Snippets

Jake Cannell serving tgi on vast notebook

Created by Dimitri McDaniel
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook shows how to serve a large language model on Vast's GPU platform using HuggingFace's open source inference framework [TGI](https://github.com/huggingface/text-generation-inference). `TGI` is particularly easy to use if you're familiar with Huggingface. It automatically batches queries for you and is compatible with the OpenAI API. This notebook is adapted from our `vLLM` guide so that you can see what exactly needs to be changed between the two of them.\n",
    "\n",
    "The commands in this notebook can be run here, or copied and pasted into your terminal (Minus the `%%bash`). At the end, we will include a way to query your `TGI` service in either python or with a curl request for the terminal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "#In an environment of your choice\n",
    "pip install --upgrade vastai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Here we will set our api key\n",
    "vastai set api-key <Your-API-Key-Here>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are going to look for GPU's on vast. The model that we are using is going to be very small, but to allow for easily swapping out the model you desire, we will select machines that:\n",
    "1. Have GPU's with Ampere or newer architecture\n",
    "2. Have at least 24gb of GPU RAM (to run 13B parameter LLMs)\n",
    "3. One GPU as `TGI` primarily serves one copy of a model.\n",
    "4. Have a static IP address to route requests to\n",
    "5. Have direct port counts available (greater than 1) to enable port reservations\n",
    "6. Use CUDA 12.1 or higher - the `TGI` image is based upon a CUDA 12.1 Base Image"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "vastai search offers 'compute_cap >= 800 gpu_ram >= 24 num_gpus = 1 static_ip=true direct_port_count > 1 cuda_vers >= 12.1' \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copy and Paste the id of a machine that you would like to choose below for `<instance-id>`.\n",
    "We will activate this instance with the `text-generation-inference:1.4` image. This image gives us a TGI server that is compatible with the OpenAI SDK. This means that it can slide in to any application that uses the openAI api. All you need to change in your app is the `base_url` and the `model_id` to the model that you are using so that the requests are properly routed to your model.\n",
    "\n",
    "This command also exposes the port 8000 in the docker container, the default openAI server port, and tells the docker container to automatically download and serve the `stabilityai/stablelm-2-zephyr-1_6b`. You can change the model by using any HuggingFace model ID. We chose this because it is fast to download and start playing with.\n",
    "\n",
    "We use vast's `--args` command to funnel the rest of the command to the container, in this case `--model-id stabilityai/stablelm-2-zephyr-1_6b`, which `TGI` uses to download the model, and `--port 8000` to ensure that TGI is listening on the right port."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "vastai create instance <instance-id> --image ghcr.io/huggingface/text-generation-inference:1.4 --env '-p 8000:8000' --disk 40 --args --port 8000 --model-id stabilityai/stablelm-2-zephyr-1_6b"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we need to verify that our setup is working. We first need to wait for our machine to download the image and the model and start serving. This will take a few minutes. The logs will show you when it's done. \n",
    "\n",
    "Then, at the top of the instance, there is a button with an IP address in it. Click this and a panel will show up of the ip address and the forwarded ports. \n",
    "You should see something like:\n",
    "```\n",
    "Open Ports\n",
    "XX.XX.XXX.XX:YYYY -> 8000/tcp\n",
    "``` \n",
    "Copy and paste the IP address and the port in the curl command below.\n",
    "\n",
    "This curl command sends and OpenAI compatible request to your TGI server. You should see the response if everything is setup correctly. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "# This request assumes you haven't changed the model. If you did, fill it in the \"model\" value in the payload json below\n",
    "curl -X POST http://<Instance-IP-Address>:<Port>/v1/completions -H \"Content-Type: application/json\"  -d '{\"model\" : \"stabilityai/stablelm-2-zephyr-1_6b\", \"prompt\": \"Hello, how are you?\", \"max_tokens\": 50}'\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "headers = {\n",
    "    'Content-Type': 'application/json',\n",
    "}\n",
    "\n",
    "model = 'stabilityai/stablelm-2-zephyr-1_6b'\n",
    "\n",
    "json_data = {\n",
    "    'model': model,\n",
    "    'prompt': 'Hello, how are you?',\n",
    "    'max_tokens': 50,\n",
    "}\n",
    "\n",
    "response = requests.post('http://<Instance-IP-Address>:<Port>/v1/completions', headers=headers, json=json_data)\n",
    "print(response.content)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "pip install openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "\n",
    "# Modify OpenAI's API key and API base to use TGI's's API server.\n",
    "openai_api_key = \"EMPTY\"\n",
    "openai_api_base = \"http://<Instance-IP-Address>:<Port>/v1\"\n",
    "client = OpenAI(\n",
    "    api_key=openai_api_key,\n",
    "    base_url=openai_api_base,\n",
    ")\n",
    "completion = client.completions.create(model=model,\n",
    "                                      prompt=\"Hello, how are you?\",\n",
    "                                      max_tokens=50)\n",
    "print(\"Completion result:\", completion)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Comments (0)

HTTPS SSH

You can clone a snippet to your computer for local editing. Learn more.