Snippets
Created by
Dimitri McDaniel
last modified
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 | {
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook shows how to serve a large language model on Vast's GPU platform using HuggingFace's open source inference framework [Text-Embeddings-Inference](https://github.com/huggingface/text-embeddings-inference). `Text-Embeddings-Inference` is arguably the fastest embedding runtime framework out there for GPU's. With it's small image size, it is well suited for spinning up and down in a serverless manner. It's compatible with the OpenAI API for online inference. This notebook takes from other notebooks to help with setup. If you're already familiar with setting up and searching for an instance on Vast, you can look ahead to the \"\" Section.\n",
"\n",
"The commands in this notebook can be run here, or copied and pasted into your terminal (Minus the `%%bash`). At the end, we will include a way to query your `Text-Embeddings-Inference` service in either python or with a curl request for the terminal."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"#In an environment of your choice\n",
"pip install --upgrade vastai"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"# Here we will set our api key\n",
"vastai set api-key <Your-API-Key-Here>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we are going to look for GPU's on vast. There are different versions of `Text-Embeddings-Inference` based upon what GPU you have. For these purposes, we're going to use Ampere GPU's.\n",
"If you have another type of GPU, you can look here to see if it is supported, and what image of `Text-Embeddings-Inference` you should use.\n",
"1. Have GPU's with Ampere or newer architecture\n",
"3. One GPU as `Text-Embeddings-Inference` runs fast enough that we're not going to notice any latency, so let's save some money\n",
"4. Have a static IP address to route requests to\n",
"5. Have direct port counts available (greater than 1) to enable port reservations\n",
"6. Use CUDA 12.2 or higher - the `Text-Embeddings-Inference` image is based upon a CUDA 12.2 Base Image"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"vastai search offers 'compute_cap >= 800 num_gpus = 1 static_ip=true direct_port_count > 1 cuda_vers >= 12.2' \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copy and Paste the id of a machine that you would like to choose below for `<instance-id>`.\n",
"We will activate this instance with the `text-embeddings-inference:1.2` image. This image gives us a Text-Embeddings-Inference server that is compatible with the OpenAI SDK for online embedding inference. This means that it can slide in to any application that uses the openAI api for embeddings. All you need to change in your app is the `base_url`so that the requests are properly routed to your model.\n",
"\n",
"This command also exposes the port 80 in the docker container, which is the default port for this service, and forwards it to port 8000, which is the default openAI server port. This command also tells the docker container to automatically download and serve the `jinaai/jina-embeddings-v2-base-en`, using vast's `--args` command to forward in the remaining text as args. You can change the model by using any HuggingFace model ID. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"vastai create instance <instance-id> --image ghcr.io/huggingface/text-embeddings-inference --env '-p 8000:80' --disk 16 --args --model-id jinaai/jina-embeddings-v2-base-en"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we need to verify that our setup is working. We first need to wait for our machine to download the image and the model and start serving. This will take a few minutes. The logs will show you when it's done. \n",
"\n",
"Then, at the top of the instance, there is a button with an IP address in it. Click this and a panel will show up of the ip address and the forwarded ports. \n",
"You should see something like:\n",
"```\n",
"Open Ports\n",
"XX.XX.XXX.XX:YYYY -> 8000/tcp\n",
"``` \n",
"Copy and paste the IP address and the port in the curl command below.\n",
"\n",
"This curl command sends and OpenAI compatible request to your embedding server. You should see the response if everything is setup correctly as a raw embedding returned to you."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"curl <Instance-IP-Address>:<Port>/embed -X POST -d '{\"inputs\":\"What is Deep Learning?\"}' -H 'Content-Type: application/json'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"\n",
"headers = {\n",
" 'Content-Type': 'application/json',\n",
"}\n",
"\n",
"json_data = {\n",
" 'inputs': 'What is Deep Learning?',\n",
"}\n",
"\n",
"response = requests.post('http://<Instance-IP-Address>:<Port>/embed', headers=headers, json=json_data)\n",
"print(response.content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also use the OpenAI SDK to generate embeddings as well, which allows for a lot of portability into other applications"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"pip install openai"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"\n",
"# Modify OpenAI's API key and API base to use vLLM's's API server.\n",
"openai_api_key = \"EMPTY\"\n",
"openai_api_base = \"http://<Instance-IP-Address>:<Port>/v1\"\n",
"client = OpenAI(\n",
" api_key=openai_api_key,\n",
" base_url=openai_api_base,\n",
")\n",
"model = \"jinaai/jina-embeddings-v2-base-en\"\n",
"embeddings = client.embeddings.create(model=model, input=\"What is Deep Learning?\")\n",
"print(embeddings)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "vast",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.19"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
|
Comments (0)
You can clone a snippet to your computer for local editing. Learn more.