Snippets
Created by
Dimitri McDaniel
last modified
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 | {
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook shows how to serve a large language model on Vast's GPU platform using the popular open source inference framework [lmDeploy](https://github.com/InternLM/lmdeploy). `lmDeploy` is particularly good at high-throughput serving, for multi user or high load use-cases, and is one of the most popular serving frameworks today.\n",
"\n",
"The commands in this notebook can be run here, or copied and pasted into your terminal (Minus the `!` or the `%%bash`). At the end, we will include a way to query your `lmDeploy` service in either python or with a curl request for the terminal."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"#In an environment of your choice\n",
"pip install --upgrade vastai"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "powershell"
}
},
"outputs": [],
"source": [
"%%bash\n",
"# Here we will set our api key\n",
"vastai set api-key <Your-API-Key-Here>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we are going to look for GPU's on vast. The model that we are using is going to be very small, but to allow for easilly swapping out the model you desire, we will select machines that:\n",
"1. Have GPU's with Ampere or newer architecture\n",
"2. Have at least 24gb of GPU RAM (to run 13B parameter LLMs)\n",
"3. One GPU as `lmDeploy` primarily serves one copy of a model.\n",
"4. Have a static IP address to route requests to\n",
"5. Have direct port counts available (greater than 1) to enable port reservations\n",
"6. Use Cuda 12.0 or higher"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "powershell"
}
},
"outputs": [],
"source": [
"%%bash\n",
"vastai search offers 'compute_cap >= 800 gpu_ram >= 24 num_gpus = 1 static_ip=true direct_port_count > 1 cuda_vers >= 12.0' \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copy and Paste the id of a machine that you would like to choose below for `<instance-id>`.\n",
"We will activate this instance with the `lmDeploy-OpenAI` template. This template gives us a docker image that uses `lmDeploy` behind an OpenAI Compatible server. This means that it can slide in to any application that uses the openAI api. All you need to change in your app is the `base_url` and the `model_id` to the model that you are using so that the requests are properly routed to your model.\n",
"\n",
"This command also exposes the port 8000 in the docker container, the default openAI server port, and tells the docker container to automatically download and serve the `internlm/internlm2_5-7b-chat`. You can change the model by using any HuggingFace model ID. We chose this because it is fast to download and start playing with. If we are using any HuggingFace model we need to make sure with agree its terms and conditions for access to the model\n",
"\n",
"The lmDeploy image requires us to specify an entrypoint. So we use vast's --entrypoint flag instead of arguments.\n",
"\n",
"We use vast's `--entrypoint` which set executables of `lmDeploy` that will always run when the container is initiated uses to download the model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "powershell"
}
},
"outputs": [],
"source": [
"%%bash\n",
"vastai create instance <instance-id> \\\n",
" --image openmmlab/lmdeploy:latest \\\n",
" --env '-p 8000:8000' \\\n",
" --disk 40 \\\n",
" --entrypoint \"lmdeploy serve api_server internlm/internlm2_5-7b-chat --model-name internlm/internlm2_5-7b-chat --server-port 8000\" \\"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we need to verify that our setup is working. We first need to wait for our machine to download the image and the model and start serving. This will take a few minutes. The logs will show you when it's done. \n",
"\n",
"Then, at the top of the instance, there is a button with an IP address in it. Click this and a panel will show up of the ip address and the forwarded ports. \n",
"You should see something like: \n",
"\n",
"```\n",
"Open Ports\n",
"XX.XX.XXX.XX:YYYY -> 8000/tcp\n",
"``` \n",
"\n",
"\n",
"This curl command sends and OpenAI compatible request to your lmDeploy server. You should see the response if everything is setup correctly. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "powershell"
}
},
"outputs": [],
"source": [
"%%bash\n",
"curl -X GET http://<Instance-IP-Address>:<Port>/v1/models\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "powershell"
}
},
"outputs": [],
"source": [
"%%bash\n",
"curl -X POST http://<Instance-IP-Address>:<Port>/v1/completions \\\n",
" -H \"Content-Type: application/json\" \\\n",
" -d '{\"model\": \"internlm/internlm2_5-7b-chat\", \"prompt\": \"Hello, how are you?\", \"max_tokens\": 50}'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This next cell replicates exactly the same request but in the python requests library. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"\n",
"headers = {\n",
" 'Content-Type': 'application/json',\n",
"}\n",
"\n",
"json_data = {\n",
" 'model': 'internlm/internlm2_5-7b-chat',\n",
" 'prompt': 'Hello, how are you?',\n",
" 'max_tokens': 50,\n",
"}\n",
"\n",
"response = requests.post('http://<Instance-IP-Address>:<Port>/v1/completions', headers=headers, json=json_data)\n",
"print(response.content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you're looking to build off of this more, we recommend checking out the [OpenAI sdk](https://github.com/openai/openai-python), which we will use here for easier interaction with the model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"pip install openai"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"\n",
"# Modify OpenAI's API key and API base to use lmDeploy 's's API server.\n",
"openai_api_key = \"PLACEHOLDER_KEY\"\n",
"\n",
"openai_api_base = \"http://<Instance-IP-Address>:<Port>/v1\"\n",
"client = OpenAI(\n",
" api_key=openai_api_key,\n",
" base_url=openai_api_base,\n",
")\n",
"completion = client.completions.create(model=\"internlm/internlm2_5-7b-chat\",\n",
" prompt=\"Hello, how are you?\",\n",
" max_tokens=50)\n",
"print(\"Completion result:\", completion)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
|
Comments (0)
You can clone a snippet to your computer for local editing. Learn more.