Snippets

Jake Cannell serving_lmdeploy_on_vast

Created by Dimitri McDaniel last modified
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook shows how to serve a large language model on Vast's GPU platform using the popular open source inference framework [lmDeploy](https://github.com/InternLM/lmdeploy). `lmDeploy` is particularly good at high-throughput serving, for multi user or high load use-cases, and is one of the most popular serving frameworks today.\n",
    "\n",
    "The commands in this notebook can be run here, or copied and pasted into your terminal (Minus the `!` or the `%%bash`). At the end, we will include a way to query your `lmDeploy` service in either python or with a curl request for the terminal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "#In an environment of your choice\n",
    "pip install --upgrade vastai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "powershell"
    }
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Here we will set our api key\n",
    "vastai set api-key <Your-API-Key-Here>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are going to look for GPU's on vast. The model that we are using is going to be very small, but to allow for easilly swapping out the model you desire, we will select machines that:\n",
    "1. Have GPU's with Ampere or newer architecture\n",
    "2. Have at least 24gb of GPU RAM (to run 13B parameter LLMs)\n",
    "3. One GPU as `lmDeploy` primarily serves one copy of a model.\n",
    "4. Have a static IP address to route requests to\n",
    "5. Have direct port counts available (greater than 1) to enable port reservations\n",
    "6. Use Cuda 12.0 or higher"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "powershell"
    }
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "vastai search offers 'compute_cap >= 800 gpu_ram >= 24 num_gpus = 1 static_ip=true direct_port_count > 1 cuda_vers >= 12.0' \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copy and Paste the id of a machine that you would like to choose below for `<instance-id>`.\n",
    "We will activate this instance with the `lmDeploy-OpenAI` template. This template gives us a docker image that uses `lmDeploy` behind an OpenAI Compatible server. This means that it can slide in to any application that uses the openAI api. All you need to change in your app is the `base_url` and the `model_id` to the model that you are using so that the requests are properly routed to your model.\n",
    "\n",
    "This command also exposes the port 8000 in the docker container, the default openAI server port, and tells the docker container to automatically download and serve the `internlm/internlm2_5-7b-chat`. You can change the model by using any HuggingFace model ID. We chose this because it is fast to download and start playing with. If we are using any HuggingFace model we need to make sure with agree its terms and conditions for access to the model\n",
    "\n",
    "The lmDeploy image requires us to specify an entrypoint. So we use vast's --entrypoint flag instead of arguments.\n",
    "\n",
    "We use vast's `--entrypoint` which set executables of `lmDeploy`  that will always run when the container is initiated uses to download the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "powershell"
    }
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "vastai create instance <instance-id> \\\n",
    "  --image openmmlab/lmdeploy:latest \\\n",
    "  --env '-p 8000:8000' \\\n",
    "  --disk 40 \\\n",
    "  --entrypoint \"lmdeploy serve api_server internlm/internlm2_5-7b-chat --model-name internlm/internlm2_5-7b-chat --server-port 8000\" \\"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we need to verify that our setup is working. We first need to wait for our machine to download the image and the model and start serving. This will take a few minutes. The logs will show you when it's done. \n",
    "\n",
    "Then, at the top of the instance, there is a button with an IP address in it. Click this and a panel will show up of the ip address and the forwarded ports. \n",
    "You should see something like: \n",
    "\n",
    "```\n",
    "Open Ports\n",
    "XX.XX.XXX.XX:YYYY -> 8000/tcp\n",
    "``` \n",
    "\n",
    "\n",
    "This curl command sends and OpenAI compatible request to your lmDeploy server. You should see the response if everything is setup correctly. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "powershell"
    }
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "curl -X GET http://<Instance-IP-Address>:<Port>/v1/models\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "powershell"
    }
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "curl -X POST http://<Instance-IP-Address>:<Port>/v1/completions \\\n",
    "     -H \"Content-Type: application/json\" \\\n",
    "     -d '{\"model\": \"internlm/internlm2_5-7b-chat\", \"prompt\": \"Hello, how are you?\", \"max_tokens\": 50}'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This next cell replicates exactly the same request but in the python requests library. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "headers = {\n",
    "    'Content-Type': 'application/json',\n",
    "}\n",
    "\n",
    "json_data = {\n",
    "    'model': 'internlm/internlm2_5-7b-chat',\n",
    "    'prompt': 'Hello, how are you?',\n",
    "    'max_tokens': 50,\n",
    "}\n",
    "\n",
    "response = requests.post('http://<Instance-IP-Address>:<Port>/v1/completions', headers=headers, json=json_data)\n",
    "print(response.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you're looking to build off of this more, we recommend checking out the [OpenAI sdk](https://github.com/openai/openai-python), which we will use here for easier interaction with the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "pip install openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "\n",
    "# Modify OpenAI's API key and API base to use lmDeploy 's's API server.\n",
    "openai_api_key = \"PLACEHOLDER_KEY\"\n",
    "\n",
    "openai_api_base = \"http://<Instance-IP-Address>:<Port>/v1\"\n",
    "client = OpenAI(\n",
    "    api_key=openai_api_key,\n",
    "    base_url=openai_api_base,\n",
    ")\n",
    "completion = client.completions.create(model=\"internlm/internlm2_5-7b-chat\",\n",
    "                                      prompt=\"Hello, how are you?\",\n",
    "                                      max_tokens=50)\n",
    "print(\"Completion result:\", completion)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Comments (0)

HTTPS SSH

You can clone a snippet to your computer for local editing. Learn more.