Snippets

Jake Cannell serving_text_embeddings_inference_on_vast

Created by Dimitri McDaniel last modified
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook shows how to serve a large language model on Vast's GPU platform using HuggingFace's open source inference framework [Text-Embeddings-Inference](https://github.com/huggingface/text-embeddings-inference). `Text-Embeddings-Inference` is arguably the fastest embedding runtime framework out there for GPU's. With it's small image size, it is well suited for spinning up and down in a serverless manner. It's compatible with the OpenAI API for online inference. This notebook takes from other notebooks to help with setup. If you're already familiar with setting up and searching for an instance on Vast, you can look ahead to the \"\" Section.\n",
    "\n",
    "The commands in this notebook can be run here, or copied and pasted into your terminal (Minus the `%%bash`). At the end, we will include a way to query your `Text-Embeddings-Inference` service in either python or with a curl request for the terminal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "#In an environment of your choice\n",
    "pip install --upgrade vastai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Here we will set our api key\n",
    "vastai set api-key <Your-API-Key-Here>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are going to look for GPU's on vast. There are different versions of `Text-Embeddings-Inference` based upon what GPU you have. For these purposes, we're going to use Ampere GPU's.\n",
    "If you have another type of GPU, you can look here to see if it is supported, and what image of `Text-Embeddings-Inference` you should use.\n",
    "1. Have GPU's with Ampere or newer architecture\n",
    "3. One GPU as `Text-Embeddings-Inference` runs fast enough that we're not going to notice any latency, so let's save some money\n",
    "4. Have a static IP address to route requests to\n",
    "5. Have direct port counts available (greater than 1) to enable port reservations\n",
    "6. Use CUDA 12.2 or higher - the `Text-Embeddings-Inference` image is based upon a CUDA 12.2 Base Image"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "vastai search offers 'compute_cap >= 800 num_gpus = 1 static_ip=true direct_port_count > 1 cuda_vers >= 12.2' \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copy and Paste the id of a machine that you would like to choose below for `<instance-id>`.\n",
    "We will activate this instance with the `text-embeddings-inference:1.2` image. This image gives us a Text-Embeddings-Inference server that is compatible with the OpenAI SDK for online embedding inference. This means that it can slide in to any application that uses the openAI api for embeddings. All you need to change in your app is the `base_url`so that the requests are properly routed to your model.\n",
    "\n",
    "This command also exposes the port 80 in the docker container, which is the default port for this service, and forwards it to port 8000, which is the default openAI server port. This command also tells the docker container to automatically download and serve the `jinaai/jina-embeddings-v2-base-en`, using vast's `--args` command to forward in the remaining text as args. You can change the model by using any HuggingFace model ID. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "vastai create instance <instance-id> --image ghcr.io/huggingface/text-embeddings-inference --env '-p 8000:80' --disk 16 --args --model-id jinaai/jina-embeddings-v2-base-en"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we need to verify that our setup is working. We first need to wait for our machine to download the image and the model and start serving. This will take a few minutes. The logs will show you when it's done. \n",
    "\n",
    "Then, at the top of the instance, there is a button with an IP address in it. Click this and a panel will show up of the ip address and the forwarded ports. \n",
    "You should see something like:\n",
    "```\n",
    "Open Ports\n",
    "XX.XX.XXX.XX:YYYY -> 8000/tcp\n",
    "``` \n",
    "Copy and paste the IP address and the port in the curl command below.\n",
    "\n",
    "This curl command sends and OpenAI compatible request to your embedding server. You should see the response if everything is setup correctly as a raw embedding returned to you."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "curl <Instance-IP-Address>:<Port>/embed -X POST -d '{\"inputs\":\"What is Deep Learning?\"}' -H 'Content-Type: application/json'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "headers = {\n",
    "    'Content-Type': 'application/json',\n",
    "}\n",
    "\n",
    "json_data = {\n",
    "    'inputs': 'What is Deep Learning?',\n",
    "}\n",
    "\n",
    "response = requests.post('http://<Instance-IP-Address>:<Port>/embed', headers=headers, json=json_data)\n",
    "print(response.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also use the OpenAI SDK to generate embeddings as well, which allows for a lot of portability into other applications"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "pip install openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "\n",
    "# Modify OpenAI's API key and API base to use vLLM's's API server.\n",
    "openai_api_key = \"EMPTY\"\n",
    "openai_api_base = \"http://<Instance-IP-Address>:<Port>/v1\"\n",
    "client = OpenAI(\n",
    "    api_key=openai_api_key,\n",
    "    base_url=openai_api_base,\n",
    ")\n",
    "model = \"jinaai/jina-embeddings-v2-base-en\"\n",
    "embeddings = client.embeddings.create(model=model, input=\"What is Deep Learning?\")\n",
    "print(embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "vast",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Comments (0)

HTTPS SSH

You can clone a snippet to your computer for local editing. Learn more.