Snippets

Jake Cannell Serve Infinity On Vast

Created by Dimitri McDaniel
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook shows how to serve embedding, reranker, and classification models on Vast's GPU platform using a really handy framework called [Infinity](https://github.com/michaelfeil/infinity/tree/main) `Infinity` is particularly good at high-throughput serving of these models, running multiple models at the same time, and providing out-of-the-box support for rerankers and classifier models.\n",
    "\n",
    "The commands in this notebook can be run here, or copied and pasted into your terminal (Minus the `!` or the `%%bash`). At the end, we will include a way to query your `Infinity` service in either python or with a curl request for the terminal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "#In an environment of your choice\n",
    "pip install --upgrade vastai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Here we will set our api key\n",
    "vastai set api-key <Your-API-Key-Here>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are going to look for GPU's on vast. The model that we are using is going to be very small, but to allow for easilly swapping out the model you desire, we will select machines that:\n",
    "1. Have GPU's with Ampere or newer architecture\n",
    "2. Have at least 24gb of GPU RAM (to run 13B parameter LLMs)\n",
    "3. One GPU as these models are small, and `Infinity` can still serve multiple of these models on one gpu.\n",
    "4. Have a static IP address to route requests to\n",
    "5. Have direct port counts available (greater than 1) to enable port reservations\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "vastai search offers 'compute_cap >= 800 gpu_ram >= 24 num_gpus = 1 static_ip=true direct_port_count > 1' \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Usage: Embedding Models behind the OpenAI compatible server"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copy and Paste the id of a machine that you would like to choose below for `<instance-id>`.\n",
    "We will activate this instance with the `Infinity` docker image. This Image lets us serve an embedding model behind OpenAI Compatible server. This means that it can slide in to any application that uses the openAI api for embeddings. All you need to change in your app is the `base_url` and the `model_id` to the model that you are using so that the requests are properly routed to your model.\n",
    "\n",
    "This command also exposes the port 8000 in the docker container, the default openAI server port, and tells the docker container to automatically download and serve the `--modelmichaelfeil/bge-small-en-v1.5`. You can change the model by using any HuggingFace model ID for an embedding model. We chose this because it is fast start playing with, but is also one of the default ones from `Infinity`'s docs.\n",
    "\n",
    "We use vast's `--args` command to funnel the rest of the command to the container, in this case `v2 --model-id michaelfeil/bge-small-en-v1.5 --port 8000`. `v2` is to specify the second version of this API (which we'll use more later), and `--port 8000` is to tell `Infinity` to host on port 8000."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "vastai create instance <instance-id> --image michaelf34/infinity:latest --env '-p 8000:8000' --disk 40 --args v2 --model-id michaelfeil/bge-small-en-v1.5 --port 8000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we need to verify that our setup is working. We first need to wait for our machine to download the image and the model and start serving. This will take a few minutes. The logs will show you when it's done. \n",
    "\n",
    "Then, at the top of the instance, there is a button with an IP address in it. Click this and a panel will show up of the ip address and the forwarded ports. \n",
    "You should see something like: \n",
    "```\n",
    "Open Ports\n",
    "XX.XX.XXX.XX:YYYY -> 8000/tcp\n",
    "``` \n",
    "Copy and paste the IP address and the port in the curl command below.\n",
    "\n",
    "This curl command sends and OpenAI compatible request to your `Infinity` server. You should see the response if everything is setup correctly. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "pip install openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "\n",
    "# Modify OpenAI's API key and API base to use vLLM's's API server.\n",
    "openai_api_key = \"EMPTY\"\n",
    "openai_api_base = \"http://<Instance-IP-Address>:<Port>/v1\"\n",
    "client = OpenAI(\n",
    "    api_key=openai_api_key,\n",
    "    base_url=openai_api_base,\n",
    ")\n",
    "model = \"michaelfeil/bge-small-en-v1.5\"\n",
    "embeddings = client.embeddings.create(model=model, input=\"What is Deep Learning?\").data[0].embedding\n",
    "print(\"Embeddings:\")\n",
    "print(embeddings)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advanced Usage: Rerankers, Classifiers, and Multiple Models at the same time\n",
    "\n",
    "Feel free to delete your previous instance. Now, we will run a similar command to create a new instance. We'll have two small changes:\n",
    "1. We swapped `michaelfeil/bge-small-en-v1.5` for `mixedbread-ai/mxbai-rerank-xsmall-v1`. This allows us to run a re-ranker model\n",
    "2. We added `--model-id  SamLowe/roberta-base-go_emotions`, which lets us serve an additional model, in this case a classifier model\n",
    "\n",
    "We will also provide ways to call these models using the requests library for reranking and classifying, based on the `Infinity` Api Spec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash \n",
    "\n",
    "vastai create instance <instance-id> --image michaelf34/infinity:latest --env '-p 8000:8000' --disk 40 --args v2 --model-id mixedbread-ai/mxbai-rerank-xsmall-v1 --model-id  SamLowe/roberta-base-go_emotions --port 8000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "base_url = \"http://<Instance-IP-Address>:<Port>\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rerank_url = base_url + \"/rerank\"\n",
    "model1 = \"mixedbread-ai/mxbai-rerank-xsmall-v1\"\n",
    "input_json = {\"query\": \"Where is Munich?\",\"documents\": [\"Munich is in Germany.\", \"The sky is blue.\"],\"return_documents\": \"false\",\"model\": \"mixedbread-ai/mxbai-rerank-xsmall-v1\"}\n",
    "headers = {\n",
    "    \"accept\": \"application/json\",\n",
    "    \"Content-Type\": \"application/json\"\n",
    "}\n",
    "    \n",
    "payload = {\n",
    "    \"query\": input_json[\"query\"],\n",
    "    \"documents\": input_json[\"documents\"],\n",
    "    \"return_documents\": input_json[\"return_documents\"],\n",
    "    \"model\": model1\n",
    "}\n",
    "\n",
    "response = requests.post(rerank_url, json=payload, headers=headers)\n",
    "    \n",
    "if response.status_code == 200:\n",
    "    resp_json = response.json()\n",
    "    print(resp_json)\n",
    "else: \n",
    "    print(response.status_code)\n",
    "    print(response.text)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "classify_url = base_url + \"/classify\"\n",
    "model2 = \"SamLowe/roberta-base-go_emotions\"\n",
    "\n",
    "headers = {\n",
    "    \"accept\": \"application/json\",\n",
    "    \"Content-Type\": \"application/json\"\n",
    "}\n",
    "\n",
    "payload = {\n",
    "        \"input\": [\"I am feeling really happy today\"],\n",
    "        \"model\": model2\n",
    "    }\n",
    "\n",
    "response = requests.post(classify_url, json=payload, headers=headers)\n",
    "    \n",
    "if response.status_code == 200:\n",
    "    resp_json = response.json()\n",
    "    print(resp_json)\n",
    "else: \n",
    "    print(response.status_code)\n",
    "    print(response.text)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "vast",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Comments (0)

HTTPS SSH

You can clone a snippet to your computer for local editing. Learn more.