Self Hosting [Experimental]¶
This guide is currently highly experimental. Instructions are subject to change as we improve support for self-hosting.
We provide a Helm chart that deploys LLM Engine to an Elastic Kubernetes Cluster (EKS) in AWS. This Helm chart should be configured to connect to dependencies (such as a PostgreSQL database) that you may already have available in your environment.
The only portions of the Helm chart that are production ready are the parts that configure and manage LLM Server itself (not PostgreSQL, IAM, etc.)
We first go over required AWS dependencies that are required to exist before we can run helm install
in your EKS cluster.
AWS Dependencies¶
This section describes assumptions about existing AWS resources required run to the LLM Engine Server
EKS¶
The LLM Engine server must be deployed in an EKS cluster environment. Currently only versions 1.23+
are supported. Below are the assumed requirements for the EKS cluster:
You will need to provision EKS node groups with GPUs to schedule model pods. These node groups must have the node-lifecycle: normal
label on them.
Additionally, they must have the k8s.amazonaws.com/accelerator
label set appropriately depending on the instance type:
Instance family | k8s.amazonaws.com/accelerator label |
---|---|
g4dn | nvidia-tesla-t4 |
g5 | nvidia-tesla-a10 |
p4d | nvidia-tesla-a100 |
p4de | nvidia-tesla-a100e |
We also recommend setting the following taint on your GPU nodes to prevent pods requiring GPU resources from being scheduled on them: - { key = "nvidia.com/gpu", value = "true", effect = "NO_SCHEDULE" }
PostgreSQL¶
The LLM Engine server requires a PostgreSQL database to back data. LLM Engine currently supports PostgreSQL version 14. Create a PostgreSQL database (e.g. AWS RDS PostgreSQL) if you do not have an existing one you wish to connect LLM Engine to.
To enable LLM Engine to connect to the PostgreSQL engine, we create a Kubernetes secret with the PostgreSQL url. An example YAML is provided below:
apiVersion: v1
kind: Secret
metadata:
name: llm-engine-database-credentials # this name will be an input to our Helm Chart
data:
database_url = "postgresql://[user[:password]@][netloc][:port][/dbname][?param1=value1&...]"
Redis¶
The LLM Engine server requires Redis for various caching/queue functionality. LLM Engine currently supports Redis version 6. Create a Redis cluster (e.g. AWS Elasticache for Redis) if you do not have an existing one you wish to connect LLM Engine to.
To enable LLM Engine to connect redis, fill out the Helm chart values with the redis host and url.
Amazon S3¶
You will need to have an S3 bucket for LLM Engine to store various assets (e.g model weigts, prediction restuls). The ARN of this bucket should be provided in the Helm chart values.
Amazon ECR¶
You will need to provide an ECR repository for the deployment to store model containers. The ARN of this repository should be provided in the Helm chart values.
Amazon SQS¶
LLM Engine utilizes Amazon SQS to keep track of jobs. LLM Engine will create and use SQS queues as needed.
Identity and Access Management (IAM)¶
The LLM Engine server will an IAM role to perform various AWS operations. This role will be assumed by the serviceaccount llm-engine
in the launch
namespace in the EKS cluster. The ARN of this role needs to be provided to the Helm chart, and the role needs to be provided the following permissions:
Action | Resource |
---|---|
s3:Get* , s3:Put* |
${s3_bucket_arn}/* |
s3:List* |
${s3_bucket_arn} |
sqs:* |
arn:aws:sqs:${region}:${account_id}:llm-engine-endpoint-id-* |
sqs:ListQueues |
* |
ecr:BatchGetImage , ecr:DescribeImages , ecr:GetDownloadUrlForLayer , ecr:ListImages |
${ecr_repository_arn} |
Helm Chart¶
Now that all dependencies have been installed and configured, we can run the provided Helm chart. The values in the Helm chart will need to correspond with the resources described in the Dependencies section.
Ensure that Helm V3 is installed instructions and can connect to the EKS cluster. Users should be able to install the chart with helm install llm-engine llm-engine -f llm-engine/values_sample.yaml -n <DESIRED_NAMESPACE>
.
Below are the configurations to specify in the values_sample.yaml
file.
Parameter | Description | Required |
---|---|---|
tag | The LLM Engine docker image tag | Yes |
context | A user-specified deployment tag | No |
image.gatewayRepository | The docker repository to pull the LLM Engine gateway image from | Yes |
image.builderRepository | The docker repository to pull the LLM Engine endpoint builder image from | Yes |
image.cacherRepository | The docker repository to pull the LLM Engine cacher image from | Yes |
image.forwarderRepository | The docker repository to pull the LLM Engine forwarder image from | Yes |
image.pullPolicy | The docker image pull policy | No |
secrets.kubernetesDatabaseSecretName | The name of the secret that contains the database credentials | Yes |
serviceAccount.annotations.eks.amazonaws.com/role-arn | The ARN of the IAM role that the service account will assume | Yes |
service.type | The service configuration for the main LLM Engine server | No |
service.port | The service configuration for the main LLM Engine server | No |
replicaCount | The amount of replica pods for each deployment | No |
autoscaling | The autoscaling configuration for LLM Engine server deployments | No |
resources.requests.cpu | The k8s resources for LLM Engine server deployments | No |
nodeSelector | The node selector for LLM Engine server deployments | No |
tolerations | The tolerations for LLM Engine server deployments | No |
affinity | The affinity for LLM Engine server deployments | No |
aws.configMap.name | The AWS configurations (by configMap) for LLM Engine server deployments | No |
aws.configMap.create | The AWS configurations (by configMap) for LLM Engine server deployments | No |
aws.profileName | The AWS configurations (by configMap) for LLM Engine server deployments | No |
serviceTemplate.securityContext.capabilities.drop | Additional flags for model endpoints | No |
serviceTemplate.mountInfraConfig | Additional flags for model endpoints | No |
config.values.infra.k8s_cluster_name | The name of the k8s cluster | Yes |
config.values.infra.dns_host_domain | The domain name of the k8s cluster | Yes |
config.values.infra.default_region | The default AWS region for various resources | Yes |
config.values.infra.ml_account_id | The AWS account ID for various resources | Yes |
config.values.infra.docker_repo_prefix | The prefix for AWS ECR repositories | Yes |
config.values.infra.redis_host | The hostname of the redis cluster you wish to connect | Yes |
config.values.infra.s3_bucket | The S3 bucket you wish to connect | Yes |
config.values.llm_engine.endpoint_namespace | K8s namespace the endpoints will be created in | Yes |
config.values.llm_engine.cache_redis_url | The full url for the redis cluster you wish to connect | Yes |
config.values.llm_engine.s3_file_llm_fine_tuning_job_repository | The S3 URI for the S3 bucket/key that you wish to save fine-tuned assets | Yes |
config.values.datadog_trace_enabled | Whether to enable datadog tracing, datadog must be installed in the cluster | No |
Play With It¶
Once helm install
succeeds, you can forward port 5000
from a llm-engine
pod and test sending requests to it.
First, see a list of pods in the namespace that you performed helm install
in:
$ kubectl get pods -n <NAMESPACE_WHERE_LLM_ENGINE_IS_INSTALLED>
NAME READY STATUS RESTARTS AGE
llm-engine-668679554-9q4wj 1/1 Running 0 18m
llm-engine-668679554-xfhxx 1/1 Running 0 18m
llm-engine-cacher-5f8b794585-fq7dj 1/1 Running 0 18m
llm-engine-endpoint-builder-5cd6bf5bbc-sm254 1/1 Running 0 18m
llm-engine-image-cache-a10-sw4pg 1/1 Running 0 18m
Forward a port from a llm-engine
pod:
$ kubectl port-forward pod/llm-engine-<REST_OF_POD_NAME> 5000:5000 -n <NAMESPACE_WHERE_LLM_ENGINE_IS_INSTALLED>
Then, try sending a request to get LLM model endpoints for test-user-id
:
$ curl -X GET -H "Content-Type: application/json" -u "test-user-id:" "http://localhost:5000/v1/llm/model-endpoints"
You should get the following response:
Next, let's create a LLM endpoint using llama-7b:
$ curl -X POST 'http://localhost:5000/v1/llm/model-endpoints' \
-H 'Content-Type: application/json' \
-d '{
"name": "llama-7b",
"model_name": "llama-7b",
"source": "hugging_face",
"inference_framework": "text_generation_inference",
"inference_framework_image_tag": "0.9.3",
"num_shards": 4,
"endpoint_type": "streaming",
"cpus": 32,
"gpus": 4,
"memory": "40Gi",
"storage": "40Gi",
"gpu_type": "nvidia-ampere-a10",
"min_workers": 1,
"max_workers": 12,
"per_worker": 1,
"labels": {},
"metadata": {}
}' \
-u test_user_id:
It should output something like:
Wait a few minutes for the endpoint to be ready. You can tell that it's ready by listing pods and checking that all containers in the llm endpoint pod are ready:
$ kubectl get pods -n <endpoint_namespace specified in values_sample.yaml>
NAME READY STATUS RESTARTS AGE
llm-engine-endpoint-id-end-cismpd08agn003rr2kc0-7f86ff64f9qj9xp 2/2 Running 1 (4m41s ago) 7m26s
Then, you can send an inference request to the endppoint:
$ curl -X POST 'http://localhost:5000/v1/llm/completions-sync?model_endpoint_name=llama-7b' \
-H 'Content-Type: application/json' \
-d '{
"prompts": ["Tell me a joke about AI"],
"max_new_tokens": 30,
"temperature": 0.1
}' \
-u test-user-id:
You should get a response similar to:
{"status":"SUCCESS","outputs":[{"text":". Tell me a joke about AI. Tell me a joke about AI. Tell me a joke about AI. Tell me","num_completion_tokens":30}],"traceback":null}
Pointing LLM Engine client to use self-hosted infrastructure¶
The llmengine
client makes requests to Scale AI's hosted infrastructure by default. You can have llmengine
client make requests to your own self-hosted infrastructure by setting the LLM_ENGINE_BASE_PATH
environment variable to the URL of the llm-engine
service.
The exact URL of llm-engine
service depends on your Kubernetes cluster networking setup. The domain is specified at config.values.infra.dns_host_domain
in the helm chart values config file. Using charts/llm-engine/values_sample.yaml
as an example, you would do: