Completions

Language Models are trained to predict natural language and provide text outputs as a response to their inputs. The inputs are called prompts and outputs are referred to as completions. LLMs take the input prompts and chunk them into smaller units called tokens to process and generate language. Tokens may include trailing spaces and even sub-words. This process is language dependent.

Scale's LLM Engine provides access to open source language models (see Model Zoo) that can be used for producing completions to prompts.

Completion API call¶

An example API call looks as follows:

Completion call in Python

from llmengine import Completion

response = Completion.create(
    model="llama-2-7b",
    prompt="Hello, my name is",
    max_new_tokens=10,
    temperature=0.2,
)

print(response.json())
# '{"request_id": "c4bf0732-08e0-48a8-8b44-dfe8d4702fb0", "output": {"text": "________ and I am a ________", "num_completion_tokens": 10}}'

print(response.output.text)
# ________ and I am a ________

model: The LLM you want to use (see Model Zoo).
prompt: The main input for the LLM to respond to.
max_new_tokens: The maximum number of tokens to generate in the chat completion.
temperature: The sampling temperature to use. Higher values make the output more random, while lower values will make it more focused and deterministic. When temperature is 0 greedy search is used.

See the full Completion API reference documentation to learn more.

Completion API response¶

An example Completion API response looks as follows:

Response in JSONResponse in Python

    >>> print(response.json())
    {
      "request_id": "c4bf0732-08e0-48a8-8b44-dfe8d4702fb0",
      "output": {
        "text": "_______ and I am a _______",
        "num_completion_tokens": 10
      }
    }

    >>> print(response.output.text)
    _______ and I am a _______

Token streaming¶

The Completions API supports token streaming to reduce perceived latency for certain applications. When streaming, tokens will be sent as data-only server-side events.

To enable token streaming, pass stream=True to either Completion.create or Completion.acreate.

An example of token streaming using the synchronous Completions API looks as follows:

Token streaming with synchronous API in python

import sys

from llmengine import Completion

stream = Completion.create(
    model="llama-2-7b",
    prompt="Give me a 200 word summary on the current economic events in the US.",
    max_new_tokens=1000,
    temperature=0.2,
    stream=True,
)

for response in stream:
    if response.output:
        print(response.output.text, end="")
        sys.stdout.flush()

Async requests¶

The Python client supports asyncio for creating Completions. Use Completion.acreate instead of Completion.create to utilize async processing. The function signatures are otherwise identical.

An example of async Completions looks as follows:

Completions with asynchronous API in python

import asyncio
from llmengine import Completion

async def main():
    response = await Completion.acreate(
        model="llama-2-7b",
        prompt="Hello, my name is",
        max_new_tokens=10,
        temperature=0.2,
    )
    print(response.json())

asyncio.run(main())

Which model should I use?¶

See the Model Zoo for more information on best practices for which model to use for Completions.