Completions
Language Models are trained to predict natural language and provide text outputs as a response to their inputs. The inputs are called prompts and outputs are referred to as completions. LLMs take the input prompts and chunk them into smaller units called tokens to process and generate language. Tokens may include trailing spaces and even sub-words. This process is language dependent.
Scale's LLM Engine provides access to open source language models (see Model Zoo) that can be used for producing completions to prompts.
Completion API call¶
An example API call looks as follows:
from llmengine import Completion
response = Completion.create(
model="llama-2-7b",
prompt="Hello, my name is",
max_new_tokens=10,
temperature=0.2,
)
print(response.json())
# '{"request_id": "c4bf0732-08e0-48a8-8b44-dfe8d4702fb0", "output": {"text": "________ and I am a ________", "num_completion_tokens": 10}}'
print(response.output.text)
# ________ and I am a ________
- model: The LLM you want to use (see Model Zoo).
- prompt: The main input for the LLM to respond to.
- max_new_tokens: The maximum number of tokens to generate in the chat completion.
- temperature: The sampling temperature to use. Higher values make the output more random, while lower values will make it more focused and deterministic. When temperature is 0 greedy search is used.
See the full Completion API reference documentation to learn more.
Completion API response¶
An example Completion API response looks as follows:
Token streaming¶
The Completions API supports token streaming to reduce perceived latency for certain applications. When streaming, tokens will be sent as data-only server-side events.
To enable token streaming, pass stream=True
to either Completion.create or Completion.acreate.
An example of token streaming using the synchronous Completions API looks as follows:
import sys
from llmengine import Completion
stream = Completion.create(
model="llama-2-7b",
prompt="Give me a 200 word summary on the current economic events in the US.",
max_new_tokens=1000,
temperature=0.2,
stream=True,
)
for response in stream:
if response.output:
print(response.output.text, end="")
sys.stdout.flush()
Async requests¶
The Python client supports asyncio
for creating Completions. Use Completion.acreate instead of Completion.create
to utilize async processing. The function signatures are otherwise identical.
An example of async Completions looks as follows:
import asyncio
from llmengine import Completion
async def main():
response = await Completion.acreate(
model="llama-2-7b",
prompt="Hello, my name is",
max_new_tokens=10,
temperature=0.2,
)
print(response.json())
asyncio.run(main())
Which model should I use?¶
See the Model Zoo for more information on best practices for which model to use for Completions.