HuggingFace LLM
Warning
Deploying Huggingface-LLM requires an existing Kubernetes cluster, with a GPU-enabled node group.
Introduction
A generative AI chatbot service backed by a HuggingFace model, exposed via a convenient web interface.
Launch configuration
To get started, in the Platforms tab, press the New Platform button, and select HuggingFace LLM.
You will then be presented with launch configuration options to fill in:
Option | Explanation |
---|---|
Platform name | A name to identify the HuggingFace LLM platform |
Kubernetes cluster | The Kubernetes platform on which to deploy HuggingFace LLM. If one hasn't already been created, check out the Kubernetes Overview. |
App version | The version of the HuggingFace LLM Azimuth Application to use. |
Model | The model to deploy from HuggingFace. vLLM is used for model serving, so any of their supported models should work. |
Access Token | HuggingFace https://huggingface.co/docs/hub/security-tokens which is required for some gated models |
Instruction | The initial system prompt, hidden from the user, which is used when generating responses |
Page Title | The title displayed at the top of the chat interface |
Backend vLLM Version | The version of vLLM to use from this list |
LLM Sampling Parameters (Temperature, Frequency etc) | See the vLLM docs |
Max Tokens | Maximum number of tokens to generate per response. Use this to moderate compute cost. |
Model Context Length | Override for the model's maximum context length |