HuggingFace LLM

Warning

Deploying Huggingface-LLM requires an existing Kubernetes cluster, with a GPU-enabled node group.

A generative AI chatbot service backed by a HuggingFace model, exposed via a convenient web interface.

To get started, in the Platforms tab, press the New Platform button, and select HuggingFace LLM.

You will then be presented with launch configuration options to fill in:

Option	Explanation
Platform name	A name to identify the HuggingFace LLM platform
Kubernetes cluster	The Kubernetes platform on which to deploy HuggingFace LLM. If one hasn't already been created, check out the Kubernetes Overview.
App version	The version of the HuggingFace LLM Azimuth Application to use.
Model	The model to deploy from HuggingFace. vLLM is used for model serving, so any of their supported models should work.
Access Token	HuggingFace https://huggingface.co/docs/hub/security-tokens which is required for some gated models
Instruction	The initial system prompt, hidden from the user, which is used when generating responses
Page Title	The title displayed at the top of the chat interface
Backend vLLM Version	The version of vLLM to use from this list
LLM Sampling Parameters (Temperature, Frequency etc)	See the vLLM docs
Max Tokens	Maximum number of tokens to generate per response. Use this to moderate compute cost.
Model Context Length	Override for the model's maximum context length