Your First LLM API on Kubernetes
Running an LLM on Kubernetes is different from running a standard web app. You do not just scale by requests. You scale by GPU capacity and model weights.
In this guide, you will deploy a model to a GPU node, expose it as an API, and call it with a curl request.
We use these tools:
- Model: Qwen/Qwen2.5-1.5B-Instruct
- Engine: vLLM
- Infrastructure: Kubernetes with NVIDIA GPU support
Step 1: Check your GPU capacity Before you start, ensure Kubernetes sees your hardware. Run this command: kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu
If the GPU column is empty, stop. You must fix your NVIDIA device plugin first.
Step 2: Secure your model access Even for public models, use a Hugging Face token. This makes it easier to swap to private models later.
- Create a namespace: kubectl create namespace llm-demo
- Create a Secret for your token: kubectl create secret generic hf-token -n llm-demo --from-literal=HF_TOKEN="your_token_here"
Step 3: Deploy the model server We use vLLM because it handles the heavy lifting. It manages batching, tokenization, and the OpenAI-compatible API.
Create a deployment file with these key requirements:
- Request 1 GPU: nvidia.com/gpu: 1
- Mount /dev/shm: Model servers need shared memory to avoid crashes.
- Use Secrets: Pass your HF_TOKEN to the container.
Apply your configuration: kubectl apply -f qwen-vllm.yaml
Step 4: Verify the API Do not trust the "Running" status. A pod is "Running" while it still downloads massive model files. Watch your logs: kubectl logs -n llm-demo -f deployment/qwen-vllm
Wait until you see the server listening on port 8000.
Test it with port-forwarding: kubectl port-forward -n llm-demo svc/qwen-vllm 8000:8000
Run your curl request:
curl http://127.0.0.1:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"messages": [{"role": "user", "content": "Explain Kubernetes in two sentences."}],
"max_tokens": 120
}'
The Goal: You have moved from raw GPU capacity to a working API. You have proven that:
- Kubernetes schedules the GPU workload.
- The container can access the hardware.
- The model server loads weights into memory.
- The API responds to standard requests.
If this loop fails, scaling and routing will not save you. Fix the foundation first.
Optional learning community: https://t.me/GyaanSetuAi
