Your First LLM API on Kubernetes

Running an LLM on Kubernetes is different from running a standard web app. You do not just scale by requests. You scale by GPU capacity and model weights.

In this guide, you will deploy a model to a GPU node, expose it as an API, and call it with a curl request.

We use these tools:

  • Model: Qwen/Qwen2.5-1.5B-Instruct
  • Engine: vLLM
  • Infrastructure: Kubernetes with NVIDIA GPU support

Step 1: Check your GPU capacity Before you start, ensure Kubernetes sees your hardware. Run this command: kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu

If the GPU column is empty, stop. You must fix your NVIDIA device plugin first.

Step 2: Secure your model access Even for public models, use a Hugging Face token. This makes it easier to swap to private models later.

  • Create a namespace: kubectl create namespace llm-demo
  • Create a Secret for your token: kubectl create secret generic hf-token -n llm-demo --from-literal=HF_TOKEN="your_token_here"

Step 3: Deploy the model server We use vLLM because it handles the heavy lifting. It manages batching, tokenization, and the OpenAI-compatible API.

Create a deployment file with these key requirements:

  • Request 1 GPU: nvidia.com/gpu: 1
  • Mount /dev/shm: Model servers need shared memory to avoid crashes.
  • Use Secrets: Pass your HF_TOKEN to the container.

Apply your configuration: kubectl apply -f qwen-vllm.yaml

Step 4: Verify the API Do not trust the "Running" status. A pod is "Running" while it still downloads massive model files. Watch your logs: kubectl logs -n llm-demo -f deployment/qwen-vllm

Wait until you see the server listening on port 8000.

Test it with port-forwarding: kubectl port-forward -n llm-demo svc/qwen-vllm 8000:8000

Run your curl request: curl http://127.0.0.1:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "Qwen/Qwen2.5-1.5B-Instruct", "messages": [{"role": "user", "content": "Explain Kubernetes in two sentences."}], "max_tokens": 120 }'

The Goal: You have moved from raw GPU capacity to a working API. You have proven that:

  • Kubernetes schedules the GPU workload.
  • The container can access the hardware.
  • The model server loads weights into memory.
  • The API responds to standard requests.

If this loop fails, scaling and routing will not save you. Fix the foundation first.

Source: https://dev.to/the-persistent-engineer/your-first-llm-api-on-kubernetes-from-model-to-curl-request-4l1j

Optional learning community: https://t.me/GyaanSetuAi