Your First LLM API on Kubernetes

Translated for your language. Read the original.

AI-assisted draft.

GyaanSetu Editorialमागील आठवडा2min read

Your First LLM API on Kubernetes

Running an LLM on Kubernetes is different from running a standard web app. You do not just scale by requests. You scale by GPU capacity and model weights.

In this guide, you will deploy a model to a GPU node, expose it as an API, and call it with a curl request.

We use these tools:

Model: Qwen/Qwen2.5-1.5B-Instruct
Engine: vLLM
Infrastructure: Kubernetes with NVIDIA GPU support

Step 1: Check your GPU capacity Before you start, ensure Kubernetes sees your hardware. Run this command: kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu

If the GPU column is empty, stop. You must fix your NVIDIA device plugin first.

Step 2: Secure your model access Even for public models, use a Hugging Face token. This makes it easier to swap to private models later.

Create a namespace: kubectl create namespace llm-demo
Create a Secret for your token: kubectl create secret generic hf-token -n llm-demo --from-literal=HF_TOKEN="your_token_here"

Step 3: Deploy the model server We use vLLM because it handles the heavy lifting. It manages batching, tokenization, and the OpenAI-compatible API.

Create a deployment file with these key requirements:

Request 1 GPU: nvidia.com/gpu: 1
Mount /dev/shm: Model servers need shared memory to avoid crashes.
Use Secrets: Pass your HF_TOKEN to the container.

Apply your configuration: kubectl apply -f qwen-vllm.yaml

Step 4: Verify the API Do not trust the "Running" status. A pod is "Running" while it still downloads massive model files. Watch your logs: kubectl logs -n llm-demo -f deployment/qwen-vllm

Wait until you see the server listening on port 8000.

Test it with port-forwarding: kubectl port-forward -n llm-demo svc/qwen-vllm 8000:8000

Run your curl request: curl http://127.0.0.1:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "Qwen/Qwen2.5-1.5B-Instruct", "messages": [{"role": "user", "content": "Explain Kubernetes in two sentences."}], "max_tokens": 120 }'

The Goal: You have moved from raw GPU capacity to a working API. You have proven that:

Kubernetes schedules the GPU workload.
The container can access the hardware.
The model server loads weights into memory.
The API responds to standard requests.

If this loop fails, scaling and routing will not save you. Fix the foundation first.

Source: https://dev.to/the-persistent-engineer/your-first-llm-api-on-kubernetes-from-model-to-curl-request-4l1j

Optional learning community: https://t.me/GyaanSetuAi

Your First LLM API on Kubernetes

Continue reading

𝗥𝘂𝗻𝗻𝗶𝗻𝗴 𝗧𝘄𝗼 𝗠𝗼𝗱𝗲𝗹𝘀 𝗼𝗻 𝗢𝗻𝗲 𝗚𝗣𝗨: 𝗧𝗵𝗲 𝗠𝗮𝘁𝗵 𝗕𝗲𝗵𝗶𝗻𝗱 𝗟𝗼𝗰𝗮𝗹 𝗟𝗟𝗠𝘀

तुमच्या डेस्कटॉपवर GLM 5.2 स्थानिकरित्या चालवा

तुमचे बजेट न बिघडवता LLMs चा वापर कसा करावा

FastFlowLM सह AMD NPU वर LLMs चालवण्यासाठी Fedora मार्गदर्शिका