Skip to main content

Command Palette

Search for a command to run...

Day 6: Run an LLM on Your Laptop - With Docker

Published
10 min read
Day 6: Run an LLM on Your Laptop - With Docker

7 Days of Docker (2026) - A Docker Captain's guide. Not your average tutorial.

I'm a Docker Captain. And if you'd told me two years ago that I'd be pulling AI models from Docker Hub the same way I pull nginx, I would've laughed.

I'm not laughing anymore.

Docker shipped something called Model Runner. It lets you pull, run, and serve Large Language Models locally - no Python environment, no conda, no CUDA drivers, no dependency hell. One command. The model runs on your hardware with GPU acceleration. And it exposes an OpenAI-compatible API that any app can talk to.

Today we're going to pull a model, talk to it, build a real app that uses it, containerize that app, and deploy the whole thing with Compose. By the end of this post, you'll have a working AI-powered API running on your laptop. No cloud. No API keys. No monthly bill.


Pull a Model Like You Pull an Image

docker model pull ai/smollm2
Using cached model: 256.35 MiB

That's it. 256 megabytes. A small but capable language model, pulled from Docker Hub using the same infrastructure that serves container images. Same content-addressable storage, same caching.

Check what you have:

docker model list
MODEL NAME  PARAMETERS  QUANTIZATION    ARCHITECTURE  MODEL ID      SIZE
smollm2     361.82 M    IQ2_XXS/Q4_K_M  llama         354bf30d0aa3  256.35 MiB

Look familiar? Same format as docker images. Model name, ID, size.


Talk to It

docker model run ai/smollm2 "What is Kubernetes in one sentence?"
Kubernetes is a container orchestration platform that automates the deployment,
scaling, and management of microservices-based applications.

That ran locally. On my Mac. With Metal GPU acceleration. No internet required after the initial pull.

Check the runner status:

docker model status
Docker Model Runner is running
llama.cpp: running llama.cpp latest-metal e365e65

latest-metal means it's using Apple Silicon GPU acceleration via Metal. On Linux with NVIDIA, you'd see a CUDA tag. Docker picks the right backend automatically.

Model Runner supports multiple inference backends:

Backend Platform Models
llama.cpp + Metal Mac (default) GGUF models from Docker Hub (~340 tok/s)
vllm-metal Mac (install required) MLX models from Hugging Face (~275 tok/s)
vLLM + CUDA Linux with NVIDIA GPU Production inference
Diffusers Linux/NVIDIA Image generation (Stable Diffusion)

Want MLX models? Install vllm-metal

llama.cpp + Metal is the default and handles GGUF models from Docker Hub. But if you want to run MLX models - Apple's native ML framework, designed for Apple Silicon's unified memory architecture - you can install the vllm-metal backend:

docker model install-runner --backend vllm
Installing vllm backend...
vllm backend installed successfully

Check the status - both backends now running:

BACKEND    STATUS         DETAILS
llama.cpp  Running        llama.cpp latest-metal e365e65
vllm       Running        vllm-metal v0.1.0-20260320-122309
diffusers  Not Installed
mlx        Not Installed  package not installed

MLX models live on Hugging Face (not Docker Hub). Pull one:

docker model pull hf.co/mlx-community/Llama-3.2-1B-Instruct-4bit

The same API (localhost:12434/v1/) serves both backends - Docker routes to the right one based on model format.

Watch out: You might see docker model install-runner --backend mlx --gpu metal suggested online or even by Gordon (Docker's AI assistant). On Docker Desktop, this fails with "Standalone installation not supported." The correct command is --backend vllm, which installs vllm-metal on Mac automatically. The mlx flag is for standalone Docker Engine only.

Here's the same question hitting both backends on my M2 - same API, same endpoint, different models:

Here's the same question hitting both backends on my M2 - same API, same endpoint:

# llama.cpp backend - SmolLM2 (362M params, GGUF from Docker Hub)
curl localhost:12434/v1/chat/completions \
  -d '{"model":"ai/smollm2", "messages":[{"role":"user","content":"What is a Docker container?"}]}'
"A Docker container is a lightweight, isolated, and self-contained runtime
environment that encapsulates an application and its dependencies."
74 tokens
# vllm-metal backend - Llama 3.2 1B (1B params, MLX from Hugging Face)
curl localhost:12434/v1/chat/completions \
  -d '{"model":"hf.co/mlx-community/Llama-3.2-1B-Instruct-4bit", ...}'
"A Docker container is a lightweight, fully virtualized, and managed package
that performs a consistent version of an application, allowing it to be easily
deployed, scaled, and managed across multiple hosts and environments."
84 tokens

Two different models, two different backends, one API. Your app code doesn't change when you switch.

What Nobody Tells You: Model Runner doesn't run models inside containers. It runs them directly on your host hardware - Metal on Mac, CUDA on Linux. The llama.cpp process is a native binary on your host, not inside a container namespace. Why? Performance. LLMs need direct GPU access. Container isolation adds overhead. Docker's role here is distribution (pull from Hub) and API (OpenAI-compatible endpoint). The container is the app that CALLS the model, not the model itself.


The API - This Changes Everything

Model Runner exposes an OpenAI-compatible API. Two endpoints depending on where you're calling from:

Calling from... URL
Your Mac (terminal, Python, VS Code) http://localhost:12434/v1/
Inside a Docker container http://model-runner.docker.internal/v1/

This is important. model-runner.docker.internal is Docker's internal DNS - it only resolves from inside containers. From your Mac, use localhost:12434.

Try it:

curl http://localhost:12434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2",
    "messages": [{"role": "user", "content": "Explain Docker volumes in 2 sentences"}],
    "max_tokens": 60
  }'
{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "A Docker volume is a data container that allows you to mount a file system onto other files or directories. It enables you to use volumes as a source or destination for files within your Docker container, improving the portability and extensibility of your applications."
    }
  }],
  "usage": {
    "prompt_tokens": 37,
    "completion_tokens": 51,
    "total_tokens": 88
  }
}

That's the standard OpenAI chat completions format. Switch localhost:12434 to api.openai.com, add an API key, and the same request hits GPT-4. Your code doesn't care which backend it talks to. Local model for dev, cloud for prod. Same interface.

What Nobody Tells You: If curl fails with "Could not resolve host: model-runner.docker.internal" - you're running it from your Mac, not from inside a container. Use localhost:12434 from the host. This trips up everyone the first time.


Build a Real AI App

Enough theory. Let's build an AI-powered API, containerize it, and serve it with Compose.

The App (app.py)

from flask import Flask, request, jsonify
from openai import OpenAI
import os

app = Flask(__name__)
client = OpenAI(
    base_url=os.environ.get("LLM_URL", "http://model-runner.docker.internal/v1/"),
    api_key="not-needed"
)

@app.route("/")
def home():
    return jsonify({"service": "AI Demo", "model": "smollm2"})

@app.route("/ask")
def ask():
    question = request.args.get("q", "What is Docker?")
    response = client.chat.completions.create(
        model="ai/smollm2",
        messages=[{"role": "user", "content": question}],
        max_tokens=100
    )
    return jsonify({
        "question": question,
        "answer": response.choices[0].message.content,
        "tokens": response.usage.total_tokens
    })

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

Notice base_url reads from an environment variable. From inside a container, it'll use the Docker internal hostname. The api_key is "not-needed" because Model Runner doesn't require auth.

The Dockerfile

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE 5000
CMD ["python", "app.py"]

The requirements.txt

flask==3.1.1
openai==1.82.0

compose.yaml

services:
  ai-app:
    build: .
    ports:
      - "5002:5000"
    environment:
      - LLM_URL=http://model-runner.docker.internal/v1/

The LLM_URL environment variable tells the container to use the Docker-internal endpoint.

Run It

docker compose up -d --build
 Image d6-ai-app-ai-app Built
 Network d6-ai-app_default Created
 Container d6-ai-app-ai-app-1 Started

Test It

curl http://localhost:5002/
{"model": "smollm2", "service": "AI Demo"}
curl "http://localhost:5002/ask?q=What+is+a+container+in+one+sentence"
{
  "answer": "A container is a lightweight package that runs applications and provides a controlled environment for the application's dependencies, allowing for easier deployment and scaling.",
  "question": "What is a container in one sentence",
  "tokens": 65
}

That's a containerized Flask app, calling a local LLM via Docker Model Runner, returning AI-generated answers. No cloud API. No API key. No monthly bill. Running on your laptop.

docker compose down

What Else Is New - The Quick Version

Model Runner is the headline, but Docker shipped more AI tooling in 2026:

Gordon (docker ai) - Docker's built-in AI assistant. It reads your project - Dockerfiles, Compose files, running containers - and gives context-specific answers. Not generic ChatGPT. It sees your actual environment.

docker ai "Why is my container using so much memory?"

Note: If you have many MCP servers configured, Gordon may error with "too many tools." Disable unused MCP servers in Docker Desktop settings to fix it.

MCP Toolkit (docker mcp) - Model Context Protocol is a standard for connecting AI agents to tools. Docker runs MCP servers inside isolated containers with restricted permissions. Think of it as a security layer between AI agents and your system.

Docker Scout - Already covered in Day 2, but worth repeating: scan your AI app images too. AI dependencies (PyTorch, transformers, etc.) are massive and often carry CVEs.

Docker Sandboxes - Run AI agents inside dedicated MicroVMs with their own kernel and private Docker daemon. Not containers - actual VM-grade isolation. Each agent session gets a disposable environment where it can clone repos, run tests, and build images without any path back to your host. Works with Claude Code, Codex, Copilot, and others. Install with brew install docker/tap/sbx.

Docker's AI story in 2026 goes well beyond "run containers." It's Model Runner for local inference, Gordon for context-aware assistance, MCP for secure tool access, Sandboxes for agent isolation, and Scout for supply chain security. All shipping today.


The Big Picture

Docker in 2026 is two things:

  1. A container platform (Days 1-5) - build, ship, run applications

  2. An AI development platform (Day 6) - pull models, run local inference, build AI apps

The second part is new. And it's growing fast. Docker Hub already hosts Llama, Mistral, Phi, Gemma, SmolLM, and others in GGUF format with various quantization levels.

The OpenAI-compatible API is the killer feature. Write your app against the OpenAI interface. During development, point it at localhost:12434 - free, fast, private. In production, swap to the real OpenAI API or any other compatible provider. Your code doesn't change.


Quick Reference

Command What It Does
docker model pull ai/smollm2 Pull a model from Docker Hub
docker model run ai/smollm2 "prompt" Chat with a model
docker model list List downloaded models
docker model status Check runner status and backend
docker model rm ai/smollm2 Remove a model
docker ai "question" Ask Gordon (context-aware)
docker mcp --version Check MCP Toolkit version
curl localhost:12434/v1/models List models via API (from host)
curl localhost:12434/v1/chat/completions Chat completions API (from host)

Tomorrow: Day 7

You just built an AI-powered app with Docker. From docker model pull to a working API in minutes.

Tomorrow is the finale. Day 7: Ship It. We take everything from the past 6 days and make it production-ready. Non-root users. Read-only filesystems. Resource limits. Security scanning. The checklist that separates side projects from production systems.

Run an LLM on Your Laptop With Docker Model Runner | Day 6