Blog
May 3, 2016 Marie H.

Reverse Engineering a Dockerfile from a Running Docker Image

Reverse Engineering a Dockerfile from a Running Docker Image

Photo by <a href="https://unsplash.com/@writecodenow?utm_source=cloudista&utm_medium=referral" target="_blank" rel="noopener">Boitumelo</a> on <a href="https://unsplash.com/?utm_source=cloudista&utm_medium=referral" target="_blank" rel="noopener">Unsplash</a>

Reverse Engineering a Dockerfile from a Running Docker Image

Here's a scenario I ran into several times in 2015 and 2016: someone on the team had a Docker container running in staging — maybe they'd built it months ago, maybe they'd pulled it from Docker Hub and customized it locally — and now they needed to understand what was in it. The original Dockerfile was gone, never committed, or never existed in a shareable form.

You can't extract a Dockerfile from a Docker image. The Dockerfile is a build recipe, and once the image is built, the recipe is gone — only the result remains. But you can extract a significant portion of the image's configuration, which is often what you actually need. I wrote a script to do that.

What the image actually stores

A Docker image is a stack of read-only filesystem layers plus a configuration blob. The configuration blob is what docker inspect shows you: environment variables, exposed ports, volumes, the working directory, the user the container runs as, the entrypoint, the default command, and any ONBUILD triggers. All of that is recoverable.

What's not recoverable in clean form: FROM, RUN, ADD, and COPY. The FROM base image isn't explicitly recorded in the config — it's implicit in the layer history. RUN, ADD, and COPY instructions get baked into filesystem layers. The layer content is there (you can poke around in /var/lib/docker/overlay2/ if you want), but the original instruction text is not stored in a machine-readable way in the config.

docker history --no-trunc gets you partway there. It shows the command that produced each layer, in the format /bin/sh -c <cmd> for RUN instructions and /bin/sh -c #(nop) <instruction> for metadata instructions. That's useful for reading, but it's noisy and not formatted as a Dockerfile. The history is also truncated by default, which is why --no-trunc matters.

The script

#!/bin/bash
# gen-docker-file: Reconstruct a Dockerfile-like config from a running Docker image
# Usage: gen-docker-file <image-name-or-id>
#
# Extracts MAINTAINER from build history, then inspects the image config
# for ENV, EXPOSE, VOLUME, USER, WORKDIR, ENTRYPOINT, CMD, and ONBUILD.
# Note: RUN, ADD, and COPY instructions are NOT recoverable — this is the
# config of the final image, not a reproducible build.

image="$1"
if [ -z "$image" ]; then
    echo "Usage: gen-docker-file <image>"
    exit 1
fi

# Try to extract MAINTAINER from the build history
# (older images only — MAINTAINER was deprecated in favor of LABEL)
docker history --no-trunc "$image" | \
    sed -n -e 's,.*/bin/sh -c #(nop) \(MAINTAINER .*[^ ]\) *0 B,\1,p' | \
    head -1

# Extract the image config and format it as Dockerfile instructions
docker inspect --format='
{{- range $e := .Config.Env}}ENV {{$e}}
{{end -}}
{{- range $e,$v := .Config.ExposedPorts}}EXPOSE {{$e}}
{{end -}}
{{- range $e,$v := .Config.Volumes}}VOLUME {{$e}}
{{end -}}
{{- with .Config.User}}USER {{.}}
{{end -}}
{{- with .Config.WorkingDir}}WORKDIR {{.}}
{{end -}}
{{- with .Config.Entrypoint}}ENTRYPOINT {{json .}}
{{end -}}
{{- with .Config.Cmd}}CMD {{json .}}
{{end -}}
{{- with .Config.OnBuild}}ONBUILD {{json .}}
{{end -}}' "$image"

The docker inspect --format syntax uses Go's text/template. The {{- ... -}} trim markers strip surrounding whitespace so you don't get blank lines for fields that aren't set. The {{json .}} call on Entrypoint, Cmd, and OnBuild serializes them as JSON arrays, which is the correct Dockerfile exec form representation (["nginx", "-g", "daemon off;"] rather than a shell string).

The MAINTAINER extraction is a bit of a hack — it greps the history output for the #(nop) MAINTAINER pattern. This only works on images that used the old MAINTAINER instruction, which was deprecated in favor of LABEL maintainer= at some point. Most images I was dealing with in 2015-2016 still used it.

What this is useful for

The three situations where I actually reached for this script:

Understanding a third-party image before deploying it. Before dropping something into Kubernetes, I wanted to know what ports it expected to expose, what user it ran as, and whether it had an ENTRYPOINT or just a CMD (and whether I'd need to override one to run it correctly in our setup). Running gen-docker-file gave me a quick readable summary without having to hunt for the Dockerfile on Docker Hub.

Figuring out what a base image expects when writing a derived image. If I was writing a Dockerfile that started with FROM somecompany/node-base:latest, I needed to know whether the base image set a WORKDIR I'd be inheriting, what ENV vars were already set (so I didn't clobber them), and what USER it ran as. docker inspect gives you all of this, but the script formats it as Dockerfile syntax, which is easier to read at a glance than raw JSON.

Quick documentation. When someone asked "what does our staging image actually expose?", I could run the script against the image tag in ECR and paste the output into a Slack message. Not a substitute for real docs, but faster than digging through the manifest history.

What it can't do

It can't give you a reproducible build. The output of this script is not a Dockerfile you can run and get the same image back. You're missing the base image, all the RUN instructions, and all the COPY/ADD operations. What you have is the shape of the final image — its runtime configuration — which is useful but different.

If you want to understand what files are in each layer, docker history --no-trunc shows the commands, and you can mount the layers directly to inspect the filesystem contents. It's tedious but possible. The dive tool (open source, came out a bit later) makes this interactive and much more pleasant — it shows you a layer-by-layer diff of the filesystem alongside the command that produced each layer.

The modern equivalent

docker image inspect <image> --format '{{json .Config}}' gets you the same raw data as a JSON blob. For most purposes now I'd just pipe that to jq and pull out the fields I care about. The docker inspect --format Go template syntax is expressive but it's also obscure enough that I always had to look it up.

The other thing that's different now: most public images on Docker Hub link directly to their Dockerfile on GitHub. In 2015 that was hit or miss. Official images were usually documented, but a lot of community images weren't, and plenty of internal images at startups lived only in someone's laptop build context. That's the environment this script was built for.