Multi-arch Docker builds in GitHub Actions
- 7 minutes read - 1282 wordsWe needed ARM64 containers. Our Python services run on mixed infrastructure: amd64 in CI and some production clusters, arm64 on newer nodes. Building on one arch and emulating the other with QEMU was painfully slow and broke native extensions. So we added proper multi-arch builds to our GitHub Actions CI.
It took a week to get right. Then five more fixes over two weeks as each failure mode revealed itself in production. This is what went wrong and how we fixed it.
The setup
We use reusable workflows in a shared Woosmap/.github repo. Individual service repos call these. A build-helper repo (checked out at runtime) contains the actual logic as a single Node.js action that routes to TypeScript modules based on an action input:
switch (action) {
case 'build': await build(await get_build_info()); break
case 'publish': await publish(await get_build_info()); break
case 'release': await release(await get_build_info()); break
case 'create_manifest':
await createManifest(await get_build_info()); break
case 'create_release_manifest':
await createReleaseManifests(await get_build_info()); break
}The multi-arch matrix is generated dynamically in the workflow:
strategy:
matrix:
include: ${{ fromJson(inputs.multi_arch &&
'[{"runner":"ubuntu-24.04","arch":"amd64"},
{"runner":"ubuntu-24.04-arm","arch":"arm64"}]'
|| '[{"runner":"ubuntu-24.04","arch":"amd64"}]') }}
runs-on: ${{ matrix.runner }}When multi_arch is true, two jobs run in parallel on native runners. No QEMU. An ARCH_SUFFIX environment variable (-amd64 or -arm64) drives platform selection downstream:
export function getPlatformForArch(): string {
const archSuffix = getArchSuffix()
return archSuffix === '-arm64' ? 'linux/arm64' : 'linux/amd64'
}ARM64 jobs skip tests, coverage, and SonarCloud. Only amd64 does those. Build + push only for arm64.
Problem 1: buildx attestations
Buildx 0.10+ defaults to pushing attestation manifests (SBOM, provenance) alongside your image. When you docker push myimage:tag, it doesn’t push a plain image, it pushes a manifest list containing the image plus attestation layers.
This is fine if you just pull and run. But we need to stitch arch-specific images into a multi-arch manifest:
docker manifest create myimage:pr_10 \
--amend myimage:sha-amd64 \
--amend myimage:sha-arm64docker manifest create expects plain images as sources, not manifest lists. When sha-amd64 is itself a manifest list (because of attestations), the stitching fails.
The fix: disable attestations with BUILDX_NO_DEFAULT_ATTESTATIONS=1.
Our first attempt was conditional:
# Broken: empty string "" fails strconv.ParseBool
BUILDX_NO_DEFAULT_ATTESTATIONS: ${{ inputs.multi_arch && '1' || '' }}When multi_arch is false, this evaluates to "". Buildx tries to parse it as a boolean via Go’s strconv.ParseBool, which fails on empty string. This broke every non-multi-arch build across the organization.
The fix was trivial, always set it to 1:
BUILDX_NO_DEFAULT_ATTESTATIONS: 1Attestations aren’t needed regardless of arch mode. An env var being set to empty is different from not being set at all: a subtle but organization-wide footgun.
Problem 2: PR number resolution
Each arch job publishes its image with a tag. We need a consistent tag so the manifest step can find both. The original design used pr_X-amd64 / pr_X-arm64, where the PR number came from an API call:
export async function get_pr(): Promise<number> {
const {data: commit_prs} =
await oktokit.rest.repos.listPullRequestsAssociatedWithCommit({
owner: repo[0], repo: repo[1], commit_sha: sha
})
if (commit_prs[0] === undefined) {
throw new Error('Commit has no PR')
}
return commit_prs[0].number
}The listPullRequestsAssociatedWithCommit API sometimes returned empty results: race conditions, merge commits, event timing. When it failed, get_publish_tag() silently fell back to the raw SHA:
export async function get_publish_tag(): Promise<string> {
try {
return `pr_${await get_pr()}`
} catch (e) {
return process.env['GITHUB_SHA'] || '' // No warning!
}
}The amd64 job would tag as pr_42-amd64 (API succeeded), the arm64 job as abc123def (API failed). The manifest step looked for pr_42-arm64, which didn’t exist.
Two fixes:
Read PR number from the event payload, always available, zero API calls:
export async function get_pr(): Promise<number> {
const con = new Context()
if (con.payload.pull_request) {
return con.payload.pull_request.number
}
// Fallback for push events only
const {data: commit_prs} = await oktokit.rest.repos
.listPullRequestsAssociatedWithCommit({...})
// ...
}Use commit SHA for arch-suffixed images, the SHA is always the same across parallel jobs:
export async function docker(image: string, name: string): Promise<void> {
const archSuffix = getArchSuffix()
const sha = process.env['GITHUB_SHA'] || ''
const tag = archSuffix ? `${sha}${archSuffix}` : await get_publish_tag()
// ...
}The manifest step bridges them:
const manifestTag = await get_publish_tag() // "pr_X"
const sha = process.env['GITHUB_SHA'] || ''
const archTags = ARCHITECTURES.map(arch => `${sha}-${arch}`)
await createAndPushManifest(name, manifestTag, archTags)So the flow is: {sha}-amd64 + {sha}-arm64 → manifest pr_X. No API race, no silent fallback.
Problem 3: deploy ordering
The Leela admin service deploys on every PR build. It was wired inline in the build job:
- name: Deploy
if: matrix.arch == 'amd64'
uses: ./helper/.github/actions/deployProblem: in multi-arch mode, the deploy step ran before the manifest existed. It tried to pull pr_X, which only gets created in a separate create_manifest job that depends on both arch builds completing.
The fix adds a separate downstream job for multi-arch deploys:
# Inline deploy only for single-arch
- name: Deploy
if: matrix.arch == 'amd64' && !inputs.multi_arch
uses: ./helper/.github/actions/deploy
# Multi-arch: deploy after manifest is ready
pr_deploy:
if: inputs.multi_arch
needs: create_manifest
runs-on: ubuntu-24.04
steps:
- name: Deploy
uses: ./helper/.github/actions/deploySame pattern for the release workflow: deploy only after create_release_manifest completes.
Problem 4: release version calculation
The createReleaseManifests function creates the multi-arch manifest for release tags. It originally called get_split_version_tag() to determine the version:
const tag = await get_split_version_tag() // Fetches latest release, bumps version
const version = `${tag[0]}.${tag[1]}.${tag[2]}`get_split_version_tag() fetches the latest GitHub release and bumps the version based on PR labels. But createReleaseManifests runs after the release jobs complete. The release already happened. So get_latest_tag() now returns v5.5.3 (the new release), and bumping it gives v5.5.4, which doesn’t exist as a tag.
The fix: just use get_latest_tag() directly.
const version = await get_latest_tag()
const parts = version.split('.')
const archTags = ARCHITECTURES.map(arch => `${version}-${arch}`)
await createAndPushManifest(name, version, archTags) // v1.2.3
await createAndPushManifest(name, `${parts[0]}.${parts[1]}`, archTags) // v1.2
await createAndPushManifest(name, parts[0], archTags) // v1
Three manifest tags per release. Semver consumers can pin to major, minor, or patch.
The full tagging flow
After all the fixes, the image lifecycle looks like this:
PR Build:
{sha}-amd64, {sha}-arm64 → manifest: pr_X
Release (per-arch):
pull pr_X (Docker resolves correct arch)
push v1.2.3-amd64, v1.2.3-arm64
Release Manifests:
v1.2.3-amd64 + v1.2.3-arm64 → manifests: v1.2.3, v1.2, v1Only the amd64 job creates the GitHub release to avoid duplicates:
if (!archSuffix || archSuffix === '-amd64') {
await create_release(version)
} else {
core.info(`Skipping GitHub release creation on ${archSuffix} build`)
}Timeline
- Feb 4: Initial multi-arch support. Matrix strategy, arch suffix, manifest creation.
- Feb 5: Deploy ordering fix. Leela deploying before manifests existed.
- Feb 16: Same fix for the release/merge workflow.
- Feb 17:
BUILDX_NO_DEFAULT_ATTESTATIONSto fix attestation-poisoned manifests. - Feb 18: SHA-based arch tagging + PR number from event payload.
- Feb 19:
BUILDX_NO_DEFAULT_ATTESTATIONSempty string crash fix. - Feb 19:
get_latest_tag()instead ofget_split_version_tag()in release manifests.
Five fixes in two weeks after the initial rollout. Each one discovered in production by a different failure mode.
What I’d do differently
Test the non-multi-arch path. The empty string strconv.ParseBool crash only affected repos that hadn’t opted into multi-arch. We tested the new feature, not the old path.
Use SHA tags from the start. The PR-number-based approach was elegant but fragile. The SHA is always consistent across parallel jobs, no API calls, no race conditions.
Make the manifest step a required downstream job. The deploy-before-manifest problem was obvious in retrospect. Any step that depends on a multi-arch manifest should be in a job that needs: the manifest creation job. No exceptions, no “it works for single-arch” shortcuts.
Docker’s multi-arch story is good once it works. Getting there from a single-arch CI that evolved organically over years is where the pain lives. Every assumption about “one build produces one image” breaks in interesting ways.