Closed Bug 1990389 Opened 5 months ago Closed 5 months ago

Intermittent exception [d2g] could not load docker image: exit status 1

Categories

(Infrastructure & Operations :: RelOps: Posix OS, defect, P3)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: chorotan, Assigned: jcristau)

References

Details

Attachments

(2 files)

Failure push: https://treeherder.mozilla.org/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel&revision=a7c39575ed11b45d662b5139d924007b16420daa&selectedTaskRun=bJtKR2NMTU2ILZ5uK2_cKA.0

Failure log: https://firefoxci.taskcluster-artifacts.net/bJtKR2NMTU2ILZ5uK2_cKA/0/public/logs/live_backing.log

[taskcluster 2025-09-23T18:07:07.870Z] Task ID: bJtKR2NMTU2ILZ5uK2_cKA
[taskcluster 2025-09-23T18:07:07.870Z] === Task Starting ===
[taskcluster 2025-09-23T18:07:08.117Z] Uploading redirect artifact public/logs/live.log to URL https://firefoxci-websocktunnel.services.mozilla.com/us-central1-a.5637449981301321985.60099/log/HkH23Yb0QAaYLl4sz-STXA with mime type "text/plain; charset=utf-8" and expiry 2025-09-23T19:22:08.063Z
[taskcluster 2025-09-23T18:07:08.299Z] [mounts] Moving existing writable directory cache gecko-level-3-checkouts-hg58-v3-38439bed431008791a5f from /home/generic-worker/caches/V1jpBh5tRKu5vWIxYbE4vA to /home/task_175865082763016/cache0
[taskcluster 2025-09-23T18:07:08.299Z] [mounts] Creating directory /home/task_175865082763016
[taskcluster 2025-09-23T18:07:08.307Z] [mounts] Updating ownership of files inside directory '/home/task_175865082763016/cache0' from task_175864999138420 to task_175865082763016
[taskcluster 2025-09-23T18:07:12.366Z] [mounts] Successfully mounted writable directory cache '/home/task_175865082763016/cache0'
[taskcluster 2025-09-23T18:07:12.366Z] [mounts] Downloading task e2R5awZzQDqHxQcZOa1-4w artifact public/image.tar.zst to /home/generic-worker/downloads/fjTHG07kTy2YJXoTWca8Tw
[taskcluster 2025-09-23T18:07:20.361Z] [mounts] Downloaded 1725268915 bytes with SHA256 99e742e057e327ea4f8f15a1cf58cf647ef9f11d76bb8ded3bf7fd73f42f448f from task e2R5awZzQDqHxQcZOa1-4w artifact public/image.tar.zst to /home/generic-worker/downloads/fjTHG07kTy2YJXoTWca8Tw
[taskcluster:warn 2025-09-23T18:07:20.361Z] [mounts] Download /home/generic-worker/downloads/fjTHG07kTy2YJXoTWca8Tw of task e2R5awZzQDqHxQcZOa1-4w artifact public/image.tar.zst has SHA256 99e742e057e327ea4f8f15a1cf58cf647ef9f11d76bb8ded3bf7fd73f42f448f but task payload does not declare a required value, so content authenticity cannot be verified
[taskcluster 2025-09-23T18:07:20.361Z] [mounts] Creating directory /home/task_175865082763016
[taskcluster 2025-09-23T18:07:20.377Z] [mounts] Copying /home/generic-worker/downloads/fjTHG07kTy2YJXoTWca8Tw to /home/task_175865082763016/dockerimage
[taskcluster 2025-09-23T18:07:33.389Z] [mounts] Granting task_175865082763016 full control of file '/home/task_175865082763016/dockerimage'
[taskcluster 2025-09-23T18:07:34.604Z] [d2g] Loading docker image
[taskcluster:warn 2025-09-23T18:08:41.033Z] [d2g] Artifact "public/remote/pup_errorsummary.json" not found at "/builds/worker/pup_errorsummary.json": exit status 1
[taskcluster:warn 2025-09-23T18:08:41.033Z] Error response from daemon: No such container: taskcontainer_NG3pfVJyT4mGNeLt8xs6Dw
[taskcluster:warn 2025-09-23T18:08:41.033Z] 
[taskcluster 2025-09-23T18:08:41.105Z] Uploading error artifact public/remote/pup_errorsummary.json from file artifact0.json with message "Could not read file '/home/task_175865082763016/artifact0.json'", reason "file-missing-on-worker" and expiry 2026-09-23T16:52:37.779Z
[taskcluster 2025-09-23T18:08:41.105Z] [mounts] Preserving cache: Moving "/home/task_175865082763016/cache0" to "/home/generic-worker/caches/V1jpBh5tRKu5vWIxYbE4vA"
[taskcluster 2025-09-23T18:08:41.198Z] Uploading link artifact public/logs/live.log to artifact public/logs/live_backing.log with expiry 2026-09-23T16:52:37.779Z
[taskcluster:error] [d2g] could not load docker image: exit status 1
[taskcluster:error] 

From the worker logs:

Sep 23 18:07:33Z gecko-t-t-linux-docker-amd-bfjnvrqms8ctz-lwy7rnyw start-worker 2025/09/23 18:07:33 Starting task feature D2G...
Sep 23 18:07:33Z gecko-t-t-linux-docker-amd-bfjnvrqms8ctz-lwy7rnyw start-worker 2025/09/23 18:07:33 Loaded file d2g-image-cache.json
Sep 23 18:08:40Z gecko-t-t-linux-docker-amd-bfjnvrqms8ctz-lwy7rnyw dockerd time="2025-09-23T18:08:40.201958553Z" level=info msg="Layer sha256:53e4dfc804e0c6f716f1f6abab7e52946476239fe59ac49fbf7e111d240e6bfd cleaned up"
Sep 23 18:08:40Z gecko-t-t-linux-docker-amd-bfjnvrqms8ctz-lwy7rnyw dockerd time="2025-09-23T18:08:40.231717132Z" level=info msg="Layer sha256:337c58fbf55ec1878b83b9a421aaa4196f71af5b82cc927f98b5b26e2f238b76 cleaned up"
Sep 23 18:08:40Z gecko-t-t-linux-docker-amd-bfjnvrqms8ctz-lwy7rnyw dockerd time="2025-09-23T18:08:40.320021380Z" level=info msg="Layer sha256:107cbdaeec042e6154640c94972c638f4e2fee795902b149e8ce9acbd03d59d7 cleaned up"
Sep 23 18:08:41Z gecko-t-t-linux-docker-amd-bfjnvrqms8ctz-lwy7rnyw start-worker 2025/09/23 18:08:41 Stopping task feature D2G...

Matt, any idea why the load might have failed here?

Flags: needinfo?(mboris)

These worker logs look suspicious

Sep 23 18:08:40Z gecko-t-t-linux-docker-amd-bfjnvrqms8ctz-lwy7rnyw dockerd time="2025-09-23T18:08:40.201958553Z" level=info msg="Layer sha256:53e4dfc804e0c6f716f1f6abab7e52946476239fe59ac49fbf7e111d240e6bfd cleaned up"
Sep 23 18:08:40Z gecko-t-t-linux-docker-amd-bfjnvrqms8ctz-lwy7rnyw dockerd time="2025-09-23T18:08:40.231717132Z" level=info msg="Layer sha256:337c58fbf55ec1878b83b9a421aaa4196f71af5b82cc927f98b5b26e2f238b76 cleaned up"
Sep 23 18:08:40Z gecko-t-t-linux-docker-amd-bfjnvrqms8ctz-lwy7rnyw dockerd time="2025-09-23T18:08:40.320021380Z" level=info msg="Layer sha256:107cbdaeec042e6154640c94972c638f4e2fee795902b149e8ce9acbd03d59d7 cleaned up"

But it's hard to say exactly what happened here without being able to see the stderr output. I could patch d2g so we could get that output to help in future issues. Let's see if this issue is prevalent or not.

Flags: needinfo?(mboris)

I found a couple other tasks that ran on that worker that also depended on the same docker image that ran into the same issue.

https://firefox-ci-tc.services.mozilla.com/tasks/M2l--90iT4ub-nN8MKnkgQ/runs/0 and https://firefox-ci-tc.services.mozilla.com/tasks/Vb58OTZtSsql9MqK7E7jzg/runs/0

https://github.com/taskcluster/taskcluster/pull/7973 landed now and we'll have better logging for future failures once a new version is released/published to FxCI workers.

Also, looking at the dependents of https://firefox-ci-tc.services.mozilla.com/tasks/e2R5awZzQDqHxQcZOa1-4w I'm seeing multiple tasks that had the worker successfully load in this docker image, which leads me to believe there's some strange issue with that individual worker.

The latest version has been rolled out to FxCI workers today. Let's look out for similar failures and this time we should have more info on the specific error.

The severity field is not set for this bug.
:bhearsum, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(bhearsum)
Severity: -- → S3
Priority: -- → P3
Flags: needinfo?(bhearsum)

I discovered a handful more of these issues on that same pool https://firefox-ci-tc.services.mozilla.com/worker-manager/gecko-t%2Ft-linux-docker-amd

Now that they're running an updated version of Generic Worker that includes more d2g logging, I can see a root cause:

Oct 08 03:52:10 PM gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w start-worker 2025/10/08 19:52:10 ERROR(s) encountered: [d2g] could not load docker image: exit status 1
Oct 08 03:52:10 PM gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w start-worker stdout:
Oct 08 03:52:10 PM gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w start-worker stderr: write /usr/lib/x86_64-linux-gnu/libx265.so.199: no space left on device

:hneiva or :jcristau, do you think we should bump the disk size? Running into any other disk space issues similar to this on that pool?

Flags: needinfo?(jcristau)
Flags: needinfo?(hneiva)

I haven't noticed any errors like this, but then again I wasn't specifically looking for these errors.
I've added a PR to bump the disk size for the gecko pool to 120gb (was 75gb).

Flags: needinfo?(hneiva)

Somehow we're not using the local ssd on that worker:

Oct 08 17:27:18 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w configure-generic-worker-disks /mnt is not using an nvme0n* device. Setting it up...
Oct 08 17:27:18 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w configure-generic-worker-disks Detecting available nvme0n* devices...
Oct 08 17:27:18 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w configure-generic-worker-disks Found 1 nvme0n* devices: /dev/nvme0n1
Oct 08 17:27:18 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w configure-generic-worker-disks   Cannot use /dev/nvme0n1: device is partitioned
Oct 08 17:27:18 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w configure-generic-worker-disks   Command requires all devices to be found.

Ah. Because the boot disk is on nvme0n1:

Oct 08 17:27:11 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w multipath nvme0n1: uid = nvme.1ae0-6e766d655f636172642d7064-6e766d655f636172642d7064-00000001 (sysfs)
Oct 08 17:27:11 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w systemd-fsck /dev/nvme0n1p15: 12 files, 12499/213663 clusters
Oct 08 17:27:11 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w kernel  nvme0n1: p1 p14 p15 p16
Oct 08 17:27:15 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w kernel EXT4-fs (nvme0n1p1): Supports (experimental) DIO atomic writes awu_min: 4096, awu_max: 4096
Oct 08 17:27:15 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w kernel EXT4-fs (nvme0n1p1): mounted filesystem f8b1fd37-145c-44d7-bd5d-420a37a1468c ro with ordered data mode. Quota mode: none.
Oct 08 17:27:15 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w kernel EXT4-fs (nvme0n1p1): re-mounted f8b1fd37-145c-44d7-bd5d-420a37a1468c r/w.
Oct 08 17:27:15 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w kernel EXT4-fs (nvme0n1p16): Supports (experimental) DIO atomic writes awu_min: 4096, awu_max: 4096
Oct 08 17:27:15 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w kernel EXT4-fs (nvme0n1p16): mounted filesystem 44bd5d32-077f-4b29-871d-cfeb7677f0f6 r/w with ordered data mode. Quota mode: none.
Oct 08 17:27:15 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w kernel EXT4-fs (nvme0n1p1): resizing filesystem from 15466235 to 19398395 blocks
Oct 08 17:27:15 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w kernel EXT4-fs (nvme0n1p1): resized filesystem to 19398395
Oct 08 17:27:18 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w configure-generic-worker-disks Found 1 nvme0n* devices: /dev/nvme0n1
Oct 08 17:27:18 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w configure-generic-worker-disks   Cannot use /dev/nvme0n1: device is partitioned
Oct 08 18:58:39 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w fstrim /boot/efi: 98.2 MiB (102995968 bytes) trimmed on /dev/nvme0n1p15
Oct 08 18:58:39 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w fstrim /boot: 795 MiB (833662976 bytes) trimmed on /dev/nvme0n1p16
Oct 08 18:58:39 gecko-t-t-linux-docker-amd-jhp5qmujtf2j6ynvs3av2w fstrim /: 16.2 GiB (17419616256 bytes) trimmed on /dev/nvme0n1p1

So where's the local ssd?

So where's the local ssd?

It's at /dev/nvme1n1, which the configure-generic-worker-disks script ignores.

Assignee: nobody → jcristau
Status: NEW → ASSIGNED
Flags: needinfo?(jcristau)
Blocks: 1983305
Component: Task Configuration → RelOps: Posix OS
Product: Firefox Build System → Infrastructure & Operations

OK this looks better:

Oct 09 16:36:05 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks /mnt is not using a local SSD. Setting it up...
Oct 09 16:36:05 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks Detecting available local SSDs...
Oct 09 16:36:05 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks Found 1 local SSD devices: /dev/disk/by-id/google-local-nvme-ssd-0
Oct 09 16:36:05 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks   Physical volume "/dev/disk/by-id/google-local-nvme-ssd-0" successfully created.
Oct 09 16:36:05 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks   Volume group "instance_storage" successfully created
Oct 09 16:36:05 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks   Logical volume "lv_instance_storage" created.
Oct 09 16:36:05 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks mke2fs 1.47.0 (5-Feb-2023)
Oct 09 16:36:08 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks Discarding device blocks:        0/98302976#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#01027262976/98302976#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#01063438848/98302976#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010                 #010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010done
Oct 09 16:36:08 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks Creating filesystem with 98302976 4k blocks and 24576000 inodes
Oct 09 16:36:08 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks Filesystem UUID: 474bab00-b63f-4ac0-a4ca-49fca3977c8f
Oct 09 16:36:08 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks Superblock backups stored on blocks:
Oct 09 16:36:08 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks #01132768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
Oct 09 16:36:08 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks #0114096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968
Oct 09 16:36:08 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks Allocating group tables:    0/3000#010#010#010#010#010#010#010#010#010         #010#010#010#010#010#010#010#010#010done
Oct 09 16:36:08 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks Writing inode tables:    0/3000#010#010#010#010#010#010#010#010#010         #010#010#010#010#010#010#010#010#010done
Oct 09 16:36:08 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks Creating journal (262144 blocks): done
Oct 09 16:36:08 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks Writing superblocks and filesystem accounting information:    0/3000#010#010#010#010#010#010#010#010#010         #010#010#010#010#010#010#010#010#010done
Oct 09 16:36:08 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks umount: /home: not mounted.
Oct 09 16:36:08 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks umount: /mnt: not mounted.
Oct 09 16:36:12 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks /mnt is now using local SSDs.
Oct 09 16:36:12 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks Creating directories for generic-worker
Oct 09 16:36:12 gecko-t-t-linux-docker-amd-alpha-cqv594rxql-7qv-gnwtziq configure-generic-worker-disks Creating docker specific directories
Status: ASSIGNED → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: