Open Bug 1932466 Opened 15 days ago Updated 4 days ago

Perma Exception: Download failed, no more retries! | Download failed: size mismatch

Categories

(Release Engineering :: General, defect, P2)

Tracking

(Not tracked)

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra])

Filed by: smolnar [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=483525355&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/ZBhuyN0qRUGX_dPNbQvk1w/runs/0/artifacts/public/logs/live_backing.log


Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/LcGMJ5QdT8OUk0kqVu9GXg/artifacts/public/build/clang.tar.zst
[fetches 2024-11-20T18:39:35.237Z] https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/LcGMJ5QdT8OUk0kqVu9GXg/artifacts/public/build/clang.tar.zst resolved to 7864272 bytes with sha256 97978193086485ea780444136bbeaf3dcb9b4ebce3717fdea555601272c16d31 in 0.403s
[fetches 2024-11-20T18:39:35.238Z] Download failed: size mismatch on https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/LcGMJ5QdT8OUk0kqVu9GXg/artifacts/public/build/clang.tar.zst: wanted 472193866; got 7864272
[fetches 2024-11-20T18:39:35.239Z] Traceback (most recent call last):
[fetches 2024-11-20T18:39:35.239Z]   File "/builds/worker/checkouts/gecko/third_party/python/taskcluster_taskgraph/taskgraph/run-task/fetch-content", line 978, in <module>
[fetches 2024-11-20T18:39:35.239Z]     sys.exit(main())
[fetches 2024-11-20T18:39:35.240Z]              ^^^^^^
[fetches 2024-11-20T18:39:35.240Z]   File "/builds/worker/checkouts/gecko/third_party/python/taskcluster_taskgraph/taskgraph/run-task/fetch-content", line 974, in main
[fetches 2024-11-20T18:39:35.240Z]     return args.func(args)
[fetches 2024-11-20T18:39:35.240Z]            ^^^^^^^^^^^^^^^
[fetches 2024-11-20T18:39:35.240Z]   File "/builds/worker/checkouts/gecko/third_party/python/taskcluster_taskgraph/taskgraph/run-task/fetch-content", line 880, in command_task_artifacts
[fetches 2024-11-20T18:39:35.240Z]     fetch_urls(downloads)
[fetches 2024-11-20T18:39:35.240Z]   File "/builds/worker/checkouts/gecko/third_party/python/taskcluster_taskgraph/taskgraph/run-task/fetch-content", line 612, in fetch_urls
[fetches 2024-11-20T18:39:35.240Z]     f.result()
[fetches 2024-11-20T18:39:35.240Z]   File "/usr/lib/python3.11/concurrent/futures/_base.py", line 456, in result
[fetches 2024-11-20T18:39:35.240Z]     return self.__get_result()
[fetches 2024-11-20T18:39:35.240Z]            ^^^^^^^^^^^^^^^^^^^
[fetches 2024-11-20T18:39:35.240Z]   File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
[fetches 2024-11-20T18:39:35.240Z]     raise self._exception
[fetches 2024-11-20T18:39:35.240Z]   File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
[fetches 2024-11-20T18:39:35.240Z]     result = self.fn(*self.args, **self.kwargs)
[fetches 2024-11-20T18:39:35.240Z]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[fetches 2024-11-20T18:39:35.240Z]   File "/builds/worker/checkouts/gecko/third_party/python/taskcluster_taskgraph/taskgraph/run-task/fetch-content", line 590, in fetch_and_extract
[fetches 2024-11-20T18:39:35.240Z]     download_to_path(url, dest_path, sha256=sha256, size=size)
[fetches 2024-11-20T18:39:35.240Z]   File "/builds/worker/checkouts/gecko/third_party/python/taskcluster_taskgraph/taskgraph/run-task/fetch-content", line 279, in download_to_path
[fetches 2024-11-20T18:39:35.240Z]     raise Exception("Download failed, no more retries!")
[fetches 2024-11-20T18:39:35.240Z] Exception: Download failed, no more retries!
[taskcluster 2024-11-20 18:39:35.563Z] === Task Finished ===
[taskcluster 2024-11-20 18:39:35.568Z] Artifact "public/logs" not found at "/builds/worker/logs/": (HTTP code 404) no such container - Could not find the file /builds/worker/logs/ in container c4b219191ea54b82f25e5c659d109eb91fa455585c9507413eaceef40c799489 
[taskcluster 2024-11-20 18:39:35.571Z] Artifact "public/build" not found at "/builds/worker/artifacts/": (HTTP code 404) no such container - Could not find the file /builds/worker/artifacts/ in container c4b219191ea54b82f25e5c659d109eb91fa455585c9507413eaceef40c799489 
[taskcluster 2024-11-20 18:39:35.573Z] Artifact "public/cidata/target.crashreporter-symbols-full.tar.zst" not found at "/builds/worker/cidata/target.crashreporter-symbols-full.tar.zst": (HTTP code 404) no such container - Could not find the file /builds/worker/cidata/target.crashreporter-symbols-full.tar.zst in container c4b219191ea54b82f25e5c659d109eb91fa455585c9507413eaceef40c799489 
[taskcluster 2024-11-20 18:39:35.575Z] Artifact "public/cidata/sccache.log" not found at "/builds/worker/cidata/sccache.log": (HTTP code 404) no such container - Could not find the file /builds/worker/cidata/sccache.log in container c4b219191ea54b82f25e5c659d109eb91fa455585c9507413eaceef40c799489 
[taskcluster 2024-11-20 18:39:35.577Z] Artifact "public/cidata/sccache-stats.json" not found at "/builds/worker/cidata/sccache-stats.json": (HTTP code 404) no such container - Could not find the file /builds/worker/cidata/sccache-stats.json in container c4b219191ea54b82f25e5c659d109eb91fa455585c9507413eaceef40c799489 
[taskcluster 2024-11-20 18:39:35.664Z] Unsuccessful task run with exit code: 1 completed in 499.877 seconds

As far as I can tell, the downloads come out with different sizes each time, which seems to point toward either some network wonkiness, or some queue service wonkiness. I'm not aware of anything changing in either of those places, though. I'm able to get the correct sizes downloading locally.

Severity: S4 → --
Priority: P5 → P1
Summary: Perma Exception: Download failed, no more retries! | Download failed: size mismatch → [TREES CLOSED] Perma Exception: Download failed, no more retries! | Download failed: size mismatch
See Also: → 1921446

With having green reruns of the affected builds we decided to reopen the trees, it seems we're in the clear now.

Priority: P1 → --
Summary: [TREES CLOSED] Perma Exception: Download failed, no more retries! | Download failed: size mismatch → Perma Exception: Download failed, no more retries! | Download failed: size mismatch

We dug through some cloud cdn logs with jbuck and found some interesting requests where we're getting 200s back with much smaller sizes than expected:

[{
"logName": "projects/moz-fx-taskcluster-prod-4b87/logs/requests",
"resource": {
"type": "http_load_balancer",
"labels": {
"url_map_name": "taskcluster-firefoxcitc-artifacts-gcs-cdn",
"forwarding_rule_name": "taskcluster-firefoxcitc-artifacts-gcs-cdn-https-ipv4",
"backend_service_name": "",
"target_proxy_name": "taskcluster-firefoxcitc-artifacts-gcs-cdn",
"zone": "global",
"project_id": "moz-fx-taskcluster-prod-4b87"
}
},
"textPayload": null,
"jsonpayload_type_loadbalancerlogentry": {
"statusdetails": "cache_lookup_failed_after_partial_response",
"_type": "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry",
"remoteip": "34.31.2.242",
"cacheid": "CBF",
"parentinsertid": null,
"cachedecision": ["RESPONSE_HAS_CACHE_CONTROL", "RESPONSE_CACHE_CONTROL_PUBLIC", "RESPONSE_HAS_ETAG", "RESPONSE_HAS_LAST_MODIFIED", "RESPONSE_HAS_EXPIRES", "RESPONSE_HAS_CONTENT_TYPE", "CACHE_MODE_FORCE_CACHE_ALL"],
"backendtargetprojectnumber": "projects/90111867433"
},
"timestamp": "2024-11-20 19:24:07.573867 UTC",
"receiveTimestamp": "2024-11-20 19:24:08.344779 UTC",
"severity": "INFO",
"insertId": "1bglqb1f9ryq0i",
"httpRequest": {
"requestMethod": "GET",
"requestUrl": "https://firefoxci.taskcluster-artifacts.net/LcGMJ5QdT8OUk0kqVu9GXg/0/public/build/clang.tar.zst",
"requestSize": "190",
"status": "200",
"responseSize": "6292192",
"userAgent": "Python-urllib/3.11",
"remoteIp": "34.31.2.242",
"serverIp": null,
"referer": null,
"cacheLookup": "true",
"cacheHit": "true",
"cacheValidatedWithOriginServer": null,
"cacheFillBytes": null,
"protocol": null,
"latency": "0.030741"
},
"operation": null,
"trace": "projects/moz-fx-taskcluster-prod-4b87/traces/846f339d822b94b814ac1962f85ae003",
"spanId": "2a5ea9aaa5cb394d",
"traceSampled": null,
"sourceLocation": null,
"split": null
}

jbuck suggests we file a GCP support ticket for this.

Severity: -- → S3
Priority: -- → P2
Whiteboard: [stockwell disable-recommended] → [stockwell infra]
You need to log in before you can comment on or make changes to this bug.