mozilla-esr115: Cached tasks were automatically rebuilt while there was no need to
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
People
(Reporter: jlorenzo, Assigned: jlorenzo)
References
Details
Attachments
(2 files)
(I'm filing this bug in Release Engineering for now, but I'm not sure whether the problem is elsewhere).
TL;DR: On Wednesday, April 9th 2024, one of the cron decision tasks decided to rebuild several cached tasks while there was no reasons to. The bug remains unclear. It may be a network blip. More details below.
What happened on April 9th?
Just like every day at 19:00 UTC[1], the cron decision task bouncer-check
ran. Like any other decision task, taskgraph:
- generated the full graph of tasks
- targeted only the task we needed (namely
cron-bouncer-check-firefox
[2]) - looked up whether we could optimize away some tasks
cron-bouncer-check-firefox
depends on
The problem happened in the 3rd step[3] (full logs attached):
[...]
[task 2024-04-09T19:05:33.749Z] Filter filter_target_tasks pruned 26754 tasks (1 remain)
[task 2024-04-09T19:05:33.749Z] writing artifact file `target-tasks.json`
[task 2024-04-09T19:05:33.749Z] Generating target task graph
[task 2024-04-09T19:05:33.770Z] Adding 0 tasks with `always_target` attribute
[task 2024-04-09T19:05:33.884Z] Generating optimized task graph
[task 2024-04-09T19:05:33.895Z] optimize: cron-bouncer-check-firefox kept because of never
[task 2024-04-09T19:05:33.895Z] optimize: docker-image-android-build kept because of index-search (index-search)
[task 2024-04-09T19:05:33.895Z] optimize: docker-image-condprof kept because of index-search (index-search)
[task 2024-04-09T19:05:33.895Z] optimize: docker-image-custom-car-linux kept because of index-search (index-search)
[task 2024-04-09T19:05:33.895Z] optimize: docker-image-custom-v8 kept because of index-search (index-search)
[task 2024-04-09T19:05:33.895Z] optimize: docker-image-deb11-toolchain-build kept because of index-search (index-search)
[...]
[task 2024-04-09T19:05:33.898Z] No tasks removed during optimization
[...]
What does index-search
do?
index-search
makes taskgraph looks up for a task based on its Taskcluster index and uses it if it finds one. We leverage index-search
to cache some tasks (like docker images). If cached task doesn't exist yet or if it's expired since then, taskgraph automatically reschedules a new cached task. To determine whether a cached task exists, taskgraph generates a hash that represents the state of some in-tree files (e.g. the Dockerfile of a docker image). This hash is baked into the index.
Were cached tasks expired?
No. The last we regenerated cached tasks on every tree (including esr115) was 3 months ago, in bug 1868558. These tasks still exist[4]. They will expire in January 2025. They have been considered valid for more than 3 months.
Was there a new hash in the index?
No. No matter if the cached task was made 3 months ago or on April 9th, the 2 shared the same hash. For instance docker-image-deb11-toolchain-build
both[4][5] point to:
index.gecko.cache.level-3.docker-images.v2.deb11-toolchain-build.hash.e52634f9f22ef85b580d202af7ea30d32679272a03601e44289cfc23ca7de2ac
Is it a bug in index-search
?
As of this writing, the code is identical in central[6] and mozilla-esr115[7]. It hasn't changed in 2 years (since bug 1780278). It's unlikely a recent regression happened there. That said, index-search
doesn't log the reason it couldn't find a cached task, it just silently makes a decision. Given the fact this code is stateful (it depends on what the index has at the time of the execution), I wonder if we should add more logs to help further debugging.
More over, it's unclear to me if only 404 are silently dropped[8] or if 5xx errors are too.
Is it a regression in the Taskcluster index?
The last deployment of the Firefox CI instance was on April 4th. The cron job runs twice a day. This scenario seems unlikely.
Is the network to blame?
I'm not sure because I don't know what the network path is between the decision task and the Taskcluster index. Moreover, about 30 cached tasks we rescheduled. That's ~30 failed requests in ~1ms:
[task 2024-04-09T19:05:33.884Z] Generating optimized task graph
[task 2024-04-09T19:05:33.895Z] optimize: cron-bouncer-check-firefox kept because of never
[task 2024-04-09T19:05:33.895Z] optimize: docker-image-android-build kept because of index-search (index-search)
[...]
[task 2024-04-09T19:05:33.896Z] optimize: docker-image-webrender kept because of index-search (index-search)
[task 2024-04-09T19:05:33.896Z] optimize: packages-deb11-cmake kept because of dependent tasks
[1] https://searchfox.org/mozilla-central/rev/98a54cbdc5964e778af0e6b325f9e7831f00c05b/.cron.yml#219
[2] https://firefox-ci-tc.services.mozilla.com/tasks/TBqzR0ESQWqsRDjS34EJpA
[3] https://firefox-ci-tc.services.mozilla.com/tasks/ej4mLxmeTnWGYkWv8wX2SA/runs/0/logs/public/logs/live.log#L3337
[4] e.g. https://firefox-ci-tc.services.mozilla.com/tasks/fymSrpZGTfWiUaTARM2IKg
[5] https://firefox-ci-tc.services.mozilla.com/tasks/GOfcQeHjTA2_xMOSH_OVSQ
[6] https://searchfox.org/mozilla-central/rev/98a54cbdc5964e778af0e6b325f9e7831f00c05b/third_party/python/taskcluster_taskgraph/taskgraph/optimize/strategies.py#11-46
[7] https://searchfox.org/mozilla-esr115/rev/af07f6d804845d72ed0dee7c3a64baed78921d6c/third_party/python/taskcluster_taskgraph/taskgraph/optimize/strategies.py#11-47
[8] https://searchfox.org/mozilla-central/rev/98a54cbdc5964e778af0e6b325f9e7831f00c05b/third_party/python/taskcluster_taskgraph/taskgraph/optimize/strategies.py#43
Assignee | ||
Comment 1•10 months ago
|
||
:yarik, would you know who would be the best person to look at what happened with the network surrounding the Firefox CI index around 2024-04-09T19:05:33.895Z?
Assignee | ||
Updated•10 months ago
|
Comment 2•10 months ago
•
|
||
:jlorenzo I don't see anything extraordinary in logs for that time frame. There were no network outages or failed api calls (that were logged)
I only see two calls to index.findTask
that returned 404
:
2024-04-09 21:04:53.346 /task/gecko.v2.mozilla-central.branch.ed20a8b98a8f2d8593dbbbd9b5df993e3e91433f.revision.ed20a8b98a8f2d8593dbbbd9b5df993e3e91433f.taskgraph.decision-nightly-android
2024-04-09 21:05:58.520 /task/gecko.v2.mozilla-central.branch.ed20a8b98a8f2d8593dbbbd9b5df993e3e91433f.revision.ed20a8b98a8f2d8593dbbbd9b5df993e3e91433f.taskgraph.decision-nightly-android
One thing I don't understand quite yet is that index-search optimizer is calling status_task()
but I can't find any calls to the queue.status
in logs.. maybe they are being excluded from logs for some reason, so there is no way to see if any of those failed
upd:
indeed, those logs are being excluded
Assignee | ||
Comment 3•10 months ago
|
||
Thanks for the analysis, :yarik! Looks like our best option at the moment is to log out what happened in taskgraph. RelEng can take care of this part, then.
Comment 4•10 months ago
|
||
The severity field is not set for this bug.
:jlorenzo, could you have a look please?
For more information, please visit BugBot documentation.
Assignee | ||
Updated•10 months ago
|
Assignee | ||
Comment 5•9 months ago
|
||
Status update: this could be related to bug 1833507 but I currently don't believe so mainly because the artifacts were still there when I investigated 3 days later. I currently think our best shot is to log out why taskgraph didn't find the cached task in the first place. Let's keep this bug open to add the logs at least.
Assignee | ||
Comment 6•9 months ago
|
||
Updated•2 months ago
|
Description
•