Open Bug 1891183 Opened 10 months ago Updated 2 months ago

mozilla-esr115: Cached tasks were automatically rebuilt while there was no need to

Categories

(Release Engineering :: General, defect)

defect

Tracking

(Not tracked)

People

(Reporter: jlorenzo, Assigned: jlorenzo)

References

Details

Attachments

(2 files)

(I'm filing this bug in Release Engineering for now, but I'm not sure whether the problem is elsewhere).

TL;DR: On Wednesday, April 9th 2024, one of the cron decision tasks decided to rebuild several cached tasks while there was no reasons to. The bug remains unclear. It may be a network blip. More details below.

What happened on April 9th?

Just like every day at 19:00 UTC[1], the cron decision task bouncer-check ran. Like any other decision task, taskgraph:

  1. generated the full graph of tasks
  2. targeted only the task we needed (namely cron-bouncer-check-firefox[2])
  3. looked up whether we could optimize away some tasks cron-bouncer-check-firefox depends on

The problem happened in the 3rd step[3] (full logs attached):

[...]
[task 2024-04-09T19:05:33.749Z] Filter filter_target_tasks pruned 26754 tasks (1 remain)
[task 2024-04-09T19:05:33.749Z] writing artifact file `target-tasks.json`
[task 2024-04-09T19:05:33.749Z] Generating target task graph
[task 2024-04-09T19:05:33.770Z] Adding 0 tasks with `always_target` attribute
[task 2024-04-09T19:05:33.884Z] Generating optimized task graph
[task 2024-04-09T19:05:33.895Z] optimize: cron-bouncer-check-firefox kept because of never
[task 2024-04-09T19:05:33.895Z] optimize: docker-image-android-build kept because of index-search (index-search)
[task 2024-04-09T19:05:33.895Z] optimize: docker-image-condprof kept because of index-search (index-search)
[task 2024-04-09T19:05:33.895Z] optimize: docker-image-custom-car-linux kept because of index-search (index-search)
[task 2024-04-09T19:05:33.895Z] optimize: docker-image-custom-v8 kept because of index-search (index-search)
[task 2024-04-09T19:05:33.895Z] optimize: docker-image-deb11-toolchain-build kept because of index-search (index-search)
[...]
[task 2024-04-09T19:05:33.898Z] No tasks removed during optimization
[...]

What does index-search do?

index-search makes taskgraph looks up for a task based on its Taskcluster index and uses it if it finds one. We leverage index-search to cache some tasks (like docker images). If cached task doesn't exist yet or if it's expired since then, taskgraph automatically reschedules a new cached task. To determine whether a cached task exists, taskgraph generates a hash that represents the state of some in-tree files (e.g. the Dockerfile of a docker image). This hash is baked into the index.

Were cached tasks expired?

No. The last we regenerated cached tasks on every tree (including esr115) was 3 months ago, in bug 1868558. These tasks still exist[4]. They will expire in January 2025. They have been considered valid for more than 3 months.

Was there a new hash in the index?

No. No matter if the cached task was made 3 months ago or on April 9th, the 2 shared the same hash. For instance docker-image-deb11-toolchain-build both[4][5] point to:

index.gecko.cache.level-3.docker-images.v2.deb11-toolchain-build.hash.e52634f9f22ef85b580d202af7ea30d32679272a03601e44289cfc23ca7de2ac

Is it a bug in index-search?

As of this writing, the code is identical in central[6] and mozilla-esr115[7]. It hasn't changed in 2 years (since bug 1780278). It's unlikely a recent regression happened there. That said, index-search doesn't log the reason it couldn't find a cached task, it just silently makes a decision. Given the fact this code is stateful (it depends on what the index has at the time of the execution), I wonder if we should add more logs to help further debugging.

More over, it's unclear to me if only 404 are silently dropped[8] or if 5xx errors are too.

Is it a regression in the Taskcluster index?

The last deployment of the Firefox CI instance was on April 4th. The cron job runs twice a day. This scenario seems unlikely.

Is the network to blame?

I'm not sure because I don't know what the network path is between the decision task and the Taskcluster index. Moreover, about 30 cached tasks we rescheduled. That's ~30 failed requests in ~1ms:

[task 2024-04-09T19:05:33.884Z] Generating optimized task graph
[task 2024-04-09T19:05:33.895Z] optimize: cron-bouncer-check-firefox kept because of never
[task 2024-04-09T19:05:33.895Z] optimize: docker-image-android-build kept because of index-search (index-search)
[...]
[task 2024-04-09T19:05:33.896Z] optimize: docker-image-webrender kept because of index-search (index-search)
[task 2024-04-09T19:05:33.896Z] optimize: packages-deb11-cmake kept because of dependent tasks

[1] https://searchfox.org/mozilla-central/rev/98a54cbdc5964e778af0e6b325f9e7831f00c05b/.cron.yml#219
[2] https://firefox-ci-tc.services.mozilla.com/tasks/TBqzR0ESQWqsRDjS34EJpA
[3] https://firefox-ci-tc.services.mozilla.com/tasks/ej4mLxmeTnWGYkWv8wX2SA/runs/0/logs/public/logs/live.log#L3337
[4] e.g. https://firefox-ci-tc.services.mozilla.com/tasks/fymSrpZGTfWiUaTARM2IKg
[5] https://firefox-ci-tc.services.mozilla.com/tasks/GOfcQeHjTA2_xMOSH_OVSQ
[6] https://searchfox.org/mozilla-central/rev/98a54cbdc5964e778af0e6b325f9e7831f00c05b/third_party/python/taskcluster_taskgraph/taskgraph/optimize/strategies.py#11-46
[7] https://searchfox.org/mozilla-esr115/rev/af07f6d804845d72ed0dee7c3a64baed78921d6c/third_party/python/taskcluster_taskgraph/taskgraph/optimize/strategies.py#11-47
[8] https://searchfox.org/mozilla-central/rev/98a54cbdc5964e778af0e6b325f9e7831f00c05b/third_party/python/taskcluster_taskgraph/taskgraph/optimize/strategies.py#43

:yarik, would you know who would be the best person to look at what happened with the network surrounding the Firefox CI index around 2024-04-09T19:05:33.895Z?

Flags: needinfo?(ykurmyza)
Attachment #9396385 - Attachment mime type: application/octet-stream → text/plain

:jlorenzo I don't see anything extraordinary in logs for that time frame. There were no network outages or failed api calls (that were logged)

I only see two calls to index.findTask that returned 404:

2024-04-09 21:04:53.346 /task/gecko.v2.mozilla-central.branch.ed20a8b98a8f2d8593dbbbd9b5df993e3e91433f.revision.ed20a8b98a8f2d8593dbbbd9b5df993e3e91433f.taskgraph.decision-nightly-android
2024-04-09 21:05:58.520 /task/gecko.v2.mozilla-central.branch.ed20a8b98a8f2d8593dbbbd9b5df993e3e91433f.revision.ed20a8b98a8f2d8593dbbbd9b5df993e3e91433f.taskgraph.decision-nightly-android

One thing I don't understand quite yet is that index-search optimizer is calling status_task() but I can't find any calls to the queue.status in logs.. maybe they are being excluded from logs for some reason, so there is no way to see if any of those failed

upd:
indeed, those logs are being excluded

Flags: needinfo?(ykurmyza)

Thanks for the analysis, :yarik! Looks like our best option at the moment is to log out what happened in taskgraph. RelEng can take care of this part, then.

The severity field is not set for this bug.
:jlorenzo, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(jlorenzo)
Severity: -- → S3
Flags: needinfo?(jlorenzo)

Status update: this could be related to bug 1833507 but I currently don't believe so mainly because the artifacts were still there when I investigated 3 days later. I currently think our best shot is to log out why taskgraph didn't find the cached task in the first place. Let's keep this bug open to add the logs at least.

See Also: → 1833507
See Also: → 1860081
QA Contact: jlorenzo
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: