Race condition between decision tasks can have unexpected side effects
Categories
(Firefox Build System :: Task Configuration, task)
Tracking
(Not tracked)
People
(Reporter: glandium, Unassigned)
Details
When two decision tasks run concurrently because one started shortly after the other, they can interfere with each other wrt cached tasks.
Concrete example:
https://treeherder.mozilla.org/jobs?repo=autoland&revision=cf6d1e146636b679bfbb016a0fd3b62b5c33cb4e
This push changed a script that affects tasks such as toolchain-linux64-x64-compiler-rt-15. However, this task wasn't triggered on this push. However, the toolchain-linux64-clang-15-profile task that depends on it was triggered. But, and this is the most interesting part, if you look at that toolchain-linux64-clang-15-profile task, you'll see that it's depending on a fresh new toolchain-linux64-x64-compiler-rt-15 task... that was triggered on https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=eYV-Z2arSsKTW3PAeyxxlA.0&revision=fa64d4d68a0c9fa6b9232d1c1a577bc6d0dc0854, the very next push...
This didn't have practical effects in this case, but this could have some subtle consequences.
Comment 1•2 years ago
|
||
From the first decision task (on cf6d1e146636b679bfbb016a0fd3b62b5c33cb4e):
[task 2023-02-23T23:00:08.628Z] https://firefox-ci-tc.services.mozilla.com:443 "GET /api/index/v1/task/gecko.cache.level-3.toolchains.v3.linux64-clang-15-profile.hash.51dc9b16c81565a0d765a0da261fe80fcc8073b92949111977f3436487d0f3e2 HTTP/1.1" 404 476
[task 2023-02-23T23:00:13.329Z] https://firefox-ci-tc.services.mozilla.com:443 "GET /api/index/v1/task/gecko.cache.level-3.toolchains.v3.linux64-x86-compiler-rt-15.hash.0b7d7aeffdc5ccd2e5db9e4606b1223145784352b9ad3f7baeedbec8ac933f60 HTTP/1.1" 200 257
From the second one (on fa64d4d68a0c9fa6b9232d1c1a577bc6d0dc0854):
[task 2023-02-23T22:57:12.734Z] https://firefox-ci-tc.services.mozilla.com:443 "GET /api/index/v1/task/gecko.cache.level-3.toolchains.v3.linux64-x64-compiler-rt-15.hash.dcc44a1eb6178ce3c4c08c400b629a9fde01cec8756dcff6d345211ff69e1ec6 HTTP/1.1" 404 478
[task 2023-02-23T22:57:22.543Z] Creating task with taskId C90XbUnXQHOFeLQxRQd89w for eager-index-toolchain-linux64-x64-compiler-rt-15
[task 2023-02-23T22:57:22.773Z] http://taskcluster:80 "PUT /queue/v1/task/C90XbUnXQHOFeLQxRQd89w HTTP/1.1" 200 420
From that eager-index task:
[taskcluster 2023-02-23 22:59:53.967Z] === Task Starting ===
Inserting eYV-Z2arSsKTW3PAeyxxlA into index (rank 0) under: gecko.cache.level-3.toolchains.v3.linux64-x64-compiler-rt-15.hash.dcc44a1eb6178ce3c4c08c400b629a9fde01cec8756dcff6d345211ff69e1ec6
indexing successfully completed.
[taskcluster 2023-02-23 22:59:56.865Z] === Task Finished ===
So the index was indeed updated with the new toolchain-linux64-x64-compiler-rt-15 task before the decision task for the first push got around to checking the index.
I think this is all working as expected. What bad side effects are you worried about? What would you have it do instead?
| Reporter | ||
Comment 2•2 years ago
|
||
Imagine the case where building toolchain-linux64-clang-15 from the first push with the toolchain-linux64-x64-compiler-rt-15 from the second push produces a broken toolchain. Things would end up busted on the first push with no obvious explanation, and would be magically fixed on the next push, which would probably only lead to the failures marked as "fixed by" or something, with a shrug. Which I guess is a fine-ish outcome, but still, that kind of rubs me the wrong way. I hope it's not possible to end up with the second push not producing its own toolchain-linux64-clang-15 (I think that's not possible but don't quote me on that).
| Reporter | ||
Comment 3•2 years ago
|
||
(Yes, in theory, if there's a substantial difference that could lead to bustage, toolchain-linux64-x64-compiler-rt-15 shouldn't have the same index between the two pushes, but indices are not perfect).
Comment 4•2 years ago
|
||
I hope it's not possible to end up with the second push not producing its own toolchain-linux64-clang-15 (I think that's not possible but don't quote me on that).
Yeah I don't think that's possible. If the decision task from the first push (D1) doesn't schedule toolchain-linux64-x64-compiler-rt-15, it means the eager-index task from the second push has already run, meaning the corresponding decision task (D2) is completed, before D1 starts scheduling anything. That means D2 would also schedule toolchain-linux64-clang-15 (assuming the relevant digest changes in either of those pushes?).
Ultimately we do rely on indices for this stuff, so if the digest doesn't change when it ought to then all bets are off...
Updated•2 years ago
|
Description
•