Closed Bug 1501113 Opened 6 years ago Closed 6 years ago

race condition: concurrent partials generation with a shared cache

Categories

(Release Engineering :: Release Automation: Updates, enhancement)

enhancement
Not set
normal

Tracking

(firefox-esr60 fixed, firefox64 fixed, firefox65 fixed)

RESOLVED FIXED
Tracking Status
firefox-esr60 --- fixed
firefox64 --- fixed
firefox65 --- fixed

People

(Reporter: mozilla, Assigned: sfraser)

Details

Attachments

(2 files)

This caused UV 1/12 to fail in both linux64 and win64 in Fx 64.0b3. To fix, we reran partials on down; otherwise it would have required a build 2.
Attached file rerun list
This is the list of tasks I had to rerun. I accidentally reran linux64 be partials, so I added those.
We could serialize this by awaiting these tasks [1] rather than appending them to an asyncio.gather() call.
Alternately, we could somehow make the cache unshared or somehow only update it with complete files (are we hitting mid-download files, or mid-generation, or ?).

The two UV tasks that showed the bustage are

+ update-verify win64 1/12 YETuIdBuQpuvb1lXUCbHxg
+ update-verify linux64 1/12 EtMs1b3vT1i7hhZ6kbtsgA

The two partials tasks that created the bustage are

+ win64 be partials C9mYTX4kTWe5lX8y8-NnLw
+ linux64 ar partials EkOgMi-NSXSHWjDqCV65Jg

[1] https://searchfox.org/mozilla-central/source/taskcluster/docker/funsize-update-generator/scripts/funsize.py#428
Reruns worked OK, except for Balrog bug 1501167 requiring a manual fix, and bug 1501131 to purge CDNs.
(In reply to Aki Sasaki [:aki] from comment #2)
> We could serialize this by awaiting these tasks [1] rather than appending
> them to an asyncio.gather() call.
> Alternately, we could somehow make the cache unshared or somehow only update
> it with complete files (are we hitting mid-download files, or
> mid-generation, or ?).
> 

Do we know which file inside the mar was created using an incorrect cached file? I didn't think S3 had race conditions like that, but if it does we should definitely reorganise. 

I'm not clear at this point how the wrong size ended up in balrog, as funsize just uses os.path.getsize() on the finished file, and puts that in the manifest. I've checked the logs for all partials tasks in that group for 34192776 and none of them logged it, at least on the first runs.
I don't think it's an s3 race condition; it looks like a working directory race condition in the docker container.
`be` appears to be diffing against msvcp140.dll at the same time as bg is patching it?

$ grep msvcp140.dll bg.log
...
2018-10-22 18:59:53,478 - DEBUG - target-63.0b12.partial.mar: diffing "msvcp140.dll"
2018-10-22 18:59:53,478 - DEBUG - target-63.0b12.partial.mar: '/tmp/tmp0ibcyk7k/to.work/msvcp140.dll.patch.xz' -> '/tmp/fs-cache/830840918d9c262c4763b1a4364554333bf790c05f7baec994bb38d5210d203001654f1a50a0ae9b54e8fe6bee9b920a5f76b428ec36c3a126ce83d3999e50ec/2bb5ada10764e4ea527e1f8e706ceba8f3fd25704f494e1b900b8c9a24a954a1ee069ebcff8899d0e1bec92af2025e0a58b4b7745f72b1313ba27b93b26be5e0'
2018-10-22 18:59:53,478 - DEBUG - target-63.0b12.partial.mar: /tmp/tmp0ibcyk7k/to.work/msvcp140.dll.patch.xz saved on local cache.
2018-10-22 18:59:53,478 - DEBUG - target-63.0b12.partial.mar:       patch "msvcp140.dll.patch" "msvcp140.dll"
2018-10-22 18:59:54,788 - DEBUG - target-63.0b13.partial.mar: diffing "msvcp140.dll"
2018-10-22 18:59:54,788 - DEBUG - target-63.0b13.partial.mar: '/tmp/fs-cache/830840918d9c262c4763b1a4364554333bf790c05f7baec994bb38d5210d203001654f1a50a0ae9b54e8fe6bee9b920a5f76b428ec36c3a126ce83d3999e50ec/2bb5ada10764e4ea527e1f8e706ceba8f3fd25704f494e1b900b8c9a24a954a1ee069ebcff8899d0e1bec92af2025e0a58b4b7745f72b1313ba27b93b26be5e0' -> '/tmp/tmp6sngxofn/to.work/msvcp140.dll.patch.xz'
2018-10-22 18:59:54,788 - DEBUG - target-63.0b13.partial.mar: Successful retrieved /tmp/tmp6sngxofn/to.work/msvcp140.dll.patch.xz from local cache.
2018-10-22 18:59:54,788 - DEBUG - target-63.0b13.partial.mar: file "msvcp140.dll" found in funsize, diffing skipped
2018-10-22 18:59:54,788 - DEBUG - target-63.0b13.partial.mar:       patch "msvcp140.dll.patch" "msvcp140.dll"
2018-10-22 19:00:16,904 - DEBUG - target-63.0b14.partial.mar: diffing "msvcp140.dll"
2018-10-22 19:00:16,904 - DEBUG - target-63.0b14.partial.mar: '/tmp/fs-cache/830840918d9c262c4763b1a4364554333bf790c05f7baec994bb38d5210d203001654f1a50a0ae9b54e8fe6bee9b920a5f76b428ec36c3a126ce83d3999e50ec/2bb5ada10764e4ea527e1f8e706ceba8f3fd25704f494e1b900b8c9a24a954a1ee069ebcff8899d0e1bec92af2025e0a58b4b7745f72b1313ba27b93b26be5e0' -> '/tmp/tmpgcwzn26c/to.work/msvcp140.dll.patch.xz'
2018-10-22 19:00:16,904 - DEBUG - target-63.0b14.partial.mar: Successful retrieved /tmp/tmpgcwzn26c/to.work/msvcp140.dll.patch.xz from local cache.
2018-10-22 19:00:16,904 - DEBUG - target-63.0b14.partial.mar: file "msvcp140.dll" found in funsize, diffing skipped
2018-10-22 19:00:16,904 - DEBUG - target-63.0b14.partial.mar:       patch "msvcp140.dll.patch" "msvcp140.dll"

$ grep msvcp140.dll be.log
...
2018-10-22 19:00:04,296 - DEBUG - target-63.0b14.partial.mar: diffing "msvcp140.dll"
2018-10-22 19:00:04,296 - DEBUG - target-63.0b14.partial.mar: '/tmp/fs-cache/830840918d9c262c4763b1a4364554333bf790c05f7baec994bb38d5210d203001654f1a50a0ae9b54e8fe6bee9b920a5f76b428ec36c3a126ce83d3999e50ec/2bb5ada10764e4ea527e1f8e706ceba8f3fd25704f494e1b900b8c9a24a954a1ee069ebcff8899d0e1bec92af2025e0a58b4b7745f72b1313ba27b93b26be5e0' -> '/tmp/tmpv5lxp3fz/to.work/msvcp140.dll.patch.xz'
2018-10-22 19:00:04,296 - DEBUG - target-63.0b14.partial.mar: Successful retrieved /tmp/tmpv5lxp3fz/to.work/msvcp140.dll.patch.xz from local cache.
2018-10-22 19:00:04,296 - DEBUG - target-63.0b14.partial.mar: file "msvcp140.dll" found in funsize, diffing skipped
2018-10-22 19:00:04,296 - DEBUG - target-63.0b14.partial.mar:       patch "msvcp140.dll.patch" "msvcp140.dll"
2018-10-22 19:00:04,429 - DEBUG - target-63.0b12.partial.mar: diffing "msvcp140.dll"
2018-10-22 19:00:04,429 - DEBUG - target-63.0b12.partial.mar: '/tmp/tmp51f3_u08/to.work/msvcp140.dll.patch.xz' -> '/tmp/fs-cache/830840918d9c262c4763b1a4364554333bf790c05f7baec994bb38d5210d203001654f1a50a0ae9b54e8fe6bee9b920a5f76b428ec36c3a126ce83d3999e50ec/2bb5ada10764e4ea527e1f8e706ceba8f3fd25704f494e1b900b8c9a24a954a1ee069ebcff8899d0e1bec92af2025e0a58b4b7745f72b1313ba27b93b26be5e0'
2018-10-22 19:00:04,429 - DEBUG - target-63.0b12.partial.mar: /tmp/tmp51f3_u08/to.work/msvcp140.dll.patch.xz saved on local cache.
2018-10-22 19:00:04,429 - DEBUG - target-63.0b12.partial.mar:       patch "msvcp140.dll.patch" "msvcp140.dll"
2018-10-22 19:00:04,564 - DEBUG - target-63.0b13.partial.mar: diffing "msvcp140.dll"
2018-10-22 19:00:04,564 - DEBUG - target-63.0b13.partial.mar: '/tmp/tmpxqgi7nda/to.work/msvcp140.dll.patch.xz' -> '/tmp/fs-cache/830840918d9c262c4763b1a4364554333bf790c05f7baec994bb38d5210d203001654f1a50a0ae9b54e8fe6bee9b920a5f76b428ec36c3a126ce83d3999e50ec/2bb5ada10764e4ea527e1f8e706ceba8f3fd25704f494e1b900b8c9a24a954a1ee069ebcff8899d0e1bec92af2025e0a58b4b7745f72b1313ba27b93b26be5e0'
2018-10-22 19:00:04,564 - DEBUG - target-63.0b13.partial.mar: /tmp/tmpxqgi7nda/to.work/msvcp140.dll.patch.xz saved on local cache.
2018-10-22 19:00:04,564 - DEBUG - target-63.0b13.partial.mar:       patch "msvcp140.dll.patch" "msvcp140.dll"
More context, since I wasn't the one that found this:

<•nthomas> This is what I get trying to unpack win64 be partial manually https://irccloud.mozilla.com/pastebin/PU37Q6FV/
15:01 N<•nthomas> msvcp140.dll definitely changes between 63.0b14 and 64.0b3, fwiw
15:03 A<•aki> i looked at the be partial task for win64 -- the partial task reruns, but the sha for the partial artifact matches betwen run1 and the signing task's cot log
15:04 N<•nthomas> yeah, the first run appeared to abort early enough that it was still downloading and hadn't generated yet
15:09 N<•nthomas> do we generate the partials in parallel ?
15:10 A<•aki> the partials tasks all queue and are claimed in parallel, if that's what you mean
15:10 N<•nthomas> actually meant within each task
15:11 N<•nthomas> this is from grepping the partial generation logs https://irccloud.mozilla.com/pastebin/dCuCU2Au/
15:11 N<•nthomas> bg does what I'd blindly expect, diff and cache the first partial, then use the cache for the subsequent too
15:11 N<•nthomas> be finds something in the cache the first time, then generates diffs the subsequent two
15:12 A<•aki> looks like yes, we await multiple concurrent partial generation coroutines in taskcluster/docker/funsize-update-generator/scripts/funsize.py
15:13 N<•nthomas> and all the log timestamps are very similar in be
15:13 N<•nthomas> raaaaaaaaaaaaace
(In reply to Aki Sasaki [:aki] from comment #5)
> I don't think it's an s3 race condition; it looks like a working directory
> race condition in the docker container.
> `be` appears to be diffing against msvcp140.dll at the same time as bg is
> patching it?

If they're separate tasks they should be on separate instances, and only sharing the S3 cache. A race within an instance is definitely there, though. A quick fix would be to turn off the local cache, so we can try that. S3 will get used anyway.
Assignee: nobody → sfraser
Yeah, aiui this is a single funsize.py run in a single container, with multiple partials run concurrently via asyncio.
I agree. We're not using the S3 caches for releases so it's a local cache at /tmp/fs-cache/. It's hard to trust the timestamps for be in comment #5 as something was found in the cache before it was written, but the two writes are only 140ms apart so it seems very likely there was a write collision.
Pushed by sfraser@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/eaf542a7bdb1
Disable local partials caching to work around races r=aki
https://hg.mozilla.org/mozilla-central/rev/eaf542a7bdb1
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Whiteboard: [checkin-needed-beta]
Whiteboard: [checkin-needed-beta]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: