Closed
Bug 1248299
Opened 9 years ago
Closed 9 years ago
Frequent antivirus job failures with IncompleteRead after several retries
Categories
(Release Engineering :: Release Automation: Other, defect)
Release Engineering
Release Automation: Other
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: nthomas, Unassigned)
References
Details
Eg: for 45.0b5 we had errors in three separate runs
09:29:14 INFO - Downloading pub/firefox/candidates/45.0b5-candidates/build1/update/mac/cy/firefox-45.0b5.complete.mar to cache/update/mac/cy/firefox-45.0b5.complete.mar
09:29:14 INFO - retry: Calling <bound method Key.get_contents_to_filename of <Key: net-mozaws-prod-delivery-firefox,pub/firefox/candidates/45.0b5-candidates/build1/update/mac/cy/firefox-45.0b5.complete.mar>> with args: (u'cache/update/mac/cy/firefox-45.0b5.complete.mar',), kwargs: {}, attempt #1
09:32:37 INFO - ... attempt #2
09:33:16 INFO - ... attempt #3
09:33:32 INFO - ... attempt #4
09:33:56 INFO - ... attempt #5
09:33:56 INFO - retry: Giving up ...
10:23:28 INFO - Downloading pub/firefox/candidates/45.0b5-candidates/build1/update/win32/ml/firefox-45.0b5.complete.mar to cache/update/win32/ml/firefox-45.0b5.complete.mar
10:23:28 INFO - retry: Calling <bound method Key.get_contents_to_filename of <Key: net-mozaws-prod-delivery-firefox,pub/firefox/candidates/45.0b5-candidates/build1/update/win32/ml/firefox-45.0b5.complete.mar>> with args: (u'cache/update/win32/ml/firefox-45.0b5.complete.mar',), kwargs: {}, attempt #1
10:26:50 INFO - .... attempt #2
10:28:40 INFO - ... attempt #3
10:28:54 INFO - ... attempt #4
10:29:15 INFO - ... attempt #5
10:29:15 INFO - retry: Giving up ...
10:51:22 INFO - Downloading pub/firefox/candidates/45.0b5-candidates/build1/win32/fr/Firefox Setup 45.0b5.exe to cache/win32/fr/Firefox Setup 45.0b5.exe
10:51:22 INFO - retry: Calling <bound method Key.get_contents_to_filename of <Key: net-mozaws-prod-delivery-firefox,pub/firefox/candidates/45.0b5-candidates/build1/win32/fr/Firefox Setup 45.0b5.exe>> with args: (u'cache/win32/fr/Firefox Setup 45.0b5.exe',), kwargs: {}, attempt #1
10:52:47 INFO - ... attempt #2
10:52:54 INFO - ... attempt #3
10:53:03 INFO - ... attempt #4
10:53:16 INFO - ... attempt #5
10:53:16 INFO - retry: Giving up ...
We've also had issues on other releases, although usually one retry is sufficient. We probably have to retry harder.
Reporter | ||
Comment 1•9 years ago
|
||
The explicit parameters on the retry call are at
http://hg.mozilla.org/mozilla-central/file/default/testing/mozharness/scripts/release/antivirus.py#l156
namely
sleeptime=30, max_sleeptime=150
on the call, and defaults are
attempts=5, sleepscale=1.5, jitter=1
I was just going to increase the number of attempts, but ....
The retry code is in http://hg.mozilla.org/mozilla-central/file/default/python/redo/redo/__init__.py, so we should be waiting sleeptime, sleeptime * sleep_scale, sleeptime * sleep_scale**2, etc, with plus/minus 1 second jitter on each iteration. So roughly speaking 30s, 45s, etc. This is definitely not happening between attempts 3, 4, and 5.
Ben, are you aware of any problems using redo with multiprocessing ? It might be that as we finish other downloads that free workers are picking retreies, maybe *waves hands wildly*.
Flags: needinfo?(bhearsum)
Comment 2•9 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #1)
> The explicit parameters on the retry call are at
>
> http://hg.mozilla.org/mozilla-central/file/default/testing/mozharness/
> scripts/release/antivirus.py#l156
> namely
> sleeptime=30, max_sleeptime=150
> on the call, and defaults are
> attempts=5, sleepscale=1.5, jitter=1
>
> I was just going to increase the number of attempts, but ....
>
> The retry code is in
> http://hg.mozilla.org/mozilla-central/file/default/python/redo/redo/__init__.
> py, so we should be waiting sleeptime, sleeptime * sleep_scale, sleeptime *
> sleep_scale**2, etc, with plus/minus 1 second jitter on each iteration. So
> roughly speaking 30s, 45s, etc. This is definitely not happening between
> attempts 3, 4, and 5.
>
> Ben, are you aware of any problems using redo with multiprocessing ? It
> might be that as we finish other downloads that free workers are picking
> retreies, maybe *waves hands wildly*.
I'm not aware of any issues, but that doesn't mean much! It looks like to me like sleeptime is all over the place...we've got:
1->2: a few minutes
2->3: 39s
3->4: 16s
4->5: 24s
1->2: a few minutes
2->3: 110s
3->4: 14s
4->5: 21s
1->2: 85s
2->3: 7s
3->4: 9s
4->5: 13s
So sleeptime is all over the place, sometimes lower than the original, and sometimes lower than the previous attempt. That's definitely funky. I also noticed that despite other debug messages being logged, the "sleeping for...." ones are not. This happens after retrier yields, which makes me wonder if there's something funky about doing work in a function after yielding, maybe only when using multiprocessing?
Flags: needinfo?(bhearsum)
Comment 3•9 years ago
|
||
I just realized that a ThreadPool is being used here ("multiprocessing" made me think many processes), which makes me wonder if there's a GIL related issue where waiting threads are unable to do anything while another one downloads. That might explain having many retries, not sure if it explains the funky logging or sleep times. I tried to reproduce in a smaller test case and couldn't, but I'm not sure if time.sleep() is equivalent to waiting on a read(), or if I missed other key things in the real case: https://gist.github.com/bhearsum/690ec4c880db2b45bfe2
Comment 5•9 years ago
|
||
so I think part of the confusion is that m-c antivirus retry timeout was bumped by me but I failed to uplift it immediately to beta:
http://hg.mozilla.org/mozilla-central/file/default/testing/mozharness/scripts/release/antivirus.py#l156
http://hg.mozilla.org/releases/mozilla-beta/file/default/testing/mozharness/scripts/release/antivirus.py#l158
does your findings line up with what is on beta right now?
Comment 6•9 years ago
|
||
Fresh data on this for 38.7.0esr on three successive runs so far:
1st run (via http://buildbot-master77.bb.releng.use1.mozilla.com:8001/builders/release-mozilla-esr38-firefox_antivirus/builds/4/steps/run_script/logs/stdio)
21:55:14 INFO - Downloading pub/firefox/candidates/38.7.0esr-candidates/build1/update/linux-x86_64/ca/firefox-38.3.0esr-38.7.0esr.partial.mar to cache/update/linux-x86_64/ca/firefox-38.3.0esr-38.7.0esr.partial.mar
21:57:05 INFO - .... attempt #2
21:57:50 INFO - ..... attempt #3
21:58:58 INFO - ..... attempt #4
22:00:40 INFO - ..... attempt #5
22:00:40 INFO - retry: Giving up ....
2nd run (via http://buildbot-master77.bb.releng.use1.mozilla.com:8001/builders/release-mozilla-esr38-firefox_antivirus/builds/5/steps/run_script/logs/stdio):
23:00:53 INFO - Downloading pub/firefox/candidates/38.7.0esr-candidates/build1/update/win32/da/firefox-38.6.1esr-38.7.0esr.partial.mar to cache/update/win32/da/firefox-38.6.1esr-38.7.0esr.partial.mar
23:02:43 INFO - .... attempt #2
23:03:28 INFO - .... attempt #3
23:04:36 INFO - .... attempt #4
23:06:16 INFO - .... attempt #5
23:06:16 INFO - retry: Giving up ...
(in the same file also)
23:01:47 INFO - Downloading pub/firefox/candidates/38.7.0esr-candidates/build1/update/win32/gl/firefox-38.7.0esr.complete.mar to cache/update/win32/gl/firefox-38.7.0esr.complete.mar
23:03:37 INFO - .... attempt #2
23:04:22 INFO - .... attempt #3
23:05:29 INFO - .... attempt #4
3rd run (via http://buildbot-master77.bb.releng.use1.mozilla.com:8001/builders/release-mozilla-esr38-firefox_antivirus/builds/6/steps/run_script/logs/stdio):
23:36:04 INFO - Downloading pub/firefox/candidates/38.7.0esr-candidates/build1/update/linux-x86_64/km/firefox-38.7.0esr.complete.mar to cache/update/linux-x86_64/km/firefox-38.7.0esr.complete.mar
23:37:56 INFO - .... attempt #2
23:38:42 INFO - .... attempt #3
23:39:49 INFO - .... attempt #4
23:41:31 INFO - .... attempt #5
23:41:31 INFO - retry: Giving up ...
This proves two things:
1) jlund's patch uplifted to beta and all the way to esr38 is working smooth. There is a gap of at least 30/45 seconds between retries so I assume that earlier in the thread the problem was just an async between the m-c -> m-b uplift and the time of the builds as jlund suggested.
2) on the other hand - should we retry even harder on this? ... usually one retrigger suffices but in this particular case I'm retriggering for the forth time.
Reporter | ||
Comment 7•9 years ago
|
||
The times still seem a bit off (not increasing by ~50% each time) but I'm not sure how to make any progress here. When I was setting this up there was definitely a speedup when using a threadpool over sequential downloads, so the GIL (comment #3) can't be hard blocking.
Assignee: nthomas → nobody
Reporter | ||
Comment 8•9 years ago
|
||
Going to WFM this on the basis that we haven't had issues recently, and release promotion has reimplemented this.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•