Closed Bug 1306860 Opened 9 years ago Closed 8 years ago

Intermittent-infra l10n nightly command timed out: 10800 seconds elapsed running ['c:/mozilla-build/python27/python', '-u', 'scripts/scripts/desktop_l10n.py', '--environment-config', 'single_locale/production.py', '--branch-config', 'single_locale/mozilla

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: bulk-close-intermittents, intermittent-failure)

Attachments

(1 file)

Ok, so we have two problems here: (1) - b-2008 spots don't actually kill buildbot when they time out. Theres a bug on file, and its reasonably important, n-i to markco to comment here what bug that is. (2) - We have a *lot* of retries to balrog in this log... That probably contributes a LOT to the potential timeout here... (some of the log excerpt follows) Ben is there any insight into how we can shore up issue #2 here a bit (either on client side, or server side, to avoid the excess time-spent) 06:17:11 INFO - "GET /api/releases/Firefox-mozilla-aurora-nightly-latest/builds/WINNT_x86-msvc/fa HTTP/1.1" 200 1838 06:17:11 INFO - REQUEST STATS: {"url": "https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-latest/builds/WINNT_x86-msvc/fa", "timestamp": 1475327831.498, "method": "GET", "elapsed_secs": 1.184999942779541, "status_code": 200} 06:17:11 INFO - Balrog request to https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-20161001004000 06:17:11 INFO - Data sent: None 06:17:11 INFO - Starting new HTTPS connection (1): aus4-admin.mozilla.org 06:17:13 INFO - "GET /api/releases/Firefox-mozilla-aurora-nightly-20161001004000 HTTP/1.1" 200 658908 06:17:14 INFO - REQUEST STATS: {"url": "https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-20161001004000", "timestamp": 1475327834.162, "method": "GET", "elapsed_secs": 2.6610000133514404, "status_code": 200} 06:17:14 INFO - Balrog request to https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-20161001004000/builds/WINNT_x86-msvc/fa 06:17:14 INFO - Data sent: None 06:17:15 INFO - "GET /api/releases/Firefox-mozilla-aurora-nightly-20161001004000/builds/WINNT_x86-msvc/fa HTTP/1.1" 200 454 06:17:15 INFO - REQUEST STATS: {"url": "https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-20161001004000/builds/WINNT_x86-msvc/fa", "timestamp": 1475327835.082, "method": "GET", "elapsed_secs": 0.9119999408721924, "status_code": 200} 06:17:15 INFO - Balrog request to https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-latest 06:17:15 INFO - Data sent: None 06:17:15 INFO - "HEAD /api/releases/Firefox-mozilla-aurora-nightly-latest HTTP/1.1" 200 0 06:17:15 INFO - REQUEST STATS: {"url": "https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-latest", "timestamp": 1475327835.842, "method": "HEAD", "elapsed_secs": 0.7579998970031738, "status_code": 200} 06:17:15 INFO - Balrog request to https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-latest/builds/WINNT_x86-msvc/fa 06:17:15 INFO - Data sent: {'alias': '["WINNT_x86-msvc-x86", "WINNT_x86-msvc-x64"]', 'product': u'Firefox', 'hashFunction': u'sha512', 'data_version': '334039', 'data': '{"buildID": "20161001004000", "platformVersion": "51.0a2", "displayVersion": "51.0a2", "completes": [{"fileUrl": "https://archive.mozilla.org/pub/firefox/nightly/2016/10/2016-10-01-00-40-00-mozilla-aurora-l10n/firefox-51.0a2.fa.win32.complete.mar", "from": "*", "hashValue": "f84541c771e99c7f6c760b61b153644967b4e8a9b4ca26e1f2cb3cac0faa4eb60f8951417685ce276d7bfeb3b9c7c91dba860d85a71e3adb5a1606370d918101", "filesize": 54580969}], "appVersion": "51.0a2"}', 'schema_version': 4} 06:17:17 INFO - "PUT /api/releases/Firefox-mozilla-aurora-nightly-latest/builds/WINNT_x86-msvc/fa HTTP/1.1" 400 87 06:17:17 INFO - Caught HTTPError: {"data": ["Failed to update row, old_data_version doesn't match current data_version"]} 06:17:17 INFO - REQUEST STATS: {"url": "https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-latest/builds/WINNT_x86-msvc/fa", "timestamp": 1475327837.714, "method": "PUT", "elapsed_secs": 1.871000051498413, "status_code": 400} 06:17:17 INFO - retry: Caught exception: 06:17:17 INFO - Traceback (most recent call last): 06:17:17 INFO - File "c:\builds\moz2_slave\m-aurora-w32-l10n-ntly-2-00000\build\tools\lib\python\vendor\redo-1.4.1\redo\__init__.py", line 152, in retry 06:17:17 INFO - return action(*args, **kwargs) 06:17:17 INFO - File "c:\builds\moz2_slave\m-aurora-w32-l10n-ntly-2-00000\build\tools\scripts\updates\../../lib/python\balrog\submitter\cli.py", line 338, in update_latest 06:17:17 INFO - data_version=latest_data_version) 06:17:17 INFO - File "c:\builds\moz2_slave\m-aurora-w32-l10n-ntly-2-00000\build\tools\scripts\updates\../../lib/python\balrog\submitter\api.py", line 216, in update_build 06:17:17 INFO - return self.request(method='PUT', data=data) 06:17:17 INFO - File "c:\builds\moz2_slave\m-aurora-w32-l10n-ntly-2-00000\build\tools\scripts\updates\../../lib/python\balrog\submitter\api.py", line 103, in request 06:17:17 INFO - return self.do_request(url, data, method) 06:17:17 INFO - File "c:\builds\moz2_slave\m-aurora-w32-l10n-ntly-2-00000\build\tools\scripts\updates\../../lib/python\balrog\submitter\api.py", line 121, in do_request 06:17:17 INFO - req.raise_for_status() 06:17:17 INFO - File "c:\builds\moz2_slave\m-aurora-w32-l10n-ntly-2-00000\build\tools\scripts\updates\../../lib/python/vendor/requests-2.7.0\requests\models.py", line 851, in raise_for_status 06:17:17 INFO - raise HTTPError(http_error_msg, response=self) 06:17:17 INFO - HTTPError: 400 Client Error: BAD REQUEST 06:17:17 INFO - sleeping for 2.00s (attempt 2/30) 06:17:19 INFO - attempt 3/30 06:17:19 INFO - retry: Calling <function update_latest at 0x02C3E1F0> with args: (), kwargs: {}, attempt #3 06:17:19 INFO - Balrog request to https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-latest 06:17:19 INFO - Data sent: None 06:17:19 INFO - Starting new HTTPS connection (1): aus4-admin.mozilla.org 06:17:20 INFO - "GET /api/releases/Firefox-mozilla-aurora-nightly-latest HTTP/1.1" 200 968131 command timed out: 10800 seconds elapsed running ['c:/mozilla-build/python27/python', '-u', 'scripts/scripts/desktop_l10n.py', '--environment-config', 'single_locale/production.py', '--branch-config', 'single_locale/mozilla-aurora.py', '--platform-config', 'single_locale/win32.py', '--total-chunks', '6', '--this-chunk', '2', '--balrog-config', 'balrog/production.py'], attempting to kill 06:17:21 INFO - REQUEST STATS: {"url": "https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-latest", "timestamp": 1475327841.646, "method": "GET", "elapsed_secs": 1.930999994277954, "status_code": 200} 06:17:21 INFO - Balrog request to https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-latest/builds/WINNT_x86-msvc/fa 06:17:21 INFO - Data sent: None 06:17:22 INFO - "GET /api/releases/Firefox-mozilla-aurora-nightly-latest/builds/WINNT_x86-msvc/fa HTTP/1.1" 200 1838 06:17:22 INFO - REQUEST STATS: {"url": "https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-latest/builds/WINNT_x86-msvc/fa", "timestamp": 1475327842.469, "method": "GET", "elapsed_secs": 0.8110001087188721, "status_code": 200} 06:17:22 INFO - Balrog request to https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-20161001004000 06:17:22 INFO - Data sent: None
Flags: needinfo?(mcornmesser)
Flags: needinfo?(bhearsum)
There's probably more retries because we recently reduced sleep time and increased attempts when submitting to Balrog. The fix for this is balrogworker, and we've been trying to marginally improve the situation in https://bugzilla.mozilla.org/show_bug.cgi?id=1284516 in the meantime.
Flags: needinfo?(bhearsum)
One thing that might help here is to reduce the # of retries. In https://github.com/mozilla/build-tools/commit/baaf2fb71a1deedc2206413d5eeafee29ae11272, we switched from 5 attempts w ~30s sleep between them -> 30 attempts with 2s sleep. That ended up increasing the theoretical max sleep time, which may have pushed us over here. Let's try 10 attempts max, which is still twice the number we had before.
Attachment #8797175 - Flags: review?(rail)
My thoughts from IRC to immortalize in this bug... 11:13 AM <Callek> bhearsum: sooo I'm thinking the likely "before" case was "something got a data version mismatch, and waits 30sec, <other job> doesn't and manages to churn through a bunch of submissions in short succession --" thus meaning that the 2sec sleep is actually hurting us when coupled with tcp/web-app overhead 11:14 AM <Callek> so I'm not sure dropping to 10 retries will solve more than causing them to fail for failing to submit, rather than fail for timeout 11:14 AM <Callek> (mostly said based on my own informed intuition, rather than on hard data) 11:17 AM <•bhearsum> you could be onto something there 11:17 AM <•bhearsum> honestly, i'm just throwing things at the wall 11:18 AM <•bhearsum> it's so difficult to reason about what's going to happen with so many variables in the mix 11:18 AM <•bhearsum> if you want to try that feel free to toss up a patch, otherwise we'll see what happens with mine first 11:18 AM <Callek> I'm happy to see how well yours takes and go from there
Attachment #8797175 - Flags: review?(rail) → review+
Rail, do you need to generate new images for this?
Flags: needinfo?(rail)
Depends on: 1304075
Flags: needinfo?(mcornmesser)
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: