Closed Bug 1306860 Opened 8 years ago Closed 7 years ago

Intermittent-infra l10n nightly command timed out: 10800 seconds elapsed running ['c:/mozilla-build/python27/python', '-u', 'scripts/scripts/desktop_l10n.py', '--environment-config', 'single_locale/production.py', '--branch-config', 'single_locale/mozilla

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: bulk-close-intermittents, intermittent-failure)

Attachments

(1 file)

Ok, so we have two problems here:

(1) - b-2008 spots don't actually kill buildbot when they time out. Theres a bug on file, and its reasonably important, n-i to markco to comment here what bug that is.

(2) - We have a *lot* of retries to balrog in this log... That probably contributes a LOT to the potential timeout here...  (some of the log excerpt follows)

Ben is there any insight into how we can shore up issue #2 here a bit (either on client side, or server side, to avoid the excess time-spent)


06:17:11     INFO -  "GET /api/releases/Firefox-mozilla-aurora-nightly-latest/builds/WINNT_x86-msvc/fa HTTP/1.1" 200 1838
06:17:11     INFO -  REQUEST STATS: {"url": "https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-latest/builds/WINNT_x86-msvc/fa", "timestamp": 1475327831.498, "method": "GET", "elapsed_secs": 1.184999942779541, "status_code": 200}
06:17:11     INFO -  Balrog request to https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-20161001004000
06:17:11     INFO -  Data sent: None
06:17:11     INFO -  Starting new HTTPS connection (1): aus4-admin.mozilla.org
06:17:13     INFO -  "GET /api/releases/Firefox-mozilla-aurora-nightly-20161001004000 HTTP/1.1" 200 658908
06:17:14     INFO -  REQUEST STATS: {"url": "https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-20161001004000", "timestamp": 1475327834.162, "method": "GET", "elapsed_secs": 2.6610000133514404, "status_code": 200}
06:17:14     INFO -  Balrog request to https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-20161001004000/builds/WINNT_x86-msvc/fa
06:17:14     INFO -  Data sent: None
06:17:15     INFO -  "GET /api/releases/Firefox-mozilla-aurora-nightly-20161001004000/builds/WINNT_x86-msvc/fa HTTP/1.1" 200 454
06:17:15     INFO -  REQUEST STATS: {"url": "https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-20161001004000/builds/WINNT_x86-msvc/fa", "timestamp": 1475327835.082, "method": "GET", "elapsed_secs": 0.9119999408721924, "status_code": 200}
06:17:15     INFO -  Balrog request to https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-latest
06:17:15     INFO -  Data sent: None
06:17:15     INFO -  "HEAD /api/releases/Firefox-mozilla-aurora-nightly-latest HTTP/1.1" 200 0
06:17:15     INFO -  REQUEST STATS: {"url": "https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-latest", "timestamp": 1475327835.842, "method": "HEAD", "elapsed_secs": 0.7579998970031738, "status_code": 200}
06:17:15     INFO -  Balrog request to https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-latest/builds/WINNT_x86-msvc/fa
06:17:15     INFO -  Data sent: {'alias': '["WINNT_x86-msvc-x86", "WINNT_x86-msvc-x64"]', 'product': u'Firefox', 'hashFunction': u'sha512', 'data_version': '334039', 'data': '{"buildID": "20161001004000", "platformVersion": "51.0a2", "displayVersion": "51.0a2", "completes": [{"fileUrl": "https://archive.mozilla.org/pub/firefox/nightly/2016/10/2016-10-01-00-40-00-mozilla-aurora-l10n/firefox-51.0a2.fa.win32.complete.mar", "from": "*", "hashValue": "f84541c771e99c7f6c760b61b153644967b4e8a9b4ca26e1f2cb3cac0faa4eb60f8951417685ce276d7bfeb3b9c7c91dba860d85a71e3adb5a1606370d918101", "filesize": 54580969}], "appVersion": "51.0a2"}', 'schema_version': 4}
06:17:17     INFO -  "PUT /api/releases/Firefox-mozilla-aurora-nightly-latest/builds/WINNT_x86-msvc/fa HTTP/1.1" 400 87
06:17:17     INFO -  Caught HTTPError: {"data": ["Failed to update row, old_data_version doesn't match current data_version"]}
06:17:17     INFO -  REQUEST STATS: {"url": "https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-latest/builds/WINNT_x86-msvc/fa", "timestamp": 1475327837.714, "method": "PUT", "elapsed_secs": 1.871000051498413, "status_code": 400}
06:17:17     INFO -  retry: Caught exception:
06:17:17     INFO -  Traceback (most recent call last):
06:17:17     INFO -    File "c:\builds\moz2_slave\m-aurora-w32-l10n-ntly-2-00000\build\tools\lib\python\vendor\redo-1.4.1\redo\__init__.py", line 152, in retry
06:17:17     INFO -      return action(*args, **kwargs)
06:17:17     INFO -    File "c:\builds\moz2_slave\m-aurora-w32-l10n-ntly-2-00000\build\tools\scripts\updates\../../lib/python\balrog\submitter\cli.py", line 338, in update_latest
06:17:17     INFO -      data_version=latest_data_version)
06:17:17     INFO -    File "c:\builds\moz2_slave\m-aurora-w32-l10n-ntly-2-00000\build\tools\scripts\updates\../../lib/python\balrog\submitter\api.py", line 216, in update_build
06:17:17     INFO -      return self.request(method='PUT', data=data)
06:17:17     INFO -    File "c:\builds\moz2_slave\m-aurora-w32-l10n-ntly-2-00000\build\tools\scripts\updates\../../lib/python\balrog\submitter\api.py", line 103, in request
06:17:17     INFO -      return self.do_request(url, data, method)
06:17:17     INFO -    File "c:\builds\moz2_slave\m-aurora-w32-l10n-ntly-2-00000\build\tools\scripts\updates\../../lib/python\balrog\submitter\api.py", line 121, in do_request
06:17:17     INFO -      req.raise_for_status()
06:17:17     INFO -    File "c:\builds\moz2_slave\m-aurora-w32-l10n-ntly-2-00000\build\tools\scripts\updates\../../lib/python/vendor/requests-2.7.0\requests\models.py", line 851, in raise_for_status
06:17:17     INFO -      raise HTTPError(http_error_msg, response=self)
06:17:17     INFO -  HTTPError: 400 Client Error: BAD REQUEST
06:17:17     INFO -  sleeping for 2.00s (attempt 2/30)
06:17:19     INFO -  attempt 3/30
06:17:19     INFO -  retry: Calling <function update_latest at 0x02C3E1F0> with args: (), kwargs: {}, attempt #3
06:17:19     INFO -  Balrog request to https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-latest
06:17:19     INFO -  Data sent: None
06:17:19     INFO -  Starting new HTTPS connection (1): aus4-admin.mozilla.org
06:17:20     INFO -  "GET /api/releases/Firefox-mozilla-aurora-nightly-latest HTTP/1.1" 200 968131

command timed out: 10800 seconds elapsed running ['c:/mozilla-build/python27/python', '-u', 'scripts/scripts/desktop_l10n.py', '--environment-config', 'single_locale/production.py', '--branch-config', 'single_locale/mozilla-aurora.py', '--platform-config', 'single_locale/win32.py', '--total-chunks', '6', '--this-chunk', '2', '--balrog-config', 'balrog/production.py'], attempting to kill
06:17:21     INFO -  REQUEST STATS: {"url": "https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-latest", "timestamp": 1475327841.646, "method": "GET", "elapsed_secs": 1.930999994277954, "status_code": 200}
06:17:21     INFO -  Balrog request to https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-latest/builds/WINNT_x86-msvc/fa
06:17:21     INFO -  Data sent: None
06:17:22     INFO -  "GET /api/releases/Firefox-mozilla-aurora-nightly-latest/builds/WINNT_x86-msvc/fa HTTP/1.1" 200 1838
06:17:22     INFO -  REQUEST STATS: {"url": "https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-latest/builds/WINNT_x86-msvc/fa", "timestamp": 1475327842.469, "method": "GET", "elapsed_secs": 0.8110001087188721, "status_code": 200}
06:17:22     INFO -  Balrog request to https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-nightly-20161001004000
06:17:22     INFO -  Data sent: None
Flags: needinfo?(mcornmesser)
Flags: needinfo?(bhearsum)
There's probably more retries because we recently reduced sleep time and increased attempts when submitting to Balrog. The fix for this is balrogworker, and we've been trying to marginally improve the situation in https://bugzilla.mozilla.org/show_bug.cgi?id=1284516 in the meantime.
Flags: needinfo?(bhearsum)
One thing that might help here is to reduce the # of retries. In https://github.com/mozilla/build-tools/commit/baaf2fb71a1deedc2206413d5eeafee29ae11272, we switched from 5 attempts w ~30s sleep between them -> 30 attempts with 2s sleep. That ended up increasing the theoretical max sleep time, which may have pushed us over here. Let's try 10 attempts max, which is still twice the number we had before.
Attachment #8797175 - Flags: review?(rail)
My thoughts from IRC to immortalize in this bug...

11:13 AM <Callek> bhearsum: sooo I'm thinking the likely "before" case was "something got a data version mismatch, and waits 30sec, <other job> doesn't and manages to churn through a bunch of submissions in short succession --" thus meaning that the 2sec sleep is actually hurting us when coupled with tcp/web-app overhead
11:14 AM <Callek> so I'm not sure dropping to 10 retries will solve more than causing them to fail for failing to submit, rather than fail for timeout
11:14 AM <Callek> (mostly said based on my own informed intuition, rather than on hard data)
11:17 AM <•bhearsum> you could be onto something there
11:17 AM <•bhearsum> honestly, i'm just throwing things at the wall
11:18 AM <•bhearsum> it's so difficult to reason about what's going to happen with so many variables in the mix
11:18 AM <•bhearsum> if you want to try that feel free to toss up a patch, otherwise we'll see what happens with mine first
11:18 AM <Callek> I'm happy to see how well yours takes and go from there
Attachment #8797175 - Flags: review?(rail) → review+
Rail, do you need to generate new images for this?
Flags: needinfo?(rail)
Depends on: 1304075
Flags: needinfo?(mcornmesser)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: