Closed Bug 1117620 Opened 11 years ago Closed 11 years ago

Blobber uploads broken by a json error, all trees closed

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

Details

e.g. https://treeherder.mozilla.org/logviewer.html#?job_id=5064885&repo=mozilla-inbound Since our third most common test failure requires looking at a blobber-uploaded screenshot to identify that it's the slave's fault, all trees are closed.
Looks like it broke between 16:30 and 16:47, assuming the time of the failure is the significant time rather than something more annoying like the time the slave ran puppet.
Looking into this....
Assignee: nobody → bugspam.Callek
Initial investigation: 17:54:39 INFO - Running command: ['/builds/slave/test/build/venv/bin/python', '/builds/slave/test/build/venv/bin/blobberc.py', '-u', 'https://blobupload.elasticbeanstalk.com', '-a', '/builds/slave/test/oauth.txt', '-b', 'mozilla-inbound', '-d', '/builds/slave/test/build/blobber_upload_dir', '--output-manifest', '/builds/slave/test/build/uploaded_files.json'] 17:54:39 INFO - Copy/paste: /builds/slave/test/build/venv/bin/python /builds/slave/test/build/venv/bin/blobberc.py -u https://blobupload.elasticbeanstalk.com -a /builds/slave/test/oauth.txt -b mozilla-inbound -d /builds/slave/test/build/blobber_upload_dir --output-manifest /builds/slave/test/build/uploaded_files.json 17:54:40 INFO - Traceback (most recent call last): 17:54:40 INFO - File "/builds/slave/test/build/venv/bin/blobberc.py", line 253, in <module> 17:54:40 INFO - main() 17:54:40 INFO - File "/builds/slave/test/build/venv/bin/blobberc.py", line 235, in main 17:54:40 INFO - filetype_whitelist = get_server_whitelist(args['--url']) 17:54:40 INFO - File "/builds/slave/test/build/venv/bin/blobberc.py", line 69, in get_server_whitelist 17:54:40 INFO - return set(response.json().get('whitelist', [])) 17:54:40 INFO - File "/builds/slave/test/build/venv/local/lib/python2.7/site-packages/requests/models.py", line 651, in json 17:54:40 INFO - return json.loads(self.text or self.content, **kwargs) 17:54:40 INFO - File "/builds/slave/test/build/venv/local/lib/python2.7/site-packages/simplejson/__init__.py", line 488, in loads 17:54:40 INFO - return _default_decoder.decode(s) 17:54:40 INFO - File "/builds/slave/test/build/venv/local/lib/python2.7/site-packages/simplejson/decoder.py", line 370, in decode 17:54:40 INFO - obj, end = self.raw_decode(s) 17:54:40 INFO - File "/builds/slave/test/build/venv/local/lib/python2.7/site-packages/simplejson/decoder.py", line 389, in raw_decode 17:54:40 INFO - return self.scan_once(s, idx=_w(s, idx).end()) 17:54:40 INFO - simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0) 17:54:40 ERROR - Return code: 1 On slave tst-linux64-spot-820 Which is erroring at the file at https://github.com/mozilla/build-blobuploader/blob/master/blobberc.py tst-linux64-spot-820 is *not* in any sort of AWS error state, and has been running for ~2 hours, and is up now.
filetype_whitelist = get_server_whitelist(args['--url']) is using '-u', 'https://blobupload.elasticbeanstalk.com', So https://github.com/mozilla/build-blobuploader/blob/master/blobberc.py#L62 Which yields an AWS error: [root@tst-linux64-spot-820.test.releng.use1.mozilla.com ~]# /builds/slave/test/build/venv/bin/python Python 2.7.3 (default, Apr 20 2012, 22:39:59) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import urlparse >>> import requests >>> from blobuploader import cert >>> hostname = "https://blobupload.elasticbeanstalk.com" >>> url = urlparse.urljoin(hostname, '/blobs/whitelist') >>> url 'https://blobupload.elasticbeanstalk.com/blobs/whitelist' >>> response = requests.get(url, verify=cert.where()) >>> response <Response [503]> >>> response.content '' >>> response.text u'' >>> response.__dict__ {'cookies': <<class 'requests.cookies.RequestsCookieJar'>[]>, '_content': '', 'headers': CaseInsensitiveDict({'content-length': ' 0', 'connection': 'keep-alive'}), 'url': u'https://blobupload.elasticbeanstalk.com/blobs/whitelist', 'status_code': 503, '_conten t_consumed': True, 'encoding': None, 'request': <PreparedRequest [GET]>, 'connection': <requests.adapters.HTTPAdapter object at 0 x1dea750>, 'elapsed': datetime.timedelta(0, 0, 267530), 'raw': <requests.packages.urllib3.response.HTTPResponse object at 0x1f7bf 90>, 'reason': 'Service Unavailable: Back-end server is at capacity', 'history': []}
Which is really interesting since, afaict elasticbeanstalk is *not* experiencing any errors according to AWS's status page: http://status.aws.amazon.com/
Unassigning myself, since I can't file an AWS ticket myself, and don't know of any alleviations to the issue that I can achieve. Informed nigelb (sheriff) on IRC, and pinged a few relengers who might be getting up soon. Given the particular error it *might* clear on its own. It is also an issue we should improve the blobberc.py error checking about. (To provide a better error message)
Assignee: bugspam.Callek → nobody
Looks like an underlying instance was terminated which caused the service to fail. I have rebuilt the environment which seems to have fixed it. Not clear why the instance was terminated. Carsten is reopening trees...
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Terminated instance was i-3d4d5ed4. (mgerva grabbed this from the events page in aws console). Thanks mgerva!
There should be an autoscaling group involved -- EB sets those up for you. So maybe there's something wrong with the EB configuration?
I believe an autoscaling group was set up, as I believe I saw it being removed as I rebuilt the environment. I wasn't able to pull back logs (perhaps because I could not connect to the terminated instance) so I could need see in detail what had taken place. After rebuilding the environment, I was also not able to find the instance i-3d4d5ed4 in the EC2 console. So I'm rather unsure what really happened. In the event logs, it was possible to see that the status had moved from green to yellow, and then yellow to red due to this instance being terminated, but the cause of the termination is unclear, and also why a new instance was not brought up automatically by the elastic scaling mechanism. Further investigation may help.
Instance termination is to be expected. IIRC there was a maintenance window for one of the az's, so that might have been responsible. But yes, investigating the autoscaling would be good.
See Also: → 1298759
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.