Closed
Bug 1117620
Opened 11 years ago
Closed 11 years ago
Blobber uploads broken by a json error, all trees closed
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Unassigned)
References
Details
e.g. https://treeherder.mozilla.org/logviewer.html#?job_id=5064885&repo=mozilla-inbound
Since our third most common test failure requires looking at a blobber-uploaded screenshot to identify that it's the slave's fault, all trees are closed.
| Reporter | ||
Comment 1•11 years ago
|
||
Looks like it broke between 16:30 and 16:47, assuming the time of the failure is the significant time rather than something more annoying like the time the slave ran puppet.
Comment 3•11 years ago
|
||
Initial investigation:
17:54:39 INFO - Running command: ['/builds/slave/test/build/venv/bin/python', '/builds/slave/test/build/venv/bin/blobberc.py', '-u', 'https://blobupload.elasticbeanstalk.com', '-a', '/builds/slave/test/oauth.txt', '-b', 'mozilla-inbound', '-d', '/builds/slave/test/build/blobber_upload_dir', '--output-manifest', '/builds/slave/test/build/uploaded_files.json']
17:54:39 INFO - Copy/paste: /builds/slave/test/build/venv/bin/python /builds/slave/test/build/venv/bin/blobberc.py -u https://blobupload.elasticbeanstalk.com -a /builds/slave/test/oauth.txt -b mozilla-inbound -d /builds/slave/test/build/blobber_upload_dir --output-manifest /builds/slave/test/build/uploaded_files.json
17:54:40 INFO - Traceback (most recent call last):
17:54:40 INFO - File "/builds/slave/test/build/venv/bin/blobberc.py", line 253, in <module>
17:54:40 INFO - main()
17:54:40 INFO - File "/builds/slave/test/build/venv/bin/blobberc.py", line 235, in main
17:54:40 INFO - filetype_whitelist = get_server_whitelist(args['--url'])
17:54:40 INFO - File "/builds/slave/test/build/venv/bin/blobberc.py", line 69, in get_server_whitelist
17:54:40 INFO - return set(response.json().get('whitelist', []))
17:54:40 INFO - File "/builds/slave/test/build/venv/local/lib/python2.7/site-packages/requests/models.py", line 651, in json
17:54:40 INFO - return json.loads(self.text or self.content, **kwargs)
17:54:40 INFO - File "/builds/slave/test/build/venv/local/lib/python2.7/site-packages/simplejson/__init__.py", line 488, in loads
17:54:40 INFO - return _default_decoder.decode(s)
17:54:40 INFO - File "/builds/slave/test/build/venv/local/lib/python2.7/site-packages/simplejson/decoder.py", line 370, in decode
17:54:40 INFO - obj, end = self.raw_decode(s)
17:54:40 INFO - File "/builds/slave/test/build/venv/local/lib/python2.7/site-packages/simplejson/decoder.py", line 389, in raw_decode
17:54:40 INFO - return self.scan_once(s, idx=_w(s, idx).end())
17:54:40 INFO - simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
17:54:40 ERROR - Return code: 1
On slave tst-linux64-spot-820
Which is erroring at the file at https://github.com/mozilla/build-blobuploader/blob/master/blobberc.py
tst-linux64-spot-820 is *not* in any sort of AWS error state, and has been running for ~2 hours, and is up now.
Comment 4•11 years ago
|
||
filetype_whitelist = get_server_whitelist(args['--url'])
is using '-u', 'https://blobupload.elasticbeanstalk.com',
So https://github.com/mozilla/build-blobuploader/blob/master/blobberc.py#L62
Which yields an AWS error:
[root@tst-linux64-spot-820.test.releng.use1.mozilla.com ~]# /builds/slave/test/build/venv/bin/python
Python 2.7.3 (default, Apr 20 2012, 22:39:59)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urlparse
>>> import requests
>>> from blobuploader import cert
>>> hostname = "https://blobupload.elasticbeanstalk.com"
>>> url = urlparse.urljoin(hostname, '/blobs/whitelist')
>>> url
'https://blobupload.elasticbeanstalk.com/blobs/whitelist'
>>> response = requests.get(url, verify=cert.where())
>>> response
<Response [503]>
>>> response.content
''
>>> response.text
u''
>>> response.__dict__
{'cookies': <<class 'requests.cookies.RequestsCookieJar'>[]>, '_content': '', 'headers': CaseInsensitiveDict({'content-length': '
0', 'connection': 'keep-alive'}), 'url': u'https://blobupload.elasticbeanstalk.com/blobs/whitelist', 'status_code': 503, '_conten
t_consumed': True, 'encoding': None, 'request': <PreparedRequest [GET]>, 'connection': <requests.adapters.HTTPAdapter object at 0
x1dea750>, 'elapsed': datetime.timedelta(0, 0, 267530), 'raw': <requests.packages.urllib3.response.HTTPResponse object at 0x1f7bf
90>, 'reason': 'Service Unavailable: Back-end server is at capacity', 'history': []}
Comment 5•11 years ago
|
||
Which is really interesting since, afaict elasticbeanstalk is *not* experiencing any errors according to AWS's status page:
http://status.aws.amazon.com/
Comment 6•11 years ago
|
||
Unassigning myself, since I can't file an AWS ticket myself, and don't know of any alleviations to the issue that I can achieve.
Informed nigelb (sheriff) on IRC, and pinged a few relengers who might be getting up soon. Given the particular error it *might* clear on its own.
It is also an issue we should improve the blobberc.py error checking about. (To provide a better error message)
Assignee: bugspam.Callek → nobody
| Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 8•11 years ago
|
||
Looks like an underlying instance was terminated which caused the service to fail. I have rebuilt the environment which seems to have fixed it. Not clear why the instance was terminated.
Carsten is reopening trees...
Updated•11 years ago
|
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 9•11 years ago
|
||
Terminated instance was i-3d4d5ed4. (mgerva grabbed this from the events page in aws console). Thanks mgerva!
Comment 10•11 years ago
|
||
There should be an autoscaling group involved -- EB sets those up for you. So maybe there's something wrong with the EB configuration?
Comment 11•11 years ago
|
||
I believe an autoscaling group was set up, as I believe I saw it being removed as I rebuilt the environment. I wasn't able to pull back logs (perhaps because I could not connect to the terminated instance) so I could need see in detail what had taken place. After rebuilding the environment, I was also not able to find the instance i-3d4d5ed4 in the EC2 console. So I'm rather unsure what really happened.
In the event logs, it was possible to see that the status had moved from green to yellow, and then yellow to red due to this instance being terminated, but the cause of the termination is unclear, and also why a new instance was not brought up automatically by the elastic scaling mechanism.
Further investigation may help.
Comment 12•11 years ago
|
||
Instance termination is to be expected. IIRC there was a maintenance window for one of the az's, so that might have been responsible. But yes, investigating the autoscaling would be good.
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•