Closed
Bug 898688
Opened 11 years ago
Closed 11 years ago
tree closure due to broken report for builds-4hr
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: joduinn, Assigned: Callek)
References
Details
Filing this placeholder, while people work on the actual bustage. From irc, this might be caused by a code landing on Monday, but unclear why thats biting us now. Trees closed at: 2013-07-26 15:50:20
Assignee | ||
Comment 1•11 years ago
|
||
Trees Reopened just now, I'll dump out state here in just a few minutes, for history.
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → bugspam.Callek
Reporter | ||
Comment 2•11 years ago
|
||
from irc: 1) It looks like json fails to parse properties from http://buildbot-master58.srv.releng.usw2.mozilla.com:8001/builders/release-mozilla-beta-win32_repack_10%2F10/builds/24 ...which broke buildapi reading in the properties, which completely broke writing builds-4hr which is what is holding trees closed 2) We have an alert monitoring builds-4h which alerted us to this before sheriffs or any devs noticed.
Assignee | ||
Comment 3•11 years ago
|
||
Ok, so timeline with inline details as I learned it. At 5:16pm ET (2:16pm PT) we started getting cron spam for some of the reporting we do: =============================== Date: Fri, 26 Jul 2013 15:31:29 -0700 Message-Id: <201307262231.r6QMVTOw027582@buildapi01.build.scl1.mozilla.com> From: root@buildapi01.build.scl1.mozilla.com (Cron Daemon) To: release@mozilla.com Subject: Cron <buildapi@buildapi01> /home/buildapi/bin/report-today.sh Traceback (most recent call last): File "/home/buildapi/src/buildapi/scripts/reporter.py", line 353, in <module> report = build_report(R, session, scheduler_db_engine, starttime, endtime) File "/home/buildapi/src/buildapi/scripts/reporter.py", line 218, in build_report build_dict = get_build_dict() File "/home/buildapi/src/buildapi/scripts/reporter.py", line 174, in get_build_dict for prop in build.properties: File "/home/buildapi/lib/python2.6/site-packages/sqlalchemy/orm/attributes.py", line 168, in __get__ return self.impl.get(instance_state(instance),dict_) File "/home/buildapi/lib/python2.6/site-packages/sqlalchemy/orm/attributes.py", line 420, in get value = self.callable_(state, passive) File "/home/buildapi/lib/python2.6/site-packages/sqlalchemy/orm/strategies.py", line 548, in _load_for_state result = q.all() File "/home/buildapi/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 1677, in all return list(self) File "/home/buildapi/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 1912, in instances rows = [process[0](row, None) for row in fetch] File "/home/buildapi/lib/python2.6/site-packages/sqlalchemy/orm/mapper.py", line 2578, in _instance populate_state(state, dict_, row, isnew, only_load_props) File "/home/buildapi/lib/python2.6/site-packages/sqlalchemy/orm/mapper.py", line 2445, in populate_state populator(state, dict_, row) File "/home/buildapi/lib/python2.6/site-packages/sqlalchemy/orm/strategies.py", line 150, in new_execute dict_[key] = row[col] File "/home/buildapi/lib/python2.6/site-packages/sqlalchemy/types.py", line 572, in process return process_value(value, dialect) File "/home/buildapi/src/buildapi/model/jsoncol.py", line 21, in process_result_value return json.loads(value) File "/usr/lib64/python2.6/json/__init__.py", line 307, in loads return _default_decoder.decode(s) File "/usr/lib64/python2.6/json/decoder.py", line 319, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib64/python2.6/json/decoder.py", line 336, in raw_decode obj, end = self._scanner.iterscan(s, **kw).next() File "/usr/lib64/python2.6/json/scanner.py", line 55, in iterscan rval, next_pos = action(m, context) File "/usr/lib64/python2.6/json/decoder.py", line 155, in JSONString return scanstring(match.string, match.end(), encoding, strict) ValueError: Unterminated string starting at: line 1 column 0 (char 0) ======================================================= Which I pretty much ignored since I was busy doing other things and didn't think it needed to be triaged right away. (This cron spam was coming once every 15 minutes) ** Next ** At about 6:45pm ET// 3:45pm PT I noticed this message in #buildduty: Fri 15:44:13 PDT [4273] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 1:30:47 ago - 1597514 bytes in 0.081 second response time (http://m.allizom.org/http+file+age+-+/buildjson/builds-4hr.js.gz) At which point I knew we had to close trees, so I did so... (closed just a few minutes later, noted in c#0) I then begun my work to figure out what was wrong... Info in the builds4hr log showed the same traceback as the cron spam we had been getting. Next I started debugging in a python shell what the problem was... With the steps from http://mxr.mozilla.org/build/source/buildapi/buildapi/scripts/reporter.py I could verify that build.properties itself was throwing. So checking out what build that was: >>> n = len(all_builds) 11696 >>> build.id 25794014L >>> build.__dict__ {'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0xa3b8e10>, 'master_id': 77L, 'endtime': datetime.datetime(2013, 7, 26, 21, 13, 53), 'builder_id': 176064L, 'buildnumber': 24L, 'lost': False, 'reason': u"The web-page 'rebuild' button was pressed by '<unknown>': \n", 'result': 5L, 'starttime': datetime.datetime(2013, 7, 26, 21, 13, 52), 'source_id': 4976243L, 'slave_id': 543L, 'id': 25794014L} Then using http://mxr.mozilla.org/build/source/buildapi/buildapi/model/statusdb_orm.py as my guide a bit more investigation >>> build.master <buildapi.model.statusdb_orm.Master object at 0xf278650> >>> build.master.name u'bm58-build1' >>> build.slave <buildapi.model.statusdb_orm.Slave object at 0xb71ead0> >>> build.slave.name u'mw32-ix-slave07' >>> build.builder <buildapi.model.statusdb_orm.Builder object at 0xb71ed10> >>> build.builder.id 176064L >>> build.builder.name u'release-mozilla-beta-win32_repack_10' >>> build.builder.master_id 77L so looking slightly closer I find http://buildbot-master58.srv.releng.usw2.mozilla.com:8001/builders/release-mozilla-beta-win32_repack_10%2F10/builds/24 which has a totally borked basedir value. [of "e:/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/mozilla-beta/obj-l10n/_virtualenv/Scripts/python.exe /e/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/mozilla-beta/config/nsinstall.py -t -D ../../dist/xpi-stage/locale-zh-TW/browser/defaults/profile/ e:/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/mozilla-beta/obj-l10n/_virtualenv/Scripts/python.exe /e/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/mozilla-beta/config/Preprocessor.py \ -I /e/builds/moz2_slave/rel-m-beta-w32_rpk_10- .. [property value too long]"] Since its a release build I suspect[ed] that backing out the recent cset that shows release builds (and others) in buildapi's recent is a good enough fix on a late friday night. So I back that out and update: http://hg.mozilla.org/build/buildapi/rev/eb629a2c049d http://hg.mozilla.org/build/buildapi/rev/5280524601ff A rerun of the whole script gets the same properties json failure, so I suspect there was at least 1 other build (not release) that had the same problem. I then proposed a patch that jhopkins said looked fine on IRC, and zeller r+'ed for me. Which I landed in: http://hg.mozilla.org/build/buildapi/rev/f988446ed820 [Along with the reland of ben's patch with http://hg.mozilla.org/build/buildapi/rev/c4e7469f95fd] Forced a buildapi update and re-ran the script again, which was successful. I also forced nagios to recheck itself which showed the 4hr file was up to date. Trees were reopened at 20:31:28 EDT (17:31:28 PT) which is within the 4 hour threshold so we did not need to do magic to 'fix' tbpl (which reads the 4 hour file for info about the jobs it needs to care about). This fix did indeed fix the cron job mentioned above as well. --- We probably will want to investigate why the basedir property got SOOO messed up. --- As I write this I looked again at the build with the borked property, and it looks like it was reading output from a *previous run* of the job, where it was still performing a repack, while it attempted to read stdout from the basedir for the property.... absolutely no idea how that can happen. Since the tree closing issue is fixed, I'm resolving this bug.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Summary: tree closure → tree closure due to broken report for builds-4hr
Comment 4•11 years ago
|
||
Thanks for debugging this. Looking at http://buildbot-master58.srv.releng.usw2.mozilla.com:8001/builders/release-mozilla-beta-win32_repack_10%2F10/builds/24 I see get_basedirs step has unexpected output in it. Given that this slave also had problems in bug 898642, and it had a bunch of exception jobs without a reboot, this looks like Windows screwing up input streams. This is pretty common when processes aren't killed properly.
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•