Closed Bug 898688 Opened 11 years ago Closed 11 years ago

tree closure due to broken report for builds-4hr

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: Callek)

References

Details

Filing this placeholder, while people work on the actual bustage. From irc, this might be caused by a code landing on Monday, but unclear why thats biting us now. 


Trees closed at: 2013-07-26 15:50:20
Trees Reopened just now, I'll dump out state here in just a few minutes, for history.
Assignee: nobody → bugspam.Callek
from irc: 

1) It looks like json fails to parse properties from http://buildbot-master58.srv.releng.usw2.mozilla.com:8001/builders/release-mozilla-beta-win32_repack_10%2F10/builds/24
...which broke buildapi reading in the properties, which completely broke writing builds-4hr which is what is holding trees closed

2) We have an alert monitoring builds-4h which alerted us to this before sheriffs or any devs noticed.
Ok, so timeline with inline details as I learned it.


At 5:16pm ET (2:16pm PT) we started getting cron spam for some of the reporting we do:

===============================
Date: Fri, 26 Jul 2013 15:31:29 -0700
Message-Id: <201307262231.r6QMVTOw027582@buildapi01.build.scl1.mozilla.com>
From: root@buildapi01.build.scl1.mozilla.com (Cron Daemon)
To: release@mozilla.com
Subject: Cron <buildapi@buildapi01> /home/buildapi/bin/report-today.sh

Traceback (most recent call last):
  File "/home/buildapi/src/buildapi/scripts/reporter.py", line 353, in <module>
    report = build_report(R, session, scheduler_db_engine, starttime, endtime)
  File "/home/buildapi/src/buildapi/scripts/reporter.py", line 218, in build_report
    build_dict = get_build_dict()
  File "/home/buildapi/src/buildapi/scripts/reporter.py", line 174, in get_build_dict
    for prop in build.properties:
  File "/home/buildapi/lib/python2.6/site-packages/sqlalchemy/orm/attributes.py", line 168, in __get__
    return self.impl.get(instance_state(instance),dict_)
  File "/home/buildapi/lib/python2.6/site-packages/sqlalchemy/orm/attributes.py", line 420, in get
    value = self.callable_(state, passive)
  File "/home/buildapi/lib/python2.6/site-packages/sqlalchemy/orm/strategies.py", line 548, in _load_for_state
    result = q.all()
  File "/home/buildapi/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 1677, in all
    return list(self)
  File "/home/buildapi/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 1912, in instances
    rows = [process[0](row, None) for row in fetch]
  File "/home/buildapi/lib/python2.6/site-packages/sqlalchemy/orm/mapper.py", line 2578, in _instance
    populate_state(state, dict_, row, isnew, only_load_props)
  File "/home/buildapi/lib/python2.6/site-packages/sqlalchemy/orm/mapper.py", line 2445, in populate_state
    populator(state, dict_, row)
  File "/home/buildapi/lib/python2.6/site-packages/sqlalchemy/orm/strategies.py", line 150, in new_execute
    dict_[key] = row[col]
  File "/home/buildapi/lib/python2.6/site-packages/sqlalchemy/types.py", line 572, in process
    return process_value(value, dialect)
  File "/home/buildapi/src/buildapi/model/jsoncol.py", line 21, in process_result_value
    return json.loads(value)
  File "/usr/lib64/python2.6/json/__init__.py", line 307, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.6/json/decoder.py", line 319, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python2.6/json/decoder.py", line 336, in raw_decode
    obj, end = self._scanner.iterscan(s, **kw).next()
  File "/usr/lib64/python2.6/json/scanner.py", line 55, in iterscan
    rval, next_pos = action(m, context)
  File "/usr/lib64/python2.6/json/decoder.py", line 155, in JSONString
    return scanstring(match.string, match.end(), encoding, strict)
ValueError: Unterminated string starting at: line 1 column 0 (char 0)

=======================================================

Which I pretty much ignored since I was busy doing other things and didn't think it needed to be triaged right away.

(This cron spam was coming once every 15 minutes)

** Next **

At about 6:45pm ET// 3:45pm PT I noticed this message in #buildduty:
Fri 15:44:13 PDT [4273] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 1:30:47 ago - 1597514 bytes in 0.081 second response time (http://m.allizom.org/http+file+age+-+/buildjson/builds-4hr.js.gz)

At which point I knew we had to close trees, so I did so...

(closed just a few minutes later, noted in c#0)

I then begun my work to figure out what was wrong...

Info in the builds4hr log showed the same traceback as the cron spam we had been getting.

Next I started debugging in a python shell what the problem was...

With the steps from http://mxr.mozilla.org/build/source/buildapi/buildapi/scripts/reporter.py I could verify that build.properties itself was throwing.

So checking out what build that was:

>>> n = len(all_builds)
11696
>>> build.id
25794014L
>>> build.__dict__
{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0xa3b8e10>, 'master_id': 77L, 'endtime': datetime.datetime(2013, 7, 26, 21, 13, 53), 'builder_id': 176064L, 'buildnumber': 24L, 'lost': False, 'reason': u"The web-page 'rebuild' button was pressed by '<unknown>': \n", 'result': 5L, 'starttime': datetime.datetime(2013, 7, 26, 21, 13, 52), 'source_id': 4976243L, 'slave_id': 543L, 'id': 25794014L}

Then using http://mxr.mozilla.org/build/source/buildapi/buildapi/model/statusdb_orm.py as my guide a bit more investigation

>>> build.master
<buildapi.model.statusdb_orm.Master object at 0xf278650>
>>> build.master.name
u'bm58-build1'
>>> build.slave
<buildapi.model.statusdb_orm.Slave object at 0xb71ead0>
>>> build.slave.name
u'mw32-ix-slave07'
>>> build.builder
<buildapi.model.statusdb_orm.Builder object at 0xb71ed10>
>>> build.builder.id
176064L
>>> build.builder.name
u'release-mozilla-beta-win32_repack_10'
>>> build.builder.master_id
77L

so looking slightly closer I find http://buildbot-master58.srv.releng.usw2.mozilla.com:8001/builders/release-mozilla-beta-win32_repack_10%2F10/builds/24 which has a totally borked basedir value. [of "e:/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/mozilla-beta/obj-l10n/_virtualenv/Scripts/python.exe /e/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/mozilla-beta/config/nsinstall.py -t -D ../../dist/xpi-stage/locale-zh-TW/browser/defaults/profile/ e:/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/mozilla-beta/obj-l10n/_virtualenv/Scripts/python.exe /e/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/mozilla-beta/config/Preprocessor.py \ -I /e/builds/moz2_slave/rel-m-beta-w32_rpk_10- .. [property value too long]"] 

Since its a release build I suspect[ed] that backing out the recent cset that shows release builds (and others) in buildapi's recent is a good enough fix on a late friday night.

So I back that out and update:
http://hg.mozilla.org/build/buildapi/rev/eb629a2c049d
http://hg.mozilla.org/build/buildapi/rev/5280524601ff

A rerun of the whole script gets the same properties json failure, so I suspect there was at least 1 other build (not release) that had the same problem.

I then proposed a patch that jhopkins said looked fine on IRC, and zeller r+'ed for me. Which I landed in:
http://hg.mozilla.org/build/buildapi/rev/f988446ed820
[Along with the reland of ben's patch with http://hg.mozilla.org/build/buildapi/rev/c4e7469f95fd]

Forced a buildapi update and re-ran the script again, which was successful. I also forced nagios to recheck itself which showed the 4hr file was up to date.

Trees were reopened at 20:31:28 EDT (17:31:28 PT) which is within the 4 hour threshold so we did not need to do magic to 'fix' tbpl (which reads the 4 hour file for info about the jobs it needs to care about).

This fix did indeed fix the cron job mentioned above as well.

---

We probably will want to investigate why the basedir property got SOOO messed up.

--- 
As I write this I looked again at the build with the borked property, and it looks like it was reading output from a *previous run* of the job, where it was still performing a repack, while it attempted to read stdout from the basedir for the property.... absolutely no idea how that can happen.

Since the tree closing issue is fixed, I'm resolving this bug.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Summary: tree closure → tree closure due to broken report for builds-4hr
Thanks for debugging this.

Looking at http://buildbot-master58.srv.releng.usw2.mozilla.com:8001/builders/release-mozilla-beta-win32_repack_10%2F10/builds/24 I see get_basedirs step has unexpected output in it. Given that this slave also had problems in bug 898642, and it had a bunch of exception jobs without a reboot, this looks like Windows screwing up input streams. This is pretty common when processes aren't killed properly.
Product: mozilla.org → Release Engineering
Blocks: 926246
You need to log in before you can comment on or make changes to this bug.