Closed Bug 712988 Opened 13 years ago Closed 13 years ago

update_from_files.py jobs on some masters are receiving tracebacks, causing missing jobs from JSON files & TBPLU

Categories

(Release Engineering :: General, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: bear)

References

Details

(Whiteboard: [buildduty][outage])

I don't know whether they're being run or not being shown, but most pushes across m-c, m-i, m-a, and m-b only show a quarter or so of the total spread of Android tests. May or may not be related to bug 712877.
philor - was this due to the inrush of phantom android-xul jobs?

those have been cleared now.
Nope, afraid not, this goes back into yesterday. It's hard to point at any one thing since there's always coalescing to confuse things, but take https://tbpl.mozilla.org/?tree=Mozilla-Aurora&onlyunstarred=1&rev=e08be766bdba - it had both on-push builds and nightlies, and out of all that barely shows any jobs. https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=6707b2415598 is entangled in the reconfig fun, and the bug 712877 fun, and yet another db meltdown, but it's still quite a few tests shy of a full xul load (and the next push up, which grabbed native tests for it, isn't quite all there, either)
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=android and hit the green down-arrow at the bottom, twice.
Priority: -- → P1
Whiteboard: [buildduty]
Depends on: 713092
Whiteboard: [buildduty] → [buildduty][outage]
Assignee: nobody → bear
This may or may not actually be fixed, but I think we decided we didn't care whether or not it was, or something.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Depends on: 713174
And indeed it is not fixed: after bug 713174, I retriggered both Android builds on the tip of m-i, got full sets of tests scheduled, and picked up, and according to self-serve run, and out of that I'm going to get pretty much the same limited set actually reported to builds-4hr that we've had for the last, what is it, 36 hours?

Can't help thinking there might be something significant in the pattern of what generally does get reported (mostly M5 and M6 for native, sometimes some of the others but pretty much never M8; dhtml and tp for the freshly returned xul talos, never any of the rest; xul much more prone to reporting J1 and R3 than any of the rest of those suites), though I don't know what.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Back in the tree status leaderboard:

mozilla-inbound, mozilla-central, mozilla-aurora and mozilla-beta are closed for this.

Reopening condition:

The tip push on mozilla-inbound needs to be showing as finished (green or orange (or red in the single case of tpan on xul, I think it's permared)) every single test that is run and visible, for both the xul and native flavors of Android - all 8 mochitests, all of the talos, browser-chrome and both crashtests and all three reftests and jsreftests for xul, everything.
Verified that the string "Android XUL Tegra 250 mozilla-inbound opt test mochitest-5" does not exist in http://build.mozilla.org/builds/builds-4hr.js.gz, despite two of them having finished in the last 4 hours, one having both started and finished in the last 2 hours.

Whatever self-serve queries, has the data from all the Android runs.

If the job that builds builds-4hr queries the same thing, then for some reason it either doesn't see a lot of the jobs, or discards them. If it queries something else, then a lot of them probably aren't making it into that.
This is the exploration i've done:

I checked that bm19 and bm20 have update.log
I checked that for any immediate error - both were showing recent good runs

I scanned the master builds for the mochitest-5 job that philor mentions and I find them on both, for example:

http://buildbot-master20.build.mtv1.mozilla.com:8201/builders/Android%20XUL%20Tegra%20250%20mozilla-inbound%20opt%20test%20mochitest-5/builds/3

shows a job for revision 9ed956b9a3b1 that was handled by tegra-276

looked in builds-4hr.js.gz for:
  that revision - found the revision
  the tegra - found it
  that test - not found
  for any Android XUL test - found them

I did find that test in update.log:

---------------------------------------------------------------------------
Starting update at Wed Dec 21 14:40:02 2011
./mozilla-beta_tegra_android-o_test-reftest-1 0 1/1 0.00% complete ETA in 0 seconds
./mozilla-inbound_tegra_android-xul_test-mochitest-5 0 1/1 33.33% complete ETA in 131 seconds
./mozilla-beta_tegra_android-o_test-remote-tp4m 0 1/1 66.67% complete ETA in 44 seconds

I did find in update.log a traceback that looks "interesting":

Traceback (most recent call last):
  File "/builds/buildbot/tests1-tegra/buildbotcustom/bin/update_from_files.py", line 300, in <module>
    updated = updateFromFiles(session, options.master, options.name, builders, last_time, options.times)
  File "/builds/buildbot/tests1-tegra/buildbotcustom/bin/update_from_files.py", line 219, in updateFromFiles
    db_build.updateFromBBBuild(session, build)
  File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/buildbotcustom/status/db/model.py", line 483, in updateFromBBBuild
    mysteps = dict((s.name, s) for s in self.steps)
  File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/attributes.py", line 163, in __get__
    instance_dict(instance))
  File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/attributes.py", line 383, in get
    value = callable_(passive=passive)
  File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/strategies.py", line 646, in __call__
    result = q.all()
  File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 1492, in all
    return list(self)
  File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 1603, in __iter__
    self.session._autoflush()
  File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 843, in _autoflush
    self.flush()
  File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 1359, in flush
    self._flush(objects)
  File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 1440, in _flush
    flush_context.execute()
  File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/unitofwork.py", line 299, in execute
    rec.execute(self)
  File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/unitofwork.py", line 401, in execute
    self.dependency_processor.process_saves(uow, states)
  File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/dependency.py", line 1018, in process_saves
    secondary_update, secondary_delete)
  File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/dependency.py", line 1039, in _run_crud
    result.rowcount)
sqlalchemy.orm.exc.StaleDataError: DELETE statement on table 'build_properties' expected to delete 11 row(s); Only 0 were matched.
That particular error sounds like the in-memory state of the tables didn't match what was in the database. It's hard to be sure exactly why that is, but it could be another victim of the DB slaves being slow, bug 712830.
bm19 and bm20 are not running since last night.

I can see mochitest5 showing up for Android.

Can we re-open the tree?

We should keep this bug open nevertheless to discover the root cause before turning those 2 masters on (IIUC).

https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=Android%20XUL%20Tegra%20250%20mozilla-inbound%20opt%20test%20mochitest
This isn't affecting just Android, updating summary.

Armen, bear, and I have been investigating this some more this morning. Here's the latest:
* Using a recent push on fx-team, we managed to figure out that all the missing jobs are on a few masters: test-master01, buildbot-master06, 14, and 15.
* All of these masters have tracebacks in their update_from_files.py log (~cltbld/update.log) that are very recent. None of the other masters have Tracebacks this recent. (Some of tracebacks from last night, but that can likely be attributed to the database master outage.)
* Given the above, I've concluded that these masters are effectively broken. I've set a graceful shutdown on bm06, 14, and 15. I've left tm01 running, because it's our only Tegra master at this point.

Still working on how to fix the jobs on these masters, though.
Summary: 75% of Android jobs don't show up in tbpl → update_from_files.py jobs on some masters are receiving tracebacks, causing missing jobs from JSON files & TBPLU
Depends on: 713238
Blocks: 713271
Now that the DB repair is completed the update_from_files.py jobs are successfully submitting data. Once they all finish submitting the old data I'm going to bring up the masters again and then retrigger some jobs.
No longer depends on: 713092
summary of everything from today:
This morning I took the lead on fixing up the DB issues we've been having, and getting tree re-opened. The issue holding the tree closed this morning was tons of missing test jobs on TBPL. After much analysis, largely done by Armen, we confirmed that the missing test jobs corresponded to masters who were having issues importing pickle data into the status DB. Each of these masters had errors like this in the import log:
sqlalchemy.orm.exc.StaleDataError: DELETE statement on table 'build_properties' expected to delete 11 row(s); Only 0 were matched.

Catlee speculated that this could be an issue with the DB, possibly a busted index, and I filed https://bugzilla.mozilla.org/show_bug.cgi?id=713238 to get IT to repair the table. After a bit of chasing them down done by John O, mpressman started the repair. He chose to do a full repair on the DB (probably the right thing to do, even though it was only one table we were seeing an error with). During this, IT decided they wanted to try to upgrade the RAM on this machine, since it was effectively down anyways. They did attempt this, but something happened and they weren't able to complete it. mpressman made another optimization to the DB, detailed in https://bugzilla.mozilla.org/show_bug.cgi?id=713238#c5.

During the DB being down, I decided to shut down all the Buildbot masters. No jobs could be run anyways, and it's been awhile since we've done a full restart of the pool.

Once the DB was back up, I watched the status DB import jobs, which all completed their imports without issue. I was hoping to see data get backfilled, and have the missing jobs show up on TBPL after that, but they didn't. I don't have a great explanation for this, but I'm not too surprised, given the state of the database last night/this morning.

Next, I triggered one of the missing builds on the latest push on Fx-Team. Very quickly it showed up as a running build on TBPL. Once the job completed, it showed up as completed, which means to me that things are fixed!

After that, I brought up all the masters again. I found someone in #developers who was ready to push (tbsaunde) to Mozilla-Inbound. He pushed, and jobs started showing up normally.

Bear and I noticed that a bunch of slaves weren't coming up. Turned out this because all of the MPT slaves were hammering the Puppet master, and couldn't get a successful run. I "fixed" this by killing Puppet on all of them, to give the master a break, and bring the slaves up. They'll be fine on their next reboot, now that load has settled on the master.

At this moment, we're waiting for the Inbound push to fully complete before we re-open. I'm quite confident this will be the case, and have handed off control of the tree to Philor. He'll be re-opening when he's confident in the infrastructure.
Not sure whether there are still loose ends that need an open bug, but we're unblockered, trees are open :)
Severity: blocker → normal
OS: Android → All
Hardware: ARM → All
this has been fixed
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
OK. So, then, why was there no Linux 32-bit Nightly on 20120104?
(In reply to Bill Gianopoulos [:WG9s] from comment #16)
> OK. So, then, why was there no Linux 32-bit Nightly on 20120104?

this isn't the bug or place to ask this question, you should ask in #developers or file a standalone releng bug
(In reply to Mike Taylor [:bear] from comment #17)
> (In reply to Bill Gianopoulos [:WG9s] from comment #16)
> > OK. So, then, why was there no Linux 32-bit Nightly on 20120104?
> 
> this isn't the bug or place to ask this question, you should ask in
> #developers or file a standalone releng bug

I already did that with no answer.
Just trying to say it is odd to be trying to close this just as we had the first unexplained nightly build did not run issue.
(In reply to Bill Gianopoulos [:WG9s] from comment #19)
> Just trying to say it is odd to be trying to close this just as we had the
> first unexplained nightly build did not run issue.

this bug isn't about nightlies not being started - it was about tbpl not getting proper information.
OK Just thought it odd it was closed after a bit of time at the same time we actually had the next build issue is all.  If it was a completely different issue that is fine with me.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.