Closed
Bug 712988
Opened 13 years ago
Closed 13 years ago
update_from_files.py jobs on some masters are receiving tracebacks, causing missing jobs from JSON files & TBPLU
Categories
(Release Engineering :: General, defect, P1)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Assigned: bear)
References
Details
(Whiteboard: [buildduty][outage])
I don't know whether they're being run or not being shown, but most pushes across m-c, m-i, m-a, and m-b only show a quarter or so of the total spread of Android tests. May or may not be related to bug 712877.
Assignee | ||
Comment 1•13 years ago
|
||
philor - was this due to the inrush of phantom android-xul jobs? those have been cleared now.
Reporter | ||
Comment 2•13 years ago
|
||
Nope, afraid not, this goes back into yesterday. It's hard to point at any one thing since there's always coalescing to confuse things, but take https://tbpl.mozilla.org/?tree=Mozilla-Aurora&onlyunstarred=1&rev=e08be766bdba - it had both on-push builds and nightlies, and out of all that barely shows any jobs. https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=6707b2415598 is entangled in the reconfig fun, and the bug 712877 fun, and yet another db meltdown, but it's still quite a few tests shy of a full xul load (and the next push up, which grabbed native tests for it, isn't quite all there, either)
Reporter | ||
Comment 3•13 years ago
|
||
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=android and hit the green down-arrow at the bottom, twice.
Updated•13 years ago
|
Priority: -- → P1
Whiteboard: [buildduty]
Assignee | ||
Updated•13 years ago
|
Assignee: nobody → bear
Reporter | ||
Comment 4•13 years ago
|
||
This may or may not actually be fixed, but I think we decided we didn't care whether or not it was, or something.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 5•13 years ago
|
||
And indeed it is not fixed: after bug 713174, I retriggered both Android builds on the tip of m-i, got full sets of tests scheduled, and picked up, and according to self-serve run, and out of that I'm going to get pretty much the same limited set actually reported to builds-4hr that we've had for the last, what is it, 36 hours? Can't help thinking there might be something significant in the pattern of what generally does get reported (mostly M5 and M6 for native, sometimes some of the others but pretty much never M8; dhtml and tp for the freshly returned xul talos, never any of the rest; xul much more prone to reporting J1 and R3 than any of the rest of those suites), though I don't know what.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 6•13 years ago
|
||
Back in the tree status leaderboard: mozilla-inbound, mozilla-central, mozilla-aurora and mozilla-beta are closed for this. Reopening condition: The tip push on mozilla-inbound needs to be showing as finished (green or orange (or red in the single case of tpan on xul, I think it's permared)) every single test that is run and visible, for both the xul and native flavors of Android - all 8 mochitests, all of the talos, browser-chrome and both crashtests and all three reftests and jsreftests for xul, everything.
Reporter | ||
Comment 7•13 years ago
|
||
Verified that the string "Android XUL Tegra 250 mozilla-inbound opt test mochitest-5" does not exist in http://build.mozilla.org/builds/builds-4hr.js.gz, despite two of them having finished in the last 4 hours, one having both started and finished in the last 2 hours. Whatever self-serve queries, has the data from all the Android runs. If the job that builds builds-4hr queries the same thing, then for some reason it either doesn't see a lot of the jobs, or discards them. If it queries something else, then a lot of them probably aren't making it into that.
Assignee | ||
Comment 8•13 years ago
|
||
This is the exploration i've done: I checked that bm19 and bm20 have update.log I checked that for any immediate error - both were showing recent good runs I scanned the master builds for the mochitest-5 job that philor mentions and I find them on both, for example: http://buildbot-master20.build.mtv1.mozilla.com:8201/builders/Android%20XUL%20Tegra%20250%20mozilla-inbound%20opt%20test%20mochitest-5/builds/3 shows a job for revision 9ed956b9a3b1 that was handled by tegra-276 looked in builds-4hr.js.gz for: that revision - found the revision the tegra - found it that test - not found for any Android XUL test - found them I did find that test in update.log: --------------------------------------------------------------------------- Starting update at Wed Dec 21 14:40:02 2011 ./mozilla-beta_tegra_android-o_test-reftest-1 0 1/1 0.00% complete ETA in 0 seconds ./mozilla-inbound_tegra_android-xul_test-mochitest-5 0 1/1 33.33% complete ETA in 131 seconds ./mozilla-beta_tegra_android-o_test-remote-tp4m 0 1/1 66.67% complete ETA in 44 seconds I did find in update.log a traceback that looks "interesting": Traceback (most recent call last): File "/builds/buildbot/tests1-tegra/buildbotcustom/bin/update_from_files.py", line 300, in <module> updated = updateFromFiles(session, options.master, options.name, builders, last_time, options.times) File "/builds/buildbot/tests1-tegra/buildbotcustom/bin/update_from_files.py", line 219, in updateFromFiles db_build.updateFromBBBuild(session, build) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/buildbotcustom/status/db/model.py", line 483, in updateFromBBBuild mysteps = dict((s.name, s) for s in self.steps) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/attributes.py", line 163, in __get__ instance_dict(instance)) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/attributes.py", line 383, in get value = callable_(passive=passive) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/strategies.py", line 646, in __call__ result = q.all() File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 1492, in all return list(self) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 1603, in __iter__ self.session._autoflush() File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 843, in _autoflush self.flush() File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 1359, in flush self._flush(objects) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 1440, in _flush flush_context.execute() File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/unitofwork.py", line 299, in execute rec.execute(self) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/unitofwork.py", line 401, in execute self.dependency_processor.process_saves(uow, states) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/dependency.py", line 1018, in process_saves secondary_update, secondary_delete) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/sqlalchemy/orm/dependency.py", line 1039, in _run_crud result.rowcount) sqlalchemy.orm.exc.StaleDataError: DELETE statement on table 'build_properties' expected to delete 11 row(s); Only 0 were matched.
Comment 9•13 years ago
|
||
That particular error sounds like the in-memory state of the tables didn't match what was in the database. It's hard to be sure exactly why that is, but it could be another victim of the DB slaves being slow, bug 712830.
Comment 10•13 years ago
|
||
bm19 and bm20 are not running since last night. I can see mochitest5 showing up for Android. Can we re-open the tree? We should keep this bug open nevertheless to discover the root cause before turning those 2 masters on (IIUC). https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=Android%20XUL%20Tegra%20250%20mozilla-inbound%20opt%20test%20mochitest
Comment 11•13 years ago
|
||
This isn't affecting just Android, updating summary. Armen, bear, and I have been investigating this some more this morning. Here's the latest: * Using a recent push on fx-team, we managed to figure out that all the missing jobs are on a few masters: test-master01, buildbot-master06, 14, and 15. * All of these masters have tracebacks in their update_from_files.py log (~cltbld/update.log) that are very recent. None of the other masters have Tracebacks this recent. (Some of tracebacks from last night, but that can likely be attributed to the database master outage.) * Given the above, I've concluded that these masters are effectively broken. I've set a graceful shutdown on bm06, 14, and 15. I've left tm01 running, because it's our only Tegra master at this point. Still working on how to fix the jobs on these masters, though.
Summary: 75% of Android jobs don't show up in tbpl → update_from_files.py jobs on some masters are receiving tracebacks, causing missing jobs from JSON files & TBPLU
Comment 12•13 years ago
|
||
Now that the DB repair is completed the update_from_files.py jobs are successfully submitting data. Once they all finish submitting the old data I'm going to bring up the masters again and then retrigger some jobs.
No longer depends on: 713092
Comment 13•13 years ago
|
||
summary of everything from today: This morning I took the lead on fixing up the DB issues we've been having, and getting tree re-opened. The issue holding the tree closed this morning was tons of missing test jobs on TBPL. After much analysis, largely done by Armen, we confirmed that the missing test jobs corresponded to masters who were having issues importing pickle data into the status DB. Each of these masters had errors like this in the import log: sqlalchemy.orm.exc.StaleDataError: DELETE statement on table 'build_properties' expected to delete 11 row(s); Only 0 were matched. Catlee speculated that this could be an issue with the DB, possibly a busted index, and I filed https://bugzilla.mozilla.org/show_bug.cgi?id=713238 to get IT to repair the table. After a bit of chasing them down done by John O, mpressman started the repair. He chose to do a full repair on the DB (probably the right thing to do, even though it was only one table we were seeing an error with). During this, IT decided they wanted to try to upgrade the RAM on this machine, since it was effectively down anyways. They did attempt this, but something happened and they weren't able to complete it. mpressman made another optimization to the DB, detailed in https://bugzilla.mozilla.org/show_bug.cgi?id=713238#c5. During the DB being down, I decided to shut down all the Buildbot masters. No jobs could be run anyways, and it's been awhile since we've done a full restart of the pool. Once the DB was back up, I watched the status DB import jobs, which all completed their imports without issue. I was hoping to see data get backfilled, and have the missing jobs show up on TBPL after that, but they didn't. I don't have a great explanation for this, but I'm not too surprised, given the state of the database last night/this morning. Next, I triggered one of the missing builds on the latest push on Fx-Team. Very quickly it showed up as a running build on TBPL. Once the job completed, it showed up as completed, which means to me that things are fixed! After that, I brought up all the masters again. I found someone in #developers who was ready to push (tbsaunde) to Mozilla-Inbound. He pushed, and jobs started showing up normally. Bear and I noticed that a bunch of slaves weren't coming up. Turned out this because all of the MPT slaves were hammering the Puppet master, and couldn't get a successful run. I "fixed" this by killing Puppet on all of them, to give the master a break, and bring the slaves up. They'll be fine on their next reboot, now that load has settled on the master. At this moment, we're waiting for the Inbound push to fully complete before we re-open. I'm quite confident this will be the case, and have handed off control of the tree to Philor. He'll be re-opening when he's confident in the infrastructure.
Reporter | ||
Comment 14•13 years ago
|
||
Not sure whether there are still loose ends that need an open bug, but we're unblockered, trees are open :)
Severity: blocker → normal
OS: Android → All
Hardware: ARM → All
Assignee | ||
Comment 15•13 years ago
|
||
this has been fixed
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → FIXED
Comment 16•13 years ago
|
||
OK. So, then, why was there no Linux 32-bit Nightly on 20120104?
Assignee | ||
Comment 17•13 years ago
|
||
(In reply to Bill Gianopoulos [:WG9s] from comment #16) > OK. So, then, why was there no Linux 32-bit Nightly on 20120104? this isn't the bug or place to ask this question, you should ask in #developers or file a standalone releng bug
Comment 18•13 years ago
|
||
(In reply to Mike Taylor [:bear] from comment #17) > (In reply to Bill Gianopoulos [:WG9s] from comment #16) > > OK. So, then, why was there no Linux 32-bit Nightly on 20120104? > > this isn't the bug or place to ask this question, you should ask in > #developers or file a standalone releng bug I already did that with no answer.
Comment 19•13 years ago
|
||
Just trying to say it is odd to be trying to close this just as we had the first unexplained nightly build did not run issue.
Assignee | ||
Comment 20•13 years ago
|
||
(In reply to Bill Gianopoulos [:WG9s] from comment #19) > Just trying to say it is odd to be trying to close this just as we had the > first unexplained nightly build did not run issue. this bug isn't about nightlies not being started - it was about tbpl not getting proper information.
Comment 21•13 years ago
|
||
OK Just thought it odd it was closed after a bit of time at the same time we actually had the next build issue is all. If it was a completely different issue that is fine with me.
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•