Closed Bug 613832 Opened 15 years ago Closed 15 years ago

Test jobs not being scheduled promptly

Categories

(Release Engineering :: General, defect, P2)

x86
All
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: dustin)

References

Details

(Whiteboard: [buildbot])

Attachments

(2 files)

We're not scheduling tests quickly (or at all?) for some pushes. For example rev 0601ba3e9395 on mozilla-central. One of the compiles: Linux x86-64 mozilla-central leak test build scheduled: 2010-11-20 16:49:26 started: 2010-11-20 16:49:51 ended: 2010-11-20 18:16:49 result: green The sendchange took place at Sat Nov 20 18:02:25 2010 and succeeded. Log on pm01 of it being received 2010-11-20 18:02:28-0800 [Broker,2377,10.2.90.74] adding change, who sendchange-unittest, 2 files, rev=0601ba3e939572d2ea0a773ec05c2b4e72af92ae, branch=mozilla-central-linux64-debug-unittest, repository=, comments , category None, project There's a change in the DB, changes.changeid=240857 on changes.branch=mozilla-central-linux64-debug-unittest at changes.when_timestamp=1290304948, aka Sat Nov 20 18:02:28 2010 PST. However the debug unit test don't get scheduled until 20:34. There's a sourcestamp_changes.sourcestampid=682672, and buildset.id=791317 with buildset.submitted_at=1290314071 (aka Sat Nov 20 20:34:31 2010 PST). That's a 2hour 32 minute gap from the sendchange which is unexplained. When the tests do run they're showing two other changes, for 6f28a60c1c71, baa51e6d4a15, for which the sendchanges are reported as 18:20 and 18:44. I have restarted pm02:tests-scheduler to try to free up other pushes which missing test jobs but that hasn't helped.
philor thinks he started seeing strangeness on Wednesday afternoon, where the 0.8.2 deployment was Wednesday morning. [21:09] <nthomas> was this an issue before DONTBUILD came along ? [21:10] <philor> yeah, :bs pushed something right before the last TM merge, and his builds didn't exist for an hour or more, but I think they then popped up as finished so, Thursday? [21:12] Wednesday <nthomas> I know https://wiki.mozilla.org/ReleaseEngineering:Maintenance is incomplete, so that's not much help [21:14] <philor> 0.8.2 was 6 to 9am Wednesday, that push was 1pm Wednesday [21:15] <philor> and, 10.6 opt tests ran like crap right out of the gate s/opt// <philor> two pushes where opt got just 1,3,oth, then one where debug got just X, then two where opt got nothing
Comment 1 is probably me being confusing, talking about two different things at once. I think that was an instance of bug 613429 (or, bug 613429 comment 2 if they really are separate things), where right after the 0.8.2 upgrade we were already starting to see builds that didn't show up on tbpl while they were running, though at the time I didn't know that I should have looked at builds-running for another cset to see if that was what was happening. I didn't start paying attention to "omg, where are the tests for that long-completed build that doesn't list pending tests?" until last night, so I don't know if we've been having two hour gaps since Wednesday or not.
See also 613429#c2 and onwards.
I believe that the test scheduler stuff is unrelated to DONTBUILD.
Regardless of the assignment of revisions to tests, this illustrates the scheduling problem. Recall that the sendchanges for the tests scheduled at 20:34 took place at 18:02, 18:20 and 18:44.
This is more consistent with something delaying the test schedulers from running than the revisions being assigned incorrectly.
Attached file Better DB data
Agreed. From this test job it looks like it started about midday on Saturday. Pending Tegra builds crossed some threshold to slow everything down ?
catlee suggested adding a logging line in schedulers/basic.py, in BasicScheduler's run(), to make sure the schedulers are being executed every minute. We then had some delayed Win7 opt tests on m-c 8a5f16022cbd. The sendchange was at 14:14 and jobs were not scheduled until 14:22. Grepping over the twistd.log we have 2010-11-21 13:35:08-0800 [-] FOO: Running tests-mozilla-central-win7-opt-unittest 2010-11-21 13:36:05-0800 [-] FOO: Running tests-mozilla-central-win7-opt-unittest 2010-11-21 13:37:05-0800 [-] FOO: Running tests-mozilla-central-win7-opt-unittest 2010-11-21 14:22:15-0800 [-] FOO: Running tests-mozilla-central-win7-opt-unittest 2010-11-21 14:22:15-0800 [-] FOO: tests-mozilla-central-win7-opt-unittest creating buildstamp with ssid=685535 rev=8a5f16022cbd88de7b4a5e25463c749882323310 2010-11-21 14:22:28-0800 [-] FOO: Running tests-mozilla-central-win7-opt-unittest 2010-11-21 14:23:04-0800 [-] FOO: Running tests-mozilla-central-win7-opt-unittest 2010-11-21 14:24:03-0800 [-] FOO: Running tests-mozilla-central-win7-opt-unittest So that scheduler just didn't run between 13:38 and 14:21. A variable number of schedulers was running each minute in that window. Interestingly there was a push to mozilla-central at 14:21 (which was picked up at 14:22:00 on pm01). There were sendchanges and other pushes detected in that window so it might just be a coincidence.
Priority: P1 → P2
s/BasicScheduler/Scheduler/ on comment #8.
And again 2010-11-21 16:06:09-0800 [-] FOO: Running tests-mozilla-central-fedora-opt-unittest 2010-11-21 16:07:06-0800 [-] FOO: Running tests-mozilla-central-fedora-opt-unittest 2010-11-21 16:43:39-0800 [-] FOO: Running tests-mozilla-central-fedora-opt-unittest 2010-11-21 16:43:39-0800 [-] FOO: tests-mozilla-central-fedora-opt-unittest creating buildstamp with ssid=686182 rev=4847e1cf6cf4725b9a36da1a94788374b0cab397 2010-11-21 16:43:53-0800 [-] FOO: Running tests-mozilla-central-fedora-opt-unittest 2010-11-21 16:06:10-0800 [-] FOO: Running tests-mozilla-central-leopard-debug-unittest 2010-11-21 16:07:06-0800 [-] FOO: Running tests-mozilla-central-leopard-debug-unittest 2010-11-21 16:43:42-0800 [-] FOO: Running tests-mozilla-central-leopard-debug-unittest 2010-11-21 16:43:54-0800 [-] FOO: Running tests-mozilla-central-leopard-debug-unittest 2010-11-21 16:07:02-0800 [-] FOO: Running tests-mozilla-central-fedora64-debug-unittest 2010-11-21 16:08:03-0800 [-] FOO: Running tests-mozilla-central-fedora64-debug-unittest 2010-11-21 16:43:47-0800 [-] FOO: Running tests-mozilla-central-fedora64-debug-unittest 2010-11-21 16:43:47-0800 [-] FOO: tests-mozilla-central-fedora64-debug-unittest creating buildstamp with ssid=686232 rev=4847e1cf6cf4725b9a36da1a94788374b0cab397 There was a m-c push at 16:42, so it's looking more likely that a new change on m-c unwedges something. Those three schedulers are stuck again, they last ran at 16:45. strace says the scheduler master is looping around this sort of pattern select(11, [3 5 6 10], [], [], {20, 552355}) = 1 (in [6], left {20, 447000}) read(6, "x", 8192) = 1 gettimeofday({1290387640, 180350}, NULL) = 0 gettimeofday({1290387640, 180350}, NULL) = 0 futex(0x9f47958, FUTEX_WAKE, 1) = 1 futex(0x9a02f38, FUTEX_WAKE, 1) = 1 gettimeofday({1290387640, 181304}, NULL) = 0 gettimeofday({1290387640, 181304}, NULL) = 0 select(11, [3 5 6 10], [], [], {20, 443257} with a pause of a few seconds on most selects.
lsof says that file descriptor 11 is a TCP connection to the DB, one of four that are established. Need someone with serious buildbot & DB background to progress further here.
Based on the "new change on m-c unwedges something" hunch, I think it's worth taking a look at the scheduler loop. It's quite possible that the loop just isn't triggered for many minutes at a time.
Assignee: nobody → dustin
Well, looking at the logging, the Loop itself is working fine, but each scheduler is taking a looong time (5-15s) to do its processing. Looking in SHOW PROCESSLIST, I see a simple select, but replicating it takes *forever*: mysql> SELECT changeid, important FROM scheduler_changes WHERE schedulerid=27; Empty set (14.78 sec) yet mysql> select count(*) from scheduler_changes; +----------+ | count(*) | +----------+ | 0 | +----------+ 1 row in set (0.01 sec) and even mysql> SELECT changeid, important FROM scheduler_changes; Empty set (14.58 sec) SHOW PROCESSLIST shows these connections in the "statistics" state. The manual describes the state thusly: --- The server is calculating statistics to develop a query execution plan. If a thread is in this state for a long time, the server is probably disk-bound performing other work. --- (http://dev.mysql.com/doc/refman/5.0/en/general-thread-states.html) So I think the tl;dr is that the MySQL cluster is busy for some reason, and Buildbot is scaling back its DB access (arguably appropriately).
Are these queries being executed against the DB master or slave?
On a slave : mysql> SELECT changeid, important FROM scheduler_changes WHERE schedulerid=27; Empty set (8.15 sec) And the some more digging around led to finding out that a select * on an empty table was returning after about 8 second. Also learnt from nthomas that a bunch of rows were culled from the table a while back. Ran an optimize : mysql> optimize table scheduler_changes; +---------------------------------------+----------+----------+----------+ | Table | Op | Msg_type | Msg_text | +---------------------------------------+----------+----------+----------+ | buildbot_schedulers.scheduler_changes | optimize | status | OK | +---------------------------------------+----------+----------+----------+ 1 row in set (1.60 sec) And we now have : mysql> select * from scheduler_changes; Empty set (0.00 sec) Which I guess fixes the issue :)
they're on the master - the buildbot masters do selects and inserts/updates, so they need r/w access. fox2mike found that the scheduler_changes table is using 566M on disk, and reminded me that MySQL/MyISAM implements deletes with, more or less, a "deleted" flag. So these queries which return nothing are scanning the entire table and then removing all rows from the result set. An OPTIMIZE TABLE scheduler_changes seems to have fixed things. Thanks!
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Blocks: 613191
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: