613832 - Test jobs not being scheduled promptly

Reporter

Description

•

15 years ago

We're not scheduling tests quickly (or at all?) for some pushes. For example rev 0601ba3e9395 on mozilla-central. One of the compiles: Linux x86-64 mozilla-central leak test build scheduled: 2010-11-20 16:49:26 started: 2010-11-20 16:49:51 ended: 2010-11-20 18:16:49 result: green The sendchange took place at Sat Nov 20 18:02:25 2010 and succeeded. Log on pm01 of it being received 2010-11-20 18:02:28-0800 [Broker,2377,10.2.90.74] adding change, who sendchange-unittest, 2 files, rev=0601ba3e939572d2ea0a773ec05c2b4e72af92ae, branch=mozilla-central-linux64-debug-unittest, repository=, comments , category None, project There's a change in the DB, changes.changeid=240857 on changes.branch=mozilla-central-linux64-debug-unittest at changes.when_timestamp=1290304948, aka Sat Nov 20 18:02:28 2010 PST. However the debug unit test don't get scheduled until 20:34. There's a sourcestamp_changes.sourcestampid=682672, and buildset.id=791317 with buildset.submitted_at=1290314071 (aka Sat Nov 20 20:34:31 2010 PST). That's a 2hour 32 minute gap from the sendchange which is unexplained. When the tests do run they're showing two other changes, for 6f28a60c1c71, baa51e6d4a15, for which the sendchanges are reported as 18:20 and 18:44. I have restarted pm02:tests-scheduler to try to free up other pushes which missing test jobs but that hasn't helped.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 1

•

15 years ago

philor thinks he started seeing strangeness on Wednesday afternoon, where the 0.8.2 deployment was Wednesday morning. [21:09] <nthomas> was this an issue before DONTBUILD came along ? [21:10] <philor> yeah, :bs pushed something right before the last TM merge, and his builds didn't exist for an hour or more, but I think they then popped up as finished so, Thursday? [21:12] Wednesday <nthomas> I know https://wiki.mozilla.org/ReleaseEngineering:Maintenance is incomplete, so that's not much help [21:14] <philor> 0.8.2 was 6 to 9am Wednesday, that push was 1pm Wednesday [21:15] <philor> and, 10.6 opt tests ran like crap right out of the gate s/opt// <philor> two pushes where opt got just 1,3,oth, then one where debug got just X, then two where opt got nothing

Phil Ringnalda (:philor)

Comment 2

•

15 years ago

Comment 1 is probably me being confusing, talking about two different things at once. I think that was an instance of bug 613429 (or, bug 613429 comment 2 if they really are separate things), where right after the 0.8.2 upgrade we were already starting to see builds that didn't show up on tbpl while they were running, though at the time I didn't know that I should have looked at builds-running for another cset to see if that was what was happening. I didn't start paying attention to "omg, where are the tests for that long-completed build that doesn't list pending tests?" until last night, so I don't know if we've been having two hour gaps since Wednesday or not.

Chris AtLee [:catlee]

Comment 3

•

15 years ago

Comment 4

•

15 years ago

I believe that the test scheduler stuff is unrelated to DONTBUILD.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 5

•

15 years ago

Attached file 'Rev3 Fedora 12x64 mozilla-central debug test mochitests-1/5' jobs — Details

Regardless of the assignment of revisions to tests, this illustrates the scheduling problem. Recall that the sendchanges for the tests scheduled at 20:34 took place at 18:02, 18:20 and 18:44.

Chris AtLee [:catlee]

Comment 6

•

15 years ago

This is more consistent with something delaying the test schedulers from running than the revisions being assigned incorrectly.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 7

•

15 years ago

Attached file Better DB data — Details

Agreed. From this test job it looks like it started about midday on Saturday. Pending Tegra builds crossed some threshold to slow everything down ?

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 8

•

15 years ago

catlee suggested adding a logging line in schedulers/basic.py, in BasicScheduler's run(), to make sure the schedulers are being executed every minute. We then had some delayed Win7 opt tests on m-c 8a5f16022cbd. The sendchange was at 14:14 and jobs were not scheduled until 14:22. Grepping over the twistd.log we have 2010-11-21 13:35:08-0800 [-] FOO: Running tests-mozilla-central-win7-opt-unittest 2010-11-21 13:36:05-0800 [-] FOO: Running tests-mozilla-central-win7-opt-unittest 2010-11-21 13:37:05-0800 [-] FOO: Running tests-mozilla-central-win7-opt-unittest 2010-11-21 14:22:15-0800 [-] FOO: Running tests-mozilla-central-win7-opt-unittest 2010-11-21 14:22:15-0800 [-] FOO: tests-mozilla-central-win7-opt-unittest creating buildstamp with ssid=685535 rev=8a5f16022cbd88de7b4a5e25463c749882323310 2010-11-21 14:22:28-0800 [-] FOO: Running tests-mozilla-central-win7-opt-unittest 2010-11-21 14:23:04-0800 [-] FOO: Running tests-mozilla-central-win7-opt-unittest 2010-11-21 14:24:03-0800 [-] FOO: Running tests-mozilla-central-win7-opt-unittest So that scheduler just didn't run between 13:38 and 14:21. A variable number of schedulers was running each minute in that window. Interestingly there was a push to mozilla-central at 14:21 (which was picked up at 14:22:00 on pm01). There were sendchanges and other pushes detected in that window so it might just be a coincidence.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

15 years ago

Priority: P1 → P2

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 9

•

15 years ago

s/BasicScheduler/Scheduler/ on comment #8.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 10

•

15 years ago

And again 2010-11-21 16:06:09-0800 [-] FOO: Running tests-mozilla-central-fedora-opt-unittest 2010-11-21 16:07:06-0800 [-] FOO: Running tests-mozilla-central-fedora-opt-unittest 2010-11-21 16:43:39-0800 [-] FOO: Running tests-mozilla-central-fedora-opt-unittest 2010-11-21 16:43:39-0800 [-] FOO: tests-mozilla-central-fedora-opt-unittest creating buildstamp with ssid=686182 rev=4847e1cf6cf4725b9a36da1a94788374b0cab397 2010-11-21 16:43:53-0800 [-] FOO: Running tests-mozilla-central-fedora-opt-unittest 2010-11-21 16:06:10-0800 [-] FOO: Running tests-mozilla-central-leopard-debug-unittest 2010-11-21 16:07:06-0800 [-] FOO: Running tests-mozilla-central-leopard-debug-unittest 2010-11-21 16:43:42-0800 [-] FOO: Running tests-mozilla-central-leopard-debug-unittest 2010-11-21 16:43:54-0800 [-] FOO: Running tests-mozilla-central-leopard-debug-unittest 2010-11-21 16:07:02-0800 [-] FOO: Running tests-mozilla-central-fedora64-debug-unittest 2010-11-21 16:08:03-0800 [-] FOO: Running tests-mozilla-central-fedora64-debug-unittest 2010-11-21 16:43:47-0800 [-] FOO: Running tests-mozilla-central-fedora64-debug-unittest 2010-11-21 16:43:47-0800 [-] FOO: tests-mozilla-central-fedora64-debug-unittest creating buildstamp with ssid=686232 rev=4847e1cf6cf4725b9a36da1a94788374b0cab397 There was a m-c push at 16:42, so it's looking more likely that a new change on m-c unwedges something. Those three schedulers are stuck again, they last ran at 16:45. strace says the scheduler master is looping around this sort of pattern select(11, [3 5 6 10], [], [], {20, 552355}) = 1 (in [6], left {20, 447000}) read(6, "x", 8192) = 1 gettimeofday({1290387640, 180350}, NULL) = 0 gettimeofday({1290387640, 180350}, NULL) = 0 futex(0x9f47958, FUTEX_WAKE, 1) = 1 futex(0x9a02f38, FUTEX_WAKE, 1) = 1 gettimeofday({1290387640, 181304}, NULL) = 0 gettimeofday({1290387640, 181304}, NULL) = 0 select(11, [3 5 6 10], [], [], {20, 443257} with a pause of a few seconds on most selects.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 11

•

15 years ago

lsof says that file descriptor 11 is a TCP connection to the DB, one of four that are established. Need someone with serious buildbot & DB background to progress further here.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 12

•

15 years ago

Based on the "new change on m-c unwedges something" hunch, I think it's worth taking a look at the scheduler loop. It's quite possible that the loop just isn't triggered for many minutes at a time.

Assignee: nobody → dustin

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 13

•

15 years ago

Well, looking at the logging, the Loop itself is working fine, but each scheduler is taking a looong time (5-15s) to do its processing. Looking in SHOW PROCESSLIST, I see a simple select, but replicating it takes *forever*: mysql> SELECT changeid, important FROM scheduler_changes WHERE schedulerid=27; Empty set (14.78 sec) yet mysql> select count(*) from scheduler_changes; +----------+ | count(*) | +----------+ | 0 | +----------+ 1 row in set (0.01 sec) and even mysql> SELECT changeid, important FROM scheduler_changes; Empty set (14.58 sec) SHOW PROCESSLIST shows these connections in the "statistics" state. The manual describes the state thusly: --- The server is calculating statistics to develop a query execution plan. If a thread is in this state for a long time, the server is probably disk-bound performing other work. --- (http://dev.mysql.com/doc/refman/5.0/en/general-thread-states.html) So I think the tl;dr is that the MySQL cluster is busy for some reason, and Buildbot is scaling back its DB access (arguably appropriately).

Shyam Mani [:fox2mike]

Comment 14

•

15 years ago

Are these queries being executed against the DB master or slave?

Shyam Mani [:fox2mike]

Comment 15

•

15 years ago

On a slave : mysql> SELECT changeid, important FROM scheduler_changes WHERE schedulerid=27; Empty set (8.15 sec) And the some more digging around led to finding out that a select * on an empty table was returning after about 8 second. Also learnt from nthomas that a bunch of rows were culled from the table a while back. Ran an optimize : mysql> optimize table scheduler_changes; +---------------------------------------+----------+----------+----------+ | Table | Op | Msg_type | Msg_text | +---------------------------------------+----------+----------+----------+ | buildbot_schedulers.scheduler_changes | optimize | status | OK | +---------------------------------------+----------+----------+----------+ 1 row in set (1.60 sec) And we now have : mysql> select * from scheduler_changes; Empty set (0.00 sec) Which I guess fixes the issue :)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 16

•

15 years ago

they're on the master - the buildbot masters do selects and inserts/updates, so they need r/w access. fox2mike found that the scheduler_changes table is using 566M on disk, and reminded me that MySQL/MyISAM implements deletes with, more or less, a "deleted" flag. So these queries which return nothing are scanning the entire table and then removing all rows from the result set. An OPTIMIZE TABLE scheduler_changes seems to have fixed things. Thanks!

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

15 years ago

Blocks: 613191

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

'Rev3 Fedora 12x64 mozilla-central debug test mochitests-1/5' jobs 15 years ago Nick Thomas [:nthomas] (UTC+12) 2.60 KB, text/plain		Details
Better DB data 15 years ago Nick Thomas [:nthomas] (UTC+12) 4.68 KB, text/plain		Details