Closed Bug 981368 Opened 10 years ago Closed 10 years ago

Determine RFO for 5 hour tree closure: No Linux64 test slaves running, 3+ hour pending

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Linux
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: catlee)

References

Details

(Whiteboard: [re][closed-trees])

Attachments

(1 file)

Because it's DST day, I suspect a bit of timezone oopsie. We've got a couple hundred pending jobs for linux64 test slaves (b2g emulator and linux64 desktop) going back to 20:07, but there are zero jobs running on them.

All trees except Try are closed.
Where by "Linux64 test slaves," since Questions Will Be Asked, I mean the linux64 AWS slaves that run everything except debug browser-chrome and talos, not the ubuntu_hw slaves that do talos or the doddering old fedora slaves that do debug browser-chrome, since both of those pools are still alive.
(In reply to Phil Ringnalda (:philor) from comment #0)
> Because it's DST day, I suspect a bit of timezone oopsie.

The pending build list goes back to just after 20:00 PST (20:07:10, to be exact), which does indeed make this a little suspicious.  But I have no idea what particular timezone oopsie would happen at that hour.
Flags: needinfo?
The problem cleared itself about 3:45 Pacific, trees reopened at 04:14.

Possibly related to bug 978971. Also may have been slaves connected to masters but work not being taken up. Hard to tell as I was looking as more slaves came online; going to need some serious log analysis to figure it out.
Severity: blocker → critical
Flags: needinfo? → needinfo?(rail)
I haven't seen any "bad" condition since I landed bug 978971. This is expected during weekends, I believe. I don't think it helped.

BTW, ~10 days ago we changed our algorithm and instead of starting EC2 instances for every single pending request, we reduce this amount by 10% of running instances. http://hg.mozilla.org/build/cloud-tools/rev/057aaeff75d2#l2.379

This may affect our pending queue esp we we have spikes and have a lot of running instances.
Flags: needinfo?(rail)
We seem to have more of a problem starting up enough instances when we're otherwise idle...That makes me think maybe a few things could be wrong:

- builders not listed in aws_watch_pending are stealing jobs (e.g. l10n)
- spot requests aren't being fulfilled (bug 978971)
- spot machines are coming online, but aren't able to do work for some reason
To preserve some history, I've copied master/twistd.log* to ~/b981368/ on the 9 tests1-linux masters.
We definitely don't match on a few type of jobs, some 1319 in total
* b2g_emulator_vm.* (479), b2g_ubuntu64_vm (73), b2g_ubuntu32_vm (49)
* Android nightly and non-unified builds (and upcoming non-profiling)

The build system branch causes all sorts of fun because it adds another 'build' in the builder names, eg 'Android 2.2 Armv6 Tegra build-system opt test crashtest'. If that was still active we'd be starting EC2 instances for Tegra jobs.

I haven't added spidermonkey, hazard, TB, l10n or release builds, on the basis I don't know if we really want to or not. And frankly the regexp are getting awkward already.

See http://people.mozilla.org/~nthomas/bug981368/ for script and inputs to work this out. The json comes from a side project that dumps information out of masters.
Attachment #8388271 - Flags: review?(rail)
Comment on attachment 8388271 [details] [diff] [review]
[cloud-tools] Match on more EC2 jobs

Thanks for updating these. lgtm
Attachment #8388271 - Flags: review?(rail) → review+
Comment on attachment 8388271 [details] [diff] [review]
[cloud-tools] Match on more EC2 jobs

https://hg.mozilla.org/build/cloud-tools/rev/481a527a9fad and deployed.
Attachment #8388271 - Flags: checked-in+
(In reply to Nick Thomas [:nthomas] from comment #7)
> The build system branch causes all sorts of fun because it adds another
> 'build' in the builder names, eg 'Android 2.2 Armv6 Tegra build-system opt
> test crashtest'. If that was still active we'd be starting EC2 instances for
> Tegra jobs.

Yuck - it's both sadfaces making (and also perversely slightly satisfying) that there aren't workarounds within our automation for handling the same problems that things like TBPL & treeherder have to deal with - due to not having separated fields for job metadata. We really need to replace the buildername (or at least supplement it) with individual properties available throughout automation, so we don't have to rely on fragile regex :-(
morphing summary into determining reason for outage, and initial remediation, since that is where the comments have gone.
Summary: No Linux64 test slaves running, 3+ hour pending, all trees closed → Determine RFO for 5 hour tree closure: No Linux64 test slaves running, 3+ hour pending
Whiteboard: [re][closed-trees]
Blocks: 984228
Another hour-long gap in starting any tonight, starting around the same time, which conveniently ended while I was closing trees.
To document it, aws watch pending reported 835 currently running, which :philor thinks was incorrect.

And right before c#13 happened for that matter:

    2014-03-16 21:20:46,890 - INFO - 134 running for bld-linux64 None
    2014-03-16 21:20:46,892 - INFO - reducing required count for bld-linux64 None by 13 (134 running; need 11)
    2014-03-16 21:20:46,892 - INFO - removing requirement for bld-linux64 None
    2014-03-16 21:20:46,896 - INFO - 835 running for tst-linux64 None
    2014-03-16 21:20:46,899 - INFO - reducing required count for tst-linux64 None by 83 (835 running; need 195)
    2014-03-16 21:20:46,901 - INFO - 134 running for bld-linux64 None
    2014-03-16 21:20:46,909 - INFO - reducing required count for bld-linux64 None by 13 (134 running; need 1)
    2014-03-16 21:20:46,910 - INFO - removing requirement for bld-linux64 None
    2014-03-16 21:20:46,919 - INFO - 835 running for tst-linux64 None
    2014-03-16 21:20:46,920 - INFO - reducing required count for tst-linux64 None by 83 (835 running; need 1)
    2014-03-16 21:20:46,920 - INFO - removing requirement for tst-linux64 None
    2014-03-16 21:21:17,762 - INFO - getting filtered spot requests for us-east-1c (m1.medium)
    2014-03-16 21:21:17,762 - INFO - getting all spot requests for us-east-1
    2014-03-16 21:21:47,647 - INFO - getting filtered spot requests for us-east-1a (m1.medium)
    2014-03-16 21:21:49,495 - WARNING - No free network interfaces left in us-east-1
    2014-03-16 21:21:49,495 - WARNING - No free network interfaces left in us-east-1
    2014-03-16 21:21:49,496 - WARNING - No free network interfaces left in us-east-1
    2014-03-16 21:21:49,496 - WARNING - No free network interfaces left in us-east-1
    2014-03-16 21:21:49,496 - WARNING - No free network interfaces left in us-east-1
    2014-03-16 21:21:49,496 - WARNING - No free network interfaces left in us-east-1
    2014-03-16 21:21:49,496 - INFO - getting filtered spot requests for us-east-1d (m1.medium)
    2014-03-16 21:21:56,904 - INFO - getting filtered spot requests for us-east-1b (m1.medium)
    2014-03-16 21:22:05,405 - INFO - getting filtered spot requests for us-west-2a (m1.medium)
    2014-03-16 21:22:05,405 - INFO - getting all spot requests for us-west-2
    2014-03-16 21:22:13,874 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,876 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,877 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,878 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,878 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,878 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,878 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,878 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,879 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,881 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,881 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,881 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,881 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,881 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,882 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,882 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,884 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,885 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,885 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,885 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,885 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,885 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,886 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,886 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,886 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,886 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,886 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,886 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,887 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,887 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,889 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,890 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,890 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,892 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,893 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,893 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,893 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,893 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,893 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,894 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,894 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,894 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,894 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,894 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,894 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,895 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,899 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,900 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,900 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,900 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,900 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,900 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,900 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,901 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,901 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,901 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,901 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,901 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,902 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,902 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,902 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,902 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,904 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,909 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,910 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,910 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,910 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,910 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,910 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,911 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,911 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,913 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,914 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,916 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,917 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,917 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,917 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,917 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,917 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,918 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,918 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,918 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,918 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,918 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,918 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,919 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,919 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,921 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,922 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,922 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,924 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,925 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,925 - WARNING - No free network interfaces left in us-west-2
    2014-03-16 21:22:13,925 - INFO - getting filtered spot requests for us-west-2b (m1.medium)
    2014-03-16 21:23:51,297 - INFO - tst-linux64 - started 112 spot instances for slaveset None; need 0
If there were in any sense "835 running" it was not in the sense of https://secure.pub.build.mozilla.org/buildapi/running - not only is that probably 100 times the number that were visibly running, I'd bet it's actually more than the total number of jobs of all sorts on all platforms that were running at the time.
"running" means how many machines are running

pretty sure this is another case of priortizeBuilders sucking:

e.g. tst-linux64-spot-835 was started 20:45, connected to buildbot-master52 at 20:51.
bm52 was stuck doing stuff like this:

2014-03-16 20:51:39-0700 [Broker,71299,10.134.157.193] Got slaveinfo from 'tst-linux64-spot-835'
2014-03-16 20:51:39-0700 [Broker,71299,10.134.157.193] bot attached
2014-03-16 20:51:40-0700 [-] prioritizeBuilders: 0.00s starting
2014-03-16 20:51:40-0700 [-] prioritizeBuilders: 0.18s got 112 request(s)
2014-03-16 20:51:40-0700 [-] prioritizeBuilders: 0.18s requests for my builders: 96
2014-03-16 20:51:40-0700 [-] prioritizeBuilders: 0.19s builders with requests: 96
2014-03-16 20:51:40-0700 [-] prioritizeBuilders: 0.39s found 43 available of 57 connected slaves 
2014-03-16 20:51:40-0700 [-] prioritizeBuilders: 0.40s builders with slaves: 96
2014-03-16 20:51:40-0700 [-] prioritizeBuilders: 0.40s prioritized 96 builder(s): ...
2014-03-16 20:51:40-0700 [-] prioritizeBuilders: 0.41s assigned into 1 slave set(s)
2014-03-16 20:51:40-0700 [-] prioritizeBuilders: 0.41s finding important builders for slaves: ['tst-linux64-spot-878', 'tst-linux64-spot-851', 'tst-linux64-spot-850', 'tst-linux64-spot-971', 'tst-linux64-spot-870', 'tst-linux64-spot-238', 'tst-linux64-spot-271', 'tst-linux64-spot-035', 'tst-linux64-spot-256', 'tst-linux64-spot-030', 'tst-linux64-spot-279', 'tst-linux64-spot-032', 'tst-linux64-spot-066', 'tst-linux64-spot-896', 'tst-linux64-spot-160', 'tst-linux64-spot-149', 'tst-linux64-spot-085', 'tst-linux64-spot-999', 'tst-linux64-spot-898', 'tst-linux64-spot-182', '...']
2014-03-16 20:51:40-0700 [-] prioritizeBuilders: 0.41s important builder b2g_emulator_vm b2g-inbound debug test mochitest-debug-12 (p == (4, 0L, 100, 1395026895L))
2014-03-16 20:51:40-0700 [-] prioritizeBuilders: 0.41s unimportant builder Ubuntu VM 12.04 x64 mozilla-inbound opt test xpcshell ((4, 0L, 100, 1395027132L) != (4, 0L, 100, 1395026895L))
2014-03-16 20:51:40-0700 [-] prioritizeBuilders: 0.41s unimportant builder Ubuntu VM 12.04 x64 mozilla-inbound opt test mochitest-3 ((4, 0L, 100, 1395027132L) != (4, 0L, 100, 1395026895L))
2014-03-16 20:51:40-0700 [-] prioritizeBuilders: 0.41s unimportant builder Ubuntu VM 12.04 x64 mozilla-inbound opt test mochitest-1 ((4, 0L, 100, 1395027132L) != (4, 0L, 100, 1395026895L))
...
2014-03-16 20:51:40-0700 [-] prioritizeBuilders: 0.42s found 1 important builder(s): ['b2g_emulator_vm b2g-inbound debug test mochitest-debug-12']
2014-03-16 20:51:40-0700 [-] prioritizeBuilders: 0.42s shuffling important builders
2014-03-16 20:51:40-0700 [-] prioritizeBuilders: 0.42s triggering builder loop again since we've dropped some lower priority builders
2014-03-16 20:51:40-0700 [-] prioritizeBuilders: 0.42s finished prioritization
2014-03-16 20:51:41-0700 [-] nextAWSSlave: 1 retries for b2g_emulator_vm b2g-inbound debug test mochitest-debug-12
2014-03-16 20:51:41-0700 [-] nextAWSSlave: No slaves appropriate for retried job - returning None
Made a few changes to watch_pending:

http://hg.mozilla.org/build/cloud-tools/rev/8b2961b13630 - factor in freshly started instances to prevent us from overshooting due to bootup -> taking jobs lag

http://hg.mozilla.org/build/cloud-tools/rev/0ce75ee78ae7 - keep # running separate for spot and ondemand so we actually start ondemand if we need to (due to retries)
Depends on: 984923
Assignee: nobody → catlee
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: