Closed Bug 851622 Opened 12 years ago Closed 12 years ago

Frequent ec2 disconnects causing most jobs to fail

Categories

(Release Engineering :: General, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RyanVM, Assigned: rail)

References

Details

(Whiteboard: [buildduty])

Attachments

(2 files)

All the trees are currently closed because at the moment, pretty much every ec2 job is being cut off prematurely. https://tbpl.mozilla.org/php/getParsedLog.php?id=20699468&tree=Mozilla-Inbound
Adding hwine as buildduty. For posterity, we had a *long* reconfig this morning. 3 linux test masters (17, 18, and 24) took over 2 hours to reconfig before I finally disabled them in slavealloc, killed them, and then started them back up. I had set the masters to gracefully shutdown, but that doesn't keep new slaves from trying to attach (as I'm told now), so new ec2 slaves would gleefully attach to the master that was trying to shutdown and start taking jobs. Just now, I took a look at those same masters that had been restarted, and every single available slave was running a job. It's possible we're seeing some sort of congestion with 3 masters and all associated slaves coming back online in quick succession, but that's a guess. It should flatten out to normal in short order if that's the case. Bug 833334 covers making the reconfig process less painful, but I'm not sure we know how to do that reliably yet.
See Also: → 833334
Maybe we need more masters to handle that many ec2 slaves.
See Also: → 851431
See Also: → 844648
Trees got reopened, then bug 851705, then got reopened while I retriggered 20 jobs per push, but I can't keep doing that any longer, so mozilla-central, mozilla-inbound, mozilla-aurora, fx-team and services-central are closed again.
The common feature of bugs 851431, bug 844648, and bug 851697 is a job finishing (usually green), and another job starting within 10 seconds. The EC2 VMs seem to start on the order of 1-2 minutes so the 2nd job gets cut when the network drops underneath it.
Whiteboard: [buildduty]
Strangely, when I reboot tst-linux32-ec2-029 I see this in the slaves twistd.log: 2013-03-16 03:22:19-0700 [-] Received SIGTERM, shutting down. 2013-03-16 03:22:19-0700 [Broker,client] lost remote # many more lost remote messages 2013-03-16 03:22:19-0700 [-] Server Shut Down. 2013-03-16 03:22:24-0700 [-] Log opened. ... 2013-03-16 03:22:30-0700 [Broker,client] Connected to buildbot-master24... slave is ready # doesn't get any work here, but if it did that's our bug occurs 2013-03-16 03:23:37-0700 [-] Log opened. ... 2013-03-16 03:23:40-0700 [Broker,client] Connected to buildbot-master24... slave is ready ie buildbot restarts within 5 seconds of getting SIGTERM and reattaches, then repeats that about a minute later. The master log agrees with this story. Meanwhile the slave syslog has rsyslogd stopping at 03:22:17, and restarting at 03:23:06 (similar for sshd). So a job could start between 03:22:30 and 03:23:40 then find the machine disappear as the reboot actually happens. The question is what is respawning buildbot, if that changed, and why we're hitting this now. The master restart (comment #1) may have made them more responsive and therefore more likely to hand out work.
> Meanwhile the slave syslog has rsyslogd stopping at 03:22:17, and restarting > at 03:23:06 (similar for sshd). So a job could start between 03:22:30 and > 03:23:40 then find the machine disappear as the reboot actually happens. Correction - a job could start between 03:22:30 and whenever the slave actually goes down (could be up to 03:23:06 depending on Amazon overhead).
I suspect 2 things: 1) Xsession's respawn stranza in http://hg.mozilla.org/build/puppet/file/ffc193221048/modules/gui/templates/Xsession.conf.erb#l9 which may try to restart Xsession after sigterm sent by reboot. 2) xterm in http://hg.mozilla.org/build/puppet/file/ffc193221048/modules/buildslave/templates/gnome-terminal.desktop.erb#l8 which make buildbot run in background without any parent process. I added "respawn limit 3 120" (not more than 3 times with 2 min interval) and changed xterm back to gnome-terminal.
I think this is the "respawn" stranza - patterns of Xsession logs changed, "Killed" started appearing recently. /var/log/upstart/Xsession.log.7.gz: Session terminated, terminating shell... ...terminated. Session terminated, terminating shell... ...terminated. Session terminated, terminating shell... ...terminated. /var/log/upstart/Xsession.log: Session terminated, terminating shell... ...terminated. Killed Session terminated, terminating shell... ...terminated. Killed Session terminated, terminating shell... ...terminated. Killed Session terminated, terminating shell... ...terminated. Killed Session terminated, terminating shell... Session terminated, terminating shell... ...terminated. Killed Session terminated, terminating shell... ...terminated.
Assignee: nobody → rail
No purples for last 2000 Ubuntu test builds so far.
Severity: blocker → normal
Priority: -- → P1
Looks like the issue is gone. No such failure for last 24 hours.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: