Closed Bug 851622 Opened 12 years ago Closed 12 years ago

Frequent ec2 disconnects causing most jobs to fail

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: RyanVM, Assigned: rail)

References

Details

(Whiteboard: [buildduty])

Attachments

(2 files)

respavn limit and gnome terminal 12 years ago Rail Aliiev [:rail] 1.73 KB, patch		Details \| Diff \| Splinter Review
remove respawn 12 years ago Rail Aliiev [:rail] 1.13 KB, patch	rail : review+	Details \| Diff \| Splinter Review

Ryan VanderMeulen [:RyanVM]

Reporter

Description

•

12 years ago

All the trees are currently closed because at the moment, pretty much every ec2 job is being cut off prematurely. https://tbpl.mozilla.org/php/getParsedLog.php?id=20699468&tree=Mozilla-Inbound

Chris Cooper [:coop] (he/him)

Comment 1

•

12 years ago

Adding hwine as buildduty. For posterity, we had a *long* reconfig this morning. 3 linux test masters (17, 18, and 24) took over 2 hours to reconfig before I finally disabled them in slavealloc, killed them, and then started them back up. I had set the masters to gracefully shutdown, but that doesn't keep new slaves from trying to attach (as I'm told now), so new ec2 slaves would gleefully attach to the master that was trying to shutdown and start taking jobs. Just now, I took a look at those same masters that had been restarted, and every single available slave was running a job. It's possible we're seeing some sort of congestion with 3 masters and all associated slaves coming back online in quick succession, but that's a guess. It should flatten out to normal in short order if that's the case. Bug 833334 covers making the reconfig process less painful, but I'm not sure we know how to do that reliably yet.

Comment 2

•

12 years ago

Maybe we need more masters to handle that many ec2 slaves.

Armen [:armenzg]

Updated

•

12 years ago

Updated

•

12 years ago

Comment 3

•

12 years ago

Trees got reopened, then bug 851705, then got reopened while I retriggered 20 jobs per push, but I can't keep doing that any longer, so mozilla-central, mozilla-inbound, mozilla-aurora, fx-team and services-central are closed again.

Nick Thomas [:nthomas] (UTC+12)

Comment 4

•

12 years ago

The common feature of bugs 851431, bug 844648, and bug 851697 is a job finishing (usually green), and another job starting within 10 seconds. The EC2 VMs seem to start on the order of 1-2 minutes so the 2nd job gets cut when the network drops underneath it.

Ed Morley [:emorley]

Updated

•

12 years ago

Whiteboard: [buildduty]

Nick Thomas [:nthomas] (UTC+12)

Comment 5

•

12 years ago

Strangely, when I reboot tst-linux32-ec2-029 I see this in the slaves twistd.log: 2013-03-16 03:22:19-0700 [-] Received SIGTERM, shutting down. 2013-03-16 03:22:19-0700 [Broker,client] lost remote # many more lost remote messages 2013-03-16 03:22:19-0700 [-] Server Shut Down. 2013-03-16 03:22:24-0700 [-] Log opened. ... 2013-03-16 03:22:30-0700 [Broker,client] Connected to buildbot-master24... slave is ready # doesn't get any work here, but if it did that's our bug occurs 2013-03-16 03:23:37-0700 [-] Log opened. ... 2013-03-16 03:23:40-0700 [Broker,client] Connected to buildbot-master24... slave is ready ie buildbot restarts within 5 seconds of getting SIGTERM and reattaches, then repeats that about a minute later. The master log agrees with this story. Meanwhile the slave syslog has rsyslogd stopping at 03:22:17, and restarting at 03:23:06 (similar for sshd). So a job could start between 03:22:30 and 03:23:40 then find the machine disappear as the reboot actually happens. The question is what is respawning buildbot, if that changed, and why we're hitting this now. The master restart (comment #1) may have made them more responsive and therefore more likely to hand out work.

Nick Thomas [:nthomas] (UTC+12)

Comment 6

•

12 years ago

> Meanwhile the slave syslog has rsyslogd stopping at 03:22:17, and restarting > at 03:23:06 (similar for sshd). So a job could start between 03:22:30 and > 03:23:40 then find the machine disappear as the reboot actually happens. Correction - a job could start between 03:22:30 and whenever the slave actually goes down (could be up to 03:23:06 depending on Amazon overhead).

Rail Aliiev [:rail]

Assignee

Comment 7

•

12 years ago

Attached patch respavn limit and gnome terminal — Details — Splinter Review

I suspect 2 things: 1) Xsession's respawn stranza in http://hg.mozilla.org/build/puppet/file/ffc193221048/modules/gui/templates/Xsession.conf.erb#l9 which may try to restart Xsession after sigterm sent by reboot. 2) xterm in http://hg.mozilla.org/build/puppet/file/ffc193221048/modules/buildslave/templates/gnome-terminal.desktop.erb#l8 which make buildbot run in background without any parent process. I added "respawn limit 3 120" (not more than 3 times with 2 min interval) and changed xterm back to gnome-terminal.

Rail Aliiev [:rail]

Assignee

Comment 8

•

12 years ago

I think this is the "respawn" stranza - patterns of Xsession logs changed, "Killed" started appearing recently. /var/log/upstart/Xsession.log.7.gz: Session terminated, terminating shell... ...terminated. Session terminated, terminating shell... ...terminated. Session terminated, terminating shell... ...terminated. /var/log/upstart/Xsession.log: Session terminated, terminating shell... ...terminated. Killed Session terminated, terminating shell... ...terminated. Killed Session terminated, terminating shell... ...terminated. Killed Session terminated, terminating shell... ...terminated. Killed Session terminated, terminating shell... Session terminated, terminating shell... ...terminated. Killed Session terminated, terminating shell... ...terminated.

Assignee: nobody → rail

Rail Aliiev [:rail]

Assignee

Comment 9

•

12 years ago

Attached patch remove respawn — Details — Splinter Review

http://hg.mozilla.org/build/puppet/rev/01ebdf47be6c r=Callek on IRC

Attachment #725803 - Flags: review+

Rail Aliiev [:rail]

Assignee

Comment 10

•

12 years ago

No purples for last 2000 Ubuntu test builds so far.

Severity: blocker → normal

Priority: -- → P1

Rail Aliiev [:rail]

Assignee

Comment 11

•

12 years ago

Looks like the issue is gone. No such failure for last 24 hours.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Frequent ec2 disconnects causing most jobs to fail

Categories

(Release Engineering :: General, defect, P1)

Tracking

(Not tracked)

People

(Reporter: RyanVM, Assigned: rail)

References

Details

(Whiteboard: [buildduty])

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Updated

Updated

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Attachment

General

Description

File Name

Content Type