Closed Bug 959266 Opened 10 years ago Closed 7 years ago

slaves that disconnect during a FileDownload (or other transfer step, presumably), get hung forever

Categories

(Release Engineering :: General, defect, P2)

x86_64
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2287] )

Attachments

(1 file)

Haven't been able to catch this in time to get twistd.log information yet, but I've seen this a few times already, and it may be related to reconfigs getting stuck.

Basically, if we lose a slave while it's doing a FileDownload, its build is hung forever. Eg:
http://buildbot-master53.srv.releng.usw2.mozilla.com:8201/builders/b2g_emulator_vm%20mozilla-inbound%20opt%20test%20crashtest-2/builds/1053
(In reply to Dustin J. Mitchell [:dustin] (I ignore NEEDINFO) from comment #1)
> https://github.com/buildbot/buildbot/commit/
> c8d1ee63f6789d63a97ef39e62e7dd9d9a912562
> ?

That looks like it would do it! Is it hard to roll out buildbot slave changes these days?
for "all but windows" yes, we'd create a new pip package for our buildslave setup. upload it, then 

http://mxr.mozilla.org/build/source/puppet/modules/buildslave/manifests/install.pp?force=1

set -0moz2 to active=false, and 0moz3 to install, then we can also set -0moz2 to absent. And things should just work (aiui, untested of course)
You could probably just take the master-side fix.  The slave-side is, judging from the commit message, cleanup.
I tested this on my dev master and it worked well!

I don't think I want to do another round of buildbot masters just yet, but I'll land it on default so we don't forget about it.
Attachment #8360516 - Flags: review?(catlee)
Comment on attachment 8360516 [details] [diff] [review]
wow. such buildslave. many disconnect. very exception

Review of attachment 8360516 [details] [diff] [review]:
-----------------------------------------------------------------

f- for inappropriate meme spreading

I'm surprised this isn't default behaviour for build steps...
Attachment #8360516 - Flags: review?(catlee) → review+
Comment on attachment 8360516 [details] [diff] [review]
wow. such buildslave. many disconnect. very exception

Landed on default. This'll get put into production the next time we land something to Buildbot...
Attachment #8360516 - Flags: checked-in+
Moving to the buildduty queue: all that's left to do here is deploy the buildbot change.
Component: General Automation → Buildduty
QA Contact: catlee → armenzg
Priority: -- → P2
per :jhopkins, this needs to be done during a TCW.

To properly get that scheduled, we'll need a CAB bug (I can file that if need be). For CAB, we'll need the information requested at https://wiki.mozilla.org/IT/ChangeControl#Submitting_a_Change_Request
Flags: needs-treeclosure?
bhearsum: iirc, we need:
1) merge 'buildbot' repo's default branch -> production
2) a full restart of each buildbot master process (not a reconfig).
Flags: needinfo?(bhearsum)
(In reply to Hal Wine [:hwine] (use needinfo) from comment #10)
> per :jhopkins, this needs to be done during a TCW.

To be clear: this could be done in a series of rolling restarts as we've done for other Buildbot master changes. If we have a TCW where hard restarts (ones that interrupt jobs) aren't going to be a problem, it's certainly easier to do it there, though.

> To properly get that scheduled, we'll need a CAB bug (I can file that if
> need be). For CAB, we'll need the information requested at
> https://wiki.mozilla.org/IT/ChangeControl#Submitting_a_Change_Request

If someone wants to do this, great! This bug isn't urgent though, so I won't be filing one specifically for it.

(In reply to John Hopkins (:jhopkins) from comment #11)
> bhearsum: iirc, we need:
> 1) merge 'buildbot' repo's default branch -> production
> 2) a full restart of each buildbot master process (not a reconfig).

The "update-buildbot" fabric target needs to be run after you push the buildbot changes, too.
Flags: needinfo?(bhearsum)
Component: Buildduty → General Automation
QA Contact: armenzg → catlee
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2273]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2273] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2281]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2281] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2287]
dropping tree closure flag -- we won't deploy this on its own merits.
Flags: needs-treeclosure?
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: