Closed Bug 1167475 Opened 9 years ago Closed 9 years ago

Constant Windows buildslave disconnects leading to constant broken objdirs

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: philor, Assigned: q)

References

Details

Phil Ringnalda (:philor)

Reporter

Description

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20build and hit the "get next: 50" button at the bottom.

Right now I'm looking at 57 disconnects since Wednesday morning, which is 57 opportunities for a broken objdir, of which I see a surprisingly low 7 instances. But then, every single time that I see one happen, I clobber every single Windows build on that tree, to save myself having to go right back again.

Phil Ringnalda (:philor)

Reporter

Comment 1

•

9 years ago

I should have gone with the better link, https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20build&filter-resultStatus=retry plus a couple 50s, which tells you that you cannot tell when it started, since that leads you back to the full pool reboot for bug 1166415. Maybe it did cause it, maybe something else did around 23:00 Tuesday, or 07:00 Wednesday.

Justin Wood (:Callek)

Comment 2

•

9 years ago

I wonder if this has anything to do with the network stack registry changes we recently made...

Assignee

Comment 3

•

9 years ago

I am not seeing anything obvious. I am diving deeper into the systems themselves now. Can I get a little more clarification? We did a mass reboot of all machines while jobs may have been running per the sheriff and build duty at that time. Is that the 57 disconnects and have seen two more since then ? Is a "disconnect" when a machine  stops checking in with the master or what is the criteria of the disconnected state?

Phil Ringnalda (:philor)

Reporter

Comment 4

•

9 years ago

No, the 57 was "at the time I filed this, on Thursday evening, there have been 57 on this one tree since Wednesday morning." The reboot was Tuesday afternoon, I wasn't counting them.

Since then? Dunno where the 2 came from, but at a very rough estimate across all trees, I'd say we're seeing more than 100 every 24 hours. The link in comment 1 is live, shows you the most recent 10 pushes to mozilla-inbound at the time that you load it, and then the "get next: 50" link at the bottom will add more older pushes, so at any time you can just click that link, add more pushes, and every blue B (or SM(p)) that you see down the center of the page will be a Windows build job which set the buildbot retry state, and right now, unless something else breaks, every one will be because of this.

Assignee

Comment 5

•

9 years ago

Okay that makes more sense. I have a few hosts and time frames to zero in on.

Assignee

Comment 6

•

9 years ago

Okay after some more review of b-2008-0081 I found some stack issues that might affect LAN connections with our topology. I saw tcp reset happening at time that coincided with the treehearder errors. After MUCH testing I was able to tweak the stack to compensate and I have a new set of registry settings and a netsh command script that was pushed via GPO and I will submit them for inclusion into puppett as well. 

 Philor can you keep an eye out and let me know if these do or do not stop?

Flags: needinfo?(philringnalda)

Assignee

Comment 7

•

9 years ago

My initial feeling is  that this may also speed up hg checkouts if the data pipes have the extra space.

Amy Rich [:arr] [:arich]

Updated

•

9 years ago

Blocks: 1168812

Phil Ringnalda (:philor)

Reporter

Comment 8

•

9 years ago

Looks fixed to me, thanks!

Flags: needinfo?(philringnalda)

Kim Moir [:kmoir] ET

Updated

•

9 years ago

Assignee: nobody → q

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Constant Windows buildslave disconnects leading to constant broken objdirs

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Tracking

(Not tracked)

People

(Reporter: philor, Assigned: q)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Updated

Updated

Updated