https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20build and hit the "get next: 50" button at the bottom. Right now I'm looking at 57 disconnects since Wednesday morning, which is 57 opportunities for a broken objdir, of which I see a surprisingly low 7 instances. But then, every single time that I see one happen, I clobber every single Windows build on that tree, to save myself having to go right back again.
I should have gone with the better link, https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20build&filter-resultStatus=retry plus a couple 50s, which tells you that you cannot tell when it started, since that leads you back to the full pool reboot for bug 1166415. Maybe it did cause it, maybe something else did around 23:00 Tuesday, or 07:00 Wednesday.
I wonder if this has anything to do with the network stack registry changes we recently made...
I am not seeing anything obvious. I am diving deeper into the systems themselves now. Can I get a little more clarification? We did a mass reboot of all machines while jobs may have been running per the sheriff and build duty at that time. Is that the 57 disconnects and have seen two more since then ? Is a "disconnect" when a machine stops checking in with the master or what is the criteria of the disconnected state?
No, the 57 was "at the time I filed this, on Thursday evening, there have been 57 on this one tree since Wednesday morning." The reboot was Tuesday afternoon, I wasn't counting them. Since then? Dunno where the 2 came from, but at a very rough estimate across all trees, I'd say we're seeing more than 100 every 24 hours. The link in comment 1 is live, shows you the most recent 10 pushes to mozilla-inbound at the time that you load it, and then the "get next: 50" link at the bottom will add more older pushes, so at any time you can just click that link, add more pushes, and every blue B (or SM(p)) that you see down the center of the page will be a Windows build job which set the buildbot retry state, and right now, unless something else breaks, every one will be because of this.
Okay that makes more sense. I have a few hosts and time frames to zero in on.
Okay after some more review of b-2008-0081 I found some stack issues that might affect LAN connections with our topology. I saw tcp reset happening at time that coincided with the treehearder errors. After MUCH testing I was able to tweak the stack to compensate and I have a new set of registry settings and a netsh command script that was pushed via GPO and I will submit them for inclusion into puppett as well. Philor can you keep an eye out and let me know if these do or do not stop?
My initial feeling is that this may also speed up hg checkouts if the data pipes have the extra space.
Looks fixed to me, thanks!
Assignee: nobody → q
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.