There are tons of slaves all over the places suffering from bug 695267. Before we start with complicated things, lets verify that the dongles are are firmly and completely inserted into all the slaves. I know from experience that its really hard to reach the plugs for the minis from behind the rack mount system and it was easy to not plug it in completely. The way to check that a dongle is inserted properly is to ssh into the machine and run: screenresolution get if that doesn't show "Display 0: 1600x1200x32", then run: screenresolution list 2>/dev/null | grep 1600x1200x32 And there should be output, containing 1600x1200x32. If there is no output on that list+grep, the machine does not have a functioning dongle. If 1600x1200x32 is in the list of available resolutions but not the selected resolution, that means the machine needs to reboot to get the right resolution. Please verify with buildduty that the machine is not running jobs before rebooting it.
Duplicate of this bug: 695215
no need for a trip - this can be done remotely.
Assignee: server-ops-releng → dustin
colo-trip: scl1 → ---
Created attachment 568266 [details] talos-r4-resolutions.csv I ran for h in talos-r4-snow-047; do echo -n "$h," >> talos-r4-resolutions.csv; ssh cltbld@$h 'get=`/usr/local/bin/screenresolution get`; list=`/usr/local/bin/screenresolution list 2>/dev/null | grep 1600x1200x32`; test "$get" = "Display 0: 1600x1200x32" && echo -n Y, || echo -n N,; test -n "$list" && echo Y || echo N' >> talos-r4-resolutions.csv; done So column 2 is "Y" if the current resolution is correct; column 3 is "Y" if that resolution is available. I don't see any "Y,N"s, so no reboots are required here. I couldn't connect to 047. I got these errors for 056: Wed Oct 19 17:40:39 talos-r4-snow-056.build.scl1.mozilla.com screenresolution <Error>: kCGErrorRangeCheck: On-demand launch of the Window Server is allowed for root user only. Wed Oct 19 17:40:39 talos-r4-snow-056.build.scl1.mozilla.com screenresolution <Error>: kCGErrorFailure: Set a breakpoint @ CGErrorBreakpoint() to catch errors as they are logged. Error: failed to get list of active displays The list of "N,N"s is: talos-r4-snow-011,N,N talos-r4-snow-018,N,N talos-r4-snow-052,N,N talos-r4-snow-061,N,N talos-r4-snow-069,N,N talos-r4-snow-074,N,N which doesn't quite correspond to the bad slaves in bug 695267.
OK. Lets reseat the dongle on those slaves.
Sounds good. John is continuing research in bug 695267 on the remaining problem slaves. Since this is just six slaves, and not the whole pool, I've reduced the importance to "major" - this should happen on the next scl1 run, but no special trip.
Assignee: dustin → mlarrain
Severity: critical → major
colo-trip: --- → scl1
Summary: please verify that all dongles are firmly and completely inserted → re-seat dongles in six talos-r4 systems and check results
Please re-seat these five; 069 is moved to bug 695930. talos-r4-snow-011 talos-r4-snow-018 talos-r4-snow-052 talos-r4-snow-061 talos-r4-snow-074
Summary: re-seat dongles in six talos-r4 systems and check results → re-seat dongles in five talos-r4 systems and check results
Duplicate of this bug: 695930
We're back to re-seating six - these are all pretty clearly not dongle'd, and they stay that way over reboots. talos-r4-snow-011 talos-r4-snow-018 talos-r4-snow-052 talos-r4-snow-061 talos-r4-snow-069 talos-r4-snow-074
Summary: re-seat dongles in five talos-r4 systems and check results → re-seat dongles in six talos-r4 systems and check results
I've verified these six are disabled in slavealloc, so they can be rebooted at will.
I will be there today working on the lion install/deploys for John Ford so I can verify those whilst I am there.
awesome, thanks Matt!
I checked all the dongles and they all seem to be plugged in. Just need them verified again.
Removed and replugged in the dongles and all is working now.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Matt and I did some diagnostics just now. Looks like the problem is improperly inserted dongles. The solution is to completely remove the dongle and reinsert. We also tested on 069, 011, 018, 052 with the same results. [14:11:52] <digipengi> try talos-r4-snow-074 [14:11:59] <digipengi> I changed the dongle out [14:12:18] <jhford> talos-r4-snow-074:~ cltbld$ screenresolution get [14:12:18] <jhford> Display 0: 1600x1200x32 [14:12:29] <jhford> sounds like a defective dongle [14:12:39] <jhford> can we measure the difference in resistance? [14:12:44] <jhford> and otherwise compare them? [14:13:24] <dustin> hang onto the defective one and bring it back [14:13:45] <digipengi> but i put it in my "blow up with c4" pile :'( [14:13:48] <dustin> although it'd be good to know if it's the dp dongle or the soldered part [14:15:11] <jhford> digipengi: can you put the bad dongle back on the slave? [14:15:18] <digipengi> sure [14:15:20] <digipengi> brbs [14:16:17] <digipengi> bad dongle back on [14:17:29] <jhford> ok, it works now [14:17:35] <jhford> did you use the same cable? [14:17:41] <digipengi> no [14:17:51] <jhford> can you use the same cable with the 'bad' dongle? [14:18:16] <digipengi> that is [14:18:30] <digipengi> by dongle I refer to the whole cable + plugged in thing [14:18:33] <jhford> ok [14:18:55] <digipengi> let me try another one of those gain [14:18:57] <digipengi> again* [14:19:01] <jhford> so the 'bad dongle' becomes a 'good dongle' when reinserted [14:19:15] <digipengi> guess so [14:19:35] <digipengi> I'm not going to question it [14:19:39] <digipengi> if it works
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.