Windows 10 Moonshots take roughly 15 mins to pick up new tasks
Categories
(Infrastructure & Operations :: RelOps: Hardware, task)
Tracking
(Not tracked)
People
(Reporter: zfay, Assigned: markco)
Details
After investigating W10 hardware workers, noticed the time between tasks goes up to 15 mins. This seems higher than it should be for a reboot and to delete files from the previous test.
@pmoore - markco thinks it is actually taking that long for the user environment to load
| Reporter | ||
Comment 1•6 years ago
|
||
Hey Rob, what do you think is causing this?
| Assignee | ||
Comment 2•6 years ago
|
||
Here is an example:
20:29:20 (00:00) Generic-worker exits
20:30:27 (01:07) OCC starts
20:40:10 (10:50) DSC starts
20:41:10 (11:50) DSC completes
20:41:12 (11:52) OCC completes
20:41:13 (11:53) Generic-worker wrapper scripts starts (was waiting on DSC to complete)
20:41:21 (12:01) User init script begins to loop (HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Explorer\StuckRects3 does not yet)
20:45:59 (16:39) User init Script finishes
20:46::21(17:01) Generic-worker is waiting on a task
I am going to do a fresh install on a node and see if it takes a similar time. As well as, on Monday, dive through the event viewer and see if there is any issues causing a slow down, particularly between the start of OCC and DSC and init script exit.
I think, could be wrong, but when we first started testing on the Moonshot nodes the time between exit and the next task was less than 5 minutes.
| Assignee | ||
Comment 3•6 years ago
|
||
I am doing a fresh install on ms-016.
Comment 4•6 years ago
|
||
minions-managed.tk uses a specific papertrail search filter to look for significant events in the run-task/system-restart cycle. that search filter would normally help us understand what's happening, but when i attempt to take a look, i see that the search does not return any results. probing further, i can see that there is no useful log data arriving in papertrail from gecko-t-win10-64-hw workers.
we have long had an issue with some percentage of hardware eventlog entries getting truncated or garbled, but today that percentage looks to be 100%. you can recognise a garbled entry by the system (hostname) field in papertrail. when the message has been truncated it always arrives at papertrail with a hostname that includes wintest.releng.mdc1.mozilla.com in its fqdn. log message entries that are not garbled, never include wintest.releng in the hostname/fqdn.
i have long suspected the log aggregators of being to blame for these garbled messages but i don't know anything about our aggregator setup other than the hostname (log-aggregator.srv.releng.mdc1.mozilla.com).
in order to get reliable debug information out of the hardware workers, we need to know what they are doing between task completion and the start of a subsequent task. the first step to understanding that, will be to get the papertrail logs ungarbled sufficiently that they start to make sense and showing the timestamp of each routine start and end.
since i know nothing about the log aggregators, my approach would be to send the event log entries directly to papertrail without routing them through the log aggregators. we can achieve that immediately with a very simple change to the nxlog configuration which i will experiment with shortly. since there is no useful log data arriving in papertrail from gecko-t-win10-64-hw instances, i think we have nothing to lose from making this change at least temporarily until somebody can fix the aggregators or if bypassing the aggregators still results in garbled messages (i'll know shortly).
Comment 5•6 years ago
|
||
modified nxlog config to route directly to papertrail, bypassing the log aggregators:
https://github.com/mozilla-releng/OpenCloudConfig/commit/4978a75
worked perfectly (no garbled logs, no trailing "#015" on every message):
- gecko-t-win10-64-hw: https://papertrailapp.com/groups/13033891/events
- gecko-t-win10-64-ux: https://papertrailapp.com/groups/13033881/events
i think it's safe to say the log aggregator config needs to be fixed or maybe the aggregators were just overwhelmed with the volume of log messages.
i will see if i can figure out what the machines are doing between task runs now...
Comment 6•6 years ago
|
||
(In reply to Rob Thijssen [:grenade (EET)] from comment #4)
we have long had an issue with some percentage of hardware eventlog entries getting truncated or garbled, but today that percentage looks to be 100%. you can recognise a garbled entry by the system (hostname) field in papertrail. when the message has been truncated it always arrives at papertrail with a hostname that includes wintest.releng.mdc1.mozilla.com in its fqdn. log message entries that are not garbled, never include wintest.releng in the hostname/fqdn.
i have long suspected the log aggregators of being to blame for these garbled messages but i don't know anything about our aggregator setup other than the hostname (log-aggregator.srv.releng.mdc1.mozilla.com).
Jake, can you look into this? If the aggregators are doing something to windows logs, we should fix; if not, then we know to look elsewhere for the garbling.
Comment 7•6 years ago
|
||
looking at the logs, it appears that there are two dsc processes running simultaneously before control of the instance is handed to generic-worker. one is triggered by occ, the other is triggered by dsc's internal scheduler. the second process isn't required and is just slowing things down.
dsc is a resource hungry process and when the internally triggered dsc run starts, pretty much everything else just hangs until it finishes and they can get their hands on a few cpu cycles.
i think we need to look at disabling the internal dsc scheduler. we don't want that running as we don't have control over its timings and it isn't doing any good to have both processes running simultaneously.
i'll work on a patch to get that done. it may need to wait until some of the work for sha pinning of hardware workers is complete though (next week maybe) because there's enough churn on windows 10 config code in occ now to get things really confused if we add anything else into the mix.
Updated•6 years ago
|
| Assignee | ||
Updated•5 years ago
|
Description
•