Closed Bug 1452133 Opened 7 years ago Closed 6 years ago

Moonshot Windows 10 nodes stop functioning

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: markco, Assigned: markco)

References

Details

Attachments

(8 files)

A small percentage of nodes will suddenly stop reporting to papertrail and become unreachable. Rebooting the machine remedies the situation.
The local OS event logs also stop and do not pick up again till the reboot.
Summary: Moonshot Windows 10 nodes drop of the network → Moonshot Windows 10 nodes stop functioning
Assignee: relops → mcornmesser
Added additional logging to the run generic-worker bat to see cleanup is affecting the node. https://github.com/mozilla-releng/OpenCloudConfig/commit/bb9f3a82317d722d536c4b363315f007bca2f9be The paertrail logs have shown for multiple nodes that is starts the cleanup, and then logs stop. Apr 11 13:23:07 T-W1064-MS-071.mdc1.mozilla.com generic-worker: Removing temp dir contents #015 Apr 11 13:23:07 T-W1064-MS-071.mdc1.mozilla.com generic-worker: C:\Users\GenericWorker\AppData\Local\Temp\aria-debug-4180.log#015
Commented out dism command: https://github.com/mozilla-releng/OpenCloudConfig/commit/b4dbe3ce9511b5c5cb6ec047f1485c80b1ea4056 Is throwing an error from time to time, and may not be needed now that updates are under control.
https://github.com/mozilla-releng/OpenCloudConfig/commit/d355e59c7c3cf6c55b8e58334624e41e106c98e0 Added additional log messages to see if the node locks up during various parts of the cleanup.
After surfing through each page of the windows 10 test machines in papertrail I've reimaged the following t-w1064-ms machines: 35, 38, 214, 244, 269. Have also updated the above doc by pointing out that these machines have also been affected.
I've just sanity-checked the above workers as a follow-up and can confirm that the machines 38+214+244+269 are back, taking jobs and completing them. However machine number 35 and 77 is still unreachable.
Attached image 1.png —
I have re-imaged 35 and 77 however the process are not ok. I rebooted the server, hit F12, started the PXE, selected option two, started the re-image process and after the first boot the server starts PXE again (it should boot from HDD, right?). I let the process start again then, after a while, the process says that there is already a re-image running: do you want to start a new one or no (let the first one continue). If I choose Yes, the process start again and a loop is created. If I say no and let the already running process continue, it fail with the errors from the capture 1. If not fail, the process are continue but after a reboot and some windows settings I get the error 2.
Attached image 2.png —
Tried to reimage t-w1064-ms-262 . Seems like it gets stuck in a loading state.
Attached image worker-ms-77.PNG —
T-W1064-MS-077 looks OK now, there where many jobs completed successful. However, the ILO capture shows the machine in a stuck state.
Attached image ms77-stuck.png —
Checking other machines that are visible in TS and taking jobs, looks like they are in the same stuck state. I believe that is a normal one :).
T-W1064-MS-035 is also back in TS, waiting to take jobs.
MS-35 took jobs and is completed successfully. There is a glitch in the re-image. To have it done it successfully follow this steps: - start the re-image as the docs say and watch the ILO; - if, after the first reboot the server goes to PXE, reboot it and press F9 to select to boot from HDD; otherwise follow the bellow steps; - watch ILO and windows setup; if the windows login is root then you are on a good track, if not, the re-image process need to be restarted until the root user logs in. - once the setup completes and the final reboot is done, ILO will show a stuck loading windows; - the re-maged machine should be visible in TC and take jobs. After more then 10 re-images, the above solutions works :).
I have found other Win10-MS servers that are not in TC: 111, 130, 178, 179, 215, 257, 263, 268. Working to reimage 111 and 130 (:apop).
T-W1064-MS-111 is reimaged and took jobs that are completed successfully. T-W1064-MS-178 is reimaged and waiting for jobs.
Attached image T-W1064-MS-179.png —
I have tried to re-image T-W1064-MS-179 but I encountered this problem. ^ After this the Automatic Repair started and then the iLO Console seems to be stuck. (Scanning and repairing drive (\\?\SystemPartition): 100% complete.
T-W1064-MS-178 - running jobs successfully T-W1064-MS-179 - re-imaged, running jobs successfully Looks like the re-image process is set to run until is done, no meter what. If there some error, like the one captured by Radu, somehow you need to finish the process. I have rebooted the server several times, let the server boot from hdd, skipped the disk check and let the process to run until I get the same error in windows. After one more reboot, I have started PXE where I get the error that the deploy cannot continue and I hit Finish. This is the time when the previous process ended and I was able to start a PXE boot and re-image successfully the server. T-W1064-MS-215 - re-imaged, taking jobs successfully
T-W1064-MS-257 - re-imaged, taking jobs successfully T-W1064-MS-263 - re-imaged, taking jobs successfully T-W1064-MS-268 - re-imaged, taking jobs successfully
T-W1064-064 - re-imaged, taking jobs successfully T-W1064-255 - re-imaged, taking jobs successfully
Looks like T-W1064-MS-262 is taking jobs now.
While monitoring Windows 10 Moonshots machines I saw that we have troubles with T-W1064-MS-021. Tried to connect remotly to it via ILO it returned me the following error: ExitException[ 3]com.sun.deploy.net.FailedDownloadException: Unable to load resource at jdk.plugin@9.0.4/sun.plugin2.applet.JNLP2Manager.downloadResources(JNLP2Manager.java:1846) at jdk.plugin@9.0.4/sun.plugin2.applet.JNLP2Manager.prepareLaunchFile(JNLP2Manager.java:1457) at jdk.plugin@9.0.4/sun.plugin2.applet.JNLP2Manager.loadJarFiles(JNLP2Manager.java:476) at jdk.plugin@9.0.4/sun.plugin2.applet.Plugin2Manager$AppletExecutionRunnable.run(Plugin2Manager.java:1770) at java.base/java.lang.Thread.run(Thread.java:844) Caused by: com.sun.deploy.net.FailedDownloadException: Unable to load resource at jdk.deploy@9.0.4/com.sun.deploy.net.DownloadEngine.actionDownload(DownloadEngine.java:807) at jdk.deploy@9.0.4/com.sun.deploy.net.DownloadEngine.downloadResource(DownloadEngine.java:914) at jdk.deploy@9.0.4/com.sun.deploy.cache.ResourceProviderImpl.getResource(ResourceProviderImpl.java:387) at jdk.deploy@9.0.4/com.sun.deploy.cache.ResourceProviderImpl.getResource(ResourceProviderImpl.java:325) at jdk.javaws@9.0.4/com.sun.javaws.LaunchDownload$DownloadTask.call(LaunchDownload.java:1649) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) ... 1 more
Started the reimaging process for T-W1064-MS-113 and T-W1064-MS-034.
(In reply to Roland Mutter Michael (:rmutter) from comment #23) > Created attachment 8980529 [details] > Screenshot_2018-05-25_12-35-49.png > > While monitoring Windows 10 Moonshots machines I saw that we have troubles > with T-W1064-MS-021. Tried to connect remotly to it via ILO it returned me > the following error: Works from my tux box, so I have started the re-image for T-W1064-MS-021 and T-W1064-MS-023.
Adding T-W1064-MS-263 to reimaged moonshots ^.
T-W1064-MS-021 and T-W1064-MS-023 - re-imaged, taking jobs successfully
T-W1064-MS-281 has some issues. It won't boot from network for a reimage . Not sure if it's hardware related issue.
T-W1064-MS-037 - not found in TC, rebooted, took jobs successfully
UPDATE: T-W1064-MS263, T-W1064-MS113 and T-W1064-MS034 are back in production taking jobs.
update : T-W1064-MS-042 - reimaged and takes jobs T-W1064-MS-134 - reimaged and takes jobs T-W1064-MS-207 - reimaged and takes jobs T-W1064-MS-212 - reimaged and takes jobs T-W1064-MS-265 - reimaged and takes jobs T-W1064-MS-129 - reimaged and takes jobs T-W1064-MS-215 - reimaged and takes jobs Problems with : T-W1064-MS-170 - not found in Task Cluster, unable to reimage it, it gets rebooted when it get at Deployment kit view
T-W1064-MS-035 - reimaged, still waiting for it to appear in tc T-W1064-MS-036 - reimaged, still waiting for it to appear in tc T-W1064-MS-040 - reimaged, still waiting for it to appear in tc
T-W1064-MS-035 - takes jobs T-W1064-MS-036 - takes jobs T-W1064-MS-040 - takes jobs
T-W1064-MS-129 - reimaged and takes jobs T-W1064-MS-214 - reimaged, rebooted and waiting to take jobs
T-W1064-MS-21 reimaged and takes jobs T-W1064-MS-70 reimaged and takes jobs T-W1064-MS-81 reimaged and takes jobs T-W1064-MS-165 reimaged and takes jobs T-W1064-MS-170 reimaged and takes jobs
Please do not reimage 135 and 81. I am using both of those for testing and may not be reporting to papertrail. I will update here once i am done with those nodes.
Reimaged: T-W1064-MS-077 T-W1064-MS-087 They are both back in taskcluster.
T-W1064-MS-031 rebooted,reimaged and takes jobs T-W1064-MS-082 rebooted, reimaged and it takes jobs T-W1064-MS-175 rebooted and takes jobs T-W1064-MS-247 wasn’t in TaskCluster, reimaged and takes jobs T-W1064-MS-267 rebooted, reimaged and it takes jobs
T-W1064-MS-029 - reboot,takes jobs T-W1064-MS-030 - reboot,takes jobs T-W1064-MS-031 - reimage,takes jobs T-W1064-MS-032 - reboot,takes jobs T-W1064-MS-035 - reimage,takes jobs T-W1064-MS-037 - reboot,takes jobs T-W1064-MS-085 - reboot,reimage,takes jobs T-W1064-MS-088 - reboot,takes jobs T-W1064-MS-089 - reboot,takes jobs T-W1064-MS-117 - reboot,takes jobs T-W1064-MS-123 - reboot,takes jobs T-W1064-MS-131 - reboot,takes jobs T-W1064-MS-132 - reboot,takes jobs T-W1064-MS-152 - reboot,takes jobs T-W1064-MS-155 - reboot,takes jobs T-W1064-MS-219 - reboot,takes jobs T-W1064-MS-267 - reboot,takes jobs
Hey Mark, can you confirm that you still need 81 and 135 for testing? Also I know you are testing the new generic worker on chassis 1, so fyi 24 and 34 are suddenly missing from taskcluster. If this was intentionally done then just ignore this.
Flags: needinfo?(mcornmesser)
Went for a full check of windows moonshots that does appear in TC. Seems like the following machines are not in TC : T-W1064-MS-18 T-W1064-MS-24 T-W1064-MS-27 T-W1064-MS-29 T-W1064-MS-30 T-W1064-MS-33 T-W1064-MS-34 T-W1064-MS-35 T-W1064-MS-40 T-W1064-MS-42 T-W1064-MS-84 T-W1064-MS-109 T-W1064-MS-113 T-W1064-MS-117 T-W1064-MS-130 T-W1064-MS-151 T-W1064-MS-247 T-W1064-MS-261 T-W1064-MS-281 Will proceed with a check in papertrail and a reboot for every machine. If that doesn't work , I'll start a reimage for each one. I'll be back with updates.
After two sessions of reboots, the following machines seem to need a reimage: T-W1064-MS-18 T-W1064-MS-29 T-W1064-MS-30 T-W1064-MS-40 T-W1064-MS-84 T-W1064-MS-117 T-W1064-MS-130 T-W1064-MS-33 - Seems to be broken and T-W1064-MS-281 - Was found in System health Summary, rebooted it and now it is in some kind of boot loop. My guess is that it doesn't find a source to boot from.
(In reply to Zsolt Fay [:zsoltfay] from comment #40) > Hey Mark, can you confirm that you still need 81 and 135 for testing? > > Also I know you are testing the new generic worker on chassis 1, so fyi 24 > and 34 are suddenly missing from taskcluster. If this was intentionally done > then just ignore this. I am no longer using any additional nodes out side of chassis 1. 34 and 24 were unintentional. I will take a look at them later.
Flags: needinfo?(mcornmesser)
Found several machines today that hadn't taken jobs for over 3 hours(some even 20+ hours), while having a 700+ pending queue. Re-imaged them and have already taken jobs successfully. The machines in question are: t-win1064-ms-{067, 078, 080, 086, 088, 107, 109, 117, 118, 129, 131, 152, 157, 217, 256, 262} I have also re-imaged 168 but it does not complete the re-image. At first glance it appears it does not have the option to boot from an HDD (only xenserver, ipv4 and ipv6 boot).
Here is a complete list of machines that don't appear in taskcluster [221]: 'T-W1064-MS-017', 'T-W1064-MS-018', 'T-W1064-MS-020', 'T-W1064-MS-023', 'T-W1064-MS-029', 'T-W1064-MS-034', 'T-W1064-MS-035', 'T-W1064-MS-041', 'T-W1064-MS-042', 'T-W1064-MS-044', 'T-W1064-MS-045', 'T-W1064-MS-064', 'T-W1064-MS-065', 'T-W1064-MS-106', 'T-W1064-MS-112', 'T-W1064-MS-116', 'T-W1064-MS-126', 'T-W1064-MS-130', 'T-W1064-MS-134', 'T-W1064-MS-151', 'T-W1064-MS-156', 'T-W1064-MS-168', 'T-W1064-MS-222', 'T-W1064-MS-261', 'T-W1064-MS-281', 'T-W1064-MS-316', 'T-W1064-MS-318', 'T-W1064-MS-319', 'T-W1064-MS-320', 'T-W1064-MS-321', 'T-W1064-MS-322', 'T-W1064-MS-323', 'T-W1064-MS-324', 'T-W1064-MS-325', 'T-W1064-MS-326', 'T-W1064-MS-327', 'T-W1064-MS-328', 'T-W1064-MS-329', 'T-W1064-MS-330', 'T-W1064-MS-331', 'T-W1064-MS-332', 'T-W1064-MS-333', 'T-W1064-MS-334', 'T-W1064-MS-335', 'T-W1064-MS-336', 'T-W1064-MS-337', 'T-W1064-MS-338', 'T-W1064-MS-339', 'T-W1064-MS-340', 'T-W1064-MS-341', 'T-W1064-MS-342', 'T-W1064-MS-343', 'T-W1064-MS-344', 'T-W1064-MS-345', 'T-W1064-MS-362', 'T-W1064-MS-363', 'T-W1064-MS-364', 'T-W1064-MS-365', 'T-W1064-MS-366', 'T-W1064-MS-367', 'T-W1064-MS-368', 'T-W1064-MS-369', 'T-W1064-MS-370', 'T-W1064-MS-371', 'T-W1064-MS-372', 'T-W1064-MS-373', 'T-W1064-MS-374', 'T-W1064-MS-375', 'T-W1064-MS-376', 'T-W1064-MS-377', 'T-W1064-MS-378', 'T-W1064-MS-379', 'T-W1064-MS-380', 'T-W1064-MS-381', 'T-W1064-MS-382', 'T-W1064-MS-383', 'T-W1064-MS-384', 'T-W1064-MS-385', 'T-W1064-MS-386', 'T-W1064-MS-387', 'T-W1064-MS-388', 'T-W1064-MS-389', 'T-W1064-MS-390', 'T-W1064-MS-406', 'T-W1064-MS-407', 'T-W1064-MS-408', 'T-W1064-MS-409', 'T-W1064-MS-410', 'T-W1064-MS-411', 'T-W1064-MS-412', 'T-W1064-MS-413', 'T-W1064-MS-414', 'T-W1064-MS-415', 'T-W1064-MS-416', 'T-W1064-MS-417', 'T-W1064-MS-418', 'T-W1064-MS-419', 'T-W1064-MS-420', 'T-W1064-MS-421', 'T-W1064-MS-422', 'T-W1064-MS-423', 'T-W1064-MS-424', 'T-W1064-MS-425', 'T-W1064-MS-426', 'T-W1064-MS-427', 'T-W1064-MS-428', 'T-W1064-MS-429', 'T-W1064-MS-430', 'T-W1064-MS-431', 'T-W1064-MS-432', 'T-W1064-MS-433', 'T-W1064-MS-434', 'T-W1064-MS-435', 'T-W1064-MS-451', 'T-W1064-MS-452', 'T-W1064-MS-453', 'T-W1064-MS-454', 'T-W1064-MS-455', 'T-W1064-MS-456', 'T-W1064-MS-457', 'T-W1064-MS-458', 'T-W1064-MS-459', 'T-W1064-MS-460', 'T-W1064-MS-461', 'T-W1064-MS-462', 'T-W1064-MS-463', 'T-W1064-MS-464', 'T-W1064-MS-465', 'T-W1064-MS-466', 'T-W1064-MS-467', 'T-W1064-MS-468', 'T-W1064-MS-469', 'T-W1064-MS-470', 'T-W1064-MS-471', 'T-W1064-MS-472', 'T-W1064-MS-473', 'T-W1064-MS-474', 'T-W1064-MS-475', 'T-W1064-MS-476', 'T-W1064-MS-477', 'T-W1064-MS-478', 'T-W1064-MS-479', 'T-W1064-MS-480', 'T-W1064-MS-497', 'T-W1064-MS-498', 'T-W1064-MS-499', 'T-W1064-MS-500', 'T-W1064-MS-501', 'T-W1064-MS-502', 'T-W1064-MS-503', 'T-W1064-MS-504', 'T-W1064-MS-505', 'T-W1064-MS-506', 'T-W1064-MS-507', 'T-W1064-MS-508', 'T-W1064-MS-509', 'T-W1064-MS-510', 'T-W1064-MS-511', 'T-W1064-MS-512', 'T-W1064-MS-513', 'T-W1064-MS-514', 'T-W1064-MS-515', 'T-W1064-MS-516', 'T-W1064-MS-517', 'T-W1064-MS-518', 'T-W1064-MS-519', 'T-W1064-MS-520', 'T-W1064-MS-521', 'T-W1064-MS-522', 'T-W1064-MS-523', 'T-W1064-MS-524', 'T-W1064-MS-525', 'T-W1064-MS-542', 'T-W1064-MS-543', 'T-W1064-MS-544', 'T-W1064-MS-545', 'T-W1064-MS-546', 'T-W1064-MS-547', 'T-W1064-MS-548', 'T-W1064-MS-549', 'T-W1064-MS-550', 'T-W1064-MS-551', 'T-W1064-MS-552', 'T-W1064-MS-553', 'T-W1064-MS-554', 'T-W1064-MS-555', 'T-W1064-MS-556', 'T-W1064-MS-557', 'T-W1064-MS-558', 'T-W1064-MS-559', 'T-W1064-MS-560', 'T-W1064-MS-561', 'T-W1064-MS-562', 'T-W1064-MS-563', 'T-W1064-MS-564', 'T-W1064-MS-565', 'T-W1064-MS-566', 'T-W1064-MS-567', 'T-W1064-MS-568', 'T-W1064-MS-569', 'T-W1064-MS-570', 'T-W1064-MS-581', 'T-W1064-MS-582', 'T-W1064-MS-583', 'T-W1064-MS-584', 'T-W1064-MS-585', 'T-W1064-MS-586', 'T-W1064-MS-587', 'T-W1064-MS-588', 'T-W1064-MS-589', 'T-W1064-MS-590', 'T-W1064-MS-591', 'T-W1064-MS-592', 'T-W1064-MS-593', 'T-W1064-MS-594', 'T-W1064-MS-595', 'T-W1064-MS-596', 'T-W1064-MS-597', 'T-W1064-MS-598', 'T-W1064-MS-599', 'T-W1064-MS-600' I'll begin working on them, if anyone has any info about them, please comment in this bug.
Update for machines above: moon-chassis-2 'T-W1064-MS-064',- Rebooted, reachable, took job and finished it as completed 'T-W1064-MS-065',- Rebooted, reachable, took job and finished it as completed moon-chassis-3 'T-W1064-MS-106',- Rebooted, reachable, took job and finished it as completed 'T-W1064-MS-112',- Rebooted, reachable, took job and finished it as completed 'T-W1064-MS-116' - Rebooted, reachable, took job and it's running it 'T-W1064-MS-126',- Rebooted but unreachable 'T-W1064-MS-130',- Rebooted but starts with disk-checking and blue-screen and after a few retries I turn it OFF. 'T-W1064-MS-134',- Rebooted, reachable, took job and finished it as completed moon-chassis-4 'T-W1064-MS-151',- Rebooted, reachable 'T-W1064-MS-156',- Rebooted, reachable, took job and finished it as completed 'T-W1064-MS-168',- Tried to reimage it but failed to do it. see screenshot (Screenshot_2018-07-04_06-31-42_fail_T-W1064-MS-168.png) moon-chassis-5 'T-W1064-MS-222',- Rebooted, reachable but not appeared in taskcluster, initiated a reimage on it and reimaged successfully moon-chassis-6 'T-W1064-MS-261',- Rebooted, reachable moon-chassis-7 'T-W1064-MS-281',- at boot it remains into booting from network and cannot make it boot from SSD moon-chassis-8 'T-W1064-MS-316',- Runs XenServer 'T-W1064-MS-325',- Begin reimage on it 'T-W1064-MS-338',- Begin reimage on it Those from moon-chassis-8 were random chosen Also from the last chassis: 'T-W1064-MS-600',- Runs XenServer
Went for 3 sessions of checks/cold boots for Windows moonshots . Performed cold boot on follwing: 017, 018 , 020, 023, 026, 029, 034, 035, 041, 042, 044, 045, 068 ,072,077, 090, 111, 114, 130, 135, 162, 165, 281, 288, 292, 294 . Out of the list the following recovered: 023, 026, 029, 034 ,041, 044, 045, 068, 072, 165, 288, 292, 294. The following need a later recheck/reimage: 017, 018, 020, 035, 042, 044, 072, 077, 090, 111, 114, 130, 135, 162, 281.
Depends on: 1473589
T-W1064-MS-017.releng.mdc1.mozilla.com rebooting himself after windows start, reimaged, OK. T-W1064-MS-018.releng.mdc1.mozilla.com booted in windows with administrator login, error NIC configuration during re-image, cannot reimage T-W1064-MS-024.releng.mdc1.mozilla.com reboot, re-imaged, OK T-W1064-MS-035.releng.mdc1.mozilla.com in TC, not took jobs for 3 days, reboot, reimage, not in TC anymore T-W1064-MS-038.releng.mdc1.mozilla.com in TC, not took jobs for 1 day, reboot, OK. T-W1064-MS-042.releng.mdc1.mozilla.com not in TC, reboot, reimage, OK T-W1064-MS-073.releng.mdc1.mozilla.com in TC, not took jobs for 1 day, reboot, OK. T-W1064-MS-074.releng.mdc1.mozilla.com in TC, not took jobs for 2 days, reboot, reimage, OK T-W1064-MS-077.releng.mdc1.mozilla.com in TC, not took jobs for 2 days, reboot, reimage, in tc waiting tasks T-W1064-MS-080.releng.mdc1.mozilla.com in TC, not took jobs for 2 days, reboot, reimage, in tc waiting tasks
T-W1064-MS-019.releng.mdc1.mozilla.com not in TC, reboot, OK. T-W1064-MS-020.releng.mdc1.mozilla.com not in TC, reboot, reimage, OK, tasks running fine. T-W1064-MS-035.releng.mdc1.mozilla.com reboot, reimage, still not in TC T-W1064-MS-036.releng.mdc1.mozilla.com not in tc,no jobs for 2 days, reboot, reimage, OK, tasks running fine. T-W1064-MS-037.releng.mdc1.mozilla.com not in tc, no jobs for 1 day, reboot, reimage, ok, running jobs T-W1064-MS-043.releng.mdc1.mozilla.com not in tc, no jobs for 2 days, reboot, ok, jobs running. T-W1064-MS-065.releng.mdc1.mozilla.com not in tc, no jobs for 1 day, reboot, reimage, not in TC anymore T-W1064-MS-071.releng.mdc1.mozilla.com not in tc, no jobs for 2 day, reboot, reimage, not in TC anymore T-W1064-MS-072.releng.mdc1.mozilla.com not in tc, no jobs for 2 day, reboot, reimage, not in TC anymore T-W1064-MS-083.releng.mdc1.mozilla.com not in tc, no jobs for 2 day, reboot, reimage, not in TC anymore I noticed a image rename from Windows 10 to Windows 10 -1. There are some changes going one? The above servers where reimaged with Windows 10 generic 10 (19-43) and with Windows 10 -1 (65-83).
We should be using the same image on all systems; I'm not sure what the new image is, so until we say otherwise please don't use it. Mark/Q: what IS that new image?
Flags: needinfo?(q)
Flags: needinfo?(mcornmesser)
Those two images have been around for a while no new images as the task ids sre ther same. The gw 10 is generic worker 10 which should in all likelihood not been done as an image but an occ switch.
Flags: needinfo?(q)
Ok, so anything that was imaged with "Windows 10 generic 10" needs to be quarantined and re-imaged.
Flags: needinfo?(ciduty)
I'm going to reimage those machines there were reimaged with "Windows 10 generic 10" on my ongoing shift. :fubar , is there a way to find out if we have any others machines reimaged with "Windows 10 generic 10" beside those that were mentioned above by :arny ?
Flags: needinfo?(ciduty)
Great question, and I don't know the answer; check in with :markco - he's still investigating a couple other install issues that came up out of this, and we might need to wait until those are sorted. but he's got all of the details atm.
Because of difference in functionality and configuration needs there are 2 separate task sequences. GW 8 creates a single user that performs task. GW 10 is a service that creates individual user environments and perform tasks. In addition to the the config file which is current deployed from MDT is different. Because of this the cleanest way to go about it was to have a separate image. Moving forward if GW does not have significant change in functionality then we will not need separate images for it. As noted in other bugs and the attached spread sheet to this bug there is a GW 10 pool of ms-016 through ms-045. That is being used to debug and catch any other unknown issue with the current configuration and GW 10. I have not seen the Windows 10 -1 task sequence name. If this pops back up again, could you take a screen shot please?
Flags: needinfo?(mcornmesser)
Reimaged all servers that: - Got reinstalled 4th to date. - Were missing in TC Reimage successful on: 064, 065, 071, 072, 074, 078, 080, 083, 090, 106, 107, 110, 111, 112 114, 116, 124, 126, 128, 132, 151, 153, 165, 166, 168, 169, 222, 256, 260, 261 HPE Restful API missing / Intelligent provisioning: 085 No video/image: 130, 134, 135, 154, 156, 158, 162 Stuck at Loading: 262 Stuck at PXE: 281
As a follow-up to the above mentioned machines: They were checked and reimaged (if needed) along with 066, 109, 134, 135, 154, 156, 158, 164, 262. These are all taking jobs now except: 071,072 still missing from TC after reimage. 222 is in quarantine. 262 doesn't even load BIOS, looks to be offline entirely. Also 130 is still broken, see Bug 1463754.
I checked 071 and 072. They had a multiple drives mounted. I have kicked off a new imnstall. 085 was able to connect and kicked off a new install. 130 updated bug with current behavior. Will include in request for support from HP. 262 Appears to be up and waiting for a task: https://papertrailapp.com/groups/1958653/events?focus=952660421653987342&q=ms-262&selected=952660421653987342 Jul 08 10:00:27 T-W1064-MS-262.mdc1.mozilla.com generic-worker: 2018/07/08 17:00:26 No task claimed...#015 Jul 08 10:01:28 T-W1064-MS-262.mdc1.mozilla.com generic-worker: 2018/07/08 17:01:27 No task claimed...#015 Jul 08 10:02:29 T-W1064-MS-262.mdc1.mozilla.com generic-worker: 2018/07/08 17:02:28 No task claimed...#015 Jul 08 10:03:30 T-W1064-MS-262.mdc1.mozilla.com generic-worker: 2018/07/08 17:03:29 No task claimed...#015 Jul 08 10:04:31 T-W1064-MS-262.mdc1.mozilla.com generic-worker: 2018/07/08 17:04:30 No task claimed...#015 Jul 08 10:05:32 T-W1064-MS-262.mdc1.mozilla.com generic-worker: 2018/07/08 17:05:31 No task claimed...#015 281 was attempting pxe boot over IPv6. Previously shutting down the node, waiting hours, and attempting to reimage resolve the issue. The node is currently shutdown.
(In reply to Kendall Libby [:fubar] from comment #53) > Ok, so anything that was imaged with "Windows 10 generic 10" needs to be > quarantined and re-imaged. From what I know, the Windows 10 generic 10 image should be used only on MS16-45, as I did the reimages, those are neededfor testing. :markco can you confirm?
Flags: needinfo?(mcornmesser)
That has been my plan thus far, and most of those are up and in Taskcluster currently. Except for 18 and 35 which are having pxe boot issues, and 23 which may have a different hardware issue.
Flags: needinfo?(mcornmesser)
T-W1064-MS-018.releng.mdc1.mozilla.com PXE issue T-W1064-MS-035.releng.mdc1.mozilla.com PXE issue T-W1064-MS-065.releng.mdc1.mozilla.com no jobs for 1 day, reboot, reimage, reboot, still no jobs, not in TC anymore. Reimage After second reimage, there is no space :| Jul 09 07:31:26 T-W1064-MS-065.mdc1.mozilla.com OpenCloudConfig: Current available disk space CRITCAL 0% free. Will not start Generic-Worker!#015 T-W1064-MS-071.releng.mdc1.mozilla.com not in TC, reboot, reimage, reboot, still not in TC. PP error: Jul 09 05:51:58 T-W1064-MS-071.mdc1.mozilla.com Microsoft-Windows-DNS-Client: The system failed to register host (A or AAAA) resource records (RRs) for network adapter with settings: Adapter Name : {103341C9-EE3A-4DBF-BBE8-13E74A6368AA} Host Name : T-W1064-MS-071 Primary Domain Suffix : mdc1.mozilla.com DNS server list : #01110.48.75.120, 10.50.75.120 Sent update to server : <?> IP Address(es) : 10.49.40.42 The reason the system could not register these RRs during the update request was because of a system problem. You can manually retry DNS registration of the network adapter and its settings by typing 'ipconfig /registerdns' at the command prompt. If problems still persist, contact your DNS server or network systems administrator. See event details for specific error code information.#015 T-W1064-MS-072.releng.mdc1.mozilla.com not in TC, reboot, reimage, reboot, still not in TC: PP error: Jul 09 05:52:10 T-W1064-MS-072.mdc1.mozilla.com Microsoft-Windows-DNS-Client: The system failed to register host (A or AAAA) resource records (RRs) for network adapter with settings: Adapter Name : {A56E9F32-310E-4A61-9453-20D41890DDE2} Host Name : T-W1064-MS-072 Primary Domain Suffix : mdc1.mozilla.com DNS server list : #01110.48.75.120, 10.50.75.120 Sent update to server : <?> IP Address(es) : 10.49.40.43 The reason the system could not register these RRs during the update request was because of a system problem. You can manually retry DNS registration of the network adapter and its settings by typing 'ipconfig /registerdns' at the command prompt. If problems still persist, contact your DNS server or network systems administrator. See event details for specific error code information.#015 T-W1064-MS-083.releng.mdc1.mozilla.com no jobs for 20h, reboot, reimage, running jobs T-W1064-MS-111.releng.mdc1.mozilla.com no jobs for 1 day, reboot, reimage, running jobs T-W1064-MS-118.releng.mdc1.mozilla.com not in TC, reboot, back in TC, waiting jobs, running jobs T-W1064-MS-119.releng.mdc1.mozilla.com not in TC, reboot, reimage, running jobs T-W1064-MS-121.releng.mdc1.mozilla.com not in TC, reboot, reimage, running jobs T-W1064-MS-125.releng.mdc1.mozilla.com not in TC, reboot, not in TC, reimage. Back in TC, waiting jobs. Here are some PP errors that I captured during the reimage. Jul 09 07:44:57 T-W1064-MS-125.mdc1.mozilla.com Microsoft-Windows-DSC: Job {8BAAAEA5-8386-11E8-8A1C-F40343DF3195} : This event indicates that failure happens when LCM is processing the configuration. Error Id is 0x1. Error Detail is The SendConfigurationApply function did not succeed.. Resource Id is [Script]FirewallRule_ICMPv6In and Source Info is C:\windows\TEMP\xDynamicConfig.ps1::583::9::Script. Error Message is PowerShell DSC resource MSFT_ScriptResource failed to execute Set-TargetResource functionality with error message: Error formatting a string: Index (zero based) must be greater than or equal to zero and less than the size of the argument list.. .#015 Jul 09 07:44:57 T-W1064-MS-125.mdc1.mozilla.com Microsoft-Windows-DSC: Job {8BAAAEA5-8386-11E8-8A1C-F40343DF3195} : MIResult: 1 Error Message: PowerShell DSC resource MSFT_ScriptResource failed to execute Set-TargetResource functionality with error message: Error formatting a string: Index (zero based) must be greater than or equal to zero and less than the size of the argument list.. Message ID: ProviderOperationExecutionFailure Error Category: 7 Error Code: 1 Error Type: MI#015 Jul 09 07:48:28 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: Run-RemoteDesiredStateConfig :: end#015 Jul 09 07:48:28 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: C:\dsc\rundsc.ps1 deleted.#015 Jul 09 07:48:28 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: C:\dsc\rundsc.ps1 downloaded.#015 Jul 09 07:48:28 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: scheduled task: RunDesiredStateConfigurationAtStartup, created.#015 Jul 09 07:48:29 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: log archive C:\log\20180709144311.userdata-run.zip created.#015 Jul 09 07:48:29 T-W1064-MS-125.mdc1.mozilla.com Microsoft-Windows-DSC: RESULT 1 [NXLOG@14506 Keywords="4611686018427387904" EventType="ERROR" EventID="4252" ProviderGuid="{50DF9E12-A8C4-4939-B281-47E1325BA63E}" Version="0" Task="0" OpcodeValue="0" RecordNumber="62" ActivityID="{1E08B3DF-1793-0004-4BCF-081E9317D401}" ThreadID="5528" Channel="Microsoft-Windows-DSC/Operational" Domain="NT AUTHORITY" AccountName="SYSTEM" UserID="S-1-5-18" AccountType="User" Opcode="Info" JobId="{8BAAAEA5-8386-11E8-8A1C-F40343DF3195}" MIResult="1" ErrorMessage="The SendConfigurationApply function did not succeed." ErrorCategory="0" ErrorCode="1" ErrorType="MI" EventReceivedTime="2018-07-09 14:48:28" SourceModuleName="eventlog" SourceModuleType="im_msvistalog"] Job {8BAAAEA5-8386-11E8-8A1C-F40343DF3195} : MIResult: 1 Error Message: The SendConfigurationApply function did not succeed. Message ID: MI RESULT 1 Error Category: 0 Error Code: 1 Error Type: MI#015 Jul 09 07:53:31 T-W1064-MS-125.mdc1.mozilla.com Microsoft-Windows-DSC: RESULT 1 [NXLOG@14506 Keywords="4611686018427387904" EventType="ERROR" EventID="4252" ProviderGuid="{50DF9E12-A8C4-4939-B281-47E1325BA63E}" Version="0" Task="0" OpcodeValue="0" RecordNumber="105" ActivityID="{519CFE68-1794-0004-2328-9D519417D401}" ThreadID="996" Channel="Microsoft-Windows-DSC/Operational" Domain="NT AUTHORITY" AccountName="SYSTEM" UserID="S-1-5-18" AccountType="User" Opcode="Info" JobId="{C12CFCC9-8387-11E8-8A1D-F40343DF3195}" MIResult="1" ErrorMessage="The SendConfigurationApply function did not succeed." ErrorCategory="0" ErrorCode="1" ErrorType="MI" EventReceivedTime="2018-07-09 14:53:29" SourceModuleName="eventlog" SourceModuleType="im_msvistalog"] Job {C12CFCC9-8387-11E8-8A1D-F40343DF3195} : MIResult: 1 Error Message: The SendConfigurationApply function did not succeed. Message ID: MI RESULT 1 Error Category: 0 Error Code: 1 Error Type: MI#015 Jul 09 07:53:31 T-W1064-MS-125.mdc1.mozilla.com Microsoft-Windows-DSC: Job {C12CFCC9-8387-11E8-8A1D-F40343DF3195} : Details logging completed for C:\windows\System32\Configuration\ConfigurationStatus\{C12CFCC9-8387-11E8-8A1D-F40343DF3195}-0.details.json.#015 Jul 09 07:53:31 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: C:\dsc\rundsc.ps1 deleted.#015 Jul 09 07:53:31 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: C:\dsc\rundsc.ps1 downloaded.#015 Jul 09 07:53:31 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: scheduled task: RunDesiredStateConfigurationAtStartup, created.#015 Jul 09 07:53:32 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: log archive C:\log\20180709145149.userdata-run.zip created.#015 Jul 09 07:53:32 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: generic-worker installation detected.#015 Jul 09 07:53:32 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: generic-worker running process detected 12 ms after task-claim-state.valid flag set.#015 Jul 09 07:53:32 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: process priority for generic worker altered from Normal to AboveNormal.#015 Jul 09 07:53:32 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: userdata run completed#015 Jul 09 07:53:53 T-W1064-MS-125.mdc1.mozilla.com generic-worker: 2018/07/09 14:53:51 No task claimed...#015 T-W1064-MS-129.releng.mdc1.mozilla.com no jobs for 2 days, reboot. running jobs T-W1064-MS-130.releng.mdc1.mozilla.com shut down T-W1064-MS-162.releng.mdc1.mozilla.com not in TC, reboot, reimage, running jobs T-W1064-MS-217.releng.mdc1.mozilla.com no jobs for 1 day, reboot, running jobs T-W1064-MS-222.releng.mdc1.mozilla.com no jobs for 4 days, reboot, reimage, running jobs T-W1064-MS-281.releng.mdc1.mozilla.com Stuck at PXE: 281
The dynamic registration errors can be ignore we don't use DDNS in our infra DNS is statically assigned. We should patch occ to disable dns registration to lower log and network noise but it causes no actual problem.
T-W1064-MS-023 reimage, running jobs. T-W1064-MS-064 not in TC, reboot, running jobs T-W1064-MS-065, 71, 72 reimage two times, no space after reimage. Opened Bug 1474578 T-W1064-MS-066 not in TC, reboot, running jobs. T-W1064-MS-152 not in TC - no jobs for 1 day, reboot, running jobs
The following workers were in TC and lazy (5h+ since last job): T-W1064-MS-{61,64,69,76,77,85,87,109,110,111,116,120,124,127,129,131,134,154,156,159,161,166,169,178,282} Of which i had to Reimaged: T-W1064-MS-{69,76,77,109,116,120,124,127,129,134,156,159,161,166,169,178,282} All of them started working after a reimage. The ones that weren't reimaged started working after a reboot. All of them have since taken jobs.
:pmoore tells me that we may be inadvertently running generic-worker 8.3.0 on these nodes. That's a really old version, and could have unintended side-effects. Is there any chance that the work in bug 1443589 is causing these nodes to go offline?
It's not inadvertent, gw10 has not yet successfully work on the moonshot hardware. Bug 1443589 is to get us there.
T-W1064-MS-131 reimage, running jobs T-W1064-MS-132 reimage, running jobs T-W1064-MS-153 reimage, running jobs T-W1064-MS-156 reimage, running jobs T-W1064-MS-158 reimage, running jobs T-W1064-MS-159 reimage, running jobs T-W1064-MS-160 reimage, running jobs T-W1064-MS-161 reimage, running jobs T-W1064-MS-162 reimage, running jobs T-W1064-MS-163 reimage, running jobs T-W1064-MS-164 reimage, running jobs T-W1064-MS-166 reimage, running jobs T-W1064-MS-169 reimage, running jobs T-W1064-MS-171 reimage, running jobs T-W1064-MS-246 reimage, running jobs T-W1064-MS-254 reimage, running jobs T-W1064-MS-247 reimage, running jobs T-W1064-MS-254 reimage, running jobs T-W1064-MS-262 reimage, running jobs T-W1064-MS-267 reimage, running jobs T-W1064-MS-284 reimage, running jobs
The above machines and also the following ones were missing from TC, they are now back after re-imaging. T-W1064-MS-061 reimage, running jobs T-W1064-MS-062 reimage, running jobs T-W1064-MS-067 reimage, running jobs T-W1064-MS-076 reimage, running jobs T-W1064-MS-078 reimage, running jobs T-W1064-MS-079 reimage, running jobs T-W1064-MS-081 reimage, running jobs T-W1064-MS-106 reimage, running jobs T-W1064-MS-107 reimage, running jobs T-W1064-MS-108 reimage, running jobs T-W1064-MS-115 reimage, running jobs T-W1064-MS-116 reimage, running jobs T-W1064-MS-119 reimage, running jobs T-W1064-MS-121 reimage, running jobs T-W1064-MS-122 reimage, running jobs T-W1064-MS-124 reimage, running jobs T-W1064-MS-126 reimage, running jobs T-W1064-MS-127 reimage, running jobs The following machines were lazy (alive in TC but failing to pick up jobs): T-W1064-MS-066 reimage, running jobs T-W1064-MS-075 reimage, running jobs T-W1064-MS-077 reimage, running jobs T-W1064-MS-086 reimage, running jobs T-W1064-MS-088 reimage, running jobs T-W1064-MS-090 reimage, running jobs T-W1064-MS-111 reimage, running jobs T-W1064-MS-112 reimage, running jobs T-W1064-MS-120 reimage, running jobs T-W1064-MS-151 reimage, running jobs T-W1064-MS-154 reimage, running jobs T-W1064-MS-177 reimage, running jobs T-W1064-MS-200 reimage, running jobs T-W1064-MS-216 reimage, running jobs T-W1064-MS-217 reimage, running jobs T-W1064-MS-241 reimage, running jobs T-W1064-MS-258 reimage, running jobs T-W1064-MS-285 reimage, running jobs T-W1064-MS-294 reimage, running jobs The following machines are the ones we know can not be fixed by re-imaging: T-W1064-MS-065, T-W1064-MS-071, T-W1064-MS-072, T-W1064-MS-130, T-W1064-MS-281.
> :pmoore tells me that we may be inadvertently running generic-worker 8.3.0 > on these nodes. That's a really old version, and could have unintended > side-effects. This looks like it is unrelated to the generic-worker version. 13 out of the 30 nodes running generic-worker 10.8.5 have stopped picking up tasks in the past 20 hours.
Depends on: 1475711
(In reply to Zsolt Fay [:zsoltfay] from comment #69) > The following machines were lazy (alive in TC but failing to pick up jobs): I've created bug 1475711 for recording some details about workers not taking work when there are jobs in the queue. If we collect details about the times and which workers, then we can ask the Taskcluster team to investigate if there is a problem in the api or queue.
Re-imaged the following machines because they were lazy and haven't recovered after a reboot: T-W1064-MS-107 T-W1064-MS-162 T-W1064-MS-169 T-W1064-MS-110 T-W1064-MS-164 T-W1064-MS-294 T-W1064-MS-127 T-W1064-MS-165 T-W1064-MS-134 T-W1064-MS-167 T-W1064-MS-161 T-W1064-MS-168 Re-imaged the following machines because they were missing from TC: T-W1064-MS-065 T-W1064-MS-076 T-W1064-MS-077 T-W1064-MS-082 T-W1064-MS-086 T-W1064-MS-088 T-W1064-MS-106 T-W1064-MS-108 T-W1064-MS-111 T-W1064-MS-114 T-W1064-MS-124 T-W1064-MS-125 T-W1064-MS-126 T-W1064-MS-131 T-W1064-MS-132 T-W1064-MS-151 T-W1064-MS-152 T-W1064-MS-155 T-W1064-MS-156 T-W1064-MS-160 T-W1064-MS-171 T-W1064-MS-209 T-W1064-MS-217 T-W1064-MS-222 T-W1064-MS-247 T-W1064-MS-258 T-W1064-MS-267 Worth noting that T-W1064-MS-134 isn't taking jobs and that t-w1064-ms-110 is performing very slow and is encountering errors when booting @bios.
065-missing 077-missing 126-missing 132-missing 134 ?? had tasks resolved a day ago, I don't think it took taks all day. 155 ?? had a task completed 13h ago 076-running (took jobs and completed them all day) 082-running (took jobs and completed them all day) 086-running (took jobs and completed them all day) 088-running (took jobs and completed them all day) 106-running (took jobs and completed them all day) 108-running (took jobs and completed them all day) 111-running (took jobs and completed them all day) 114-running (took jobs and completed them all day) 124-running (took jobs and completed them all day) 125-running (took jobs and completed them all day) 131-running (took jobs and completed them all day) 151-running (took jobs and completed them all day) 152-running (took jobs and completed them all day) 156-running (took jobs and completed them all day) 160-running (took jobs and completed them all day) 171-running (took jobs and completed them all day) 247-running (took jobs and completed them all day) 258-running (took jobs and completed them all day) 267-running (took jobs and completed them all day)
T-W1064-MS-063 - re-imaged and took jobs T-W1064-MS-068 - re-imaged and took jobs T-W1064-MS-070 - re-imaged and took jobs T-W1064-MS-073 - re-imaged and took jobs T-W1064-MS-081 - re-imaged and took jobs T-W1064-MS-108 - re-imaged and took jobs T-W1064-MS-110 - re-imaged and took jobs T-W1064-MS-111 - re-imaged and took jobs T-W1064-MS-112 - re-imaged and took jobs T-W1064-MS-113 - re-imaged and took jobs T-W1064-MS-116 - re-imaged and took jobs T-W1064-MS-117 - re-imaged and took jobs T-W1064-MS-118 - re-imaged and took jobs T-W1064-MS-120 - re-imaged and took jobs T-W1064-MS-126 - re-imaged and took jobs T-W1064-MS-132 - re-imaged and took jobs T-W1064-MS-133 - re-imaged and took jobs T-W1064-MS-152 - re-imaged and took jobs T-W1064-MS-160 - re-imaged and took jobs T-W1064-MS-210 - re-imaged and took jobs T-W1064-MS-216 - re-imaged and took jobs T-W1064-MS-217 - re-imaged and took jobs T-W1064-MS-245 - re-imaged and took jobs T-W1064-MS-246 - re-imaged and took jobs T-W1064-MS-256 - re-imaged and took jobs T-W1064-MS-294 - re-imaged and took jobs
Re-imaged the missing win10 machines: T-W1064-MS-064 T-W1064-MS-087 T-W1064-MS-169 T-W1064-MS-066 T-W1064-MS-090 T-W1064-MS-205 T-W1064-MS-067 T-W1064-MS-127 T-W1064-MS-222 T-W1064-MS-068 T-W1064-MS-129 T-W1064-MS-247 T-W1064-MS-076 T-W1064-MS-154 T-W1064-MS-260 T-W1064-MS-078 T-W1064-MS-157 T-W1064-MS-284 T-W1064-MS-082 T-W1064-MS-162 T-W1064-MS-084 T-W1064-MS-166 T-W1064-MS-086 T-W1064-MS-168 These still need to be checked ^
Flags: needinfo?(ciduty)
I can confirm that all of the above and once again in TC except :T-W1064-MS-127 , T-W1064-MS-247. Will take actions for those 2 and the following: T-W1064-MS-065 T-W1064-MS-070 T-W1064-MS-077 T-W1064-MS-088 T-W1064-MS-131 T-W1064-MS-156 T-W1064-MS-241 T-W1064-MS-256 T-W1064-MS-262 T-W1064-MS-285 T-W1064-MS-294 Will comment later with the results.
Flags: needinfo?(ciduty)
The following machines got back into TC after a reboot: T-W1064-MS-070 T-W1064-MS-088 T-W1064-MS-131 T-W1064-MS-256 T-W1064-MS-120 T-W1064-MS-210 T-W1064-MS-217 T-W1064-MS-246 T-W1064-MS-267 The following have been reimaged for not recovering after reboot: T-W1064-MS-077 T-W1064-MS-127 T-W1064-MS-156 T-W1064-MS-241 T-W1064-MS-247 T-W1064-MS-262 T-W1064-MS-294 Reimaged ones still need a TaskCluster check.
Flags: needinfo?(ciduty)
All of them took tasks and completed them without problems except for T-W1064-MS-077, this one doesn't appear in TaskCluster. I'll look into it and come back with more details.
Flags: needinfo?(ciduty)
Below are machines that were rebooted and takes jobs : T-W1064-MS-063 T-W1064-MS-074 T-W1064-MS-082 T-W1064-MS-108 T-W1064-MS-110 T-W1064-MS-222 the following is a machine that has been rebooted and now is available on Task Cluster, but it didn't received any job, yet : T-W1064-MS-285 Here is a list with the machine that were rebooted and reimaged and now are available on Task Cluster and take jobs or not, yet : available, but no job yet : T-W1064-MS-061, T-W1064-MS-077, T-W1064-MS-115, T-W1064-MS-153, T-W1064-MS-161. available and takes jobs : T-W1064-MS-062, T-W1064-MS-079, T-W1064-MS-107, T-W1064-MS-118.
The following where re-imaged and now are taking jobs and are available on Task Cluster : T-W1064-MS-061 T-W1064-MS-106 T-W1064-MS-111 T-W1064-MS-112 T-W1064-MS-113 T-W1064-MS-115 T-W1064-MS-116 T-W1064-MS-118 T-W1064-MS-120 T-W1064-MS-121 T-W1064-MS-125 T-W1064-MS-128 T-W1064-MS-129 T-W1064-MS-131 T-W1064-MS-133 T-W1064-MS-134 T-W1064-MS-135 T-W1064-MS-155 T-W1064-MS-169 T-W1064-MS-200 T-W1064-MS-206 T-W1064-MS-217 T-W1064-MS-254 T-W1064-MS-258 T-W1064-MS-262
The following workers have been re-imaged today: T-W1064-MS-066 T-W1064-MS-079 T-W1064-MS-109 T-W1064-MS-110 T-W1064-MS-156 T-W1064-MS-159 T-W1064-MS-161 T-W1064-MS-168 T-W1064-MS-198 T-W1064-MS-212 T-W1064-MS-241 T-W1064-MS-247 T-W1064-MS-250 T-W1064-MS-251 T-W1064-MS-282 T-W1064-MS-286 All have since taken jobs except 250. See Bug 1479187
T-W1064-MS-064 no jobs for 1 day, reboot, jobs running T-W1064-MS-127 no jobs for 1 day,reboot, jobs running T-W1064-MS-212 no jobs for 1 day, reboot, reimage, jobs running T-W1064-MS-222 no jobs for 1 day, reboot, reimage, jobs running T-W1064-MS-285 no jobs for 1 day, reboot, reimage, jobs running T-W1064-MS-324 not in TC, reboot, running jobs T-W1064-MS-335 not in TCm reboot, reimage, see next comment.
Attached image ms-win10-335.PNG —
I have re-imaged this today twice but after the windows installation, It stuck at the "Task sequence". I have selected once again win 10 GW 10, it finish the install, but not appear in TC.
T-W1064-MS-125: no jobs for 1 day, reboot, running jobs T-W1064-MS-156: no jobs for 1 day, reboot, running jobs T-W1064-MS-169: no jobs for 1 day, reboot, reimage, running jobs T-W1064-MS-260: no jobs for 1 day, reboot, reimage, running jobs T-W1064-MS-267: no jobs for 1 day, reboot, running jobs
T-W1064-MS-080: no jobs for one day, reboot, reimage, running jobs T-W1064-MS-089: no jobs for one day, reboot, reimage, running jobs T-W1064-MS-107: no jobs for one day, reboot, reimage, running jobs T-W1064-MS-198: no jobs for one day, reboot, running jobs T-W1064-MS-162: no jobs for one day, reboot, reimage, running jobs
When these servers have done "no jobs for one day", is the worker process still running, and what was the logged end for the last task (timeout? like we see sometimes in OSX), and what is in the logs? (if the worker process is running, is it polling for work or giving warnings?) After reboot, how do you determine that it needs reimaged?
After the reboot, in the logs I see many lines like the one bellow then just going forever. After 30-45 mins I reimage them. generic-worker: 2018/08/01 13:18:05 No task claimed...#015
T-W1064-MS-168 - just had same issue, to jobs for one day. In the logs before and after reboot: Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 13:05:49 Response#015 Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 13:05:49 HTTP/1.1 200 OK#015 Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Content-Length: 0#015 Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Date: Tue, 31 Jul 2018 13:05:50 GMT#015 Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Etag: "52ea05b8431def9a40afcd85b2af1c23"#015 Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Server: AmazonS3#015 Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: X-Amz-Id-2: T/1wOyAuCN2hTlZrWxN7kYnCsN7pXTTVHiDktWQ6N6YW5UAbv9ew5oN/kfEHld0dX2w7cfKQK10=#015 Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: X-Amz-Request-Id: E71719FDE2A62B38#015 Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: X-Amz-Version-Id: k4DcGnPfxH3WsJVrSUabWmu.3UcejhZx#015 Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: #015 Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 13:05:49 Resolving task...#015 Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 13:05:49 Command finished successfully!#015 Jul 31 06:05:50 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 13:05:49 No previous task user desktop, so no need to close any open desktops#015 Jul 31 06:05:50 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 13:05:49 Trying to remove directory 'C:\Users\task_1533041946' via os.RemoveAll(path) call as GenericWorker user...#015 Jul 31 06:05:54 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Graphic Card being used "Intel(R) Iris(R) Pro Graphics P580 " #015 Jul 31 06:05:54 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Removing temp dir contents #015 Jul 31 06:05:54 T-W1064-MS-168.mdc1.mozilla.com generic-worker: C:\Users\GenericWorker\AppData\Local\Temp\aria-debug-7960.log#015 Jul 31 06:05:54 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Deleted file - C:\Users\GenericWorker\AppData\Local\Temp\livelog807285603\stream#015 Jul 31 06:05:54 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Deleted file - C:\ProgramData\Package Cache\{B74E65FD-CC47-41C5-4B89-791A3F61942D}v8.100.25984\Installers\Kits Configuration Installer-x86_en-us.msi#015 Jul 31 06:05:54 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Removing log files older than 1 day #015 Jul 31 06:05:54 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Removing Windows log files older than 7 days #015 Jul 31 06:05:54 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Removing Recycle.bin contents #015 Jul 31 06:05:56 T-W1064-MS-168.mdc1.mozilla.com User32: The process C:\windows\system32\shutdown.exe (T-W1064-MS-168) has initiated the restart of computer T-W1064-MS-168 on behalf of user T-W1064-MS-168\GenericWorker for the following reason: No title for this reason could be found Reason Code: 0x800000ff Shutdown Type: restart Comment: Rebooting as generic worker ran successfully#015 Jul 31 06:05:59 T-W1064-MS-168.mdc1.mozilla.com Service_Control_Manager: The sshd service terminated unexpectedly. It has done this 1 time(s).#015 no logs for almost 24h. Aug 01 05:55:59 T-W1064-MS-168.mdc1.mozilla.com generic-worker: /07/31 14:05:48 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015 Aug 01 05:55:59 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 14:06:43 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015 Aug 01 05:55:59 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 14:07:39 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015 Aug 01 05:55:59 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 14:08:46 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015 Aug 01 05:55:59 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 14:09:49 Could not claim work. Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015 Aug 01 05:55:59 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 14:09:49 No task claimed...#015 Aug 01 05:55:59 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 14:09:49 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015 Then after ~ 45 min the server took jobs.
Looks like there is a common error on the MS win servers, at least on the bellow 3: Microsoft-Windows-DSC: Job DscTimerConsistencyOperationResult : DSC Engine Error : #011 Error Message: NULL #011Error Code : 1 #015 https://bugzilla.mozilla.org/show_bug.cgi?id=1478723 https://bugzilla.mozilla.org/show_bug.cgi?id=1480347 https://bugzilla.mozilla.org/show_bug.cgi?id=1480386
Blocks: 1481426
T-W1064-MS-112 and T-W1064-MS-117 have non-responsive ilo after logging in with credentials.
Were these nodes rebooted or reimaged? They seem to be up and working now.
(In reply to Mark Cornmesser [:markco] from comment #91) > Were these nodes rebooted or reimaged? They seem to be up and working now. I'm checked the handovers, and I don't see anyone in CiDuty reporting the fact that they worked on the 2 machines.
I have rebooted 112 and 117 yesterday and then they where taking tasks. Both had no tasks for one day.
- rebooted,reimaged & take jobs : T-W1064-MS-132, T-W1064-MS-133, T-W1064-MS-134 - rebooted & take jobs : T-W1064-MS-067, T-W1064-MS-071, T-W1064-MS-118, T-W1064-MS-168, T-W1064-MS-196, T-W1064-MS-211, T-W1064-MS-247, T-W1064-MS-262, T-W1064-MS-266, T-W1064-MS-286, T-W1064-MS-296
Found a lot of lazy workers in the windows pool. Mark, the number this time is quite considerable. You might be extra interested in this. Went ahead and rebooted all of the machines below. @CiDuty these all need to be checked, please re-image, check the logs and update their respective bugs for the ones that don't recover after a re-image.
Flags: needinfo?(mcornmesser)
T-W1064-MS-{021, 024, 027, 031, 033, 034, 035, 037, 037, 039, 041, 043, 045, 062, 064, 066, 071, 072, 078, 079, 088, 089, 090, 110, 116, 117, 120, 121, 122, 125, 126, 127, 129, 131, 132, 133, 134, 152, 158, 160, 164, 165, 167, 169, 174, 176, 180, 198, 199, 202, 204, 206, 210, 212, 215, 216, 218, 222, 242, 244, 246, 247, 249, 253, 256, 259, 264, 268, 269, 270, 281, 282, 283, 287, 295}. Note, all of the above have had at least 2+ hours since last job taken, with over 75% of them having 10h+.
Flags: needinfo?(ciduty)
Zsoltfay: What was the time frame these stopped working. All within the last 2 to 10 hours? Were there groups that stopped around the same time?
Flags: needinfo?(mcornmesser)
Yes, it was during my shift. We've rebooted all of them ~4 hours ago. A lot of them had 17-18 hours since their last jobs and another cluster had ~2 hours since their last job. Then we've had a few stray ones around the board with 5-7-9-12h.
Do you have a record of which ones were greater than 2 hours? I am going to go ahead and close this bug. The original issue here had been addressed. I have opened Bug 1490398. CiDuty: Could you add nodes to that bug that have not picked up tasks in 2.5 or more hours?
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
(In reply to Zsolt Fay [:zfay] from comment #97) > T-W1064-MS-{021, 024, 027, 031, 033, 034, 035, 037, 037, 039, 041, 043, 045, > 062, 064, 066, 071, 072, 078, 079, 088, 089, 090, 110, 116, 117, 120, 121, > 122, 125, 126, 127, 129, 131, 132, 133, 134, 152, 158, 160, 164, 165, 167, > 169, 174, 176, 180, 198, 199, 202, 204, 206, 210, 212, 215, 216, 218, 222, > 242, 244, 246, 247, 249, 253, 256, 259, 264, 268, 269, 270, 281, 282, 283, > 287, 295}. > > Note, all of the above have had at least 2+ hours since last job taken, with > over 75% of them having 10h+. Cleared the NI request since I did a follow up to those machines in Bug 1490398#c1.
Flags: needinfo?(ciduty)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: