1452133 - Moonshot Windows 10 nodes stop functioning

Added additional logging to the run generic-worker bat to see cleanup is affecting the node. https://github.com/mozilla-releng/OpenCloudConfig/commit/bb9f3a82317d722d536c4b363315f007bca2f9be The paertrail logs have shown for multiple nodes that is starts the cleanup, and then logs stop. Apr 11 13:23:07 T-W1064-MS-071.mdc1.mozilla.com generic-worker: Removing temp dir contents #015 Apr 11 13:23:07 T-W1064-MS-071.mdc1.mozilla.com generic-worker: C:\Users\GenericWorker\AppData\Local\Temp\aria-debug-4180.log#015

Mark Cornmesser [:markco]

Assignee

Comment 3

•

7 years ago

Commented out dism command: https://github.com/mozilla-releng/OpenCloudConfig/commit/b4dbe3ce9511b5c5cb6ec047f1485c80b1ea4056 Is throwing an error from time to time, and may not be needed now that updates are under control.

Mark Cornmesser [:markco]

Assignee

Comment 4

•

7 years ago

https://github.com/mozilla-releng/OpenCloudConfig/commit/d355e59c7c3cf6c55b8e58334624e41e106c98e0 Added additional log messages to see if the node locks up during various parts of the cleanup.

Mark Cornmesser [:markco]

Assignee

Comment 5

•

7 years ago

Tracking which nodes have been affected: https://docs.google.com/spreadsheets/d/1RYn2ojbcdCjnt0Z16Sl_qhN7W1-KvCilGitCfvZjibc/edit#gid=1896682270

Zsolt Fay [:zfay]

Comment 6

•

7 years ago

After surfing through each page of the windows 10 test machines in papertrail I've reimaged the following t-w1064-ms machines: 35, 38, 214, 244, 269. Have also updated the above doc by pointing out that these machines have also been affected.

Zsolt Fay [:zfay]

Comment 7

•

7 years ago

I've just sanity-checked the above workers as a follow-up and can confirm that the machines 38+214+244+269 are back, taking jobs and completing them. However machine number 35 and 77 is still unreachable.

Attila Craciun [:arny]

Comment 8

•

7 years ago

Attached image 1.png — Details

I have re-imaged 35 and 77 however the process are not ok. I rebooted the server, hit F12, started the PXE, selected option two, started the re-image process and after the first boot the server starts PXE again (it should boot from HDD, right?). I let the process start again then, after a while, the process says that there is already a re-image running: do you want to start a new one or no (let the first one continue). If I choose Yes, the process start again and a loop is created. If I say no and let the already running process continue, it fail with the errors from the capture 1. If not fail, the process are continue but after a reboot and some windows settings I get the error 2.

Attila Craciun [:arny]

Comment 9

•

7 years ago

Attached image 2.png — Details

Roland Mutter Michael (:rmutter)

Comment 10

•

7 years ago

Tried to reimage t-w1064-ms-262 . Seems like it gets stuck in a loading state.

Attila Craciun [:arny]

Comment 11

•

Comment 19

•

7 years ago

T-W1064-MS-178 - running jobs successfully T-W1064-MS-179 - re-imaged, running jobs successfully Looks like the re-image process is set to run until is done, no meter what. If there some error, like the one captured by Radu, somehow you need to finish the process. I have rebooted the server several times, let the server boot from hdd, skipped the disk check and let the process to run until I get the same error in windows. After one more reboot, I have started PXE where I get the error that the deploy cannot continue and I hit Finish. This is the time when the previous process ended and I was able to start a PXE boot and re-image successfully the server. T-W1064-MS-215 - re-imaged, taking jobs successfully

Attila Craciun [:arny]

Comment 20

•

7 years ago

T-W1064-MS-257 - re-imaged, taking jobs successfully T-W1064-MS-263 - re-imaged, taking jobs successfully T-W1064-MS-268 - re-imaged, taking jobs successfully

Zsolt Fay [:zfay]

Comment 21

•

7 years ago

T-W1064-064 - re-imaged, taking jobs successfully T-W1064-255 - re-imaged, taking jobs successfully

Roland Mutter Michael (:rmutter)

Comment 22

•

7 years ago

Looks like T-W1064-MS-262 is taking jobs now.

Roland Mutter Michael (:rmutter)

Comment 23

•

•

Q

Comment 52

•

7 years ago

Those two images have been around for a while no new images as the task ids sre ther same. The gw 10 is generic worker 10 which should in all likelihood not been done as an image but an occ switch.

Flags: needinfo?(q)

Kendall Libby [:fubar] (he/him)

Comment 53

•

7 years ago

Ok, so anything that was imaged with "Windows 10 generic 10" needs to be quarantined and re-imaged.

Flags: needinfo?(ciduty)

Roland Mutter Michael (:rmutter)

Comment 54

•

7 years ago

I'm going to reimage those machines there were reimaged with "Windows 10 generic 10" on my ongoing shift. :fubar , is there a way to find out if we have any others machines reimaged with "Windows 10 generic 10" beside those that were mentioned above by :arny ?

Flags: needinfo?(ciduty)

Kendall Libby [:fubar] (he/him)

Comment 55

•

Comment 61

•

•

•

Comment 96

•

7 years ago

Found a lot of lazy workers in the windows pool. Mark, the number this time is quite considerable. You might be extra interested in this. Went ahead and rebooted all of the machines below. @CiDuty these all need to be checked, please re-image, check the logs and update their respective bugs for the ones that don't recover after a re-image.

Flags: needinfo?(mcornmesser)

Zsolt Fay [:zfay]

Comment 97

•

7 years ago

T-W1064-MS-{021, 024, 027, 031, 033, 034, 035, 037, 037, 039, 041, 043, 045, 062, 064, 066, 071, 072, 078, 079, 088, 089, 090, 110, 116, 117, 120, 121, 122, 125, 126, 127, 129, 131, 132, 133, 134, 152, 158, 160, 164, 165, 167, 169, 174, 176, 180, 198, 199, 202, 204, 206, 210, 212, 215, 216, 218, 222, 242, 244, 246, 247, 249, 253, 256, 259, 264, 268, 269, 270, 281, 282, 283, 287, 295}. Note, all of the above have had at least 2+ hours since last job taken, with over 75% of them having 10h+.

Flags: needinfo?(ciduty)

Mark Cornmesser [:markco]

Assignee

Comment 98

•

7 years ago

Zsoltfay: What was the time frame these stopped working. All within the last 2 to 10 hours? Were there groups that stopped around the same time?

Flags: needinfo?(mcornmesser)

Zsolt Fay [:zfay]

Comment 99

•

7 years ago

Yes, it was during my shift. We've rebooted all of them ~4 hours ago. A lot of them had 17-18 hours since their last jobs and another cluster had ~2 hours since their last job. Then we've had a few stray ones around the board with 5-7-9-12h.

Mark Cornmesser [:markco]

Assignee

Comment 100

•

7 years ago

Do you have a record of which ones were greater than 2 hours? I am going to go ahead and close this bug. The original issue here had been addressed. I have opened Bug 1490398. CiDuty: Could you add nodes to that bug that have not picked up tasks in 2.5 or more hours?

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Comment 101

•

7 years ago

(In reply to Zsolt Fay [:zfay] from comment #97) > T-W1064-MS-{021, 024, 027, 031, 033, 034, 035, 037, 037, 039, 041, 043, 045, > 062, 064, 066, 071, 072, 078, 079, 088, 089, 090, 110, 116, 117, 120, 121, > 122, 125, 126, 127, 129, 131, 132, 133, 134, 152, 158, 160, 164, 165, 167, > 169, 174, 176, 180, 198, 199, 202, 204, 206, 210, 212, 215, 216, 218, 222, > 242, 244, 246, 247, 249, 253, 256, 259, 264, 268, 269, 270, 281, 282, 283, > 287, 295}. > > Note, all of the above have had at least 2+ hours since last job taken, with > over 75% of them having 10h+. Cleared the NI request since I did a follow up to those machines in Bug 1490398#c1.

Flags: needinfo?(ciduty)

1.png 7 years ago Attila Craciun [:arny] 47.98 KB, image/png		Details
2.png 7 years ago Attila Craciun [:arny] 4.63 KB, image/png		Details
worker-ms-77.PNG 7 years ago Attila Craciun [:arny] 100.68 KB, image/png		Details
ms77-stuck.png 7 years ago Attila Craciun [:arny] 19.89 KB, image/png		Details
T-W1064-MS-179.png 7 years ago Radu Iman[:riman] 21.44 KB, image/png		Details
Screenshot_2018-05-25_12-35-49.png 7 years ago Roland Mutter Michael (:rmutter) 16.34 KB, image/png		Details
Screenshot_2018-07-04_06-31-42_fail_T-W1064-MS-168.png 7 years ago Bogdan Crisan [:bcrisan] (EEST - GMT + 3) 43.76 KB, image/png		Details
ms-win10-335.PNG 7 years ago Attila Craciun [:arny] 251.81 KB, image/png		Details