Closed
Bug 1452133
Opened 7 years ago
Closed 6 years ago
Moonshot Windows 10 nodes stop functioning
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: markco, Assigned: markco)
References
Details
Attachments
(8 files)
A small percentage of nodes will suddenly stop reporting to papertrail and become unreachable.
Rebooting the machine remedies the situation.
Assignee | ||
Comment 1•7 years ago
|
||
The local OS event logs also stop and do not pick up again till the reboot.
Summary: Moonshot Windows 10 nodes drop of the network → Moonshot Windows 10 nodes stop functioning
Assignee | ||
Updated•7 years ago
|
Assignee: relops → mcornmesser
Assignee | ||
Comment 2•7 years ago
|
||
Added additional logging to the run generic-worker bat to see cleanup is affecting the node.
https://github.com/mozilla-releng/OpenCloudConfig/commit/bb9f3a82317d722d536c4b363315f007bca2f9be
The paertrail logs have shown for multiple nodes that is starts the cleanup, and then logs stop.
Apr 11 13:23:07 T-W1064-MS-071.mdc1.mozilla.com generic-worker: Removing temp dir contents #015
Apr 11 13:23:07 T-W1064-MS-071.mdc1.mozilla.com generic-worker: C:\Users\GenericWorker\AppData\Local\Temp\aria-debug-4180.log#015
Assignee | ||
Comment 3•7 years ago
|
||
Commented out dism command:
https://github.com/mozilla-releng/OpenCloudConfig/commit/b4dbe3ce9511b5c5cb6ec047f1485c80b1ea4056
Is throwing an error from time to time, and may not be needed now that updates are under control.
Assignee | ||
Comment 4•7 years ago
|
||
https://github.com/mozilla-releng/OpenCloudConfig/commit/d355e59c7c3cf6c55b8e58334624e41e106c98e0
Added additional log messages to see if the node locks up during various parts of the cleanup.
Assignee | ||
Comment 5•7 years ago
|
||
Tracking which nodes have been affected:
https://docs.google.com/spreadsheets/d/1RYn2ojbcdCjnt0Z16Sl_qhN7W1-KvCilGitCfvZjibc/edit#gid=1896682270
Comment 6•7 years ago
|
||
After surfing through each page of the windows 10 test machines in papertrail I've reimaged the following t-w1064-ms machines:
35, 38, 214, 244, 269.
Have also updated the above doc by pointing out that these machines have also been affected.
Comment 7•7 years ago
|
||
I've just sanity-checked the above workers as a follow-up and can confirm that the machines 38+214+244+269 are back, taking jobs and completing them.
However machine number 35 and 77 is still unreachable.
Comment 8•7 years ago
|
||
I have re-imaged 35 and 77 however the process are not ok.
I rebooted the server, hit F12, started the PXE, selected option two, started the re-image process and after the first boot the server starts PXE again (it should boot from HDD, right?). I let the process start again then, after a while, the process says that there is already a re-image running: do you want to start a new one or no (let the first one continue). If I choose Yes, the process start again and a loop is created. If I say no and let the already running process continue, it fail with the errors from the capture 1. If not fail, the process are continue but after a reboot and some windows settings I get the error 2.
Comment 9•7 years ago
|
||
Comment 10•7 years ago
|
||
Tried to reimage t-w1064-ms-262 . Seems like it gets stuck in a loading state.
Comment 11•7 years ago
|
||
T-W1064-MS-077 looks OK now, there where many jobs completed successful. However, the ILO capture shows the machine in a stuck state.
Comment 12•7 years ago
|
||
Comment 13•7 years ago
|
||
Checking other machines that are visible in TS and taking jobs, looks like they are in the same stuck state. I believe that is a normal one :).
Comment 14•7 years ago
|
||
T-W1064-MS-035 is also back in TS, waiting to take jobs.
Comment 15•7 years ago
|
||
MS-35 took jobs and is completed successfully.
There is a glitch in the re-image. To have it done it successfully follow this steps:
- start the re-image as the docs say and watch the ILO;
- if, after the first reboot the server goes to PXE, reboot it and press F9 to select to boot from HDD; otherwise follow the bellow steps;
- watch ILO and windows setup; if the windows login is root then you are on a good track, if not, the re-image process need to be restarted until the root user logs in.
- once the setup completes and the final reboot is done, ILO will show a stuck loading windows;
- the re-maged machine should be visible in TC and take jobs.
After more then 10 re-images, the above solutions works :).
Comment 16•7 years ago
|
||
I have found other Win10-MS servers that are not in TC: 111, 130, 178, 179, 215, 257, 263, 268.
Working to reimage 111 and 130 (:apop).
Comment 17•7 years ago
|
||
T-W1064-MS-111 is reimaged and took jobs that are completed successfully.
T-W1064-MS-178 is reimaged and waiting for jobs.
Comment 18•7 years ago
|
||
I have tried to re-image T-W1064-MS-179 but I encountered this problem. ^
After this the Automatic Repair started and then the iLO Console seems to be stuck. (Scanning and repairing drive (\\?\SystemPartition): 100% complete.
Comment 19•7 years ago
|
||
T-W1064-MS-178 - running jobs successfully
T-W1064-MS-179 - re-imaged, running jobs successfully
Looks like the re-image process is set to run until is done, no meter what. If there some error, like the one captured by Radu, somehow you need to finish the process. I have rebooted the server several times, let the server boot from hdd, skipped the disk check and let the process to run until I get the same error in windows. After one more reboot, I have started PXE where I get the error that the deploy cannot continue and I hit Finish. This is the time when the previous process ended and I was able to start a PXE boot and re-image successfully the server.
T-W1064-MS-215 - re-imaged, taking jobs successfully
Comment 20•7 years ago
|
||
T-W1064-MS-257 - re-imaged, taking jobs successfully
T-W1064-MS-263 - re-imaged, taking jobs successfully
T-W1064-MS-268 - re-imaged, taking jobs successfully
Comment 21•7 years ago
|
||
T-W1064-064 - re-imaged, taking jobs successfully
T-W1064-255 - re-imaged, taking jobs successfully
Comment 22•7 years ago
|
||
Looks like T-W1064-MS-262 is taking jobs now.
Comment 23•7 years ago
|
||
While monitoring Windows 10 Moonshots machines I saw that we have troubles with T-W1064-MS-021. Tried to connect remotly to it via ILO it returned me the following error:
ExitException[ 3]com.sun.deploy.net.FailedDownloadException: Unable to load resource
at jdk.plugin@9.0.4/sun.plugin2.applet.JNLP2Manager.downloadResources(JNLP2Manager.java:1846)
at jdk.plugin@9.0.4/sun.plugin2.applet.JNLP2Manager.prepareLaunchFile(JNLP2Manager.java:1457)
at jdk.plugin@9.0.4/sun.plugin2.applet.JNLP2Manager.loadJarFiles(JNLP2Manager.java:476)
at jdk.plugin@9.0.4/sun.plugin2.applet.Plugin2Manager$AppletExecutionRunnable.run(Plugin2Manager.java:1770)
at java.base/java.lang.Thread.run(Thread.java:844)
Caused by: com.sun.deploy.net.FailedDownloadException: Unable to load resource
at jdk.deploy@9.0.4/com.sun.deploy.net.DownloadEngine.actionDownload(DownloadEngine.java:807)
at jdk.deploy@9.0.4/com.sun.deploy.net.DownloadEngine.downloadResource(DownloadEngine.java:914)
at jdk.deploy@9.0.4/com.sun.deploy.cache.ResourceProviderImpl.getResource(ResourceProviderImpl.java:387)
at jdk.deploy@9.0.4/com.sun.deploy.cache.ResourceProviderImpl.getResource(ResourceProviderImpl.java:325)
at jdk.javaws@9.0.4/com.sun.javaws.LaunchDownload$DownloadTask.call(LaunchDownload.java:1649)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
... 1 more
Comment 24•7 years ago
|
||
Started the reimaging process for T-W1064-MS-113 and T-W1064-MS-034.
Comment 25•7 years ago
|
||
(In reply to Roland Mutter Michael (:rmutter) from comment #23)
> Created attachment 8980529 [details]
> Screenshot_2018-05-25_12-35-49.png
>
> While monitoring Windows 10 Moonshots machines I saw that we have troubles
> with T-W1064-MS-021. Tried to connect remotly to it via ILO it returned me
> the following error:
Works from my tux box, so I have started the re-image for T-W1064-MS-021 and T-W1064-MS-023.
Comment 26•7 years ago
|
||
Adding T-W1064-MS-263 to reimaged moonshots ^.
Comment 27•7 years ago
|
||
T-W1064-MS-021 and T-W1064-MS-023 - re-imaged, taking jobs successfully
Comment 28•7 years ago
|
||
T-W1064-MS-281 has some issues. It won't boot from network for a reimage . Not sure if it's hardware related issue.
Comment 29•7 years ago
|
||
T-W1064-MS-037 - not found in TC, rebooted, took jobs successfully
Comment 30•7 years ago
|
||
UPDATE: T-W1064-MS263, T-W1064-MS113 and T-W1064-MS034 are back in production taking jobs.
Comment 31•7 years ago
|
||
update :
T-W1064-MS-042 - reimaged and takes jobs
T-W1064-MS-134 - reimaged and takes jobs
T-W1064-MS-207 - reimaged and takes jobs
T-W1064-MS-212 - reimaged and takes jobs
T-W1064-MS-265 - reimaged and takes jobs
T-W1064-MS-129 - reimaged and takes jobs
T-W1064-MS-215 - reimaged and takes jobs
Problems with :
T-W1064-MS-170
- not found in Task Cluster, unable to reimage it, it gets rebooted when it get at Deployment kit view
Comment 32•7 years ago
|
||
T-W1064-MS-035 - reimaged, still waiting for it to appear in tc
T-W1064-MS-036 - reimaged, still waiting for it to appear in tc
T-W1064-MS-040 - reimaged, still waiting for it to appear in tc
Comment 33•7 years ago
|
||
T-W1064-MS-035 - takes jobs
T-W1064-MS-036 - takes jobs
T-W1064-MS-040 - takes jobs
Comment 34•7 years ago
|
||
T-W1064-MS-129 - reimaged and takes jobs
T-W1064-MS-214 - reimaged, rebooted and waiting to take jobs
Comment 35•7 years ago
|
||
T-W1064-MS-21 reimaged and takes jobs
T-W1064-MS-70 reimaged and takes jobs
T-W1064-MS-81 reimaged and takes jobs
T-W1064-MS-165 reimaged and takes jobs
T-W1064-MS-170 reimaged and takes jobs
Assignee | ||
Comment 36•7 years ago
|
||
Please do not reimage 135 and 81. I am using both of those for testing and may not be reporting to papertrail. I will update here once i am done with those nodes.
Comment 37•7 years ago
|
||
Reimaged:
T-W1064-MS-077
T-W1064-MS-087
They are both back in taskcluster.
Comment 38•7 years ago
|
||
T-W1064-MS-031
rebooted,reimaged and takes jobs
T-W1064-MS-082
rebooted, reimaged and it takes jobs
T-W1064-MS-175
rebooted and takes jobs
T-W1064-MS-247
wasn’t in TaskCluster, reimaged and takes jobs
T-W1064-MS-267
rebooted, reimaged and it takes jobs
Comment 39•7 years ago
|
||
T-W1064-MS-029
- reboot,takes jobs
T-W1064-MS-030
- reboot,takes jobs
T-W1064-MS-031
- reimage,takes jobs
T-W1064-MS-032
- reboot,takes jobs
T-W1064-MS-035
- reimage,takes jobs
T-W1064-MS-037
- reboot,takes jobs
T-W1064-MS-085
- reboot,reimage,takes jobs
T-W1064-MS-088
- reboot,takes jobs
T-W1064-MS-089
- reboot,takes jobs
T-W1064-MS-117
- reboot,takes jobs
T-W1064-MS-123
- reboot,takes jobs
T-W1064-MS-131
- reboot,takes jobs
T-W1064-MS-132
- reboot,takes jobs
T-W1064-MS-152
- reboot,takes jobs
T-W1064-MS-155
- reboot,takes jobs
T-W1064-MS-219
- reboot,takes jobs
T-W1064-MS-267
- reboot,takes jobs
Comment 40•7 years ago
|
||
Hey Mark, can you confirm that you still need 81 and 135 for testing?
Also I know you are testing the new generic worker on chassis 1, so fyi 24 and 34 are suddenly missing from taskcluster. If this was intentionally done then just ignore this.
Flags: needinfo?(mcornmesser)
Comment 41•7 years ago
|
||
Went for a full check of windows moonshots that does appear in TC. Seems like the following machines are not in TC :
T-W1064-MS-18
T-W1064-MS-24
T-W1064-MS-27
T-W1064-MS-29
T-W1064-MS-30
T-W1064-MS-33
T-W1064-MS-34
T-W1064-MS-35
T-W1064-MS-40
T-W1064-MS-42
T-W1064-MS-84
T-W1064-MS-109
T-W1064-MS-113
T-W1064-MS-117
T-W1064-MS-130
T-W1064-MS-151
T-W1064-MS-247
T-W1064-MS-261
T-W1064-MS-281
Will proceed with a check in papertrail and a reboot for every machine. If that doesn't work , I'll start a reimage for each one. I'll be back with updates.
Comment 42•7 years ago
|
||
After two sessions of reboots, the following machines seem to need a reimage:
T-W1064-MS-18
T-W1064-MS-29
T-W1064-MS-30
T-W1064-MS-40
T-W1064-MS-84
T-W1064-MS-117
T-W1064-MS-130
T-W1064-MS-33 - Seems to be broken and T-W1064-MS-281 - Was found in System health Summary, rebooted it and now it is in some kind of boot loop. My guess is that it doesn't find a source to boot from.
Assignee | ||
Comment 43•7 years ago
|
||
(In reply to Zsolt Fay [:zsoltfay] from comment #40)
> Hey Mark, can you confirm that you still need 81 and 135 for testing?
>
> Also I know you are testing the new generic worker on chassis 1, so fyi 24
> and 34 are suddenly missing from taskcluster. If this was intentionally done
> then just ignore this.
I am no longer using any additional nodes out side of chassis 1. 34 and 24 were unintentional. I will take a look at them later.
Flags: needinfo?(mcornmesser)
Comment 44•7 years ago
|
||
Found several machines today that hadn't taken jobs for over 3 hours(some even 20+ hours), while having a 700+ pending queue.
Re-imaged them and have already taken jobs successfully. The machines in question are:
t-win1064-ms-{067, 078, 080, 086, 088, 107, 109, 117, 118, 129, 131, 152, 157, 217, 256, 262}
I have also re-imaged 168 but it does not complete the re-image. At first glance it appears it does not have the option to boot from an HDD (only xenserver, ipv4 and ipv6 boot).
Comment 45•7 years ago
|
||
Here is a complete list of machines that don't appear in taskcluster [221]:
'T-W1064-MS-017', 'T-W1064-MS-018', 'T-W1064-MS-020', 'T-W1064-MS-023', 'T-W1064-MS-029', 'T-W1064-MS-034', 'T-W1064-MS-035', 'T-W1064-MS-041', 'T-W1064-MS-042', 'T-W1064-MS-044', 'T-W1064-MS-045', 'T-W1064-MS-064', 'T-W1064-MS-065', 'T-W1064-MS-106', 'T-W1064-MS-112', 'T-W1064-MS-116', 'T-W1064-MS-126', 'T-W1064-MS-130', 'T-W1064-MS-134', 'T-W1064-MS-151', 'T-W1064-MS-156', 'T-W1064-MS-168', 'T-W1064-MS-222', 'T-W1064-MS-261', 'T-W1064-MS-281', 'T-W1064-MS-316', 'T-W1064-MS-318', 'T-W1064-MS-319', 'T-W1064-MS-320', 'T-W1064-MS-321', 'T-W1064-MS-322', 'T-W1064-MS-323', 'T-W1064-MS-324', 'T-W1064-MS-325', 'T-W1064-MS-326', 'T-W1064-MS-327', 'T-W1064-MS-328', 'T-W1064-MS-329', 'T-W1064-MS-330', 'T-W1064-MS-331', 'T-W1064-MS-332', 'T-W1064-MS-333', 'T-W1064-MS-334', 'T-W1064-MS-335', 'T-W1064-MS-336', 'T-W1064-MS-337', 'T-W1064-MS-338', 'T-W1064-MS-339', 'T-W1064-MS-340', 'T-W1064-MS-341', 'T-W1064-MS-342', 'T-W1064-MS-343', 'T-W1064-MS-344', 'T-W1064-MS-345', 'T-W1064-MS-362', 'T-W1064-MS-363', 'T-W1064-MS-364', 'T-W1064-MS-365', 'T-W1064-MS-366', 'T-W1064-MS-367', 'T-W1064-MS-368', 'T-W1064-MS-369', 'T-W1064-MS-370', 'T-W1064-MS-371', 'T-W1064-MS-372', 'T-W1064-MS-373', 'T-W1064-MS-374', 'T-W1064-MS-375', 'T-W1064-MS-376', 'T-W1064-MS-377', 'T-W1064-MS-378', 'T-W1064-MS-379', 'T-W1064-MS-380', 'T-W1064-MS-381', 'T-W1064-MS-382', 'T-W1064-MS-383', 'T-W1064-MS-384', 'T-W1064-MS-385', 'T-W1064-MS-386', 'T-W1064-MS-387', 'T-W1064-MS-388', 'T-W1064-MS-389', 'T-W1064-MS-390', 'T-W1064-MS-406', 'T-W1064-MS-407', 'T-W1064-MS-408', 'T-W1064-MS-409', 'T-W1064-MS-410', 'T-W1064-MS-411', 'T-W1064-MS-412', 'T-W1064-MS-413', 'T-W1064-MS-414', 'T-W1064-MS-415', 'T-W1064-MS-416', 'T-W1064-MS-417', 'T-W1064-MS-418', 'T-W1064-MS-419', 'T-W1064-MS-420', 'T-W1064-MS-421', 'T-W1064-MS-422', 'T-W1064-MS-423', 'T-W1064-MS-424', 'T-W1064-MS-425', 'T-W1064-MS-426', 'T-W1064-MS-427', 'T-W1064-MS-428', 'T-W1064-MS-429', 'T-W1064-MS-430', 'T-W1064-MS-431', 'T-W1064-MS-432', 'T-W1064-MS-433', 'T-W1064-MS-434', 'T-W1064-MS-435', 'T-W1064-MS-451', 'T-W1064-MS-452', 'T-W1064-MS-453', 'T-W1064-MS-454', 'T-W1064-MS-455', 'T-W1064-MS-456', 'T-W1064-MS-457', 'T-W1064-MS-458', 'T-W1064-MS-459', 'T-W1064-MS-460', 'T-W1064-MS-461', 'T-W1064-MS-462', 'T-W1064-MS-463', 'T-W1064-MS-464', 'T-W1064-MS-465', 'T-W1064-MS-466', 'T-W1064-MS-467', 'T-W1064-MS-468', 'T-W1064-MS-469', 'T-W1064-MS-470', 'T-W1064-MS-471', 'T-W1064-MS-472', 'T-W1064-MS-473', 'T-W1064-MS-474', 'T-W1064-MS-475', 'T-W1064-MS-476', 'T-W1064-MS-477', 'T-W1064-MS-478', 'T-W1064-MS-479', 'T-W1064-MS-480', 'T-W1064-MS-497', 'T-W1064-MS-498', 'T-W1064-MS-499', 'T-W1064-MS-500', 'T-W1064-MS-501', 'T-W1064-MS-502', 'T-W1064-MS-503', 'T-W1064-MS-504', 'T-W1064-MS-505', 'T-W1064-MS-506', 'T-W1064-MS-507', 'T-W1064-MS-508', 'T-W1064-MS-509', 'T-W1064-MS-510', 'T-W1064-MS-511', 'T-W1064-MS-512', 'T-W1064-MS-513', 'T-W1064-MS-514', 'T-W1064-MS-515', 'T-W1064-MS-516', 'T-W1064-MS-517', 'T-W1064-MS-518', 'T-W1064-MS-519', 'T-W1064-MS-520', 'T-W1064-MS-521', 'T-W1064-MS-522', 'T-W1064-MS-523', 'T-W1064-MS-524', 'T-W1064-MS-525', 'T-W1064-MS-542', 'T-W1064-MS-543', 'T-W1064-MS-544', 'T-W1064-MS-545', 'T-W1064-MS-546', 'T-W1064-MS-547', 'T-W1064-MS-548', 'T-W1064-MS-549', 'T-W1064-MS-550', 'T-W1064-MS-551', 'T-W1064-MS-552', 'T-W1064-MS-553', 'T-W1064-MS-554', 'T-W1064-MS-555', 'T-W1064-MS-556', 'T-W1064-MS-557', 'T-W1064-MS-558', 'T-W1064-MS-559', 'T-W1064-MS-560', 'T-W1064-MS-561', 'T-W1064-MS-562', 'T-W1064-MS-563', 'T-W1064-MS-564', 'T-W1064-MS-565', 'T-W1064-MS-566', 'T-W1064-MS-567', 'T-W1064-MS-568', 'T-W1064-MS-569', 'T-W1064-MS-570', 'T-W1064-MS-581', 'T-W1064-MS-582', 'T-W1064-MS-583', 'T-W1064-MS-584', 'T-W1064-MS-585', 'T-W1064-MS-586', 'T-W1064-MS-587', 'T-W1064-MS-588', 'T-W1064-MS-589', 'T-W1064-MS-590', 'T-W1064-MS-591', 'T-W1064-MS-592', 'T-W1064-MS-593', 'T-W1064-MS-594', 'T-W1064-MS-595', 'T-W1064-MS-596', 'T-W1064-MS-597', 'T-W1064-MS-598', 'T-W1064-MS-599', 'T-W1064-MS-600'
I'll begin working on them, if anyone has any info about them, please comment in this bug.
Comment 46•7 years ago
|
||
Update for machines above:
moon-chassis-2
'T-W1064-MS-064',- Rebooted, reachable, took job and finished it as completed
'T-W1064-MS-065',- Rebooted, reachable, took job and finished it as completed
moon-chassis-3
'T-W1064-MS-106',- Rebooted, reachable, took job and finished it as completed
'T-W1064-MS-112',- Rebooted, reachable, took job and finished it as completed
'T-W1064-MS-116' - Rebooted, reachable, took job and it's running it
'T-W1064-MS-126',- Rebooted but unreachable
'T-W1064-MS-130',- Rebooted but starts with disk-checking and blue-screen and after a few retries I turn it OFF.
'T-W1064-MS-134',- Rebooted, reachable, took job and finished it as completed
moon-chassis-4
'T-W1064-MS-151',- Rebooted, reachable
'T-W1064-MS-156',- Rebooted, reachable, took job and finished it as completed
'T-W1064-MS-168',- Tried to reimage it but failed to do it. see screenshot (Screenshot_2018-07-04_06-31-42_fail_T-W1064-MS-168.png)
moon-chassis-5
'T-W1064-MS-222',- Rebooted, reachable but not appeared in taskcluster, initiated a reimage on it and reimaged successfully
moon-chassis-6
'T-W1064-MS-261',- Rebooted, reachable
moon-chassis-7
'T-W1064-MS-281',- at boot it remains into booting from network and cannot make it boot from SSD
moon-chassis-8
'T-W1064-MS-316',- Runs XenServer
'T-W1064-MS-325',- Begin reimage on it
'T-W1064-MS-338',- Begin reimage on it
Those from moon-chassis-8 were random chosen
Also from the last chassis:
'T-W1064-MS-600',- Runs XenServer
Comment 47•7 years ago
|
||
Comment 48•7 years ago
|
||
Went for 3 sessions of checks/cold boots for Windows moonshots . Performed cold boot on follwing: 017, 018 , 020, 023, 026, 029, 034, 035, 041, 042, 044, 045, 068 ,072,077, 090, 111, 114, 130, 135, 162, 165, 281, 288, 292, 294 . Out of the list the following recovered: 023, 026, 029, 034 ,041, 044, 045, 068, 072, 165, 288, 292, 294. The following need a later recheck/reimage: 017, 018, 020, 035, 042, 044, 072, 077, 090, 111, 114, 130, 135, 162, 281.
Comment 49•7 years ago
|
||
T-W1064-MS-017.releng.mdc1.mozilla.com rebooting himself after windows start, reimaged, OK.
T-W1064-MS-018.releng.mdc1.mozilla.com booted in windows with administrator login, error NIC configuration during re-image, cannot reimage
T-W1064-MS-024.releng.mdc1.mozilla.com reboot, re-imaged, OK
T-W1064-MS-035.releng.mdc1.mozilla.com in TC, not took jobs for 3 days, reboot, reimage, not in TC anymore
T-W1064-MS-038.releng.mdc1.mozilla.com in TC, not took jobs for 1 day, reboot, OK.
T-W1064-MS-042.releng.mdc1.mozilla.com not in TC, reboot, reimage, OK
T-W1064-MS-073.releng.mdc1.mozilla.com in TC, not took jobs for 1 day, reboot, OK.
T-W1064-MS-074.releng.mdc1.mozilla.com in TC, not took jobs for 2 days, reboot, reimage, OK
T-W1064-MS-077.releng.mdc1.mozilla.com in TC, not took jobs for 2 days, reboot, reimage, in tc waiting tasks
T-W1064-MS-080.releng.mdc1.mozilla.com in TC, not took jobs for 2 days, reboot, reimage, in tc waiting tasks
Comment 50•7 years ago
|
||
T-W1064-MS-019.releng.mdc1.mozilla.com not in TC, reboot, OK.
T-W1064-MS-020.releng.mdc1.mozilla.com not in TC, reboot, reimage, OK, tasks running fine.
T-W1064-MS-035.releng.mdc1.mozilla.com reboot, reimage, still not in TC
T-W1064-MS-036.releng.mdc1.mozilla.com not in tc,no jobs for 2 days, reboot, reimage, OK, tasks running fine.
T-W1064-MS-037.releng.mdc1.mozilla.com not in tc, no jobs for 1 day, reboot, reimage, ok, running jobs
T-W1064-MS-043.releng.mdc1.mozilla.com not in tc, no jobs for 2 days, reboot, ok, jobs running.
T-W1064-MS-065.releng.mdc1.mozilla.com not in tc, no jobs for 1 day, reboot, reimage, not in TC anymore
T-W1064-MS-071.releng.mdc1.mozilla.com not in tc, no jobs for 2 day, reboot, reimage, not in TC anymore
T-W1064-MS-072.releng.mdc1.mozilla.com not in tc, no jobs for 2 day, reboot, reimage, not in TC anymore
T-W1064-MS-083.releng.mdc1.mozilla.com not in tc, no jobs for 2 day, reboot, reimage, not in TC anymore
I noticed a image rename from Windows 10 to Windows 10 -1. There are some changes going one?
The above servers where reimaged with Windows 10 generic 10 (19-43) and with Windows 10 -1 (65-83).
Comment 51•7 years ago
|
||
We should be using the same image on all systems; I'm not sure what the new image is, so until we say otherwise please don't use it.
Mark/Q: what IS that new image?
Flags: needinfo?(q)
Flags: needinfo?(mcornmesser)
Comment 52•7 years ago
|
||
Those two images have been around for a while no new images as the task ids sre ther same. The gw 10 is generic worker 10 which should in all likelihood not been done as an image but an occ switch.
Flags: needinfo?(q)
Comment 53•7 years ago
|
||
Ok, so anything that was imaged with "Windows 10 generic 10" needs to be quarantined and re-imaged.
Flags: needinfo?(ciduty)
Comment 54•7 years ago
|
||
I'm going to reimage those machines there were reimaged with "Windows 10 generic 10" on my ongoing shift.
:fubar , is there a way to find out if we have any others machines reimaged with "Windows 10 generic 10" beside those that were mentioned above by :arny ?
Flags: needinfo?(ciduty)
Comment 55•7 years ago
|
||
Great question, and I don't know the answer; check in with :markco - he's still investigating a couple other install issues that came up out of this, and we might need to wait until those are sorted. but he's got all of the details atm.
Assignee | ||
Comment 56•7 years ago
|
||
Because of difference in functionality and configuration needs there are 2 separate task sequences. GW 8 creates a single user that performs task. GW 10 is a service that creates individual user environments and perform tasks. In addition to the the config file which is current deployed from MDT is different. Because of this the cleanest way to go about it was to have a separate image. Moving forward if GW does not have significant change in functionality then we will not need separate images for it.
As noted in other bugs and the attached spread sheet to this bug there is a GW 10 pool of ms-016 through ms-045. That is being used to debug and catch any other unknown issue with the current configuration and GW 10.
I have not seen the Windows 10 -1 task sequence name. If this pops back up again, could you take a screen shot please?
Flags: needinfo?(mcornmesser)
Comment 57•7 years ago
|
||
Reimaged all servers that:
- Got reinstalled 4th to date.
- Were missing in TC
Reimage successful on: 064, 065, 071, 072, 074, 078, 080, 083, 090, 106, 107, 110, 111, 112 114, 116, 124, 126, 128, 132, 151, 153, 165, 166, 168, 169, 222, 256, 260, 261
HPE Restful API missing / Intelligent provisioning: 085
No video/image: 130, 134, 135, 154, 156, 158, 162
Stuck at Loading: 262
Stuck at PXE: 281
Comment 58•7 years ago
|
||
As a follow-up to the above mentioned machines:
They were checked and reimaged (if needed) along with 066, 109, 134, 135, 154, 156, 158, 164, 262. These are all taking jobs now except:
071,072 still missing from TC after reimage. 222 is in quarantine.
262 doesn't even load BIOS, looks to be offline entirely. Also 130 is still broken, see Bug 1463754.
Assignee | ||
Comment 59•7 years ago
|
||
I checked 071 and 072. They had a multiple drives mounted. I have kicked off a new imnstall.
085 was able to connect and kicked off a new install.
130 updated bug with current behavior. Will include in request for support from HP.
262 Appears to be up and waiting for a task:
https://papertrailapp.com/groups/1958653/events?focus=952660421653987342&q=ms-262&selected=952660421653987342
Jul 08 10:00:27 T-W1064-MS-262.mdc1.mozilla.com generic-worker: 2018/07/08 17:00:26 No task claimed...#015
Jul 08 10:01:28 T-W1064-MS-262.mdc1.mozilla.com generic-worker: 2018/07/08 17:01:27 No task claimed...#015
Jul 08 10:02:29 T-W1064-MS-262.mdc1.mozilla.com generic-worker: 2018/07/08 17:02:28 No task claimed...#015
Jul 08 10:03:30 T-W1064-MS-262.mdc1.mozilla.com generic-worker: 2018/07/08 17:03:29 No task claimed...#015
Jul 08 10:04:31 T-W1064-MS-262.mdc1.mozilla.com generic-worker: 2018/07/08 17:04:30 No task claimed...#015
Jul 08 10:05:32 T-W1064-MS-262.mdc1.mozilla.com generic-worker: 2018/07/08 17:05:31 No task claimed...#015
281 was attempting pxe boot over IPv6. Previously shutting down the node, waiting hours, and attempting to reimage resolve the issue. The node is currently shutdown.
Comment 60•7 years ago
|
||
(In reply to Kendall Libby [:fubar] from comment #53)
> Ok, so anything that was imaged with "Windows 10 generic 10" needs to be
> quarantined and re-imaged.
From what I know, the Windows 10 generic 10 image should be used only on MS16-45, as I did the reimages, those are neededfor testing. :markco can you confirm?
Flags: needinfo?(mcornmesser)
Assignee | ||
Comment 61•7 years ago
|
||
That has been my plan thus far, and most of those are up and in Taskcluster currently. Except for 18 and 35 which are having pxe boot issues, and 23 which may have a different hardware issue.
Flags: needinfo?(mcornmesser)
Comment 62•7 years ago
|
||
T-W1064-MS-018.releng.mdc1.mozilla.com PXE issue
T-W1064-MS-035.releng.mdc1.mozilla.com PXE issue
T-W1064-MS-065.releng.mdc1.mozilla.com no jobs for 1 day, reboot, reimage, reboot, still no jobs, not in TC anymore. Reimage
After second reimage, there is no space :|
Jul 09 07:31:26 T-W1064-MS-065.mdc1.mozilla.com OpenCloudConfig: Current available disk space CRITCAL 0% free. Will not start Generic-Worker!#015
T-W1064-MS-071.releng.mdc1.mozilla.com not in TC, reboot, reimage, reboot, still not in TC. PP error:
Jul 09 05:51:58 T-W1064-MS-071.mdc1.mozilla.com Microsoft-Windows-DNS-Client: The system failed to register host (A or AAAA) resource records (RRs) for network adapter with settings: Adapter Name : {103341C9-EE3A-4DBF-BBE8-13E74A6368AA} Host Name : T-W1064-MS-071 Primary Domain Suffix : mdc1.mozilla.com DNS server list : #01110.48.75.120, 10.50.75.120 Sent update to server : <?> IP Address(es) : 10.49.40.42 The reason the system could not register these RRs during the update request was because of a system problem. You can manually retry DNS registration of the network adapter and its settings by typing 'ipconfig /registerdns' at the command prompt. If problems still persist, contact your DNS server or network systems administrator. See event details for specific error code information.#015
T-W1064-MS-072.releng.mdc1.mozilla.com not in TC, reboot, reimage, reboot, still not in TC: PP error:
Jul 09 05:52:10 T-W1064-MS-072.mdc1.mozilla.com Microsoft-Windows-DNS-Client: The system failed to register host (A or AAAA) resource records (RRs) for network adapter with settings: Adapter Name : {A56E9F32-310E-4A61-9453-20D41890DDE2} Host Name : T-W1064-MS-072 Primary Domain Suffix : mdc1.mozilla.com DNS server list : #01110.48.75.120, 10.50.75.120 Sent update to server : <?> IP Address(es) : 10.49.40.43 The reason the system could not register these RRs during the update request was because of a system problem. You can manually retry DNS registration of the network adapter and its settings by typing 'ipconfig /registerdns' at the command prompt. If problems still persist, contact your DNS server or network systems administrator. See event details for specific error code information.#015
T-W1064-MS-083.releng.mdc1.mozilla.com no jobs for 20h, reboot, reimage, running jobs
T-W1064-MS-111.releng.mdc1.mozilla.com no jobs for 1 day, reboot, reimage, running jobs
T-W1064-MS-118.releng.mdc1.mozilla.com not in TC, reboot, back in TC, waiting jobs, running jobs
T-W1064-MS-119.releng.mdc1.mozilla.com not in TC, reboot, reimage, running jobs
T-W1064-MS-121.releng.mdc1.mozilla.com not in TC, reboot, reimage, running jobs
T-W1064-MS-125.releng.mdc1.mozilla.com not in TC, reboot, not in TC, reimage. Back in TC, waiting jobs.
Here are some PP errors that I captured during the reimage.
Jul 09 07:44:57 T-W1064-MS-125.mdc1.mozilla.com Microsoft-Windows-DSC: Job {8BAAAEA5-8386-11E8-8A1C-F40343DF3195} : This event indicates that failure happens when LCM is processing the configuration. Error Id is 0x1. Error Detail is The SendConfigurationApply function did not succeed.. Resource Id is [Script]FirewallRule_ICMPv6In and Source Info is C:\windows\TEMP\xDynamicConfig.ps1::583::9::Script. Error Message is PowerShell DSC resource MSFT_ScriptResource failed to execute Set-TargetResource functionality with error message: Error formatting a string: Index (zero based) must be greater than or equal to zero and less than the size of the argument list.. .#015
Jul 09 07:44:57 T-W1064-MS-125.mdc1.mozilla.com Microsoft-Windows-DSC: Job {8BAAAEA5-8386-11E8-8A1C-F40343DF3195} : MIResult: 1 Error Message: PowerShell DSC resource MSFT_ScriptResource failed to execute Set-TargetResource functionality with error message: Error formatting a string: Index (zero based) must be greater than or equal to zero and less than the size of the argument list.. Message ID: ProviderOperationExecutionFailure Error Category: 7 Error Code: 1 Error Type: MI#015
Jul 09 07:48:28 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: Run-RemoteDesiredStateConfig :: end#015
Jul 09 07:48:28 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: C:\dsc\rundsc.ps1 deleted.#015
Jul 09 07:48:28 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: C:\dsc\rundsc.ps1 downloaded.#015
Jul 09 07:48:28 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: scheduled task: RunDesiredStateConfigurationAtStartup, created.#015
Jul 09 07:48:29 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: log archive C:\log\20180709144311.userdata-run.zip created.#015
Jul 09 07:48:29 T-W1064-MS-125.mdc1.mozilla.com Microsoft-Windows-DSC: RESULT 1 [NXLOG@14506 Keywords="4611686018427387904" EventType="ERROR" EventID="4252" ProviderGuid="{50DF9E12-A8C4-4939-B281-47E1325BA63E}" Version="0" Task="0" OpcodeValue="0" RecordNumber="62" ActivityID="{1E08B3DF-1793-0004-4BCF-081E9317D401}" ThreadID="5528" Channel="Microsoft-Windows-DSC/Operational" Domain="NT AUTHORITY" AccountName="SYSTEM" UserID="S-1-5-18" AccountType="User" Opcode="Info" JobId="{8BAAAEA5-8386-11E8-8A1C-F40343DF3195}" MIResult="1" ErrorMessage="The SendConfigurationApply function did not succeed." ErrorCategory="0" ErrorCode="1" ErrorType="MI" EventReceivedTime="2018-07-09 14:48:28" SourceModuleName="eventlog" SourceModuleType="im_msvistalog"] Job {8BAAAEA5-8386-11E8-8A1C-F40343DF3195} : MIResult: 1 Error Message: The SendConfigurationApply function did not succeed. Message ID: MI RESULT 1 Error Category: 0 Error Code: 1 Error Type: MI#015
Jul 09 07:53:31 T-W1064-MS-125.mdc1.mozilla.com Microsoft-Windows-DSC: RESULT 1 [NXLOG@14506 Keywords="4611686018427387904" EventType="ERROR" EventID="4252" ProviderGuid="{50DF9E12-A8C4-4939-B281-47E1325BA63E}" Version="0" Task="0" OpcodeValue="0" RecordNumber="105" ActivityID="{519CFE68-1794-0004-2328-9D519417D401}" ThreadID="996" Channel="Microsoft-Windows-DSC/Operational" Domain="NT AUTHORITY" AccountName="SYSTEM" UserID="S-1-5-18" AccountType="User" Opcode="Info" JobId="{C12CFCC9-8387-11E8-8A1D-F40343DF3195}" MIResult="1" ErrorMessage="The SendConfigurationApply function did not succeed." ErrorCategory="0" ErrorCode="1" ErrorType="MI" EventReceivedTime="2018-07-09 14:53:29" SourceModuleName="eventlog" SourceModuleType="im_msvistalog"] Job {C12CFCC9-8387-11E8-8A1D-F40343DF3195} : MIResult: 1 Error Message: The SendConfigurationApply function did not succeed. Message ID: MI RESULT 1 Error Category: 0 Error Code: 1 Error Type: MI#015
Jul 09 07:53:31 T-W1064-MS-125.mdc1.mozilla.com Microsoft-Windows-DSC: Job {C12CFCC9-8387-11E8-8A1D-F40343DF3195} : Details logging completed for C:\windows\System32\Configuration\ConfigurationStatus\{C12CFCC9-8387-11E8-8A1D-F40343DF3195}-0.details.json.#015
Jul 09 07:53:31 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: C:\dsc\rundsc.ps1 deleted.#015
Jul 09 07:53:31 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: C:\dsc\rundsc.ps1 downloaded.#015
Jul 09 07:53:31 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: scheduled task: RunDesiredStateConfigurationAtStartup, created.#015
Jul 09 07:53:32 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: log archive C:\log\20180709145149.userdata-run.zip created.#015
Jul 09 07:53:32 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: generic-worker installation detected.#015
Jul 09 07:53:32 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: generic-worker running process detected 12 ms after task-claim-state.valid flag set.#015
Jul 09 07:53:32 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: process priority for generic worker altered from Normal to AboveNormal.#015
Jul 09 07:53:32 T-W1064-MS-125.mdc1.mozilla.com OpenCloudConfig: userdata run completed#015
Jul 09 07:53:53 T-W1064-MS-125.mdc1.mozilla.com generic-worker: 2018/07/09 14:53:51 No task claimed...#015
T-W1064-MS-129.releng.mdc1.mozilla.com no jobs for 2 days, reboot. running jobs
T-W1064-MS-130.releng.mdc1.mozilla.com shut down
T-W1064-MS-162.releng.mdc1.mozilla.com not in TC, reboot, reimage, running jobs
T-W1064-MS-217.releng.mdc1.mozilla.com no jobs for 1 day, reboot, running jobs
T-W1064-MS-222.releng.mdc1.mozilla.com no jobs for 4 days, reboot, reimage, running jobs
T-W1064-MS-281.releng.mdc1.mozilla.com Stuck at PXE: 281
Comment 63•7 years ago
|
||
The dynamic registration errors can be ignore we don't use DDNS in our infra DNS is statically assigned. We should patch occ to disable dns registration to lower log and network noise but it causes no actual problem.
Comment 64•7 years ago
|
||
T-W1064-MS-023 reimage, running jobs.
T-W1064-MS-064 not in TC, reboot, running jobs
T-W1064-MS-065, 71, 72 reimage two times, no space after reimage. Opened Bug 1474578
T-W1064-MS-066 not in TC, reboot, running jobs.
T-W1064-MS-152 not in TC - no jobs for 1 day, reboot, running jobs
Comment 65•7 years ago
|
||
The following workers were in TC and lazy (5h+ since last job):
T-W1064-MS-{61,64,69,76,77,85,87,109,110,111,116,120,124,127,129,131,134,154,156,159,161,166,169,178,282}
Of which i had to Reimaged: T-W1064-MS-{69,76,77,109,116,120,124,127,129,134,156,159,161,166,169,178,282}
All of them started working after a reimage.
The ones that weren't reimaged started working after a reboot. All of them have since taken jobs.
Comment 66•7 years ago
|
||
:pmoore tells me that we may be inadvertently running generic-worker 8.3.0 on these nodes. That's a really old version, and could have unintended side-effects.
Is there any chance that the work in bug 1443589 is causing these nodes to go offline?
Comment 67•7 years ago
|
||
It's not inadvertent, gw10 has not yet successfully work on the moonshot hardware. Bug 1443589 is to get us there.
Comment 68•7 years ago
|
||
T-W1064-MS-131 reimage, running jobs
T-W1064-MS-132 reimage, running jobs
T-W1064-MS-153 reimage, running jobs
T-W1064-MS-156 reimage, running jobs
T-W1064-MS-158 reimage, running jobs
T-W1064-MS-159 reimage, running jobs
T-W1064-MS-160 reimage, running jobs
T-W1064-MS-161 reimage, running jobs
T-W1064-MS-162 reimage, running jobs
T-W1064-MS-163 reimage, running jobs
T-W1064-MS-164 reimage, running jobs
T-W1064-MS-166 reimage, running jobs
T-W1064-MS-169 reimage, running jobs
T-W1064-MS-171 reimage, running jobs
T-W1064-MS-246 reimage, running jobs
T-W1064-MS-254 reimage, running jobs
T-W1064-MS-247 reimage, running jobs
T-W1064-MS-254 reimage, running jobs
T-W1064-MS-262 reimage, running jobs
T-W1064-MS-267 reimage, running jobs
T-W1064-MS-284 reimage, running jobs
Comment 69•7 years ago
|
||
The above machines and also the following ones were missing from TC, they are now back after re-imaging.
T-W1064-MS-061 reimage, running jobs
T-W1064-MS-062 reimage, running jobs
T-W1064-MS-067 reimage, running jobs
T-W1064-MS-076 reimage, running jobs
T-W1064-MS-078 reimage, running jobs
T-W1064-MS-079 reimage, running jobs
T-W1064-MS-081 reimage, running jobs
T-W1064-MS-106 reimage, running jobs
T-W1064-MS-107 reimage, running jobs
T-W1064-MS-108 reimage, running jobs
T-W1064-MS-115 reimage, running jobs
T-W1064-MS-116 reimage, running jobs
T-W1064-MS-119 reimage, running jobs
T-W1064-MS-121 reimage, running jobs
T-W1064-MS-122 reimage, running jobs
T-W1064-MS-124 reimage, running jobs
T-W1064-MS-126 reimage, running jobs
T-W1064-MS-127 reimage, running jobs
The following machines were lazy (alive in TC but failing to pick up jobs):
T-W1064-MS-066 reimage, running jobs
T-W1064-MS-075 reimage, running jobs
T-W1064-MS-077 reimage, running jobs
T-W1064-MS-086 reimage, running jobs
T-W1064-MS-088 reimage, running jobs
T-W1064-MS-090 reimage, running jobs
T-W1064-MS-111 reimage, running jobs
T-W1064-MS-112 reimage, running jobs
T-W1064-MS-120 reimage, running jobs
T-W1064-MS-151 reimage, running jobs
T-W1064-MS-154 reimage, running jobs
T-W1064-MS-177 reimage, running jobs
T-W1064-MS-200 reimage, running jobs
T-W1064-MS-216 reimage, running jobs
T-W1064-MS-217 reimage, running jobs
T-W1064-MS-241 reimage, running jobs
T-W1064-MS-258 reimage, running jobs
T-W1064-MS-285 reimage, running jobs
T-W1064-MS-294 reimage, running jobs
The following machines are the ones we know can not be fixed by re-imaging:
T-W1064-MS-065, T-W1064-MS-071, T-W1064-MS-072, T-W1064-MS-130, T-W1064-MS-281.
Assignee | ||
Comment 70•7 years ago
|
||
> :pmoore tells me that we may be inadvertently running generic-worker 8.3.0
> on these nodes. That's a really old version, and could have unintended
> side-effects.
This looks like it is unrelated to the generic-worker version. 13 out of the 30 nodes running generic-worker 10.8.5 have stopped picking up tasks in the past 20 hours.
Comment 71•7 years ago
|
||
(In reply to Zsolt Fay [:zsoltfay] from comment #69)
> The following machines were lazy (alive in TC but failing to pick up jobs):
I've created bug 1475711 for recording some details about workers not taking work when there are jobs in the queue. If we collect details about the times and which workers, then we can ask the Taskcluster team to investigate if there is a problem in the api or queue.
Updated•7 years ago
|
Blocks: t-w1060-ms-071
Updated•7 years ago
|
Blocks: t-w1060-ms-261
Updated•7 years ago
|
Blocks: t-w1060-ms-291
Comment 72•7 years ago
|
||
Re-imaged the following machines because they were lazy and haven't recovered after a reboot:
T-W1064-MS-107 T-W1064-MS-162 T-W1064-MS-169
T-W1064-MS-110 T-W1064-MS-164 T-W1064-MS-294
T-W1064-MS-127 T-W1064-MS-165
T-W1064-MS-134 T-W1064-MS-167
T-W1064-MS-161 T-W1064-MS-168
Re-imaged the following machines because they were missing from TC:
T-W1064-MS-065 T-W1064-MS-076 T-W1064-MS-077
T-W1064-MS-082 T-W1064-MS-086 T-W1064-MS-088
T-W1064-MS-106 T-W1064-MS-108 T-W1064-MS-111
T-W1064-MS-114 T-W1064-MS-124 T-W1064-MS-125
T-W1064-MS-126 T-W1064-MS-131 T-W1064-MS-132
T-W1064-MS-151 T-W1064-MS-152 T-W1064-MS-155
T-W1064-MS-156 T-W1064-MS-160 T-W1064-MS-171
T-W1064-MS-209 T-W1064-MS-217 T-W1064-MS-222
T-W1064-MS-247 T-W1064-MS-258 T-W1064-MS-267
Worth noting that T-W1064-MS-134 isn't taking jobs and that t-w1064-ms-110 is performing very slow and is encountering errors when booting @bios.
Comment 73•7 years ago
|
||
065-missing
077-missing
126-missing
132-missing
134 ?? had tasks resolved a day ago, I don't think it took taks all day.
155 ?? had a task completed 13h ago
076-running (took jobs and completed them all day)
082-running (took jobs and completed them all day)
086-running (took jobs and completed them all day)
088-running (took jobs and completed them all day)
106-running (took jobs and completed them all day)
108-running (took jobs and completed them all day)
111-running (took jobs and completed them all day)
114-running (took jobs and completed them all day)
124-running (took jobs and completed them all day)
125-running (took jobs and completed them all day)
131-running (took jobs and completed them all day)
151-running (took jobs and completed them all day)
152-running (took jobs and completed them all day)
156-running (took jobs and completed them all day)
160-running (took jobs and completed them all day)
171-running (took jobs and completed them all day)
247-running (took jobs and completed them all day)
258-running (took jobs and completed them all day)
267-running (took jobs and completed them all day)
Comment 74•7 years ago
|
||
T-W1064-MS-063 - re-imaged and took jobs
T-W1064-MS-068 - re-imaged and took jobs
T-W1064-MS-070 - re-imaged and took jobs
T-W1064-MS-073 - re-imaged and took jobs
T-W1064-MS-081 - re-imaged and took jobs
T-W1064-MS-108 - re-imaged and took jobs
T-W1064-MS-110 - re-imaged and took jobs
T-W1064-MS-111 - re-imaged and took jobs
T-W1064-MS-112 - re-imaged and took jobs
T-W1064-MS-113 - re-imaged and took jobs
T-W1064-MS-116 - re-imaged and took jobs
T-W1064-MS-117 - re-imaged and took jobs
T-W1064-MS-118 - re-imaged and took jobs
T-W1064-MS-120 - re-imaged and took jobs
T-W1064-MS-126 - re-imaged and took jobs
T-W1064-MS-132 - re-imaged and took jobs
T-W1064-MS-133 - re-imaged and took jobs
T-W1064-MS-152 - re-imaged and took jobs
T-W1064-MS-160 - re-imaged and took jobs
T-W1064-MS-210 - re-imaged and took jobs
T-W1064-MS-216 - re-imaged and took jobs
T-W1064-MS-217 - re-imaged and took jobs
T-W1064-MS-245 - re-imaged and took jobs
T-W1064-MS-246 - re-imaged and took jobs
T-W1064-MS-256 - re-imaged and took jobs
T-W1064-MS-294 - re-imaged and took jobs
Comment 75•7 years ago
|
||
Re-imaged the missing win10 machines:
T-W1064-MS-064 T-W1064-MS-087 T-W1064-MS-169
T-W1064-MS-066 T-W1064-MS-090 T-W1064-MS-205
T-W1064-MS-067 T-W1064-MS-127 T-W1064-MS-222
T-W1064-MS-068 T-W1064-MS-129 T-W1064-MS-247
T-W1064-MS-076 T-W1064-MS-154 T-W1064-MS-260
T-W1064-MS-078 T-W1064-MS-157 T-W1064-MS-284
T-W1064-MS-082 T-W1064-MS-162
T-W1064-MS-084 T-W1064-MS-166
T-W1064-MS-086 T-W1064-MS-168
These still need to be checked ^
Flags: needinfo?(ciduty)
Comment 76•7 years ago
|
||
I can confirm that all of the above and once again in TC except :T-W1064-MS-127 , T-W1064-MS-247.
Will take actions for those 2 and the following:
T-W1064-MS-065
T-W1064-MS-070
T-W1064-MS-077
T-W1064-MS-088
T-W1064-MS-131
T-W1064-MS-156
T-W1064-MS-241
T-W1064-MS-256
T-W1064-MS-262
T-W1064-MS-285
T-W1064-MS-294
Will comment later with the results.
Flags: needinfo?(ciduty)
Comment 77•7 years ago
|
||
The following machines got back into TC after a reboot:
T-W1064-MS-070
T-W1064-MS-088
T-W1064-MS-131
T-W1064-MS-256
T-W1064-MS-120
T-W1064-MS-210
T-W1064-MS-217
T-W1064-MS-246
T-W1064-MS-267
The following have been reimaged for not recovering after reboot:
T-W1064-MS-077
T-W1064-MS-127
T-W1064-MS-156
T-W1064-MS-241
T-W1064-MS-247
T-W1064-MS-262
T-W1064-MS-294
Reimaged ones still need a TaskCluster check.
Flags: needinfo?(ciduty)
Comment 78•7 years ago
|
||
All of them took tasks and completed them without problems except for T-W1064-MS-077, this one doesn't appear in TaskCluster.
I'll look into it and come back with more details.
Flags: needinfo?(ciduty)
Comment 79•7 years ago
|
||
Below are machines that were rebooted and takes jobs :
T-W1064-MS-063
T-W1064-MS-074
T-W1064-MS-082
T-W1064-MS-108
T-W1064-MS-110
T-W1064-MS-222
the following is a machine that has been rebooted and now is available on Task Cluster, but it didn't received any job, yet :
T-W1064-MS-285
Here is a list with the machine that were rebooted and reimaged and now are available on Task Cluster and take jobs or not, yet :
available, but no job yet : T-W1064-MS-061, T-W1064-MS-077, T-W1064-MS-115, T-W1064-MS-153, T-W1064-MS-161.
available and takes jobs : T-W1064-MS-062, T-W1064-MS-079, T-W1064-MS-107, T-W1064-MS-118.
Updated•6 years ago
|
Blocks: T-W1064-MS-069
Updated•6 years ago
|
Blocks: T-W1064-MS-075
Updated•6 years ago
|
Blocks: T-W1064-MS-078
Updated•6 years ago
|
Blocks: T-W1064-MS-080
Updated•6 years ago
|
Blocks: T-W1064-MS-086
Comment 80•6 years ago
|
||
The following where re-imaged and now are taking jobs and are available on Task Cluster :
T-W1064-MS-061
T-W1064-MS-106
T-W1064-MS-111
T-W1064-MS-112
T-W1064-MS-113
T-W1064-MS-115
T-W1064-MS-116
T-W1064-MS-118
T-W1064-MS-120
T-W1064-MS-121
T-W1064-MS-125
T-W1064-MS-128
T-W1064-MS-129
T-W1064-MS-131
T-W1064-MS-133
T-W1064-MS-134
T-W1064-MS-135
T-W1064-MS-155
T-W1064-MS-169
T-W1064-MS-200
T-W1064-MS-206
T-W1064-MS-217
T-W1064-MS-254
T-W1064-MS-258
T-W1064-MS-262
Comment 81•6 years ago
|
||
The following workers have been re-imaged today:
T-W1064-MS-066
T-W1064-MS-079
T-W1064-MS-109
T-W1064-MS-110
T-W1064-MS-156
T-W1064-MS-159
T-W1064-MS-161
T-W1064-MS-168
T-W1064-MS-198
T-W1064-MS-212
T-W1064-MS-241
T-W1064-MS-247
T-W1064-MS-250
T-W1064-MS-251
T-W1064-MS-282
T-W1064-MS-286
All have since taken jobs except 250. See Bug 1479187
Comment 82•6 years ago
|
||
T-W1064-MS-064 no jobs for 1 day, reboot, jobs running
T-W1064-MS-127 no jobs for 1 day,reboot, jobs running
T-W1064-MS-212 no jobs for 1 day, reboot, reimage, jobs running
T-W1064-MS-222 no jobs for 1 day, reboot, reimage, jobs running
T-W1064-MS-285 no jobs for 1 day, reboot, reimage, jobs running
T-W1064-MS-324 not in TC, reboot, running jobs
T-W1064-MS-335 not in TCm reboot, reimage, see next comment.
Comment 83•6 years ago
|
||
I have re-imaged this today twice but after the windows installation, It stuck at the "Task sequence". I have selected once again win 10 GW 10, it finish the install, but not appear in TC.
Comment 84•6 years ago
|
||
T-W1064-MS-125: no jobs for 1 day, reboot, running jobs
T-W1064-MS-156: no jobs for 1 day, reboot, running jobs
T-W1064-MS-169: no jobs for 1 day, reboot, reimage, running jobs
T-W1064-MS-260: no jobs for 1 day, reboot, reimage, running jobs
T-W1064-MS-267: no jobs for 1 day, reboot, running jobs
Comment 85•6 years ago
|
||
T-W1064-MS-080: no jobs for one day, reboot, reimage, running jobs
T-W1064-MS-089: no jobs for one day, reboot, reimage, running jobs
T-W1064-MS-107: no jobs for one day, reboot, reimage, running jobs
T-W1064-MS-198: no jobs for one day, reboot, running jobs
T-W1064-MS-162: no jobs for one day, reboot, reimage, running jobs
Comment 86•6 years ago
|
||
When these servers have done "no jobs for one day", is the worker process still running, and what was the logged end for the last task (timeout? like we see sometimes in OSX), and what is in the logs? (if the worker process is running, is it polling for work or giving warnings?)
After reboot, how do you determine that it needs reimaged?
Comment 87•6 years ago
|
||
After the reboot, in the logs I see many lines like the one bellow then just going forever. After 30-45 mins I reimage them.
generic-worker: 2018/08/01 13:18:05 No task claimed...#015
Comment 88•6 years ago
|
||
T-W1064-MS-168 - just had same issue, to jobs for one day.
In the logs before and after reboot:
Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 13:05:49 Response#015
Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 13:05:49 HTTP/1.1 200 OK#015
Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Content-Length: 0#015
Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Date: Tue, 31 Jul 2018 13:05:50 GMT#015
Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Etag: "52ea05b8431def9a40afcd85b2af1c23"#015
Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Server: AmazonS3#015
Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: X-Amz-Id-2: T/1wOyAuCN2hTlZrWxN7kYnCsN7pXTTVHiDktWQ6N6YW5UAbv9ew5oN/kfEHld0dX2w7cfKQK10=#015
Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: X-Amz-Request-Id: E71719FDE2A62B38#015
Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: X-Amz-Version-Id: k4DcGnPfxH3WsJVrSUabWmu.3UcejhZx#015
Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: #015
Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 13:05:49 Resolving task...#015
Jul 31 06:05:49 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 13:05:49 Command finished successfully!#015
Jul 31 06:05:50 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 13:05:49 No previous task user desktop, so no need to close any open desktops#015
Jul 31 06:05:50 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 13:05:49 Trying to remove directory 'C:\Users\task_1533041946' via os.RemoveAll(path) call as GenericWorker user...#015
Jul 31 06:05:54 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Graphic Card being used "Intel(R) Iris(R) Pro Graphics P580 " #015
Jul 31 06:05:54 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Removing temp dir contents #015
Jul 31 06:05:54 T-W1064-MS-168.mdc1.mozilla.com generic-worker: C:\Users\GenericWorker\AppData\Local\Temp\aria-debug-7960.log#015
Jul 31 06:05:54 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Deleted file - C:\Users\GenericWorker\AppData\Local\Temp\livelog807285603\stream#015
Jul 31 06:05:54 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Deleted file - C:\ProgramData\Package Cache\{B74E65FD-CC47-41C5-4B89-791A3F61942D}v8.100.25984\Installers\Kits Configuration Installer-x86_en-us.msi#015
Jul 31 06:05:54 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Removing log files older than 1 day #015
Jul 31 06:05:54 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Removing Windows log files older than 7 days #015
Jul 31 06:05:54 T-W1064-MS-168.mdc1.mozilla.com generic-worker: Removing Recycle.bin contents #015
Jul 31 06:05:56 T-W1064-MS-168.mdc1.mozilla.com User32: The process C:\windows\system32\shutdown.exe (T-W1064-MS-168) has initiated the restart of computer T-W1064-MS-168 on behalf of user T-W1064-MS-168\GenericWorker for the following reason: No title for this reason could be found Reason Code: 0x800000ff Shutdown Type: restart Comment: Rebooting as generic worker ran successfully#015
Jul 31 06:05:59 T-W1064-MS-168.mdc1.mozilla.com Service_Control_Manager: The sshd service terminated unexpectedly. It has done this 1 time(s).#015
no logs for almost 24h.
Aug 01 05:55:59 T-W1064-MS-168.mdc1.mozilla.com generic-worker: /07/31 14:05:48 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015
Aug 01 05:55:59 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 14:06:43 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015
Aug 01 05:55:59 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 14:07:39 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015
Aug 01 05:55:59 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 14:08:46 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015
Aug 01 05:55:59 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 14:09:49 Could not claim work. Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015
Aug 01 05:55:59 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 14:09:49 No task claimed...#015
Aug 01 05:55:59 T-W1064-MS-168.mdc1.mozilla.com generic-worker: 2018/07/31 14:09:49 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-win10-64-hw: dial tcp: lookup queue.taskcluster.net: getaddrinfow: No such host is known.#015
Then after ~ 45 min the server took jobs.
Updated•6 years ago
|
Blocks: T-W1064-MS-118
Updated•6 years ago
|
Blocks: T-W1064-MS-322
Updated•6 years ago
|
Blocks: T-W1064-MS-165
Updated•6 years ago
|
Blocks: T-W1064-MS-241
Updated•6 years ago
|
Blocks: T-W1064-MS-151
Updated•6 years ago
|
Blocks: T-W1064-MS-221
Comment 89•6 years ago
|
||
Looks like there is a common error on the MS win servers, at least on the bellow 3:
Microsoft-Windows-DSC: Job DscTimerConsistencyOperationResult : DSC Engine Error : #011 Error Message: NULL #011Error Code : 1 #015
https://bugzilla.mozilla.org/show_bug.cgi?id=1478723
https://bugzilla.mozilla.org/show_bug.cgi?id=1480347
https://bugzilla.mozilla.org/show_bug.cgi?id=1480386
Updated•6 years ago
|
Blocks: T-W1064-MS-070
Updated•6 years ago
|
Blocks: T-W1064-MS-135
Updated•6 years ago
|
Blocks: T-W1064-MS-064
Updated•6 years ago
|
Blocks: T-W1064-MS-163
Updated•6 years ago
|
Blocks: T-W1064-MS-260
Updated•6 years ago
|
Blocks: T-W1064-MS-152
Updated•6 years ago
|
Blocks: T-W1064-MS-316
Updated•6 years ago
|
Blocks: T-W1064-MS-324
Updated•6 years ago
|
Blocks: T-W1064-MS-326
Updated•6 years ago
|
Blocks: T-W1064-MS-334
Updated•6 years ago
|
Blocks: T-W1064-MS-339
Updated•6 years ago
|
Blocks: T-W1064-MS-340
Updated•6 years ago
|
Blocks: T-W1064-MS-109
Updated•6 years ago
|
Blocks: T-W1064-MS-169
Updated•6 years ago
|
Blocks: T-W1064-MS-117
Updated•6 years ago
|
Blocks: T-W1064-MS-076
Updated•6 years ago
|
Blocks: T-W1064-MS-121
Updated•6 years ago
|
Blocks: T-W1064-MS-122
Updated•6 years ago
|
Blocks: T-W1064-MS-132
Updated•6 years ago
|
Blocks: T-W1064-MS-074
Updated•6 years ago
|
Blocks: T-W1064-MS-157
Updated•6 years ago
|
Blocks: T-W1064-MS-114
Updated•6 years ago
|
Blocks: T-W1064-MS-128
Updated•6 years ago
|
Blocks: T-W1064-MS-256
Updated•6 years ago
|
Blocks: T-W1064-MS-258
Updated•6 years ago
|
Blocks: T-W1064-MS-216
Updated•6 years ago
|
Blocks: T-W1064-MS-282
Updated•6 years ago
|
Blocks: T-W1064-MS-107
Updated•6 years ago
|
Blocks: T-W1064-MS-209
Updated•6 years ago
|
Blocks: T-W1064-MS-215
Updated•6 years ago
|
Blocks: T-W1064-MS-247
Updated•6 years ago
|
Blocks: T-W1064-MS-253
Updated•6 years ago
|
Blocks: T-W1064-MS-344
Updated•6 years ago
|
Blocks: T-W1064-MS-087
Updated•6 years ago
|
Blocks: T-W1064-MS-066
Updated•6 years ago
|
Blocks: T-W1064-MS-116
Comment 90•6 years ago
|
||
T-W1064-MS-112 and T-W1064-MS-117 have non-responsive ilo after logging in with credentials.
Updated•6 years ago
|
Blocks: T-W1064-MS-159
Assignee | ||
Comment 91•6 years ago
|
||
Were these nodes rebooted or reimaged? They seem to be up and working now.
Comment 92•6 years ago
|
||
(In reply to Mark Cornmesser [:markco] from comment #91)
> Were these nodes rebooted or reimaged? They seem to be up and working now.
I'm checked the handovers, and I don't see anyone in CiDuty reporting the fact that they worked on the 2 machines.
Comment 93•6 years ago
|
||
I have rebooted 112 and 117 yesterday and then they where taking tasks. Both had no tasks for one day.
Comment 94•6 years ago
|
||
Comment 95•6 years ago
|
||
- rebooted,reimaged & take jobs : T-W1064-MS-132, T-W1064-MS-133, T-W1064-MS-134
- rebooted & take jobs : T-W1064-MS-067, T-W1064-MS-071, T-W1064-MS-118, T-W1064-MS-168, T-W1064-MS-196, T-W1064-MS-211, T-W1064-MS-247, T-W1064-MS-262, T-W1064-MS-266, T-W1064-MS-286, T-W1064-MS-296
Updated•6 years ago
|
Blocks: T-W1064-MS-211
Updated•6 years ago
|
Blocks: T-W1064-MS-031
Updated•6 years ago
|
Blocks: T-W1064-MS-061
Updated•6 years ago
|
Blocks: T-W1064-MS-111
Updated•6 years ago
|
Blocks: T-W1064-MS-218
Updated•6 years ago
|
Blocks: T-W1064-MS-220
Updated•6 years ago
|
Blocks: T-W1064-MS-206
Comment 96•6 years ago
|
||
Found a lot of lazy workers in the windows pool. Mark, the number this time is quite considerable. You might be extra interested in this.
Went ahead and rebooted all of the machines below. @CiDuty these all need to be checked, please re-image, check the logs and update their respective bugs for the ones that don't recover after a re-image.
Flags: needinfo?(mcornmesser)
Comment 97•6 years ago
|
||
T-W1064-MS-{021, 024, 027, 031, 033, 034, 035, 037, 037, 039, 041, 043, 045, 062, 064, 066, 071, 072, 078, 079, 088, 089, 090, 110, 116, 117, 120, 121, 122, 125, 126, 127, 129, 131, 132, 133, 134, 152, 158, 160, 164, 165, 167, 169, 174, 176, 180, 198, 199, 202, 204, 206, 210, 212, 215, 216, 218, 222, 242, 244, 246, 247, 249, 253, 256, 259, 264, 268, 269, 270, 281, 282, 283, 287, 295}.
Note, all of the above have had at least 2+ hours since last job taken, with over 75% of them having 10h+.
Flags: needinfo?(ciduty)
Assignee | ||
Comment 98•6 years ago
|
||
Zsoltfay: What was the time frame these stopped working. All within the last 2 to 10 hours? Were there groups that stopped around the same time?
Flags: needinfo?(mcornmesser)
Comment 99•6 years ago
|
||
Yes, it was during my shift. We've rebooted all of them ~4 hours ago. A lot of them had 17-18 hours since their last jobs and another cluster had ~2 hours since their last job. Then we've had a few stray ones around the board with 5-7-9-12h.
Assignee | ||
Comment 100•6 years ago
|
||
Do you have a record of which ones were greater than 2 hours?
I am going to go ahead and close this bug. The original issue here had been addressed. I have opened Bug 1490398.
CiDuty: Could you add nodes to that bug that have not picked up tasks in 2.5 or more hours?
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Comment 101•6 years ago
|
||
(In reply to Zsolt Fay [:zfay] from comment #97)
> T-W1064-MS-{021, 024, 027, 031, 033, 034, 035, 037, 037, 039, 041, 043, 045,
> 062, 064, 066, 071, 072, 078, 079, 088, 089, 090, 110, 116, 117, 120, 121,
> 122, 125, 126, 127, 129, 131, 132, 133, 134, 152, 158, 160, 164, 165, 167,
> 169, 174, 176, 180, 198, 199, 202, 204, 206, 210, 212, 215, 216, 218, 222,
> 242, 244, 246, 247, 249, 253, 256, 259, 264, 268, 269, 270, 281, 282, 283,
> 287, 295}.
>
> Note, all of the above have had at least 2+ hours since last job taken, with
> over 75% of them having 10h+.
Cleared the NI request since I did a follow up to those machines in Bug 1490398#c1.
Flags: needinfo?(ciduty)
You need to log in
before you can comment on or make changes to this bug.
Description
•