Closed Bug 1488195 Opened 6 years ago Closed 6 years ago

Production win10 hw machines running generic-worker 8.3.0

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: apavel, Unassigned)

References

Details

Treeherder link: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&selectedJob=197201474 Failure log: https://taskcluster-artifacts.net/DQbh_umgSZWMjfzzWHpixw/0/public/logs/live_backing.log [taskcluster 2018-09-03T11:48:43.487Z] TASK FAIL since the task payload is invalid. See errors: [taskcluster 2018-09-03T11:48:43.487Z] - expires: expires is required [taskcluster 2018-09-03T11:48:43.487Z] - expires: expires is required [taskcluster 2018-09-03T11:48:43.487Z] Task not successful due to following exception(s): [taskcluster 2018-09-03T11:48:43.487Z] Exception 1) [taskcluster 2018-09-03T11:48:43.487Z] Validation of payload failed for task DQbh_umgSZWMjfzzWHpixw [taskcluster 2018-09-03T11:48:43.487Z]
Pete, is this related to your push?
Flags: needinfo?(pmoore)
This seems to be a loaner machine (T-W1064-MS-361) that is taking production jobs, running the wrong version of generic-worker (8.3.0 vs 10.8.5), and not logging to papertrail (at least I can't find logs for it in papertrail). I'm not sure why a loaner is taking production jobs, we probably shouldn't allow this. The worker has been quarantined by arny, and I suspect we'll want to reimage it, and review which other machines may be taking production workloads that shouldn't be.
Flags: needinfo?(pmoore) → needinfo?(mcornmesser)
"From IRC: Callek> apavel|sheriffduty: in that case, my gut feeling is to pull out failing instances from the production pool and get bug(s) on file about them, also get a wrap up and e-mail pmoore about it when you think the main hurting has stopped " I've quarantined T-W1064-MS-{067, 070, 361,541} since they are the only workers which failed those tasks. T-W1064-MS-{067, 070} -are running the latest Generik-Worker version T-W1064-MS-{361, 541} -are from MDC2 and they are loaned for Markco. I don't understand why they show up in MDC1 in TC e.g https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw/workers/mdc1/T-W1064-MS-541 e.g https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw/workers/mdc1/T-W1064-MS-361
Summary: task payload is invalid → Production win10 hw machines running generic-worker 8.3.0
Note, all machines should have been upgraded to 10.8.5 - see bug 1443589 comment 91.
See Also: → 1443589
In summary, the problem was caused by some machines running generic-worker 8.3.0 when all machines taking production jobs should be running generic-worker 10.8.5. Some of these were loaners to markco, which shouldn't have been configured to take production jobs, and others were production machines, that for some unknown reason have the wrong version of generic-worker installed. Since all the offending machines have been quarantined, the production fallout should now be mitigated, and we can wait for Mark/Kendall to decide what to do with these machines, and find out how they got into a bad state. zfay is going to look into reimaging the production machines (not the loaners).
I've re-imaged T-W1064-MS-067 and T-W1064-MS-070 with the new generic worker. Both have taken jobs since and completed them with success.
(In reply to Pete Moore [:pmoore][:pete] from comment #2) > This seems to be a loaner machine (T-W1064-MS-361) that is taking production > jobs, running the wrong version of generic-worker (8.3.0 vs 10.8.5), and not > logging to papertrail (at least I can't find logs for it in papertrail). > > I'm not sure why a loaner is taking production jobs, we probably shouldn't > allow this. > > The worker has been quarantined by arny, and I suspect we'll want to reimage > it, and review which other machines may be taking production workloads that > shouldn't be. It looks like the wrong MDT task sequences was applied to the nodes with the older version of generic-worker. The older task sequence is now no longer available, so this will not happen in the future.
Flags: needinfo?(mcornmesser)
One more MS-106 running GW 8.3. Should we reimage this with GW10.10 or is a loaner for you :markco ane leave the old GW? Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: "generic-worker": {#015 Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: "go-arch": "386",#015 Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: "go-os": "windows",#015 Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: "go-version": "go1.7.5",#015 Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: "release": "https://github.com/taskcluster/generic-worker/releases/tag/v8.3.0",#015 Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: "version": "10.0.5"#015 Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: },#015 Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: "machine-setup": {#015
Flags: needinfo?(mcornmesser)
(In reply to Attila Craciun [:arny] from comment #10) > One more MS-106 running GW 8.3. Should we reimage this with GW10.10 or is a > loaner for you :markco ane leave the old GW? > Hi Attila, Yes please reimage any machines you find running 8.3.0, also loaners, since we no longer support this release. Many thanks! Pete
Flags: needinfo?(mcornmesser)
(In reply to Attila Craciun [:arny] from comment #10) > One more MS-106 running GW 8.3. Should we reimage this with GW10.10 or is a > loaner for you :markco ane leave the old GW? > > Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: > "generic-worker": {#015 > Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: > "go-arch": "386",#015 > Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: "go-os": > "windows",#015 > Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: > "go-version": "go1.7.5",#015 > Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: > "release": > "https://github.com/taskcluster/generic-worker/releases/tag/v8.3.0",#015 > Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: > "version": "10.0.5"#015 > Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: },#015 > Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: > "machine-setup": {#015 Mark, Any idea why this showed up today, and not yesterday? Thanks.
Flags: needinfo?(mcornmesser)
(In reply to Pete Moore [:pmoore][:pete] from comment #12) > (In reply to Attila Craciun [:arny] from comment #10) > > One more MS-106 running GW 8.3. Should we reimage this with GW10.10 or is a > > loaner for you :markco ane leave the old GW? > > > > Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: > > "generic-worker": {#015 > > Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: > > "go-arch": "386",#015 > > Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: "go-os": > > "windows",#015 > > Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: > > "go-version": "go1.7.5",#015 > > Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: > > "release": > > "https://github.com/taskcluster/generic-worker/releases/tag/v8.3.0",#015 > > Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: > > "version": "10.0.5"#015 > > Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: },#015 > > Sep 05 04:15:11 T-W1064-MS-106.mdc1.mozilla.com generic-worker: > > "machine-setup": {#015 > > Mark, > > Any idea why this showed up today, and not yesterday? > > Thanks. Looks like this machine was AWOL for four days between Sep 01 and Sep 05 - any ideas why? https://papertrailapp.com/systems/1730304451/events?focus=973798966959439916&selected=973798966959439916 Sep 01 02:45:20 T-W1064-MS-106.mdc1.mozilla.com generic-worker: Removing log files older than 1 day #015 Sep 01 02:45:21 T-W1064-MS-106.mdc1.mozilla.com generic-worker: Removing Windows log files older than 7 days #015 Sep 01 02:45:21 T-W1064-MS-106.mdc1.mozilla.com generic-worker: #015 Sep 01 02:45:21 T-W1064-MS-106.mdc1.mozilla.com generic-worker: C:\Windows\Logs\CBS\CbsPersist_20180816190950.cab#015 Sep 01 02:45:21 T-W1064-MS-106.mdc1.mozilla.com generic-worker: Removing Recycle.bin contents #015 Sep 01 02:45:21 T-W1064-MS-106.mdc1.mozilla.com User32: The process C:\windows\system32\shutdown.exe (T-W1064-MS-106) has initiated the restart of computer T-W1064-MS-106 on behalf of user T-W1064-MS-106\GenericWorker for the following reason: No title for this reason could be found Reason Code: 0x800000ff Shutdown Type: restart Comment: Rebooting as generic worker ran successfully#015 Sep 01 02:45:24 T-W1064-MS-106.mdc1.mozilla.com Service_Control_Manager: The sshd service terminated unexpectedly. It has done this 1 time(s).#015 Sep 05 02:56:28 T-W1064-MS-106.mdc1.mozilla.com Service_Control_Manager: The CldFlt service failed to start due to the following error: The request is not supported.#015 Sep 05 02:56:28 T-W1064-MS-106.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc1.mozilla.com'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015 Sep 05 02:56:28 T-W1064-MS-106.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc2.mozilla.com,8'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015 Sep 05 02:56:28 T-W1064-MS-106.mdc1.mozilla.com Service_Control_Manager: The cphs service terminated with the following error: No more data is available.#015 Sep 05 02:56:28 T-W1064-MS-106.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc1.mozilla.com'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015 Sep 05 02:56:28 T-W1064-MS-106.mdc1.mozilla.com Microsoft-Windows-Time-Service: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'infoblox1.private.mdc2.mozilla.com,8'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)#015 Sep 05 02:56:28 T-W1064-MS-106.mdc1.mozilla.com generic-worker: Running generic-worker startup script (run-generic-worker.bat) ... #015
Assignee: nobody → relops
Component: General → RelOps
Product: Taskcluster → Infrastructure & Operations
QA Contact: klibby
(In reply to Pete Moore [:pmoore][:pete] from comment #12) > (In reply to Attila Craciun [:arny] from comment #10) > > One more MS-106 running GW 8.3. Should we reimage this with GW10.10 or is a > Mark, > > Any idea why this showed up today, and not yesterday? > > Thanks. My assumption is it booted up and hit a network issue and then it was manual rebooted today. And because it was the old worker it did not have the recovery pieces in place. The hope is once we get the firmware update applied this won't be an issue at all.
Flags: needinfo?(mcornmesser)
Reimaged T-W1064-MS-106 with GW10 and is running jobs fine.
Closing the bug, We have not seen a node with gw 8 in the last few weeks. The Windows deployment option for gw 8 has been removed, and I will be removing the Gw 8 hardware support from OCC in the near future.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.