Open Bug 1789388 Opened 2 years ago Updated 2 years ago

Increased number of "An attempt was made to reference a token that does not exist." errors in generic-worker

Categories

(Infrastructure & Operations :: RelOps: Windows OS, defect)

x86_64
Windows
defect

Tracking

(Not tracked)

People

(Reporter: yarik, Unassigned)

Details

Sentry shows thousands of errors coming from generic worker on a amd64/windows machines running 44.17.1 generic-worker.
Happens in the gecko-t/win10-64-2004-gpu

*os.SyscallError: An attempt was made to reference a token that does not exist.
  File "/task_165763351549456/taskcluster/workers/generic-worker/sentry.go", line 27, in func1
  File "/task_165763351549456/go/path/pkg/mod/github.com/getsentry/raven-go@v0.2.0/client.go", line 822, in CapturePanicAndWait
  File "/task_165763351549456/taskcluster/workers/generic-worker/sentry.go", line 25, in ReportCrashToSentry
  File "/task_165763351549456/taskcluster/workers/generic-worker/main.go", line 325, in HandleCrash
  File "/task_165763351549456/taskcluster/workers/generic-worker/main.go", line 341, in func1
  File "/task_165763351549456/go/go/src/runtime/panic.go", line 844, in gopanic
  File "/task_165763351549456/taskcluster/workers/generic-worker/multiuser.go", line 49, in PlatformTaskEnvironmentSetup
  File "/task_165763351549456/taskcluster/workers/generic-worker/main.go", line 1097, in PrepareTaskEnvironment
  File "/task_165763351549456/taskcluster/workers/generic-worker/main.go", line 1168, in RotateTaskEnvironment
  File "/task_165763351549456/taskcluster/workers/generic-worker/main.go", line 396, in RunWorker
  File "/task_165763351549456/taskcluster/workers/generic-worker/main.go", line 158, in main
  File "/task_165763351549456/go/go/src/runtime/proc.go", line 250, in main

This issue came up recently in email, so copy/pasting the details here for good measure.

In summary, in July it was an issue with Azure hypervisors after some upgrade. The issue was resolved at the time, but maybe it has reappeared.

Mark, is the Azure case still open? Also, if there are any related bugs or links to Azure cases, could you link them? Many thanks!

From: Peter Moore pmoore@mozilla.com
Subject: Re: Fwd: Taskcluster Worker Manager Error: generic-worker error
Date: 28. July 2022 at 11:25
To: Mark Cornmesser mcornmesser@mozilla.com

Glad the mystery is solved, thanks for letting me know Mark.
Also if anything else needs my attention, let me know. I'm not always great at following bugmail etc and I wouldn't want you to be stuck with anything.
Speak soon! Pete

On 28 Jul 2022, at 00:48, Mark Cornmesser <mcornmesser@mozilla.com> wrote:
Hey Pete, so this does turn out to be an Azure issue. It is an issue with their hypervisor after a recent software update.

On Mon, Jul 25, 2022 at 3:07 PM Mark Cornmesser < mcornmesser@mozilla.com> wrote:
Just FYI, I am opening up support case with Azure. The more I look into it the more it looks like the increased frequency is a result of VMs not staying shutdown .

On Mon, Jul 25, 2022 at 2:46 PM Mark Cornmesser < mcornmesser@mozilla.com> wrote:
Hey Pete,
I am still diving into this and gathering information.

Is this a new platform, and does it happen 100% of the time?

This is not a new platform.

If it doesn't always happen, is it an error that once it has occurred, it then happens permanently?

This looks like after the error occurs it is permanent. In fact it seems to occur after the VM has failed to shutdown. Then this error causes GW to exit with 69 and it continues to fail to shutdown. Example. So this may just be a continuation of the other issue. I will let you know as I figure out more.

Are the errors limited to a subset of workers in the worker pool? Doesn't seem though, btu i can't rule that out yet.
And what version of Windows and patch level is installed?

Windows 10 2004 with no patching.

On Mon, Jul 25, 2022 at 12:05 AM Peter Moore < pmoore@mozilla.com> wrote:
Hi Mark,
That's a good question, I've not seen this error before. Is this a new platform, and does it happen 100% of the time? If it doesn't always happen, is it an error that once it has occurred, it then happens permanently? Are the errors limited to a subset of workers in the worker pool? And what version of Windows and patch level is installed?
Happy to meet up to discuss!
Pete

On 24 Jul 2022, at 18:06, Mark Cornmesser < mcornmesser@mozilla.com> wrote:
Hey Pete, We are seeing an extreme uptick in this error, thousands of times in the last few days. This seems to be isolated to the same worker pool and VM size as the other issued I NI'ed on. Any ideas? Any suggestions on how to troubleshoot it?

---------- Forwarded message ---------
From: <firefoxcitc-taskcluster-noreply@mozilla.com>
Date: Sat, Jul 23, 2022 at 6:08 PM
Subject: Taskcluster Worker Manager Error: generic-worker error To: < relops-azure-provisioning@mozilla.com>

WTSQueryUserToken: An attempt was made to reference a token that does not e
ErrorId: FajYbL4uS8SbNU6QF_KAuw
It includes the extra information:
GOARCH: amd64
GOOS: windows
cleanUpTaskDirs: 'true'
deploymentId: ''
engine: multiuser
gwRevision: fb202d2bffa82750d35609fbb6bdcc38b0ce00e5
gwVersion: 43.0.0
instanceType: Standard_NV6
provisionerId: gecko-t
rootURL: https://firefox-ci-tc.services.mozilla.com
workerGroup: southcentralus
workerId: vm-xhmduxepqxctd3bybjvoatxsyrocrnmkmm0
workerType: win10-64-2004-gpu

Component: Workers → RelOps: Windows OS
Flags: needinfo?(mcornmesser)
Product: Taskcluster → Infrastructure & Operations

This is still an open issue with Azure. The support case is 2207250040008499.

Unfortunately this will continue to be an issue for next few months. It seems that Azure needs to update VM clusters individually, preventing a quick roll out to fix the bug. I am working with Azure and do some weekly maintenance to try to mitigate this error, but we may continue to see spikes from time to time.

This is also being tracked at https://mozilla-hub.atlassian.net/browse/RELOPS-231 .

Flags: needinfo?(mcornmesser)
You need to log in before you can comment on or make changes to this bug.