Closed Bug 1539096 Opened 6 years ago Closed 5 years ago

Artifact upload failures - artifact not found & Queue error

Categories

(Taskcluster :: Workers, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: sfraser, Unassigned)

References

Details

Our beta release had some task errors around artifacts:

repackage-msi-ka-win32-nightly/opt GHsD7AvsRyCFf0wl-aocdQ

[taskcluster:error] goroutine 1 [running]:
[taskcluster:error] runtime/debug.Stack(0xc0424c3278, 0x9100a0, 0xc04226e6a0)
[taskcluster:error] /home/travis/.gimme/versions/go1.10.3.src/src/runtime/debug/stack.go:24 +0xae
[taskcluster:error] main.(*TaskRun).Run.func2(0xc042480118, 0xc04252e700, 0xc042480120)
[taskcluster:error] /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1202 +0x231
[taskcluster:error] panic(0x9100a0, 0xc04226e6a0)
[taskcluster:error] /home/travis/.gimme/versions/go1.10.3.src/src/runtime/panic.go:502 +0x237
[taskcluster:error] main.(*TaskRun).uploadArtifact(0xc04252e700, 0xa7c600, 0xc042240f60, 0x4)
[taskcluster:error] /home/travis/gopath/src/github.com/taskcluster/generic-worker/artifacts.go:464 +0x109d
[taskcluster:error] main.(*LiveLogTask).Stop(0xc0421f6b00, 0xc042109960)
[taskcluster:error] /home/travis/gopath/src/github.com/taskcluster/generic-worker/livelog.go:135 +0x452
[taskcluster:error] main.(*TaskRun).Run.func3(0xc042480118, 0xa7c500, 0xc0421f6b00, 0xa7cb40, 0xd38890)
[taskcluster:error] /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1284 +0xdc
[taskcluster:error] main.(*TaskRun).Run(0xc04252e700, 0xc0420be6e0)
[taskcluster:error] /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1368 +0x1835
[taskcluster:error] main.RunWorker(0x0)
[taskcluster:error] /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:748 +0xc9c
[taskcluster:error] main.main()
[taskcluster:error] /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:472 +0xbda
[taskcluster:error]
[taskcluster:error] &errors.errorString{s:"WORKER EXCEPTION due to response code 401 from Queue when uploading artifact &main.RedirectArtifact{BaseArtifact:(*main.BaseArtifact)(0xc042d7b200), URL:"https://queue.taskcluster.net/v1/task/GHsD7AvsRyCFf0wl-aocdQ/runs/0/artifacts/public/logs/live_backing.log\"} with CreateArtifact payload {"contentType":"text/plain; charset=utf-8","expires":"2020-03-24T16:41:17.556Z","storageType":"reference","url":"https://queue.taskcluster.net/v1/task/GHsD7AvsRyCFf0wl-aocdQ/runs/0/artifacts/public/logs/live_backing.log\"}"}
[taskcluster:error] WORKER EXCEPTION due to response code 401 from Queue when uploading artifact &main.RedirectArtifact{BaseArtifact:(*main.BaseArtifact)(0xc042d7b200), URL:"https://queue.taskcluster.net/v1/task/GHsD7AvsRyCFf0wl-aocdQ/runs/0/artifacts/public/logs/live_backing.log"} with CreateArtifact payload {"contentType":"text/plain; charset=utf-8","expires":"2020-03-24T16:41:17.556Z","storageType":"reference","url":"https://queue.taskcluster.net/v1/task/GHsD7AvsRyCFf0wl-aocdQ/runs/0/artifacts/public/logs/live_backing.log"}

and the update-verify tasks all had:
[task 2019-03-25T19:11:45.456Z] All is well
[fetches 2019-03-25T19:11:45.456Z] removing /builds/worker/fetches
[fetches 2019-03-25T19:11:45.456Z] finished
[taskcluster 2019-03-25 19:11:45.854Z] === Task Finished ===
[taskcluster 2019-03-25 19:11:45.975Z] Artifact "public/build/diff-summary.log" not found at "/builds/worker/tools/release/updates/diff-summary.log"
[taskcluster 2019-03-25 19:11:46.877Z] Successful task run with exit code: 0 completed in 2943.735 seconds

and were marked as failure.

AUY7gFcPTbyroFBHJW_DMQ release-update-verify-firefox-linux64-11/12 failed
BgKY6VvBTM2qCSnr21V62A release-update-verify-firefox-macosx64-2/12 failed
EAszPCRWTHykktwB8VhUsw release-update-verify-firefox-linux64-12/12 failed
E_I-5zAgRwqxVKcgwKhytg release-update-verify-firefox-linux-11/12 failed
ITjqMPesSaSe33V-C_hjnw release-update-verify-firefox-linux-4/12 failed
K-5eDdTFQAGKHNfj7WkZ3A release-update-verify-firefox-macosx64-4/12 failed
KuWArIReTLuZ6K9633MHWg release-update-verify-firefox-linux-1/12 failed
O9X96W4kTdOsKaSIcEZ4kA release-update-verify-firefox-macosx64-6/12 failed
QU7Zd2POTRSfs-JGXq8I7g release-update-verify-firefox-macosx64-3/12 failed

This all seemed to happen at or soon after 2019-03-25 19:08 GMT. Was there an infrastructure issue at that time? Or is something else going on? All tasks had successful reruns.

(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #0)

This all seemed to happen at or soon after 2019-03-25 19:08 GMT. Was there an infrastructure issue at that time? Or is something else going on? All tasks had successful reruns.

I suspect we might have had a Queue downtime - Dustin, do we have historical outage data on our services?

Similarly I'm curious if bug 1539174 is also for an outage the following day (2019-03-26 13:11 GMT).

Component: Operations and Service Requests → Services
Flags: needinfo?(dustin)
See Also: → 1539174
Flags: needinfo?(dustin)

This issue has been observed on several windows10-aarch64 machines hosted at Bitbar over the weekend (July 12-15) running generic-worker 14.1.2.

This resulted in the machines dropping out of the queue.

(In reply to Edwin Gao (:egao) from comment #3)

This issue has been observed on several windows10-aarch64 machines hosted at Bitbar over the weekend (July 12-15) running generic-worker 14.1.2.

This resulted in the machines dropping out of the queue.

Hi Edwin,

In bug 1562964 (gw 15.1.1) we now log more information when we get the 401 response code.

I realise there are still some open issues with running generic-worker 15 on bitbar aarch64 hardware (bug 1565005) but hopefully when those are resolved, we can have generic-worker 15.1.1 running on those machines and we should have a better idea what is causing the issue.

I'm curious if the issue might be due to the laptops going to sleep while they are running jobs. Do we have full access to logs on one of the affected machines?

Flags: needinfo?(egao)
See Also: → 1565005

:pmoore - I do not have access to the logs, or the machines themselves.

I can ask the Bitbar folks to verify that the machines are configured to never sleep if you suspect it as the cause.

I've added and pinged you in the Slack channel where I am working with the folks from Bitbar as well.

Flags: needinfo?(egao)
2019/07/14 02:57:48  *********** PANIC occurred! *********** 

2019/07/14 02:57:48 WORKER EXCEPTION due to response code 401 from Queue when uploading artifact &main.RedirectArtifact{BaseArtifact:(*main.BaseArtifact)(0x18fd0420), URL:"https://queue.taskcluster.net/v1/task/AhAoVzLpRNGFzSq8KQOGfA/runs/3/artifacts/public/logs/live_backing.log"} with CreateArtifact payload {"contentType":"text/plain; charset=utf-8","expires":"2020-07-12T21:58:19.649Z","storageType":"reference","url":"https://queue.taskcluster.net/v1/task/AhAoVzLpRNGFzSq8KQOGfA/runs/3/artifacts/public/logs/live_backing.log"}

2019/07/14 02:57:48 No sentry project defined, not reporting to sentry

2019/07/14 02:57:48 Exiting worker with exit code 69

Looking at this example, we see the claim expired: https://tools.taskcluster.net/groups/ct3UKgHkRAOemrdTtpf6Qg/tasks/AhAoVzLpRNGFzSq8KQOGfA/runs/3

It expired at 2019-07-14T02:52:13.000Z.

From the logs, we see that the worker tried to post an artifact at 2019/07/14 02:57:48, which is 5 mins after the task had expired.

The worker would normally reclaim the task 17 mins after it first claimed it, i.e. at 2019-07-14T02:49:11.519Z (3 mins before the claim expires).

So the question is, what was the worker (t-lenovoyogac630-024) doing between 2019-07-14T02:49:11.519Z and 2019/07/14 02:57:48? Did it sleep during this time?

I've requested logs between 02:49 and 02:58 UTC on 14 July for t-lenovoyogac630-024 from Stanley Lao in the slack chat.

Blocks: 1585578

Pete, did you want to keep investigating this or has the moment passed?

Component: Services → Workers
Flags: needinfo?(pmoore)

Yes, let's close, this seems to be just due to the services outage. Thanks!

Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(pmoore)
Resolution: --- → WONTFIX
No longer blocks: 1585578
You need to log in before you can comment on or make changes to this bug.