Open Bug 1759288 Opened 3 years ago Updated 3 days ago

[docker-worker] Permanent and Intermittent Linux 18.04 x64 jobs fail after many retries with "claim_expired" - log is unavailable

Categories

(Taskcluster :: Workers, defect, P3)

Tracking

(Not tracked)

People

(Reporter: imoraru, Assigned: aerickson)

References

(Regression)

Details

(Keywords: intermittent-failure, regression, Whiteboard: [stockwell infra])

There is no failure summary in treeherder and the log is unavailable.
This started to fail quite often.
Push where we notice the failure rate increased

Hi! Can you please take a look at this when you have some time?
Thank you!

Flags: needinfo?(mgoossens)

Hey :pmoore, any idea what could cause this intermittent claim expire?

Severity: -- → S3
Flags: needinfo?(mgoossens) → needinfo?(pmoore)
Priority: -- → P3
Summary: Intermittent Linux 18.04 x64 WebRender Shippable Wc jobs fail after many retries with "claim_expired" - log is unavailable → Intermittent Linux 18.04 x64 jobs fail after many retries with "claim_expired" - log is unavailable
Flags: needinfo?(pmoore)
Summary: Intermittent Linux 18.04 x64 jobs fail after many retries with "claim_expired" - log is unavailable → [docker-worker] Intermittent Linux 18.04 x64 jobs fail after many retries with "claim_expired" - log is unavailable

I suspect the task is causing the worker to crash / run out of memory, or trigger a bug in the worker implementation (docker-worker).

If it happens consistently, it could be worth tailing the live log while the task is running - as soon as the worker goes offline, that log is lost.

Claim-expired means that the queue lost touch with the worker. The worker is required to routinely call queue.reclaimTask while it is executing a task. Failure to do so indicates the worker has broken/crashed/lost network connectivity etc, and the queue will reschedule the task to run on a different worker. If it consistently results in exception/claim-expired then it suggests something that the task does causes the worker to crash. This is still a docker-worker bug, but it may be triggered by e.g. excessive memory use of the task etc.

Note, while the task is running, it streams the live log file, and only when the task completes, it uploads the full task log to aws s3. Since the worker crashes/goes offline before the task completes, the complete log file never gets uploaded, which is why the errored tasks have no log. However, the log is probably still streamed during task execution up until the worker goes offline, so it is worth running a task and tailing the live log with e.g. a curl command, to capture as much of the log as possible, to help diagnose the issue (at least to get an idea about approximately at which stage of the task it fails).

Linux tsan wpt6 is a permanent failure. James, can you get a log out of the worker or is this a request for RelOps?

Flags: needinfo?(james)
Summary: [docker-worker] Intermittent Linux 18.04 x64 jobs fail after many retries with "claim_expired" - log is unavailable → [docker-worker] Permanent and Intermittent Linux 18.04 x64 jobs fail after many retries with "claim_expired" - log is unavailable

If it helps, I've exported all the worker logs that I can see into a giant text file.

My attempts to get the log out of a running instance of the task were unsuccessful, so maybe RelOps? I tried running it in an interactive task but it seemed to be trying to download a tests-by-manifest.json.gz file that doesn't exist (maybe they have very short expiry times? Not sure).

Flags: needinfo?(james)

That could potentially be it, from the worker logs (5000 lines since the worker has started) I don't see anything stick out, but if anyone from relops or the TC team wants to view these, let me know and I'll upload them.

Could you check the logs why some tasks fail permanently with an exception (see comment 15)?

Flags: needinfo?(dhouse)
Whiteboard: [stockwell disable-recommended]
Whiteboard: [stockwell disable-recommended]
Assignee: nobody → dhouse
Flags: needinfo?(dhouse)
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

Update: there are 109 total failures in the last 7 days and 647 total failures in the last 30 days on

  • linux1804-64-tsan-qr opt
  • linux1804-64-shippable-qr opt
  • linux1804-64-devedition-qr opt

@dhouse: are there any updates on this?

Flags: needinfo?(dhouse)
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

(In reply to Michelle Goossens [:masterwayz] from comment #18)

That could potentially be it, from the worker logs (5000 lines since the worker has started) I don't see anything stick out, but if anyone from relops or the TC team wants to view these, let me know and I'll upload them.

Hi Michelle, could you attach the logs you captured? I tried an interactive task but something doesn't work for me to connect to interactive tasks (the taskcluster web view doesn't load with some permissions errors).

Flags: needinfo?(dhouse) → needinfo?(mgoossens)

I don't have it anymore, but I can get a new one. Do you have a link to a task that failed with this?

Flags: needinfo?(mgoossens) → needinfo?(dhouse)

(In reply to Michelle Goossens [:masterwayz] from comment #50)

I don't have it anymore, but I can get a new one. Do you have a link to a task that failed with this?

Hi Michelle, does this work for you? It's the most recent failure in OF.

Yes, that works!

Flags: needinfo?(dhouse) → needinfo?(mgoossens)

For logs see the bug that this depends on.

Flags: needinfo?(mgoossens)
Whiteboard: [stockwell disable-recommended] → [stockwell infra]

I can't help but notice that green runs have a boatload and a half of logspam in them along these lines:

[task 2022-05-16T22:55:19.118Z] 22:55:19     INFO - PID 3555 | JavaScript error: http://web-platform.test:8000/css/CSS2/normal-flow/crashtests/block-in-inline-ax-crash.html, line 31: TypeError: can't access property "insertNode" of null
[task 2022-05-16T22:55:19.129Z] 22:55:19     INFO - PID 3555 | JavaScript error: http://web-platform.test:8000/css/CSS2/normal-flow/crashtests/block-in-inline-ax-crash.html, line 50: TypeError: can't access property "extractContents", oRange is null

I wonder if skipping the test would suddenly make this more reliable...

Log link:
https://firefoxci.taskcluster-artifacts.net/cLsypBdgSy2JdLNNmdTRpg/5/public/logs/live_backing.log

The push mentioned in #c0 also added a new wpt crashtest (svg/animations/end-of-time-002-crash.html) - I'm wondering why we haven't tried disabling it yet?

OK, I've been able to confirm that at least the wpt crashtest issues were caused by the Part 2 patch from bug 1751107. At Aryx' request, I'm filing a spin-off bug for that specific issue. Regardless, we should try to address the fact that we should be failing more gracefully than we are here.

Depends on: 1770058
Component: Workers → DOM: Streams
Flags: needinfo?(mgaudet)
Product: Taskcluster → Core
Regressed by: 1761483

I'm guessing we just need to skip large-pipeto.js on tsan in that case?

(In reply to Ryan VanderMeulen [:RyanVM] from comment #72)

I'm guessing we just need to skip large-pipeto.js on tsan in that case?

Comment 8 suggested that OOM could cause this. large-pipeto.js definitely could cause OOM on a tightly constrained machine. I'll make a patch to skip under tsan, unless there's a tighter constraint that can be put around these jobs.

Flags: needinfo?(mgaudet)

Hopefully the frequency of this will drop now that Bug 1770170 is closed.

Set release status flags based on info from the regressing bug 1761483

Looks like ASAN mochitest-browser-chrome is hitting this now for the past week or so :(

Back to Taskcluster: Workers since https://bugzilla.mozilla.org/show_bug.cgi?id=1759288#c70 was resolved and was not a browser-chrome test.

Component: DOM: Streams → Workers
Product: Core → Taskcluster

There have been 88 total failures in the last 7 days, recent failure log.
Affected platforms are:

  • linux1804-64-asan-qr
  • linux1804-64-qr
  • linux1804-64-shippable-qr
  • linux64

There have been 42 total failures in the last 7 days, recent failure log.
Affected platforms are:

  • linux1804-64-tsan-qr
  • linux1804-64-asan-qr

Crashtest started failing as exceptions on release: in this range

Pete, is there someone you can assign this bug to?
Thank you.

Flags: needinfo?(pmoore)

I'm not sure what to suggest.

One important thing to call out, is that there are two different types of logs discussed in this bug.

  • Task logs are the logs that get attached to the task, such as public/build/live.log and public/build/live_backing.log
  • Worker logs generally refer to the log file that remains on the worker, which is output from the worker process (docker-worker/generic-worker/scriptworker/...)

Relops can provide access to worker logs, but they probably do not offer a lot of value. However, they may report certain issues like out of memory, or give an indication of at what time something went wrong.
On the other hand, task logs are probably pretty useful, because they will show what the task was doing at the point the incident occurred that caused the worker to go offline or stop talking to the Queue.
The only way to get task logs is to capture the live log while it is streaming (see comment 8). If the live stream is not captured, there is usually no way to capture it afterwards. This is obviously much easier to do when an issue is a permanent failure, since you can curl the live log and be sure that at some point it will fail, and have a copy. With an intermittent task, that is obviously harder, unless there is a high chance it will fail, otherwise you might try many times and never capture one.

I would suggest this is the next step. Undoubtedly there will be something in the task that is causing the problem, so it will likely result in a fix to the task, rather than a fix to the worker. However, it could result in both, since ideally workers wouldn't crash when tasks do bad things. Docker worker is soon to be replaced with generic-worker, so hopefully these types of issues will be less frequent once we have switched over, hopefully later this year.

Ryan, any other ideas about how this issue could be tackled?

Flags: needinfo?(pmoore) → needinfo?(ryanvm)

Seems like a question that RelEng might be better equipped to answer.

Flags: needinfo?(ryanvm) → needinfo?(jcristau)

Looking at a recent failure that was starred with this bug: https://firefox-ci-tc.services.mozilla.com/tasks/BpS8EBPPTNGIWmASxDedBQ/runs/0 (test-linux1804-64-ccov-qr/opt-jittest-1proc-6)

      {
        "runId": 0,
        "state": "exception",
        "reasonCreated": "scheduled",
        "reasonResolved": "claim-expired",
        "workerGroup": "us-west1",
        "workerId": "10906424040333399",
        "takenUntil": "2023-03-23T14:04:17.413Z",
        "scheduled": "2023-03-23T13:34:16.722Z",
        "started": "2023-03-23T13:34:16.822Z",
        "resolved": "2023-03-23T14:04:18.380Z"
      },

The worker logs show it taking the task and start running stuff. At 2023-03-23 13:52:09 it's still alive and well. Then it looks like it reboots, and we get syslog and kernel messages starting 2023-03-23 13:52:30 from a new boot. Eventually docker-worker restarts, but its registerWorker call gets a 400:

2023/03/23 13:52:50 Could not register worker:
CALL SUMMARY
============
Method: POST
Service: firefox-ci-tc.services.mozilla.com:
Response Body:
{
  "code": "InputError",
  "message": "Token validation error\n\n---\n\n* method:     registerWorker\n* errorCode:  InputError\n* statusCode: 400\n* time:       2023-03-23T13:52:50.443Z",
  "requestInfo": {
    "method": "registerWorker",
    "params": {},
    "payload": "(OMITTED)",
    "time": "2023-03-23T13:52:50.443Z"
  }
}
Attempts: 1
(Permanent) HTTP response code 400
[...]
Worker exited. Shutting down...

Nothing in the worker log seems to indicate why it's rebooting.

A couple more data points:

Worker memory before the task is claimed:
2023/03/23 13:34:09 {"EnvVersion":"2.0","Fields":{"key":"memory.free","v":1,"val":10094071808},"Hostname":"gecko-t-t-linux-xlarge-gcp-cn5whsflr1ypz84aws32mg","Logger":"taskcluster.docker-worker.gecko-t.t-linux-xlarge-gcp.projects/887720501152/machineTypes/n2-standard-4","Pid":2417,"Severity":6,"Timestamp":1679578449456000000,"Type":"monitor.measure","serviceContext":{"service":"docker-worker"},"severity":"INFO"}

Last stats before the reboot:
2023/03/23 13:52:09 {"EnvVersion":"2.0","Fields":{"key":"memory.free","v":1,"val":6610112512},"Hostname":"gecko-t-t-linux-xlarge-gcp-cn5whsflr1ypz84aws32mg","Logger":"taskcluster.docker-worker.gecko-t.t-linux-xlarge-gcp.projects/887720501152/machineTypes/n2-standard-4","Pid":2417,"Severity":6,"Timestamp":1679579529469000000,"Type":"monitor.measure","serviceContext":{"service":"docker-worker"},"severity":"INFO"}

Doesn't look like it's close to OOM at that point.

I'm afraid I don't know what's next here.

Flags: needinfo?(jcristau)
Depends on: 1859204
See Also: → 1866612
Assignee: dhouse → aerickson

quite likely this is related; If this is 95%+ on jittest for ccov, then I am happy put that back to a worker with a scratch disk. I will do that in bug 1892798

Flags: needinfo?(jmaher)

This has gotten a lot worse lately, especially with these two tests:

Hi Pete! Can you please take a look at this?
Thank you!

Flags: needinfo?(aerickson)
Flags: needinfo?(aerickson)
Flags: needinfo?(pmoore)
Depends on: 1902612
Depends on: 1902620
Depends on: 1902979

This has started to perma fail for some browser-chrome jobs on Linux 18.04 x64 WebRender asan opt -> example and I think it started with the landing of Bug 1911801.

Hi Dave! Can you please take a look and confirm if your changes could have really caused this?
Thank you!

Flags: needinfo?(dtownsend)

(In reply to Iulian Moraru from comment #229)

This has started to perma fail for some browser-chrome jobs on Linux 18.04 x64 WebRender asan opt -> example and I think it started with the landing of Bug 1911801.

Hi Dave! Can you please take a look and confirm if your changes could have really caused this?
Thank you!

I can't see any information about the error so it is a bit difficult to say. That bug simply starts a browser chrome test running on linux when previously it wasn't but I don't know why that would cause an exception in the test run.

Flags: needinfo?(dtownsend)

often these are OOM or some kind of kernel panic that causes the OS to restart and the task to terminate.

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #231)

often these are OOM or some kind of kernel panic that causes the OS to restart and the task to terminate.

Ok that might make sense. I suggest backing out bug 1911801 and if that resolves the issue filing a bug blocking bug 1635451 (which originally added the test but with incorrect targetting).

Hi Dave! I've backed out Bug 1911801. Let's see if this goes away after the backout.

See Also: → 1917391

After the landing of bug 1907273, some Linux bc jobs started to time out and eventually fail with exception as can be seen here
@Joel, could you take a look?

Flags: needinfo?(jmaher)

hmm, this runs fine on mozilla-central, so this must be something quirky with the selective tests? or maybe something else landed that fixed this?

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #240)

hmm, this runs fine on mozilla-central, so this must be something quirky with the selective tests? or maybe something else landed that fixed this?

it fails on mozilla-central too

ok, this appears to be a legit regression, odd I didn't see this on try- but i tested almost 2 weeks ago before my PTO. I have a patch on bug 1919684.

Flags: needinfo?(jmaher)

@Joel, it is still failing with exception on tsan jobs. do you think we can disable this too, like in bug 1919684 ?

Flags: needinfo?(jmaher)

yes, working on a patch for that now, I will post to bug 1919684.

Flags: needinfo?(jmaher)
You need to log in before you can comment on or make changes to this bug.