Closed Bug 1672869 Opened 4 years ago Closed 4 years ago

Mda jobs thrown as exceptions - Claim expired

Categories

(Core :: Web Audio, defect)

defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox-esr78 --- unaffected
firefox81 --- unaffected
firefox82 --- unaffected
firefox83 --- unaffected
firefox84 --- disabled

People

(Reporter: nataliaCs, Assigned: decoder)

References

(Regression)

Details

(Keywords: regression)

Attachments

(2 files)

Started when bug 1648192 added the mda tasks to Linux tsan. The failures are for the test chunk running dom/media/webaudio/test/mochitest.ini and only the tests from that test manifest in the task.

Component: Task Configuration → Web Audio
Product: Firefox Build System → Core
Regressed by: mochitest_media_tsan
Has Regression Range: --- → yes

Paul, Christian, please coordinate how this shall be resolved.

Flags: needinfo?(padenot)
Flags: needinfo?(choller)

I've seen this too and I am clueless as to why this might happen. Once the tests finish, they finish very quickly with a runtime of only 8 or 9 minutes.

Could this be related to the fact, that a different machine type (xlarge) is requested here with different spot price properties? Might be useful to loop someone in that works on TC. Without any kind of logs, it is going to be hard to figure this out.

Flags: needinfo?(choller) → needinfo?(pmoore)

(In reply to Christian Holler (:decoder) from comment #4)

I've seen this too and I am clueless as to why this might happen. Once the tests finish, they finish very quickly with a runtime of only 8 or 9 minutes.

Could this be related to the fact, that a different machine type (xlarge) is requested here with different spot price properties? Might be useful to loop someone in that works on TC. Without any kind of logs, it is going to be hard to figure this out.

Brian, can you assist Christian? Thanks in advance.

Flags: needinfo?(bstack)

Yeah happy to help. Are only the Mda jobs experiencing this? I've found the worker logs for one of these instances and the machine restarted in the middle of a task without saying why. In our experience this is usually do to OOM or some kernel panic, etc.

My intuition is that if this is only affecting one test type and not a lot of other tasks that happen on these workers it might be that the test is leaking memory or something along those lines? If sheriffs ping me right after the next one of these maybe we can get to the instance in aws before aws forgets about it.

Is there an in-tree reporting mechanism that is tracking how much memory these tasks use?

Flags: needinfo?(bstack)
Pushed by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/autoland/rev/f82597b1e299
disable dom/media/webaudio/test/mochitest.ini for ThreadSanitizer because new task fails very frequently. DONTBUILD

Normally I would also have guessed OOM (and I guess we could rule this out by providing an even larger machine, do we have anything larger than xlarge in TC?).

What is confusing to me is: When these jobs finish normally, they only take 8-9 minutes. Normally when tests put a lot of OOM stress on the machine so they sometimes fail, they usually take really long, even if they succeed (because the machine gets delayed from swapping and general bad test performance). This is not the case here. I've also run the tests locally with TSan and couldn't see increased memory usage beyond what we would usually expect from TSan, but I can try to do these again, maybe they happen intermittently?

Flags: needinfo?(padenot)

(In reply to Brian Stack [:bstack] from comment #6)

Yeah happy to help. Are only the Mda jobs experiencing this? I've found the worker logs for one of these instances and the machine restarted in the middle of a task without saying why. In our experience this is usually do to OOM or some kernel panic, etc.

Please attach the logs here, this links to a login page (or otherwise provide instructions on how to get them). OOM sounds plausible, and we might well be able to modify the test to make this more reliable. Thanks!

Flags: needinfo?(bstack)

Perhaps this could also be a worker configuration issue, for example if two workers are configured to run on a single worker, perhaps one is rebooting the machine while the other one is running.

Claim-expired is a resolution that the queue assigns to a task when the worker goes AWOL, when it has failed to reclaim the task within 20 minutes, which it should do if it is actively working on a task. Since logs are streamed during execution, and only uploaded on completion, a worker that "disappears" unexpectedly will leave no trace of logs unfortunately, unless the livelogs are captured as the task happens. If the claim-expired is reasonably repeatable, it may be worth tailing the live logs in a console session while the task runs, in order that they can be captured despite the claim-expired failure. However, even if the worker goes offline, it may be that the worker still retains the task logs, assuming it doesn't terminate, so it may be possible for us to retrieve them for a post mortem.

Note this could also be a worker bug, or a misconfiguration of docker on the worker. Normally for docker-worker tasks, it is difficult for the task to do something evil like reboot the machine mid-task, since it should only be able to affect the docker container the task is running in, not the host environment.

Flags: needinfo?(pmoore)

This is not the case here. I've also run the tests locally with TSan and couldn't see increased memory usage beyond what we would usually expect from TSan, but I can try to do these again, maybe they happen intermittently?

Yeah that is definitely confusing. I'm not sure how often this is happening? Is it some constant percentage of these tasks or 100% of them, etc.

Please attach the logs here, this links to a login page (or otherwise provide instructions on how to get them). OOM sounds plausible, and we might well be able to modify the test to make this more reliable. Thanks!

Ah, there's nothing really to see there. The logs just showed normal docker-worker logs followed by the init process of the machine starting again with nothing in between. Shouldn't have even linked really. They're no longer indexed in the web service we use there but I can go get them out of s3 if you'd like to see them.

If the claim-expired is reasonably repeatable, it may be worth tailing the live logs in a console session while the task runs, in order that they can be captured despite the claim-expired failure.

We tried that on Friday a bit and couldn't get it to reproduce at that time. We can try that again today/tomorrow/etc. It might be the best next step forward. Tailing a handful of tasks at a time for a few hours should result in a hit?

Flags: needinfo?(bstack)

If we can determine whether this happens during a particular test, it can be useful. The media test suite does regression tests and regular tests where it allocates a lot to try to put Firefox into an OOM or quasi-OOM situation that wasn't handled previously, and to check that it's now gracefully handled, but sometimes this cause weird issues.

Assignee: nobody → choller
Status: NEW → ASSIGNED

I found the particular test that causes this (it causes a 12 GB memory rss spike on my local machine) and it looks like that was the main problem:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=df288d9eb2f1a55efd27e271ac1e69e6628c82cc

I've decided to disable this test not just on TSan but also on ASan. Even though it is not failing on ASan right now, such huge allocations can cause large delays with any of the sanitizers an make the test susceptible to intermittents. There is I guess little use for this test on ASan anyway.

Pushed by choller@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8ce94732d1ef
Disable WebAudio OOM test for sanitizers. r=padenot

The severity field is not set for this bug.
:padenot, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(padenot)

Christian, do you think we can backout the first patch in this bug and then we can close this? The second patch should be enough, and I'd rather have TSAN coverage here.

Flags: needinfo?(padenot) → needinfo?(choller)

Unless I missed something, the second patch already reverted the first patch?

Flags: needinfo?(choller) → needinfo?(padenot)

Ah brilliant, missed this, thanks. I think we can close this then. Thanks!

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Flags: needinfo?(padenot)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: