Closed Bug 1672869 Opened 5 years ago Closed 5 years ago

Mda jobs thrown as exceptions - Claim expired

Tracking

()

Status:

RESOLVED FIXED

Tracking Flags:

Tracking

Status

firefox-esr78

---

unaffected

firefox81

---

unaffected

firefox82

---

unaffected

firefox83

---

unaffected

firefox84

---

disabled

People

(Reporter: nataliaCs, Assigned: decoder)

References

(Regression)

Details

(Keywords: regression)

Attachments

(2 files)

Bug 1672869 - disable dom/media/webaudio/test/mochitest.ini for ThreadSanitizer because new task fails very frequently. DONTBUILD 5 years ago Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1672869 - Disable WebAudio OOM test for sanitizers. r?padenot 5 years ago Christian Holler (:decoder) 47 bytes, text/x-phabricator-request		Details \| Review

Natalia Csoregi [:nataliaCs]

Reporter

Description

•

5 years ago

•

Edited

Many mda exceptions started appearing, which have no logs available:
https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=linux%2C18.04%2Cx64%2Ctsan%2Copt%2Cmochitests%2Ctest-linux1804-64-tsan%2Fopt-mochitest-media-e10s%2Cmda2&tochange=dafa26b89edacc8d9fb9eade704d7124ba5239cf&fromchange=f2c121c81601b91d45446ee27627ef75555cf57b&group_state=expanded

Task example:
https://firefox-ci-tc.services.mozilla.com/tasks/H5qo8FCXQXC8yBSXYaly7w

It seems like a big number of retries before the test actually runs successfully.

Comment hidden (Intermittent Failures Robot)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 2

•

5 years ago

Started when bug 1648192 added the mda tasks to Linux tsan. The failures are for the test chunk running dom/media/webaudio/test/mochitest.ini and only the tests from that test manifest in the task.

Component: Task Configuration → Web Audio

Product: Firefox Build System → Core

Regressed by: mochitest_media_tsan

BMO Automation

Updated

•

5 years ago

Has Regression Range: --- → yes

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 3

•

5 years ago

Paul, Christian, please coordinate how this shall be resolved.

status-firefox81: --- → unaffected

status-firefox82: --- → unaffected

status-firefox83: --- → unaffected

status-firefox84: --- → affected

status-firefox-esr78: --- → unaffected

Flags: needinfo?(padenot)

Flags: needinfo?(choller)

Christian Holler (:decoder)

Assignee

Comment 4

•

5 years ago

I've seen this too and I am clueless as to why this might happen. Once the tests finish, they finish very quickly with a runtime of only 8 or 9 minutes.

Could this be related to the fact, that a different machine type (xlarge) is requested here with different spot price properties? Might be useful to loop someone in that works on TC. Without any kind of logs, it is going to be hard to figure this out.

Flags: needinfo?(choller) → needinfo?(pmoore)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 5

•

5 years ago

(In reply to Christian Holler (:decoder) from comment #4)

I've seen this too and I am clueless as to why this might happen. Once the tests finish, they finish very quickly with a runtime of only 8 or 9 minutes.

Could this be related to the fact, that a different machine type (xlarge) is requested here with different spot price properties? Might be useful to loop someone in that works on TC. Without any kind of logs, it is going to be hard to figure this out.

Brian, can you assist Christian? Thanks in advance.

Flags: needinfo?(bstack)

Brian Stack [:bstack]

Comment 6

•

5 years ago

Yeah happy to help. Are only the Mda jobs experiencing this? I've found the worker logs for one of these instances and the machine restarted in the middle of a task without saying why. In our experience this is usually do to OOM or some kernel panic, etc.

My intuition is that if this is only affecting one test type and not a lot of other tasks that happen on these workers it might be that the test is leaking memory or something along those lines? If sheriffs ping me right after the next one of these maybe we can get to the instance in aws before aws forgets about it.

Is there an in-tree reporting mechanism that is tracking how much memory these tasks use?

Flags: needinfo?(bstack)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

5 years ago

Keywords: leave-open

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 7

•

5 years ago

Attached file Bug 1672869 - disable dom/media/webaudio/test/mochitest.ini for ThreadSanitizer because new task fails very frequently. DONTBUILD — Details

Pulsebot

Comment 8

•

5 years ago

Pushed by archaeopteryx@coole-files.de: https://hg.mozilla.org/integration/autoland/rev/f82597b1e299 disable dom/media/webaudio/test/mochitest.ini for ThreadSanitizer because new task fails very frequently. DONTBUILD

Cristina Coroiu [:ccoroiu]

Comment 9

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/f82597b1e299

Christian Holler (:decoder)

Assignee

Comment 10

•

5 years ago

Normally I would also have guessed OOM (and I guess we could rule this out by providing an even larger machine, do we have anything larger than xlarge in TC?).

What is confusing to me is: When these jobs finish normally, they only take 8-9 minutes. Normally when tests put a lot of OOM stress on the machine so they sometimes fail, they usually take really long, even if they succeed (because the machine gets delayed from swapping and general bad test performance). This is not the case here. I've also run the tests locally with TSan and couldn't see increased memory usage beyond what we would usually expect from TSan, but I can try to do these again, maybe they happen intermittently?

Comment hidden (Intermittent Failures Robot)

Paul Adenot (:padenot)

Updated

•

5 years ago

Flags: needinfo?(padenot)

Paul Adenot (:padenot)

Comment 12

•

5 years ago

(In reply to Brian Stack [:bstack] from comment #6)

Yeah happy to help. Are only the Mda jobs experiencing this? I've found the worker logs for one of these instances and the machine restarted in the middle of a task without saying why. In our experience this is usually do to OOM or some kernel panic, etc.

Please attach the logs here, this links to a login page (or otherwise provide instructions on how to get them). OOM sounds plausible, and we might well be able to modify the test to make this more reliable. Thanks!

Flags: needinfo?(bstack)

BugBot [:suhaib / :marco/ :calixte]

Updated

•

5 years ago

Keywords: regression

Pete Moore [:pmoore][:pete]

Comment 13

•

5 years ago

Perhaps this could also be a worker configuration issue, for example if two workers are configured to run on a single worker, perhaps one is rebooting the machine while the other one is running.

Claim-expired is a resolution that the queue assigns to a task when the worker goes AWOL, when it has failed to reclaim the task within 20 minutes, which it should do if it is actively working on a task. Since logs are streamed during execution, and only uploaded on completion, a worker that "disappears" unexpectedly will leave no trace of logs unfortunately, unless the livelogs are captured as the task happens. If the claim-expired is reasonably repeatable, it may be worth tailing the live logs in a console session while the task runs, in order that they can be captured despite the claim-expired failure. However, even if the worker goes offline, it may be that the worker still retains the task logs, assuming it doesn't terminate, so it may be possible for us to retrieve them for a post mortem.

Note this could also be a worker bug, or a misconfiguration of docker on the worker. Normally for docker-worker tasks, it is difficult for the task to do something evil like reboot the machine mid-task, since it should only be able to affect the docker container the task is running in, not the host environment.

Flags: needinfo?(pmoore)

Brian Stack [:bstack]

Comment 14

•

5 years ago

This is not the case here. I've also run the tests locally with TSan and couldn't see increased memory usage beyond what we would usually expect from TSan, but I can try to do these again, maybe they happen intermittently?

Yeah that is definitely confusing. I'm not sure how often this is happening? Is it some constant percentage of these tasks or 100% of them, etc.

Please attach the logs here, this links to a login page (or otherwise provide instructions on how to get them). OOM sounds plausible, and we might well be able to modify the test to make this more reliable. Thanks!

Ah, there's nothing really to see there. The logs just showed normal docker-worker logs followed by the init process of the machine starting again with nothing in between. Shouldn't have even linked really. They're no longer indexed in the web service we use there but I can go get them out of s3 if you'd like to see them.

If the claim-expired is reasonably repeatable, it may be worth tailing the live logs in a console session while the task runs, in order that they can be captured despite the claim-expired failure.

We tried that on Friday a bit and couldn't get it to reproduce at that time. We can try that again today/tomorrow/etc. It might be the best next step forward. Tailing a handful of tasks at a time for a few hours should result in a hit?

Flags: needinfo?(bstack)

Paul Adenot (:padenot)

Comment 15

•

5 years ago

If we can determine whether this happens during a particular test, it can be useful. The media test suite does regression tests and regular tests where it allocates a lot to try to put Firefox into an OOM or quasi-OOM situation that wasn't handled previously, and to check that it's now gracefully handled, but sometimes this cause weird issues.

Christian Holler (:decoder)

Assignee

Comment 16

•

5 years ago

Attached file Bug 1672869 - Disable WebAudio OOM test for sanitizers. r?padenot — Details

Phabricator Automation

Updated

•

5 years ago

Assignee: nobody → choller

Status: NEW → ASSIGNED

Christian Holler (:decoder)

Assignee

Comment 17

•

5 years ago

I found the particular test that causes this (it causes a 12 GB memory rss spike on my local machine) and it looks like that was the main problem:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=df288d9eb2f1a55efd27e271ac1e69e6628c82cc

I've decided to disable this test not just on TSan but also on ASan. Even though it is not failing on ASan right now, such huge allocations can cause large delays with any of the sanitizers an make the test susceptible to intermittents. There is I guess little use for this test on ASan anyway.

Pulsebot

Comment 18

•

5 years ago

Pushed by choller@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/8ce94732d1ef Disable WebAudio OOM test for sanitizers. r=padenot

Cristina Coroiu [:ccoroiu]

Comment 19

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/8ce94732d1ef

Jim Mathies [:jimm]

Updated

•

5 years ago

status-firefox84: affected → disabled

BugBot [:suhaib / :marco/ :calixte]

Comment 20

•

5 years ago

The severity field is not set for this bug.
:padenot, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(padenot)

Paul Adenot (:padenot)

Comment 21

•

5 years ago

Christian, do you think we can backout the first patch in this bug and then we can close this? The second patch should be enough, and I'd rather have TSAN coverage here.

Flags: needinfo?(padenot) → needinfo?(choller)

Christian Holler (:decoder)

Assignee

Comment 22

•

5 years ago

Unless I missed something, the second patch already reverted the first patch?

Flags: needinfo?(choller) → needinfo?(padenot)

Paul Adenot (:padenot)

Comment 23

•

5 years ago

Ah brilliant, missed this, thanks. I think we can close this then. Thanks!

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

Flags: needinfo?(padenot)

Resolution: --- → FIXED

BugBot [:suhaib / :marco/ :calixte]

Updated

•

5 years ago

Keywords: leave-open

You need to log in before you can comment on or make changes to this bug.