1811237 - Linux webrender asan xpcshell frequent retries that end up in exception

Reporter

Description

•

2 years ago

There are frequent Linux webrender asan xpcshell test retries that end up in exception:
Started from this push
@Jens, can you take a look?
Range is indicating this failure started from bug 1810666

Flags: needinfo?(jstutte)

Sandor Molnar[:smolnar]

Reporter

Updated

•

2 years ago

Regressed by: 1810666

BugBot [:suhaib / :marco/ :calixte]

Updated

•

2 years ago

Keywords: regression

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

2 years ago

Set release status flags based on info from the regressing bug 1810666

status-firefox109: --- → unaffected

status-firefox110: --- → unaffected

status-firefox111: --- → affected

status-firefox-esr102: --- → unaffected

Jens Stutte [:jstutte]

Assignee

Comment 2

•

2 years ago

I am able to reproduce this but unfortunately the log files of those tasks seem not to be accessible. I can probably try and run the asan build from try locally (which I never tried) but then I would not know for which test I should look out?

There is some potential in the patches from bug 1810666 to cause more content process creations during parent shutdown, this might just make us hit some limit of the machine or we might see a real problem with those patches.

Flags: needinfo?(jstutte) → needinfo?(smolnar)

Comment hidden (Intermittent Failures Robot)

Jens Stutte [:jstutte]

Assignee

Comment 4

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #2)

There is some potential in the patches from bug 1810666 to cause more content process creations during parent shutdown, this might just make us hit some limit of the machine or we might see a real problem with those patches.

I added some diagnostic assert to see if we ever hit the case I have in mind.

Jens Stutte [:jstutte]

Assignee

Comment 5

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #4)

I added some diagnostic assert to see if we ever hit the case I have in mind.

From a first run, this seems not to be the case - and it ran successfully now. I re-triggered another few times just to understand, if this is intermittent now. Did someone alter the machine's configuration?

Jens Stutte [:jstutte]

Assignee

Comment 6

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #5)

From a first run, this seems not to be the case - and it ran successfully now. I re-triggered another few times just to understand, if this is intermittent now. Did someone alter the machine's configuration?

Hmm, now I see 6 re-trigger of X3, while I am sure I just started two of them. There seems something weird going on with restarts here? In any case and without being able to see any logs I do not see much actionable here for me.

Cosmin Sabou [:CosminS]

Comment 7

•

2 years ago

•

Edited

The test groups that run when this ends as an exception are:

    browser/components/customizableui/test/unit/xpcshell.ini
    browser/components/sessionstore/test/unit/xpcshell.ini
    browser/extensions/formautofill/test/unit/heuristics/third_party/xpcshell.ini
    browser/tools/mozscreenshots/tests/xpcshell/xpcshell.ini
    chrome/test/unit/xpcshell.ini
    devtools/server/actors/compatibility/lib/test/xpcshell/xpcshell.ini
    devtools/shared/discovery/tests/xpcshell/xpcshell.ini
    devtools/shared/tests/xpcshell/xpcshell.ini
    devtools/shared/webconsole/test/xpcshell/xpcshell.ini
    docshell/test/unit/xpcshell.ini
    dom/abort/tests/unit/xpcshell.ini
    dom/base/test/unit_ipc/xpcshell.ini
    dom/encoding/test/unit/xpcshell.ini
    dom/media/webvtt/test/xpcshell/xpcshell.ini
    dom/messagechannel/tests/unit/xpcshell.ini
    dom/notification/test/unit/xpcshell.ini
    dom/quota/test/xpcshell/xpcshell.ini
    dom/tests/unit/xpcshell.ini
    extensions/pref/autoconfig/test/unit/xpcshell.ini
    extensions/pref/autoconfig/test/unit/xpcshell_snap.ini
    intl/uconv/tests/unit/xpcshell.ini
    js/xpconnect/tests/unit/xpcshell.ini
    modules/libjar/test/unit/xpcshell.ini
    modules/libmar/tests/unit/xpcshell.ini
    parser/xml/test/unit/xpcshell.ini
    remote/shared/test/xpcshell/xpcshell.ini
    security/manager/ssl/tests/unit/xpcshell-smartcards.ini
    testing/modules/tests/xpcshell/xpcshell.ini
    toolkit/components/aboutthirdparty/tests/xpcshell/xpcshell.ini
    toolkit/components/asyncshutdown/tests/xpcshell/xpcshell.ini
    toolkit/components/autocomplete/tests/unit/xpcshell.ini
    toolkit/components/commandlines/test/unit_unix/xpcshell.ini
    toolkit/components/contextualidentity/tests/unit/xpcshell.ini
    toolkit/components/credentialmanagement/tests/xpcshell/xpcshell.ini
    toolkit/components/ctypes/tests/unit/xpcshell.ini
    toolkit/components/downloads/test/unit/xpcshell.ini
    toolkit/components/extensions/test/xpcshell/xpcshell.ini
    toolkit/components/mediasniffer/test/unit/xpcshell.ini
    toolkit/components/messaging-system/targeting/test/unit/xpcshell.ini
    toolkit/components/mozintl/test/xpcshell.ini
    toolkit/components/osfile/tests/xpcshell/xpcshell.ini
    toolkit/components/passwordmgr/test/unit/xpcshell.ini
    toolkit/components/satchel/test/unit/xpcshell.ini
    toolkit/components/startup/tests/unit/xpcshell.ini
    toolkit/components/telemetry/dap/tests/xpcshell/xpcshell.ini
    toolkit/components/thumbnails/test/xpcshell.ini
    toolkit/components/urlformatter/tests/unit/xpcshell.ini
    toolkit/components/windowcreator/tests/unit/xpcshell.ini
    toolkit/mozapps/update/tests/unit_service_updater/xpcshell.ini
    toolkit/profile/xpcshell/xpcshell.ini
    widget/headless/tests/xpcshell.ini

fwiw these ^ are only run on backstop pushes
vs when there's a green one there's only this one:

toolkit/components/extensions/test/xpcshell/xpcshell.ini

Taking as example this range.

Jens, could this be a case of a test misbehaving just as it was in Bug 1796753?

Flags: needinfo?(smolnar) → needinfo?(jstutte)

Comment hidden (Intermittent Failures Robot)

Jens Stutte [:jstutte]

Assignee

Comment 9

•

2 years ago

•

Edited

The patches from bug 1810666 did not change any test directly. However there is potential for them to cause a higher number of content processes or at least to change the order with which they are created/removed. However, AFAICS, this try shows that the only known case where we expect this to be possible is not hit if applying also the patches from bug 1811195 (but the test still fails), but I might overlook something.

Does the XPCShell test harness know, how many processes were ever spawned or even better the maximum number of processes being alive in parallel? It would be interesting to compare this number between the successful and the failed runs.

And can we reduce the number of tests running in parallel on those instances (or give them more memory), just to see if it makes a difference? I'd like to understand if something is really going nuts and allocating a very high number of extra processes or if we just sail along the border already and small fluctuations make us fail.

Flags: needinfo?(jstutte) → needinfo?(smolnar)

Sandor Molnar[:smolnar]

Reporter

Comment 10

•

2 years ago

Unfortunately do not have information on the specific metrics regarding the XPCShell test harness.
@Aryx, do you have any insight about this?

Flags: needinfo?(smolnar) → needinfo?(aryx.bugmail)

Comment hidden (Intermittent Failures Robot)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 12

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #9)

Does the XPCShell test harness know, how many processes were ever spawned or even better the maximum number of processes being alive in parallel? It would be interesting to compare this number between the successful and the failed runs.

And can we reduce the number of tests running in parallel on those instances (or give them more memory), just to see if it makes a difference? I'd like to understand if something is really going nuts and allocating a very high number of extra processes or if we just sail along the border already and small fluctuations make us fail.

Flags: needinfo?(aryx.bugmail) → needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 13

•

2 years ago

interesting questions, this might be possible- maybe some solutions here.

xpcshell.ini has options to run tests sequentially, not in parallel ( https://searchfox.org/mozilla-central/search?q=sequential&path=xpcshell.ini&case=false&regexp=false ). In fact, about a year ago I took the most frequent failures (ones that almost perma failed in parallel but maybe not in sequential) and forced them to run as sequential
we run 1 test per thread, and this is defined by https://searchfox.org/mozilla-central/source/testing/xpcshell/runxpcshelltests.py#54 ( NUM_THREADS = int(cpu_count() * 4))

The question isn't answered yet, here is what answers more of it, if you designate a test to run-sequentially, it will be put into a list and after all the parallel tests are completed we iterate through the sequential list:
https://searchfox.org/mozilla-central/source/testing/xpcshell/runxpcshelltests.py#1946

So a few ways forward:

adjust num_threads and push to ry
if certains tests or directories are suspect, add run-sequentially to the manifest

Flags: needinfo?(jmaher)

Comment hidden (Intermittent Failures Robot)

Jens Stutte [:jstutte]

Assignee

Comment 15

•

2 years ago

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #13)

So a few ways forward:

adjust num_threads and push to ry

if certains tests or directories are suspect, add run-sequentially to the manifest

(In reply to Cosmin Sabou [:CosminS] from comment #7)

The test groups that run when this ends as an exception are:
    browser/components/customizableui/test/unit/xpcshell.ini
    ...
    widget/headless/tests/xpcshell.ini
fwiw these ^ are only run on backstop pushes
vs when there's a green one there's only this one:
toolkit/components/extensions/test/xpcshell/xpcshell.ini

:CosminS, based on the above: do you have an idea for which tests we could apply those manifest changes then? I do not really have the feeling we identified a clear offender, yet.

Flags: needinfo?(csabou)

Jens Stutte [:jstutte]

Assignee

Comment 16

•

2 years ago

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #13)

we run 1 test per thread, and this is defined by https://searchfox.org/mozilla-central/source/testing/xpcshell/runxpcshelltests.py#54 ( NUM_THREADS = int(cpu_count() * 4))

It seems that constant has not been changed for 9 years now. But IIUC there is also an option threadCount that can be set from the command line ? Do we ever use this option and/or would that be an easier way to test it ?

Comment hidden (Intermittent Failures Robot)

Jens Stutte [:jstutte]

Assignee

Comment 18

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #16)

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #13)

we run 1 test per thread, and this is defined by https://searchfox.org/mozilla-central/source/testing/xpcshell/runxpcshelltests.py#54 ( NUM_THREADS = int(cpu_count() * 4))

It seems that constant has not been changed for 9 years now. But IIUC there is also an option threadCount that can be set from the command line ? Do we ever use this option and/or would that be an easier way to test it ?

There is also this adjustment for tsan already. I cannot find anything similar for asan, though. A successful run of X4 asan shows:

[task 2023-01-20T19:43:21.720Z] 19:43:21     INFO -  Using at most 8 threads.

while a successful run of tsan says:

[task 2023-01-20T19:24:56.226Z] 19:24:56     INFO -  Using at most 4 threads.

logged from runxpcshelltests.py.

If we knew cpu_count() == 2 of the interested node then this could be the initial NUM_THREADS = int(cpu_count() * 4) value for asan and the adjustment of that value for tsan. It is probably reasonable to make the same/a similar adjustment for asan?

Jens Stutte [:jstutte]

Assignee

Comment 19

•

2 years ago

Try: https://treeherder.mozilla.org/jobs?repo=try&revision=4ae7bd2d6aa321cca5b1c042c33122ccd1ad1657

Jens Stutte [:jstutte]

Assignee

Comment 20

•

2 years ago

•

Edited

(In reply to Jens Stutte [:jstutte] from comment #19)

Try: https://treeherder.mozilla.org/jobs?repo=try&revision=4ae7bd2d6aa321cca5b1c042c33122ccd1ad1657

That looks good, so far. The log shows we are using now 4 "threads". For some reason I do not understand phabricator keeps the patch I attached here secret, so it does not show up here?

Joel Maher ( :jmaher ) (UTC -8)

Comment 21

•

2 years ago

I think lowering the thread count for asan wouldn't be a problem.

Jens Stutte [:jstutte]

Assignee

Comment 22

•

2 years ago

Attached file Bug 1811237 - Limit the number of parallel running XPCShell tests for asan builds. r?jmaher — Details

Phabricator Automation

Updated

•

2 years ago

Assignee: nobody → jstutte

Status: NEW → ASSIGNED

Pulsebot

Comment 23

•

2 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/079e8849e811 Limit the number of parallel running XPCShell tests for asan builds. r=jmaher

Noemi Erli[:noemi_erli]

Comment 24

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/079e8849e811

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

status-firefox111: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 111 Branch

Comment hidden (Intermittent Failures Robot)