Closed Bug 1811237 Opened 2 years ago Closed 2 years ago

Linux webrender asan xpcshell frequent retries that end up in exception

Categories

(Core :: DOM: Content Processes, defect)

defect

Tracking

()

RESOLVED FIXED
111 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox109 --- unaffected
firefox110 --- unaffected
firefox111 --- fixed

People

(Reporter: smolnar, Assigned: jstutte)

References

(Regression)

Details

(Keywords: regression, Whiteboard: [stockwell disable-recommended])

Attachments

(1 file)

There are frequent Linux webrender asan xpcshell test retries that end up in exception:
Started from this push
@Jens, can you take a look?
Range is indicating this failure started from bug 1810666

Flags: needinfo?(jstutte)
Regressed by: 1810666

Set release status flags based on info from the regressing bug 1810666

I am able to reproduce this but unfortunately the log files of those tasks seem not to be accessible. I can probably try and run the asan build from try locally (which I never tried) but then I would not know for which test I should look out?

There is some potential in the patches from bug 1810666 to cause more content process creations during parent shutdown, this might just make us hit some limit of the machine or we might see a real problem with those patches.

Flags: needinfo?(jstutte) → needinfo?(smolnar)

(In reply to Jens Stutte [:jstutte] from comment #2)

There is some potential in the patches from bug 1810666 to cause more content process creations during parent shutdown, this might just make us hit some limit of the machine or we might see a real problem with those patches.

I added some diagnostic assert to see if we ever hit the case I have in mind.

(In reply to Jens Stutte [:jstutte] from comment #4)

I added some diagnostic assert to see if we ever hit the case I have in mind.

From a first run, this seems not to be the case - and it ran successfully now. I re-triggered another few times just to understand, if this is intermittent now. Did someone alter the machine's configuration?

(In reply to Jens Stutte [:jstutte] from comment #5)

From a first run, this seems not to be the case - and it ran successfully now. I re-triggered another few times just to understand, if this is intermittent now. Did someone alter the machine's configuration?

Hmm, now I see 6 re-trigger of X3, while I am sure I just started two of them. There seems something weird going on with restarts here? In any case and without being able to see any logs I do not see much actionable here for me.

The test groups that run when this ends as an exception are:

    browser/components/customizableui/test/unit/xpcshell.ini
    browser/components/sessionstore/test/unit/xpcshell.ini
    browser/extensions/formautofill/test/unit/heuristics/third_party/xpcshell.ini
    browser/tools/mozscreenshots/tests/xpcshell/xpcshell.ini
    chrome/test/unit/xpcshell.ini
    devtools/server/actors/compatibility/lib/test/xpcshell/xpcshell.ini
    devtools/shared/discovery/tests/xpcshell/xpcshell.ini
    devtools/shared/tests/xpcshell/xpcshell.ini
    devtools/shared/webconsole/test/xpcshell/xpcshell.ini
    docshell/test/unit/xpcshell.ini
    dom/abort/tests/unit/xpcshell.ini
    dom/base/test/unit_ipc/xpcshell.ini
    dom/encoding/test/unit/xpcshell.ini
    dom/media/webvtt/test/xpcshell/xpcshell.ini
    dom/messagechannel/tests/unit/xpcshell.ini
    dom/notification/test/unit/xpcshell.ini
    dom/quota/test/xpcshell/xpcshell.ini
    dom/tests/unit/xpcshell.ini
    extensions/pref/autoconfig/test/unit/xpcshell.ini
    extensions/pref/autoconfig/test/unit/xpcshell_snap.ini
    intl/uconv/tests/unit/xpcshell.ini
    js/xpconnect/tests/unit/xpcshell.ini
    modules/libjar/test/unit/xpcshell.ini
    modules/libmar/tests/unit/xpcshell.ini
    parser/xml/test/unit/xpcshell.ini
    remote/shared/test/xpcshell/xpcshell.ini
    security/manager/ssl/tests/unit/xpcshell-smartcards.ini
    testing/modules/tests/xpcshell/xpcshell.ini
    toolkit/components/aboutthirdparty/tests/xpcshell/xpcshell.ini
    toolkit/components/asyncshutdown/tests/xpcshell/xpcshell.ini
    toolkit/components/autocomplete/tests/unit/xpcshell.ini
    toolkit/components/commandlines/test/unit_unix/xpcshell.ini
    toolkit/components/contextualidentity/tests/unit/xpcshell.ini
    toolkit/components/credentialmanagement/tests/xpcshell/xpcshell.ini
    toolkit/components/ctypes/tests/unit/xpcshell.ini
    toolkit/components/downloads/test/unit/xpcshell.ini
    toolkit/components/extensions/test/xpcshell/xpcshell.ini
    toolkit/components/mediasniffer/test/unit/xpcshell.ini
    toolkit/components/messaging-system/targeting/test/unit/xpcshell.ini
    toolkit/components/mozintl/test/xpcshell.ini
    toolkit/components/osfile/tests/xpcshell/xpcshell.ini
    toolkit/components/passwordmgr/test/unit/xpcshell.ini
    toolkit/components/satchel/test/unit/xpcshell.ini
    toolkit/components/startup/tests/unit/xpcshell.ini
    toolkit/components/telemetry/dap/tests/xpcshell/xpcshell.ini
    toolkit/components/thumbnails/test/xpcshell.ini
    toolkit/components/urlformatter/tests/unit/xpcshell.ini
    toolkit/components/windowcreator/tests/unit/xpcshell.ini
    toolkit/mozapps/update/tests/unit_service_updater/xpcshell.ini
    toolkit/profile/xpcshell/xpcshell.ini
    widget/headless/tests/xpcshell.ini

fwiw these ^ are only run on backstop pushes
vs when there's a green one there's only this one:

toolkit/components/extensions/test/xpcshell/xpcshell.ini

Taking as example this range.

Jens, could this be a case of a test misbehaving just as it was in Bug 1796753?

Flags: needinfo?(smolnar) → needinfo?(jstutte)

The patches from bug 1810666 did not change any test directly. However there is potential for them to cause a higher number of content processes or at least to change the order with which they are created/removed. However, AFAICS, this try shows that the only known case where we expect this to be possible is not hit if applying also the patches from bug 1811195 (but the test still fails), but I might overlook something.

Does the XPCShell test harness know, how many processes were ever spawned or even better the maximum number of processes being alive in parallel? It would be interesting to compare this number between the successful and the failed runs.

And can we reduce the number of tests running in parallel on those instances (or give them more memory), just to see if it makes a difference? I'd like to understand if something is really going nuts and allocating a very high number of extra processes or if we just sail along the border already and small fluctuations make us fail.

Flags: needinfo?(jstutte) → needinfo?(smolnar)

Unfortunately do not have information on the specific metrics regarding the XPCShell test harness.
@Aryx, do you have any insight about this?

Flags: needinfo?(smolnar) → needinfo?(aryx.bugmail)

(In reply to Jens Stutte [:jstutte] from comment #9)

Does the XPCShell test harness know, how many processes were ever spawned or even better the maximum number of processes being alive in parallel? It would be interesting to compare this number between the successful and the failed runs.

And can we reduce the number of tests running in parallel on those instances (or give them more memory), just to see if it makes a difference? I'd like to understand if something is really going nuts and allocating a very high number of extra processes or if we just sail along the border already and small fluctuations make us fail.

Flags: needinfo?(aryx.bugmail) → needinfo?(jmaher)

interesting questions, this might be possible- maybe some solutions here.

  1. xpcshell.ini has options to run tests sequentially, not in parallel ( https://searchfox.org/mozilla-central/search?q=sequential&path=xpcshell.ini&case=false&regexp=false ). In fact, about a year ago I took the most frequent failures (ones that almost perma failed in parallel but maybe not in sequential) and forced them to run as sequential
  2. we run 1 test per thread, and this is defined by https://searchfox.org/mozilla-central/source/testing/xpcshell/runxpcshelltests.py#54 ( NUM_THREADS = int(cpu_count() * 4))

The question isn't answered yet, here is what answers more of it, if you designate a test to run-sequentially, it will be put into a list and after all the parallel tests are completed we iterate through the sequential list:
https://searchfox.org/mozilla-central/source/testing/xpcshell/runxpcshelltests.py#1946

So a few ways forward:

  1. adjust num_threads and push to ry
  2. if certains tests or directories are suspect, add run-sequentially to the manifest
Flags: needinfo?(jmaher)

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #13)

So a few ways forward:

  1. adjust num_threads and push to ry
  2. if certains tests or directories are suspect, add run-sequentially to the manifest

(In reply to Cosmin Sabou [:CosminS] from comment #7)

The test groups that run when this ends as an exception are:

    browser/components/customizableui/test/unit/xpcshell.ini
    ...
    widget/headless/tests/xpcshell.ini

fwiw these ^ are only run on backstop pushes
vs when there's a green one there's only this one:

toolkit/components/extensions/test/xpcshell/xpcshell.ini

:CosminS, based on the above: do you have an idea for which tests we could apply those manifest changes then? I do not really have the feeling we identified a clear offender, yet.

Flags: needinfo?(csabou)

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #13)

  1. we run 1 test per thread, and this is defined by https://searchfox.org/mozilla-central/source/testing/xpcshell/runxpcshelltests.py#54 ( NUM_THREADS = int(cpu_count() * 4))

It seems that constant has not been changed for 9 years now. But IIUC there is also an option threadCount that can be set from the command line ? Do we ever use this option and/or would that be an easier way to test it ?

(In reply to Jens Stutte [:jstutte] from comment #16)

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #13)

  1. we run 1 test per thread, and this is defined by https://searchfox.org/mozilla-central/source/testing/xpcshell/runxpcshelltests.py#54 ( NUM_THREADS = int(cpu_count() * 4))

It seems that constant has not been changed for 9 years now. But IIUC there is also an option threadCount that can be set from the command line ? Do we ever use this option and/or would that be an easier way to test it ?

There is also this adjustment for tsan already. I cannot find anything similar for asan, though. A successful run of X4 asan shows:

[task 2023-01-20T19:43:21.720Z] 19:43:21     INFO -  Using at most 8 threads.

while a successful run of tsan says:

[task 2023-01-20T19:24:56.226Z] 19:24:56     INFO -  Using at most 4 threads.

logged from runxpcshelltests.py.

If we knew cpu_count() == 2 of the interested node then this could be the initial NUM_THREADS = int(cpu_count() * 4) value for asan and the adjustment of that value for tsan. It is probably reasonable to make the same/a similar adjustment for asan?

(In reply to Jens Stutte [:jstutte] from comment #19)

Try: https://treeherder.mozilla.org/jobs?repo=try&revision=4ae7bd2d6aa321cca5b1c042c33122ccd1ad1657

That looks good, so far. The log shows we are using now 4 "threads". For some reason I do not understand phabricator keeps the patch I attached here secret, so it does not show up here?

I think lowering the thread count for asan wouldn't be a problem.

Assignee: nobody → jstutte
Status: NEW → ASSIGNED
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/079e8849e811 Limit the number of parallel running XPCShell tests for asan builds. r=jmaher
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 111 Branch
Flags: needinfo?(csabou)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: