Closed Bug 1736430 Opened 3 years ago Closed 3 years ago

Perma tests/jit-test/jit-test/tests/wasm/large-memory.js | Unknown (code -11, args "--ion-eager --ion-offthread-compile=off --more-compartments") [0.0 s] | (code 138, args "--ion-eager --ion-offthread-compile=off --more-compartments") [0.2 s]

Categories

(Core :: JavaScript: WebAssembly, defect, P3)

defect

Tracking

()

RESOLVED DUPLICATE of bug 1736531
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- unaffected
firefox93 --- unaffected
firefox94 --- unaffected
firefox95 --- affected

People

(Reporter: intermittent-bug-filer, Assigned: lth)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: intermittent-failure, regression)

This looks to be from Bug 1727084. Lars, could you please have a look over it as it's perma failing on central?
This one too.

Added jobs that point to the culprit here and here.

Flags: needinfo?(lhansen)
Summary: Perma tests/jit-test/jit-test/tests/wasm/large-memory.js | Unknown (code -11, args "--ion-eager --ion-offthread-compile=off --more-compartments") [0.0 s] → Perma tests/jit-test/jit-test/tests/wasm/large-memory.js | Unknown (code -11, args "--ion-eager --ion-offthread-compile=off --more-compartments") [0.0 s] | (code 138, args "--ion-eager --ion-offthread-compile=off --more-compartments") [0.2 s]
Assignee: nobody → lhansen
Status: NEW → ASSIGNED
Priority: P5 → P3
Component: JavaScript Engine → Javascript: WebAssembly

To summarize:

jit-test/tests/wasm/large-memory.js fails with what appears to be a SIGSEGV on MacOS 11 x64 and on Android 8 arm64. This test was not changed by the memory64 patch set so something got perturbed in a way to make this fail. This could be a code bug or the introduction of a concurrent test case that causes turbulence.

If the systems are underprovisioned on real memory or swap then the huge memory demands from the jit-test/tests/wasm/memory64/basic.js test that could be running concurrently could in principle cause some overcommit issues, but that's not a completely obvious candidate.

Both failing builds are opt builds, but this could be happenstance - not sure if non-opt builds are even running on these devices. It looks like the failure appears with various command line parameters. In particular, it also occurs with --disable-wasm-huge-memory.

Not able to repro the mac failure locally, but I have a MBP with a newer OS and lots of RAM so that's not much data. With a full jit-test --tbpl run, memory use on the system never got very high at all.

The next step here would probably be to disable jit-test/tests/wasm/memory64/basic.js on the affected devices to see if that makes any difference on the outcome of the other test. If it does, the problem has to do with the provisioning of the test systems. If it does not, the memory64 patch introduced a bug.

Flags: needinfo?(lhansen)

Set release status flags based on info from the regressing bug 1727084

Disabling basic.js didn't reproduce the Windows ccov failure but macOS and Android persisted.

Yeah. I found another problem last night that may be the cause of the present bug, will test that this morning.

I can't repro locally with the failing artifact, and the other bug I thought might be connected almost certainly isn't, so I'm going to have to bisect this on try in the hope of tracking down some non-obvious problem. I will update this comment as I progress.

Range of updates:

Newest:
changeset: ______:80388e7f335c
user: Lars T Hansen <lhansen@mozilla.com>
date: Mon Oct 18 09:58:16 2021 +0000
summary: Bug 1727084 - Memory64 - Test cases and testing code. r=yury
Try run on that patch: https://treeherder.mozilla.org/jobs?repo=try&revision=eeb728b75e64e981f8032344d14a599dd60f64a4 shows (last entry for "OS X 11 WebRender Shippable") the desired failure

changeset: ______:13ee9674ee35
user: Lars T Hansen <lhansen@mozilla.com>
date: Mon Oct 18 09:58:14 2021 +0000
summary: Bug 1727084 - Memory64 - Bulk memory operations. r=yury
Try run on that patch: https://treeherder.mozilla.org/jobs?repo=try&revision=42ec18a0685d80b97690f133693826e914920e94&selectedTaskRun=YVqRvipjQ5utWHMJ64OhRA.0 shows the desired failure.

changeset: ______:b658cfe4b173
user: Lars T Hansen <lhansen@mozilla.com>
date: Mon Oct 18 09:58:14 2021 +0000
summary: Bug 1727084 - Memory64 - Expose the index type via js-types. r=yury
Try run on that patch: https://treeherder.mozilla.org/jobs?repo=try&revision=29ad95f7325351c973d8d4d61d5ea5e6e0769bc5&selectedTaskRun=GgOEvTsrTfq5MhGkr2gHxg.0 shows no failure.

changeset: ______:caca657178e4
user: Lars T Hansen <lhansen@mozilla.com>
date: Mon Oct 18 09:58:14 2021 +0000
summary: Bug 1727084 - Memory64 - Allow larger-than-4GB allocations. r=yury
Try run on that patch: https://treeherder.mozilla.org/jobs?repo=try&revision=1412f60882731b418d484c82cef651223d42fc7e shows no failure.

Oldest:
changeset: ____:83e52246d0ea
user: Lars T Hansen <lhansen@mozilla.com>
date: Mon Oct 18 09:58:13 2021 +0000
summary: Bug 1727084 - Memory64 - Huge-memory status depends on index type. r=yury
(Unknown)

Try run on the patch before the oldest patch: https://treeherder.mozilla.org/jobs?repo=try&revision=1470bbfa0c0dfd6bf40b1c4cc6fa0da9318e08ba&selectedTaskRun=MPGnd5CaQoqshprHrr8PBg.0 shows no failure, as expected and desired.

In conclusion, it looks like the bulk memory change created this problem.

Blocks: wasm64

The test run succeeds if I remove the bulk memory tests from large-memory.js so I think we have a smoking gun. Tomorrow I'll try to narrow it down to one of memory.copy, memory.fill, memory.init, and memory.grow.

It's worth noting that there have been no failures for several days, and I can't repro on current central. It is possible that this is somehow related to bug [redacted] (now fixed), though it's a little hard to see how precisely. It's possible there have been no failures since there's been little activity on the weekend. But since I have been able to repro every time I've tried with patches in that queue, and I can't now, it's possible that the problem has been fixed.

Oh, there's an important detail: the name of the job that fails is "OS X 11 WebRender Shippable opt test-macosx1100-64-shippable-qr/opt-jittest-1proc Jit" and the artifact is "macosx1100-64-shippable-qr". I took this to mean x64. But it is not: the orange factor graph shows that all failures are on an M1 Mac Mini (in addition to the pixel 2). That is, this is exclusively an arm64 bug, and that makes it vastly more likely that bug [redacted] (now fixed) was the cause of this.

I'm not sure who to blame for the mixup here; in truth, macosx builds are multi-arch, so "64" is technically correct, even if confusing (to me anyhow).

Sebastian, re comment 12, the fact that this is an arm64 bug is very well hidden. Consider the failure on https://treeherder.mozilla.org/jobs?repo=try&revision=42ec18a0685d80b97690f133693826e914920e94&selectedTaskRun=YVqRvipjQ5utWHMJ64OhRA.0. If I select the failing run, I find no indication of architecture in any of the panes. If I inspect the task (from the meatball menu), ditto. And if I open the log and scroll to the top, I get this confusing collection of facts:

Worker Type (releng-hardware/gecko-t-osx-1100-m1) settings:
[taskcluster 2021-10-23T07:01:19.596Z]   {
[taskcluster 2021-10-23T07:01:19.596Z]     "arch": "x86_64",
[taskcluster 2021-10-23T07:01:19.596Z]     "config": {
[taskcluster 2021-10-23T07:01:19.596Z]       "deploymentId": ""
[taskcluster 2021-10-23T07:01:19.596Z]     },
[taskcluster 2021-10-23T07:01:19.596Z]     "disk_size": "228.27 GiB",
[taskcluster 2021-10-23T07:01:19.596Z]     "generic-worker": {
[taskcluster 2021-10-23T07:01:19.596Z]       "engine": "simple",
[taskcluster 2021-10-23T07:01:19.596Z]       "go-arch": "arm64",
[taskcluster 2021-10-23T07:01:19.596Z]       "go-os": "darwin",
[taskcluster 2021-10-23T07:01:19.596Z]       "go-version": "go1.16.4",
[taskcluster 2021-10-23T07:01:19.596Z]       "release": "https://github.com/taskcluster/taskcluster/releases/tag/v30.0.2",
[taskcluster 2021-10-23T07:01:19.596Z]       "revision": "6fdba0dad3ef52d4c547a794901f75b7171e3172",
[taskcluster 2021-10-23T07:01:19.596Z]       "source": "https://github.com/taskcluster/taskcluster/commits/6fdba0dad3ef52d4c547a794901f75b7171e3172",
[taskcluster 2021-10-23T07:01:19.596Z]       "version": "30.0.2"
[taskcluster 2021-10-23T07:01:19.596Z]     },
[taskcluster 2021-10-23T07:01:19.596Z]     "ip": "10.155.0.59",
[taskcluster 2021-10-23T07:01:19.596Z]     "machine-setup": {
[taskcluster 2021-10-23T07:01:19.596Z]       "config": "https://github.com/mozilla-platform-ops/ronin_puppet"
[taskcluster 2021-10-23T07:01:19.596Z]     },
[taskcluster 2021-10-23T07:01:19.596Z]     "memory": "16 GB",
[taskcluster 2021-10-23T07:01:19.596Z]     "model_identifier": "Macmini9,1",
[taskcluster 2021-10-23T07:01:19.596Z]     "processor_cores": "8",
[taskcluster 2021-10-23T07:01:19.596Z]     "processor_count": "1",
[taskcluster 2021-10-23T07:01:19.596Z]     "processor_name": "Unknown",
[taskcluster 2021-10-23T07:01:19.596Z]     "processor_speed": "2.4 GHz",
[taskcluster 2021-10-23T07:01:19.596Z]     "system_version": "macOS 11.2.3 (20D91)",
[taskcluster 2021-10-23T07:01:19.596Z]     "workerGroup": "macstadium-vegas",
[taskcluster 2021-10-23T07:01:19.596Z]     "workerId": "macmini-m1-49"
[taskcluster 2021-10-23T07:01:19.596Z]   }

There are several clues here that this is an arm64 / M1 machine, yet "arch" is plainly stated to be "x86_64".

This situation seems suboptimal. Where might I file bugs about making it more difficult to make the same mistake that I did about the architecture?

Flags: needinfo?(aryx.bugmail)
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → DUPLICATE

Sorry for the trouble. Please file a bug in Firefox Build System :: Task Configuration and needinfo glandium and CC me. Thank you.

Flags: needinfo?(aryx.bugmail)
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.