Closed Bug 1897792 Opened 1 year ago Closed 1 year ago

Perma Android 13.0 Pixel5 AArch64 opt mochitest-webgl [tier 2] No tests run or test summary not found | TinderboxPrint: mochitest-webgl1-ext<br/><em class="testfail">T-FAIL</em>

Categories

(Core :: JavaScript Engine: JIT, defect, P1)

ARM64
All
defect

Tracking

()

RESOLVED FIXED
132 Branch
Tracking Status
firefox-esr115 131+ fixed
firefox-esr128 131+ fixed
firefox126 --- wontfix
firefox127 --- wontfix
firefox128 --- wontfix
firefox129 --- wontfix
firefox130 --- wontfix
firefox131 + fixed
firefox132 + fixed

People

(Reporter: intermittent-bug-filer, Assigned: jandem)

References

(Blocks 1 open bug)

Details

(Keywords: csectype-jit, sec-high, Whiteboard: [stockwell disable-recommended][adv-main131+r][adv-esr128.3+r][adv-esr115.16+r])

Attachments

(8 files, 2 obsolete files)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
9.44 KB, text/plain
Details
6.71 KB, text/plain
Details
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

Filed by: imoraru [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=458862299&repo=try
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/BBg0fpjeS0ialGqSE23Cxw/runs/0/artifacts/public/logs/live_backing.log


[task 2024-05-20T12:42:42.803Z] 12:42:39     INFO -  TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__ogles__GL__abs__abs_001_to_006.html | getError was expected value: NO_ERROR : there should be no errors
[task 2024-05-20T12:42:42.803Z] 12:42:39     INFO -  Buffered messages finished
[task 2024-05-20T12:42:42.803Z] 12:42:40     INFO -  0 INFO TEST-START | Shutdown
[task 2024-05-20T12:42:42.803Z] 12:42:40     INFO -  1 INFO Passed:  0
[task 2024-05-20T12:42:42.803Z] 12:42:40     INFO -  2 INFO Failed:  0
[task 2024-05-20T12:42:42.803Z] 12:42:40     INFO -  3 INFO Todo:    0
[task 2024-05-20T12:42:42.803Z] 12:42:40     INFO -  4 INFO Mode:    e10s
[task 2024-05-20T12:42:42.803Z] 12:42:40     INFO -  5 INFO SimpleTest FINISHED
[task 2024-05-20T12:42:42.803Z] 12:42:40     INFO -  Buffered messages finished
[task 2024-05-20T12:42:42.803Z] 12:42:40     INFO -  SUITE-END | took 44s
[task 2024-05-20T12:42:42.803Z] 12:42:40     INFO - Return code: 1
[task 2024-05-20T12:42:42.803Z] 12:42:40    ERROR - No tests run or test summary not found
[task 2024-05-20T12:42:42.803Z] 12:42:40     INFO - TinderboxPrint: mochitest-webgl1-ext<br/><em class="testfail">T-FAIL</em>
[task 2024-05-20T12:42:42.803Z] 12:42:40     INFO - ##### mochitest-webgl1-ext log ends
[task 2024-05-20T12:42:42.803Z] 12:42:40  WARNING - setting return code to 1
[task 2024-05-20T12:42:42.803Z] 12:42:40     INFO - The mochitest-webgl1-ext suite: mochitest-webgl1-ext ran with return status: WARNING
[task 2024-05-20T12:42:42.803Z] 12:42:40     INFO - Running post-action listener: _package_coverage_data
[task 2024-05-20T12:42:42.803Z] 12:42:40     INFO - Running post-action listener: _resource_record_post_action
[task 2024-05-20T12:42:42.803Z] 12:42:40     INFO - Running post-action listener: process_java_coverage_data
[task 2024-05-20T12:42:42.803Z] 12:42:40     INFO - Running post-action listener: stop_device
[task 2024-05-20T12:42:42.803Z] 12:42:42     INFO - /data/tombstones/tombstone_00 deleted
[task 2024-05-20T12:42:42.803Z] 12:42:42     INFO - /data/tombstones/tombstone_00.pb deleted
[task 2024-05-20T12:42:42.803Z] 12:42:42     INFO - Killing logcat pid 820.
[task 2024-05-20T12:42:42.803Z] 12:42:42     INFO - [mozharness: 2024-05-20 12:42:42.631880Z] Finished run-tests step (success)
[task 2024-05-20T12:42:42.803Z] 12:42:42     INFO - Running post-run listener: _resource_record_post_run
[task 2024-05-20T12:42:47.860Z] 12:42:42     INFO - Total resource usage - Wall time: 67s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
[task 2024-05-20T12:42:47.860Z] 12:42:42     INFO - TinderboxPrint: I/O read bytes / time<br/>0 / 0
[task 2024-05-20T12:42:47.860Z] 12:42:42     INFO - TinderboxPrint: I/O write bytes / time<br/>0 / 0
[task 2024-05-20T12:42:47.860Z] 12:42:42     INFO - TinderboxPrint: Swap in / out<br/>0 / 0
[task 2024-05-20T12:42:47.860Z] 12:42:42     INFO - verify-device - Wall time: 4s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
[task 2024-05-20T12:42:47.860Z] 12:42:42     INFO - install - Wall time: 8s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
[task 2024-05-20T12:42:47.860Z] 12:42:42     INFO - run-tests - Wall time: 52s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
[task 2024-05-20T12:42:47.860Z] 12:42:42  WARNING - returning nonzero exit status 1
[task 2024-05-20T12:42:47.860Z] cleanup

Observed in today's central-as-early-beta simulation, it only happened on Android 13.0 Pixel5 AArch64 opt Mochitests without fission enabled.

This seems to be permafailing since the central as early beta simulation from May 16th.

Hi @owlish! Can you please take a look at this?
Thank you!

Flags: needinfo?(bugzeeeeee)
Summary: Perma mochitest-webgl [tier 2] No tests run or test summary not found | TinderboxPrint: mochitest-webgl1-ext<br/><em class="testfail">T-FAIL</em> → Perma Android 13.0 Pixel5 AArch64 opt mochitest-webgl [tier 2] No tests run or test summary not found | TinderboxPrint: mochitest-webgl1-ext<br/><em class="testfail">T-FAIL</em> when Gecko 128 merges to beta on 2024-06-10

Do we have a rough idea of when this started?

Component: General → Graphics: CanvasWebGL
Flags: needinfo?(imoraru)
Product: GeckoView → Core
Severity: S4 → --
Priority: P5 → --
Severity: -- → S3

Test suite execution ends during the execution of a test:

    INFO -  TEST-START | dom/canvas/test/webgl-conf/generated/test_conformance__ogles__GL__abs__abs_001_to_006.html
    INFO -  wait for org.mozilla.geckoview.test_runner complete; top activity=com.google.android.apps.nexuslauncher
    INFO -  runtestsremote.py | Application ran for: 0:00:34.386762

Today's run stops executing the suite during the next test but it lists errors during the execution of the test started above. That run might have failed to handle the issue.

Could an issue in the environment be triggered, e.g. by high memory usage?

Pushlog between last good and first bad central-as-beta simulation.

Flags: needinfo?(imoraru) → needinfo?(aaron.train)

I was running in to this on try and thought it was due to my patches before I spotted this bug. You can see in the logcat that the content process is crashing, although there are no symbols. So I did a push without stripping symbols, which gives this:

*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
Build fingerprint: 'google/redfin/redfin:13/TP1A.220905.004/8927612:user/release-keys'
Revision: 'MP1.0'
ABI: 'arm64'
Timestamp: 2024-05-21 20:31:37.562276831+0000
Process uptime: 8s
Cmdline: org.mozilla.geckoview.test_runner:tab12
pid: 11913, tid: 11946, name: Web Content  >>> org.mozilla.geckoview.test_runner:tab12 <<<
uid: 10239
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x0001ffffffffffff
    x0  0000007af5d35700  x1  0000000000000004  x2  0000007b145328a0  x3  0000000000000001
    x4  0000000000000000  x5  0000000000000004  x6  0000007b14532978  x7  68796e6c2d66716e
    x8  0000007b14532a60  x9  0000007b14532798  x10 0000007b14532780  x11 0000007b14532890
    x12 0000007a6f9380b0  x13 ffffffffffffffff  x14 0000007a57890020  x15 0000007a591f4fb0
    x16 0000000000000000  x17 0000007b14532950  x18 0000007b138ac000  x19 fff9800000000000
    x20 0000007af5d35700  x21 0000007b145328a0  x22 0000007b145327a8  x23 0000007b14532790
    x24 0000007a6f9380b4  x25 0001ffffffffffff  x26 0000007b14536000  x27 0000007af5d35768
    x28 0000007af5d35718  x29 0000007b14532830
    lr  0000007b0742fe3c  sp  0000007b14532770  pc  0000007b072777c8  pst 0000000020001000
backtrace:
      #00 pc 0000000007a6e7c8  /data/app/~~G7OFszoupsjRuJWkID0I8A==/org.mozilla.geckoview.test_runner-izkRhys_PXkK308bbTW90Q==/lib/arm64/libxul.so (js::ToPrimitiveSlow(JSContext*, JSType, JS::MutableHandle<JS::Value>)+156) (BuildId: 71936ccbb31c7045bd666a24784c8283d737d0d2)
      #01 pc 0000000007c26e38  /data/app/~~G7OFszoupsjRuJWkID0I8A==/org.mozilla.geckoview.test_runner-izkRhys_PXkK308bbTW90Q==/lib/arm64/libxul.so (js::ToNumberSlow(JSContext*, JS::Handle<JS::Value>, double*)+92) (BuildId: 71936ccbb31c7045bd666a24784c8283d737d0d2)
      #02 pc 0000000007b1e538  /data/app/~~G7OFszoupsjRuJWkID0I8A==/org.mozilla.geckoview.test_runner-izkRhys_PXkK308bbTW90Q==/lib/arm64/libxul.so (js::ElementSpecific<float, js::UnsharedOps>::initFromIterablePackedArray(JSContext*, JS::Handle<js::FixedLengthTypedArrayObject*>, JS::Handle<js::ArrayObject*>)+848) (BuildId: 71936ccbb31c7045bd666a24784c8283d737d0d2)
      #03 pc 0000000007b5c1d4  /data/app/~~G7OFszoupsjRuJWkID0I8A==/org.mozilla.geckoview.test_runner-izkRhys_PXkK308bbTW90Q==/lib/arm64/libxul.so ((anonymous namespace)::TypedArrayObjectTemplate<float>::fromArray(JSContext*, JS::Handle<JSObject*>, JS::Handle<JSObject*>)+1644) (BuildId: 71936ccbb31c7045bd666a24784c8283d737d0d2)
      #04 pc 0000000000007efc  [anon:js-executable-memory]

Perhaps that might help identify something in the pushlog

It's interesting that bug 1894716 didn't break the test on nightly but only in beta.

Let me see what I can find out - it's likely a timing change that's breaking this test.

(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #3)

Today's run stops executing the suite during the next test but it lists errors during the execution of the test started above. That run might have failed to handle the issue.

Yes, this is interesting.
For this run we end up failing the pixel compares.

[task 2024-05-21T12:19:16.258Z] 12:19:16  WARNING -  TEST-UNEXPECTED-FAIL | dom/canvas/test/webgl-conf/generated/test_conformance__ogles__GL__abs__abs_001_to_006.html | pixel @ (412, 253 was [191,0,0,255] expected [98,0,0,255]
[task 2024-05-21T12:19:16.258Z] 12:19:16     INFO -      SimpleTest.ok@SimpleTest/SimpleTest.js:426:16
[task 2024-05-21T12:19:16.258Z] 12:19:16     INFO -      reportResults@dom/canvas/test/webgl-conf/mochi-single.html?checkout/conformance/ogles/GL/abs/abs_001_to_006.html:22:14
[task 2024-05-21T12:19:16.258Z] 12:19:16     INFO -      reportTestResultsToHarness@dom/canvas/test/webgl-conf/checkout/js/js-test-pre.js:108:22

I'm trying to narrow this down, but it would be helpful if we had someone who was familiar with these webGL tests.
Bug 1894716 optimized gecko on android for speed ("-O2"), which shifted the timing of many operations.

Kelsey: This one specific Android webgl mochitest task fails permanently for this one test platform.

  1. Is there anything webgl specific which would explain why the summary at the end does not count already executed tests? (If not, jmaher should be the next person to ask about it from a mochitest POV.)
  2. The test failures from the timing change in bug 1894716: could you investigate them?
    There should be one bug for each issue, please share for which one this bug should be used.
Flags: needinfo?(bugzeeeeee) → needinfo?(jgilbert)
Flags: needinfo?(aaron.train)

In this push I've enabled logging in the OpenGLESTestRunner.

I'm seeing pixel failures in each test, e.g.

 dom/canvas/test/webgl-conf/generated/test_conformance__ogles__GL__abs__abs_001_to_006.html | pixel @ (412, 253 was [191,0,0,255] expected [98,0,0,255]

The SIGSEGV follows shortly after.
There might be more details in the logs, still looking.

Without knowledge of these tests, I'm not sure I can help in any meaningful way (but do let me know if I can).

In summary, with the faster binary and logs enabled on the OpenGLESTestRunner, we consistently see this test fail for that one platform:

TEST-UNEXPECTED-FAIL | dom/canvas/test/webgl-conf/generated/test_conformance__ogles__GL__abs__abs_001_to_006.html | pixel @ (412, 253 was [98,0,0,255] expected [191,0,0,255]

It's not clear to me why geckoview crashes after the test failure, as in general that doesn't happen when these tests fail (induced example here).

But finding out why the pixel test fails seems like a good next step.

I don't think this needs to track, it's just a test failure.

Assignee: nobody → aryx.bugmail
Status: NEW → ASSIGNED
Pushed by archaeopteryx@coole-files.de: https://hg.mozilla.org/integration/autoland/rev/93963136c6fe don't run mochitest webgl1-ext as chunked tasks. r=jmaher https://hg.mozilla.org/integration/autoland/rev/a4db4d624dfe skip test_conformance__ogles__GL__abs__abs_001_to_006.html on Android debug for terminating test suite. r=jgilbert https://hg.mozilla.org/integration/autoland/rev/bb4f8ae9bdfe drop unsupported or unused in CI operating systems from documentation in mochitest-errata.toml. r=jgilbert https://hg.mozilla.org/integration/autoland/rev/1c8e49ce3fa1 update mochitest-errata file extension in documentation from .ini to .toml. r=jgilbert

Since nightly and release are affected, beta will likely be affected too.
For more information, please visit BugBot documentation.

The issue persists.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 130 Branch → ---

(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #8)

Kelsey: This one specific Android webgl mochitest task fails permanently for this one test platform.

  1. Is there anything webgl specific which would explain why the summary at the end does not count already executed tests? (If not, jmaher should be the next person to ask about it from a mochitest POV.)

No, I don't know why the summary isn't counting things.

  1. The test failures from the timing change in bug 1894716: could you investigate them?
    There should be one bug for each issue, please share for which one this bug should be used.

I don't think that this is worth investigating, and I would prefer to just disable it on these machines.

Flags: needinfo?(jgilbert)

I took a quick poke at this, but it seems that skipping one test just moves the failures to the next one.

Attachment #9413714 - Attachment is obsolete: true
Summary: Perma Android 13.0 Pixel5 AArch64 opt mochitest-webgl [tier 2] No tests run or test summary not found | TinderboxPrint: mochitest-webgl1-ext<br/><em class="testfail">T-FAIL</em> when Gecko 128 merges to beta on 2024-06-10 → Perma Android 13.0 Pixel5 AArch64 opt mochitest-webgl [tier 2] No tests run or test summary not found | TinderboxPrint: mochitest-webgl1-ext<br/><em class="testfail">T-FAIL</em>

FWIW I can reproduce this not just on a Pixel 5, but on a Pixel 8 too. Presumably it affects all aarch64 android devices. The backtrace (which I will attach as it's long) shows we are crashing in JS engine code.

I took a quick poke at this, but it seems that skipping one test just moves the failures to the next one.

The next one presumably being acos_001_to_006.html, followed by various other 001_to00<n> tests. The thing they have in common is that they use OpenGLESTestRunner.

Looking at abs_001_to_006.html in particular, the first 3 subtests alone don't cause the crash, but any of the last 3 subtests alone do. They key difference is that they use "model": "grid" rather than "model": null. The issue appears to be with the drawGrid() function.

Attached file Backtrace

Based on the stack this looks like a javascript / gc bug. Moving to that component, hopefully somebody more familiar with the code can take it from here

Component: Graphics: CanvasWebGL → JavaScript: GC

Based on the stack this looks like reading from a bad JSObject pointer.

    [Inlined] js::gc::HeaderWord::get() const Cell.h:106
    [Inlined] js::gc::CellWithTenuredGCPointer::headerPtr() const Cell.h:799
    [Inlined] JSObject::shape() const JSObject.h:93
    [Inlined] JSObject::is<…>() const NativeObject.h:1887
    [Inlined] JSObject::maybeHasInterestingSymbolProperty() const JSObject-inl.h:240
    [Inlined] js::MaybeHasInterestingSymbolProperty(JSContext *, JSObject *, JS::Symbol *, JSObject **) ObjectOperations-inl.h:244
    [Inlined] js::GetInterestingSymbolProperty(JSContext *, JS::Handle<…>, JS::Symbol *, JS::MutableHandle<…>) ObjectOperations-inl.h:264
    js::ToPrimitiveSlow(JSContext *, JSType, JS::MutableHandle<…>) JSObject.cpp:2464
    [Inlined] js::ToPrimitive(JSContext *, JSType, JS::MutableHandle<…>) JSObject.h:778
    js::ToNumberSlow(JSContext *, JS::Handle<…>, double *) jsnum.cpp:2026
    [Inlined] JS::ToNumber(JSContext *, JS::Handle<…>, double *) Conversions.h:139
    [Inlined] js::ElementSpecific::valueToNative(JSContext *, JS::Handle<…>, float *) TypedArrayObject-inl.h:696
    js::ElementSpecific::initFromIterablePackedArray(JSContext *, JS::Handle<…>, JS::Handle<…>) TypedArrayObject-inl.h:573
    [Inlined] TypedArrayObjectTemplate::fromObject(JSContext *, JS::Handle<…>, JS::Handle<…>) TypedArrayObject.cpp:1388
    TypedArrayObjectTemplate::fromArray(JSContext *, JS::Handle<…>, JS::Handle<…>) TypedArrayObject.cpp:1237
    <unknown> 0x000000778f578040
...
    <unknown> 0x000000778f5707bc
    [Inlined] EnterJit(JSContext *, js::RunState &, unsigned char *) Jit.cpp:115
    js::jit::MaybeEnterJit(JSContext *, js::RunState &) Jit.cpp:261
    js::RunScript(JSContext *, js::RunState &) Interpreter.cpp:453
Component: JavaScript: GC → JavaScript Engine
Flags: needinfo?(acreskey)

Certainly seems like if we could debug this there might be something interesting happening; I'd be particularly curious what the object pointer is that has been passed to GetInterestingSymbolProperty (which ultimately is what gets dereferenced at the crash point)

Jamie: If you can still reproduce this (my android environment hasn't been spun up in a year plus at the moment) I'd be really interested in the value of the JSObject passed to GetInterestingSymbolProperty; specifically I'm curious if it's actually some sort of poison value?

Flags: needinfo?(jnicol)
Priority: -- → P3

I think it was 0x1fffffffffffffff, which I believe corresponds to Value::as_bits_ == 0xffffffffffffffff.

I also traced up the stack to initFromIterablePackedArray() here and found that it's always element 65548 of the values array that is 0xffffffffffffffff. The values before and after seem normal, and in debug builds that element is also normal.

In some of the other tests which use this function we get a value of 0x0 instead, which doesn't cause a crash but causes the test to fail.

Prior to this it seemed as if NativeObject::growElements was realloc()ing the array. that's as far as I got

Flags: needinfo?(jnicol)
Attached file drawGrid.js (obsolete) —
65548 is 0x1000c, which is suspiciously close to a round number. Tracing through `initFromIterablePackedArray`, it does look like this is exactly how we would crash if we somehow ended up with Value::fromRawBits(-1) in a big array. But that just pushes the question back one level: how did we do that? Assuming that comment 33 correctly identified the crashing file, I'm pretty sure the `initFromIterablePackedArray` has to correspond to the `new Float32Array(buffer.data)` [here](https://searchfox.org/mozilla-central/rev/0c55d51c0d2a9b672e42ad40ea54f90267f92a8e/dom/canvas/test/webgl-conf/checkout/conformance/ogles/ogles-utils.js#115). It looks like drawGrid is being called [here](https://searchfox.org/mozilla-central/rev/e942f7bc56cd103bba86b396ddeba5b1ab04f1a4/dom/canvas/test/webgl-conf/checkout/conformance/ogles/ogles-utils.js#531) with width, height, and attribs all being derived from `gl`. The canvas height and width appear to be [fixed at 500](https://searchfox.org/mozilla-central/rev/e942f7bc56cd103bba86b396ddeba5b1ab04f1a4/dom/canvas/test/webgl-conf/checkout/conformance/ogles/ogles-utils.js#763). So naively I would assume that the following cutdown testcase should replicate the bug:

Talked about this bug with Jan, so needinfoing him -- we think there's perhaps some investigation to be done here.

What you describe is definitely interesting. Do you know if this can be reproduced in the simulator(emulator?)

If you can reproduce this, we might also be able to produce some debugging-assistance patches too. E.g, for the case you describe, saving the last element of the elements array on the stack, and validating that realloc correctly preserves the contents of the array when running. At the very least it would help point more directly to the root cause, even if it's an Android OS issue.

Oh and there's iain's test case now too :)

Flags: needinfo?(jnicol)
Flags: needinfo?(jdemooij)
Attachment #9420167 - Attachment is obsolete: true
Attached file drawGrid.js

Some initial thoughts:

  • According to comments here, this started after bug 1894716 landed. That bug changed compiler flags for the Android build, so it could be a Clang/LLVM issue.
  • Running the test locally and dumping JS stacks from initFromIterablePackedArray, I see some arrays with length 73728 with the following stack:
http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/conformance/ogles/ogles-utils.js:115 (1385b4c2d060 @ 218)
http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/conformance/ogles/ogles-utils.js:389 (1385b4c2d150 @ 3028)
http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/conformance/ogles/ogles-utils.js:531 (1385b4c2d1a0 @ 2119)
http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/conformance/ogles/ogles-utils.js:578 (1385b4c2d330 @ 408)
http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/conformance/ogles/ogles-utils.js:551 (1385b4c2d2e0 @ 27)
http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/js/webgl-test-utils.js:1937 (1385b4c26920 @ 214)
  • This matches Iain's analysis in comment 43 for the location of the initFromIterablePackedArray call.
  • Object elements have two reserved Values for the elements header. Elements allocations here should be powers of two, so for a length of 73728 we went from 65536 to 131072 elements including the header.
  • Element 65548 would have been stored shortly after growing from 65536. If the array contains a bogus value at this index, it either was a store of an invalid Value or the store code itself was buggy (probably in JIT code). Or we didn't store anything and this is uninitialized memory. Or something else corrupted this memory after we stored a valid JS value there.

If I look at element 65548 in initFromIterablePackedArray, I get the following different arrays multiple times on my desktop machine:

Element 65547 = double 0.1015625 (0x3fba000000000000)
Element 65548 = double 0.890625 (0x3fec800000000000)
Element 65549 = double 0.90625 (0x3fed000000000000)
Element 65547 = double 0.78125 (0x3fe9000000000000)
Element 65548 = double 0.8125 (0x3fea000000000000)
Element 65549 = int32 -2

The second one has an int32 value -2 for every third element. This likely comes from this code in drawGrid:

var z = -2.0;
...
      gridVertices[currentVertex++] = x1;
      gridVertices[currentVertex++] = y1;
      gridVertices[currentVertex++] = z;

The first one is probably the gridColors array but I'm not completely sure.

Jamie, I'm curious if one of these matches the values you saw on Android.

I can do some Try debugging to see if I can narrow it down more, assuming it reproduces somewhat reliably.

This test also crashes a local mozilla-beta build on my M1.

Group: javascript-core-security
Assignee: aryx.bugmail → jdemooij
Status: REOPENED → ASSIGNED

I think this is a problem with our definition of non-volatile float registers for ARM64 here.

We're claiming that v8-v30 are non-volatile, so preserved across a call (v31 is a scratch register). However the docs say this:

Registers v8-v15 must be preserved by a callee across subroutine calls; the remaining registers (v0-v7, v16-v31) do not need to be preserved (or should be preserved by the caller). Additionally, only the bottom 64 bits of each value stored in v8-v15 need to be preserved [8]; it is the responsibility of the caller to preserve larger values.

So for registers v16-v30 we're incorrectly assuming calls to C++ code will preserve them.

This particular test case has a ton of live float registers and C++ code clobbers d26 and stores a NaN value in it when growing array elements. Removing that register from NonVolatileSingleMask fixes the test for me.

Flags: needinfo?(jnicol)
See Also: → 1914979
Component: JavaScript Engine → JavaScript Engine: JIT
Flags: needinfo?(jdemooij)
Keywords: sec-high
OS: Android → All
Hardware: All → ARM64
Severity: S3 → S2
Priority: P3 → P1

Yooo! Super exciting to see root-causes and fixes to "no idea what's going wrong here"-type bugs! Thanks!

Comment on attachment 9420838 [details]
Bug 1897792 - Change NonVolatileSingleMask on ARM64. r?iain!

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: Very hard. It also depends on floating-point registers used in Clang/LLVM generated code.
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
  • Which branches (beta, release, and/or ESR) are affected by this flaw, and do the release status flags reflect this affected/unaffected state correctly?: All
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: Yes
  • If not, how different, hard to create, and risky will they be?: Patch should apply.
  • How likely is this patch to cause regressions; how much testing does it need?: Unlikely. It only affects arm64 builds and only affects code that uses a lot of different floating-point values.
  • Is the patch ready to land after security approval is given?: Yes
  • Is Android affected?: Yes
Attachment #9420838 - Flags: sec-approval?

Comment on attachment 9420838 [details]
Bug 1897792 - Change NonVolatileSingleMask on ARM64. r?iain!

Approved to land and uplift

Attachment #9420838 - Flags: sec-approval? → sec-approval+
Whiteboard: [stockwell disable-recommended] → [stockwell disable-recommended][reminder-test 2024-11-12]

Comment on attachment 9420838 [details]
Bug 1897792 - Change NonVolatileSingleMask on ARM64. r?iain!

Beta/Release Uplift Approval Request

  • User impact if declined: Crashes, potential security bugs.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Only affects ARM64 where it just marks some floating point registers as volatile instead of non-volatile in the JIT compilers.
  • String changes made/needed: N/A
  • Is Android affected?: Yes
Attachment #9420838 - Flags: approval-mozilla-esr128?
Attachment #9420838 - Flags: approval-mozilla-esr115?
Attachment #9420838 - Flags: approval-mozilla-beta?
Pushed by jdemooij@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e8d5adc23a97 Change NonVolatileSingleMask on ARM64. r=iain
Group: javascript-core-security → core-security-release
Status: ASSIGNED → RESOLVED
Closed: 1 year ago1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 132 Branch

Comment on attachment 9420838 [details]
Bug 1897792 - Change NonVolatileSingleMask on ARM64. r?iain!

Approved for 131.0b4

Attachment #9420838 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment on attachment 9420838 [details]
Bug 1897792 - Change NonVolatileSingleMask on ARM64. r?iain!

Approved for 128.3esr

Attachment #9420838 - Flags: approval-mozilla-esr128? → approval-mozilla-esr128+

Comment on attachment 9420838 [details]
Bug 1897792 - Change NonVolatileSingleMask on ARM64. r?iain!

Approved for 115.16esr

Attachment #9420838 - Flags: approval-mozilla-esr115? → approval-mozilla-esr115+
QA Whiteboard: [post-critsmash-triage]
Regressions: 1917628
Whiteboard: [stockwell disable-recommended][reminder-test 2024-11-12] → [stockwell disable-recommended][reminder-test 2024-11-12][adv-main131+r]
Whiteboard: [stockwell disable-recommended][reminder-test 2024-11-12][adv-main131+r] → [stockwell disable-recommended][reminder-test 2024-11-12][adv-main131+r][adv-esr128.3+r]
Whiteboard: [stockwell disable-recommended][reminder-test 2024-11-12][adv-main131+r][adv-esr128.3+r] → [stockwell disable-recommended][reminder-test 2024-11-12][adv-main131+r][adv-esr128.3+r][adv-esr115.16+r]

2 months ago, tjr placed a reminder on the bug using the whiteboard tag [reminder-test 2024-11-12] .

jandem, please refer to the original comment to better understand the reason for the reminder.

Flags: needinfo?(jdemooij)
Whiteboard: [stockwell disable-recommended][reminder-test 2024-11-12][adv-main131+r][adv-esr128.3+r][adv-esr115.16+r] → [stockwell disable-recommended][adv-main131+r][adv-esr128.3+r][adv-esr115.16+r]
See Also: → 1919803
Attachment #9420839 - Attachment description: Bug 1897792 - Add test and simulator volatile register poisoning. r?iain! → Bug 1897792 - Add test. r?iain!
Flags: needinfo?(jdemooij)
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: