Perma Android 13.0 Pixel5 AArch64 opt mochitest-webgl [tier 2] No tests run or test summary not found | TinderboxPrint: mochitest-webgl1-ext<br/><em class="testfail">T-FAIL</em>
Categories
(Core :: JavaScript Engine: JIT, defect, P1)
Tracking
()
People
(Reporter: intermittent-bug-filer, Assigned: jandem)
References
(Blocks 1 open bug)
Details
(Keywords: csectype-jit, sec-high, Whiteboard: [stockwell disable-recommended][adv-main131+r][adv-esr128.3+r][adv-esr115.16+r])
Attachments
(8 files, 2 obsolete files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
Bug 1897792 - update mochitest-errata file extension in documentation from .ini to .toml. r=jgilbert
48 bytes,
text/x-phabricator-request
|
Details | Review | |
9.44 KB,
text/plain
|
Details | |
6.71 KB,
text/plain
|
Details | |
48 bytes,
text/x-phabricator-request
|
diannaS
:
approval-mozilla-beta+
diannaS
:
approval-mozilla-esr115+
diannaS
:
approval-mozilla-esr128+
tjr
:
sec-approval+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
Details | Review |
Filed by: imoraru [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=458862299&repo=try
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/BBg0fpjeS0ialGqSE23Cxw/runs/0/artifacts/public/logs/live_backing.log
[task 2024-05-20T12:42:42.803Z] 12:42:39 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__ogles__GL__abs__abs_001_to_006.html | getError was expected value: NO_ERROR : there should be no errors
[task 2024-05-20T12:42:42.803Z] 12:42:39 INFO - Buffered messages finished
[task 2024-05-20T12:42:42.803Z] 12:42:40 INFO - 0 INFO TEST-START | Shutdown
[task 2024-05-20T12:42:42.803Z] 12:42:40 INFO - 1 INFO Passed: 0
[task 2024-05-20T12:42:42.803Z] 12:42:40 INFO - 2 INFO Failed: 0
[task 2024-05-20T12:42:42.803Z] 12:42:40 INFO - 3 INFO Todo: 0
[task 2024-05-20T12:42:42.803Z] 12:42:40 INFO - 4 INFO Mode: e10s
[task 2024-05-20T12:42:42.803Z] 12:42:40 INFO - 5 INFO SimpleTest FINISHED
[task 2024-05-20T12:42:42.803Z] 12:42:40 INFO - Buffered messages finished
[task 2024-05-20T12:42:42.803Z] 12:42:40 INFO - SUITE-END | took 44s
[task 2024-05-20T12:42:42.803Z] 12:42:40 INFO - Return code: 1
[task 2024-05-20T12:42:42.803Z] 12:42:40 ERROR - No tests run or test summary not found
[task 2024-05-20T12:42:42.803Z] 12:42:40 INFO - TinderboxPrint: mochitest-webgl1-ext<br/><em class="testfail">T-FAIL</em>
[task 2024-05-20T12:42:42.803Z] 12:42:40 INFO - ##### mochitest-webgl1-ext log ends
[task 2024-05-20T12:42:42.803Z] 12:42:40 WARNING - setting return code to 1
[task 2024-05-20T12:42:42.803Z] 12:42:40 INFO - The mochitest-webgl1-ext suite: mochitest-webgl1-ext ran with return status: WARNING
[task 2024-05-20T12:42:42.803Z] 12:42:40 INFO - Running post-action listener: _package_coverage_data
[task 2024-05-20T12:42:42.803Z] 12:42:40 INFO - Running post-action listener: _resource_record_post_action
[task 2024-05-20T12:42:42.803Z] 12:42:40 INFO - Running post-action listener: process_java_coverage_data
[task 2024-05-20T12:42:42.803Z] 12:42:40 INFO - Running post-action listener: stop_device
[task 2024-05-20T12:42:42.803Z] 12:42:42 INFO - /data/tombstones/tombstone_00 deleted
[task 2024-05-20T12:42:42.803Z] 12:42:42 INFO - /data/tombstones/tombstone_00.pb deleted
[task 2024-05-20T12:42:42.803Z] 12:42:42 INFO - Killing logcat pid 820.
[task 2024-05-20T12:42:42.803Z] 12:42:42 INFO - [mozharness: 2024-05-20 12:42:42.631880Z] Finished run-tests step (success)
[task 2024-05-20T12:42:42.803Z] 12:42:42 INFO - Running post-run listener: _resource_record_post_run
[task 2024-05-20T12:42:47.860Z] 12:42:42 INFO - Total resource usage - Wall time: 67s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
[task 2024-05-20T12:42:47.860Z] 12:42:42 INFO - TinderboxPrint: I/O read bytes / time<br/>0 / 0
[task 2024-05-20T12:42:47.860Z] 12:42:42 INFO - TinderboxPrint: I/O write bytes / time<br/>0 / 0
[task 2024-05-20T12:42:47.860Z] 12:42:42 INFO - TinderboxPrint: Swap in / out<br/>0 / 0
[task 2024-05-20T12:42:47.860Z] 12:42:42 INFO - verify-device - Wall time: 4s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
[task 2024-05-20T12:42:47.860Z] 12:42:42 INFO - install - Wall time: 8s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
[task 2024-05-20T12:42:47.860Z] 12:42:42 INFO - run-tests - Wall time: 52s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
[task 2024-05-20T12:42:47.860Z] 12:42:42 WARNING - returning nonzero exit status 1
[task 2024-05-20T12:42:47.860Z] cleanup
Comment 1•1 year ago
•
|
||
Observed in today's central-as-early-beta simulation, it only happened on Android 13.0 Pixel5 AArch64 opt Mochitests without fission enabled.
This seems to be permafailing since the central as early beta simulation from May 16th.
Hi @owlish! Can you please take a look at this?
Thank you!
Updated•1 year ago
|
Comment 2•1 year ago
|
||
Do we have a rough idea of when this started?
Updated•1 year ago
|
Updated•1 year ago
|
![]() |
||
Comment 3•1 year ago
|
||
Test suite execution ends during the execution of a test:
INFO - TEST-START | dom/canvas/test/webgl-conf/generated/test_conformance__ogles__GL__abs__abs_001_to_006.html
INFO - wait for org.mozilla.geckoview.test_runner complete; top activity=com.google.android.apps.nexuslauncher
INFO - runtestsremote.py | Application ran for: 0:00:34.386762
Today's run stops executing the suite during the next test but it lists errors during the execution of the test started above. That run might have failed to handle the issue.
Could an issue in the environment be triggered, e.g. by high memory usage?
Pushlog between last good and first bad central-as-beta simulation.
Comment 4•1 year ago
|
||
I was running in to this on try and thought it was due to my patches before I spotted this bug. You can see in the logcat that the content process is crashing, although there are no symbols. So I did a push without stripping symbols, which gives this:
*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
Build fingerprint: 'google/redfin/redfin:13/TP1A.220905.004/8927612:user/release-keys'
Revision: 'MP1.0'
ABI: 'arm64'
Timestamp: 2024-05-21 20:31:37.562276831+0000
Process uptime: 8s
Cmdline: org.mozilla.geckoview.test_runner:tab12
pid: 11913, tid: 11946, name: Web Content >>> org.mozilla.geckoview.test_runner:tab12 <<<
uid: 10239
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x0001ffffffffffff
x0 0000007af5d35700 x1 0000000000000004 x2 0000007b145328a0 x3 0000000000000001
x4 0000000000000000 x5 0000000000000004 x6 0000007b14532978 x7 68796e6c2d66716e
x8 0000007b14532a60 x9 0000007b14532798 x10 0000007b14532780 x11 0000007b14532890
x12 0000007a6f9380b0 x13 ffffffffffffffff x14 0000007a57890020 x15 0000007a591f4fb0
x16 0000000000000000 x17 0000007b14532950 x18 0000007b138ac000 x19 fff9800000000000
x20 0000007af5d35700 x21 0000007b145328a0 x22 0000007b145327a8 x23 0000007b14532790
x24 0000007a6f9380b4 x25 0001ffffffffffff x26 0000007b14536000 x27 0000007af5d35768
x28 0000007af5d35718 x29 0000007b14532830
lr 0000007b0742fe3c sp 0000007b14532770 pc 0000007b072777c8 pst 0000000020001000
backtrace:
#00 pc 0000000007a6e7c8 /data/app/~~G7OFszoupsjRuJWkID0I8A==/org.mozilla.geckoview.test_runner-izkRhys_PXkK308bbTW90Q==/lib/arm64/libxul.so (js::ToPrimitiveSlow(JSContext*, JSType, JS::MutableHandle<JS::Value>)+156) (BuildId: 71936ccbb31c7045bd666a24784c8283d737d0d2)
#01 pc 0000000007c26e38 /data/app/~~G7OFszoupsjRuJWkID0I8A==/org.mozilla.geckoview.test_runner-izkRhys_PXkK308bbTW90Q==/lib/arm64/libxul.so (js::ToNumberSlow(JSContext*, JS::Handle<JS::Value>, double*)+92) (BuildId: 71936ccbb31c7045bd666a24784c8283d737d0d2)
#02 pc 0000000007b1e538 /data/app/~~G7OFszoupsjRuJWkID0I8A==/org.mozilla.geckoview.test_runner-izkRhys_PXkK308bbTW90Q==/lib/arm64/libxul.so (js::ElementSpecific<float, js::UnsharedOps>::initFromIterablePackedArray(JSContext*, JS::Handle<js::FixedLengthTypedArrayObject*>, JS::Handle<js::ArrayObject*>)+848) (BuildId: 71936ccbb31c7045bd666a24784c8283d737d0d2)
#03 pc 0000000007b5c1d4 /data/app/~~G7OFszoupsjRuJWkID0I8A==/org.mozilla.geckoview.test_runner-izkRhys_PXkK308bbTW90Q==/lib/arm64/libxul.so ((anonymous namespace)::TypedArrayObjectTemplate<float>::fromArray(JSContext*, JS::Handle<JSObject*>, JS::Handle<JSObject*>)+1644) (BuildId: 71936ccbb31c7045bd666a24784c8283d737d0d2)
#04 pc 0000000000007efc [anon:js-executable-memory]
Perhaps that might help identify something in the pushlog
Comment 5•1 year ago
|
||
This failure appeared on beta after the uplift of Bug 1894716.
https://treeherder.mozilla.org/jobs?repo=mozilla-beta&group_state=expanded&fromchange=c53a1221d79f4b2f3492262364140539a3ba8ac9&searchStr=android%2C13.0%2Cpixel5%2Caarch64%2Copt%2Cmochitests%2Cwithout%2Cfission%2Cenabled%2Ctest-android-hw-p5-13-0-android-aarch64-qr%2Fopt-geckoview-mochitest-webgl1-ext-nofis%2Cgl1e2&tochange=845df5851b8c01ecca7830d5227f23516f3c21fb&selectedTaskRun=HtkBdjtCRA6msadrJ-WmTQ.0
Can you please take a look?
Comment 6•1 year ago
|
||
It's interesting that bug 1894716 didn't break the test on nightly but only in beta.
Let me see what I can find out - it's likely a timing change that's breaking this test.
Comment 7•1 year ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #3)
Today's run stops executing the suite during the next test but it lists errors during the execution of the test started above. That run might have failed to handle the issue.
Yes, this is interesting.
For this run we end up failing the pixel compares.
[task 2024-05-21T12:19:16.258Z] 12:19:16 WARNING - TEST-UNEXPECTED-FAIL | dom/canvas/test/webgl-conf/generated/test_conformance__ogles__GL__abs__abs_001_to_006.html | pixel @ (412, 253 was [191,0,0,255] expected [98,0,0,255]
[task 2024-05-21T12:19:16.258Z] 12:19:16 INFO - SimpleTest.ok@SimpleTest/SimpleTest.js:426:16
[task 2024-05-21T12:19:16.258Z] 12:19:16 INFO - reportResults@dom/canvas/test/webgl-conf/mochi-single.html?checkout/conformance/ogles/GL/abs/abs_001_to_006.html:22:14
[task 2024-05-21T12:19:16.258Z] 12:19:16 INFO - reportTestResultsToHarness@dom/canvas/test/webgl-conf/checkout/js/js-test-pre.js:108:22
I'm trying to narrow this down, but it would be helpful if we had someone who was familiar with these webGL tests.
Bug 1894716 optimized gecko on android for speed ("-O2"), which shifted the timing of many operations.
![]() |
||
Comment 8•1 year ago
|
||
Kelsey: This one specific Android webgl mochitest task fails permanently for this one test platform.
- Is there anything webgl specific which would explain why the summary at the end does not count already executed tests? (If not, jmaher should be the next person to ask about it from a mochitest POV.)
- The test failures from the timing change in bug 1894716: could you investigate them?
There should be one bug for each issue, please share for which one this bug should be used.
Updated•1 year ago
|
Comment 9•1 year ago
|
||
In this push I've enabled logging in the OpenGLESTestRunner.
I'm seeing pixel failures in each test, e.g.
dom/canvas/test/webgl-conf/generated/test_conformance__ogles__GL__abs__abs_001_to_006.html | pixel @ (412, 253 was [191,0,0,255] expected [98,0,0,255]
The SIGSEGV follows shortly after.
There might be more details in the logs, still looking.
Comment 10•1 year ago
|
||
Without knowledge of these tests, I'm not sure I can help in any meaningful way (but do let me know if I can).
In summary, with the faster binary and logs enabled on the OpenGLESTestRunner, we consistently see this test fail for that one platform:
TEST-UNEXPECTED-FAIL | dom/canvas/test/webgl-conf/generated/test_conformance__ogles__GL__abs__abs_001_to_006.html | pixel @ (412, 253 was [98,0,0,255] expected [191,0,0,255]
It's not clear to me why geckoview crashes after the test failure, as in general that doesn't happen when these tests fail (induced example here).
But finding out why the pixel test fails seems like a good next step.
![]() |
||
Comment 11•1 year ago
|
||
I don't think this needs to track, it's just a test failure.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
![]() |
||
Comment 18•1 year ago
|
||
Task duration for each of the 2 chunks was below 10 minutes.
Updated•1 year ago
|
Comment 22•1 year ago
|
||
Comment 23•1 year ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/93963136c6fe
https://hg.mozilla.org/mozilla-central/rev/a4db4d624dfe
https://hg.mozilla.org/mozilla-central/rev/bb4f8ae9bdfe
https://hg.mozilla.org/mozilla-central/rev/1c8e49ce3fa1
Comment 24•1 year ago
|
||
Since nightly and release are affected, beta will likely be affected too.
For more information, please visit BugBot documentation.
![]() |
||
Comment 25•1 year ago
|
||
The issue persists.
Comment 26•1 year ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #8)
Kelsey: This one specific Android webgl mochitest task fails permanently for this one test platform.
- Is there anything webgl specific which would explain why the summary at the end does not count already executed tests? (If not, jmaher should be the next person to ask about it from a mochitest POV.)
No, I don't know why the summary isn't counting things.
- The test failures from the timing change in bug 1894716: could you investigate them?
There should be one bug for each issue, please share for which one this bug should be used.
I don't think that this is worth investigating, and I would prefer to just disable it on these machines.
Comment hidden (Intermittent Failures Robot) |
Comment 28•1 year ago
|
||
Comment 29•1 year ago
|
||
I took a quick poke at this, but it seems that skipping one test just moves the failures to the next one.
Updated•1 year ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•1 year ago
|
Comment 33•1 year ago
•
|
||
FWIW I can reproduce this not just on a Pixel 5, but on a Pixel 8 too. Presumably it affects all aarch64 android devices. The backtrace (which I will attach as it's long) shows we are crashing in JS engine code.
I took a quick poke at this, but it seems that skipping one test just moves the failures to the next one.
The next one presumably being acos_001_to_006.html
, followed by various other 001_to00<n>
tests. The thing they have in common is that they use OpenGLESTestRunner
.
Looking at abs_001_to_006.html
in particular, the first 3 subtests alone don't cause the crash, but any of the last 3 subtests alone do. They key difference is that they use "model": "grid"
rather than "model": null
. The issue appears to be with the drawGrid()
function.
Comment 34•1 year ago
|
||
Comment hidden (Intermittent Failures Robot) |
Comment 36•1 year ago
|
||
Based on the stack this looks like a javascript / gc bug. Moving to that component, hopefully somebody more familiar with the code can take it from here
Comment 37•1 year ago
|
||
Based on the stack this looks like reading from a bad JSObject pointer.
[Inlined] js::gc::HeaderWord::get() const Cell.h:106
[Inlined] js::gc::CellWithTenuredGCPointer::headerPtr() const Cell.h:799
[Inlined] JSObject::shape() const JSObject.h:93
[Inlined] JSObject::is<…>() const NativeObject.h:1887
[Inlined] JSObject::maybeHasInterestingSymbolProperty() const JSObject-inl.h:240
[Inlined] js::MaybeHasInterestingSymbolProperty(JSContext *, JSObject *, JS::Symbol *, JSObject **) ObjectOperations-inl.h:244
[Inlined] js::GetInterestingSymbolProperty(JSContext *, JS::Handle<…>, JS::Symbol *, JS::MutableHandle<…>) ObjectOperations-inl.h:264
js::ToPrimitiveSlow(JSContext *, JSType, JS::MutableHandle<…>) JSObject.cpp:2464
[Inlined] js::ToPrimitive(JSContext *, JSType, JS::MutableHandle<…>) JSObject.h:778
js::ToNumberSlow(JSContext *, JS::Handle<…>, double *) jsnum.cpp:2026
[Inlined] JS::ToNumber(JSContext *, JS::Handle<…>, double *) Conversions.h:139
[Inlined] js::ElementSpecific::valueToNative(JSContext *, JS::Handle<…>, float *) TypedArrayObject-inl.h:696
js::ElementSpecific::initFromIterablePackedArray(JSContext *, JS::Handle<…>, JS::Handle<…>) TypedArrayObject-inl.h:573
[Inlined] TypedArrayObjectTemplate::fromObject(JSContext *, JS::Handle<…>, JS::Handle<…>) TypedArrayObject.cpp:1388
TypedArrayObjectTemplate::fromArray(JSContext *, JS::Handle<…>, JS::Handle<…>) TypedArrayObject.cpp:1237
<unknown> 0x000000778f578040
...
<unknown> 0x000000778f5707bc
[Inlined] EnterJit(JSContext *, js::RunState &, unsigned char *) Jit.cpp:115
js::jit::MaybeEnterJit(JSContext *, js::RunState &) Jit.cpp:261
js::RunScript(JSContext *, js::RunState &) Interpreter.cpp:453
Updated•1 year ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 40•1 year ago
|
||
Certainly seems like if we could debug this there might be something interesting happening; I'd be particularly curious what the object pointer is that has been passed to GetInterestingSymbolProperty (which ultimately is what gets dereferenced at the crash point)
Jamie: If you can still reproduce this (my android environment hasn't been spun up in a year plus at the moment) I'd be really interested in the value of the JSObject passed to GetInterestingSymbolProperty
; specifically I'm curious if it's actually some sort of poison value?
Comment hidden (Intermittent Failures Robot) |
Comment 42•1 year ago
|
||
I think it was 0x1fffffffffffffff
, which I believe corresponds to Value::as_bits_ == 0xffffffffffffffff
.
I also traced up the stack to initFromIterablePackedArray()
here and found that it's always element 65548 of the values array that is 0xffffffffffffffff. The values before and after seem normal, and in debug builds that element is also normal.
In some of the other tests which use this function we get a value of 0x0 instead, which doesn't cause a crash but causes the test to fail.
Prior to this it seemed as if NativeObject::growElements was realloc()
ing the array. that's as far as I got
Comment 43•1 year ago
|
||
Comment 44•1 year ago
|
||
Talked about this bug with Jan, so needinfoing him -- we think there's perhaps some investigation to be done here.
What you describe is definitely interesting. Do you know if this can be reproduced in the simulator(emulator?)
If you can reproduce this, we might also be able to produce some debugging-assistance patches too. E.g, for the case you describe, saving the last element of the elements array on the stack, and validating that realloc correctly preserves the contents of the array when running. At the very least it would help point more directly to the root cause, even if it's an Android OS issue.
Comment 45•1 year ago
|
||
Oh and there's iain's test case now too :)
Updated•1 year ago
|
Comment 46•1 year ago
|
||
Assignee | ||
Comment 47•1 year ago
|
||
Some initial thoughts:
- According to comments here, this started after bug 1894716 landed. That bug changed compiler flags for the Android build, so it could be a Clang/LLVM issue.
- Running the test locally and dumping JS stacks from
initFromIterablePackedArray
, I see some arrays with length 73728 with the following stack:
http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/conformance/ogles/ogles-utils.js:115 (1385b4c2d060 @ 218)
http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/conformance/ogles/ogles-utils.js:389 (1385b4c2d150 @ 3028)
http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/conformance/ogles/ogles-utils.js:531 (1385b4c2d1a0 @ 2119)
http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/conformance/ogles/ogles-utils.js:578 (1385b4c2d330 @ 408)
http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/conformance/ogles/ogles-utils.js:551 (1385b4c2d2e0 @ 27)
http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/js/webgl-test-utils.js:1937 (1385b4c26920 @ 214)
- This matches Iain's analysis in comment 43 for the location of the
initFromIterablePackedArray
call. - Object elements have two reserved
Values
for the elements header. Elements allocations here should be powers of two, so for a length of 73728 we went from 65536 to 131072 elements including the header. - Element 65548 would have been stored shortly after growing from 65536. If the array contains a bogus value at this index, it either was a store of an invalid
Value
or the store code itself was buggy (probably in JIT code). Or we didn't store anything and this is uninitialized memory. Or something else corrupted this memory after we stored a valid JS value there.
Assignee | ||
Comment 48•1 year ago
|
||
If I look at element 65548 in initFromIterablePackedArray
, I get the following different arrays multiple times on my desktop machine:
Element 65547 = double 0.1015625 (0x3fba000000000000)
Element 65548 = double 0.890625 (0x3fec800000000000)
Element 65549 = double 0.90625 (0x3fed000000000000)
Element 65547 = double 0.78125 (0x3fe9000000000000)
Element 65548 = double 0.8125 (0x3fea000000000000)
Element 65549 = int32 -2
The second one has an int32 value -2
for every third element. This likely comes from this code in drawGrid
:
var z = -2.0;
...
gridVertices[currentVertex++] = x1;
gridVertices[currentVertex++] = y1;
gridVertices[currentVertex++] = z;
The first one is probably the gridColors
array but I'm not completely sure.
Jamie, I'm curious if one of these matches the values you saw on Android.
I can do some Try debugging to see if I can narrow it down more, assuming it reproduces somewhat reliably.
Assignee | ||
Comment 49•1 year ago
|
||
This test also crashes a local mozilla-beta build on my M1.
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 50•1 year ago
|
||
I think this is a problem with our definition of non-volatile float registers for ARM64 here.
We're claiming that v8-v30 are non-volatile, so preserved across a call (v31 is a scratch register). However the docs say this:
Registers v8-v15 must be preserved by a callee across subroutine calls; the remaining registers (v0-v7, v16-v31) do not need to be preserved (or should be preserved by the caller). Additionally, only the bottom 64 bits of each value stored in v8-v15 need to be preserved [8]; it is the responsibility of the caller to preserve larger values.
So for registers v16-v30 we're incorrectly assuming calls to C++ code will preserve them.
This particular test case has a ton of live float registers and C++ code clobbers d26 and stores a NaN value in it when growing array elements. Removing that register from NonVolatileSingleMask
fixes the test for me.
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 51•1 year ago
|
||
Assignee | ||
Comment 52•1 year ago
|
||
Assignee | ||
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Comment 53•1 year ago
|
||
Yooo! Super exciting to see root-causes and fixes to "no idea what's going wrong here"-type bugs! Thanks!
Assignee | ||
Comment 54•1 year ago
|
||
Comment on attachment 9420838 [details]
Bug 1897792 - Change NonVolatileSingleMask on ARM64. r?iain!
Security Approval Request
- How easily could an exploit be constructed based on the patch?: Very hard. It also depends on floating-point registers used in Clang/LLVM generated code.
- Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
- Which branches (beta, release, and/or ESR) are affected by this flaw, and do the release status flags reflect this affected/unaffected state correctly?: All
- If not all supported branches, which bug introduced the flaw?: None
- Do you have backports for the affected branches?: Yes
- If not, how different, hard to create, and risky will they be?: Patch should apply.
- How likely is this patch to cause regressions; how much testing does it need?: Unlikely. It only affects arm64 builds and only affects code that uses a lot of different floating-point values.
- Is the patch ready to land after security approval is given?: Yes
- Is Android affected?: Yes
Comment 55•1 year ago
|
||
Comment on attachment 9420838 [details]
Bug 1897792 - Change NonVolatileSingleMask on ARM64. r?iain!
Approved to land and uplift
Updated•1 year ago
|
Assignee | ||
Comment 56•1 year ago
|
||
Comment on attachment 9420838 [details]
Bug 1897792 - Change NonVolatileSingleMask on ARM64. r?iain!
Beta/Release Uplift Approval Request
- User impact if declined: Crashes, potential security bugs.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Only affects ARM64 where it just marks some floating point registers as volatile instead of non-volatile in the JIT compilers.
- String changes made/needed: N/A
- Is Android affected?: Yes
Comment 57•1 year ago
|
||
Comment 58•1 year ago
|
||
Comment 59•1 year ago
|
||
Comment on attachment 9420838 [details]
Bug 1897792 - Change NonVolatileSingleMask on ARM64. r?iain!
Approved for 131.0b4
Comment 60•1 year ago
|
||
uplift |
Updated•1 year ago
|
Comment 61•1 year ago
|
||
Comment on attachment 9420838 [details]
Bug 1897792 - Change NonVolatileSingleMask on ARM64. r?iain!
Approved for 128.3esr
Comment 62•1 year ago
|
||
uplift |
Updated•1 year ago
|
Comment 63•11 months ago
|
||
uplift |
Comment 64•11 months ago
|
||
Comment on attachment 9420838 [details]
Bug 1897792 - Change NonVolatileSingleMask on ARM64. r?iain!
Approved for 115.16esr
Updated•11 months ago
|
Updated•11 months ago
|
Updated•11 months ago
|
Updated•11 months ago
|
Updated•11 months ago
|
Comment 65•9 months ago
|
||
2 months ago, tjr placed a reminder on the bug using the whiteboard tag [reminder-test 2024-11-12]
.
jandem, please refer to the original comment to better understand the reason for the reminder.
Updated•9 months ago
|
Assignee | ||
Updated•9 months ago
|
Comment 66•9 months ago
|
||
![]() |
||
Comment 67•9 months ago
|
||
Updated•5 months ago
|
Comment hidden (Intermittent Failures Robot) |
Description
•