High number of jemalloc crashes in webgl

NEW
Unassigned

Status

()

Core
Canvas: WebGL
P3
normal
a year ago
4 months ago

People

(Reporter: Tomcat, Unassigned)

Tracking

(Depends on: 1 bug, Blocks: 53 bugs, {leave-open, meta})

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [gfx-noted][test disabled])

Attachments

(2 attachments)

(Reporter)

Description

a year ago
we see a high number of jemalloc crashes on the tres, this bug is to track hopefully the fixes of this crashes
Possibly related to us changing the WebGL conformance tests (bug 1193526.)
See Also: → bug 1193526, bug 1268747
Although some predate the conformance test update: https://bugzilla.mozilla.org/buglist.cgi?list_id=13106221&short_desc=Intermittent%20webgl-%20crashed%20jemalloc_crash&resolution=---&query_format=advanced&short_desc_type=allwordssubstr
Part of a long list of similar WebGL intermittents.
Whiteboard: [gfx-noted]

Updated

a year ago
Assignee: nobody → vliu
Peter, Vincent, any progress on this?
Flags: needinfo?(vliu)
Flags: needinfo?(howareyou322)

Comment 5

a year ago
Hi Milan,

Currently I am still trying to find the possible way to reproduce the issue from try server. The crash is not consistent so it is hard to have  a fixed chunk to look into; Furthermore, I also want to setup taskcluster and try to available mach to run test. It still have environment relative issues need to be resolved.
Flags: needinfo?(vliu)
Flags: needinfo?(howareyou322)

Comment 6

a year ago
1. Current situation.
    1.1 Looked into back trace for the recent tc-M(gl1) and tc-M(gl5) in mozilla-inbound, they are not consistent. most of them were crashed at jemalloc free 
        but with differnt root cause. Please see the attached file.

    1.2 I tried to collect all test program which easily crash in tc-M(gl1) and tc-M(gl5) together in generated-mochitest.ini. With this modification, I can reproduce jemalloc_crash
       on try-server many times. Please see [1] for more detail.

       [1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=afd22fd275b2

    1.3 In the other hand, I found running reftest-18 also easily to reproduce jemalloc_crash. Please see [2] for more detail.

       [2]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ffdb90ff6ce9

       This pattern is always creash in webgl-hanging-fb-test.html. Looked into this test program, I try to simpify the code and the main gl operations are

          gl.clearColor(0.0, 1.0, 0.0, 1.0);
          gl.clear(gl.COLOR_BUFFER_BIT);

       There are more clues to think about the above gl operations are we had some crashs in [3]. Please see bug 1292263, 1291864 and 1291790. But I have no more evidence
       to prove it.

        [3]: https://dxr.mozilla.org/mozilla-central/source/dom/canvas/test/reftest/webgl-clear-test.html 


2. Possible ways for the next move.
    2.1 build up mozharness locally.
    2.2 Keep trying remote environment by "one click loaner".
    2.3 Rethink test programs which are easily to crash like 1.3 mentioned.
    2.4 I also consult performance team for more help. It is currently on going.

Comment 7

a year ago
Created attachment 8782871 [details]
jemalloc-backtrace.log

A list of back trace 1.1 in Comment 6 was attached.

Comment 8

a year ago
Hi Mike,

Thanks for talk to you on IRC about jemalloc issues. This bug is meta bug for tracking. I modified the code you suggested in jemalloc and now the try has reproduced. Can you have a look for me if there is any clue to point to the root cause? Thanks

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8e9f3a2b8e00&selectedJob=26014306
Flags: needinfo?(mh+mozilla)
Oh waw, this one is really bad. Contrary to the past one I looked at, this one happens on malloc. So it's not even about a bad pointer being passed in. The crash happens on the same kind of check (arena magic match), so it indicates an actual heap corruption, where something is actually touching mozjemalloc's main (and only) arena structure, or the arenas list, or the pointer to it.

What I would suggest is:
- inserting a page-aligned padding after the struct arena_s magic member, and mprotect the first page or arenas[0] so that it can't be written to after initialization
- doing something similar with the arenas list.
Flags: needinfo?(mh+mozilla)
Hi Mike,

Thanks the reply for the issue. From crash listed in Comment 8, I also see another crash point was in [1].


[1]: https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c?q=RELEASE_ASSERT%28%28mapbits+%26+CHUNK_MAP_ALLOCATED%29+%21%3D+0%29%3B&redirect_type=single#4587

Does it also happens because of heap corruption? Should I add any padding or any protection to prevent?
Flags: needinfo?(mh+mozilla)
That could be or could be totally unrelated. But it's also harder to diagnose with padding and mprotect. You should try comment 9 first before trying to care about that variant.
Flags: needinfo?(mh+mozilla)
Having guard pages in the memory allocator is something that would be hugely helpful in diagnosing memory corruptions. Jemalloc has checks of the arena magic in malloc()/free(), but this is often unrelated to the corruption site.

To make this useful, we need to reserve guard page (maybe before the arena header instead of after it), fill it with some pattern that is recognized by us and mprotect() it as readonly so that we can crash-annotate this as memory-corruption so that such crashes can stand out among other kinds of access violations like use-after-free.
From the previous suggestions, I tried to draft a WIP to debug this issue. The WIP intended to add a paddings, 
followed after magic in arena structure to combine a 4096 bytes (This equals to a page size in arena). 

When arena was allocated in initial phase, mprotect() was also called to protect in a page basis. By this way, magic in arena may be protected from
harming. [1] is the try result.

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b616fa5ca3f2&selectedJob=26494713

Looked into [1], two crashes happens for both alloc and free phase. 

  [2]: https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#4306
  [3]: https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#4510
  

Logcat showed only arena 0 was allocated and make sure allocated arena was page aligned. 

   Jemalloc:  arena 0 has allocated. ret=1426067456

From crash point, it still has chance to hit magic not matched[2]. In other hand, [3] still happens in most try.

Hi Mike, Cervantes,
Could you please have a look in [1] if I have anything missing? Thanks.
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(cyu)
Blocks: 1294682
Blocks: 1284740
Blocks: 1291155
Blocks: 1297926
Blocks: 1283780
Blocks: 1293515
Blocks: 1291570
Blocks: 1291902
Blocks: 1294223
Blocks: 1284363
Blocks: 1294305
Blocks: 1298099
Blocks: 1298224
Blocks: 1284152
Blocks: 1284865
Blocks: 1284611
Blocks: 1293875
Blocks: 1282369
Blocks: 1295785
Blocks: 1297236
Blocks: 1291613
Blocks: 1237521
Blocks: 1284737
Blocks: 1293561
Blocks: 1285691
Blocks: 1291788
Blocks: 1284734
Blocks: 1285079
Blocks: 1271070
Blocks: 1293135
Blocks: 1291170
Blocks: 1284732
Blocks: 1284692
Blocks: 1292824
Blocks: 1284689
Blocks: 1298075
Blocks: 1291934
Blocks: 1293288
Blocks: 1271371
Blocks: 1284916
Blocks: 1285601
Blocks: 1285599
Blocks: 1291104
Blocks: 1297947
Blocks: 1296195
Blocks: 1293547
Blocks: 1284580
Blocks: 1298039
Blocks: 1285354
Blocks: 1284153
Blocks: 1285615
Blocks: 1295414
Blocks: 1292436
Blocks: 1291114
Blocks: 1292189
Blocks: 1285610
Blocks: 1284639
Blocks: 1291864
Blocks: 1291560
Blocks: 1294859
Blocks: 1285112
Blocks: 1293431
Blocks: 1292796
Blocks: 1293417
Blocks: 1291554
Blocks: 1292263
Blocks: 1293880
Blocks: 1236728
Blocks: 1291098
Blocks: 1285422
Blocks: 1293907
Blocks: 1284729
Blocks: 1284186
Blocks: 1294179
Blocks: 1284612
Blocks: 1284857
Blocks: 1291169
Blocks: 1292307
Blocks: 1298278
Blocks: 1291898
Blocks: 1292805
Blocks: 1286132
Blocks: 1284724
Blocks: 1235138
Blocks: 1285471
Blocks: 1291790
Blocks: 1284075
Blocks: 1292208, 1292965
Blocks: 1291093
Blocks: 1293249
Blocks: 1285096
Blocks: 1291782
Blocks: 1288062
Blocks: 1298461
Blocks: 1291906
Blocks: 1291783
Blocks: 1271071
Blocks: 1291094
Blocks: 1296810
Blocks: 1285612
Blocks: 1291507
Blocks: 1292746
Blocks: 1285388
Blocks: 1285753
Blocks: 1291102
Blocks: 1298267
Blocks: 1291899
Blocks: 1291493
Blocks: 1285279
Blocks: 1292759
Blocks: 1285056
Blocks: 1283469
Blocks: 1291503, 1295864
Blocks: 1285016
Blocks: 1286542
Blocks: 1293871, 1295707
Blocks: 1293320
Blocks: 1291095
Blocks: 1284761
Blocks: 1284148
Blocks: 1284638, 1284935
(In reply to Vincent Liu[:vliu] from comment #13)
> From the previous suggestions, I tried to draft a WIP to debug this issue.
> The WIP intended to add a paddings, 
> followed after magic in arena structure to combine a 4096 bytes (This equals
> to a page size in arena). 
> 
> When arena was allocated in initial phase, mprotect() was also called to
> protect in a page basis. By this way, magic in arena may be protected from
> harming. [1] is the try result.
> 
> [1]:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=b616fa5ca3f2&selectedJob=26494713

The log for that selected job is from the code you point at [3] because of the isalloc_validate call you added to free() in https://hg.mozilla.org/try/file/b616fa5ca3f2/memory/mozjemalloc/jemalloc.c#l6529 and essentially means free is being passed an already freed pointer.

The log for the next orange after that one shows a crash at https://hg.mozilla.org/try/file/b616fa5ca3f2/memory/mozjemalloc/jemalloc.c#l4312 ... and no android log from the code at https://hg.mozilla.org/try/file/b616fa5ca3f2/memory/mozjemalloc/jemalloc.c#l4306. That's a huge WTF. (android log: https://public-artifacts.taskcluster.net/DokpsO8nQuaA_BjBOg1VIw/0/public/test_info//logcat-emulator-5554.log ; crash log: https://public-artifacts.taskcluster.net/DokpsO8nQuaA_BjBOg1VIw/0/public/logs/live_backing.log )

Even looking at the assembly doesn't help make it a little less a WTF:

   1c56a:       4607            mov     r7, r0       // r0 contains arenas[0], copy the value in r7
   1c56c:       b9d8            cbnz    r0, 1c5a6    // if r0 is non-zero, go to that address
(...)
   1c5a6:       6803            ldr     r3, [r0, #0] // Load arenas[0].magic in r3
   1c5a8:       4ebe            ldr     r6, [pc, #760]  ; (1c8a4) // Load ARENA_MAGIC in r6
   1c5aa:       42b3            cmp     r3, r6       // Compare r3 and r6
   1c5ac:       d021            beq.n   1c5f2        // If r3 != r6, go to that address. Considering we didn't get the android log, this branch was taken, meaning r3 == r6, and r0 is obviously non-zero, so we took the first branch too
(...)
   1c5f2:       683b            ldr     r3, [r7, #0] // r7 has a copy of arenas[0] from above, so this loads arenas[0].magic in r3 again
   1c5f4:       42b3            cmp     r3, r6       // and compare r3 and r6 again
   1c5f6:       d001            beq.n   1c5fc        // If r3 != r6, go to that address. Considering the following instruction is the one going to jemalloc_crash, and is where the backtrace points at, it means this branch is not taken. So somewhere between 1c5a6 and 1c5f2, something changed the value of arenas[0].magic without hitting the mprotect? I'm ready to invoke the solar flare hypothesis.

The crash backtrace usefully displays r7 as 0x55001000, which corresponds to the 1426067456 from the android log giving the address of arenas[0].

At this point, I'd try to copy arenas[0].magic and padding onto a buffer in the stack in jemalloc_crash. Something similar to https://dxr.mozilla.org/mozilla-central/source/js/src/jit/x86-shared/BaseAssembler-x86-shared.h#3787-3798 . Although, all that would tell you is what there is there, not why it's there.
   1c5f8:       f7fd fbe6       bl      19dc8        // jemalloc_crash
Flags: needinfo?(mh+mozilla)
> The crash backtrace usefully displays r7 as 0x55001000, which corresponds to the 1426067456 from the android log giving the address of arenas[0].

The minidump also confirms the page at 0x55001000 is read-only.
FWIW, I was hitting very similar symptons (arenas[0].magic corruption) when trying to land a patch a while back and it turned out to be (non-jemalloc) code that was using the STL without respecting the invariants required by the STL--the sort function we were passing to std::sort wasn't correct.  Carefully auditing STL usage might be a worthwhile approach here.
(In reply to Nathan Froyd [:froydnj] from comment #16)
But that doesn't explain how a arenas[0].magic inside a mprotect()'d page can be changed in the try push in this bug.

Since this is easily reproduced, I think the suggestion in comment #14 might help.
Flags: needinfo?(cyu)
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #17)
> But that doesn't explain how a arenas[0].magic inside a mprotect()'d page
> can be changed in the try push in this bug.

This is true.  But e.g. comment 1260208 comment 20 and following shows similarly weird behavior.
After taking a closer look, I don't think arena magic is corrupted. Note that the crash in isalloc():

https://hg.mozilla.org/try/file/b616fa5ca3f2853ea29044e185fc67b2c0ab198b/memory/mozjemalloc/jemalloc.c#l4516

the assertion is that CHUNK_MAP_ALLOCATED bit is not turned on for the page. Maybe it's some buffer underflow to corrupt the arena_chunk_t header for the chunk. If my reasoning is right, the next step should be adding a guard page after the chunk header.
The thing is that there are *several* crashes, most of which happen on the line that checks ARENA_MAGIC, not CHUNK_MAP_ALLOCATED. Also, see comment 14.
Next Thursday will be this bug's two month anniversary, so next weekend I'll be doing what should have been done long ago, hiding all the webgl mochitests on Android and skipping all the webgl reftests.
FWIW, the oldest bugs with a similar crash signature are 8 months old.
(from the list in comment 2)
Yep, lightly broken for Christmas, broken enough to be noticeable and should have been dealt with at the end of April, then broken so badly that it's been violating the Visiblity Policy since July (not sure how to even count the reftests, but the mochitests are either around a 10% failure rate, ten times what it's supposed to be for an established suite, or if you just count the failure rate of the two worst chunks on debug where they are at their worst, more like 60%).
(In reply to Phil Ringnalda (:philor) from comment #21)
> Next Thursday will be this bug's two month anniversary, so next weekend I'll
> be doing what should have been done long ago, hiding all the webgl
> mochitests on Android and skipping all the webgl reftests.

Usually these guidelines apply, but I don't believe it's an option in this case. It would give up too much protection against regressions, and there's no clean way to clear this up for android while maintaining our coverage on desktop.

It's been two months because of the harsh reality that is prioritization. The right path forward is to push through on root causing this and getting a fix. We're doing that now. If we strike out here, we'll talk about other options to clear up the tests.
(In reply to Jeff Gilbert [:jgilbert] from comment #25)
> Usually these guidelines apply, but I don't believe it's an option in this
> case. It would give up too much protection against regressions, and there's
> no clean way to clear this up for android while maintaining our coverage on
> desktop.

Not sure what you mean by that. It's simple to hide mochitest-gl on Android while leaving it visible on desktop, and only a tiresome lot of copy-paste to skip the reftests on Android while leaving them running on desktop.

Yeah, prioritization is tough, and now we need to prioritize running the rest of reftest-18 on Android over running your reftests until they crash and not running the rest of the chunk, and we need to prioritize starring test suites with intermittent failures that happen below a 10% frequency over filing bug after bug after bug after bug as your test suites find yet another test in which to crash, and then finding those bugs when they get filed in the wrong place against some other component, and pulling them back into here rather than making some other team spend time trying to figure out why their code crashed in an incomprehensible way due to this.
We do need the reftests though. We can chunk those out into r-gl if that'll help.

I agree that WebGL issues should avoid being in the way of other tests. However, WebGL requires these tests to ensure that we actually get pixels to the screen. Backsliding on testing this is extremely risk-prone, and should be a last resort.
Oh, a way to see the reftests while they are hidden, more easily than keeping around a patch to reenable them pushed to try once a day, gotcha.

Since they are running on taskcluster, it might be much easier (or, less likely, vastly more difficult) to break them out. I'd recommend cc'ing jmaher on the bug, since he has lots of experience with creating new suites for this very reason.
Depends on: 1299941
Perhaps it's worth looking into bug 1299941, which is what I hit when I tried running the webgl mochitests under valgrind?
Yes - bug 1299941 comment 3
Depends on: 1291221
Duplicate of this bug: 1301453

Updated

a year ago
Blocks: 1301584
Blocks: 1301897
Android webgl mochitests hidden on all trunk trees, including try, click the "Excluded Jobs" link in the grey bar at the top of treeherder to see them.
Keywords: leave-open
Whiteboard: [gfx-noted] → [gfx-noted][test disabled]

Comment 33

a year ago
Pushed by philringnalda@gmail.com:
https://hg.mozilla.org/mozilla-central/rev/bef797e6c129
Skip all webgl reftests on Android for causing memory corruption resulting in crashes, often in later non-webgl tests, a=unfortunatesituation
Hi :glandium, :cyu,

I tried to disable Jemalloc on try and found it didn't have any crash happens on test result. Looked into all triggered jobs on gl1/gl5. Please see [1] for more details.

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=86053a67f229&exclusion_profile=false&selectedJob=27339694

gl1:
they all unexpected fail at the same error. Another try in [2].

[2]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ce566b1fa51d&exclusion_profile=false&selectedJob=27348933

In [2], some warning message may be the reason to cause unexpected fail. For this, I will add more log message to look into.

09-12 23:42:59.399   766   785 I Gecko   : [766] WARNING: Failed to find a compatible config.: file /home/worker/workspace/build/src/gfx/gl/GLContextProviderEGL.cpp, line 960
09-12 23:42:59.399   766   785 I Gecko   : [766] WARNING: Failed to find a compatible config.: file /home/worker/workspace/build/src/gfx/gl/GLContextProviderEGL.cpp, line 960
09-12 23:42:59.409   766   785 W GeckoConsole: [JavaScript Warning: "Error: WebGL: Failed to create WebGL context: WebGL creation failed: 

gl5:
All retriggered jobs were green.


At this moment I am not sure disabling Jemalloc can get more clues to help on this issue. Could you please input some comments for better next move? Really thanks.
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(cyu)
What happened to the followup to comment 14? (getting the data from the page containing modified arena magic onto the jemalloc_crash() stack)
Flags: needinfo?(mh+mozilla)
(In reply to Mike Hommey [:glandium] from comment #35)
> What happened to the followup to comment 14? (getting the data from the page
> containing modified arena magic onto the jemalloc_crash() stack)

Hi Mike,

Sorry to late reply. I tried to declare a global raw pointer to point to the start address of arena magic. When crash happens, dumping those data for debugging. Please see logcat in [1]

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa9f9e87d676&exclusion_profile=false&selectedJob=27358972

Could you please have a look to see if there is any useful information? Thanks.
Flags: needinfo?(mh+mozilla)
I can reproduce this crash at will when starting my main profile on Linux64, local custom build off m-c last night.  Crashes in main process with a trashed arena->magic value; same trashing in two runs.  The entire arena structure looks trashed, again identically in both runs.  Does not repro in rr in the one attempt; perhaps because rr may be blocking opengl -- I see these messages:

JavaScript warning: http://www.salon.com/ga945320.js, line 1: Error: WebGL: Disallowing antialiased backbuffers due to blacklisting.
JavaScript warning: http://www.salon.com/ga945320.js, line 1: Error: WebGL: Refused to create native OpenGL context because of blacklist entry: 
JavaScript warning: http://www.salon.com/ga945320.js, line 1: Error: WebGL: Failed to create WebGL context: WebGL creation failed: 
* Refused to create native OpenGL context because of blacklist entry: 
* Exhausted GL driver options.

Attaching gdb dump of stack for all threads, bt full for crashing thread, dump of *arena in both decimal and hex.

I've saved a copy of the profile directory, the bin directory, and the changeset to ensure I don't lose the ability to crash it.  I've crashed it 5 times in a row now (excluding the rr attempt).

I'm open to suggestions on how to narrow it down.  It does crash pretty fast.
Created attachment 8790773 [details]
gdb logs
Assignee: vliu → nobody
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(cyu)
(In reply to Randell Jesup [:jesup] from comment #37)
Some things to check:
* There are lots of 0x90909090 in the corrupted object. Is there any correlation to the test data such as decoded texture?
* Just to be sure that we are tracking the right corruption, can you make sure that when the crash happens, arenas[0] is still the same value as jemalloc() is initialized? You can check its original value by the end of malloc_init_hard(). We need to check whether it's the static variable getting corrupted to point to some random data, or the arena is really corrupted.
* Set watchpoints in arenas[0]->magic and &arenas[0] to capture both cases above. See if this produces some meaningful result. If not, then we might need to watch for an earlier corruption that leads to this corruption.
* Valgrind may be worth a try to see if your workstation produces some meaningful result than David's result in bug 1299941. In bug 1299941, valgrind reports corruptions in nvidia's GL driver, which we can do very little about.
Milan, do you folks have anything actionable to go on here? This is causing the gl mochitests to be hidden, which means we don't really have any coverage on Android currently.
Flags: needinfo?(milan)
Not sure how actionable, but Randall being able to reproduce it suggests we should be able to get somewhere, though if it is the driver corrupting things, I'm not sure what that would be.
Vincent, anything in the comment 37 and comment 39 that can help us get further on this?
Flags: needinfo?(milan)
Randall reproduces a likely different crash.
(In reply to Milan Sreckovic [:milan] from comment #41)
> Not sure how actionable, but Randall being able to reproduce it suggests we
> should be able to get somewhere, though if it is the driver corrupting
> things, I'm not sure what that would be.
> Vincent, anything in the comment 37 and comment 39 that can help us get
> further on this?

The case I'd seen from the previous try result were likely in free, which was not like the crash happens in allocation (gdb logs attached for comment 37). 
More information analyzed for this bug showed arena didn't actually corrupted, the static variable getting corrupted to point to some random data instead.

Currently I don't have any actionable way to go from comment 37 and 39. It would be great to input any STR to approach the issue.
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.