Closed Bug 1285531 Opened 5 years ago Closed 2 years ago
High number of jemalloc crashes in webgl
we see a high number of jemalloc crashes on the tres, this bug is to track hopefully the fixes of this crashes
Possibly related to us changing the WebGL conformance tests (bug 1193526.)
Although some predate the conformance test update: https://bugzilla.mozilla.org/buglist.cgi?list_id=13106221&short_desc=Intermittent%20webgl-%20crashed%20jemalloc_crash&resolution=---&query_format=advanced&short_desc_type=allwordssubstr
Part of a long list of similar WebGL intermittents.
Peter, Vincent, any progress on this?
Hi Milan, Currently I am still trying to find the possible way to reproduce the issue from try server. The crash is not consistent so it is hard to have a fixed chunk to look into; Furthermore, I also want to setup taskcluster and try to available mach to run test. It still have environment relative issues need to be resolved.
1. Current situation. 1.1 Looked into back trace for the recent tc-M(gl1) and tc-M(gl5) in mozilla-inbound, they are not consistent. most of them were crashed at jemalloc free but with differnt root cause. Please see the attached file. 1.2 I tried to collect all test program which easily crash in tc-M(gl1) and tc-M(gl5) together in generated-mochitest.ini. With this modification, I can reproduce jemalloc_crash on try-server many times. Please see  for more detail. : https://treeherder.mozilla.org/#/jobs?repo=try&revision=afd22fd275b2 1.3 In the other hand, I found running reftest-18 also easily to reproduce jemalloc_crash. Please see  for more detail. : https://treeherder.mozilla.org/#/jobs?repo=try&revision=ffdb90ff6ce9 This pattern is always creash in webgl-hanging-fb-test.html. Looked into this test program, I try to simpify the code and the main gl operations are gl.clearColor(0.0, 1.0, 0.0, 1.0); gl.clear(gl.COLOR_BUFFER_BIT); There are more clues to think about the above gl operations are we had some crashs in . Please see bug 1292263, 1291864 and 1291790. But I have no more evidence to prove it. : https://dxr.mozilla.org/mozilla-central/source/dom/canvas/test/reftest/webgl-clear-test.html 2. Possible ways for the next move. 2.1 build up mozharness locally. 2.2 Keep trying remote environment by "one click loaner". 2.3 Rethink test programs which are easily to crash like 1.3 mentioned. 2.4 I also consult performance team for more help. It is currently on going.
A list of back trace 1.1 in Comment 6 was attached.
Hi Mike, Thanks for talk to you on IRC about jemalloc issues. This bug is meta bug for tracking. I modified the code you suggested in jemalloc and now the try has reproduced. Can you have a look for me if there is any clue to point to the root cause? Thanks https://treeherder.mozilla.org/#/jobs?repo=try&revision=8e9f3a2b8e00&selectedJob=26014306
Oh waw, this one is really bad. Contrary to the past one I looked at, this one happens on malloc. So it's not even about a bad pointer being passed in. The crash happens on the same kind of check (arena magic match), so it indicates an actual heap corruption, where something is actually touching mozjemalloc's main (and only) arena structure, or the arenas list, or the pointer to it. What I would suggest is: - inserting a page-aligned padding after the struct arena_s magic member, and mprotect the first page or arenas so that it can't be written to after initialization - doing something similar with the arenas list.
Hi Mike, Thanks the reply for the issue. From crash listed in Comment 8, I also see another crash point was in . : https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c?q=RELEASE_ASSERT%28%28mapbits+%26+CHUNK_MAP_ALLOCATED%29+%21%3D+0%29%3B&redirect_type=single#4587 Does it also happens because of heap corruption? Should I add any padding or any protection to prevent?
That could be or could be totally unrelated. But it's also harder to diagnose with padding and mprotect. You should try comment 9 first before trying to care about that variant.
Having guard pages in the memory allocator is something that would be hugely helpful in diagnosing memory corruptions. Jemalloc has checks of the arena magic in malloc()/free(), but this is often unrelated to the corruption site. To make this useful, we need to reserve guard page (maybe before the arena header instead of after it), fill it with some pattern that is recognized by us and mprotect() it as readonly so that we can crash-annotate this as memory-corruption so that such crashes can stand out among other kinds of access violations like use-after-free.
From the previous suggestions, I tried to draft a WIP to debug this issue. The WIP intended to add a paddings, followed after magic in arena structure to combine a 4096 bytes (This equals to a page size in arena). When arena was allocated in initial phase, mprotect() was also called to protect in a page basis. By this way, magic in arena may be protected from harming.  is the try result. : https://treeherder.mozilla.org/#/jobs?repo=try&revision=b616fa5ca3f2&selectedJob=26494713 Looked into , two crashes happens for both alloc and free phase. : https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#4306 : https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#4510 Logcat showed only arena 0 was allocated and make sure allocated arena was page aligned. Jemalloc: arena 0 has allocated. ret=1426067456 From crash point, it still has chance to hit magic not matched. In other hand,  still happens in most try. Hi Mike, Cervantes, Could you please have a look in  if I have anything missing? Thanks.
(In reply to Vincent Liu[:vliu] from comment #13) > From the previous suggestions, I tried to draft a WIP to debug this issue. > The WIP intended to add a paddings, > followed after magic in arena structure to combine a 4096 bytes (This equals > to a page size in arena). > > When arena was allocated in initial phase, mprotect() was also called to > protect in a page basis. By this way, magic in arena may be protected from > harming.  is the try result. > > : > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=b616fa5ca3f2&selectedJob=26494713 The log for that selected job is from the code you point at  because of the isalloc_validate call you added to free() in https://hg.mozilla.org/try/file/b616fa5ca3f2/memory/mozjemalloc/jemalloc.c#l6529 and essentially means free is being passed an already freed pointer. The log for the next orange after that one shows a crash at https://hg.mozilla.org/try/file/b616fa5ca3f2/memory/mozjemalloc/jemalloc.c#l4312 ... and no android log from the code at https://hg.mozilla.org/try/file/b616fa5ca3f2/memory/mozjemalloc/jemalloc.c#l4306. That's a huge WTF. (android log: https://public-artifacts.taskcluster.net/DokpsO8nQuaA_BjBOg1VIw/0/public/test_info//logcat-emulator-5554.log ; crash log: https://public-artifacts.taskcluster.net/DokpsO8nQuaA_BjBOg1VIw/0/public/logs/live_backing.log ) Even looking at the assembly doesn't help make it a little less a WTF: 1c56a: 4607 mov r7, r0 // r0 contains arenas, copy the value in r7 1c56c: b9d8 cbnz r0, 1c5a6 // if r0 is non-zero, go to that address (...) 1c5a6: 6803 ldr r3, [r0, #0] // Load arenas.magic in r3 1c5a8: 4ebe ldr r6, [pc, #760] ; (1c8a4) // Load ARENA_MAGIC in r6 1c5aa: 42b3 cmp r3, r6 // Compare r3 and r6 1c5ac: d021 beq.n 1c5f2 // If r3 != r6, go to that address. Considering we didn't get the android log, this branch was taken, meaning r3 == r6, and r0 is obviously non-zero, so we took the first branch too (...) 1c5f2: 683b ldr r3, [r7, #0] // r7 has a copy of arenas from above, so this loads arenas.magic in r3 again 1c5f4: 42b3 cmp r3, r6 // and compare r3 and r6 again 1c5f6: d001 beq.n 1c5fc // If r3 != r6, go to that address. Considering the following instruction is the one going to jemalloc_crash, and is where the backtrace points at, it means this branch is not taken. So somewhere between 1c5a6 and 1c5f2, something changed the value of arenas.magic without hitting the mprotect? I'm ready to invoke the solar flare hypothesis. The crash backtrace usefully displays r7 as 0x55001000, which corresponds to the 1426067456 from the android log giving the address of arenas. At this point, I'd try to copy arenas.magic and padding onto a buffer in the stack in jemalloc_crash. Something similar to https://dxr.mozilla.org/mozilla-central/source/js/src/jit/x86-shared/BaseAssembler-x86-shared.h#3787-3798 . Although, all that would tell you is what there is there, not why it's there. 1c5f8: f7fd fbe6 bl 19dc8 // jemalloc_crash
> The crash backtrace usefully displays r7 as 0x55001000, which corresponds to the 1426067456 from the android log giving the address of arenas. The minidump also confirms the page at 0x55001000 is read-only.
FWIW, I was hitting very similar symptons (arenas.magic corruption) when trying to land a patch a while back and it turned out to be (non-jemalloc) code that was using the STL without respecting the invariants required by the STL--the sort function we were passing to std::sort wasn't correct. Carefully auditing STL usage might be a worthwhile approach here.
(In reply to Nathan Froyd [:froydnj] from comment #16) But that doesn't explain how a arenas.magic inside a mprotect()'d page can be changed in the try push in this bug. Since this is easily reproduced, I think the suggestion in comment #14 might help.
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #17) > But that doesn't explain how a arenas.magic inside a mprotect()'d page > can be changed in the try push in this bug. This is true. But e.g. comment 1260208 comment 20 and following shows similarly weird behavior.
After taking a closer look, I don't think arena magic is corrupted. Note that the crash in isalloc(): https://hg.mozilla.org/try/file/b616fa5ca3f2853ea29044e185fc67b2c0ab198b/memory/mozjemalloc/jemalloc.c#l4516 the assertion is that CHUNK_MAP_ALLOCATED bit is not turned on for the page. Maybe it's some buffer underflow to corrupt the arena_chunk_t header for the chunk. If my reasoning is right, the next step should be adding a guard page after the chunk header.
The thing is that there are *several* crashes, most of which happen on the line that checks ARENA_MAGIC, not CHUNK_MAP_ALLOCATED. Also, see comment 14.
Next Thursday will be this bug's two month anniversary, so next weekend I'll be doing what should have been done long ago, hiding all the webgl mochitests on Android and skipping all the webgl reftests.
FWIW, the oldest bugs with a similar crash signature are 8 months old.
(from the list in comment 2)
Yep, lightly broken for Christmas, broken enough to be noticeable and should have been dealt with at the end of April, then broken so badly that it's been violating the Visiblity Policy since July (not sure how to even count the reftests, but the mochitests are either around a 10% failure rate, ten times what it's supposed to be for an established suite, or if you just count the failure rate of the two worst chunks on debug where they are at their worst, more like 60%).
(In reply to Phil Ringnalda (:philor) from comment #21) > Next Thursday will be this bug's two month anniversary, so next weekend I'll > be doing what should have been done long ago, hiding all the webgl > mochitests on Android and skipping all the webgl reftests. Usually these guidelines apply, but I don't believe it's an option in this case. It would give up too much protection against regressions, and there's no clean way to clear this up for android while maintaining our coverage on desktop. It's been two months because of the harsh reality that is prioritization. The right path forward is to push through on root causing this and getting a fix. We're doing that now. If we strike out here, we'll talk about other options to clear up the tests.
(In reply to Jeff Gilbert [:jgilbert] from comment #25) > Usually these guidelines apply, but I don't believe it's an option in this > case. It would give up too much protection against regressions, and there's > no clean way to clear this up for android while maintaining our coverage on > desktop. Not sure what you mean by that. It's simple to hide mochitest-gl on Android while leaving it visible on desktop, and only a tiresome lot of copy-paste to skip the reftests on Android while leaving them running on desktop. Yeah, prioritization is tough, and now we need to prioritize running the rest of reftest-18 on Android over running your reftests until they crash and not running the rest of the chunk, and we need to prioritize starring test suites with intermittent failures that happen below a 10% frequency over filing bug after bug after bug after bug as your test suites find yet another test in which to crash, and then finding those bugs when they get filed in the wrong place against some other component, and pulling them back into here rather than making some other team spend time trying to figure out why their code crashed in an incomprehensible way due to this.
We do need the reftests though. We can chunk those out into r-gl if that'll help. I agree that WebGL issues should avoid being in the way of other tests. However, WebGL requires these tests to ensure that we actually get pixels to the screen. Backsliding on testing this is extremely risk-prone, and should be a last resort.
Oh, a way to see the reftests while they are hidden, more easily than keeping around a patch to reenable them pushed to try once a day, gotcha. Since they are running on taskcluster, it might be much easier (or, less likely, vastly more difficult) to break them out. I'd recommend cc'ing jmaher on the bug, since he has lots of experience with creating new suites for this very reason.
Perhaps it's worth looking into bug 1299941, which is what I hit when I tried running the webgl mochitests under valgrind?
Yes - bug 1299941 comment 3
Android webgl mochitests hidden on all trunk trees, including try, click the "Excluded Jobs" link in the grey bar at the top of treeherder to see them.
Whiteboard: [gfx-noted] → [gfx-noted][test disabled]
Pushed by email@example.com: https://hg.mozilla.org/mozilla-central/rev/bef797e6c129 Skip all webgl reftests on Android for causing memory corruption resulting in crashes, often in later non-webgl tests, a=unfortunatesituation
What happened to the followup to comment 14? (getting the data from the page containing modified arena magic onto the jemalloc_crash() stack)
(In reply to Mike Hommey [:glandium] from comment #35) > What happened to the followup to comment 14? (getting the data from the page > containing modified arena magic onto the jemalloc_crash() stack) Hi Mike, Sorry to late reply. I tried to declare a global raw pointer to point to the start address of arena magic. When crash happens, dumping those data for debugging. Please see logcat in  : https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa9f9e87d676&exclusion_profile=false&selectedJob=27358972 Could you please have a look to see if there is any useful information? Thanks.
Assignee: vliu → nobody
(In reply to Randell Jesup [:jesup] from comment #37) Some things to check: * There are lots of 0x90909090 in the corrupted object. Is there any correlation to the test data such as decoded texture? * Just to be sure that we are tracking the right corruption, can you make sure that when the crash happens, arenas is still the same value as jemalloc() is initialized? You can check its original value by the end of malloc_init_hard(). We need to check whether it's the static variable getting corrupted to point to some random data, or the arena is really corrupted. * Set watchpoints in arenas->magic and &arenas to capture both cases above. See if this produces some meaningful result. If not, then we might need to watch for an earlier corruption that leads to this corruption. * Valgrind may be worth a try to see if your workstation produces some meaningful result than David's result in bug 1299941. In bug 1299941, valgrind reports corruptions in nvidia's GL driver, which we can do very little about.
Milan, do you folks have anything actionable to go on here? This is causing the gl mochitests to be hidden, which means we don't really have any coverage on Android currently.
Not sure how actionable, but Randall being able to reproduce it suggests we should be able to get somewhere, though if it is the driver corrupting things, I'm not sure what that would be. Vincent, anything in the comment 37 and comment 39 that can help us get further on this?
Randall reproduces a likely different crash.
(In reply to Milan Sreckovic [:milan] from comment #41) > Not sure how actionable, but Randall being able to reproduce it suggests we > should be able to get somewhere, though if it is the driver corrupting > things, I'm not sure what that would be. > Vincent, anything in the comment 37 and comment 39 that can help us get > further on this? The case I'd seen from the previous try result were likely in free, which was not like the crash happens in allocation (gdb logs attached for comment 37). More information analyzed for this bug showed arena didn't actually corrupted, the static variable getting corrupted to point to some random data instead. Currently I don't have any actionable way to go from comment 37 and 39. It would be great to input any STR to approach the issue.
4 years ago
Priority: -- → P3
Looks like somewhere along the line, probably when tests were being moved from buildbot to taskcluster, someone decided there wasn't any point in bothering to run the webgl mochitests on Android, so in the unlikely event that anyone ever decides to work on this, their first task will be to recreate the job (or, if the reftests repro'ed well enough to tell about a fix, which I can't remember, test a fix against reftests and then recreate the mochitest job).
No longer blocks: 1301897
Duplicate of this bug: 1301897
Status: NEW → RESOLVED
Closed: 2 years ago
No longer depends on: 1299941
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.