Closed
Bug 1285531
Opened 8 years ago
Closed 5 years ago
High number of jemalloc crashes in webgl
Categories
(Core :: Graphics: CanvasWebGL, defect, P3)
Core
Graphics: CanvasWebGL
Tracking
()
RESOLVED
WORKSFORME
People
(Reporter: cbook, Unassigned)
References
Details
(Keywords: meta, Whiteboard: [gfx-noted][test disabled])
Attachments
(2 files)
we see a high number of jemalloc crashes on the tres, this bug is to track hopefully the fixes of this crashes
Possibly related to us changing the WebGL conformance tests (bug 1193526.)
Although some predate the conformance test update: https://bugzilla.mozilla.org/buglist.cgi?list_id=13106221&short_desc=Intermittent%20webgl-%20crashed%20jemalloc_crash&resolution=---&query_format=advanced&short_desc_type=allwordssubstr
Updated•8 years ago
|
Assignee: nobody → vliu
Peter, Vincent, any progress on this?
Flags: needinfo?(vliu)
Flags: needinfo?(howareyou322)
Comment 5•8 years ago
|
||
Hi Milan,
Currently I am still trying to find the possible way to reproduce the issue from try server. The crash is not consistent so it is hard to have a fixed chunk to look into; Furthermore, I also want to setup taskcluster and try to available mach to run test. It still have environment relative issues need to be resolved.
Flags: needinfo?(vliu)
Flags: needinfo?(howareyou322)
Comment 6•8 years ago
|
||
1. Current situation.
1.1 Looked into back trace for the recent tc-M(gl1) and tc-M(gl5) in mozilla-inbound, they are not consistent. most of them were crashed at jemalloc free
but with differnt root cause. Please see the attached file.
1.2 I tried to collect all test program which easily crash in tc-M(gl1) and tc-M(gl5) together in generated-mochitest.ini. With this modification, I can reproduce jemalloc_crash
on try-server many times. Please see [1] for more detail.
[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=afd22fd275b2
1.3 In the other hand, I found running reftest-18 also easily to reproduce jemalloc_crash. Please see [2] for more detail.
[2]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ffdb90ff6ce9
This pattern is always creash in webgl-hanging-fb-test.html. Looked into this test program, I try to simpify the code and the main gl operations are
gl.clearColor(0.0, 1.0, 0.0, 1.0);
gl.clear(gl.COLOR_BUFFER_BIT);
There are more clues to think about the above gl operations are we had some crashs in [3]. Please see bug 1292263, 1291864 and 1291790. But I have no more evidence
to prove it.
[3]: https://dxr.mozilla.org/mozilla-central/source/dom/canvas/test/reftest/webgl-clear-test.html
2. Possible ways for the next move.
2.1 build up mozharness locally.
2.2 Keep trying remote environment by "one click loaner".
2.3 Rethink test programs which are easily to crash like 1.3 mentioned.
2.4 I also consult performance team for more help. It is currently on going.
Comment 8•8 years ago
|
||
Hi Mike,
Thanks for talk to you on IRC about jemalloc issues. This bug is meta bug for tracking. I modified the code you suggested in jemalloc and now the try has reproduced. Can you have a look for me if there is any clue to point to the root cause? Thanks
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8e9f3a2b8e00&selectedJob=26014306
Flags: needinfo?(mh+mozilla)
Comment 9•8 years ago
|
||
Oh waw, this one is really bad. Contrary to the past one I looked at, this one happens on malloc. So it's not even about a bad pointer being passed in. The crash happens on the same kind of check (arena magic match), so it indicates an actual heap corruption, where something is actually touching mozjemalloc's main (and only) arena structure, or the arenas list, or the pointer to it.
What I would suggest is:
- inserting a page-aligned padding after the struct arena_s magic member, and mprotect the first page or arenas[0] so that it can't be written to after initialization
- doing something similar with the arenas list.
Flags: needinfo?(mh+mozilla)
Comment 10•8 years ago
|
||
Hi Mike,
Thanks the reply for the issue. From crash listed in Comment 8, I also see another crash point was in [1].
[1]: https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c?q=RELEASE_ASSERT%28%28mapbits+%26+CHUNK_MAP_ALLOCATED%29+%21%3D+0%29%3B&redirect_type=single#4587
Does it also happens because of heap corruption? Should I add any padding or any protection to prevent?
Flags: needinfo?(mh+mozilla)
Comment 11•8 years ago
|
||
That could be or could be totally unrelated. But it's also harder to diagnose with padding and mprotect. You should try comment 9 first before trying to care about that variant.
Flags: needinfo?(mh+mozilla)
Comment 12•8 years ago
|
||
Having guard pages in the memory allocator is something that would be hugely helpful in diagnosing memory corruptions. Jemalloc has checks of the arena magic in malloc()/free(), but this is often unrelated to the corruption site.
To make this useful, we need to reserve guard page (maybe before the arena header instead of after it), fill it with some pattern that is recognized by us and mprotect() it as readonly so that we can crash-annotate this as memory-corruption so that such crashes can stand out among other kinds of access violations like use-after-free.
Comment 13•8 years ago
|
||
From the previous suggestions, I tried to draft a WIP to debug this issue. The WIP intended to add a paddings,
followed after magic in arena structure to combine a 4096 bytes (This equals to a page size in arena).
When arena was allocated in initial phase, mprotect() was also called to protect in a page basis. By this way, magic in arena may be protected from
harming. [1] is the try result.
[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b616fa5ca3f2&selectedJob=26494713
Looked into [1], two crashes happens for both alloc and free phase.
[2]: https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#4306
[3]: https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#4510
Logcat showed only arena 0 was allocated and make sure allocated arena was page aligned.
Jemalloc: arena 0 has allocated. ret=1426067456
From crash point, it still has chance to hit magic not matched[2]. In other hand, [3] still happens in most try.
Hi Mike, Cervantes,
Could you please have a look in [1] if I have anything missing? Thanks.
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(cyu)
Updated•8 years ago
|
Updated•8 years ago
|
Updated•8 years ago
|
Updated•8 years ago
|
Comment 14•8 years ago
|
||
(In reply to Vincent Liu[:vliu] from comment #13)
> From the previous suggestions, I tried to draft a WIP to debug this issue.
> The WIP intended to add a paddings,
> followed after magic in arena structure to combine a 4096 bytes (This equals
> to a page size in arena).
>
> When arena was allocated in initial phase, mprotect() was also called to
> protect in a page basis. By this way, magic in arena may be protected from
> harming. [1] is the try result.
>
> [1]:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=b616fa5ca3f2&selectedJob=26494713
The log for that selected job is from the code you point at [3] because of the isalloc_validate call you added to free() in https://hg.mozilla.org/try/file/b616fa5ca3f2/memory/mozjemalloc/jemalloc.c#l6529 and essentially means free is being passed an already freed pointer.
The log for the next orange after that one shows a crash at https://hg.mozilla.org/try/file/b616fa5ca3f2/memory/mozjemalloc/jemalloc.c#l4312 ... and no android log from the code at https://hg.mozilla.org/try/file/b616fa5ca3f2/memory/mozjemalloc/jemalloc.c#l4306. That's a huge WTF. (android log: https://public-artifacts.taskcluster.net/DokpsO8nQuaA_BjBOg1VIw/0/public/test_info//logcat-emulator-5554.log ; crash log: https://public-artifacts.taskcluster.net/DokpsO8nQuaA_BjBOg1VIw/0/public/logs/live_backing.log )
Even looking at the assembly doesn't help make it a little less a WTF:
1c56a: 4607 mov r7, r0 // r0 contains arenas[0], copy the value in r7
1c56c: b9d8 cbnz r0, 1c5a6 // if r0 is non-zero, go to that address
(...)
1c5a6: 6803 ldr r3, [r0, #0] // Load arenas[0].magic in r3
1c5a8: 4ebe ldr r6, [pc, #760] ; (1c8a4) // Load ARENA_MAGIC in r6
1c5aa: 42b3 cmp r3, r6 // Compare r3 and r6
1c5ac: d021 beq.n 1c5f2 // If r3 != r6, go to that address. Considering we didn't get the android log, this branch was taken, meaning r3 == r6, and r0 is obviously non-zero, so we took the first branch too
(...)
1c5f2: 683b ldr r3, [r7, #0] // r7 has a copy of arenas[0] from above, so this loads arenas[0].magic in r3 again
1c5f4: 42b3 cmp r3, r6 // and compare r3 and r6 again
1c5f6: d001 beq.n 1c5fc // If r3 != r6, go to that address. Considering the following instruction is the one going to jemalloc_crash, and is where the backtrace points at, it means this branch is not taken. So somewhere between 1c5a6 and 1c5f2, something changed the value of arenas[0].magic without hitting the mprotect? I'm ready to invoke the solar flare hypothesis.
The crash backtrace usefully displays r7 as 0x55001000, which corresponds to the 1426067456 from the android log giving the address of arenas[0].
At this point, I'd try to copy arenas[0].magic and padding onto a buffer in the stack in jemalloc_crash. Something similar to https://dxr.mozilla.org/mozilla-central/source/js/src/jit/x86-shared/BaseAssembler-x86-shared.h#3787-3798 . Although, all that would tell you is what there is there, not why it's there.
1c5f8: f7fd fbe6 bl 19dc8 // jemalloc_crash
Flags: needinfo?(mh+mozilla)
Comment 15•8 years ago
|
||
> The crash backtrace usefully displays r7 as 0x55001000, which corresponds to the 1426067456 from the android log giving the address of arenas[0].
The minidump also confirms the page at 0x55001000 is read-only.
Comment 16•8 years ago
|
||
FWIW, I was hitting very similar symptons (arenas[0].magic corruption) when trying to land a patch a while back and it turned out to be (non-jemalloc) code that was using the STL without respecting the invariants required by the STL--the sort function we were passing to std::sort wasn't correct. Carefully auditing STL usage might be a worthwhile approach here.
Comment 17•8 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #16)
But that doesn't explain how a arenas[0].magic inside a mprotect()'d page can be changed in the try push in this bug.
Since this is easily reproduced, I think the suggestion in comment #14 might help.
Flags: needinfo?(cyu)
Comment 18•8 years ago
|
||
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #17)
> But that doesn't explain how a arenas[0].magic inside a mprotect()'d page
> can be changed in the try push in this bug.
This is true. But e.g. comment 1260208 comment 20 and following shows similarly weird behavior.
Comment 19•8 years ago
|
||
After taking a closer look, I don't think arena magic is corrupted. Note that the crash in isalloc():
https://hg.mozilla.org/try/file/b616fa5ca3f2853ea29044e185fc67b2c0ab198b/memory/mozjemalloc/jemalloc.c#l4516
the assertion is that CHUNK_MAP_ALLOCATED bit is not turned on for the page. Maybe it's some buffer underflow to corrupt the arena_chunk_t header for the chunk. If my reasoning is right, the next step should be adding a guard page after the chunk header.
Comment 20•8 years ago
|
||
The thing is that there are *several* crashes, most of which happen on the line that checks ARENA_MAGIC, not CHUNK_MAP_ALLOCATED. Also, see comment 14.
Comment 21•8 years ago
|
||
Next Thursday will be this bug's two month anniversary, so next weekend I'll be doing what should have been done long ago, hiding all the webgl mochitests on Android and skipping all the webgl reftests.
Comment 22•8 years ago
|
||
FWIW, the oldest bugs with a similar crash signature are 8 months old.
Comment 23•8 years ago
|
||
(from the list in comment 2)
Comment 24•8 years ago
|
||
Yep, lightly broken for Christmas, broken enough to be noticeable and should have been dealt with at the end of April, then broken so badly that it's been violating the Visiblity Policy since July (not sure how to even count the reftests, but the mochitests are either around a 10% failure rate, ten times what it's supposed to be for an established suite, or if you just count the failure rate of the two worst chunks on debug where they are at their worst, more like 60%).
Comment 25•8 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #21)
> Next Thursday will be this bug's two month anniversary, so next weekend I'll
> be doing what should have been done long ago, hiding all the webgl
> mochitests on Android and skipping all the webgl reftests.
Usually these guidelines apply, but I don't believe it's an option in this case. It would give up too much protection against regressions, and there's no clean way to clear this up for android while maintaining our coverage on desktop.
It's been two months because of the harsh reality that is prioritization. The right path forward is to push through on root causing this and getting a fix. We're doing that now. If we strike out here, we'll talk about other options to clear up the tests.
Comment 26•8 years ago
|
||
(In reply to Jeff Gilbert [:jgilbert] from comment #25)
> Usually these guidelines apply, but I don't believe it's an option in this
> case. It would give up too much protection against regressions, and there's
> no clean way to clear this up for android while maintaining our coverage on
> desktop.
Not sure what you mean by that. It's simple to hide mochitest-gl on Android while leaving it visible on desktop, and only a tiresome lot of copy-paste to skip the reftests on Android while leaving them running on desktop.
Yeah, prioritization is tough, and now we need to prioritize running the rest of reftest-18 on Android over running your reftests until they crash and not running the rest of the chunk, and we need to prioritize starring test suites with intermittent failures that happen below a 10% frequency over filing bug after bug after bug after bug as your test suites find yet another test in which to crash, and then finding those bugs when they get filed in the wrong place against some other component, and pulling them back into here rather than making some other team spend time trying to figure out why their code crashed in an incomprehensible way due to this.
Comment 27•8 years ago
|
||
We do need the reftests though. We can chunk those out into r-gl if that'll help.
I agree that WebGL issues should avoid being in the way of other tests. However, WebGL requires these tests to ensure that we actually get pixels to the screen. Backsliding on testing this is extremely risk-prone, and should be a last resort.
Comment 28•8 years ago
|
||
Oh, a way to see the reftests while they are hidden, more easily than keeping around a patch to reenable them pushed to try once a day, gotcha.
Since they are running on taskcluster, it might be much easier (or, less likely, vastly more difficult) to break them out. I'd recommend cc'ing jmaher on the bug, since he has lots of experience with creating new suites for this very reason.
Depends on: 1299941
Perhaps it's worth looking into bug 1299941, which is what I hit when I tried running the webgl mochitests under valgrind?
Yes - bug 1299941 comment 3
Comment 32•8 years ago
|
||
Android webgl mochitests hidden on all trunk trees, including try, click the "Excluded Jobs" link in the grey bar at the top of treeherder to see them.
Updated•8 years ago
|
Keywords: leave-open
Whiteboard: [gfx-noted] → [gfx-noted][test disabled]
Comment 33•8 years ago
|
||
Pushed by philringnalda@gmail.com:
https://hg.mozilla.org/mozilla-central/rev/bef797e6c129
Skip all webgl reftests on Android for causing memory corruption resulting in crashes, often in later non-webgl tests, a=unfortunatesituation
Comment 34•8 years ago
|
||
Hi :glandium, :cyu,
I tried to disable Jemalloc on try and found it didn't have any crash happens on test result. Looked into all triggered jobs on gl1/gl5. Please see [1] for more details.
[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=86053a67f229&exclusion_profile=false&selectedJob=27339694
gl1:
they all unexpected fail at the same error. Another try in [2].
[2]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ce566b1fa51d&exclusion_profile=false&selectedJob=27348933
In [2], some warning message may be the reason to cause unexpected fail. For this, I will add more log message to look into.
09-12 23:42:59.399 766 785 I Gecko : [766] WARNING: Failed to find a compatible config.: file /home/worker/workspace/build/src/gfx/gl/GLContextProviderEGL.cpp, line 960
09-12 23:42:59.399 766 785 I Gecko : [766] WARNING: Failed to find a compatible config.: file /home/worker/workspace/build/src/gfx/gl/GLContextProviderEGL.cpp, line 960
09-12 23:42:59.409 766 785 W GeckoConsole: [JavaScript Warning: "Error: WebGL: Failed to create WebGL context: WebGL creation failed:
gl5:
All retriggered jobs were green.
At this moment I am not sure disabling Jemalloc can get more clues to help on this issue. Could you please input some comments for better next move? Really thanks.
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(cyu)
Comment 35•8 years ago
|
||
What happened to the followup to comment 14? (getting the data from the page containing modified arena magic onto the jemalloc_crash() stack)
Flags: needinfo?(mh+mozilla)
Comment 36•8 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #35)
> What happened to the followup to comment 14? (getting the data from the page
> containing modified arena magic onto the jemalloc_crash() stack)
Hi Mike,
Sorry to late reply. I tried to declare a global raw pointer to point to the start address of arena magic. When crash happens, dumping those data for debugging. Please see logcat in [1]
[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa9f9e87d676&exclusion_profile=false&selectedJob=27358972
Could you please have a look to see if there is any useful information? Thanks.
Flags: needinfo?(mh+mozilla)
Comment 37•8 years ago
|
||
I can reproduce this crash at will when starting my main profile on Linux64, local custom build off m-c last night. Crashes in main process with a trashed arena->magic value; same trashing in two runs. The entire arena structure looks trashed, again identically in both runs. Does not repro in rr in the one attempt; perhaps because rr may be blocking opengl -- I see these messages:
JavaScript warning: http://www.salon.com/ga945320.js, line 1: Error: WebGL: Disallowing antialiased backbuffers due to blacklisting.
JavaScript warning: http://www.salon.com/ga945320.js, line 1: Error: WebGL: Refused to create native OpenGL context because of blacklist entry:
JavaScript warning: http://www.salon.com/ga945320.js, line 1: Error: WebGL: Failed to create WebGL context: WebGL creation failed:
* Refused to create native OpenGL context because of blacklist entry:
* Exhausted GL driver options.
Attaching gdb dump of stack for all threads, bt full for crashing thread, dump of *arena in both decimal and hex.
I've saved a copy of the profile directory, the bin directory, and the changeset to ensure I don't lose the ability to crash it. I've crashed it 5 times in a row now (excluding the rr attempt).
I'm open to suggestions on how to narrow it down. It does crash pretty fast.
Comment 38•8 years ago
|
||
Updated•8 years ago
|
Assignee: vliu → nobody
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(cyu)
Comment 39•8 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #37)
Some things to check:
* There are lots of 0x90909090 in the corrupted object. Is there any correlation to the test data such as decoded texture?
* Just to be sure that we are tracking the right corruption, can you make sure that when the crash happens, arenas[0] is still the same value as jemalloc() is initialized? You can check its original value by the end of malloc_init_hard(). We need to check whether it's the static variable getting corrupted to point to some random data, or the arena is really corrupted.
* Set watchpoints in arenas[0]->magic and &arenas[0] to capture both cases above. See if this produces some meaningful result. If not, then we might need to watch for an earlier corruption that leads to this corruption.
* Valgrind may be worth a try to see if your workstation produces some meaningful result than David's result in bug 1299941. In bug 1299941, valgrind reports corruptions in nvidia's GL driver, which we can do very little about.
Milan, do you folks have anything actionable to go on here? This is causing the gl mochitests to be hidden, which means we don't really have any coverage on Android currently.
Flags: needinfo?(milan)
Not sure how actionable, but Randall being able to reproduce it suggests we should be able to get somewhere, though if it is the driver corrupting things, I'm not sure what that would be.
Vincent, anything in the comment 37 and comment 39 that can help us get further on this?
Flags: needinfo?(milan)
Comment 42•8 years ago
|
||
Randall reproduces a likely different crash.
Comment 43•8 years ago
|
||
(In reply to Milan Sreckovic [:milan] from comment #41)
> Not sure how actionable, but Randall being able to reproduce it suggests we
> should be able to get somewhere, though if it is the driver corrupting
> things, I'm not sure what that would be.
> Vincent, anything in the comment 37 and comment 39 that can help us get
> further on this?
The case I'd seen from the previous try result were likely in free, which was not like the crash happens in allocation (gdb logs attached for comment 37).
More information analyzed for this bug showed arena didn't actually corrupted, the static variable getting corrupted to point to some random data instead.
Currently I don't have any actionable way to go from comment 37 and 39. It would be great to input any STR to approach the issue.
Updated•7 years ago
|
Priority: -- → P3
Comment 44•7 years ago
|
||
Looks like somewhere along the line, probably when tests were being moved from buildbot to taskcluster, someone decided there wasn't any point in bothering to run the webgl mochitests on Android, so in the unlikely event that anyone ever decides to work on this, their first task will be to recreate the job (or, if the reftests repro'ed well enough to tell about a fix, which I can't remember, test a fix against reftests and then recreate the mochitest job).
Updated•5 years ago
|
Updated•5 years ago
|
See Also: → webgl-orange-ooms
Updated•5 years ago
|
Keywords: leave-open
You need to log in
before you can comment on or make changes to this bug.
Description
•