Bugzilla

Comment 5

•

8 years ago

Hi Milan,

Currently I am still trying to find the possible way to reproduce the issue from try server. The crash is not consistent so it is hard to have  a fixed chunk to look into; Furthermore, I also want to setup taskcluster and try to available mach to run test. It still have environment relative issues need to be resolved.

Flags: needinfo?(vliu)

Flags: needinfo?(howareyou322)

Comment 6

•

8 years ago

1. Current situation.
    1.1 Looked into back trace for the recent tc-M(gl1) and tc-M(gl5) in mozilla-inbound, they are not consistent. most of them were crashed at jemalloc free 
        but with differnt root cause. Please see the attached file.

    1.2 I tried to collect all test program which easily crash in tc-M(gl1) and tc-M(gl5) together in generated-mochitest.ini. With this modification, I can reproduce jemalloc_crash
       on try-server many times. Please see [1] for more detail.

       [1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=afd22fd275b2

    1.3 In the other hand, I found running reftest-18 also easily to reproduce jemalloc_crash. Please see [2] for more detail.

       [2]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ffdb90ff6ce9

       This pattern is always creash in webgl-hanging-fb-test.html. Looked into this test program, I try to simpify the code and the main gl operations are

          gl.clearColor(0.0, 1.0, 0.0, 1.0);
          gl.clear(gl.COLOR_BUFFER_BIT);

       There are more clues to think about the above gl operations are we had some crashs in [3]. Please see bug 1292263, 1291864 and 1291790. But I have no more evidence
       to prove it.

        [3]: https://dxr.mozilla.org/mozilla-central/source/dom/canvas/test/reftest/webgl-clear-test.html 


2. Possible ways for the next move.
    2.1 build up mozharness locally.
    2.2 Keep trying remote environment by "one click loaner".
    2.3 Rethink test programs which are easily to crash like 1.3 mentioned.
    2.4 I also consult performance team for more help. It is currently on going.

Comment 7

•

8 years ago

Attached file jemalloc-backtrace.log — Details

A list of back trace 1.1 in Comment 6 was attached.

Comment 8

•

8 years ago

Hi Mike,

Thanks for talk to you on IRC about jemalloc issues. This bug is meta bug for tracking. I modified the code you suggested in jemalloc and now the try has reproduced. Can you have a look for me if there is any clue to point to the root cause? Thanks

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8e9f3a2b8e00&selectedJob=26014306

Flags: needinfo?(mh+mozilla)

Comment 9

•

8 years ago

Oh waw, this one is really bad. Contrary to the past one I looked at, this one happens on malloc. So it's not even about a bad pointer being passed in. The crash happens on the same kind of check (arena magic match), so it indicates an actual heap corruption, where something is actually touching mozjemalloc's main (and only) arena structure, or the arenas list, or the pointer to it.

What I would suggest is:
- inserting a page-aligned padding after the struct arena_s magic member, and mprotect the first page or arenas[0] so that it can't be written to after initialization
- doing something similar with the arenas list.

Flags: needinfo?(mh+mozilla)

Comment 10

•

8 years ago

Hi Mike,

Thanks the reply for the issue. From crash listed in Comment 8, I also see another crash point was in [1].


[1]: https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c?q=RELEASE_ASSERT%28%28mapbits+%26+CHUNK_MAP_ALLOCATED%29+%21%3D+0%29%3B&redirect_type=single#4587

Does it also happens because of heap corruption? Should I add any padding or any protection to prevent?

Flags: needinfo?(mh+mozilla)

Comment 11

•

8 years ago

That could be or could be totally unrelated. But it's also harder to diagnose with padding and mprotect. You should try comment 9 first before trying to care about that variant.

Flags: needinfo?(mh+mozilla)

Comment 12

•

8 years ago

Having guard pages in the memory allocator is something that would be hugely helpful in diagnosing memory corruptions. Jemalloc has checks of the arena magic in malloc()/free(), but this is often unrelated to the corruption site.

To make this useful, we need to reserve guard page (maybe before the arena header instead of after it), fill it with some pattern that is recognized by us and mprotect() it as readonly so that we can crash-annotate this as memory-corruption so that such crashes can stand out among other kinds of access violations like use-after-free.

Milan Sreckovic [:milan] (needinfo for best results)

Comment 13

•

8 years ago

From the previous suggestions, I tried to draft a WIP to debug this issue. The WIP intended to add a paddings, 
followed after magic in arena structure to combine a 4096 bytes (This equals to a page size in arena). 

When arena was allocated in initial phase, mprotect() was also called to protect in a page basis. By this way, magic in arena may be protected from
harming. [1] is the try result.

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b616fa5ca3f2&selectedJob=26494713

Looked into [1], two crashes happens for both alloc and free phase. 

  [2]: https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#4306
  [3]: https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#4510
  

Logcat showed only arena 0 was allocated and make sure allocated arena was page aligned. 

   Jemalloc:  arena 0 has allocated. ret=1426067456

From crash point, it still has chance to hit magic not matched[2]. In other hand, [3] still happens in most try.

Hi Mike, Cervantes,
Could you please have a look in [1] if I have anything missing? Thanks.

Flags: needinfo?(mh+mozilla)

Flags: needinfo?(cyu)

Updated

•

8 years ago

Blocks: 1294682

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284740

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291155

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1297926

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1283780

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1293515

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291570

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291902

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1294223

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284363

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1294305

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1298099

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1298224

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284152

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284865

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284611

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1293875

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1282369

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1295785

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1297236

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291613

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1237521

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284737

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1293561

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1285691

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291788

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284734

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1285079

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1271070

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1293135

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291170

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284732

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284692

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1292824

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284689

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1298075

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291934

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1293288

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1271371

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284916

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1285601

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1285599

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291104

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1297947

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1296195

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1293547

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284580

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1298039

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1285354

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284153

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1285615

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1295414

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1292436

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291114

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1292189

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1285610

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284639

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291864

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291560

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1294859

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1285112

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1293431

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1292796

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1293417

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291554

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1292263

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1293880

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1236728

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291098

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1285422

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1293907

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284729

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284186

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1294179

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284612

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284857

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291169

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1292307

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1298278

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291898

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1292805

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1286132

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284724

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1235138

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1285471

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291790

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284075

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1292208, 1292965

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291093

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1293249

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1285096

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291782

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1288062

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1298461

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291906

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291783

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1271071

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291094

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1296810

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1285612

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291507

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1292746

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1285388

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1285753

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291102

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1298267

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291899

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291493

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1285279

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1292759

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1285056

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1283469

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291503, 1295864

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1285016

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1286542

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1293871, 1295707

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1293320

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1291095

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284761

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284148

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1284638, 1284935

Comment 14

•

8 years ago

(In reply to Vincent Liu[:vliu] from comment #13)
> From the previous suggestions, I tried to draft a WIP to debug this issue.
> The WIP intended to add a paddings, 
> followed after magic in arena structure to combine a 4096 bytes (This equals
> to a page size in arena). 
> 
> When arena was allocated in initial phase, mprotect() was also called to
> protect in a page basis. By this way, magic in arena may be protected from
> harming. [1] is the try result.
> 
> [1]:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=b616fa5ca3f2&selectedJob=26494713

The log for that selected job is from the code you point at [3] because of the isalloc_validate call you added to free() in https://hg.mozilla.org/try/file/b616fa5ca3f2/memory/mozjemalloc/jemalloc.c#l6529 and essentially means free is being passed an already freed pointer.

The log for the next orange after that one shows a crash at https://hg.mozilla.org/try/file/b616fa5ca3f2/memory/mozjemalloc/jemalloc.c#l4312 ... and no android log from the code at https://hg.mozilla.org/try/file/b616fa5ca3f2/memory/mozjemalloc/jemalloc.c#l4306. That's a huge WTF. (android log: https://public-artifacts.taskcluster.net/DokpsO8nQuaA_BjBOg1VIw/0/public/test_info//logcat-emulator-5554.log ; crash log: https://public-artifacts.taskcluster.net/DokpsO8nQuaA_BjBOg1VIw/0/public/logs/live_backing.log )

Even looking at the assembly doesn't help make it a little less a WTF:

   1c56a:       4607            mov     r7, r0       // r0 contains arenas[0], copy the value in r7
   1c56c:       b9d8            cbnz    r0, 1c5a6    // if r0 is non-zero, go to that address
(...)
   1c5a6:       6803            ldr     r3, [r0, #0] // Load arenas[0].magic in r3
   1c5a8:       4ebe            ldr     r6, [pc, #760]  ; (1c8a4) // Load ARENA_MAGIC in r6
   1c5aa:       42b3            cmp     r3, r6       // Compare r3 and r6
   1c5ac:       d021            beq.n   1c5f2        // If r3 != r6, go to that address. Considering we didn't get the android log, this branch was taken, meaning r3 == r6, and r0 is obviously non-zero, so we took the first branch too
(...)
   1c5f2:       683b            ldr     r3, [r7, #0] // r7 has a copy of arenas[0] from above, so this loads arenas[0].magic in r3 again
   1c5f4:       42b3            cmp     r3, r6       // and compare r3 and r6 again
   1c5f6:       d001            beq.n   1c5fc        // If r3 != r6, go to that address. Considering the following instruction is the one going to jemalloc_crash, and is where the backtrace points at, it means this branch is not taken. So somewhere between 1c5a6 and 1c5f2, something changed the value of arenas[0].magic without hitting the mprotect? I'm ready to invoke the solar flare hypothesis.

The crash backtrace usefully displays r7 as 0x55001000, which corresponds to the 1426067456 from the android log giving the address of arenas[0].

At this point, I'd try to copy arenas[0].magic and padding onto a buffer in the stack in jemalloc_crash. Something similar to https://dxr.mozilla.org/mozilla-central/source/js/src/jit/x86-shared/BaseAssembler-x86-shared.h#3787-3798 . Although, all that would tell you is what there is there, not why it's there.
   1c5f8:       f7fd fbe6       bl      19dc8        // jemalloc_crash

Flags: needinfo?(mh+mozilla)

Comment 15

•

8 years ago

> The crash backtrace usefully displays r7 as 0x55001000, which corresponds to the 1426067456 from the android log giving the address of arenas[0].

The minidump also confirms the page at 0x55001000 is read-only.

Nathan Froyd [:froydnj]

Comment 16

•

8 years ago

FWIW, I was hitting very similar symptons (arenas[0].magic corruption) when trying to land a patch a while back and it turned out to be (non-jemalloc) code that was using the STL without respecting the invariants required by the STL--the sort function we were passing to std::sort wasn't correct.  Carefully auditing STL usage might be a worthwhile approach here.

Comment 17

•

8 years ago

(In reply to Nathan Froyd [:froydnj] from comment #16)
But that doesn't explain how a arenas[0].magic inside a mprotect()'d page can be changed in the try push in this bug.

Since this is easily reproduced, I think the suggestion in comment #14 might help.

Flags: needinfo?(cyu)

Nathan Froyd [:froydnj]

Comment 18

•

8 years ago

(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #17)
> But that doesn't explain how a arenas[0].magic inside a mprotect()'d page
> can be changed in the try push in this bug.

This is true.  But e.g. comment 1260208 comment 20 and following shows similarly weird behavior.

Comment 19

•

8 years ago

After taking a closer look, I don't think arena magic is corrupted. Note that the crash in isalloc():

https://hg.mozilla.org/try/file/b616fa5ca3f2853ea29044e185fc67b2c0ab198b/memory/mozjemalloc/jemalloc.c#l4516

the assertion is that CHUNK_MAP_ALLOCATED bit is not turned on for the page. Maybe it's some buffer underflow to corrupt the arena_chunk_t header for the chunk. If my reasoning is right, the next step should be adding a guard page after the chunk header.

Comment 20

•

8 years ago

The thing is that there are *several* crashes, most of which happen on the line that checks ARENA_MAGIC, not CHUNK_MAP_ALLOCATED. Also, see comment 14.

Comment 21

•

8 years ago

Next Thursday will be this bug's two month anniversary, so next weekend I'll be doing what should have been done long ago, hiding all the webgl mochitests on Android and skipping all the webgl reftests.

Comment 22

•

8 years ago

FWIW, the oldest bugs with a similar crash signature are 8 months old.

Comment 23

•

8 years ago

(from the list in comment 2)

Comment 24

•

8 years ago

Yep, lightly broken for Christmas, broken enough to be noticeable and should have been dealt with at the end of April, then broken so badly that it's been violating the Visiblity Policy since July (not sure how to even count the reftests, but the mochitests are either around a 10% failure rate, ten times what it's supposed to be for an established suite, or if you just count the failure rate of the two worst chunks on debug where they are at their worst, more like 60%).

Comment 25

•

8 years ago

(In reply to Phil Ringnalda (:philor) from comment #21)
> Next Thursday will be this bug's two month anniversary, so next weekend I'll
> be doing what should have been done long ago, hiding all the webgl
> mochitests on Android and skipping all the webgl reftests.

Usually these guidelines apply, but I don't believe it's an option in this case. It would give up too much protection against regressions, and there's no clean way to clear this up for android while maintaining our coverage on desktop.

It's been two months because of the harsh reality that is prioritization. The right path forward is to push through on root causing this and getting a fix. We're doing that now. If we strike out here, we'll talk about other options to clear up the tests.

Comment 26

•

8 years ago

(In reply to Jeff Gilbert [:jgilbert] from comment #25)
> Usually these guidelines apply, but I don't believe it's an option in this
> case. It would give up too much protection against regressions, and there's
> no clean way to clear this up for android while maintaining our coverage on
> desktop.

Not sure what you mean by that. It's simple to hide mochitest-gl on Android while leaving it visible on desktop, and only a tiresome lot of copy-paste to skip the reftests on Android while leaving them running on desktop.

Yeah, prioritization is tough, and now we need to prioritize running the rest of reftest-18 on Android over running your reftests until they crash and not running the rest of the chunk, and we need to prioritize starring test suites with intermittent failures that happen below a 10% frequency over filing bug after bug after bug after bug as your test suites find yet another test in which to crash, and then finding those bugs when they get filed in the wrong place against some other component, and pulling them back into here rather than making some other team spend time trying to figure out why their code crashed in an incomprehensible way due to this.

Comment 27

•

8 years ago

We do need the reftests though. We can chunk those out into r-gl if that'll help.

I agree that WebGL issues should avoid being in the way of other tests. However, WebGL requires these tests to ensure that we actually get pixels to the screen. Backsliding on testing this is extremely risk-prone, and should be a last resort.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 28

•

8 years ago

Oh, a way to see the reftests while they are hidden, more easily than keeping around a patch to reenable them pushed to try once a day, gotcha.

Since they are running on taskcluster, it might be much easier (or, less likely, vastly more difficult) to break them out. I'd recommend cc'ing jmaher on the bug, since he has lots of experience with creating new suites for this very reason.

Updated

•

8 years ago

Depends on: 1299941

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 29

•

8 years ago

Perhaps it's worth looking into bug 1299941, which is what I hit when I tried running the webgl mochitests under valgrind?

Milan Sreckovic [:milan] (needinfo for best results)

Comment 30

•

8 years ago

Yes - bug 1299941 comment 3

Jerry Shih[:jerry] (UTC+8) (inactive)

Updated

•

8 years ago

Depends on: 1291221

Updated

•

8 years ago

Blocks: 1301584

Updated

•

8 years ago

Blocks: 1301897

Comment 32

•

8 years ago

Android webgl mochitests hidden on all trunk trees, including try, click the "Excluded Jobs" link in the grey bar at the top of treeherder to see them.

Updated

•

8 years ago

Keywords: leave-open

Whiteboard: [gfx-noted] → [gfx-noted][test disabled]

Pulsebot

Comment 33

•

8 years ago

Pushed by philringnalda@gmail.com:
https://hg.mozilla.org/mozilla-central/rev/bef797e6c129
Skip all webgl reftests on Android for causing memory corruption resulting in crashes, often in later non-webgl tests, a=unfortunatesituation

Comment 34

•

8 years ago

Hi :glandium, :cyu,

I tried to disable Jemalloc on try and found it didn't have any crash happens on test result. Looked into all triggered jobs on gl1/gl5. Please see [1] for more details.

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=86053a67f229&exclusion_profile=false&selectedJob=27339694

gl1:
they all unexpected fail at the same error. Another try in [2].

[2]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ce566b1fa51d&exclusion_profile=false&selectedJob=27348933

In [2], some warning message may be the reason to cause unexpected fail. For this, I will add more log message to look into.

09-12 23:42:59.399   766   785 I Gecko   : [766] WARNING: Failed to find a compatible config.: file /home/worker/workspace/build/src/gfx/gl/GLContextProviderEGL.cpp, line 960
09-12 23:42:59.399   766   785 I Gecko   : [766] WARNING: Failed to find a compatible config.: file /home/worker/workspace/build/src/gfx/gl/GLContextProviderEGL.cpp, line 960
09-12 23:42:59.409   766   785 W GeckoConsole: [JavaScript Warning: "Error: WebGL: Failed to create WebGL context: WebGL creation failed: 

gl5:
All retriggered jobs were green.


At this moment I am not sure disabling Jemalloc can get more clues to help on this issue. Could you please input some comments for better next move? Really thanks.

Flags: needinfo?(mh+mozilla)

Flags: needinfo?(cyu)

Comment 35

•

8 years ago

What happened to the followup to comment 14? (getting the data from the page containing modified arena magic onto the jemalloc_crash() stack)

Flags: needinfo?(mh+mozilla)

Randell Jesup [:jesup] (needinfo me)

Comment 36

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #35)
> What happened to the followup to comment 14? (getting the data from the page
> containing modified arena magic onto the jemalloc_crash() stack)

Hi Mike,

Sorry to late reply. I tried to declare a global raw pointer to point to the start address of arena magic. When crash happens, dumping those data for debugging. Please see logcat in [1]

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa9f9e87d676&exclusion_profile=false&selectedJob=27358972

Could you please have a look to see if there is any useful information? Thanks.

Flags: needinfo?(mh+mozilla)

Comment 37

•

8 years ago

I can reproduce this crash at will when starting my main profile on Linux64, local custom build off m-c last night.  Crashes in main process with a trashed arena->magic value; same trashing in two runs.  The entire arena structure looks trashed, again identically in both runs.  Does not repro in rr in the one attempt; perhaps because rr may be blocking opengl -- I see these messages:

JavaScript warning: http://www.salon.com/ga945320.js, line 1: Error: WebGL: Disallowing antialiased backbuffers due to blacklisting.
JavaScript warning: http://www.salon.com/ga945320.js, line 1: Error: WebGL: Refused to create native OpenGL context because of blacklist entry: 
JavaScript warning: http://www.salon.com/ga945320.js, line 1: Error: WebGL: Failed to create WebGL context: WebGL creation failed: 
* Refused to create native OpenGL context because of blacklist entry: 
* Exhausted GL driver options.

Attaching gdb dump of stack for all threads, bt full for crashing thread, dump of *arena in both decimal and hex.

I've saved a copy of the profile directory, the bin directory, and the changeset to ensure I don't lose the ability to crash it.  I've crashed it 5 times in a row now (excluding the rr attempt).

I'm open to suggestions on how to narrow it down.  It does crash pretty fast.

Randell Jesup [:jesup] (needinfo me)

Comment 38

•

8 years ago

Attached file gdb logs — Details

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Assignee: vliu → nobody

Flags: needinfo?(mh+mozilla)

Flags: needinfo?(cyu)

James Willcox (:snorp) (jwillcox@mozilla.com) (he/him)

Comment 39

•

8 years ago

(In reply to Randell Jesup [:jesup] from comment #37)
Some things to check:
* There are lots of 0x90909090 in the corrupted object. Is there any correlation to the test data such as decoded texture?
* Just to be sure that we are tracking the right corruption, can you make sure that when the crash happens, arenas[0] is still the same value as jemalloc() is initialized? You can check its original value by the end of malloc_init_hard(). We need to check whether it's the static variable getting corrupted to point to some random data, or the arena is really corrupted.
* Set watchpoints in arenas[0]->magic and &arenas[0] to capture both cases above. See if this produces some meaningful result. If not, then we might need to watch for an earlier corruption that leads to this corruption.
* Valgrind may be worth a try to see if your workstation produces some meaningful result than David's result in bug 1299941. In bug 1299941, valgrind reports corruptions in nvidia's GL driver, which we can do very little about.

Comment 40

•

8 years ago

Milan, do you folks have anything actionable to go on here? This is causing the gl mochitests to be hidden, which means we don't really have any coverage on Android currently.

Flags: needinfo?(milan)

Milan Sreckovic [:milan] (needinfo for best results)

Comment 41

•

8 years ago

Not sure how actionable, but Randall being able to reproduce it suggests we should be able to get somewhere, though if it is the driver corrupting things, I'm not sure what that would be.
Vincent, anything in the comment 37 and comment 39 that can help us get further on this?

Flags: needinfo?(milan)

Comment 42

•

8 years ago

Randall reproduces a likely different crash.

Milan Sreckovic [:milan] (needinfo for best results)

Comment 43

•

8 years ago

(In reply to Milan Sreckovic [:milan] from comment #41)
> Not sure how actionable, but Randall being able to reproduce it suggests we
> should be able to get somewhere, though if it is the driver corrupting
> things, I'm not sure what that would be.
> Vincent, anything in the comment 37 and comment 39 that can help us get
> further on this?

The case I'd seen from the previous try result were likely in free, which was not like the crash happens in allocation (gdb logs attached for comment 37). 
More information analyzed for this bug showed arena didn't actually corrupted, the static variable getting corrupted to point to some random data instead.

Currently I don't have any actionable way to go from comment 37 and 39. It would be great to input any STR to approach the issue.

Updated

•

7 years ago

Priority: -- → P3

Comment 44

•

6 years ago

Looks like somewhere along the line, probably when tests were being moved from buildbot to taskcluster, someone decided there wasn't any point in bothering to run the webgl mochitests on Android, so in the unlikely event that anyone ever decides to work on this, their first task will be to recreate the job (or, if the reftests repro'ed well enough to tell about a fix, which I can't remember, test a fix against reftests and then recreate the mochitest job).

Updated

•

6 years ago

No longer blocks: 1301897

Updated

•

5 years ago

Status: NEW → RESOLVED

Closed: 5 years ago

No longer depends on: 1299941

Resolution: --- → WORKSFORME