Intermittent b2g process crash on compositor thread (on tbpl: "TEST-UNEXPECTED-FAIL | dom-level1-core/test_characterdatasetdatanomodificationallowederr.html | application timed out after 330 seconds with no output")

RESOLVED FIXED in Firefox 19

Status

()

defect
P1
normal
RESOLVED FIXED
7 years ago
7 years ago

People

(Reporter: cjones, Assigned: bjacob)

Tracking

({intermittent-failure})

19 Branch
B2G C3 (12dec-1jan)
ARM
Gonk (Firefox OS)
Points:
---

Firefox Tracking Flags

(blocking-basecamp:+, firefox17 wontfix, firefox18 wontfix, firefox19 fixed, firefox20 fixed, firefox-esr10 wontfix, firefox-esr17 wontfix, b2g18 fixed)

Details

(Whiteboard: [b2g-gfx] [leave open])

Attachments

(2 attachments)

Posted file gdb session
While trying to catch bug 817946 today, I ran into this crash exactly twice in about 6 hours of continuous stress testing.  The first time the android tool I was using messed up and I missed the crash (surprise!).  The second time I was watching the wrong process in gdb and could only attach after our signal handler caught it, which screwed up the stack.

I attached the gdb session in which I recovered everything I could by munging the stack pointer.  Sadly, not much.

After attaching gdb to the right process, I was unable to repro in another 3 hours or so of testing.  Will keep at it.

Then lo and behold, this showed up on tbpl today!
The best stack I was able to squeeze out of gdb is

(gdb) bt
#0  __dl_read () at bionic/libc/arch-arm/syscalls/read.S:10
#1  0xb0003332 in __dl_$t (n=11, info=<value optimized out>, unused=<value optimized out>) at bionic/linker/debugger.c:149
#2  0x4119f0b8 in mozilla::layers::Layer::CalculateScissorRect (this=<value optimized out>, aCurrentScissorRect=<value optimized out>, aWorldTransform=<value optimized out>) at /home/cjones/mozilla/emulator-b2g/gecko/gfx/layers/Layers.cpp:645
#3  0x411a3c74 in ContainerRender<mozilla::layers::ShadowContainerLayerOGL> (this=0x14, aPreviousFrameBuffer=1219606168, aOffset=<value optimized out>) at /home/cjones/mozilla/emulator-b2g/gecko/gfx/layers/opengl/ContainerLayerOGL.cpp:324
#4  mozilla::layers::ShadowContainerLayerOGL::RenderLayer (this=0x14, aPreviousFrameBuffer=1219606168, aOffset=<value optimized out>) at /home/cjones/mozilla/emulator-b2g/gecko/gfx/layers/opengl/ContainerLayerOGL.cpp:446
#5  0x40155040 in ?? ()
#6  0x40155040 in ?? ()

which, if it can be believed, suggests a null ShadowContainerLayerOGL*.  But that doesn't make sense ...
https://tbpl.mozilla.org/php/getParsedLog.php?id=17674845&tree=Mozilla-Beta#error0

b2g_ics_armv7a_gecko_emulator mozilla-beta opt test mochitest-1 on 2012-12-06 08:50:22 PST for push feff35567e4a

slave: talos-r3-fed-063

09:07:40  WARNING -  TEST-UNEXPECTED-FAIL | /tests/dom/tests/mochitest/dom-level1-core/test_characterdatasetdatanomodificationallowederr.html | application timed out after 330 seconds with no output
09:10:50    ERROR - Return code: 1
09:10:50    ERROR -  F/libc    (  171): Fatal signal 7 (SIGBUS) at 0x41396588 (code=2)
09:10:50    ERROR -  This usually indicates the B2G process has crashed
09:10:50    ERROR -  F/libc    (   43): Fatal signal 11 (SIGSEGV) at 0x00000047 (code=1)
09:10:50    ERROR -  This usually indicates the B2G process has crashed
09:10:50    ERROR -  F/libc    (  740): Fatal signal 11 (SIGSEGV) at 0x47e00000 (code=2)
09:10:50    ERROR -  This usually indicates the B2G process has crashed
09:10:50    ERROR -  F/libc    (  781): Fatal signal 11 (SIGSEGV) at 0x00000030 (code=1)
09:10:50    ERROR -  This usually indicates the B2G process has crashed
Caught another hard-to-analyze stack in the emulator that might be related to bug 796773

(gdb) bt
#0  less_than_32_left () at bionic/libc/arch-arm/bionic/memcpy.S:288
#1  0x40286ade in android::GraphicBuffer::flatten (this=<value optimized out>, buffer=0xbda, size=<value optimized out>, fds=0x0, count=1133509320) at frameworks/base/libs/ui/GraphicBuffer.cpp:232
#2  0x46c17140 in ?? ()
Cannot access memory at address 0x0
#3  0x46c17140 in ?? ()
Cannot access memory at address 0x0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

I don't think we want a |count| of over 10^9, whatever we might be doing here ...

Will see what I can dig out tomorrow.

Updated

7 years ago
blocking-basecamp: ? → +
Priority: -- → P1
Target Milestone: --- → B2G C3 (12dec-1jan)
Assignee: nobody → bugs
Assignee: bugs → bjacob
Whiteboard: [b2g-gfx]
(In reply to Chris Jones [:cjones] [:warhammer] from comment #1)
> The best stack I was able to squeeze out of gdb is
> 
> (gdb) bt
> #0  __dl_read () at bionic/libc/arch-arm/syscalls/read.S:10
> #1  0xb0003332 in __dl_$t (n=11, info=<value optimized out>, unused=<value
> optimized out>) at bionic/linker/debugger.c:149
> #2  0x4119f0b8 in mozilla::layers::Layer::CalculateScissorRect (this=<value
> optimized out>, aCurrentScissorRect=<value optimized out>,
> aWorldTransform=<value optimized out>) at
> /home/cjones/mozilla/emulator-b2g/gecko/gfx/layers/Layers.cpp:645
> #3  0x411a3c74 in ContainerRender<mozilla::layers::ShadowContainerLayerOGL>
> (this=0x14, aPreviousFrameBuffer=1219606168, aOffset=<value optimized out>)
> at
> /home/cjones/mozilla/emulator-b2g/gecko/gfx/layers/opengl/ContainerLayerOGL.
> cpp:324
> #4  mozilla::layers::ShadowContainerLayerOGL::RenderLayer (this=0x14,
> aPreviousFrameBuffer=1219606168, aOffset=<value optimized out>) at
> /home/cjones/mozilla/emulator-b2g/gecko/gfx/layers/opengl/ContainerLayerOGL.
> cpp:446
> #5  0x40155040 in ?? ()
> #6  0x40155040 in ?? ()
> 
> which, if it can be believed, suggests a null ShadowContainerLayerOGL*.  But
> that doesn't make sense ...

It has got to make sense somehow: there's no other practical explanation for 'this=0x14' above...

Reasoning from this starting point:

 - the stack doesn't tell us who called RenderLayer
 - but looking at the code, we have RenderLayer and ContainerRender calling each other recursively, iterating over children in ContainerRender at ContainerLayerOGL.cpp:245:

     for (uint32_t i = 0; i < children.Length(); i++) {

 - this |children| array is obtained just above at ContainerLayerOGL.cpp:240:

     aContainer->SortChildrenBy3DZOrder(children);

 - So the question is could SortChildrenBy3DZOrder insert a null pointer in its output |children| array? It is defined at Layers.cpp:769, and it does test whether pointers are NULL before inserting them in the array... only to insert them anyway:

void
ContainerLayer::SortChildrenBy3DZOrder(nsTArray<Layer*>& aArray)
{
  nsAutoTArray<Layer*, 10> toSort;

  for (Layer* l = GetFirstChild(); l; l = l->GetNextSibling()) {
    ContainerLayer* container = l->AsContainerLayer(); // <--- this returns null sometimes
    if (container && container->GetContentFlags() & CONTENT_PRESERVE_3D) {
      toSort.AppendElement(l);
    } else {   // <--- in null case, we take this path
      if (toSort.Length() > 0) {
        SortLayersBy3DZOrder(toSort);
        aArray.MoveElementsFrom(toSort);
      }
      aArray.AppendElement(l);  // <--- we insert the null pointers here
    }
  }
  if (toSort.Length() > 0) {
    SortLayersBy3DZOrder(toSort);
    aArray.MoveElementsFrom(toSort);
  }
}
Erm, read a bit too fast: what we insert is |l|, not |container|, so it doesn't have to be null.

Is there any circumstance where we would have null children layers?
Sorry for the botched analysis in comment 7; hopefully this is more useful:

Looking deeper into "could SortChildrenBy3DZOrder possibly return null children" the next step is to look into the SortLayersBy3DZOrder function that it calls; that is defined in LayerSorter.cpp, and there we have this do { ... } while() loop:

http://dxr.mozilla.org/mozilla-central/gfx/layers/LayerSorter.cpp.html#l276

In this loop, at line 302, we set

  Layer* minNode = nullptr;

Then we /maybe/ set minNode to a non-null value, but that doesn't look guaranteed (lines 303--310); we then append it at the end of the noIncoming array at line 316:

  noIncoming.AppendElement(minNode);

and we go to the next iteration of the loop, where we take the last element of the noIncoming array and append it to the output sortedList at lines 278--283:

  uint32_t last = noIncoming.Length() - 1;

  Layer* layer = noIncoming.ElementAt(last);

  noIncoming.RemoveElementAt(last);
  sortedList.AppendElement(layer);

There it seems that we could really be inserting a null pointer into the output sortedList.

Matt, does that make sense?
Comment on attachment 693082 [details] [diff] [review]
avoid adding null Layers into the output of SortLayersBy3DZOrder

Review of attachment 693082 [details] [diff] [review]:
-----------------------------------------------------------------

Good catch!
Attachment #693082 - Flags: review?(matt.woodrow) → review+
While I have your attention:
 - can you think of other ways that null pointers could sneak into this list?
 - should we aim for bullet-proof-ness here and do the check for null right where we insert the layer into the list, so that we'd catch any other reason why a null layer might sneak in? Or are you confident enough?
Well, we're going to crash regardless if we add a null layer to the list right?

So we might as well add an assertion to check for this when we append to sortedList. Same outcome, but easier to debug.

And I can't see any other ways it could happen, but these things are easy to miss.
https://hg.mozilla.org/integration/mozilla-inbound/rev/985bab8f84df

Added a MOZ_ASSERT where it's inserted into sortedList.

[leave open] because we don't know for sure yet that this fixes the intermittent failure reported here; we'll have to let some time pass and see if the intermittent failures continue.
Whiteboard: [b2g-gfx] → [b2g-gfx] [leave open]
Comment on attachment 693082 [details] [diff] [review]
avoid adding null Layers into the output of SortLayersBy3DZOrder

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Don't know
User impact if declined: Intermittent B2G crash. Please approve at least for B2G18 and aurora; don't actually care about beta insofar as it doesn't apply to B2G.
Testing completed (on m-c, etc.): landed yesterday
Risk to taking this patch (and alternatives if risky): not risky: very simple patch avoiding doing dangerous things with null pointers
String or UUID changes made by this patch: none
Attachment #693082 - Flags: approval-mozilla-beta?
Attachment #693082 - Flags: approval-mozilla-b2g18?
Attachment #693082 - Flags: approval-mozilla-aurora?
Attachment #693082 - Flags: approval-mozilla-beta?
Comment on attachment 693082 [details] [diff] [review]
avoid adding null Layers into the output of SortLayersBy3DZOrder

Approving for B2G branches, but we won't make a call on mozilla-beta until we know the associated crash signature. Can anybody help there?
Attachment #693082 - Flags: approval-mozilla-beta?
Attachment #693082 - Flags: approval-mozilla-b2g18?
Attachment #693082 - Flags: approval-mozilla-b2g18+
Attachment #693082 - Flags: approval-mozilla-aurora?
Attachment #693082 - Flags: approval-mozilla-aurora+
This crash could affect fennec-android in theory, but seems not to in practice.  So I don't think we need to take it on beta.
Benoit, do we have enough data to tell if we've reduced the crash rate (or at least have better stacks now)?

Updated

7 years ago
Flags: needinfo?(bjacob)
There are only a few reports on this bug, all by Chris, none after the tentative fix landed.

So I don't know; it's looking good so far but I don't know that this bug ever had enough volume that we could conclude anything yet from the fact that it hasn't happened again.

Maybe Chris has more clues as he managed to reproduce.
Flags: needinfo?(bjacob) → needinfo?(jones.chris.g)
...actually, I haven't landed this yet on b2g18... doing it asap.
What looks like the same thing as this crash is still happening.
Flags: needinfo?(jones.chris.g)
Comment on attachment 693082 [details] [diff] [review]
avoid adding null Layers into the output of SortLayersBy3DZOrder

Considering comment# 19, this patch is not needed on beta.
Attachment #693082 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
I don't see these crashes happening anymore.  The b2g tests are pretty sickeningly green, in fact.

So going to declare this fix a winner.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.