Closed Bug 822398 Opened 11 years ago Closed 11 years ago

Crash when trying to add contact photo during MW2

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(blocking-basecamp:+, firefox19 fixed, firefox20 fixed, b2g18 fixed)

RESOLVED FIXED
B2G C3 (12dec-1jan)
blocking-basecamp +
Tracking Status
firefox19 --- fixed
firefox20 --- fixed
b2g18 --- fixed

People

(Reporter: cjones, Assigned: bechen)

References

Details

(Keywords: regression)

Attachments

(3 files, 1 obsolete file)

STR
 (1) Follow the steps in https://wiki.mozilla.org/B2G/Memory_acceptance_criteria#MW2:_Active_call_stays_active

During step 19, the b2g process crashes.

Can we please run these tests as part of our smoketests?
Over to tchung to determine if we should add this as a smoketest.
Flags: needinfo?(tchung)
blocking-basecamp: ? → +
Andrea: Can you have a look at this? I think this is crashing in webactivities code.
Assignee: nobody → amarchesini
Target Milestone: --- → B2G C3 (12dec-1jan)
I cannot reproduce this bug.
Can you tell me which release you are using?
Flags: needinfo?(jones.chris.g)
Assignee: amarchesini → nobody
Benjamin, could you help this case? Thanks!
Assignee: nobody → bechen
Attached file gdb stack
Looks like a duplication to Bug 823951
I have got this twice by gdb and both the strings in frame6 are "screen".

#6  0x411cef8a in nsStringBuffer::Release (this=0x49acfb60) at /home/benjamin/Documents/git/unagi/B2G/gecko/xpcom/string/src/nsSubstring.cpp:161

0000000: 00 00 00 00 52 00 61 00 64 00 69 00 6f 00 20 00  ....R.a.d.i.o. .
0000010: 46 00 4d 00 20 00 47 00 61 00 69 00 61 00 00 00  F.M. .G.a.i.a...
0000020: 04 00 ff ff 00 00 00 00 0a 00 00 00 69 6d 61 67  ............imag
0000030: 65 2f 67 69 66 00 00 20 63 6c 6f 73 65 00 00 00  e/gif.. close...
0000040: 00 00 9d 47 00 00 00 00 0b 00 00 00 6b 65 65 70  ...G........keep
0000050: 2d 61 6c 69 76 65 00 00 e0 98 30 40 d0 81 b7 43  -alive....0@...C
0000060: c7 75 23 40 00 00 00 00 0e 00 00 00 73 00 63 00  .u#@........s.c.
0000070: 72 00 65 00 65 00 6e 00 0a                       r.e.e.n
In frame 1, this assertion failed:
    RELEASE_ASSERT((run->regs_mask[elm] & (1U << bit)) == 0);
It looks like a double free.

In frame 6:
    int32_t count = PR_ATOMIC_DECREMENT(&mRefCount);
    NS_LOG_RELEASE(this, count, "nsStringBuffer");
    if (count == 0)
      {
        STRING_STAT_INCREMENT(Free);
        free(this); // we were allocated with |malloc|
      }
It is possible that more than one threads free the same thing simultaneously.
The "if (count ==0)" check should be combined with the decrement.
Sorry! The protection should be fine! So it may be that it is freed in other sites.
Attached file double free
It's double free.
Nice! :)

Are you able to reproduce this in a DEBUG build?  It looks like bug 823474 just maybe could cause this, but I'm not 100% sure.  If it's bug 823474, then assertions will fail in DEBUG builds.
What the backtraces seem to show is that
 - we're actually destroying a string here [1]; maybe that string is |aKey|? o_O
 - we hand out that string reference to an hal:: notification [2], which is probably calling into JS
 - then we later die trying to free the same string JSExternalString::finalize

This looks like it might maybe be violating some pldhash invariant.

Also looks like it might be the Great White Whale of random crashes :D.

[1] http://mxr.mozilla.org/mozilla-central/source/hal/HalWakeLock.cpp#81
[2]http://mxr.mozilla.org/mozilla-central/source/hal/HalWakeLock.cpp#85
(In reply to Chris Jones [:cjones] [:warhammer] from comment #12)
>  - we hand out that string reference to an hal:: notification [2], which is
> probably calling into JS

Confirmed that this will result in JS callbacks being invoked.
Hmm, I didn't notice that IPDL type doesn't copy values. We should copy the key. Nice catch Benjamin.
Looking back over the memory-test history, things started getting really crash between 12/6 and 12/12.  With I had more resolution there, but bug 811740 is in that history.  That seems like the kind of change that might exacerbate a pre-existing problem.
Bug 814822 merged on 11/27, and bug 817946 was filed on 12/3.  Doesn't say a whole lot, but things are in range.
(In reply to Kan-Ru Chen [:kanru] from comment #14)
> Hmm, I didn't notice that IPDL type doesn't copy values. We should copy the
> key. Nice catch Benjamin.

The IPC layer copies the incoming key.  I think the problem is

         sLockTable->Remove(aKey);

while we're enumerating |aKey|.  I think we need to return PL_DHASH_REMOVE and not remove the key there, but I need to double check the dhash API.  (It's not very friendly :/.)
Attached patch Fix (obsolete) — Splinter Review
Use Enumerate instead EnumerateRead?
Comment on attachment 695954 [details] [diff] [review]
Fix

Review of attachment 695954 [details] [diff] [review]:
-----------------------------------------------------------------

::: hal/HalWakeLock.cpp
@@ +78,5 @@
>        LockCount totalCount;
>        WakeLockInformation info;
>        aTable->EnumerateRead(CountWakeLocks, &totalCount);
>        if (!totalCount.numLocks) {
> +        op = PL_DHASH_REMOVE;

op |= PL_DHASH_REMOVE;

We need to search the whole table.

I'm not sure if EnumerateRead() is allowed to return PL_DHASH_REMOVE. If not we should change the enumerator to Enumerate().
Yep you're right.  Now trying to sort through the also not-very-friendly nsBaseHashtable API :/.
Attached patch Fix v2Splinter Review
I think this is right.  The PLDHashOperator isn't really a bitfield when enumerating.
Comment on attachment 695957 [details] [diff] [review]
Fix v2

I was able to complete the memory acceptance tests without any crashes!  Looks promising.  That's only happened once in the last 14 days.

Kan-Ru, do you have tests that would check that we're removing the wake-lock tables properly?
Attachment #695957 - Flags: review?(kchen)
Attachment #695954 - Attachment is obsolete: true
(In reply to Chris Jones [:cjones] [:warhammer] from comment #23)
> Created attachment 695957 [details] [diff] [review]
> Fix v2
> 
> I think this is right.  The PLDHashOperator isn't really a bitfield when
> enumerating.

Specifically,

http://mxr.mozilla.org/mozilla-central/source/xpcom/glue/pldhash.cpp#717
Comment on attachment 695957 [details] [diff] [review]
Fix v2

Review of attachment 695957 [details] [diff] [review]:
-----------------------------------------------------------------

The existing test doesn't cover multiple wake lock with different names.
Attachment #695957 - Flags: review?(kchen) → review+
I got a second consecutive non-crash memory acceptance run, so this is promising indeed.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #27)
> I got a second consecutive non-crash memory acceptance run, so this is
> promising indeed.

Three in a row.  Unheard of!
(In reply to Chris Jones [:cjones] [:warhammer] from comment #12)
> What the backtraces seem to show is that
>  - we're actually destroying a string here [1]; maybe that string is |aKey|?
> o_O
>  - we hand out that string reference to an hal:: notification [2], which is
> probably calling into JS
>  - then we later die trying to free the same string
> JSExternalString::finalize
> 
> This looks like it might maybe be violating some pldhash invariant.
> 
> Also looks like it might be the Great White Whale of random crashes :D.
> 
> [1] http://mxr.mozilla.org/mozilla-central/source/hal/HalWakeLock.cpp#81
> [2]http://mxr.mozilla.org/mozilla-central/source/hal/HalWakeLock.cpp#85

I have a question about the nsStringBuffer's ref count.
From my attachment 695937 [details], the first and the second free come from
nsStringBuffer::Release.
There is a mRefCount for the nsStringBuffer, only free when mRefCount == 0
   
That means someone keep the reference and add the mRefCount after the first free?
As bugs with verified seen-in-the-wild signatures were blindly duped here, can people who could reproduce the exact crash scenario fixed here please give us some crash IDs? Otherwise I think I'll probably need to reopen the duped bugs unless we can verify otherwise that those top crash signatures from the other bugs were indeed this issue.
(In reply to Benjamin Chen from comment #34)
> I have a question about the nsStringBuffer's ref count.
> From my attachment 695937 [details], the first and the second free come from
> nsStringBuffer::Release.
> There is a mRefCount for the nsStringBuffer, only free when mRefCount == 0
>    
> That means someone keep the reference and add the mRefCount after the first
> free?

The problem is that the enumerator is passed a const reference to the string element in the hash set, and then frees the item in the hash set that the reference points to.

We would have caught this with nasty assertion failures if we had a test for this running in debug builds (v. bug 824526).
This is a huge improvement in stability :).  bechen++ for the analysis!
Flags: needinfo?(tchung)
You need to log in before you can comment on or make changes to this bug.