Closed Bug 1318247 Opened 8 years ago Closed 7 years ago

Intermittent remoteautomation.py | application crashed [@ ZipCollection::GetZip]

Categories

(Firefox for Android Graveyard :: Testing, defect, P3)

defect

Tracking

(firefox-esr52 unaffected, firefox54 wontfix, firefox55 wontfix, firefox56 fixed)

RESOLVED FIXED
Firefox 56
Tracking Status
firefox-esr52 --- unaffected
firefox54 --- wontfix
firefox55 --- wontfix
firefox56 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: esawin)

References

Details

(Keywords: crash, intermittent-failure)

Attachments

(1 file, 1 obsolete file)

[task 2016-11-17T06:16:23.634614Z] 06:16:23     INFO -  Crash reason:  SIGSEGV
[task 2016-11-17T06:16:23.634794Z] 06:16:23     INFO -  Crash address: 0x0
[task 2016-11-17T06:16:23.634980Z] 06:16:23     INFO -  Process uptime: not available
[task 2016-11-17T06:16:23.635155Z] 06:16:23     INFO -  Thread 10 (crashed)
[task 2016-11-17T06:16:23.635369Z] 06:16:23     INFO -   0  libmozglue.so!ZipCollection::GetZip [RefCounted.h:8cfe56f5b5d4 : 101 + 0x0]
[task 2016-11-17T06:16:23.635564Z] 06:16:23     INFO -       r0 = 0x0000008d    r1 = 0x21d0b45c    r2 = 0x528fe381    r3 = 0x5291610c
[task 2016-11-17T06:16:23.635757Z] 06:16:23     INFO -       r4 = 0x52d0e080    r5 = 0x52758c2c    r6 = 0x00000000    r7 = 0x52758c04
[task 2016-11-17T06:16:23.635955Z] 06:16:23     INFO -       r8 = 0x52d080f4    r9 = 0x2a357798   r10 = 0x52d080f8   r12 = 0x00000003
[task 2016-11-17T06:16:23.636150Z] 06:16:23     INFO -       fp = 0x52758c64    sp = 0x52758c00    lr = 0x52880beb    pc = 0x52881504
[task 2016-11-17T06:16:23.636336Z] 06:16:23     INFO -      Found by: given as instruction pointer in context
[task 2016-11-17T06:16:23.636563Z] 06:16:23     INFO -   1  libmozglue.so!Java_org_mozilla_gecko_mozglue_NativeZip_getZip [nsGeckoUtils.cpp:8cfe56f5b5d4 : 65 + 0x5]
[task 2016-11-17T06:16:23.636753Z] 06:16:23     INFO -       r4 = 0x4c847180    r5 = 0x2a1e4778    r6 = 0x2a357798    r7 = 0x52758c2c
[task 2016-11-17T06:16:23.636946Z] 06:16:23     INFO -       r8 = 0x4af00005    r9 = 0x00000000   r10 = 0x2a1e4b58    fp = 0x52758c64
[task 2016-11-17T06:16:23.637123Z] 06:16:23     INFO -       sp = 0x52758c28    pc = 0x528864a1
[task 2016-11-17T06:16:23.637314Z] 06:16:23     INFO -      Found by: call frame info
[task 2016-11-17T06:16:23.637488Z] 06:16:23     INFO -   2  libdvm.so + 0x1dc4e
[task 2016-11-17T06:16:23.637694Z] 06:16:23     INFO -       r4 = 0x4c847180    r5 = 0x2a1e4b48    r6 = 0x00000003    r7 = 0x52658dcc
[task 2016-11-17T06:16:23.637915Z] 06:16:23     INFO -       r8 = 0x52758c50    r9 = 0x52658dc4   r10 = 0x2a1e4b58    fp = 0x52758c64
[task 2016-11-17T06:16:23.638087Z] 06:16:23     INFO -       sp = 0x52758c50    pc = 0x4085bc50
[task 2016-11-17T06:16:23.638262Z] 06:16:23     INFO -      Found by: call frame info
[task 2016-11-17T06:16:23.638459Z] 06:16:23     INFO -   3  data@app@org.mozilla.fennec-1.apk@classes.dex + 0x4b9a84
[task 2016-11-17T06:16:23.638636Z] 06:16:23     INFO -       sp = 0x52758c5c    pc = 0x523a1a86
[task 2016-11-17T06:16:23.638817Z] 06:16:23     INFO -      Found by: stack scanning
[task 2016-11-17T06:16:23.639000Z] 06:16:23     INFO -   4  dalvik-heap (deleted) + 0x5ef996
[task 2016-11-17T06:16:23.639182Z] 06:16:23     INFO -       sp = 0x52758c60    pc = 0x41aa4998
[task 2016-11-17T06:16:23.639357Z] 06:16:23     INFO -      Found by: stack scanning
[task 2016-11-17T06:16:23.639534Z] 06:16:23     INFO -   5  libdvm.so + 0x4dcad
[task 2016-11-17T06:16:23.639717Z] 06:16:23     INFO -       sp = 0x52758c68    pc = 0x4088bcaf
[task 2016-11-17T06:16:23.639891Z] 06:16:23     INFO -      Found by: stack scanning
[task 2016-11-17T06:16:23.640086Z] 06:16:23     INFO -   6  data@app@org.mozilla.fennec-1.apk@classes.dex + 0x4b9a82
[task 2016-11-17T06:16:23.640264Z] 06:16:23     INFO -       sp = 0x52758c70    pc = 0x523a1a84
[task 2016-11-17T06:16:23.640438Z] 06:16:23     INFO -      Found by: stack scanning
[task 2016-11-17T06:16:23.640679Z] 06:16:23     INFO -   7  libmozglue.so!Java_org_mozilla_gecko_mozglue_NativeZip_getZipFromByteBuffer [nsGeckoUtils.cpp:8cfe56f5b5d4 : 87 + 0xf]
[task 2016-11-17T06:16:23.640843Z] 06:16:23     INFO -       sp = 0x52758c74    pc = 0x52886461
[task 2016-11-17T06:16:23.641016Z] 06:16:23     INFO -      Found by: stack scanning


I thought you sorted these out in bug 1302516?
Flags: needinfo?(esawin)
Priority: -- → P3
Assignee: nobody → esawin
Flags: needinfo?(esawin)
The crash is caused by the assertion in [1] because mRefCnt < 0 holds during a call to AddRef.
This can only be the case when we're attempting to call AddRef on a reference which is in the process of being destructed [2], but hasn't been removed from the zip collection yet.

The following timeline would trigger this event:
1. Thread 1 acquires the lock in GetZip [3]. 
2. Thread 2 calls Release on reference A [4].
3. Thread 2 invokes destructor of A [5].
4. Thread 2 is waiting for the lock in Forget [6].
5. Thread 1 finds zip A and calls AddRef on it [7].

[1] https://dxr.mozilla.org/mozilla-central/source/mfbt/RefCounted.h#101
[2] https://dxr.mozilla.org/mozilla-central/source/mfbt/RefCounted.h#135
[3] https://dxr.mozilla.org/mozilla-central/source/mozglue/linker/Zip.cpp#196
[4] https://dxr.mozilla.org/mozilla-central/source/mozglue/android/nsGeckoUtils.cpp#95
[5] https://dxr.mozilla.org/mozilla-central/source/mozglue/linker/Zip.cpp#77
[6] https://dxr.mozilla.org/mozilla-central/source/mozglue/linker/Zip.cpp#220
[7] https://dxr.mozilla.org/mozilla-central/source/mozglue/linker/Zip.cpp#201
To avoid the race, we make ZipCollection increase the ref count so that we can control destruction and removal of a zip in a locked environment.

Alternatively, we can remove ZipCollection at the cost of increased address space usage (otherwise, there is no measurable performance loss).
Attachment #8814157 - Flags: review?(mh+mozilla)
Comment on attachment 8814157 [details] [diff] [review]
0001-Bug-1318247-1.0-Add-custom-refcounting-to-zips-in-Zi.patch

Review of attachment 8814157 [details] [diff] [review]:
-----------------------------------------------------------------

There is similar code for LibHandle refcounting, but slightly different. It would be preferable to use the same strategy for both types, preferably by sharing the Release() code.
Attachment #8814157 - Flags: review?(mh+mozilla)
(In reply to Mike Hommey [:glandium] from comment #4)
> There is similar code for LibHandle refcounting, but slightly different. It
> would be preferable to use the same strategy for both types, preferably by
> sharing the Release() code.

We don't access the lib handles directly from JNI, so I think LibHandle/ElfLoader don't need to be thread-safe.

The Release specialization for Zip is only effective with the locking mechanics we've introduced to ZipCollection before.

We could change the debug mRefCnt value from 0x7fffdead to detail::DEAD (0xffffdead) in LibHandle::Release, otherwise we wouldn't trigger the assertion in AddRef in such a case, but that's unrelated to this bug.
(In reply to Eugen Sawin [:esawin] from comment #5)
> (In reply to Mike Hommey [:glandium] from comment #4)
> > There is similar code for LibHandle refcounting, but slightly different. It
> > would be preferable to use the same strategy for both types, preferably by
> > sharing the Release() code.
> 
> We don't access the lib handles directly from JNI, so I think
> LibHandle/ElfLoader don't need to be thread-safe.
> 
> The Release specialization for Zip is only effective with the locking
> mechanics we've introduced to ZipCollection before.
> 
> We could change the debug mRefCnt value from 0x7fffdead to detail::DEAD
> (0xffffdead) in LibHandle::Release, otherwise we wouldn't trigger the
> assertion in AddRef in such a case, but that's unrelated to this bug.

I've tried to picture what shared Release code would look like and I think it would require a friend entry in RefCounted to make it work since Release could not be a specialization of either Release template and would need to access mRefCnt.

Alternatively, I can modify LibHandleList to hold RefPtr and use the same forget-on-1 mechanics (essentially copy the Release specialization), however this wouldn't solve any existing issues afaik and should probably go into a refactoring bug instead.
Flags: needinfo?(mh+mozilla)
You can share by declaring something like:

template <typename T> inline void DoRelease(const T* obj, Atomic<MozRefCountType> &aRefCnt)

And making the RefCounted<*>::Release() implementations call that with (this, mRefCnt)
Flags: needinfo?(mh+mozilla)
I'm assuming that DoRelease would implement both, the forget-on-1 and release mechanics.
In this case, we can't easily share it because lib handles don't have a single authoritative collection like zips have [1].

Zips are registered on creation and, with the new mechanics, released once the ref count drops to 1.

Lib handles have multiple creation and registration paths and have two separate collections [2][3], so that the release-on-1 mechanics would not be applicable without some modifications.

Unless I'm misunderstanding which part of the release mechanics we want to share, I think it would be safer to fix this issue in isolation to have a relatively simple and upliftable fix.

[1] https://dxr.mozilla.org/mozilla-central/source/mozglue/linker/Zip.h#497
[2] https://dxr.mozilla.org/mozilla-central/source/mozglue/linker/ElfLoader.h#484
[3] https://dxr.mozilla.org/mozilla-central/source/mozglue/linker/CustomElf.h#138
Flags: needinfo?(mh+mozilla)
Comment on attachment 8814157 [details] [diff] [review]
0001-Bug-1318247-1.0-Add-custom-refcounting-to-zips-in-Zi.patch

Can you please revisit the patch given the locking mechanics introduced in bug 1358241 for the lib handles case?

I'm uncomfortable changing the release mechanics for the lib handles (see comment 8), where the zip collection case is simpler and we still see crashes there (bug 1364310).
Attachment #8814157 - Flags: review?(mh+mozilla)
Comment on attachment 8814157 [details] [diff] [review]
0001-Bug-1318247-1.0-Add-custom-refcounting-to-zips-in-Zi.patch

Review of attachment 8814157 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozglue/linker/Zip.h
@@ +508,5 @@
> +{
> +  MOZ_ASSERT(static_cast<int32_t>(mRefCnt) > 0);
> +  const auto count = --mRefCnt;
> +  if (count == 1) {
> +    // No external references are left, attempt to remove it from the collection.

Add a comment that this is expected to loop back in this function from Forget calling vector::erase.
Attachment #8814157 - Flags: review?(mh+mozilla) → review+
Flags: needinfo?(mh+mozilla)
Pushed by esawin@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8c0a3de985d4
[1.0] Add custom refcounting of zips in ZipCollection to allow for thread-safe reuse of zips. r=glandium
Added comment.
Attachment #8814157 - Attachment is obsolete: true
Attachment #8882668 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/8c0a3de985d4
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 56
Should we consider this for backport to 55 or let it ride the 56 train?
Flags: needinfo?(esawin)
Keywords: crash
It's a low-volume crash in automation and I don't expect users to hit this a lot in release.
Although the patch is simple, it changes our custom ref-counting behavior.

I would prefer to let this ride the train.
Flags: needinfo?(esawin)
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: