Closed
Bug 987867
Opened 10 years ago
Closed 10 years ago
JB & KK crash in java.util.ConcurrentModificationException: at java.util.LinkedList$LinkIterator.next(LinkedList.java)
Categories
(Firefox for Android Graveyard :: Favicon Handling, defect)
Tracking
(firefox28 unaffected, firefox29 fixed, firefox30 fixed, firefox31 fixed, b2g-v1.4 fixed, fennec29+)
RESOLVED
FIXED
Firefox 31
People
(Reporter: kbrosnan, Assigned: rnewman)
References
Details
(Keywords: crash, regression)
Crash Data
Attachments
(1 file)
1.20 KB,
patch
|
mfinkle
:
review+
lsblakk
:
approval-mozilla-aurora-
lsblakk
:
approval-mozilla-beta+
lsblakk
:
approval-mozilla-release+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is report bp-98366937-62ad-49af-aad6-60e212140324. ============================================================= Jellybean and Kitkat crash while typing in the address bar. 0 libmozalloc.so mozalloc_abort(char const*) memory/mozalloc/mozalloc_abort.cpp 1 libxul.so Java_org_mozilla_gecko_GeckoAppShell_reportJavaCrash widget/android/AndroidJNI.cpp 2 libmozglue.so Java_org_mozilla_gecko_GeckoAppShell_reportJavaCrash mozglue/android/jni-stubs.inc 3 libdvm.so libdvm.so@0x1dbce 4 data@app@org.mozilla.firefox_beta-1.apk@classes.dex data@app@org.mozilla.firefox_beta-1.apk@classes.dex@0x1d82ef 5 dalvik-heap (deleted) dalvik-heap (deleted)@0x7cfde 6 libdvm.so libdvm.so@0x4e125 7 data@app@org.mozilla.firefox_beta-1.apk@classes.dex data@app@org.mozilla.firefox_beta-1.apk@classes.dex@0x1d82ed 8 libmozglue.so Java_org_mozilla_gecko_GeckoAppShell_dispatchMemoryPressure mozglue/android/jni-stubs.inc 9 @0x857752b2 10 libmozglue.so Java_org_mozilla_gecko_GeckoAppShell_dispatchMemoryPressure mozglue/android/jni-stubs.inc 11 @0x4015fffe 12 libc.so libc.so@0xdc0b 13 libdvm.so libdvm.so@0x4fd55 14 libdvm.so libdvm.so@0x73917 15 libdvm.so libdvm.so@0xaac72 16 libdvm.so libdvm.so@0x794bb 17 dalvik-LinearAlloc (deleted) dalvik-LinearAlloc (deleted)@0x10dc36 18 dalvik-LinearAlloc (deleted) dalvik-LinearAlloc (deleted)@0x35e492 19 dalvik-heap (deleted) dalvik-heap (deleted)@0x7cfde 20 dalvik-LinearAlloc (deleted) dalvik-LinearAlloc (deleted)@0x35e47e 21 data@app@org.mozilla.firefox_beta-1.apk@classes.dex data@app@org.mozilla.firefox_beta-1.apk@classes.dex@0x2016e8 22 libdvm.so libdvm.so@0x6be39 23 data@app@org.mozilla.firefox_beta-1.apk@classes.dex data@app@org.mozilla.firefox_beta-1.apk@classes.dex@0x2016e8 24 data@app@org.mozilla.firefox_beta-1.apk@classes.dex data@app@org.mozilla.firefox_beta-1.apk@classes.dex@0x460fe 25 dalvik-heap (deleted) dalvik-heap (deleted)@0x7cfde 26 data@app@org.mozilla.firefox_beta-1.apk@classes.dex data@app@org.mozilla.firefox_beta-1.apk@classes.dex@0x2016e8 27 data@app@org.mozilla.firefox_beta-1.apk@classes.dex data@app@org.mozilla.firefox_beta-1.apk@classes.dex@0x2adffe 28 libdvm.so libdvm.so@0x4fc5b 29 libdvm.so libdvm.so@0xaac72 30 dalvik-LinearAlloc (deleted) dalvik-LinearAlloc (deleted)@0x35e47e 31 libdvm.so libdvm.so@0x4df93 32 libdvm.so libdvm.so@0xaf1ee 33 libdvm.so libdvm.so@0xaac72 34 libdvm.so libdvm.so@0x4fb11 35 data@app@org.mozilla.firefox_beta-1.apk@classes.dex data@app@org.mozilla.firefox_beta-1.apk@classes.dex@0xc9fa4 36 dalvik-heap (deleted) dalvik-heap (deleted)@0x7cfde 37 libdvm.so libdvm.so@0x1dd3e 38 libdvm.so libdvm.so@0x26fe2 39 libdvm.so libdvm.so@0x2df52 40 dalvik-LinearAlloc (deleted) dalvik-LinearAlloc (deleted)@0x6b3e 41 libdvm.so libdvm.so@0x2dfa2 42 data@app@org.mozilla.firefox_beta-1.apk@classes.dex data@app@org.mozilla.firefox_beta-1.apk@classes.dex@0x106d1a 43 libdvm.so libdvm.so@0x271d2 44 libdvm.so libdvm.so@0x2df52 45 libdvm.so libdvm.so@0x2b63a
Comment 1•10 years ago
|
||
> java.util.ConcurrentModificationException > at java.util.LinkedList$LinkIterator.next(LinkedList.java:124) > at java.util.LinkedList.removeOneOccurrence(LinkedList.java:835) > at java.util.LinkedList.removeFirstOccurrenceImpl(LinkedList.java:830) > at java.util.LinkedList.remove(LinkedList.java:665) > at org.mozilla.gecko.favicons.cache.FaviconCache.setMostRecentlyUsedWithinWrite(FaviconCache.java:494) > at org.mozilla.gecko.favicons.Favicons.putFaviconsInMemCache(Favicons.java:309) > at org.mozilla.gecko.favicons.Favicons.putFaviconsInMemCache(Favicons.java:313) > at org.mozilla.gecko.favicons.LoadFaviconTask.pushToCacheAndGetResult(LoadFaviconTask.java:431) > at org.mozilla.gecko.favicons.LoadFaviconTask.doInBackground$2d4c763b(LoadFaviconTask.java:354) > at org.mozilla.gecko.favicons.LoadFaviconTask.doInBackground$42af7916(LoadFaviconTask.java:42) > at org.mozilla.gecko.util.UiAsyncTask$BackgroundTaskRunnable.run(UiAsyncTask.java:48) > at android.os.Handler.handleCallback(Handler.java:733) > at android.os.Handler.dispatchMessage(Handler.java:95) > at android.os.Looper.loop(Looper.java:136) > at org.mozilla.gecko.util.GeckoBackgroundThread.run(GeckoBackgroundThread.java:32) The report above is from 29 Beta, after bug 969417 landed on 29 Aurora
Component: Keyboards and IME → General
Assignee | ||
Comment 2•10 years ago
|
||
Based on my reading of the 31 code, this can't happen unless the semaphore is allowing simultaneous reads and writes. First hypothesis: 29 is not the same as 31. Second hypothesis: the R/W exclusion mechanism is wrong. Third hypothesis: we're accidentally doing multi-threaded work when we expect single-threaded. Fourth hypothesis: something sneaky is happening with optimization -- that stack doesn't include every method that it should.
Flags: needinfo?(chriskitching)
Reporter | ||
Comment 3•10 years ago
|
||
This is happening on 31 - 29 from the signature report.
Comment 4•10 years ago
|
||
(In reply to Kevin Brosnan [:kbrosnan] from comment #3) > This is happening on 31 - 29 from the signature report. Could it be that the reports from 29 are from before the patch got uplifted? I'd be so happy if this was some sort of clerical error... (In reply to Richard Newman [:rnewman] from comment #2) > Based on my reading of the 31 code, this can't happen unless the semaphore > is allowing simultaneous reads and writes. Correct. Multiple-finish of either a read or a write would put the semaphores in a state that would allow this - but program flow is such that doing so should be impossible... Reflective access to the class? Either reflectively calling one of the methods in the wrong way, or reflectively changing the ordering semaphore. Bug in the Semaphore implementation on Android? Memory corruption from some wonkly C++ (Or sun.misc.Unsafe) code somewhere could, theoretically, cause this (It's not a segfault to corrupt the memory in Java-land when you're both within the same process, after all), so if someone somewhere got some pointer arithmetic wrong in *precisely* the right way... ... Now I'm just getting silly. > First hypothesis: 29 is not the same as 31. Hopefully in exactly the "didn't have the other patch" way... Modifications elsewhere in the code, assuming they're not doing stupid things with Reflection to do things they shouldn't be able to do, should not be able to cause this sort of failure. > Second hypothesis: the R/W exclusion mechanism is wrong. Me being stupid - by far the most likely option. But neither of us can spot the screwup at this point, it seems. Everything *appears* watertight. Might want to consider rephrasing the concurrency here to use ReentrantReadWriteLock, although it doesn't quite handle all of the wrinkles of our system, it'd mean slightly less of the code here is roll-your-own semaphore magic. (I didn't use it from the start because I wanted to support the (flawed and now defunct) lock upgrading policy, as well as to provide consistency in implementation method). I'm skeptical that'll make a difference, and it won't touch the reordering lock semaphore, which is a cruicial part of the protection of the datastructure that's getting upset. (And eliminating that entirely would cause every read to incur a write, effectively removing all concurrency from the cache system (Which would be a disaster for the performance of those scrolly list views with favicons that the about:home people seem to be so fond of using absolutely everywhere)). > Third hypothesis: we're accidentally doing multi-threaded work when we > expect single-threaded. What, exactly, do you mean by this? The FaviconCache class is ostensibly safe for an arbitrarily large number of threads to concurrently use it. If you mean that we've got an extra thread running when we're in a write, so should have only one active thread, that seems to be exactly what has happened. Since that's supposedly impossible, something very funny has happened. It's annoying that ConcurrentModificationExceptions don't give you both of the offending stack traces (Of both threads). ... That's a point. Doesn't crash reporter do a thread dump? I know the HotSpot VM does once when it crashes, but can't seem to find a comparable thing on the crash interface, alas. If we had a thread dump from this failure, we'd know in linear (in thread count) time what the problem is (Since both the thread that threw the ConcurrentModificationException and the other thread that did the concurrent modification the first one is so upset about will be present in the dump). So, that does at least give us one slightly amusing solution to the problem: Deploy a patch that causes us to have a thread dump when this next happens, and eat cake for a month until it happens again. That, however, seems awfully like giving up. > Fourth hypothesis: something sneaky is happening with optimization -- that > stack doesn't include every method that it should. If that is true, we're astronomically screwed. Attempting to detect an optimiser bug given only a very rare intermittent concurrency failure isn't simple. In any case, the inlining that seems to have taken place is of org.mozilla.gecko.favicons.cache.FaviconCache#putFavicons into org.mozilla.gecko.favicons.Favicons#putFaviconsInMemCache(java.lang.String, java.util.Iterator<android.graphics.Bitmap>, boolean). That *by itself* is safe, but we don't really know what other tomfoolery has taken place inside the method. It might prove productive to study the resulting bytecodes to see if everything reads correctly at that level. That, however, would be a slow and soul-destroying activity - there's a lot of fairly complicated methods to read through. If no sensible routes for making progress appear in the near future this is something I'll find time to do. (Finding a Proguard bug would be a pretty excellent thing to manage to do). tl;dr: That's not supposed to happen. It's not obvious why it did, but here are some really stupid possibilities as to why it might, as well as some ramblings on and around the subject in question.
Flags: needinfo?(chriskitching)
Updated•10 years ago
|
Assignee: nobody → chriskitching
tracking-fennec: ? → 29+
Comment 5•10 years ago
|
||
Maybe I'm missing something here, but it looks like writes and reads race WRT the 'ordering' LRU list. In the read case, it acquires 'reorderingSemaphore', but nothing tries to acquire that in the lock case.
Comment 6•10 years ago
|
||
NM, I see startRead() acquires a write lock. Weirdness.
Comment 7•10 years ago
|
||
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #6) > NM, I see startRead() acquires a write lock. Weirdness. Indeed. The idea is that arbitrarily many read threads can concurrently execute, but only a single writer thread (and no readers may execute while a writer is executing). Since writer threads are certain to be operating alone, they can play with the ordering list without needing to bother taking the reordering semaphore. Since reader threads don't have this luxury, they need to use the semaphore to ensure they don't step on each other's toes (But this quirk also lets reader threads write to the ordering list without needing to actually be a writer, enabling us to have decent concurrency (As the alternative would be very close to just locking the entire cache whenever we do anything)). I think this bug is either really obscure and subtle, a spurious report of the old fault in some way, or an API bug of some sort. Fun. Suggestions welcome.
Assignee | ||
Updated•10 years ago
|
Assignee: chriskitching → rnewman
Comment 8•10 years ago
|
||
Does this qualify as a top-crash on ARMv7? It's #5 on crash-stats for Firefox 29
Comment 9•10 years ago
|
||
Richard - What's the plan here, assuming we have one?
Flags: needinfo?(rnewman)
Assignee | ||
Comment 10•10 years ago
|
||
We don't have a good one. No good STR, and code reading says "can't happen", so we're still riding around in the Mystery Machine. I'll spend a little more time today attempting to figure out a root cause, and then failing that I'll go for a band-aid fix.
Flags: needinfo?(rnewman)
Assignee | ||
Comment 11•10 years ago
|
||
I decided to add some checks, and look what I found! java.lang.RuntimeException: setMostRecentlyUsedWithinRead has 1 permits. Expected 0. at org.mozilla.gecko.favicons.cache.FaviconCache.expectNoPermits(FaviconCache.java:471) at org.mozilla.gecko.favicons.cache.FaviconCache.setMostRecentlyUsedWithinRead(FaviconCache.java:483) at org.mozilla.gecko.favicons.cache.FaviconCache.getFaviconForDimensions(FaviconCache.java:319) at org.mozilla.gecko.favicons.Favicons.getSizedFaviconFromCache(Favicons.java:180) at org.mozilla.gecko.favicons.Favicons.getSizedFaviconForPageFromLocal(Favicons.java:208) at org.mozilla.gecko.favicons.Favicons.getSizedFaviconForPageFromLocal(Favicons.java:226) at org.mozilla.gecko.home.TwoLinePageRow.updateFromCursor(TwoLinePageRow.java:234) at org.mozilla.gecko.home.BookmarksListAdapter.bindView(BookmarksListAdapter.java:287) at org.mozilla.gecko.home.MultiTypeCursorAdapter.getView(MultiTypeCursorAdapter.java:62) java.lang.RuntimeException: setMostRecentlyUsedWithinRead has 1 permits. Expected 0. at org.mozilla.gecko.favicons.cache.FaviconCache.expectNoPermits(FaviconCache.java:471) at org.mozilla.gecko.favicons.cache.FaviconCache.setMostRecentlyUsedWithinRead(FaviconCache.java:483) at org.mozilla.gecko.favicons.cache.FaviconCache.getFaviconForDimensions(FaviconCache.java:319) at org.mozilla.gecko.favicons.Favicons.getSizedFaviconFromCache(Favicons.java:180) at org.mozilla.gecko.favicons.Favicons.getSizedFavicon(Favicons.java:155) at org.mozilla.gecko.BrowserApp.loadFavicon(BrowserApp.java:1480) at org.mozilla.gecko.BrowserApp.onTabChanged(BrowserApp.java:258) So apparently the read flow isn't correctly acquiring the write lock.
Assignee | ||
Comment 12•10 years ago
|
||
Also, here's a hypothetical trio of threads, hitting startRead and startWrite. R1 | R2 | W1 ======================================================================= Acquire turn | | | Wait for turn | Release turn | | | Acquire turn, release | | | Acquire turn incrementAndGet = 1 | | | incrementAndGet = 2 | | | Acquire write lock WAITING ON WRITE LOCK | FREE TO PERFORM READ | FREE TO PERFORM WRITE Chris, does this look like a fair analysis to you?
Flags: needinfo?(chriskitching)
Assignee | ||
Comment 13•10 years ago
|
||
Conclusion: https://gist.github.com/rnewman/11274983 There seems to be something going wrong with the write lock. Will investigate more tomorrow.
Assignee | ||
Comment 14•10 years ago
|
||
Double-release!
Attachment #8412793 -
Flags: review?(chriskitching)
Updated•10 years ago
|
Attachment #8412793 -
Flags: review?(chriskitching) → review+
Comment 15•10 years ago
|
||
Comment on attachment 8412793 [details] [diff] [review] Only finishRead once if we fail to compute dominant color. v1 Ha. That'd do it. That said, I don't think this resolves the scenario allowing readers and writers to coexist you outlined earlier - the use of an atomic integer for ongoingReads is not sufficient to make that safe. This, however, is very likely a much more frequent cause of failure than that other scenario. That we both missed this for so long is sort of impressively bad. Whoops. Great work! Glad this means I don't have to find time this weekend to try and stop you using an evil ConcurrentLinkedList.
Flags: needinfo?(chriskitching)
Assignee | ||
Comment 16•10 years ago
|
||
https://hg.mozilla.org/integration/fx-team/rev/df92a5c9e366
Status: NEW → ASSIGNED
Flags: needinfo?(rnewman)
Comment 17•10 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/df92a5c9e366
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 31
Comment 18•10 years ago
|
||
Hang on, as I mentioned in Comment 15, the problem RNewman pointed out in Comment 12 isn't resolved by this patch - so there's still scope for failure here. The multiple-release was merely the more common mode of failure. If readers and writers can coexist, everything's going to end horribly at least some of the time.
Assignee | ||
Comment 19•10 years ago
|
||
(In reply to Chris Kitching [:ckitching] from comment #18) > Hang on, as I mentioned in Comment 15, the problem RNewman pointed out in > Comment 12 isn't resolved by this patch… The double-finish led to dramatically worse race conditions: once this happens (by failing to compute the background color for search engine favicons), the semaphore advances to always allow for two concurrent writers, no further shenanigans required. I'm far less concerned about the problem in Comment 12. I'll file a follow-up for that.
Assignee | ||
Updated•10 years ago
|
Component: General → Favicon Handling
Flags: needinfo?(rnewman)
Assignee | ||
Comment 20•10 years ago
|
||
Comment on attachment 8412793 [details] [diff] [review] Only finishRead once if we fail to compute dominant color. v1 [Approval Request Comment] Bug caused by (feature/regressing bug #): Favicon cache rework in 29. User impact if declined: #6 crasher, as far as I can see. Testing completed (on m-c, etc.): Tested locally with logging to provoke traces of the error. It's been in Nightly for a couple of days. The change is obviously correct, so now we just need to see if it reduces the crash rate. Risk to taking this patch (and alternatives if risky): Low; this should be a slam-dunk improvement, because the error essentially turns off all of the sane concurrency restrictions, and this restores them! I would take this patch for another beta or a 29.0.1 if we spin one, so flagging for beta, too. String or IDL/UUID changes made by this patch: None.
Attachment #8412793 -
Flags: approval-mozilla-beta?
Attachment #8412793 -
Flags: approval-mozilla-aurora?
Assignee | ||
Comment 21•10 years ago
|
||
(In reply to Richard Newman [:rnewman] from comment #19) > I'm far less concerned about the problem in Comment 12. I'll file a > follow-up for that. Filed Bug 1002537.
Updated•10 years ago
|
Comment 22•10 years ago
|
||
Comment on attachment 8412793 [details] [diff] [review] Only finishRead once if we fail to compute dominant color. v1 We're not seeing a need for a 29.0.1 (yet) but flagging this for m-r in case we do so and can take this as a stability ride-along. In the meantime let's get it on Beta (now 30) to collect more proof that that this is the fix.
Attachment #8412793 -
Flags: approval-mozilla-release?
Attachment #8412793 -
Flags: approval-mozilla-beta?
Attachment #8412793 -
Flags: approval-mozilla-beta+
Attachment #8412793 -
Flags: approval-mozilla-aurora?
Attachment #8412793 -
Flags: approval-mozilla-aurora-
Assignee | ||
Comment 23•10 years ago
|
||
Thanks, Lukas! https://hg.mozilla.org/releases/mozilla-beta/rev/69fcc2d9c81b
Updated•10 years ago
|
status-b2g-v1.4:
--- → fixed
Reporter | ||
Comment 25•10 years ago
|
||
I am concerned that this bug fix missed beta 1 so we don't have any large population testing of this fix. Unless a 29.0.1 happens later next week I would hold off.
Comment 26•10 years ago
|
||
Comment on attachment 8412793 [details] [diff] [review] Only finishRead once if we fail to compute dominant color. v1 We are taking this for a non-urgent 29.0.1 release.
Attachment #8412793 -
Flags: approval-mozilla-release? → approval-mozilla-release+
Reporter | ||
Comment 27•10 years ago
|
||
I talked with Lukas on IRC, made a mistake looking at the calendar. Looked at the May calendar which has the 28th on a Wednesday. This made it into beta 1. We double checked crash-stats data and the crash has not happened on 29 beta 1. https://crash-stats.mozilla.com/search/?signature=~java.util.ConcurrentModificationException%3A+at+java.util.LinkedList%24LinkIterator.next%28LinkedList.java%29&version=30.0b1&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform
Comment 29•10 years ago
|
||
30.0b1, 30.0b2 and 29.0.1 are not affected so far by this crash https://crash-stats.mozilla.com/search/?signature=~java.util.ConcurrentModificationException%3A+at+java.util.LinkedList%24LinkIterator.next%28LinkedList.java%29&version=30.0b1&version=29.0.1&version=30.0b2&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform
Assignee | ||
Updated•10 years ago
|
Updated•3 years ago
|
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•