Closed Bug 987867 Opened 10 years ago Closed 10 years ago

JB & KK crash in java.util.ConcurrentModificationException: at java.util.LinkedList$LinkIterator.next(LinkedList.java)

Categories

(Firefox for Android Graveyard :: Favicon Handling, defect)

All
Android
defect
Not set
critical

Tracking

(firefox28 unaffected, firefox29 fixed, firefox30 fixed, firefox31 fixed, b2g-v1.4 fixed, fennec29+)

RESOLVED FIXED
Firefox 31
Tracking Status
firefox28 --- unaffected
firefox29 --- fixed
firefox30 --- fixed
firefox31 --- fixed
b2g-v1.4 --- fixed
fennec 29+ ---

People

(Reporter: kbrosnan, Assigned: rnewman)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-98366937-62ad-49af-aad6-60e212140324.
=============================================================

Jellybean and Kitkat crash while typing in the address bar.

0 	libmozalloc.so 	mozalloc_abort(char const*) 	memory/mozalloc/mozalloc_abort.cpp
1 	libxul.so 	Java_org_mozilla_gecko_GeckoAppShell_reportJavaCrash 	widget/android/AndroidJNI.cpp
2 	libmozglue.so 	Java_org_mozilla_gecko_GeckoAppShell_reportJavaCrash 	mozglue/android/jni-stubs.inc
3 	libdvm.so 	libdvm.so@0x1dbce 	
4 	data@app@org.mozilla.firefox_beta-1.apk@classes.dex 	data@app@org.mozilla.firefox_beta-1.apk@classes.dex@0x1d82ef 	
5 	dalvik-heap (deleted) 	dalvik-heap (deleted)@0x7cfde 	
6 	libdvm.so 	libdvm.so@0x4e125 	
7 	data@app@org.mozilla.firefox_beta-1.apk@classes.dex 	data@app@org.mozilla.firefox_beta-1.apk@classes.dex@0x1d82ed 	
8 	libmozglue.so 	Java_org_mozilla_gecko_GeckoAppShell_dispatchMemoryPressure 	mozglue/android/jni-stubs.inc
9 		@0x857752b2 	
10 	libmozglue.so 	Java_org_mozilla_gecko_GeckoAppShell_dispatchMemoryPressure 	mozglue/android/jni-stubs.inc
11 		@0x4015fffe 	
12 	libc.so 	libc.so@0xdc0b 	
13 	libdvm.so 	libdvm.so@0x4fd55 	
14 	libdvm.so 	libdvm.so@0x73917 	
15 	libdvm.so 	libdvm.so@0xaac72 	
16 	libdvm.so 	libdvm.so@0x794bb 	
17 	dalvik-LinearAlloc (deleted) 	dalvik-LinearAlloc (deleted)@0x10dc36 	
18 	dalvik-LinearAlloc (deleted) 	dalvik-LinearAlloc (deleted)@0x35e492 	
19 	dalvik-heap (deleted) 	dalvik-heap (deleted)@0x7cfde 	
20 	dalvik-LinearAlloc (deleted) 	dalvik-LinearAlloc (deleted)@0x35e47e 	
21 	data@app@org.mozilla.firefox_beta-1.apk@classes.dex 	data@app@org.mozilla.firefox_beta-1.apk@classes.dex@0x2016e8 	
22 	libdvm.so 	libdvm.so@0x6be39 	
23 	data@app@org.mozilla.firefox_beta-1.apk@classes.dex 	data@app@org.mozilla.firefox_beta-1.apk@classes.dex@0x2016e8 	
24 	data@app@org.mozilla.firefox_beta-1.apk@classes.dex 	data@app@org.mozilla.firefox_beta-1.apk@classes.dex@0x460fe 	
25 	dalvik-heap (deleted) 	dalvik-heap (deleted)@0x7cfde 	
26 	data@app@org.mozilla.firefox_beta-1.apk@classes.dex 	data@app@org.mozilla.firefox_beta-1.apk@classes.dex@0x2016e8 	
27 	data@app@org.mozilla.firefox_beta-1.apk@classes.dex 	data@app@org.mozilla.firefox_beta-1.apk@classes.dex@0x2adffe 	
28 	libdvm.so 	libdvm.so@0x4fc5b 	
29 	libdvm.so 	libdvm.so@0xaac72 	
30 	dalvik-LinearAlloc (deleted) 	dalvik-LinearAlloc (deleted)@0x35e47e 	
31 	libdvm.so 	libdvm.so@0x4df93 	
32 	libdvm.so 	libdvm.so@0xaf1ee 	
33 	libdvm.so 	libdvm.so@0xaac72 	
34 	libdvm.so 	libdvm.so@0x4fb11 	
35 	data@app@org.mozilla.firefox_beta-1.apk@classes.dex 	data@app@org.mozilla.firefox_beta-1.apk@classes.dex@0xc9fa4 	
36 	dalvik-heap (deleted) 	dalvik-heap (deleted)@0x7cfde 	
37 	libdvm.so 	libdvm.so@0x1dd3e 	
38 	libdvm.so 	libdvm.so@0x26fe2 	
39 	libdvm.so 	libdvm.so@0x2df52 	
40 	dalvik-LinearAlloc (deleted) 	dalvik-LinearAlloc (deleted)@0x6b3e 	
41 	libdvm.so 	libdvm.so@0x2dfa2 	
42 	data@app@org.mozilla.firefox_beta-1.apk@classes.dex 	data@app@org.mozilla.firefox_beta-1.apk@classes.dex@0x106d1a 	
43 	libdvm.so 	libdvm.so@0x271d2 	
44 	libdvm.so 	libdvm.so@0x2df52 	
45 	libdvm.so 	libdvm.so@0x2b63a
> java.util.ConcurrentModificationException
>     at java.util.LinkedList$LinkIterator.next(LinkedList.java:124)
>     at java.util.LinkedList.removeOneOccurrence(LinkedList.java:835)
>     at java.util.LinkedList.removeFirstOccurrenceImpl(LinkedList.java:830)
>     at java.util.LinkedList.remove(LinkedList.java:665)
>     at org.mozilla.gecko.favicons.cache.FaviconCache.setMostRecentlyUsedWithinWrite(FaviconCache.java:494)
>     at org.mozilla.gecko.favicons.Favicons.putFaviconsInMemCache(Favicons.java:309)
>     at org.mozilla.gecko.favicons.Favicons.putFaviconsInMemCache(Favicons.java:313)
>     at org.mozilla.gecko.favicons.LoadFaviconTask.pushToCacheAndGetResult(LoadFaviconTask.java:431)
>     at org.mozilla.gecko.favicons.LoadFaviconTask.doInBackground$2d4c763b(LoadFaviconTask.java:354)
>     at org.mozilla.gecko.favicons.LoadFaviconTask.doInBackground$42af7916(LoadFaviconTask.java:42)
>     at org.mozilla.gecko.util.UiAsyncTask$BackgroundTaskRunnable.run(UiAsyncTask.java:48)
>     at android.os.Handler.handleCallback(Handler.java:733)
>     at android.os.Handler.dispatchMessage(Handler.java:95)
>     at android.os.Looper.loop(Looper.java:136)
>     at org.mozilla.gecko.util.GeckoBackgroundThread.run(GeckoBackgroundThread.java:32)

The report above is from 29 Beta, after bug 969417 landed on 29 Aurora
Component: Keyboards and IME → General
Blocks: 969417
Based on my reading of the 31 code, this can't happen unless the semaphore is allowing simultaneous reads and writes.

First hypothesis: 29 is not the same as 31.

Second hypothesis: the R/W exclusion mechanism is wrong.

Third hypothesis: we're accidentally doing multi-threaded work when we expect single-threaded.

Fourth hypothesis: something sneaky is happening with optimization -- that stack doesn't include every method that it should.
Flags: needinfo?(chriskitching)
This is happening on 31 - 29 from the signature report.
(In reply to Kevin Brosnan [:kbrosnan] from comment #3)
> This is happening on 31 - 29 from the signature report.

Could it be that the reports from 29 are from before the patch got uplifted? I'd be so happy if this was some sort of clerical error...


(In reply to Richard Newman [:rnewman] from comment #2)
> Based on my reading of the 31 code, this can't happen unless the semaphore
> is allowing simultaneous reads and writes.

Correct.
Multiple-finish of either a read or a write would put the semaphores in a state that would allow this - but program flow is such that doing so should be impossible...

Reflective access to the class? Either reflectively calling one of the methods in the wrong way, or reflectively changing the ordering semaphore.

Bug in the Semaphore implementation on Android?

Memory corruption from some wonkly C++ (Or sun.misc.Unsafe) code somewhere could, theoretically, cause this (It's not a segfault to corrupt the memory in Java-land when you're both within the same process, after all), so if someone somewhere got some pointer arithmetic wrong in *precisely* the right way...

... Now I'm just getting silly.

> First hypothesis: 29 is not the same as 31.

Hopefully in exactly the "didn't have the other patch" way...
Modifications elsewhere in the code, assuming they're not doing stupid things with Reflection to do things they shouldn't be able to do, should not be able to cause this sort of failure.

> Second hypothesis: the R/W exclusion mechanism is wrong.

Me being stupid - by far the most likely option. But neither of us can spot the screwup at this point, it seems. Everything *appears* watertight. Might want to consider rephrasing the concurrency here to use ReentrantReadWriteLock, although it doesn't quite handle all of the wrinkles of our system, it'd mean slightly less of the code here is roll-your-own semaphore magic. (I didn't use it from the start because I wanted to support the (flawed and now defunct) lock upgrading policy, as well as to provide consistency in implementation method).
I'm skeptical that'll make a difference, and it won't touch the reordering lock semaphore, which is a cruicial part of the protection of the datastructure that's getting upset. (And eliminating that entirely would cause every read to incur a write, effectively removing all concurrency from the cache system (Which would be a disaster for the performance of those scrolly list views with favicons that the about:home people seem to be so fond of using absolutely everywhere)).

> Third hypothesis: we're accidentally doing multi-threaded work when we
> expect single-threaded.

What, exactly, do you mean by this? The FaviconCache class is ostensibly safe for an arbitrarily large number of threads to concurrently use it.
If you mean that we've got an extra thread running when we're in a write, so should have only one active thread, that seems to be exactly what has happened. Since that's supposedly impossible, something very funny has happened. It's annoying that ConcurrentModificationExceptions don't give you both of the offending stack traces (Of both threads).
...
That's a point. Doesn't crash reporter do a thread dump? I know the HotSpot VM does once when it crashes, but can't seem to find a comparable thing on the crash interface, alas.
If we had a thread dump from this failure, we'd know in linear (in thread count) time what the problem is (Since both the thread that threw the ConcurrentModificationException and the other thread that did the concurrent modification the first one is so upset about will be present in the dump).

So, that does at least give us one slightly amusing solution to the problem: Deploy a patch that causes us to have a thread dump when this next happens, and eat cake for a month until it happens again.
That, however, seems awfully like giving up.

> Fourth hypothesis: something sneaky is happening with optimization -- that
> stack doesn't include every method that it should.

If that is true, we're astronomically screwed. Attempting to detect an optimiser bug given only a very rare intermittent concurrency failure isn't simple.

In any case, the inlining that seems to have taken place is of org.mozilla.gecko.favicons.cache.FaviconCache#putFavicons into org.mozilla.gecko.favicons.Favicons#putFaviconsInMemCache(java.lang.String, java.util.Iterator<android.graphics.Bitmap>, boolean). That *by itself* is safe, but we don't really know what other tomfoolery has taken place inside the method.
It might prove productive to study the resulting bytecodes to see if everything reads correctly at that level. That, however, would be a slow and soul-destroying activity - there's a lot of fairly complicated methods to read through.
If no sensible routes for making progress appear in the near future this is something I'll find time to do. (Finding a Proguard bug would be a pretty excellent thing to manage to do).

tl;dr: That's not supposed to happen. It's not obvious why it did, but here are some really stupid possibilities as to why it might, as well as some ramblings on and around the subject in question.
Flags: needinfo?(chriskitching)
Assignee: nobody → chriskitching
tracking-fennec: ? → 29+
Maybe I'm missing something here, but it looks like writes and reads race WRT the 'ordering' LRU list. In the read case, it acquires 'reorderingSemaphore', but nothing tries to acquire that in the lock case.
NM, I see startRead() acquires a write lock. Weirdness.
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #6)
> NM, I see startRead() acquires a write lock. Weirdness.

Indeed. The idea is that arbitrarily many read threads can concurrently execute, but only a single writer thread (and no readers may execute while a writer is executing).
Since writer threads are certain to be operating alone, they can play with the ordering list without needing to bother taking the reordering semaphore. Since reader threads don't have this luxury, they need to use the semaphore to ensure they don't step on each other's toes (But this quirk also lets reader threads write to the ordering list without needing to actually be a writer, enabling us to have decent concurrency (As the alternative would be very close to just locking the entire cache whenever we do anything)).

I think this bug is either really obscure and subtle, a spurious report of the old fault in some way, or an API bug of some sort.
Fun.

Suggestions welcome.
Assignee: chriskitching → rnewman
Does this qualify as a top-crash on ARMv7? It's #5 on crash-stats for Firefox 29
Richard - What's the plan here, assuming we have one?
Flags: needinfo?(rnewman)
We don't have a good one. No good STR, and code reading says "can't happen", so we're still riding around in the Mystery Machine.

I'll spend a little more time today attempting to figure out a root cause, and then failing that I'll go for a band-aid fix.
Flags: needinfo?(rnewman)
I decided to add some checks, and look what I found!

 java.lang.RuntimeException: setMostRecentlyUsedWithinRead has 1 permits. Expected 0.
 	at org.mozilla.gecko.favicons.cache.FaviconCache.expectNoPermits(FaviconCache.java:471)
 	at org.mozilla.gecko.favicons.cache.FaviconCache.setMostRecentlyUsedWithinRead(FaviconCache.java:483)
 	at org.mozilla.gecko.favicons.cache.FaviconCache.getFaviconForDimensions(FaviconCache.java:319)
 	at org.mozilla.gecko.favicons.Favicons.getSizedFaviconFromCache(Favicons.java:180)
 	at org.mozilla.gecko.favicons.Favicons.getSizedFaviconForPageFromLocal(Favicons.java:208)
 	at org.mozilla.gecko.favicons.Favicons.getSizedFaviconForPageFromLocal(Favicons.java:226)
 	at org.mozilla.gecko.home.TwoLinePageRow.updateFromCursor(TwoLinePageRow.java:234)
 	at org.mozilla.gecko.home.BookmarksListAdapter.bindView(BookmarksListAdapter.java:287)
 	at org.mozilla.gecko.home.MultiTypeCursorAdapter.getView(MultiTypeCursorAdapter.java:62)


 java.lang.RuntimeException: setMostRecentlyUsedWithinRead has 1 permits. Expected 0.
 	at org.mozilla.gecko.favicons.cache.FaviconCache.expectNoPermits(FaviconCache.java:471)
 	at org.mozilla.gecko.favicons.cache.FaviconCache.setMostRecentlyUsedWithinRead(FaviconCache.java:483)
 	at org.mozilla.gecko.favicons.cache.FaviconCache.getFaviconForDimensions(FaviconCache.java:319)
 	at org.mozilla.gecko.favicons.Favicons.getSizedFaviconFromCache(Favicons.java:180)
 	at org.mozilla.gecko.favicons.Favicons.getSizedFavicon(Favicons.java:155)
 	at org.mozilla.gecko.BrowserApp.loadFavicon(BrowserApp.java:1480)
 	at org.mozilla.gecko.BrowserApp.onTabChanged(BrowserApp.java:258)


So apparently the read flow isn't correctly acquiring the write lock.
Also, here's a hypothetical trio of threads, hitting startRead and startWrite.

  R1                   |  R2                   |  W1
=======================================================================
Acquire turn           |                       |
                       | Wait for turn         |
Release turn           |                       |
                       | Acquire turn, release |
                       |                       | Acquire turn
incrementAndGet = 1    |                       |
                       | incrementAndGet = 2   |
                       |                       | Acquire write lock
WAITING ON WRITE LOCK  | FREE TO PERFORM READ  | FREE TO PERFORM WRITE



Chris, does this look like a fair analysis to you?
Flags: needinfo?(chriskitching)
Conclusion: https://gist.github.com/rnewman/11274983

There seems to be something going wrong with the write lock. Will investigate more tomorrow.
Double-release!
Attachment #8412793 - Flags: review?(chriskitching)
Attachment #8412793 - Flags: review?(chriskitching) → review+
Comment on attachment 8412793 [details] [diff] [review]
Only finishRead once if we fail to compute dominant color. v1

Ha. That'd do it.

That said, I don't think this resolves the scenario allowing readers and writers to coexist you outlined earlier - the use of an atomic integer for ongoingReads is not sufficient to make that safe.

This, however, is very likely a much more frequent cause of failure than that other scenario.

That we both missed this for so long is sort of impressively bad. Whoops. Great work!
Glad this means I don't have to find time this weekend to try and stop you using an evil ConcurrentLinkedList.
Flags: needinfo?(chriskitching)
https://hg.mozilla.org/integration/fx-team/rev/df92a5c9e366
Status: NEW → ASSIGNED
Flags: needinfo?(rnewman)
https://hg.mozilla.org/mozilla-central/rev/df92a5c9e366
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 31
Hang on, as I mentioned in Comment 15, the problem RNewman pointed out in Comment 12 isn't resolved by this patch - so there's still scope for failure here. The multiple-release was merely the more common mode of failure.
If readers and writers can coexist, everything's going to end horribly at least some of the time.
(In reply to Chris Kitching [:ckitching] from comment #18)
> Hang on, as I mentioned in Comment 15, the problem RNewman pointed out in
> Comment 12 isn't resolved by this patch…

The double-finish led to dramatically worse race conditions: once this happens (by failing to compute the background color for search engine favicons), the semaphore advances to always allow for two concurrent writers, no further shenanigans required.

I'm far less concerned about the problem in Comment 12. I'll file a follow-up for that.
Component: General → Favicon Handling
Flags: needinfo?(rnewman)
Blocks: 1002537
Comment on attachment 8412793 [details] [diff] [review]
Only finishRead once if we fail to compute dominant color. v1

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 
  Favicon cache rework in 29.

User impact if declined: 
  #6 crasher, as far as I can see.

Testing completed (on m-c, etc.): 
  Tested locally with logging to provoke traces of the error. It's been in Nightly for a couple of days. The change is obviously correct, so now we just need to see if it reduces the crash rate.

Risk to taking this patch (and alternatives if risky):
  Low; this should be a slam-dunk improvement, because the error essentially turns off all of the sane concurrency restrictions, and this restores them! 

  I would take this patch for another beta or a 29.0.1 if we spin one, so flagging for beta, too.

String or IDL/UUID changes made by this patch:
  None.
Attachment #8412793 - Flags: approval-mozilla-beta?
Attachment #8412793 - Flags: approval-mozilla-aurora?
(In reply to Richard Newman [:rnewman] from comment #19)
> I'm far less concerned about the problem in Comment 12. I'll file a
> follow-up for that.

Filed Bug 1002537.
Comment on attachment 8412793 [details] [diff] [review]
Only finishRead once if we fail to compute dominant color. v1

We're not seeing a need for a 29.0.1 (yet) but flagging this for m-r in case we do so and can take this as a stability ride-along.  In the meantime let's get it on Beta (now 30) to collect more proof that that this is the fix.
Attachment #8412793 - Flags: approval-mozilla-release?
Attachment #8412793 - Flags: approval-mozilla-beta?
Attachment #8412793 - Flags: approval-mozilla-beta+
Attachment #8412793 - Flags: approval-mozilla-aurora?
Attachment #8412793 - Flags: approval-mozilla-aurora-
I am concerned that this bug fix missed beta 1 so we don't have any large population testing of this fix. Unless a 29.0.1 happens later next week I would hold off.
Comment on attachment 8412793 [details] [diff] [review]
Only finishRead once if we fail to compute dominant color. v1

We are taking this for a non-urgent 29.0.1 release.
Attachment #8412793 - Flags: approval-mozilla-release? → approval-mozilla-release+
I talked with Lukas on IRC, made a mistake looking at the calendar. Looked at the May calendar which has the 28th on a Wednesday. This made it into beta 1. We double checked crash-stats data and the crash has not happened on 29 beta 1. 

https://crash-stats.mozilla.com/search/?signature=~java.util.ConcurrentModificationException%3A+at+java.util.LinkedList%24LinkIterator.next%28LinkedList.java%29&version=30.0b1&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.