Crash in java.lang.StackOverflowError: at java.lang.Character.digit(Character.java) during batching downloads

RESOLVED FIXED in Firefox 58

Status

()

defect
P1
critical
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: wgianopoulos, Assigned: Grisha)

Tracking

({crash})

Firefox 54
Firefox 58
All
Android
Points:
---

Firefox Tracking Flags

(firefox58 fixed)

Details

(crash signature)

Attachments

(2 attachments)

This bug was filed from the Socorro interface and is 
report bp-2b2b0f9a-94c3-44b5-980f-b36dc2170329.
=============================================================
This seems to crash about 1 time per day.  This occurs while Aurora is NOT running, so it is either a background sync or the background check for updates.  My money is on the latter.  I have disabled sync so we shall see what happens tomorrow.  The first crash I see in about:crashes is from March 21 at 6:21 AM EDT.

This only occurs on my Samsung Galaxy Tab 2 GT-P3113.

This does not occur on my Nexus 7 2012.

This also does not occur on mozilla-central nightlies.
Bill, Firefox mobile doesn't use the Toolkit -> Application Update code so it can't be that. Please find a better component to move this bug to.
Flags: needinfo?(wgianopoulos)
That was the component suggested by crash reporter.  So, asking me to come up with a better one seems a bit specious.
Flags: needinfo?(wgianopoulos)
(In reply to Bill Gianopoulos [:WG9s] from comment #3)
> That was the component suggested by crash reporter.  So, asking me to come
> up with a better one seems a bit specious.

IF you think this is not the correct component, then please move it to where you think it needs to be.
Where in crash reporter is this component suggested?
Flags: needinfo?(wgianopoulos)
Have you never filed a bug via the crash reporter interface?????   It either correlates your crash with an existing bug or it gives you a place to file it under a suggested component.
Flags: needinfo?(wgianopoulos)
Posted image Screenshot
Here is a screenshot showing the suggested component.  I have no idea why you are not capable of doing this yourself.
Bill, I tend to actually look at the stack and file it since I started doing this a very long time ago when we were using Talkback by Full Circle and going by the stack is authoritative.

I asked where Application Update is referenced because I don't know this interface and I would like to ask the socorro devs to fix this but I also don't see a reference to Application Update in your sceenshot. If it isn't there "doing this myself" wouldn't have helped. Anyways, now you know that android doesn't use Application Update. Sorry this has been annoying to you.
Component: Application Update → General
Product: Toolkit → Firefox for Android
Component: General → Android Sync
Product: Firefox for Android → Android Background Services
Hardware: ARM → All
Version: 54 Branch → Firefox 54
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #8)
> Bill, I tend to actually look at the stack and file it since I started doing
> this a very long time ago when we were using Talkback by Full Circle and
> going by the stack is authoritative.
> 
> I asked where Application Update is referenced because I don't know this
> interface and I would like to ask the socorro devs to fix this but I also
> don't see a reference to Application Update in your sceenshot. If it isn't
> there "doing this myself" wouldn't have helped. Anyways, now you know that
> android doesn't use Application Update. Sorry this has been annoying to you.

NO this does not happen if i do a check for updates form the about: page.  this seems to happen when the automatic check for updates occurs.
Like I said i disabled sync so by sometime tomorrow we can see if it still crashes or not.
This is _definitely_ not an Application Update issue; it's a problem with the recursive nature of the new batch APIs!

Grisha, can you take a look at this?  Is it possible we need to manage our own stack of batching uploads/downloads because we may recurse very deeply?  Or is there a logic error leading to infinite recursion?
Flags: needinfo?(gkruglov)
But this only happens on Aurora, and only happens on certain hardware.
It seems that there are two things going on here. All 52 crash logs are on GT-I9195, API19, and occur during a restoreSessionTabs call.

All 54 (and 55! so, nightlies) crash logs are on GT-P3113, API17, and occur during download batching.

First, tabs restore crash logs. JanH, do you know why this might happen?

java.lang.StackOverflowError
	at java.lang.Character.digit(Character.java:2318)
	at java.lang.Long.parse(Long.java:359)
	at java.lang.Long.parseLong(Long.java:352)
	at org.json.JSONTokener.readLiteral(JSONTokener.java:306)
	at org.json.JSONTokener.nextValue(JSONTokener.java:111)
	at org.json.JSONTokener.readObject(JSONTokener.java:385)
	at org.json.JSONTokener.nextValue(JSONTokener.java:100)
	at org.json.JSONTokener.readArray(JSONTokener.java:430)
	at org.json.JSONTokener.nextValue(JSONTokener.java:103)
	at org.json.JSONTokener.readObject(JSONTokener.java:385)

        ...(this repeats, a lot)...

        at org.json.JSONTokener.nextValue(JSONTokener.java:100)
	at org.json.JSONObject.<init>(JSONObject.java:155)
	at org.json.JSONObject.<init>(JSONObject.java:172)
	at org.mozilla.gecko.SessionParser.parse(SessionParser.java:84)
	at org.mozilla.gecko.GeckoApp.restoreSessionTabs(GeckoApp.java:1727)
	at org.mozilla.gecko.GeckoApp.access$200(GeckoApp.java:129)
	at org.mozilla.gecko.GeckoApp$15.run(GeckoApp.java:1310)
	at android.os.Handler.handleCallback(Handler.java:733)
	at android.os.Handler.dispatchMessage(Handler.java:95)
	at android.os.Looper.loop(Looper.java:146)
	at org.mozilla.gecko.util.GeckoBackgroundThread.run(GeckoBackgroundThread.java:43)


Below are the Download Batching crash logs. In every crash I count about 27-28 fetches before it blows up. This doesn't scream "deep recursion" to me, but perhaps that particular device has a small enough stack.
It's very likely we'll see more of these crashes as 54 makes it to beta. It should be easy enough to tweak BatchingDownloader to avoid recursion altogether, which I think is the right solution for this problem. Better safe than sorry, when it comes to JVMs and recursion.

java.lang.StackOverflowError
	at java.lang.Character.digit(Character.java:2360)
	at java.lang.Integer.parse(Integer.java:373)
	at java.lang.Integer.parseInt(Integer.java:366)
	at java.lang.Integer.parseInt(Integer.java:332)
	at ch.boye.httpclientandroidlib.message.BasicLineParser.parseProtocolVersion(BasicLineParser.java:161)
	at ch.boye.httpclientandroidlib.message.BasicLineParser.parseStatusLine(BasicLineParser.java:361)
	at ch.boye.httpclientandroidlib.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:160)
	at ch.boye.httpclientandroidlib.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
	at ch.boye.httpclientandroidlib.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
	at ch.boye.httpclientandroidlib.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
	at ch.boye.httpclientandroidlib.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
	at ch.boye.httpclientandroidlib.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:223)
	at ch.boye.httpclientandroidlib.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
	at ch.boye.httpclientandroidlib.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
	at ch.boye.httpclientandroidlib.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
	at ch.boye.httpclientandroidlib.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
	at ch.boye.httpclientandroidlib.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:859)
	at ch.boye.httpclientandroidlib.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
	at org.mozilla.gecko.sync.net.BaseResource.execute(BaseResource.java:308)
	at org.mozilla.gecko.sync.net.BaseResource.go(BaseResource.java:373)
	at org.mozilla.gecko.sync.net.BaseResource.get(BaseResource.java:379)
	at org.mozilla.gecko.sync.net.SyncStorageRequest.get(SyncStorageRequest.java:178)
	at org.mozilla.gecko.sync.repositories.downloaders.BatchingDownloader.fetchWithParameters(BatchingDownloader.java:124)
	at org.mozilla.gecko.sync.repositories.downloaders.BatchingDownloader.onFetchCompleted(BatchingDownloader.java:260)
	at org.mozilla.gecko.sync.repositories.downloaders.BatchingDownloaderDelegate.handleRequestSuccess(BatchingDownloaderDelegate.java:73)
	at org.mozilla.gecko.sync.net.SyncStorageCollectionRequest$SyncCollectionResourceDelegate.handleHttpResponse(SyncStorageCollectionRequest.java:143)
	at org.mozilla.gecko.sync.net.BaseResource.execute(BaseResource.java:342)
	at org.mozilla.gecko.sync.net.BaseResource.go(BaseResource.java:373)
	at org.mozilla.gecko.sync.net.BaseResource.get(BaseResource.java:379)
	at org.mozilla.gecko.sync.net.SyncStorageRequest.get(SyncStorageRequest.java:178)
	at org.mozilla.gecko.sync.repositories.downloaders.BatchingDownloader.fetchWithParameters(BatchingDownloader.java:124)
	at org.mozilla.gecko.sync.repositories.downloaders.BatchingDownloader.onFetchCompleted(BatchingDownloader.java:260)
	at org.mozilla.gecko.sync.repositories.downloaders.BatchingDownloaderDelegate.handleRequestSuccess(BatchingDownloaderDelegate.java:73)
	at org.mozilla.gecko.sync.net.SyncStorageCollectionRequest$SyncCollectionResourceDelegate.handleHttpResponse(SyncStorageCollectionRequest.java:143)
	at org.mozilla.gecko.sync.net.BaseResource.execute(BaseResource.java:342)
        
        ...(this repeats a bunch of times, indicating that client is doing a batching download)...
        
	at org.mozilla.gecko.sync.net.BaseResource.go(BaseResource.java:373)
	at org.mozilla.gecko.sync.net.BaseResource.get(BaseResource.java:379)
	at org.mozilla.gecko.sync.net.SyncStorageRequest.get(SyncStorageRequest.java:178)
	at org.mozilla.gecko.sync.repositories.downloaders.BatchingDownloader.fetchWithParameters(BatchingDownloader.java:124)
	at org.mozilla.gecko.sync.repositories.downloaders.BatchingDownloader.onFetchCompleted(BatchingDownloader.java:260)
	at org.mozilla.gecko.sync.repositories.downloaders.BatchingDownloaderDelegate.handleRequestSuccess(BatchingDownloaderDelegate.java:73)
	at org.mozilla.gecko.sync.net.SyncStorageCollectionRequest$SyncCollectionResourceDelegate.handleHttpResponse(SyncStorageCollectionRequest.java:143)
	at org.mozilla.gecko.sync.net.BaseResource.execute(BaseResource.java:342)
	at org.mozilla.gecko.sync.net.BaseResource.go(BaseResource.java:373)
	at org.mozilla.gecko.sync.net.BaseResource.get(BaseResource.java:379)
	at org.mozilla.gecko.sync.net.SyncStorageRequest.get(SyncStorageRequest.java:178)
	at org.mozilla.gecko.sync.repositories.downloaders.BatchingDownloader.fetchWithParameters(BatchingDownloader.java:124)
	at org.mozilla.gecko.sync.repositories.downloaders.BatchingDownloader.fetchSince(BatchingDownloader.java:145)
	at org.mozilla.gecko.sync.repositories.downloaders.BatchingDownloaderController.resumeFetchSinceIfPossible(BatchingDownloaderController.java:69)
	at org.mozilla.gecko.sync.repositories.Server15RepositorySession.fetchSince(Server15RepositorySession.java:60)
	at org.mozilla.gecko.sync.middleware.Crypto5MiddlewareRepositorySession.fetchSince(Crypto5MiddlewareRepositorySession.java:140)
	at org.mozilla.gecko.sync.synchronizer.RecordsChannel.flow(RecordsChannel.java:180)
	at org.mozilla.gecko.sync.synchronizer.RecordsChannel.onBeginSucceeded(RecordsChannel.java:335)
	at org.mozilla.gecko.sync.repositories.delegates.DeferredRepositorySessionBeginDelegate$1.run(DeferredRepositorySessionBeginDelegate.java:24)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1080)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:573)
	at java.lang.Thread.run(Thread.java:856)
Flags: needinfo?(gkruglov) → needinfo?(jh+bugzilla)
See Also: → 1352149
Summary: Crash in java.lang.StackOverflowError: at java.lang.Character.digit(Character.java) → Crash in java.lang.StackOverflowError: at java.lang.Character.digit(Character.java) during batching downloads
(In reply to :Grisha Kruglov from comment #13)
> First, tabs restore crash logs. JanH, do you know why this might happen?

See my answer in bug 1352149 - I've no real idea, but most probably either a session file that's grown too large (but I've no idea what too large is [1]), or else a corrupted session file that has hit a bug in our JSON parser and therefore results in a stack overflow instead of a JSONException.

[1] The crashing phone is a S4 Mini and during debugging some other bug using a large-ish session file provided by somebody, even my old S3 Mini was able to handle something like around 50 tabs fine. Also, according to telemetry [2] some users have session files well into the megabytes, yet at the same time there aren't very many crash reports for this issue around. So maybe the corrupted file triggering a parser bug might be slightly more likely?

[2] https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-03-24&keys=__none__!__none__!__none__&max_channel_version=release%252F52&measure=FX_SESSION_RESTORE_FILE_SIZE_BYTES&min_channel_version=null&processType=*&product=Fennec&sanitize=1&sort_keys=submissions&start_date=2017-03-03&table=0&trim=1&use_submission_date=0
Flags: needinfo?(jh+bugzilla)
Why are we recursing for batches?

How many other StackOverflowErrors are in crash-stats that *aren't* parsing Character.digit?
Flags: needinfo?(gkruglov)
It looks like the success path -- run another download -- is the only part of the code that doesn't use the work queue.

We should use the work queue to flatten the stack.

https://dxr.mozilla.org/mozilla-central/source/mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/downloaders/BatchingDownloader.java#260
No reason to recurse here, this slipped through the cracks. Patch upcoming to use a queue for this.
Flags: needinfo?(gkruglov)
This needs to be backported at least to aurora for me to ascertain if it actually fixes anything.
(In reply to Bill Gianopoulos [:WG9s] from comment #19)
> This needs to be backported at least to aurora for me to ascertain if it
> actually fixes anything.

That's the plan, for the sync portion of this.
Great! I was thinking I could just try uninstalling and reinstalling on the idea this had something to do with my existing install, but figured it would be better to keep it as is with sync disabled and then re-enable once you think it is fixed.
Assignee: nobody → gkruglov
Status: NEW → ASSIGNED
Priority: -- → P1
I guess I was wrong this has now crashed on nightly for me.
Didn't have bandwidth to push a patch through; going to de-prioritize this due to a very low crash count (~1 day on beta).
Status: ASSIGNED → NEW
Priority: P1 → P2
I'm now running into this on an old samsung device, while trying to sync down a large profile for the first time.

Thankfully we now have high-water-mark history downloading, and so it crashes a few times, but eventually succeeds, since the amount of work sync needs to do gets smaller with every attempt, and so our stack gets shallower here.
Priority: P2 → P1
This should be fixed rather soon, so that it has enough time to bake on 57 Nightly.
Status: NEW → ASSIGNED
Comment on attachment 8907363 [details]
Bug 1351673 - Use a single-threaded work queue to process batching downloader work items

https://reviewboard.mozilla.org/r/179034/#review184148

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/downloaders/BatchingDownloader.java:76
(Diff revision 1)
>      // Used to track outstanding requests, so that we can abort them as needed.
>      @VisibleForTesting
>      protected final Set<SyncStorageCollectionRequest> pending = Collections.synchronizedSet(new HashSet<SyncStorageCollectionRequest>());
>      /* @GuardedBy("this") */ private String lastModified;
>  
> +    private final ExecutorService taskQueue = Executors.newSingleThreadExecutor();

Another one!?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/downloaders/BatchingDownloader.java:251
(Diff revision 1)
>          }
>  
>          // We need to make another batching request!
>          // Let the delegate know that a batch fetch just completed before we proceed.
>          // This operation needs to run after every call to onFetchedRecord for this batch has been
>          // processed, hence the delayWorkItem call.

The comment is no longer accurate.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/downloaders/BatchingDownloader.java
(Diff revision 1)
>          });
>      }
>  
>      public void onFetchedRecord(CryptoRecord record,
>                                  RepositorySessionFetchRecordsDelegate fetchRecordsDelegate) {
> -        this.workTracker.incrementOutstanding();

Above we declare that we need to do some work after every `onFetchedRecord` is finished.

We do this by queueing that work on the executor. The code you're removing did this with a counter, running that work when the counter ticks back to zero.

The queue approach only works if every `onFetchedRecord` happens before `onFetchCompleted` runs. 

Are you sure that's guaranteed? Have you documented that constraint?

If that is guaranteed, what did we need `DelayedWorkTracker` for?

It's been a long day and I'm tired, but to me this looks like we had a bunch of code to make sure runnables happened-after a collection of callbacks all completed, and now we don't…
(In reply to Richard Newman [:rnewman] from comment #27)
> Are you sure that's guaranteed? Have you documented that constraint?
> 
> If that is guaranteed, what did we need `DelayedWorkTracker` for?
> 
> It's been a long day and I'm tired, but to me this looks like we had a bunch
> of code to make sure runnables happened-after a collection of callbacks all
> completed, and now we don't…

onFetchCompleted happens after all calls to onFetchRecord went out for a given payload. So, we do have strict ordering of events there (see SyncStorageCollectionRequest).

The queue should provide stronger ordering guarantee than what we currently have, as well. Currently, we only increment counter before calling into onFetchedRecord. Just using DelayedWorkTracker doesn't otherwise guarantee that our callbacks will execute in correct order. It's possible we might even have some subtle bugs here already..?

Also, looking at RecordsChannel, the only guarantees we had prior with DelayedWorkTracker (and still have with this patch) is that we've finished queuing the record for processing - _not_ that the queued record was actually processed by the other session. So it's a very weak guarantee - ISTM that this comment was always misleading. RecordsChannel can in theory ensure that strict ordering is observed (as it does for other 'completed' callbacks), but without some restructuring it's not feasible to do so on the BatchingDownloader level.

In case of onBatchComplete I don't think it's actually worth the effort. It's meant to trigger a flush operation in BufferStorage, which doesn't care about how records are flowing. We might end up making that call in a non-optimal time, but it shouldn't cause problems. And, currently, that call is a no-op, since we don't (yet) have any persistent buffers, and the in-memory buffer ignores the flush call.
What do you think, Richard? ISTM that this _should_ be ok to land safely. I can update the patch with better docs to explain what's going on clearer.
Flags: needinfo?(rnewman)
(In reply to :Grisha Kruglov from comment #29)
> What do you think, Richard? ISTM that this _should_ be ok to land safely. I
> can update the patch with better docs to explain what's going on clearer.

If you're confident in the properties of the system, and you've updated the comments in the files to match the new behavior, then ship it.
Flags: needinfo?(rnewman)
Attachment #8907363 - Flags: review?(rnewman) → review+
Comment on attachment 8907363 [details]
Bug 1351673 - Use a single-threaded work queue to process batching downloader work items

https://reviewboard.mozilla.org/r/179034/#review187328
Attachment #8907363 - Flags: review?(rnewman) → review+
Pushed by gkruglov@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/96dab44be3ba
Use a single-threaded work queue to process batching downloader work items r=rnewman
Backout by philringnalda@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/1465ade89bff
Backed out changeset 96dab44be3ba for Android build bustage
Pushed by gkruglov@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9767e159a701
Use a single-threaded work queue to process batching downloader work items r=rnewman
Backed out for failing android lint:

https://hg.mozilla.org/integration/autoland/rev/8eba074485cc441037a82b2aae44013ff29ed372

Push with failure: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=9767e159a7018465824b7f6e4d504875cfa5cc6b&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=132372610&repo=autoland

Failed tests:
    BatchingDownloaderTest. testBatching
    BatchingDownloaderTest. testBatchingSingleBatchMode
    BatchingDownloaderTest. testBatchingTrivial
    BatchingDownloaderTest. testFailureException
    BatchingDownloaderTest. testFailureLMChangedMultiBatch
Flags: needinfo?(gkruglov)
Ah, my apologies for not checking the tests; the updated patch passes everything locally. The approach is simplistic, but I think it should be good enough for the task.
Flags: needinfo?(gkruglov)
Pushed by gkruglov@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/cd91b8be78f6
Use a single-threaded work queue to process batching downloader work items r=rnewman
https://hg.mozilla.org/mozilla-central/rev/cd91b8be78f6
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 58
Depends on: 1404674
No longer depends on: 1404674
Product: Android Background Services → Firefox for Android
You need to log in before you can comment on or make changes to this bug.