Closed Bug 987323 Opened 7 years ago Closed 6 years ago

AsyncShutdown crash: "SessionFile: Finish writing the latest sessionstore.js"

Categories

(Firefox :: Session Restore, defect)

defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox30 - wontfix
firefox31 - affected
firefox32 + unaffected
firefox33 + unaffected
firefox-esr31 --- affected

People

(Reporter: Yoric, Assigned: Yoric)

References

()

Details

(Keywords: crash, dataloss, topcrash-win)

Crash Data

Attachments

(1 file, 2 obsolete files)

No description provided.
Submitted recently as https://crash-stats.mozilla.com/report/index/bp-1fc61f95-cedd-4e1c-815a-bd6652140324
Crash Signature: https://crash-stats.mozilla.com/report/index/bp-1fc61f95-cedd-4e1c-815a-bd6652140324
Is that a 60s timeout crash that occurs after the SessionFile promise for some reason doesn't resolve?
Crash Signature: https://crash-stats.mozilla.com/report/index/bp-1fc61f95-cedd-4e1c-815a-bd6652140324 → [@ mozalloc_abort(char const* const) | NS_DebugBreak | nsDebugImpl::Abort(char const*, int) ]
Apparently, we have it in the wild:
crash 6fe3999f-ea79-4edb-9c96-bdbcd2140627
crash b04fd3e3-605a-4d7c-90e9-8ac6f2140627
crash e241a261-21fb-4e07-a4c5-567de2140627
crash 65e942fa-dcf9-4f64-ba7b-cf0102140627
crash 77eb812c-cb4e-46a4-af8f-677bf2140627
Keywords: dataloss
Severity: normal → major
So far, my best hypothesis is that the on-worker XHR is failing us, once again.
Note that you can see far more of these crashes in the wild by following these steps:

1. Load https://crash-stats.mozilla.com/report/list?signature=mozalloc_abort%28char%20const*%20const%29%20|%20NS_DebugBreak%20|%20nsDebugImpl%3A%3AAbort%28char%20const*%2C%20int%29#tab-reports
2. Click "Change columns in this table" or scroll to the bottom of the page
3. Select "AsyncShutdownTimeout" from the "Available columns" list 
4. Click ">>" to add it to the list of "Chosen columns"
5. Click "Save and Reload"
6. Find in page "SessionFile: Finish writing the latest sessionstore.js" and you'll see a lot of these crashes.

I'm not sure how to search crashstats to only show the crashes with this message in AsyncShutdownTimer but manually tabulating the results on the first few pages, I would guess ~50% of the crashes are for this reason. That means this could be in the thousands of crashes and is very likely a topcrash.
Severity: major → critical
Theoretically, the only thing that can cause such a crash would be a bug in the implementation of either Chrome Workers or require(). So far, most of the problems we have had with require() were due to weird behavior of XHR in workers during shutdown. My first reaction was to investigate any correlation with bug 965309, but these crashes happened already with FF29. Ben, do you know if something changed with FF29 that could affect Chrome Workers in general and sync XHR in workers in particular?

Anthony, do we have occurrences of this crash after the landing of bug 801598? This patch improves the robustness of SessionStore by catching some errors that take place in case of worker internal errors, typically during shutdown, so there is a small chance that we might have fixed the issue already by accident.

Also, once bug 1034075 has laned, it should give us more information about such crashes, but if we only have one week, this will be a (very) close thing.
Flags: needinfo?(ttaubert)
Flags: needinfo?(bent.mozilla)
Flags: needinfo?(anthony.s.hughes)
Oh, that's interesting. Apparently, we have crashes only under Windows.
Apparently, we have a:
- 2 occurrences of this on Firefox 33;
- 3 occurrences of this on Firefox 32;
- (based on a sample) 500+ on Firefox 31.
I hope that this fixes the issue. Even if it doesn't, this contains more AsyncShutdownTimeout instrumentation to let us find out what's wrong.
Attachment #8451712 - Flags: review?(ttaubert)
Attachment #8451712 - Flags: review?(smacleod)
I forgot to add that this fix is designed specifically for Firefox 31. It backports some of bug 801598 and bug 1034075 to FF 31. Steven, if we land this, this should be done today, before we move to Beta 8, so a quick review would be great :)
Comment on attachment 8451712 [details] [diff] [review]
Instrumenting SessionFile shutdown, tightening AsyncShutdown control

Review of attachment 8451712 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/components/sessionstore/src/SessionFile.jsm
@@ +260,5 @@
>    let worker = new PromiseWorker("resource:///modules/sessionstore/SessionWorker.js",
>      OS.Shared.LOG.bind("SessionWorker"));
> +
> +  // Status of the latest call to `post`.
> +  let queue = Promise.resolve();

I don't like the name queue for this, it's not actually a queue, it's the promise for the most recent operation (it's name made things more confusing than it had to be). How about "lastPostPromise" or something?

@@ +269,5 @@
> +        let _queue = queue;
> +        queue = null;
> +        return _queue;
> +      },
> +      () => gDiagnostics

Aside: This fetchState argument needs better documentation in AsyncShutdown. You need to dig down into internals to tell what it is.

@@ +289,5 @@
> +      return Task.spawn(function*() {
> +
> +        // Wait until all the previous call to `post` has
> +        // resolved/rejected.
> +        yield queue;

This will wait on the current post (itself), not the last post (you've just set queue to deferred.promise). Shouldn't this be "yield _queue"?
Attachment #8451712 - Flags: review?(smacleod) → review-
Thanks for the quick review. Applied feedback.
Try: https://tbpl.mozilla.org/?tree=Try&rev=c75329b76ea9
Assignee: nobody → dteller
Attachment #8451712 - Attachment is obsolete: true
Attachment #8451712 - Flags: review?(ttaubert)
Attachment #8451760 - Flags: review?(smacleod)
Comment on attachment 8451760 [details] [diff] [review]
Instrumenting SessionFile shutdown, tightening AsyncShutdown control, v2

Review of attachment 8451760 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/components/sessionstore/src/SessionFile.jsm
@@ +279,5 @@
> +      if (!latestPost) {
> +        throw new Error("Shutdown in progress, cannot post messages to the SessionWorker");
> +      }
> +
> +      // Grab the queue to make sure that we are executed only once

"queue" -> "latest post promise" or something?
Attachment #8451760 - Flags: review?(smacleod) → review+
Amended comments, as requested.
Attachment #8451760 - Attachment is obsolete: true
Attachment #8451767 - Flags: review+
Comment on attachment 8451767 [details] [diff] [review]
Instrumenting SessionFile shutdown, tightening AsyncShutdown control, v3

Approval Request Comment
[Feature/regressing bug #]: bug 966253
[User impact if declined]: Shutdown timeouts/crashes.
[Describe test coverage new/current, TBPL]: TBPL, see above.
[Risks and why]: The original, well tested, fix present on Aurora and Nightly could not be uplifted as it was part of a larger refactoring. This code implements the same algorithm but is not tested in the wild.
[String/UUID change made/needed]: None.
Attachment #8451767 - Flags: approval-mozilla-beta?
Comment on attachment 8451767 [details] [diff] [review]
Instrumenting SessionFile shutdown, tightening AsyncShutdown control, v3

If I read the tests correctly, we are completely killing e10s, which is a bad sign. It also means that we are not going to have time to get this uplifted. Canceling approval-mozilla-beta.
Attachment #8451767 - Flags: approval-mozilla-beta?
> Anthony, do we have occurrences of this crash after the landing of bug 801598? 

The latest build I've seen reporting this crash is Firefox 33.0a1 20140702030201. Considering bug 801598 landed on July 2nd, that might be a good clue that it helped. I can't say 100% though.
Flags: needinfo?(anthony.s.hughes)
Well, writing a patch specifically to land in Beta 8 sounds like a bad idea. Even if the full set of patches of bug 801598 is larger, it has been tested much more extensively, so I would be more confident in landing them than in landing a smaller and unproved patch.
Flags: needinfo?(sledru)
So, this is actually not a topcrasher. As discussed with Sylvestre, we are not going to uplift this just yet.
Flags: needinfo?(sledru)
Most of the issues are triggered by experiments.jsm which is not going to be active in the release.
Just to clarify: it's not the issues that are triggered by Experiments.jsm. It's the fact that they were accidentally labelled a topcrasher, because they share the same signature as the Experiments.jsm crash. Experiments.jsm has 93% of the crashes that match this signature, and this bug has ~4%.
I don't understand enough about this bug to know what to look for in worker changes. Can you please explain your reasoning in comment 7?
Flags: needinfo?(bent.mozilla)
Flags: needinfo?(ttaubert)
According to http://yoric.github.io/are-we-shutting-down-yet-/#SessionFile:%20Finish%20writing%20the%20latest%20sessionstore.js, we do not seem to have crashes post FF31. This suggests that the bug has been fixed already.

bent: This crash means that the SessionWorker (which is a ChromeWorker) took more than 1 minute to send a response to a postMessage() from the main thread at some point during shutdown. Since there are basically no loops in SessionWorker, and since we catch any error and respond, this suggests that the error doesn't take place in JS-land but in the native-land below.

We have hit several variants around this kind of issues in the past, with ChromeWorkers not reacting too well to shutdown (e.g. bug 965309), so it might be worth investigating in this direction.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Resolution: FIXED → WORKSFORME
(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #26)
> According to
> http://yoric.github.io/are-we-shutting-down-yet-/#SessionFile:
> %20Finish%20writing%20the%20latest%20sessionstore.js, we do not seem to have
> crashes post FF31. This suggests that the bug has been fixed already.

The problem is that 31 is an ESR version so can you check your data again to see how common this is? There is at least one report in bug 1074353.
Flags: needinfo?(dteller)
I got bit by this bug after being upgraded from ESR 24 to ESR 31 (see bug 1074353). I was able to resolve the issue by resetting my profile, but I saved my old profile in case it's helpful to be able to reproduce the problem.

A couple of notes:
1) With the bad profile, session restore doesn't update at all. My sessionstore.js was last modified 5 days ago, which I think is when the upgrade happened.
2) When using the bad profile, I also have problems with downloads not completing. This problem also appeared after upgrading. No idea if the two problems are related.
I have refined the dashboard since then. According to http://yoric.github.io/are-we-shutting-down-yet-/?signature=~SessionFile%3A+Finish+writing+the+latest+sessionstore.js#SessionFile:%20Finish%20writing%20the%20latest%20sessionstore.js, we have had *two* crashes with Firefox 31.1 ESR during the last 7 days.

I am also a bit worried because we have crashes in 32.0.3, build 20140923175406, which is quite recent.
Flags: needinfo?(dteller)
Will: The two issues are probably not related.
Reopening the bug as it's apparently not entirely fixed for FF31 or FF32.
Tim, any idea? Do you think the race condition you fixed in bug 1020831 could be the cause of this?
Status: RESOLVED → REOPENED
Flags: needinfo?(ttaubert)
Resolution: WORKSFORME → ---
Bug 1020831 would case an error message because we try to write again after we already closed the file. But I can't see that it would be related to this one, which doesn't mean it isn't...
Flags: needinfo?(ttaubert)
Interesting sample: https://crash-stats.mozilla.com/report/index/45f4cdab-312f-4093-8501-fbb172141004
(look at the metadata, AsyncShutdownTimeout)
This signature is showing up as a topcrash in Firefox 34, but I'm not sure if it's this issue or a new one. I filed bug 1110217 for it.
Can we close this? Looks like it isn't an issue anymore?
Flags: needinfo?(dteller)
I see a little noise with this error message (total of ~1 crash/day), but other than that, I believe we can close this.
Flags: needinfo?(dteller)
(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #35)
> I see a little noise with this error message (total of ~1 crash/day), but
> other than that, I believe we can close this.

Done.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.