Closed Bug 965527 Opened 6 years ago Closed 5 years ago

Intermittent B2G desktop Shutdown "###!!! ABORT: file resource://gre/modules/AsyncShutdown.jsm"

Categories

(Toolkit :: OS.File, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla31
Tracking Status
firefox29 --- unaffected
firefox30 --- fixed
firefox31 --- fixed
firefox-esr24 --- unaffected
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed

People

(Reporter: RyanVM, Assigned: Yoric)

References

Details

(Keywords: intermittent-failure)

Attachments

(5 files, 1 obsolete file)

https://tbpl.mozilla.org/php/getParsedLog.php?id=33754620&tree=B2g-Inbound

b2g_ubuntu64_vm b2g-inbound opt test reftest on 2014-01-29 11:03:47 PST for push fbd02ed19b1a
slave: tst-linux64-ec2-352

11:09:08     INFO -  REFTEST TEST-START | Shutdown
11:09:08     INFO -  WARNING: A phase completion condition is taking too long to complete. Condition: OS.File: flush I/O queued before profile-before-change Phase: profile-before-change
11:09:08     INFO -  ERROR: At least one completion condition failed to complete within a reasonable amount of time. Causing a crash to ensure that we do not leave the user with an unresponsive process draining resources. Conditions: OS.File: flush I/O queued before profile-before-change Phase: profile-before-change
11:09:08     INFO -  [Parent 2336] ###!!! ABORT: file resource://gre/modules/AsyncShutdown.jsm, line 390
11:09:08     INFO -  [Parent 2336] ###!!! ABORT: file resource://gre/modules/AsyncShutdown.jsm, line 390
11:09:08  WARNING -  PROCESS-CRASH | Shutdown | application crashed [@ mozalloc_abort(char const*)]
Dave, can you please take a look? :)
Flags: needinfo?(dteller)
I am in the process of instrumenting AsyncShutdown and OS.File to get better reporting on such things (bug 957123). In the meantime, there is nothing I can do.
Flags: needinfo?(dteller)
philor, comment 34 seems to be a different bug. Could you file it and assign it to rvitillo?
Attached patch More loggingSplinter Review
Trying with more logging.
Attachment #8375607 - Flags: review+
Keywords: checkin-needed
Whiteboard: [leave open]
https://hg.mozilla.org/integration/fx-team/rev/7cc797a9b5dc
Keywords: checkin-needed
Whiteboard: [leave open] → [leave open][fixed-in-fx-team]
https://hg.mozilla.org/mozilla-central/rev/7cc797a9b5dc
Whiteboard: [leave open][fixed-in-fx-team] → [leave open]
For some reason, we end up in a case in which the logging throws an error. Let's make it even more careful.
Attachment #8377050 - Flags: review+
This one is a little clearer.
1. OS.File (and only OS.File) has trouble shutting down;
2. the reset timer is on;
3. we don't know whether there is any message pending, because it's not a DEBUG build

This could be a bug in OS.File (most likely) or this could be some OS.File client spamming OS.File with requests. Working on further expanding the logging.
Yet more logging and as a bonus, a test on that logging.
Attachment #8378956 - Flags: review+
Same try, after pull: https://tbpl.mozilla.org/?tree=Try&rev=4882894cfb96
So here we go for further logging.
Keywords: checkin-needed
https://hg.mozilla.org/integration/fx-team/rev/03545057aff2
Keywords: checkin-needed
Whiteboard: [leave open] → [leave open][fixed-in-fx-team]
https://hg.mozilla.org/mozilla-central/rev/03545057aff2
Whiteboard: [leave open][fixed-in-fx-team] → [leave open]
So, we now know that some call to OS.File.writeAtomic is placed a few seconds before the deadline and doesn't receive a reply within ~1 minute. On closer inspection of the code, we might have a race condition if the first call to OS.File is placed during a brief window during shutdown, which would explain the issue.

Hardening the process, let's see if it works.
Assignee: nobody → dteller
Attachment #8382941 - Flags: review?(nfroyd)
Comment on attachment 8382941 [details] [diff] [review]
Hardening OS.File shutdown flag

Review of attachment 8382941 [details] [diff] [review]:
-----------------------------------------------------------------

Sure, makes sense.

::: toolkit/components/osfile/modules/osfile_async_front.jsm
@@ +205,5 @@
> +        LOG("OS.File is not available anymore. Request has been rejected.");
> +        throw new Error("OS.File has been shut down.");
> +      }
> +      return code();
> +    }.bind(this));

Fat arrow function instead?
Attachment #8382941 - Flags: review?(nfroyd) → review+
(In reply to TBPL Robot from comment #63)

This was on a push from after comment 62.
Attached patch 4. Yet more detailed logging (obsolete) — Splinter Review
I'm blocked by the lack of information on the file being written. Reworking the debugging information to ensure that we keep this data, without keeping data that is too expensive.
Attachment #8394126 - Flags: review?(nfroyd)
Comment on attachment 8394126 [details] [diff] [review]
4. Yet more detailed logging

Review of attachment 8394126 [details] [diff] [review]:
-----------------------------------------------------------------

Is there a reason you're turning this on for non-DEBUG builds too?  I worry that we're going to be collecting all this information in non-debug (i.e. release) builds and that's going to slow things down.

r=me if we can turn this back into mostly DEBUG-only logging before the next uplift.

::: toolkit/components/osfile/modules/osfile_async_front.jsm
@@ +139,5 @@
> + * Extract a shortened version of an object, fit for logging.
> + *
> + * This function returns a copy of the original object in which all
> + * long strings, Arrays, TypedArrays, ArrayBuffers are removed and
> + * replaced with plceholders. Use this function to sanitize objects

Nit: "placeholders"

@@ +160,5 @@
> +    if (Array.isArray(obj)) {
> +      if (obj.length > 32) {
> +        return {"Long array": obj.length};
> +      }
> +      return [summarizeObject(k) for (k of obj)];

Perhaps it would be better to do something like (possibly not real JS):

return [summarizeObject(k) for (k of obj.slice(0, Math.min(obj.length, 32)))] + [(obj.length - 32) + " trailing elements omitted"];

WDYT?

@@ +203,5 @@
> +  Debugging: {
> +    /**
> +     * The latest message sent and still waiting for a reply. In DEBUG
> +     * builds, the entire message is stored, which may be memory-consuming.
> +     * In non-DEBUG builds, only the method name is stored.

This comment is no longer accurate.  (Might have been more straightforward if you had made the Debugging-as-object changes, then the summary+message # changes.)

@@ +211,5 @@
> +    /**
> +     * The latest reply received, or null if we are waiting for a reply.
> +     * In DEBUG builds, the entire response is stored, which may be
> +     * memory-consuming.  In non-DEBUG builds, only exceptions and
> +     * method names are stored.

Likewise.

@@ +326,3 @@
>          reply = error;
>          isError = true;
> +        if (error instanceof PromiseWorker.WorkerError) {

I guess this refactoring makes this clearer...
Attachment #8394126 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd (:froydnj) from comment #83)
> Comment on attachment 8394126 [details] [diff] [review]
> 4. Yet more detailed logging
> 
> Review of attachment 8394126 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Is there a reason you're turning this on for non-DEBUG builds too?  I worry
> that we're going to be collecting all this information in non-debug (i.e.
> release) builds and that's going to slow things down.

For some reason, the error doesn't seem to happen on DEBUG builds, so I need info on non-DEBUG builds.

> r=me if we can turn this back into mostly DEBUG-only logging before the next
> uplift.

> @@ +160,5 @@
> > +    if (Array.isArray(obj)) {
> > +      if (obj.length > 32) {
> > +        return {"Long array": obj.length};
> > +      }
> > +      return [summarizeObject(k) for (k of obj)];
> 
> Perhaps it would be better to do something like (possibly not real JS):
> 
> return [summarizeObject(k) for (k of obj.slice(0, Math.min(obj.length,
> 32)))] + [(obj.length - 32) + " trailing elements omitted"];
> 
> WDYT?

Actually, I don't think we are ever going to get an array with size > 5 in the current setting, so I wanted to go for something simple. What do you think about keeping such a refinement for later, if needed?


> @@ +326,3 @@
> >          reply = error;
> >          isError = true;
> > +        if (error instanceof PromiseWorker.WorkerError) {
> 
> I guess this refactoring makes this clearer...

Logging was making this piece of code too complicated for its own good. I took the opportunity to clean it up a little.
(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #84)
> > Is there a reason you're turning this on for non-DEBUG builds too?  I worry
> > that we're going to be collecting all this information in non-debug (i.e.
> > release) builds and that's going to slow things down.
> 
> For some reason, the error doesn't seem to happen on DEBUG builds, so I need
> info on non-DEBUG builds.

Ah, I see I was misled by the mis-starred failures above.  OK, then, yes, we do need this in opt builds.

> > @@ +160,5 @@
> > > +    if (Array.isArray(obj)) {
> > > +      if (obj.length > 32) {
> > > +        return {"Long array": obj.length};
> > > +      }
> > > +      return [summarizeObject(k) for (k of obj)];
> > 
> > Perhaps it would be better to do something like (possibly not real JS):
> > 
> > return [summarizeObject(k) for (k of obj.slice(0, Math.min(obj.length,
> > 32)))] + [(obj.length - 32) + " trailing elements omitted"];
> > 
> > WDYT?
> 
> Actually, I don't think we are ever going to get an array with size > 5 in
> the current setting, so I wanted to go for something simple. What do you
> think about keeping such a refinement for later, if needed?

That's fine.  If you wanted to bump down the max size summarized for this, that'd be OK too.
https://hg.mozilla.org/integration/fx-team/rev/18f579c4308e
Keywords: checkin-needed
Whiteboard: [leave open] → [leave open][fixed-in-fx-team]
Oops, typo.
Attachment #8394126 - Attachment is obsolete: true
Attachment #8394801 - Flags: review+
Something in this checkin-needed push caused Win7 talos xperf regressions. It wasn't clear to me which was at fault, so I backed out the entire push. Please verify that your patch wasn't at fault and re-request checkin when ready :)
https://hg.mozilla.org/integration/fx-team/rev/6d0a341040a9

https://tbpl.mozilla.org/php/getParsedLog.php?id=36500476&tree=Fx-Team
Whiteboard: [leave open][fixed-in-fx-team] → [leave open]
Try runs look good, thanks :)
https://hg.mozilla.org/integration/fx-team/rev/8f768ced9020
Whiteboard: [leave open] → [leave open][fixed-in-fx-team]
https://hg.mozilla.org/mozilla-central/rev/8f768ced9020
Whiteboard: [leave open][fixed-in-fx-team] → [leave open]
Duplicate of this bug: 986877
So, the latest DUP informs us that:
- it's not the first message;
- a reset is pending at the time of the crash;
- one call to writeAtomic remains unanswered.

I believe that it's the same issue as the OS.File part of bug 961317. There's a patch over there pending review.
Tomcat, could you file a separate bug for this one and Cc bsmedberg and gps? The crash is part of experiments.jsm
Tomcat, the last two are a different bug.
Flags: needinfo?(cbook)
(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?" - I'll be away on April 9th-16th) from comment #101)
> Tomcat, could you file a separate bug for this one and Cc bsmedberg and gps?
> The crash is part of experiments.jsm

yeah sorry, filed bug 996601 for it
Flags: needinfo?(cbook)
Wrong bug, honzab, that should be bug 995027.

As a side-note, I believe that the current bug was fixed by the patch of bug 961317. Closing. Please reopen if needed.
Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(honzab.moz)
Resolution: --- → FIXED
I believe you.  Not sure that bug popped on me tho.  It's hard, too similar ones one by one :)
Flags: needinfo?(honzab.moz)
(In reply to Honza Bambas (:mayhemer) from comment #107)
> I believe you.  Not sure that bug popped on me tho.  It's hard, too similar
> ones one by one :)

It shows up in the TBPL suggestion list and you can see the Experiments.jsm stuff in the parsed log immediately prior to the failure.
Whiteboard: [leave open]
Target Milestone: --- → mozilla31
Gijs: Nope, unrelated bug.
Flags: needinfo?(gijskruitbosch+bugs)
(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?" - I'll be away on May 8th-12th) from comment #110)
> Gijs: Nope, unrelated bug.

Thanks! Sorry, meant to star bug 924622, not sure what went wrong there. :-\
Flags: needinfo?(gijskruitbosch+bugs)
Tweaking summary to avoid future mis-stars.
Summary: Intermittent B2G desktop Shutdown | application crashed [@ mozalloc_abort(char const*)] after "###!!! ABORT: file resource://gre/modules/AsyncShutdown.jsm" → Intermittent B2G desktop Shutdown "###!!! ABORT: file resource://gre/modules/AsyncShutdown.jsm"
You need to log in before you can comment on or make changes to this bug.