Closed Bug 1044020 Opened 10 years ago Closed 10 years ago

[AsyncShutdown] Find out the duration of shutdown phases

Categories

(Toolkit :: Async Tooling, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla36

People

(Reporter: Yoric, Assigned: Yoric)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 11 obsolete files)

922 bytes, patch
froydnj
: review+
mccr8
: review+
Details | Diff | Splinter Review
4.64 KB, patch
Yoric
: review+
Details | Diff | Splinter Review
35.79 KB, patch
Yoric
: review+
Details | Diff | Splinter Review
We need to know how long shutdown phases typically take to set the duration of AsyncShutdown and Terminator timeouts accordingly. Too short, and users might lose data, too long and Firefox shutdown takes forever for some of them.

Finding out the duration of shutdown phases is something we typically cannot do with Telemetry because most of shutdown takes place after Telemetry. However, we can use a trick to get some real-world measurements, by adding this duration information to crash reporter annotations and extract the information from crash stats.

Also note that we are not interested in wallclock time but rather in wake time, i.e. unaffected by closing the lid of the computer. Both AsyncShutdown and Terminator have this information somewhere.

We should use one or the other to get it.
Assignee: nobody → dteller
Ideally, I would like to write this information to the profile and reload it on the next run to send it with Telemetry. Benjamin, I would like your blessing before I do this, as this would require me to write to the profile after profile-before-change2.
Flags: needinfo?(benjamin)
I don't particularly care, at least until 662444 is done. This would only be written when telemetry was enabled, right?
Flags: needinfo?(benjamin)
That's the idea, yes.
If we don't trust I/O to always terminate (and I believe we shouldn't trust them in a watchdog), ensuring that all the disk writes take place off the main thread will require a second dedicated thread. Vladan, what's our policy on small main thread I/O during shutdown?
Flags: needinfo?(vdjeric)
(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #4)
> Vladan, what's our policy on small main thread I/O during shutdown?

We like to avoid it unless necessary :)

However, in this particular scenario, we don't have much to gain from pushing the write to another thread -- event thread responsiveness is a non-factor and we're unlikely to get any benefits from parallelizing work. It would also add complexity.

Also, we're already recording shutdown durations (using main-thread I/O): 
http://mxr.mozilla.org/mozilla-central/search?string=RecordShutdown&regexp=1&find=&findi=&filter=^[^\0]*%24&hitlimit=&tree=mozilla-central

I think you just want to make the measurements written to Telemetry.ShutdownTime.txt a bit more fine-grained.
Flags: needinfo?(vdjeric)
Comment on attachment 8488806 [details] [diff] [review]
1. Computing and collating shutdown durations

Review of attachment 8488806 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/terminator/nsTerminator.cpp
@@ +211,5 @@
> +  // Setup destinationPath and tmpFilePath
> +
> +  UniquePtr<char, NS_FreeDelete<char> >
> +    destinationPath(static_cast<char*>(arg));
> +  const char EXTENSION[] = ".tmp";

Might as well make this static.  Style guide probably wants this named sExtension as well.

@@ +214,5 @@
> +    destinationPath(static_cast<char*>(arg));
> +  const char EXTENSION[] = ".tmp";
> +  const size_t destinationLength = PL_strlen(destinationPath.get());
> +  UniquePtr<char, NS_FreeDelete<char> >
> +    tmpFilePath(static_cast<char*>(NS_Alloc(destinationLength + sizeof(EXTENSION))));

Is there a reason that you are making this painful and not using nsCString?  (I guess you're doing this to avoid XPCOM issues (?), but I think we can keep using XPCOM strings even after XPCOM shutdown...)

@@ +290,5 @@
> +{
> +  char const* const mTopic;
> +  int mTicks;
> +
> +  ShutdownStep(const char *const topic)

MOZ_CONSTEXPR, please.

@@ +336,4 @@
>    return NS_OK;
>  }
>  
> +// Actually launch thes threads. This takes place at the first sign of shutdown.

"thes threads"?

@@ +348,5 @@
>    // Add a little padding, to ensure that we do not crash before
>    // AsyncShutdown.
>    crashAfterMS += ADDITIONAL_WAIT_BEFORE_CRASH_MS;
>  
> +  if (crashAfterMS < 0) {

What are you trying to defend against here?  Negative values of the pref, or overflow?  You should be handling the latter differently, and it seems to me that the former ought to be handled by |crashAfterMS = FALLBACK_ASYNCSHUTDOWN_CRASH_AFTER_MS;|.  (Negative values should probably also be handled before the "Add a little padding" above?)

@@ +360,2 @@
>    // By design, it will never finish, nor be deallocated.
> +  PRThread* watchdogThread = PR_CreateThread(

I feel like pulling this pattern out into:

PRThread* t = CreateSystemThread(PRThreadFunc aFunc, void* aArg);

or some such would be useful.

@@ +371,2 @@
>  
> +  do {

Ditch the do-nothing loop and just use early returns, please.

If that means you have to move the |mInitialized| bit earlier, please do that.

@@ +452,5 @@
> +  }
> +  MOZ_ASSERT(nextStep != -1);
> +  mCurrentStep = nextStep;
> +
> +  do {

Instead of a do-nothing loop, please separate this out into its own function.

::: toolkit/components/terminator/terminator.manifest
@@ +1,3 @@
>  category profile-after-change nsTerminator @mozilla.org/toolkit/shutdown-terminator;1
>  
> +component {3f78ada1-cba2-442a-82dd-d5fb300ddea7} nsTerminatorTelemetry.js

Is this leftover from a previous iteration, or is it a pointer to something that should have been hg add'd?
Attachment #8488806 - Flags: review?(nfroyd) → feedback+
(In reply to Nathan Froyd (:froydnj) from comment #7)
> @@ +214,5 @@
> > +    destinationPath(static_cast<char*>(arg));
> > +  const char EXTENSION[] = ".tmp";
> > +  const size_t destinationLength = PL_strlen(destinationPath.get());
> > +  UniquePtr<char, NS_FreeDelete<char> >
> > +    tmpFilePath(static_cast<char*>(NS_Alloc(destinationLength + sizeof(EXTENSION))));
> 
> Is there a reason that you are making this painful and not using nsCString? 
> (I guess you're doing this to avoid XPCOM issues (?), but I think we can
> keep using XPCOM strings even after XPCOM shutdown...)

I can't think of a good way to atomically produce a nsCString on one thread and consume it on the other one. Did I miss something?

> ::: toolkit/components/terminator/terminator.manifest
> @@ +1,3 @@
> >  category profile-after-change nsTerminator @mozilla.org/toolkit/shutdown-terminator;1
> >  
> > +component {3f78ada1-cba2-442a-82dd-d5fb300ddea7} nsTerminatorTelemetry.js
> 
> Is this leftover from a previous iteration, or is it a pointer to something
> that should have been hg add'd?

Oops, missing file.
(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #8)
> (In reply to Nathan Froyd (:froydnj) from comment #7)
> > @@ +214,5 @@
> > > +    destinationPath(static_cast<char*>(arg));
> > > +  const char EXTENSION[] = ".tmp";
> > > +  const size_t destinationLength = PL_strlen(destinationPath.get());
> > > +  UniquePtr<char, NS_FreeDelete<char> >
> > > +    tmpFilePath(static_cast<char*>(NS_Alloc(destinationLength + sizeof(EXTENSION))));
> > 
> > Is there a reason that you are making this painful and not using nsCString? 
> > (I guess you're doing this to avoid XPCOM issues (?), but I think we can
> > keep using XPCOM strings even after XPCOM shutdown...)
> 
> I can't think of a good way to atomically produce a nsCString on one thread
> and consume it on the other one. Did I miss something?

Is there a reason that |new nsCString()| and |Atomic<nsCString*>| won't work for you?  (A little unusual to dynamically allocate the string class itself, but I think it makes the logic here a lot easier...even if getting an actual |char*| out of |UniquePtr<nsCString>| is a little longer.)
Thanks for the suggestions, I have applied your feedback and added the missing files.
Attachment #8488806 - Attachment is obsolete: true
Attachment #8490280 - Flags: review?(nfroyd)
Comment on attachment 8490280 [details] [diff] [review]
1. Computing and collating shutdown durations, v2

Review of attachment 8490280 [details] [diff] [review]:
-----------------------------------------------------------------

This all looks good; I'd like confirmation on what's going on with test_terminator.js--both that it gets run and that the FIXMEs are appropriately addressed.

::: toolkit/components/terminator/nsTerminator.cpp
@@ +37,2 @@
>  #include "mozilla/DebugOnly.h"
> +#include "mozilla/Monitor.h"

I don't think you need this, since you're using NSPR monitors directly?

@@ +335,2 @@
>  void
>  nsTerminator::Start() {

Opening brace on the next line, please.

@@ +356,5 @@
>    // AsyncShutdown.
>    crashAfterMS += ADDITIONAL_WAIT_BEFORE_CRASH_MS;
>  
> +  // Defend against overflow
> +  if (crashAfterMS < 0) {

The compiler is free to remove this check, since overflow on signed integer types is undefined (even though the typical behavior is to do what you expect here.)  The right check is something like:

if (crashAfterMS > (INT32_MAX - ADDITIONAL_WAIT_BEFORE_CRASH_MS)) {
  crashAfterMS = INT32_MAX;
} else {
  crashAfterMS += ADDITIONAL_WAIT_BEFORE_CRASH_MS;
}

@@ +372,5 @@
> +// Prepare, allocate and start the writer thread. By design, it will never
> +// finish, nor be deallocated. In case of error, we degrade
> +// gracefully to not writing Telemetry data.
> +void
> +nsTerminator::StartWriter() {

Opening brace on the next line, please.

@@ +440,5 @@
> +  }
> +  MOZ_ASSERT(nextStep != -1);
> +  mCurrentStep = nextStep;
> +
> +  do {

Can this loop be factored out into its own function, please?

::: toolkit/components/terminator/tests/xpcshell/test_terminator.js
@@ +52,5 @@
> +  // FIXME: Remove ShutdownDuration.{json, tmp} if it exists
> +
> +
> +
> +  // FIXME: Spin-loop until ShutdownDuration.json is created

Are these FIXMEs intended to be fixed before landing the patch or after?

::: toolkit/components/terminator/tests/xpcshell/xpcshell.ini
@@ +2,5 @@
> +head=
> +tail=
> +
> +[test_terminator_record.js]
> +[test_terminator_reload.js]

I think you are missing test_terminator.js here?
Attachment #8490280 - Flags: review?(nfroyd) → feedback+
Ah, right. test_terminator.js has been renamed test_terminator_record.js, but I obviously forgot to remove the original file.
Attachment #8490280 - Attachment is obsolete: true
Attachment #8491574 - Flags: review?(nfroyd)
Sorry, wrong patch.
Attachment #8491574 - Attachment is obsolete: true
Attachment #8491574 - Flags: review?(nfroyd)
Attachment #8491575 - Flags: review?(nfroyd)
For some reason, a file has added itself to the list.
Attachment #8491575 - Attachment is obsolete: true
Attachment #8491575 - Flags: review?(nfroyd)
Hi,

this failed to apply - 

Hunk #1 FAILED at 6659
1 out of 1 hunks FAILED -- saving rejects to file toolkit/components/telemetry/Histograms.json.rej


could you take a look. Thanks!
Keywords: checkin-needed
Summary: [AsyncShutdown] Use the CrashReporter to find out the duration of shutdown phases → [AsyncShutdown] Find out the duration of shutdown phases
Unbitrotten.
Attachment #8495575 - Attachment is obsolete: true
Attachment #8496758 - Flags: review+
Remove PHONY file that was added by this patch (WTF?)
https://hg.mozilla.org/integration/fx-team/rev/dade62440f97
https://hg.mozilla.org/integration/fx-team/rev/49b1465d3e5e
Flags: in-testsuite+
Keywords: checkin-needed
Whiteboard: [fixed-in-fx-team]
Status: NEW → ASSIGNED
Flags: needinfo?(dteller)
Apparently, this is a merge error. I'll fix this later today.
Fixed the problems introduced in the merge.
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=761e915514df
Attachment #8496758 - Attachment is obsolete: true
Attachment #8497466 - Flags: review+
Flags: needinfo?(dteller)
Ah, leaks. Yes, they were expected and they cannot be resolved, by definition of this code, but I didn't realize that they would cause ASAN failures. How can I inform ASAN to expect them?
Flags: needinfo?(emorley)
Same code, plus LSAN annotations.
Attachment #8497466 - Attachment is obsolete: true
Attachment #8506973 - Flags: superreview?(continuation)
Attachment #8506973 - Flags: review+
Comment on attachment 8506973 [details] [diff] [review]
1. Computing and collating shutdown durations, v8

I think it makes more sense for Nathan to approve this use of the intentional leak thing.
Attachment #8506973 - Flags: superreview?(continuation) → superreview?(nfroyd)
Comment on attachment 8506973 [details] [diff] [review]
1. Computing and collating shutdown durations, v8

Review of attachment 8506973 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with the changes below.

::: toolkit/components/terminator/nsTerminator.cpp
@@ +144,5 @@
> +    {
> +      MOZ_LSAN_INTENTIONAL_SCOPED_LEAK;
> +      // PR_Sleep allocates then deallocates a condition variable.
> +      // Since shutdown typically happens between the two, LSAN should
> +      // ignore this.

Given resolution in bug 1075134, please remove this and keep the lsan_suppressions.txt change.

@@ +263,5 @@
> +    //
> +    // In case of any error, we simply give up. Since the data is
> +    // hardly critical, we don't want to spend too much effort
> +    // salvaging it.
> +    //    

Whitespace.
Attachment #8506973 - Flags: superreview?(nfroyd) → superreview+
This is leaking again. Backed out. Did you upload the right patch?
https://hg.mozilla.org/integration/fx-team/rev/3bec46585a18

https://treeherder.mozilla.org/ui/logviewer.html#?job_id=1020060&repo=fx-team
Whiteboard: [fixed-in-fx-team]
Attached patch 2. Whitelist libld.so (obsolete) — Splinter Review
I finally found out what was wrong, mostly.
Apparently, PR_Sleep performs a call to libld.so that LSAN can't follow (so it's not whitelisted by PR_Sleep), which causes the allocation of 32 bytes.

I can't find a better way around this than whitelisting libld.so.
Attachment #8515526 - Flags: review?(nfroyd)
s/libld/libdl/
Attachment #8515526 - Attachment is obsolete: true
Attachment #8515526 - Flags: review?(nfroyd)
Attachment #8515662 - Flags: review?(nfroyd)
Comment on attachment 8515662 [details] [diff] [review]
2. Whitelist libdl.so

Apologies for the review delay.

How do you figure this?  libdl.so exports less than 10 functions (dlopen() and friends) and I don't immediately see how any of those could get called by PR_Sleep.  Can you elaborate with call stacks?
Flags: needinfo?(dteller)
The entire error stack is as follows:
11:07:10 INFO - Direct leak of 32 byte(s) in 1 object(s) allocated from:
11:07:10 INFO - #0 0x471e81 in calloc /builds/slave/moz-toolchain/src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:90
11:07:10 INFO - #1 0x7f1d6d15459f (/lib/x86_64-linux-gnu/libdl.so.2+0x159f)

Which makes it a bit tricky to debug. I bissected by commenting out function calls until it turned out that the issue was caused by the call to `PR_Sleep()`. Given that `PR_Sleep()` is already whitelisted and that it doesn't appear in the error stack anyway, I couldn't find a better strategy than whitelisting libdl.so, which seems to do the trick.

I do not have a precise explanation. However, looking at the source code, `PR_Sleep` may cause lazy initialization of NSPR, which in turn calls `dlopen` with `RTLD_LAZY` at http://dxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/linking/prlink.c#169, which could possibly explain the allocation of 32 bytes by libdl.so
Flags: needinfo?(dteller) → needinfo?(nfroyd)
There's some way to disable the fast stack unwinder that ASan uses which will give you stacks that go through the system libraries.  I think that was the trick...
Ah, well, that would be nice to use, too. Do you remember how to do that?
Ah, I see, _PR_InitLinker grabs a reference to the executable via dlopen, and then we don't cleanup those bits.  I'm a little surprised that PR_Sleep would be the first thing to try and initialize the library (I'd think creating threads would take care of that), but this is NSPR, so...
Flags: needinfo?(nfroyd)
Attachment #8515662 - Flags: review?(nfroyd) → review+
Comment on attachment 8515662 [details] [diff] [review]
2. Whitelist libdl.so

Review of attachment 8515662 [details] [diff] [review]:
-----------------------------------------------------------------

Actually, Andrew might also have things to say about this.
Attachment #8515662 - Flags: review?(continuation)
(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #42)
> Ah, well, that would be nice to use, too. Do you remember how to do that?

It looks like the ASan option I was thinking of is fast_unwind_on_malloc:
  https://code.google.com/p/address-sanitizer/wiki/Flags
So you'd want to set that to 0.

For a mochitest, you'd want to set it here:
  http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#319

Though now that I think about it, I'm not sure if it was that that helped, or installing some kind of debug symbols for the system libraries that got me better stacks. But I guess this at least is easier to try.
Comment on attachment 8515662 [details] [diff] [review]
2. Whitelist libdl.so

Review of attachment 8515662 [details] [diff] [review]:
-----------------------------------------------------------------

Yeah, this is unfortunate, because any other leaks we get involving this system library will just get surpressed, but if you can't avoid it, unfortunately with the current stackwalking on TBPL there's no other way around this.
Attachment #8515662 - Flags: review?(continuation) → review+
The only alternative I can think of is the MOZ_LSAN_SCOPED_INTENTIONAL_LEAK, which didn't make it through review. I suspect that libdl links are rather benign, so if we think that MOZ_LSAN_SCOPED_INTENTIONAL_LEAK is a footgun, I don't think it's worth reintroducing it just for this bug.

In other words, if everybody agrees, I would like to land the patch in the current state.
Yeah, I think that's fine. We have a ton of libraries in the suppression list.
Oh, looks like the test is executed on b2g emulator when it shouldn't. I'll fix this and upload an updated version.
Flags: needinfo?(dteller)
Let's not build/run tests under platforms that don't even have a shutdown.
Attachment #8520144 - Flags: review+
Part 1 needs rebasing unfortunately :(
Keywords: checkin-needed
Rebased.
(sigh)
Attachment #8508599 - Attachment is obsolete: true
Attachment #8523142 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/a063964b77c0
https://hg.mozilla.org/mozilla-central/rev/a748ffac1abf
https://hg.mozilla.org/mozilla-central/rev/9dba07ce2717
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Whiteboard: [fixed-in-fx-team]
Target Milestone: --- → mozilla36
You need to log in before you can comment on or make changes to this bug.