Crash in TelemetryImpl::RunPingSender on shutdown

VERIFIED FIXED in Firefox 55

Status

()

Toolkit
Telemetry
P1
normal
VERIFIED FIXED
3 months ago
20 days ago

People

(Reporter: hildjj, Assigned: gsvelto)

Tracking

({crash})

55 Branch
mozilla55
x86_64
Mac OS X
crash
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox53 unaffected, firefox54 unaffected, firefox55 verified, firefox-esr52 unaffected)

Details

(Whiteboard: [measurement:client])

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(5 attachments)

(Reporter)

Description

3 months ago
Created attachment 8858399 [details]
stack.txt

Crash on shutdown.  Reproduce:
- start Firefox with fresh profile
- shut down
- restart
- Navigate to http://neverssl.com/
- Shut down
- Crash.  No crash report generated.

OSX 10.12.4 (16E195)
Nightly 55.0a1 (2017-04-14) (64-bit)

Stack trace attached
(Reporter)

Updated

3 months ago
Summary: TelemetryImpl::RunPingSender → Crash in TelemetryImpl::RunPingSender on shutdown

Updated

3 months ago
Keywords: crash
Flags: needinfo?(alessio.placitelli)
Priority: -- → P1
Whiteboard: [measurement:client]
Blocks: 1354482
Duplicate of this bug: 1357232
Assignee: nobody → alessio.placitelli
Flags: needinfo?(alessio.placitelli)
I've tried reproducing this a bunch of times but I don't seem to be able to hit it. However I've noticed something that might be related; sometimes when shutting down with the toolkit.telemetry.shutdownPingSender.enabled option set to true I get the following in my log:

WARNING: YOU ARE LEAKING THE WORLD (at least one JSRuntime and everything alive inside it, that is) AT JS_ShutDown TIME.  FIX THIS!
[Parent 41748] ###!!! ASSERTION: Component Manager being held past XPCOM shutdown.: 'cnt == 0', file /Users/gsvelto/projects/mozilla-central/xpcom/build/XPCOMInit.cpp, line 1045

Some other errors like this seem to show up. With the option off I've been unable to get the same error. I don't know if it's related because I cannot consistently trigger it, but I've never seen that particular error before.

That being said there's another thing of note: in the attached trace the pingsender is being called from a promise, but we're already within nsXREDirProvider::DoShutdown(). So I was wondering if we might end up in a situation where the promise used to run the pingsender is being run far too late compared to the rest of the shutdown process and triggers this issue. The reason why I'm thinking this might be the case is that the bottom of that trace shows a segmentation fault deep within memory allocation code run by fork(). Since fork() doesn't accept any parameters nor should rely on our global data then the only chance of that happening is that we've somehow corrupted those memory structures before fork() gets called. Also, since the reporter is not getting a crash then it means we've already disabled the breakpad signal handler that triggers the crashreporter. That also happens very late in the shutdown process and JS methods shouldn't work at that point.

Without a way of reproducing this reliably though it's going to be hard to debug.
See also the STR in bug 1357232, comment 0.
Both reports are OS X.
(Reporter)

Comment 4

3 months ago
Re bug 1357232, I have extensions.logging.enabled=false.
Joe, if you can hit this consistently one thing that would be very important to establish is if the crash is happening after we've disabled Breakpad's exception handler. Can you check if you see the crash after we've hit this code?

https://dxr.mozilla.org/mozilla-central/rev/a374c35469935a874fefe64d3e07003fc5bc8884/toolkit/crashreporter/nsExceptionHandler.cpp#2130

If you can put printf()/fflush(stdout) couple in there and see if it shows up before the crash it would be super helpful because if it is, then we're running the pingsender far too late and that's why it's failing (past that point it would be too late to run *anything*).
(Reporter)

Comment 6

3 months ago
From what I can tell, in both the case where it crashes and in the case where it doesn't crash, gExceptionHandler is null at 

https://dxr.mozilla.org/mozilla-central/rev/a374c35469935a874fefe64d3e07003fc5bc8884/toolkit/crashreporter/nsExceptionHandler.cpp#2205

Causing the function to return NS_ERROR_NOT_INITIALIZED.  I get this further info on crash, AFTER that return:

Leaked URLs:
  chrome://pluginproblem/content/pluginProblemBinding.css
  resource://gre-resources/counterstyles.css
  resource://gre-resources/html.css
  chrome://global/content/minimal-xul.css
  resource://gre-resources/quirk.css
  resource://gre/res/svg.css
  chrome://global/content/xul.css
  chrome://global/skin/scrollbars.css
  resource://gre-resources/number-control.css
  resource://gre-resources/forms.css
  resource://gre-resources/noscript.css
  resource://gre-resources/ua.css
  chrome://pocket/skin/pocket.css
  chrome://pocket-shared/skin/pocket.css
  chrome://global/content/bindings/general.xml#root-element
  chrome://global/content/bindings/popup.xml#tooltip
  chrome://global/content/bindings/stringbundle.xml#stringbundleset
  chrome://global/content/bindings/toolbar.xml#menubar
  chrome://global/content/bindings/stringbundle.xml#stringbundle
  chrome://global/content/bindings/popup.xml#popup
  chrome://global/content/bindings/general.xml#deck
  chrome://global/content/bindings/autocomplete.xml#autocomplete-rich-result-popup
  chrome://global/content/bindings/popup.xml#arrowpanel
  chrome://global/content/bindings/menulist.xml#menulist-popuponly
  chrome://global/content/bindings/popup.xml#panel
  chrome://global/content/bindings/notification.xml#popup-notification
  chrome://global/content/bindings/general.xml#image
  chrome://global/content/bindings/toolbarbutton.xml#toolbarbutton
  chrome://global/content/bindings/toolbar.xml#toolbox
  chrome://global/content/bindings/toolbarbutton.xml#menu
  chrome://global/content/bindings/scrollbox.xml#scrollbox
  chrome://global/content/bindings/text.xml#text-label
  chrome://global/content/bindings/text.xml#label-control
  chrome://global/content/bindings/general.xml#dropmarker
  chrome://global/content/bindings/toolbarbutton.xml#menu-button
  chrome://global/content/bindings/textbox.xml#input-box
  chrome://global/content/bindings/autocomplete.xml#history-dropmarker
  chrome://global/content/platformHTMLBindings.xml#inputFields
  chrome://global/content/bindings/toolbarbutton.xml#toolbarbutton-badged
  chrome://global/content/bindings/splitter.xml#splitter
  chrome://global/content/bindings/notification.xml#notificationbox
  chrome://global/content/bindings/tabbox.xml#tabpanels
  chrome://global/content/bindings/browser.xml
  chrome://global/content/bindings/browser.xml#browser
  chrome://global/content/bindings/browser.xml
  chrome://global/content/bindings/menu.xml#menuitem
  chrome://global/content/bindings/menu.xml#menuseparator
  chrome://global/content/bindings/scrollbar.xml#scrollbar
  chrome://global/content/bindings/scrollbar.xml
  chrome://global/content/bindings/scrollbar.xml#thumb
  chrome://global/content/bindings/scrollbar.xml
  chrome://global/content/bindings/scrollbar.xml#scrollbar-base
  chrome://global/content/bindings/scrollbar.xml#scrollbar-base
  chrome://global/content/bindings/scrollbar.xml#scrollbar
  chrome://global/content/bindings/scrollbar.xml#scrollbar-base
  chrome://global/content/bindings/scrollbar.xml#thumb
  chrome://global/content/bindings/remote-browser.xml
  chrome://global/content/bindings/browser.xml#browser
  chrome://global/content/bindings/remote-browser.xml#remote-browser
  chrome://global/content/bindings/remote-browser.xml
  chrome://global/content/bindings/menu.xml#menu
  chrome://global/content/bindings/menu.xml#menu-iconic
  chrome://global/content/bindings/toolbar.xml#toolbardecoration
  chrome://pocket/skin/Toolbar@2x.png
  chrome://global/content/bindings/browser.xml#browser
  chrome://pocket/content/panels/img/pocketmenuitem16@2x.png
  https://accounts.firefox.com/
  chrome://global/content/bindings/text.xml#text-base
  resource://search-plugins/images/yahoo.ico
  resource://search-plugins/images/amazon.ico
  resource://search-plugins/images/wikipedia.ico
  chrome://global/content/bindings/autocomplete.xml#autocomplete-richlistbox
  chrome://global/content/bindings/button.xml#button
  chrome://global/content/bindings/autocomplete.xml#autocomplete-richlistitem
  chrome://global/content/backgroundPageThumbs.xhtml
  file:///Users/jhildebrand/track/gecko-dev/toolkit/components/thumbnails/content/backgroundPageThumbs.xhtml
  chrome://global/content/bindings/remote-browser.xml#remote-browser
[Parent 82953] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /Users/jhildebrand/track/gecko-dev/xpcom/base/nsTraceRefcnt.cpp, line 172
nsStringStats
 => mAllocCount:         141852
 => mReallocCount:         7518
 => mFreeCount:          138296  --  LEAKED 3556 !!!
 => mShareCount:         123093
 => mAdoptCount:           8903
 => mAdoptFreeCount:       8901  --  LEAKED 2 !!!
(In reply to Joe Hildebrand  [:hildjj] (UTC-6) from comment #4)
> Re bug 1357232, I have extensions.logging.enabled=false.

I enabled the "extensions.logging.enabled=true" pref so I can see when fx pings the add-on/telemetry services via the console browser. It doesn't have to be enabled to reproduce the issue. However, it gives you a good indicator on when you can shutdown fx when you notice the pings.

I can only reproduce this with a released version of m-c and haven't been able to reproduce the issue with a debug build. Gabriele, let me know if there's anything that I can help with here.
As a precaution, since we cannot safely evaluate the volume of the shutdown crashes (since the crashreporter is not getting called), we're disabling the shutdown pingsender in bug 1357745.
(In reply to Gabriele Svelto [:gsvelto] from comment #2)
> WARNING: YOU ARE LEAKING THE WORLD (at least one JSRuntime and everything
> alive inside it, that is) AT JS_ShutDown TIME.  FIX THIS!
> [Parent 41748] ###!!! ASSERTION: Component Manager being held past XPCOM
> shutdown.: 'cnt == 0', file
> /Users/gsvelto/projects/mozilla-central/xpcom/build/XPCOMInit.cpp, line 1045

FWIW I occasionally see this on Ubuntu, even with the pingsender disabled.
Comment hidden (mozreview-request)
(In reply to Alessio Placitelli [:Dexter] from comment #8)
> As a precaution, since we cannot safely evaluate the volume of the shutdown
> crashes (since the crashreporter is not getting called), we're disabling the
> shutdown pingsender in bug 1357745.

Ah! Thanks to all the precious feedback in this bug, I was able to trace down the potential source of the crash.

As Gabriele mentioned in comment 2, this is crashing because we're trying to spawn a new process too late into the shutdown.

I knew we were blocking on Telemetry at shutdown, so I crosschecked the calls.

The shutdown ping is submitted at [1] and wait upon at [2]. TelemetryController waits for the shutdown to complete at [3], and we block the FF shutdown here [4].

However, is TelemetryController.submitExternalPing returning the promise we expect? If we follow the calls down to TelemetrySend [5], we can see that we're returning the promise related to saving the ping to the disk, rather than the one for the pingsender process being spawned.

This explains why we're crashing when spawning the pingsender rather than when saving the ping to the disk.

The attached patch attempts to fix the problem, by returning the right promise so that Firefox correctly waits on the pingsender to spawn before quitting.

[1] - http://searchfox.org/mozilla-central/rev/214345204f1e7d97abb571b7992b6deedb5ff98f/toolkit/components/telemetry/TelemetrySession.jsm#1786
[2] - http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/TelemetrySession.jsm#2035
[3] - http://searchfox.org/mozilla-central/rev/214345204f1e7d97abb571b7992b6deedb5ff98f/toolkit/components/telemetry/TelemetryController.jsm#796
[4] - http://searchfox.org/mozilla-central/rev/214345204f1e7d97abb571b7992b6deedb5ff98f/toolkit/components/telemetry/TelemetryController.jsm#755
[5] - http://searchfox.org/mozilla-central/rev/214345204f1e7d97abb571b7992b6deedb5ff98f/toolkit/components/telemetry/TelemetrySend.jsm#810
Kamil, thank you for your help so far.

Would you kindly check if you're still able to reproduce the problem with the patch attached to this bug?
I've also pushed the patch to try so that you can conveniently grab a pre-build copy of the built Firefox.
Flags: needinfo?(kjozwiak)
Status: NEW → ASSIGNED
Keywords: leave-open

Comment 13

3 months ago
mozreview-review
Comment on attachment 8859615 [details]
Bug 1356673 - Fix a crash in RunPingSender at shutdown.

https://reviewboard.mozilla.org/r/131622/#review134426

We should definitely fix what this patch adresses, lets land this.

However, i'm not sure about the analysis of the crash.
You are blaming memory corruption from XPCOM shutdown or so.
But this is a crash in a sync main-thread function inside PR_CreateProcessDetached / fork.
To get there, we must already have successfully passed the strings through XPCOM and converted them to CStrings.

Let's land this and set leave-open pending:
- verifying this solves the problem
- checking the crash analysis with someone more familiar with these, maybe :froydnj or :ted
Attachment #8859615 - Flags: review?(gfritzsche) → review+

Comment 14

3 months ago
Pushed by alessio.placitelli@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/f281b8da52c9
Fix a crash in RunPingSender at shutdown. r=gfritzsche
(In reply to Georg Fritzsche [:gfritzsche] [away Apr 13 - 18] from comment #13)
> However, i'm not sure about the analysis of the crash.
> You are blaming memory corruption from XPCOM shutdown or so.
> But this is a crash in a sync main-thread function inside
> PR_CreateProcessDetached / fork.
> To get there, we must already have successfully passed the strings through
> XPCOM and converted them to CStrings.

Ted, could you kindly help with this?

Context: we're crashing at shutdown when spawning a new process to send the shutdown pings using the "pingsender".

Gabriele provided some useful insights about this crash in comment 2. How can we be crashing so far off after the XPCOM shutdown? Does this crash analysis make sense?
Flags: needinfo?(ted)
(In reply to Alessio Placitelli [:Dexter] from comment #12)
> Kamil, thank you for your help so far.
> 
> Would you kindly check if you're still able to reproduce the problem with
> the patch attached to this bug?
> I've also pushed the patch to try so that you can conveniently grab a
> pre-build copy of the built Firefox.

Alessio, I can still reproduce this using the try build [1] using the same STR as before. Once fx pings the add-on/telemetry services, it will crash every single time it's closed. Let me know if I'm using the incorrect try build.

[1] https://archive.mozilla.org/pub/firefox/try-builds/alessio.placitelli@gmail.com-0bd22e8ec3dddb948b3ae56699cd1b677234ca26/try-macosx64/
Flags: needinfo?(kjozwiak)

Comment 17

3 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/f281b8da52c9
(In reply to Kamil Jozwiak [:kjozwiak] from comment #16)
> (In reply to Alessio Placitelli [:Dexter] from comment #12)
> > Kamil, thank you for your help so far.
> > 
> > Would you kindly check if you're still able to reproduce the problem with
> > the patch attached to this bug?
> > I've also pushed the patch to try so that you can conveniently grab a
> > pre-build copy of the built Firefox.
> 
> Alessio, I can still reproduce this using the try build [1] using the same
> STR as before. Once fx pings the add-on/telemetry services, it will crash
> every single time it's closed. Let me know if I'm using the incorrect try
> build.
> 
> [1]
> https://archive.mozilla.org/pub/firefox/try-builds/alessio.placitelli@gmail.
> com-0bd22e8ec3dddb948b3ae56699cd1b677234ca26/try-macosx64/

Thanks for checking this. We ended up disabling the shutdown ping sending in bug 1357745 while we investigate this.
Judging by the stack, I think this is a race condition in shutdown, and that our mozalloc implementation is doing something wrong. The crash is actually in the child process that you're launching, and it's trying to re-initialize its malloc bits after forking. It's crashing with a bad pointer that points inside libmozglue:

Exception Codes:       KERN_PROTECTION_FAILURE at 0x0000000104767610

VM Regions Near 0x104767610:
    __DATA                 0000000104756000-0000000104757000 [    4K] rw-/rwx SM=COW  /Applications/FirefoxNightly.app/Contents/MacOS/libmozglue.dylib
--> __LINKEDIT             0000000104757000-000000010476d000 [   88K] r--/rwx SM=COW  /Applications/FirefoxNightly.app/Contents/MacOS/libmozglue.dylib
    Kernel Alloc Once      000000010476d000-000000010476f000 [    8K] rw-/rwx SM=COW  

I think that mozalloc has a bug where it isn't doing something right for fork()ed processes, and this just happens to be the only code that's using fork on OS X (as you noticed) and also has the forked process racing with the main process' shutdown.

My best guess at what happens is something like:
1) fork()
2) Parent process continues running, frees some mozalloc data structures.
3) Child process starts running, tries to use those mozalloc data structures, crashes.
Flags: needinfo?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #19)
> My best guess at what happens is something like:
> 1) fork()
> 2) Parent process continues running, frees some mozalloc data structures.
> 3) Child process starts running, tries to use those mozalloc data
> structures, crashes.

Thanks for the analysis Ted, I was smelling it was a problem with memory allocation because of the stack trace but I didn't realize it was in the child process :-/ Since I'm using fork() here only to have a common path between Linux and Mac, what do you think if I just switch this code to use posix_spawn() instead? We shouldn't run into fork()-related issues anymore, it should work on both platforms and since we're just running an external executable it's probably the most appropriate function to use.
Flags: needinfo?(ted)
Assignee: alessio.placitelli → gsvelto
If that fixes the issue I would say go for it. We can file a separate bug about the crash here.
Flags: needinfo?(ted)
Working on a patch to fix this, it should be ready by tonight.
Created attachment 8861222 [details] [diff] [review]
[PATCH WIP] Run the pingsender executable in a way that avoids races during shutdown

This is a mostly working WIP which still needs some polish. The fix is achieved by using nsIProcess to spawn the pingsender instead of the custom C++ code we had deployed in Telemetry. Since nsIProcess uses posix_spawnp() internally on MacOS X it will avoid any problems related to mozalloc in the child process.

I feel silly about not having noticed the nsIProcess component before, it would have saved us quite a bit of trouble :-/
Attachment #8861222 - Flags: feedback?(alessio.placitelli)
Comment on attachment 8861222 [details] [diff] [review]
[PATCH WIP] Run the pingsender executable in a way that avoids races during shutdown

Review of attachment 8861222 [details] [diff] [review]:
-----------------------------------------------------------------

This seems reasonable to me, thanks! I left a few comments below.

::: toolkit/components/telemetry/TelemetrySend.jsm
@@ +280,5 @@
>    getShutdownState() {
>      return TelemetrySendImpl.getShutdownState();
>    },
> +
> +  runPingSender(url, pingPath) {

Let's move the docs that we had in nsITelemetry.idl here.

@@ +281,5 @@
>      return TelemetrySendImpl.getShutdownState();
>    },
> +
> +  runPingSender(url, pingPath) {
> +    TelemetrySendImpl.runPingSender(url, pingPath);

Let's change this to |return TelemetrySendImpl. ...|, in case we want to propagate the errors from runpingSender.

@@ +1231,5 @@
>      };
>    },
> +
> +  runPingSender(url, pingPath) {
> +#ifdef XP_WIN

Instead of doing this, you can use AppConstants. Something like this:

const pingSenderExe = (AppConstants.platform == "win") ? "pingsender.exe" : "pingsender";

@@ +1235,5 @@
> +#ifdef XP_WIN
> +    const exeName = "pingsender.exe";
> +#else
> +    const exeName = "pingsender";
> +#endif

Let's also disable support for Fennec here, as we did in Telemetry.cpp.

@@ +1242,5 @@
> +    exe.append(exeName);
> +
> +    let process = Cc["@mozilla.org/process/util;1"]
> +                  .createInstance(Ci.nsIProcess);
> +    process.init(exe);

Let's check for errors returned by both |init| and |run| (below), as it seems they could be failing.

@@ +1243,5 @@
> +
> +    let process = Cc["@mozilla.org/process/util;1"]
> +                  .createInstance(Ci.nsIProcess);
> +    process.init(exe);
> +    process.run(/* blocking */ false, [url, pingPath], 2);

Are you sure this makes the new process "detached" from the Firefox process?
For example, are we sure the pingsender will keep running once the Firefox process is killed?

::: toolkit/components/telemetry/moz.build
@@ +84,5 @@
>      'UITelemetry.jsm',
>  ]
>  
> +EXTRA_PP_JS_MODULES += [
> +    'TelemetrySend.jsm',

If you use AppConstants, you don't need to move the file here.

::: toolkit/components/telemetry/pingsender/pingsender.h
@@ +3,5 @@
>   * License, v. 2.0. If a copy of the MPL was not distributed with this
>   * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
>  
>  #include <string>
> +#define DEBUG

Debugging leftover? :)
Attachment #8861222 - Flags: feedback?(alessio.placitelli) → feedback+
@Ted, do you think that it's reasonable to use |nsIProcess| in this case, right before Firefox shuts down (see TelemetrySend.jsm in the patch from comment 23)?
Flags: needinfo?(ted)
(In reply to Alessio Placitelli [:Dexter] from comment #24)
> Let's change this to |return TelemetrySendImpl. ...|, in case we want to
> propagate the errors from runpingSender.

runPingSender() will throw NS_ERROR_FAILURE if it can't run the pingsender, it's the same behavior as before. What other error were you thinking was worth returning? BTW TelemetrySend.runPingSender() is exposed only because I needed it for testing but it's not used directly, the only other call-site is TelemetrySend._sendWithPingSender() which is catching all exceptions.

> const pingSenderExe = (AppConstants.platform == "win") ? "pingsender.exe" :
> "pingsender";

Thanks for the pointer, I had been digging into our codebase to figure out how that's normally done but wasn't aware of AppConstants.

> Let's also disable support for Fennec here, as we did in Telemetry.cpp.
> 
> @@ +1242,5 @@
> > +    exe.append(exeName);
> > +
> > +    let process = Cc["@mozilla.org/process/util;1"]
> > +                  .createInstance(Ci.nsIProcess);
> > +    process.init(exe);
> 
> Let's check for errors returned by both |init| and |run| (below), as it
> seems they could be failing.

They both throw NS_ERROR_FAILURE in case something goes wrong and we're already catching everything in TelemetrySendImpl._sendWithPingSender() which is what we already did before. Should I change this?

> @@ +1243,5 @@
> > +
> > +    let process = Cc["@mozilla.org/process/util;1"]
> > +                  .createInstance(Ci.nsIProcess);
> > +    process.init(exe);
> > +    process.run(/* blocking */ false, [url, pingPath], 2);
> 
> Are you sure this makes the new process "detached" from the Firefox process?
> For example, are we sure the pingsender will keep running once the Firefox
> process is killed?

Yes, I've tested it just in case and the pingsender process remains alive after the main process has been terminated. 

> ::: toolkit/components/telemetry/pingsender/pingsender.h
> @@ +3,5 @@
> >   * License, v. 2.0. If a copy of the MPL was not distributed with this
> >   * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> >  
> >  #include <string>
> > +#define DEBUG
> 
> Debugging leftover? :)

Yup :)
Comment hidden (mozreview-request)

Comment 28

3 months ago
mozreview-review
Comment on attachment 8861763 [details]
Bug 1356673 - Run the pingsender executable in a way that avoids shutdown races with mozalloc;

https://reviewboard.mozilla.org/r/133766/#review136714

This looks good with the changes below addressed, thanks!

Let's also turn the shutdown pingsender back on by adding this:

// Enables sending the shutdown ping when Firefox shuts down.
pref("toolkit.telemetry.shutdownPingSender.enabled", true);

right after [here](http://searchfox.org/mozilla-central/rev/876c7dd30586f9c6f9c99ef7444f2d73c7acfe7c/browser/app/profile/firefox.js#1446).

Please let's wait for Ted to clear the ni? before landing this and flag QA for it too.

::: toolkit/components/telemetry/TelemetrySend.jsm:285
(Diff revision 1)
>    getShutdownState() {
>      return TelemetrySendImpl.getShutdownState();
>    },
> +
> +  /**
> +   * Send a ping using the ping sender.

Let's mention that this is a test only function, not meant to be used outside of the tests.

::: toolkit/components/telemetry/TelemetrySend.jsm:293
(Diff revision 1)
> +   *
> +   * @param {String} aUrl The telemetry server URL
> +   * @param {String} aPingFilePath The path to the file holding the ping
> +   *        contents, if if sent successfully the pingsender will delete it.
> +   *
> +   * @throws NS_ERROR_FAILURE if we couldn't find or run the pingsender executable.

nit: also mention that it throws on Android too with *...NOT_SUPPORTED*

::: toolkit/components/telemetry/TelemetrySend.jsm:295
(Diff revision 1)
> +   * @param {String} aPingFilePath The path to the file holding the ping
> +   *        contents, if if sent successfully the pingsender will delete it.
> +   *
> +   * @throws NS_ERROR_FAILURE if we couldn't find or run the pingsender executable.
> +   */
> +  runPingSender(url, pingPath) {

Please, rename this to testRunPingSender since this is test-only.
Attachment #8861763 - Flags: review?(alessio.placitelli) → review+
Thanks for the review, I've updated the comments and I'll push this to try so that we have builds the reporters can use to check if the issue went away for good.
Comment hidden (mozreview-request)
Kamil, Joe, can you try this build and see if it fixes the problem for you:

https://archive.mozilla.org/pub/firefox/try-builds/gsvelto@mozilla.com-d8bf263170563b327296ef94b5aa97529d0ff6c3/try-macosx64/
Flags: needinfo?(kjozwiak)
Flags: needinfo?(joe-bugzilla)
(In reply to Alessio Placitelli [:Dexter] from comment #25)
> @Ted, do you think that it's reasonable to use |nsIProcess| in this case,
> right before Firefox shuts down (see TelemetrySend.jsm in the patch from
> comment 23)?

I think it should be OK, but the only thing that worries me is that you're launching it with blocking=false, which triggers this code path:
https://dxr.mozilla.org/mozilla-central/rev/0b77ed3f26c5335503bc16e85b8c067382e7bb1e/xpcom/threads/nsProcessCommon.cpp#554

...which creates a thread to watch for the process result, and adds a shutdown observer to the observer service.
Flags: needinfo?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #32)
> (In reply to Alessio Placitelli [:Dexter] from comment #25)
> > @Ted, do you think that it's reasonable to use |nsIProcess| in this case,
> > right before Firefox shuts down (see TelemetrySend.jsm in the patch from
> > comment 23)?
> 
> I think it should be OK, but the only thing that worries me is that you're
> launching it with blocking=false, which triggers this code path:
> https://dxr.mozilla.org/mozilla-central/rev/
> 0b77ed3f26c5335503bc16e85b8c067382e7bb1e/xpcom/threads/nsProcessCommon.
> cpp#554
> 
> ...which creates a thread to watch for the process result, and adds a
> shutdown observer to the observer service.

Mh, I guess this could be a problem. Does it mean that we slow the shutdown or, worse, block the shutdown on the pingsender process? Gabriele, what's your take on it?

I guess we could add some sort of "run and forget" API endpoint to nsIProcess which runs the process asynchronously and doesn't monitor it, so we don't end up copy/pasting this code again and again in the whole codebase.
Flags: needinfo?(gsvelto)
(In reply to Gabriele Svelto [:gsvelto] from comment #31)
> Kamil, Joe, can you try this build and see if it fixes the problem for you:
> 
> https://archive.mozilla.org/pub/firefox/try-builds/gsvelto@mozilla.com-
> d8bf263170563b327296ef94b5aa97529d0ff6c3/try-macosx64/

I couldn't reproduce the issue using the above try build. I installed/re-installed fx about ~12 times and went through the STR from comment#0 and bug#1357232comment#0. I didn't receive the macOS crash once.
Flags: needinfo?(kjozwiak)
Interestingly, I fixed bug 1301471 today and reprocessed some OS X crashes that were missing symbols, and picking one at random I saw:
https://crash-stats.mozilla.com/report/index/4a19e768-74ca-425a-b29b-71c730170427#tab-details

which is also a crash in malloc after fork!
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #35)
> Interestingly, I fixed bug 1301471 today and reprocessed some OS X crashes
> that were missing symbols, and picking one at random I saw:
> https://crash-stats.mozilla.com/report/index/4a19e768-74ca-425a-b29b-
> 71c730170427#tab-details
> 
> which is also a crash in malloc after fork!

Interesting, so the underlying cause is more widespread than initially thought! Is it worth having a bug on it/trying to fix it (just throwing the idea around, I don't think I'd be able to take it :-D )?
(In reply to Alessio Placitelli [:Dexter] from comment #33)
> > I think it should be OK, but the only thing that worries me is that you're
> > launching it with blocking=false, which triggers this code path:
> > https://dxr.mozilla.org/mozilla-central/rev/
> > 0b77ed3f26c5335503bc16e85b8c067382e7bb1e/xpcom/threads/nsProcessCommon.
> > cpp#554
> > 
> > ...which creates a thread to watch for the process result, and adds a
> > shutdown observer to the observer service.
> 
> Mh, I guess this could be a problem. Does it mean that we slow the shutdown
> or, worse, block the shutdown on the pingsender process? Gabriele, what's
> your take on it?
> 
> I guess we could add some sort of "run and forget" API endpoint to
> nsIProcess which runs the process asynchronously and doesn't monitor it, so
> we don't end up copy/pasting this code again and again in the whole codebase.

I've dug into this a little more and I think it's safe, here's why: the observer is used only to detect XPCOM shutdown and when we're running the pingsender XPCOM shutdown hasn't initiated yet; in fact it cannot initiate until we're done with the asynchronous shutdown (which includes telemetry). If we'd use nsIProcess during XPCOM shutdown it would be a problem but as it is, it's not.
Flags: needinfo?(gsvelto)
(In reply to Kamil Jozwiak [:kjozwiak] from comment #34)
> (In reply to Gabriele Svelto [:gsvelto] from comment #31)
> > Kamil, Joe, can you try this build and see if it fixes the problem for you:
> > 
> > https://archive.mozilla.org/pub/firefox/try-builds/gsvelto@mozilla.com-
> > d8bf263170563b327296ef94b5aa97529d0ff6c3/try-macosx64/
> 
> I couldn't reproduce the issue using the above try build. I
> installed/re-installed fx about ~12 times and went through the STR from
> comment#0 and bug#1357232comment#0. I didn't receive the macOS crash once.

Thanks a lot, this is really helpful.

Alessio, Ted, if you don't have any more objections I'll land this as it seems to be working fine.
(In reply to Gabriele Svelto [:gsvelto] from comment #38)
> (In reply to Kamil Jozwiak [:kjozwiak] from comment #34)
> > (In reply to Gabriele Svelto [:gsvelto] from comment #31)
> > > Kamil, Joe, can you try this build and see if it fixes the problem for you:
> > > 
> > > https://archive.mozilla.org/pub/firefox/try-builds/gsvelto@mozilla.com-
> > > d8bf263170563b327296ef94b5aa97529d0ff6c3/try-macosx64/
> > 
> > I couldn't reproduce the issue using the above try build. I
> > installed/re-installed fx about ~12 times and went through the STR from
> > comment#0 and bug#1357232comment#0. I didn't receive the macOS crash once.
> 
> Thanks a lot, this is really helpful.
> 
> Alessio, Ted, if you don't have any more objections I'll land this as it
> seems to be working fine.

If we're not:

- slowing down the shutdown;
- blocking the shutdown on the pingsender process;
- killing the pingsender process when FF shuts down;

then we're good to land!

Let's ask for QA coverage on this.
Abe, do you have some cycles to verify this fix?

Since we're changing the way we call the pingsender at shutdown for the "shutdown" ping, this would need to be verified on Windows, Mac and Windows again. We need to take particular care to make sure it doesn't crash at shutdown, especially on MacOS.

If you do this, you can use the "shutdown" pingsender test plan as a reference, there's only one thing that changed meanwhile: we are sending the shutdown ping from the second session on. So, if you create a new profile and shut down Firefox, no shutdown ping will be received: it will be received next time Firefox is started. From the second time FF shuts down, the shutdown ping is sent using the pingsender and should be received.
Flags: needinfo?(amasresha)

Comment 41

3 months ago
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/90774eb8f507
Run the pingsender executable in a way that avoids shutdown races with mozalloc; r=Dexter

Comment 42

3 months ago
(In reply to Alessio Placitelli [:Dexter] from comment #40)
> Abe, do you have some cycles to verify this fix?
Yes, I will verify this bug fix and rerun all test cases of shutdown pings with your inputs (@comment 40).
Flags: needinfo?(amasresha)

Comment 43

3 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/90774eb8f507

Updated

3 months ago
Depends on: 1360818
Depends on: 1360819
Abe, would you mind also testing the following test case, Windows only:

- Open Firefox, wait at least 1 minute;
- Shut down Windows;
- No error should be displayed 

Doesn't necessarily mean that the ping must be received by the server exactly at shutdown in this case, as this heavily depends on the OS. However, if it's not received at that time, it should be sent again on the next restart of FF.
Flags: needinfo?(joe-bugzilla) → needinfo?(amasresha)
(In reply to Alessio Placitelli [:Dexter] from comment #44)
> Abe, would you mind also testing the following test case, Windows only:
> 
> - Open Firefox, wait at least 1 minute;
> - Shut down Windows;
> - No error should be displayed 

Actually, I was just able to test that the error dialog I am getting when exiting Windows is about pingsender.exe. I'll file a bug.
Depends on: 1362024

Comment 46

3 months ago
(In reply to Alessio Placitelli [:Dexter] from comment #44)
> Abe, would you mind also testing the following test case, Windows only:
> 
> - Open Firefox, wait at least 1 minute;
> - Shut down Windows;
> - No error should be displayed 
> 
> Doesn't necessarily mean that the ping must be received by the server
> exactly at shutdown in this case, as this heavily depends on the OS.
> However, if it's not received at that time, it should be sent again on the
> next restart of FF.

Sure. I will add this test case to the shutdown ping test suites and make the testing.
Flags: needinfo?(amasresha)

Updated

3 months ago
Depends on: 1362500
No longer depends on: 1362500
Abe, any update on the validation for this? Do you have an ETA?
Flags: needinfo?(amasresha)
Abe, please let me know if there's anything that I can do to help. If you're currently swamped with work, I can help you go through the test suite.

Comment 49

3 months ago
(In reply to Kamil Jozwiak [:kjozwiak] from comment #48)
> Abe, please let me know if there's anything that I can do to help. If you're
> currently swamped with work, I can help you go through the test suite.
Thanks.

We have completed mid-nightly (second-round) testing on shutdown ping and this particular issue on OSX as well. 
Test results show all green. 
This issue seems to be fixed. 
I will add links to the test documents very shortly.
Flags: needinfo?(amasresha)

Comment 50

3 months ago
Created attachment 8866890 [details]
Manual Testing Summary_ Shutdown Ping - Mid-Nightly Sign Off (2017-05-11) - TestRail.pdf

Test documents:
Test runs: attached.
Test plan: https://wiki.mozilla.org/QA/Shutdown_Ping
Mid-Nightly Sign-off: will be sent via email on 5/12/2017.
Keywords: leave-open
Thank you Abe for the great work here! Can this be closed as Verified then?
Flags: needinfo?(amasresha)
Status: ASSIGNED → RESOLVED
Last Resolved: 3 months ago
Resolution: --- → FIXED

Comment 52

3 months ago
Verified-fixed.
See comment 50 for the test documents
Status: RESOLVED → VERIFIED
Flags: needinfo?(amasresha)
status-firefox53: --- → unaffected
status-firefox54: --- → unaffected
status-firefox55: affected → verified
status-firefox-esr52: --- → unaffected
Target Milestone: --- → mozilla55
gsvelto filed bug 1376567 on tracking the crash here, since we effectively just worked around it in this bug.
You need to log in before you can comment on or make changes to this bug.