Closed
Bug 1356673
Opened 8 years ago
Closed 8 years ago
Crash in TelemetryImpl::RunPingSender on shutdown
Categories
(Toolkit :: Telemetry, defect, P1)
Tracking
()
VERIFIED
FIXED
mozilla55
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox53 | --- | unaffected |
firefox54 | --- | unaffected |
firefox55 | --- | verified |
People
(Reporter: jhildebrand, Assigned: gsvelto)
References
Details
(Keywords: crash, Whiteboard: [measurement:client])
Attachments
(5 files)
Crash on shutdown. Reproduce:
- start Firefox with fresh profile
- shut down
- restart
- Navigate to http://neverssl.com/
- Shut down
- Crash. No crash report generated.
OSX 10.12.4 (16E195)
Nightly 55.0a1 (2017-04-14) (64-bit)
Stack trace attached
Reporter | ||
Updated•8 years ago
|
Summary: TelemetryImpl::RunPingSender → Crash in TelemetryImpl::RunPingSender on shutdown
Updated•8 years ago
|
Flags: needinfo?(alessio.placitelli)
Priority: -- → P1
Whiteboard: [measurement:client]
Updated•8 years ago
|
Assignee: nobody → alessio.placitelli
Flags: needinfo?(alessio.placitelli)
Assignee | ||
Comment 2•8 years ago
|
||
I've tried reproducing this a bunch of times but I don't seem to be able to hit it. However I've noticed something that might be related; sometimes when shutting down with the toolkit.telemetry.shutdownPingSender.enabled option set to true I get the following in my log:
WARNING: YOU ARE LEAKING THE WORLD (at least one JSRuntime and everything alive inside it, that is) AT JS_ShutDown TIME. FIX THIS!
[Parent 41748] ###!!! ASSERTION: Component Manager being held past XPCOM shutdown.: 'cnt == 0', file /Users/gsvelto/projects/mozilla-central/xpcom/build/XPCOMInit.cpp, line 1045
Some other errors like this seem to show up. With the option off I've been unable to get the same error. I don't know if it's related because I cannot consistently trigger it, but I've never seen that particular error before.
That being said there's another thing of note: in the attached trace the pingsender is being called from a promise, but we're already within nsXREDirProvider::DoShutdown(). So I was wondering if we might end up in a situation where the promise used to run the pingsender is being run far too late compared to the rest of the shutdown process and triggers this issue. The reason why I'm thinking this might be the case is that the bottom of that trace shows a segmentation fault deep within memory allocation code run by fork(). Since fork() doesn't accept any parameters nor should rely on our global data then the only chance of that happening is that we've somehow corrupted those memory structures before fork() gets called. Also, since the reporter is not getting a crash then it means we've already disabled the breakpad signal handler that triggers the crashreporter. That also happens very late in the shutdown process and JS methods shouldn't work at that point.
Without a way of reproducing this reliably though it's going to be hard to debug.
Comment 3•8 years ago
|
||
See also the STR in bug 1357232, comment 0.
Both reports are OS X.
Reporter | ||
Comment 4•8 years ago
|
||
Re bug 1357232, I have extensions.logging.enabled=false.
Assignee | ||
Comment 5•8 years ago
|
||
Joe, if you can hit this consistently one thing that would be very important to establish is if the crash is happening after we've disabled Breakpad's exception handler. Can you check if you see the crash after we've hit this code?
https://dxr.mozilla.org/mozilla-central/rev/a374c35469935a874fefe64d3e07003fc5bc8884/toolkit/crashreporter/nsExceptionHandler.cpp#2130
If you can put printf()/fflush(stdout) couple in there and see if it shows up before the crash it would be super helpful because if it is, then we're running the pingsender far too late and that's why it's failing (past that point it would be too late to run *anything*).
Reporter | ||
Comment 6•8 years ago
|
||
From what I can tell, in both the case where it crashes and in the case where it doesn't crash, gExceptionHandler is null at
https://dxr.mozilla.org/mozilla-central/rev/a374c35469935a874fefe64d3e07003fc5bc8884/toolkit/crashreporter/nsExceptionHandler.cpp#2205
Causing the function to return NS_ERROR_NOT_INITIALIZED. I get this further info on crash, AFTER that return:
Leaked URLs:
chrome://pluginproblem/content/pluginProblemBinding.css
resource://gre-resources/counterstyles.css
resource://gre-resources/html.css
chrome://global/content/minimal-xul.css
resource://gre-resources/quirk.css
resource://gre/res/svg.css
chrome://global/content/xul.css
chrome://global/skin/scrollbars.css
resource://gre-resources/number-control.css
resource://gre-resources/forms.css
resource://gre-resources/noscript.css
resource://gre-resources/ua.css
chrome://pocket/skin/pocket.css
chrome://pocket-shared/skin/pocket.css
chrome://global/content/bindings/general.xml#root-element
chrome://global/content/bindings/popup.xml#tooltip
chrome://global/content/bindings/stringbundle.xml#stringbundleset
chrome://global/content/bindings/toolbar.xml#menubar
chrome://global/content/bindings/stringbundle.xml#stringbundle
chrome://global/content/bindings/popup.xml#popup
chrome://global/content/bindings/general.xml#deck
chrome://global/content/bindings/autocomplete.xml#autocomplete-rich-result-popup
chrome://global/content/bindings/popup.xml#arrowpanel
chrome://global/content/bindings/menulist.xml#menulist-popuponly
chrome://global/content/bindings/popup.xml#panel
chrome://global/content/bindings/notification.xml#popup-notification
chrome://global/content/bindings/general.xml#image
chrome://global/content/bindings/toolbarbutton.xml#toolbarbutton
chrome://global/content/bindings/toolbar.xml#toolbox
chrome://global/content/bindings/toolbarbutton.xml#menu
chrome://global/content/bindings/scrollbox.xml#scrollbox
chrome://global/content/bindings/text.xml#text-label
chrome://global/content/bindings/text.xml#label-control
chrome://global/content/bindings/general.xml#dropmarker
chrome://global/content/bindings/toolbarbutton.xml#menu-button
chrome://global/content/bindings/textbox.xml#input-box
chrome://global/content/bindings/autocomplete.xml#history-dropmarker
chrome://global/content/platformHTMLBindings.xml#inputFields
chrome://global/content/bindings/toolbarbutton.xml#toolbarbutton-badged
chrome://global/content/bindings/splitter.xml#splitter
chrome://global/content/bindings/notification.xml#notificationbox
chrome://global/content/bindings/tabbox.xml#tabpanels
chrome://global/content/bindings/browser.xml
chrome://global/content/bindings/browser.xml#browser
chrome://global/content/bindings/browser.xml
chrome://global/content/bindings/menu.xml#menuitem
chrome://global/content/bindings/menu.xml#menuseparator
chrome://global/content/bindings/scrollbar.xml#scrollbar
chrome://global/content/bindings/scrollbar.xml
chrome://global/content/bindings/scrollbar.xml#thumb
chrome://global/content/bindings/scrollbar.xml
chrome://global/content/bindings/scrollbar.xml#scrollbar-base
chrome://global/content/bindings/scrollbar.xml#scrollbar-base
chrome://global/content/bindings/scrollbar.xml#scrollbar
chrome://global/content/bindings/scrollbar.xml#scrollbar-base
chrome://global/content/bindings/scrollbar.xml#thumb
chrome://global/content/bindings/remote-browser.xml
chrome://global/content/bindings/browser.xml#browser
chrome://global/content/bindings/remote-browser.xml#remote-browser
chrome://global/content/bindings/remote-browser.xml
chrome://global/content/bindings/menu.xml#menu
chrome://global/content/bindings/menu.xml#menu-iconic
chrome://global/content/bindings/toolbar.xml#toolbardecoration
chrome://pocket/skin/Toolbar@2x.png
chrome://global/content/bindings/browser.xml#browser
chrome://pocket/content/panels/img/pocketmenuitem16@2x.png
https://accounts.firefox.com/
chrome://global/content/bindings/text.xml#text-base
resource://search-plugins/images/yahoo.ico
resource://search-plugins/images/amazon.ico
resource://search-plugins/images/wikipedia.ico
chrome://global/content/bindings/autocomplete.xml#autocomplete-richlistbox
chrome://global/content/bindings/button.xml#button
chrome://global/content/bindings/autocomplete.xml#autocomplete-richlistitem
chrome://global/content/backgroundPageThumbs.xhtml
file:///Users/jhildebrand/track/gecko-dev/toolkit/components/thumbnails/content/backgroundPageThumbs.xhtml
chrome://global/content/bindings/remote-browser.xml#remote-browser
[Parent 82953] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /Users/jhildebrand/track/gecko-dev/xpcom/base/nsTraceRefcnt.cpp, line 172
nsStringStats
=> mAllocCount: 141852
=> mReallocCount: 7518
=> mFreeCount: 138296 -- LEAKED 3556 !!!
=> mShareCount: 123093
=> mAdoptCount: 8903
=> mAdoptFreeCount: 8901 -- LEAKED 2 !!!
Comment 7•8 years ago
|
||
(In reply to Joe Hildebrand [:hildjj] (UTC-6) from comment #4)
> Re bug 1357232, I have extensions.logging.enabled=false.
I enabled the "extensions.logging.enabled=true" pref so I can see when fx pings the add-on/telemetry services via the console browser. It doesn't have to be enabled to reproduce the issue. However, it gives you a good indicator on when you can shutdown fx when you notice the pings.
I can only reproduce this with a released version of m-c and haven't been able to reproduce the issue with a debug build. Gabriele, let me know if there's anything that I can help with here.
Comment 8•8 years ago
|
||
As a precaution, since we cannot safely evaluate the volume of the shutdown crashes (since the crashreporter is not getting called), we're disabling the shutdown pingsender in bug 1357745.
Comment 9•8 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #2)
> WARNING: YOU ARE LEAKING THE WORLD (at least one JSRuntime and everything
> alive inside it, that is) AT JS_ShutDown TIME. FIX THIS!
> [Parent 41748] ###!!! ASSERTION: Component Manager being held past XPCOM
> shutdown.: 'cnt == 0', file
> /Users/gsvelto/projects/mozilla-central/xpcom/build/XPCOMInit.cpp, line 1045
FWIW I occasionally see this on Ubuntu, even with the pingsender disabled.
Comment hidden (mozreview-request) |
Comment 11•8 years ago
|
||
(In reply to Alessio Placitelli [:Dexter] from comment #8)
> As a precaution, since we cannot safely evaluate the volume of the shutdown
> crashes (since the crashreporter is not getting called), we're disabling the
> shutdown pingsender in bug 1357745.
Ah! Thanks to all the precious feedback in this bug, I was able to trace down the potential source of the crash.
As Gabriele mentioned in comment 2, this is crashing because we're trying to spawn a new process too late into the shutdown.
I knew we were blocking on Telemetry at shutdown, so I crosschecked the calls.
The shutdown ping is submitted at [1] and wait upon at [2]. TelemetryController waits for the shutdown to complete at [3], and we block the FF shutdown here [4].
However, is TelemetryController.submitExternalPing returning the promise we expect? If we follow the calls down to TelemetrySend [5], we can see that we're returning the promise related to saving the ping to the disk, rather than the one for the pingsender process being spawned.
This explains why we're crashing when spawning the pingsender rather than when saving the ping to the disk.
The attached patch attempts to fix the problem, by returning the right promise so that Firefox correctly waits on the pingsender to spawn before quitting.
[1] - http://searchfox.org/mozilla-central/rev/214345204f1e7d97abb571b7992b6deedb5ff98f/toolkit/components/telemetry/TelemetrySession.jsm#1786
[2] - http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/TelemetrySession.jsm#2035
[3] - http://searchfox.org/mozilla-central/rev/214345204f1e7d97abb571b7992b6deedb5ff98f/toolkit/components/telemetry/TelemetryController.jsm#796
[4] - http://searchfox.org/mozilla-central/rev/214345204f1e7d97abb571b7992b6deedb5ff98f/toolkit/components/telemetry/TelemetryController.jsm#755
[5] - http://searchfox.org/mozilla-central/rev/214345204f1e7d97abb571b7992b6deedb5ff98f/toolkit/components/telemetry/TelemetrySend.jsm#810
Comment 12•8 years ago
|
||
Kamil, thank you for your help so far.
Would you kindly check if you're still able to reproduce the problem with the patch attached to this bug?
I've also pushed the patch to try so that you can conveniently grab a pre-build copy of the built Firefox.
Flags: needinfo?(kjozwiak)
Updated•8 years ago
|
Status: NEW → ASSIGNED
Updated•8 years ago
|
Keywords: leave-open
Comment 13•8 years ago
|
||
mozreview-review |
Comment on attachment 8859615 [details]
Bug 1356673 - Fix a crash in RunPingSender at shutdown.
https://reviewboard.mozilla.org/r/131622/#review134426
We should definitely fix what this patch adresses, lets land this.
However, i'm not sure about the analysis of the crash.
You are blaming memory corruption from XPCOM shutdown or so.
But this is a crash in a sync main-thread function inside PR_CreateProcessDetached / fork.
To get there, we must already have successfully passed the strings through XPCOM and converted them to CStrings.
Let's land this and set leave-open pending:
- verifying this solves the problem
- checking the crash analysis with someone more familiar with these, maybe :froydnj or :ted
Attachment #8859615 -
Flags: review?(gfritzsche) → review+
Comment 14•8 years ago
|
||
Pushed by alessio.placitelli@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/f281b8da52c9
Fix a crash in RunPingSender at shutdown. r=gfritzsche
Comment 15•8 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] [away Apr 13 - 18] from comment #13)
> However, i'm not sure about the analysis of the crash.
> You are blaming memory corruption from XPCOM shutdown or so.
> But this is a crash in a sync main-thread function inside
> PR_CreateProcessDetached / fork.
> To get there, we must already have successfully passed the strings through
> XPCOM and converted them to CStrings.
Ted, could you kindly help with this?
Context: we're crashing at shutdown when spawning a new process to send the shutdown pings using the "pingsender".
Gabriele provided some useful insights about this crash in comment 2. How can we be crashing so far off after the XPCOM shutdown? Does this crash analysis make sense?
Updated•8 years ago
|
Flags: needinfo?(ted)
Comment 16•8 years ago
|
||
(In reply to Alessio Placitelli [:Dexter] from comment #12)
> Kamil, thank you for your help so far.
>
> Would you kindly check if you're still able to reproduce the problem with
> the patch attached to this bug?
> I've also pushed the patch to try so that you can conveniently grab a
> pre-build copy of the built Firefox.
Alessio, I can still reproduce this using the try build [1] using the same STR as before. Once fx pings the add-on/telemetry services, it will crash every single time it's closed. Let me know if I'm using the incorrect try build.
[1] https://archive.mozilla.org/pub/firefox/try-builds/alessio.placitelli@gmail.com-0bd22e8ec3dddb948b3ae56699cd1b677234ca26/try-macosx64/
Flags: needinfo?(kjozwiak)
Comment 17•8 years ago
|
||
bugherder |
Comment 18•8 years ago
|
||
(In reply to Kamil Jozwiak [:kjozwiak] from comment #16)
> (In reply to Alessio Placitelli [:Dexter] from comment #12)
> > Kamil, thank you for your help so far.
> >
> > Would you kindly check if you're still able to reproduce the problem with
> > the patch attached to this bug?
> > I've also pushed the patch to try so that you can conveniently grab a
> > pre-build copy of the built Firefox.
>
> Alessio, I can still reproduce this using the try build [1] using the same
> STR as before. Once fx pings the add-on/telemetry services, it will crash
> every single time it's closed. Let me know if I'm using the incorrect try
> build.
>
> [1]
> https://archive.mozilla.org/pub/firefox/try-builds/alessio.placitelli@gmail.
> com-0bd22e8ec3dddb948b3ae56699cd1b677234ca26/try-macosx64/
Thanks for checking this. We ended up disabling the shutdown ping sending in bug 1357745 while we investigate this.
Comment 19•8 years ago
|
||
Judging by the stack, I think this is a race condition in shutdown, and that our mozalloc implementation is doing something wrong. The crash is actually in the child process that you're launching, and it's trying to re-initialize its malloc bits after forking. It's crashing with a bad pointer that points inside libmozglue:
Exception Codes: KERN_PROTECTION_FAILURE at 0x0000000104767610
VM Regions Near 0x104767610:
__DATA 0000000104756000-0000000104757000 [ 4K] rw-/rwx SM=COW /Applications/FirefoxNightly.app/Contents/MacOS/libmozglue.dylib
--> __LINKEDIT 0000000104757000-000000010476d000 [ 88K] r--/rwx SM=COW /Applications/FirefoxNightly.app/Contents/MacOS/libmozglue.dylib
Kernel Alloc Once 000000010476d000-000000010476f000 [ 8K] rw-/rwx SM=COW
I think that mozalloc has a bug where it isn't doing something right for fork()ed processes, and this just happens to be the only code that's using fork on OS X (as you noticed) and also has the forked process racing with the main process' shutdown.
My best guess at what happens is something like:
1) fork()
2) Parent process continues running, frees some mozalloc data structures.
3) Child process starts running, tries to use those mozalloc data structures, crashes.
Flags: needinfo?(ted)
Assignee | ||
Comment 20•8 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #19)
> My best guess at what happens is something like:
> 1) fork()
> 2) Parent process continues running, frees some mozalloc data structures.
> 3) Child process starts running, tries to use those mozalloc data
> structures, crashes.
Thanks for the analysis Ted, I was smelling it was a problem with memory allocation because of the stack trace but I didn't realize it was in the child process :-/ Since I'm using fork() here only to have a common path between Linux and Mac, what do you think if I just switch this code to use posix_spawn() instead? We shouldn't run into fork()-related issues anymore, it should work on both platforms and since we're just running an external executable it's probably the most appropriate function to use.
Flags: needinfo?(ted)
Updated•8 years ago
|
Assignee: alessio.placitelli → gsvelto
Comment 21•8 years ago
|
||
If that fixes the issue I would say go for it. We can file a separate bug about the crash here.
Flags: needinfo?(ted)
Assignee | ||
Comment 22•8 years ago
|
||
Working on a patch to fix this, it should be ready by tonight.
Assignee | ||
Comment 23•8 years ago
|
||
This is a mostly working WIP which still needs some polish. The fix is achieved by using nsIProcess to spawn the pingsender instead of the custom C++ code we had deployed in Telemetry. Since nsIProcess uses posix_spawnp() internally on MacOS X it will avoid any problems related to mozalloc in the child process.
I feel silly about not having noticed the nsIProcess component before, it would have saved us quite a bit of trouble :-/
Attachment #8861222 -
Flags: feedback?(alessio.placitelli)
Comment 24•8 years ago
|
||
Comment on attachment 8861222 [details] [diff] [review]
[PATCH WIP] Run the pingsender executable in a way that avoids races during shutdown
Review of attachment 8861222 [details] [diff] [review]:
-----------------------------------------------------------------
This seems reasonable to me, thanks! I left a few comments below.
::: toolkit/components/telemetry/TelemetrySend.jsm
@@ +280,5 @@
> getShutdownState() {
> return TelemetrySendImpl.getShutdownState();
> },
> +
> + runPingSender(url, pingPath) {
Let's move the docs that we had in nsITelemetry.idl here.
@@ +281,5 @@
> return TelemetrySendImpl.getShutdownState();
> },
> +
> + runPingSender(url, pingPath) {
> + TelemetrySendImpl.runPingSender(url, pingPath);
Let's change this to |return TelemetrySendImpl. ...|, in case we want to propagate the errors from runpingSender.
@@ +1231,5 @@
> };
> },
> +
> + runPingSender(url, pingPath) {
> +#ifdef XP_WIN
Instead of doing this, you can use AppConstants. Something like this:
const pingSenderExe = (AppConstants.platform == "win") ? "pingsender.exe" : "pingsender";
@@ +1235,5 @@
> +#ifdef XP_WIN
> + const exeName = "pingsender.exe";
> +#else
> + const exeName = "pingsender";
> +#endif
Let's also disable support for Fennec here, as we did in Telemetry.cpp.
@@ +1242,5 @@
> + exe.append(exeName);
> +
> + let process = Cc["@mozilla.org/process/util;1"]
> + .createInstance(Ci.nsIProcess);
> + process.init(exe);
Let's check for errors returned by both |init| and |run| (below), as it seems they could be failing.
@@ +1243,5 @@
> +
> + let process = Cc["@mozilla.org/process/util;1"]
> + .createInstance(Ci.nsIProcess);
> + process.init(exe);
> + process.run(/* blocking */ false, [url, pingPath], 2);
Are you sure this makes the new process "detached" from the Firefox process?
For example, are we sure the pingsender will keep running once the Firefox process is killed?
::: toolkit/components/telemetry/moz.build
@@ +84,5 @@
> 'UITelemetry.jsm',
> ]
>
> +EXTRA_PP_JS_MODULES += [
> + 'TelemetrySend.jsm',
If you use AppConstants, you don't need to move the file here.
::: toolkit/components/telemetry/pingsender/pingsender.h
@@ +3,5 @@
> * License, v. 2.0. If a copy of the MPL was not distributed with this
> * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
>
> #include <string>
> +#define DEBUG
Debugging leftover? :)
Attachment #8861222 -
Flags: feedback?(alessio.placitelli) → feedback+
Comment 25•8 years ago
|
||
@Ted, do you think that it's reasonable to use |nsIProcess| in this case, right before Firefox shuts down (see TelemetrySend.jsm in the patch from comment 23)?
Flags: needinfo?(ted)
Assignee | ||
Comment 26•8 years ago
|
||
(In reply to Alessio Placitelli [:Dexter] from comment #24)
> Let's change this to |return TelemetrySendImpl. ...|, in case we want to
> propagate the errors from runpingSender.
runPingSender() will throw NS_ERROR_FAILURE if it can't run the pingsender, it's the same behavior as before. What other error were you thinking was worth returning? BTW TelemetrySend.runPingSender() is exposed only because I needed it for testing but it's not used directly, the only other call-site is TelemetrySend._sendWithPingSender() which is catching all exceptions.
> const pingSenderExe = (AppConstants.platform == "win") ? "pingsender.exe" :
> "pingsender";
Thanks for the pointer, I had been digging into our codebase to figure out how that's normally done but wasn't aware of AppConstants.
> Let's also disable support for Fennec here, as we did in Telemetry.cpp.
>
> @@ +1242,5 @@
> > + exe.append(exeName);
> > +
> > + let process = Cc["@mozilla.org/process/util;1"]
> > + .createInstance(Ci.nsIProcess);
> > + process.init(exe);
>
> Let's check for errors returned by both |init| and |run| (below), as it
> seems they could be failing.
They both throw NS_ERROR_FAILURE in case something goes wrong and we're already catching everything in TelemetrySendImpl._sendWithPingSender() which is what we already did before. Should I change this?
> @@ +1243,5 @@
> > +
> > + let process = Cc["@mozilla.org/process/util;1"]
> > + .createInstance(Ci.nsIProcess);
> > + process.init(exe);
> > + process.run(/* blocking */ false, [url, pingPath], 2);
>
> Are you sure this makes the new process "detached" from the Firefox process?
> For example, are we sure the pingsender will keep running once the Firefox
> process is killed?
Yes, I've tested it just in case and the pingsender process remains alive after the main process has been terminated.
> ::: toolkit/components/telemetry/pingsender/pingsender.h
> @@ +3,5 @@
> > * License, v. 2.0. If a copy of the MPL was not distributed with this
> > * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> >
> > #include <string>
> > +#define DEBUG
>
> Debugging leftover? :)
Yup :)
Comment hidden (mozreview-request) |
Comment 28•8 years ago
|
||
mozreview-review |
Comment on attachment 8861763 [details]
Bug 1356673 - Run the pingsender executable in a way that avoids shutdown races with mozalloc;
https://reviewboard.mozilla.org/r/133766/#review136714
This looks good with the changes below addressed, thanks!
Let's also turn the shutdown pingsender back on by adding this:
// Enables sending the shutdown ping when Firefox shuts down.
pref("toolkit.telemetry.shutdownPingSender.enabled", true);
right after [here](http://searchfox.org/mozilla-central/rev/876c7dd30586f9c6f9c99ef7444f2d73c7acfe7c/browser/app/profile/firefox.js#1446).
Please let's wait for Ted to clear the ni? before landing this and flag QA for it too.
::: toolkit/components/telemetry/TelemetrySend.jsm:285
(Diff revision 1)
> getShutdownState() {
> return TelemetrySendImpl.getShutdownState();
> },
> +
> + /**
> + * Send a ping using the ping sender.
Let's mention that this is a test only function, not meant to be used outside of the tests.
::: toolkit/components/telemetry/TelemetrySend.jsm:293
(Diff revision 1)
> + *
> + * @param {String} aUrl The telemetry server URL
> + * @param {String} aPingFilePath The path to the file holding the ping
> + * contents, if if sent successfully the pingsender will delete it.
> + *
> + * @throws NS_ERROR_FAILURE if we couldn't find or run the pingsender executable.
nit: also mention that it throws on Android too with *...NOT_SUPPORTED*
::: toolkit/components/telemetry/TelemetrySend.jsm:295
(Diff revision 1)
> + * @param {String} aPingFilePath The path to the file holding the ping
> + * contents, if if sent successfully the pingsender will delete it.
> + *
> + * @throws NS_ERROR_FAILURE if we couldn't find or run the pingsender executable.
> + */
> + runPingSender(url, pingPath) {
Please, rename this to testRunPingSender since this is test-only.
Attachment #8861763 -
Flags: review?(alessio.placitelli) → review+
Assignee | ||
Comment 29•8 years ago
|
||
Thanks for the review, I've updated the comments and I'll push this to try so that we have builds the reporters can use to check if the issue went away for good.
Comment hidden (mozreview-request) |
Assignee | ||
Comment 31•8 years ago
|
||
Kamil, Joe, can you try this build and see if it fixes the problem for you:
https://archive.mozilla.org/pub/firefox/try-builds/gsvelto@mozilla.com-d8bf263170563b327296ef94b5aa97529d0ff6c3/try-macosx64/
Flags: needinfo?(kjozwiak)
Flags: needinfo?(joe-bugzilla)
Comment 32•8 years ago
|
||
(In reply to Alessio Placitelli [:Dexter] from comment #25)
> @Ted, do you think that it's reasonable to use |nsIProcess| in this case,
> right before Firefox shuts down (see TelemetrySend.jsm in the patch from
> comment 23)?
I think it should be OK, but the only thing that worries me is that you're launching it with blocking=false, which triggers this code path:
https://dxr.mozilla.org/mozilla-central/rev/0b77ed3f26c5335503bc16e85b8c067382e7bb1e/xpcom/threads/nsProcessCommon.cpp#554
...which creates a thread to watch for the process result, and adds a shutdown observer to the observer service.
Flags: needinfo?(ted)
Comment 33•8 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #32)
> (In reply to Alessio Placitelli [:Dexter] from comment #25)
> > @Ted, do you think that it's reasonable to use |nsIProcess| in this case,
> > right before Firefox shuts down (see TelemetrySend.jsm in the patch from
> > comment 23)?
>
> I think it should be OK, but the only thing that worries me is that you're
> launching it with blocking=false, which triggers this code path:
> https://dxr.mozilla.org/mozilla-central/rev/
> 0b77ed3f26c5335503bc16e85b8c067382e7bb1e/xpcom/threads/nsProcessCommon.
> cpp#554
>
> ...which creates a thread to watch for the process result, and adds a
> shutdown observer to the observer service.
Mh, I guess this could be a problem. Does it mean that we slow the shutdown or, worse, block the shutdown on the pingsender process? Gabriele, what's your take on it?
I guess we could add some sort of "run and forget" API endpoint to nsIProcess which runs the process asynchronously and doesn't monitor it, so we don't end up copy/pasting this code again and again in the whole codebase.
Flags: needinfo?(gsvelto)
Comment 34•8 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #31)
> Kamil, Joe, can you try this build and see if it fixes the problem for you:
>
> https://archive.mozilla.org/pub/firefox/try-builds/gsvelto@mozilla.com-
> d8bf263170563b327296ef94b5aa97529d0ff6c3/try-macosx64/
I couldn't reproduce the issue using the above try build. I installed/re-installed fx about ~12 times and went through the STR from comment#0 and bug#1357232comment#0. I didn't receive the macOS crash once.
Flags: needinfo?(kjozwiak)
Comment 35•8 years ago
|
||
Interestingly, I fixed bug 1301471 today and reprocessed some OS X crashes that were missing symbols, and picking one at random I saw:
https://crash-stats.mozilla.com/report/index/4a19e768-74ca-425a-b29b-71c730170427#tab-details
which is also a crash in malloc after fork!
Comment 36•8 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #35)
> Interestingly, I fixed bug 1301471 today and reprocessed some OS X crashes
> that were missing symbols, and picking one at random I saw:
> https://crash-stats.mozilla.com/report/index/4a19e768-74ca-425a-b29b-
> 71c730170427#tab-details
>
> which is also a crash in malloc after fork!
Interesting, so the underlying cause is more widespread than initially thought! Is it worth having a bug on it/trying to fix it (just throwing the idea around, I don't think I'd be able to take it :-D )?
Assignee | ||
Comment 37•8 years ago
|
||
(In reply to Alessio Placitelli [:Dexter] from comment #33)
> > I think it should be OK, but the only thing that worries me is that you're
> > launching it with blocking=false, which triggers this code path:
> > https://dxr.mozilla.org/mozilla-central/rev/
> > 0b77ed3f26c5335503bc16e85b8c067382e7bb1e/xpcom/threads/nsProcessCommon.
> > cpp#554
> >
> > ...which creates a thread to watch for the process result, and adds a
> > shutdown observer to the observer service.
>
> Mh, I guess this could be a problem. Does it mean that we slow the shutdown
> or, worse, block the shutdown on the pingsender process? Gabriele, what's
> your take on it?
>
> I guess we could add some sort of "run and forget" API endpoint to
> nsIProcess which runs the process asynchronously and doesn't monitor it, so
> we don't end up copy/pasting this code again and again in the whole codebase.
I've dug into this a little more and I think it's safe, here's why: the observer is used only to detect XPCOM shutdown and when we're running the pingsender XPCOM shutdown hasn't initiated yet; in fact it cannot initiate until we're done with the asynchronous shutdown (which includes telemetry). If we'd use nsIProcess during XPCOM shutdown it would be a problem but as it is, it's not.
Flags: needinfo?(gsvelto)
Assignee | ||
Comment 38•8 years ago
|
||
(In reply to Kamil Jozwiak [:kjozwiak] from comment #34)
> (In reply to Gabriele Svelto [:gsvelto] from comment #31)
> > Kamil, Joe, can you try this build and see if it fixes the problem for you:
> >
> > https://archive.mozilla.org/pub/firefox/try-builds/gsvelto@mozilla.com-
> > d8bf263170563b327296ef94b5aa97529d0ff6c3/try-macosx64/
>
> I couldn't reproduce the issue using the above try build. I
> installed/re-installed fx about ~12 times and went through the STR from
> comment#0 and bug#1357232comment#0. I didn't receive the macOS crash once.
Thanks a lot, this is really helpful.
Alessio, Ted, if you don't have any more objections I'll land this as it seems to be working fine.
Comment 39•8 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #38)
> (In reply to Kamil Jozwiak [:kjozwiak] from comment #34)
> > (In reply to Gabriele Svelto [:gsvelto] from comment #31)
> > > Kamil, Joe, can you try this build and see if it fixes the problem for you:
> > >
> > > https://archive.mozilla.org/pub/firefox/try-builds/gsvelto@mozilla.com-
> > > d8bf263170563b327296ef94b5aa97529d0ff6c3/try-macosx64/
> >
> > I couldn't reproduce the issue using the above try build. I
> > installed/re-installed fx about ~12 times and went through the STR from
> > comment#0 and bug#1357232comment#0. I didn't receive the macOS crash once.
>
> Thanks a lot, this is really helpful.
>
> Alessio, Ted, if you don't have any more objections I'll land this as it
> seems to be working fine.
If we're not:
- slowing down the shutdown;
- blocking the shutdown on the pingsender process;
- killing the pingsender process when FF shuts down;
then we're good to land!
Let's ask for QA coverage on this.
Comment 40•8 years ago
|
||
Abe, do you have some cycles to verify this fix?
Since we're changing the way we call the pingsender at shutdown for the "shutdown" ping, this would need to be verified on Windows, Mac and Windows again. We need to take particular care to make sure it doesn't crash at shutdown, especially on MacOS.
If you do this, you can use the "shutdown" pingsender test plan as a reference, there's only one thing that changed meanwhile: we are sending the shutdown ping from the second session on. So, if you create a new profile and shut down Firefox, no shutdown ping will be received: it will be received next time Firefox is started. From the second time FF shuts down, the shutdown ping is sent using the pingsender and should be received.
Flags: needinfo?(amasresha)
Comment 41•8 years ago
|
||
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/90774eb8f507
Run the pingsender executable in a way that avoids shutdown races with mozalloc; r=Dexter
Comment 42•8 years ago
|
||
(In reply to Alessio Placitelli [:Dexter] from comment #40)
> Abe, do you have some cycles to verify this fix?
Yes, I will verify this bug fix and rerun all test cases of shutdown pings with your inputs (@comment 40).
Flags: needinfo?(amasresha)
Comment 43•8 years ago
|
||
bugherder |
Comment 44•8 years ago
|
||
Abe, would you mind also testing the following test case, Windows only:
- Open Firefox, wait at least 1 minute;
- Shut down Windows;
- No error should be displayed
Doesn't necessarily mean that the ping must be received by the server exactly at shutdown in this case, as this heavily depends on the OS. However, if it's not received at that time, it should be sent again on the next restart of FF.
Flags: needinfo?(joe-bugzilla) → needinfo?(amasresha)
Comment 45•8 years ago
|
||
(In reply to Alessio Placitelli [:Dexter] from comment #44)
> Abe, would you mind also testing the following test case, Windows only:
>
> - Open Firefox, wait at least 1 minute;
> - Shut down Windows;
> - No error should be displayed
Actually, I was just able to test that the error dialog I am getting when exiting Windows is about pingsender.exe. I'll file a bug.
Comment 46•8 years ago
|
||
(In reply to Alessio Placitelli [:Dexter] from comment #44)
> Abe, would you mind also testing the following test case, Windows only:
>
> - Open Firefox, wait at least 1 minute;
> - Shut down Windows;
> - No error should be displayed
>
> Doesn't necessarily mean that the ping must be received by the server
> exactly at shutdown in this case, as this heavily depends on the OS.
> However, if it's not received at that time, it should be sent again on the
> next restart of FF.
Sure. I will add this test case to the shutdown ping test suites and make the testing.
Flags: needinfo?(amasresha)
Comment 47•8 years ago
|
||
Abe, any update on the validation for this? Do you have an ETA?
Flags: needinfo?(amasresha)
Comment 48•8 years ago
|
||
Abe, please let me know if there's anything that I can do to help. If you're currently swamped with work, I can help you go through the test suite.
Comment 49•8 years ago
|
||
(In reply to Kamil Jozwiak [:kjozwiak] from comment #48)
> Abe, please let me know if there's anything that I can do to help. If you're
> currently swamped with work, I can help you go through the test suite.
Thanks.
We have completed mid-nightly (second-round) testing on shutdown ping and this particular issue on OSX as well.
Test results show all green.
This issue seems to be fixed.
I will add links to the test documents very shortly.
Flags: needinfo?(amasresha)
Comment 50•8 years ago
|
||
Test documents:
Test runs: attached.
Test plan: https://wiki.mozilla.org/QA/Shutdown_Ping
Mid-Nightly Sign-off: will be sent via email on 5/12/2017.
Updated•8 years ago
|
Keywords: leave-open
Comment 51•8 years ago
|
||
Thank you Abe for the great work here! Can this be closed as Verified then?
Flags: needinfo?(amasresha)
Updated•8 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Comment 52•8 years ago
|
||
Verified-fixed.
See comment 50 for the test documents
Status: RESOLVED → VERIFIED
Flags: needinfo?(amasresha)
Updated•8 years ago
|
status-firefox53:
--- → unaffected
status-firefox54:
--- → unaffected
status-firefox-esr52:
--- → unaffected
Target Milestone: --- → mozilla55
Comment 53•8 years ago
|
||
gsvelto filed bug 1376567 on tracking the crash here, since we effectively just worked around it in this bug.
You need to log in
before you can comment on or make changes to this bug.
Description
•