Closed Bug 1343282 Opened 3 years ago Closed 2 years ago

Validate sending shutdown pings using the PingSender

Categories

(Toolkit :: Telemetry, defect, P1)

defect
Points:
2

Tracking

()

RESOLVED FIXED
Tracking Status
firefox54 --- affected

People

(Reporter: Dexter, Assigned: Dexter)

References

Details

(Whiteboard: [measurement:client])

After bug 1336360 lands, we should take a moment to analyse the incoming data to check that:

- shutdown pings are coming in faster;
- there's no downfall in the number of shutdown pings we receive; we expect to see a rise instead (due to profiles that use Firefox only once).
- the shutdown pings we receive through the PingSender are not malformed.
- the number of duplicated shutdown pings is not increasing too much
Blocks: 1343277
Priority: -- → P3
Whiteboard: [measurement:client]
Priority: P3 → P1
Points: --- → 2
Priority: P1 → P2
Priority: P2 → P1
Assignee: nobody → alessio.placitelli
The analysis is being reviewed at: https://github.com/mozilla/mozilla-reports/pull/62
The analysis was reviewed and merged [1], here's the key facts:

Good
- 85% of the shutdown pings received as soon as generated, versus 35% within the first hour (no pingsender).

Bad
- shutdownDuration regression (up to 4ms)
- Increase in the duplicates rate (from 0.5% to 1.61%).

The theory behind the increase of duplicates from the pingsender is that the OS is shutting down while Firefox is open: this triggers Firefox to close, it succeeds to spawn the pingsender process and to send the ping, but fails to remove the copy of the ping from the disk. Bug 1363345 is dealing with that.

@Gijs, as mentioned above we're regressing the shutdown a little bit. Is up to 4ms at shutdown something we can deal with? Are you the right person to discuss this or do you know who I should talk to?

[1] - https://github.com/mozilla/mozilla-reports/blob/master/projects/main_ping_delays_pingsender.kp/knowledge.md
Flags: needinfo?(gijskruitbosch+bugs)
(In reply to Alessio Placitelli [:Dexter] from comment #2)
> The analysis was reviewed and merged [1], here's the key facts:
> 
> Good
> - 85% of the shutdown pings received as soon as generated, versus 35% within
> the first hour (no pingsender).
> 
> Bad
> - shutdownDuration regression (up to 4ms)
> - Increase in the duplicates rate (from 0.5% to 1.61%).
> 
> The theory behind the increase of duplicates from the pingsender is that the
> OS is shutting down while Firefox is open: this triggers Firefox to close,
> it succeeds to spawn the pingsender process and to send the ping, but fails
> to remove the copy of the ping from the disk. Bug 1363345 is dealing with
> that.
> 
> @Gijs, as mentioned above we're regressing the shutdown a little bit. Is up
> to 4ms at shutdown something we can deal with? Are you the right person to
> discuss this or do you know who I should talk to?
> 
> [1] -
> https://github.com/mozilla/mozilla-reports/blob/master/projects/
> main_ping_delays_pingsender.kp/knowledge.md

I think Florian (for Photon perf) or Ehsan (for Quantum Flow) are better folks to talk to. Forwarding the ni.
Flags: needinfo?(gijskruitbosch+bugs)
Flags: needinfo?(florian)
Flags: needinfo?(ehsan)
What time do we run this program in relation to receiving the WM_QUERYENDSESSION and WM_ENDSESSION messages?

Your question about the 4ms isn't really possible to answer.  Windows doesn't give you time limits at shutdown.  After your application returns from its WM_ENDSESSION handler it should expect to be terminated at any time.
Flags: needinfo?(ehsan) → needinfo?(alessio.placitelli)
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #4)
> What time do we run this program in relation to receiving the
> WM_QUERYENDSESSION and WM_ENDSESSION messages?

It should be late in the shutdown, when "profile-before-change-telemetry" is notified [1].

> Your question about the 4ms isn't really possible to answer.  Windows
> doesn't give you time limits at shutdown.  After your application returns
> from its WM_ENDSESSION handler it should expect to be terminated at any time.

Sorry if I wasn't clear, my question was specifically whether it is acceptable or not to get up to 4ms at Firefox shutdown when triggered by the normal user flow (user quitting Firefox).

In bug 1363345 we'll make sure not to spawn the pingsender if Firefox is being closed due to the OS shutting down.

[1] - http://searchfox.org/mozilla-central/rev/d66b9f27d5630a90b2fce4d70d4e9050f43df9b4/widget/windows/nsWindow.cpp#5127
Flags: needinfo?(alessio.placitelli) → needinfo?(ehsan)
(In reply to Alessio Placitelli [:Dexter] from comment #5)
> (In reply to :Ehsan Akhgari (super long backlog, slow to respond) from
> comment #4)
> > What time do we run this program in relation to receiving the
> > WM_QUERYENDSESSION and WM_ENDSESSION messages?
> 
> It should be late in the shutdown, when "profile-before-change-telemetry" is
> notified [1].

Sorry...  I was answering a different question.  I thought you are saying we launch pingsender close to Windows shutdown time and you're worried whether Windows will allow a program to run for 4ms before it shuts the system down.  :-)  Please ignore what I said before, we were talking about different things!

> > Your question about the 4ms isn't really possible to answer.  Windows
> > doesn't give you time limits at shutdown.  After your application returns
> > from its WM_ENDSESSION handler it should expect to be terminated at any time.
> 
> Sorry if I wasn't clear, my question was specifically whether it is
> acceptable or not to get up to 4ms at Firefox shutdown when triggered by the
> normal user flow (user quitting Firefox).

I see.  I'm pretty sure that's not a good idea, since we are trying to make the shutdown of Firefox faster for 57, however, I think the plans for 57 are more focused on making it "appear" that we shut down fast (for example see bug 13554590).  So I will let Florian or Mike Conley speak to this.

The real solution to fast shutdowns is bug 662444.  4ms here and there doesn't really matter.  :-)

> In bug 1363345 we'll make sure not to spawn the pingsender if Firefox is
> being closed due to the OS shutting down.

Yeah that is a good idea, there is not a great chance that it can finish its job in time.
Flags: needinfo?(ehsan) → needinfo?(mconley)
Also one question, does this time depend on the size of the ping?  And is the size of the ping different between Nightly and Release?
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #7)
> Also one question, does this time depend on the size of the ping?  And is
> the size of the ping different between Nightly and Release?

No, it doesn't!
The additional time is only bound to the time it takes to spawn a process on the machine.
Jeff, is having up to 4ms of delay at shutdown something we can deal with? 4ms is the worst case, as most of the overhead is between 1-2ms [1]. We are seeing this overhead because we're spawning a new process at shutdown to quickly send back the shutdown pings, instead of waiting for the next Firefox restart. See comment 2 for more details.

[1] - https://github.com/mozilla/mozilla-reports/blob/master/projects/main_ping_delays_pingsender.kp/knowledge.md#did-we-regress-shutdownduration
Flags: needinfo?(jgriffiths)
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #6)
> 
> I see.  I'm pretty sure that's not a good idea, since we are trying to make
> the shutdown of Firefox faster for 57, however, I think the plans for 57 are
> more focused on making it "appear" that we shut down fast (for example see
> bug 13554590).  So I will let Florian or Mike Conley speak to this.
> 

Faster shutdown, both perceived and real, has been moved out of scope for 57 based on the remaining time. That doesn't mean I'm jazzed about us regressing this, but it's not going against active performance work in that area.
Flags: needinfo?(mconley)
(In reply to Mike Conley (:mconley) - Review and needinfo queue maxed. from comment #10)
> (In reply to :Ehsan Akhgari (super long backlog, slow to respond) from
> comment #6)
> > 
> > I see.  I'm pretty sure that's not a good idea, since we are trying to make
> > the shutdown of Firefox faster for 57, however, I think the plans for 57 are
> > more focused on making it "appear" that we shut down fast (for example see
> > bug 13554590).  So I will let Florian or Mike Conley speak to this.
> > 
> 
> Faster shutdown, both perceived and real, has been moved out of scope for 57
> based on the remaining time. That doesn't mean I'm jazzed about us
> regressing this, but it's not going against active performance work in that
> area.

Mike is correct. I'm not jazzed about this either but I think I agree that the additional utility is important enough, in fact this seems like a cool trick :)

Mike, Ehsan: IMO if this feature gets landed and has a larger-than-expected shutdown regression (  >=5ms ), we back it out.
Flags: needinfo?(jgriffiths)
(In reply to Alessio Placitelli [:Dexter] from comment #2)

> Is up to 4ms at shutdown something we can deal with?

Where is the '4ms' value coming from? What makes you confident that it'll never take longer?

By 'shutdown', do you mean the end of the firefox.exe process, or the time when we remove the profile lock from the disk? (the latter is the only thing that can really get in the way of the user if attempting to start Firefox again)
Flags: needinfo?(florian)
(In reply to Florian Quèze [:florian] [:flo] from comment #12)
> (In reply to Alessio Placitelli [:Dexter] from comment #2)
> 
> > Is up to 4ms at shutdown something we can deal with?
> 
> Where is the '4ms' value coming from? What makes you confident that it'll
> never take longer?

The telemetry analysis in comment 2, I think.

(In reply to Jeff Griffiths (:canuckistani) (:⚡︎) from comment #11)
> (In reply to Mike Conley (:mconley) - Review and needinfo queue maxed. from
> comment #10)
> > (In reply to :Ehsan Akhgari (super long backlog, slow to respond) from
> > comment #6)
> > > 
> > > I see.  I'm pretty sure that's not a good idea, since we are trying to make
> > > the shutdown of Firefox faster for 57, however, I think the plans for 57 are
> > > more focused on making it "appear" that we shut down fast (for example see
> > > bug 13554590).  So I will let Florian or Mike Conley speak to this.
> > > 
> > 
> > Faster shutdown, both perceived and real, has been moved out of scope for 57
> > based on the remaining time. That doesn't mean I'm jazzed about us
> > regressing this, but it's not going against active performance work in that
> > area.
> 
> Mike is correct. I'm not jazzed about this either but I think I agree that
> the additional utility is important enough, in fact this seems like a cool
> trick :)

OK, great to know.

> Mike, Ehsan: IMO if this feature gets landed and has a larger-than-expected
> shutdown regression (  >=5ms ), we back it out.

FWIW in all honesty I wouldn't rush to a backout if the real regression numbers end up being 5ms instead of 4ms since 1ms isn't perceivable by any human, but I agree in the general that we should back out if the regressions becomes larger than expected.  But given comment 8, I'm hopeful that it won't.
(In reply to Florian Quèze [:florian] [:flo] from comment #12)
> (In reply to Alessio Placitelli [:Dexter] from comment #2)
> 
> > Is up to 4ms at shutdown something we can deal with?
> 
> Where is the '4ms' value coming from? What makes you confident that it'll
> never take longer?

As discussed (see comment 8 and comment 9), this additional delay comes from spawning a new process when Firefox shuts down [1][2].

The analysis from comment 9 investigates the |shutdownDuration| data point, 

> By 'shutdown', do you mean the end of the firefox.exe process, or the time
> when we remove the profile lock from the disk? (the latter is the only thing
> that can really get in the way of the user if attempting to start Firefox
> again)

I mean the time it takes to end the firefox.exe process, as the |shutdownDuration| measures the time it takes to go though the full shutdown procedure [3][4]. However, spawning the 'pingsender' happens during the |profile-before-change-telemetry| phase, which is the last phase before the profile directory gets unlocked.

(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #13)
> > Mike, Ehsan: IMO if this feature gets landed and has a larger-than-expected
> > shutdown regression (  >=5ms ), we back it out.
> 
> FWIW in all honesty I wouldn't rush to a backout if the real regression
> numbers end up being 5ms instead of 4ms since 1ms isn't perceivable by any
> human, but I agree in the general that we should back out if the regressions
> becomes larger than expected.  But given comment 8, I'm hopeful that it
> won't.

I agree with you Ehsan. Moreover, the 4ms estimate comes from the 99 percentile of shutdownDuration gathered from 1 full week of data of the Nightly population. The 50 and 90 percentiles are 1ms and 3 ms respectively.

Additionally, bug 1363345 is taking care of disabling this during OS shutdown, so that we don't risk of delaying it.

[1] - http://searchfox.org/mozilla-central/rev/ae24a3c83d22e0e35aecfd9049c2b463ca7e045b/toolkit/components/telemetry/TelemetrySend.jsm#812
[2] - http://searchfox.org/mozilla-central/rev/ae24a3c83d22e0e35aecfd9049c2b463ca7e045b/toolkit/components/telemetry/TelemetrySend.jsm#1259
[3] - http://searchfox.org/mozilla-central/rev/ae24a3c83d22e0e35aecfd9049c2b463ca7e045b/toolkit/components/startup/nsAppStartup.cpp#386
[4] - http://searchfox.org/mozilla-central/rev/9a7fbdee1d54f99cd548af95b81231d80e5f9ad1/toolkit/xre/nsAppRunner.cpp#4824
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #13)
...
> > Mike, Ehsan: IMO if this feature gets landed and has a larger-than-expected
> > shutdown regression (  >=5ms ), we back it out.
> 
> FWIW in all honesty I wouldn't rush to a backout if the real regression
> numbers end up being 5ms instead of 4ms since 1ms isn't perceivable by any
> human, but I agree in the general that we should back out if the regressions
> becomes larger than expected.  But given comment 8, I'm hopeful that it
> won't.

I can easily see it getting larger on slow hardware and slow disks. I'd be interested in knowing the distribution of SSDs on nightly vs other channels. I opened an issue against the hardware report to track this[1]. 

Ehsan - what is the human-perceived threshold?

In the future we could, for example, disable this feature on hardware we think will be excessively slow.

[1] https://github.com/mozilla/firefox-hardware-report/issues/25
I profiled shutdown on my fast macbook, and it seems telemetry code already spent ~90ms during shutdown: https://perf-html.io/public/6c3c44c91b0a18bfc746809de306763a41b98586/calltree/?implementation=js&range=10.3230_11.7577&search=Telemetry&thread=0

I haven't managed (yet?) to get a shutdown profile on the quantum reference device.
(In reply to Florian Quèze [:florian] [:flo] from comment #16)
> I profiled shutdown on my fast macbook, and it seems telemetry code already
> spent ~90ms during shutdown:
> https://perf-html.io/public/6c3c44c91b0a18bfc746809de306763a41b98586/
> calltree/?implementation=js&range=10.3230_11.7577&search=Telemetry&thread=0

This is the time it takes for the full Telemetry shutdown to complete (gathering final measurements, saving pings to disk, ...), not the time it takes for the change discussed in this bug. Feel free to file another bug and take the discussion there if you think the overall Telemetry shutdown performance needs to be discussed further.

(In reply to Jeff Griffiths (:canuckistani) (:⚡︎) from comment #15)
> In the future we could, for example, disable this feature on hardware we
> think will be excessively slow.
> 
> [1] https://github.com/mozilla/firefox-hardware-report/issues/25

That's a good idea!
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Whoops, I didn't mean to close it now, my mistake. Feel free to re-open if you feel like the conversation hasn't settled yet.
(In reply to Jeff Griffiths (:canuckistani) (:⚡︎) from comment #15)
> (In reply to :Ehsan Akhgari (super long backlog, slow to respond) from
> comment #13)
> ...
> > > Mike, Ehsan: IMO if this feature gets landed and has a larger-than-expected
> > > shutdown regression (  >=5ms ), we back it out.
> > 
> > FWIW in all honesty I wouldn't rush to a backout if the real regression
> > numbers end up being 5ms instead of 4ms since 1ms isn't perceivable by any
> > human, but I agree in the general that we should back out if the regressions
> > becomes larger than expected.  But given comment 8, I'm hopeful that it
> > won't.
> 
> I can easily see it getting larger on slow hardware and slow disks. I'd be
> interested in knowing the distribution of SSDs on nightly vs other channels.
> I opened an issue against the hardware report to track this[1]. 
> 
> Ehsan - what is the human-perceived threshold?
> 
> In the future we could, for example, disable this feature on hardware we
> think will be excessively slow.

Per comment 2, my understanding was that the 4ms being discussed was the average slowdown over the population without the shutdown pings based on telemetry data (see the last part of the study there.)  Also AFAIK the operation here is a process creation, not pure IO (although I admit that I'm not sure how much IO exactly is involved in a CreateProcess call, there may be _some_ IO involved...), so I would expect the overhead distribution to be affected more by things like the version of Windows installed than the type of hardware.  I could be wrong tough.  I wonder if we could tell based on the data we have collected on telemetry?

On the question of the human perceivable part of the shutdown, there are only two important aspects:

 a) How quickly we can close all of your windows and remove our OS dock icon.  The former should be under our control similarly to how we managed to make our tab closing almost instantaneous, but the latter probably depends on the specific OS (some OSes may not allow you to hide your icon before the process actually exits.)

 b) How quickly the process exits in case the user tries to launch Firefox again really quickly, in which case they'll get an error because the profile will be in use.  This will only be an issue when the user does try to relaunch Firefox shortly after closing it, so the likelihood isn't high but the error blocks starting the browser.

As far as I understand the code here runs after (a) so it really only has a user perceivable effect on the OS dock if the OS keeps it alive after the last window goes away and on whether you'd get the error in scenario (b).
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #19)
> (In reply to Jeff Griffiths (:canuckistani) (:⚡︎) from comment #15)
> [...]
>
> > In the future we could, for example, disable this feature on hardware we
> > think will be excessively slow.
> 
> Per comment 2, my understanding was that the 4ms being discussed was the
> average slowdown over the population without the shutdown pings based on
> telemetry data (see the last part of the study there.)  Also AFAIK the
> operation here is a process creation, not pure IO (although I admit that I'm
> not sure how much IO exactly is involved in a CreateProcess call, there may
> be _some_ IO involved...), so I would expect the overhead distribution to be
> affected more by things like the version of Windows installed than the type
> of hardware.  I could be wrong tough.  I wonder if we could tell based on
> the data we have collected on telemetry?

Let me clarify this point: the up to 4ms (worst case) are the additional overhead to the shutdown process due to spawning the pingsender, done with nsIProcess, which boils down to ShellExecuteEx or CreateProcess on Windows.

Even without spawning the pingsender, Telemetry still performs some data collection and I/O at shutdown: that's what Florian saw in comment 16 and what I commented about in comment 18. That part wasn't changed nor there are plans to, at least not in this bug.

With that said, you have a point with CreateProcess not suffering too much on old hardware. In the specific case of the shutdown regression, we only have Nightly data (that's what the analysis is about), which is potentially too skewed toward newest hardware.
I'm planning to repeat my analysis after we hit beta in bug 1365978, there we could have enough data and enough HW variation to clear the CreateProcess doubt, if interested!

> On the question of the human perceivable part of the shutdown, there are
> only two important aspects:
> 
>  a) How quickly we can close all of your windows and remove our OS dock
> icon.  The former should be under our control similarly to how we managed to
> make our tab closing almost instantaneous, but the latter probably depends
> on the specific OS (some OSes may not allow you to hide your icon before the
> process actually exits.)
> 
>  b) How quickly the process exits in case the user tries to launch Firefox
> again really quickly, in which case they'll get an error because the profile
> will be in use.  This will only be an issue when the user does try to
> relaunch Firefox shortly after closing it, so the likelihood isn't high but
> the error blocks starting the browser.
> 
> As far as I understand the code here runs after (a) so it really only has a
> user perceivable effect on the OS dock if the OS keeps it alive after the
> last window goes away and on whether you'd get the error in scenario (b).

You're right!
FWIW it's *super* valuable to have telemetry data specifically about the overhead of CreateProcess, since we need to call that function for doing things like creating a content process on our main thread, and we currently do that on our UI thread.  If you ever got better data about this I would love to know about it.  :-)
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #21)
> FWIW it's *super* valuable to have telemetry data specifically about the
> overhead of CreateProcess, since we need to call that function for doing
> things like creating a content process on our main thread, and we currently
> do that on our UI thread.  If you ever got better data about this I would
> love to know about it.  :-)

Good to know! Even though what we have it's an indirect measurement, it's still better than nothing I guess :-) I'll keep you in the loop for the data from Beta.
Duplicate of this bug: 831104
You need to log in before you can comment on or make changes to this bug.