Closed Bug 844331 Opened 12 years ago Closed 9 years ago

Support recording of Telemetry at shutdown

Categories

(Toolkit :: Telemetry, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla22
Tracking Status
firefox21 --- fixed
firefox22 --- fixed

People

(Reporter: vladan, Unassigned)

References

Details

(Whiteboard: [Telemetry:P1][leave open])

Attachments

(9 files, 3 obsolete files)

3.92 KB, patch
taras.mozilla
: review+
Details | Diff | Splinter Review
7.18 KB, patch
taras.mozilla
: review+
Details | Diff | Splinter Review
2.89 KB, patch
taras.mozilla
: review+
Details | Diff | Splinter Review
4.64 KB, patch
taras.mozilla
: review+
Details | Diff | Splinter Review
5.59 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
6.02 KB, patch
benjamin
: review+
Details | Diff | Splinter Review
3.41 KB, patch
vladan
: review+
Details | Diff | Splinter Review
5.80 KB, patch
vladan
: review+
Details | Diff | Splinter Review
3.29 KB, patch
vladan
: review+
Details | Diff | Splinter Review
Telemetry writes out session data to a file at a very early stage of shutdown: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryPing.js#1083 However, Telemetry users such as FHR and Necko try to record Telemetry for shutdown operations that happen during profile-before-change: HEALTHREPORT_SHUTDOWN_DELAY_MS CACHE_SERVICE_LOCK_WAIT_MAINTHREAD_NSCACHESERVICE_ONPROFILESHUTDOWN NETWORK_DISK_CACHE_DELETEDIR_SHUTDOWN As a result, most of this data is lost and we only receive a handful of reports during weird race conditions (e.g. races between idle-daily & shutdown). We should either move the writing of the session file to a later point in the shutdown process (e.g. as the last operation during profiler-before-change) or tell devs not to record Telemetry at shutdown.
Whiteboard: [Telemetry:P1]
I am indifferent about this one, but the name was going to get a little unwieldy with future commits, so...can drop it on the floor if you like.
Attachment #719613 - Flags: review?(taras.mozilla)
getSimpleMeasurements should really live in TelemetryPing anyway, and moving it, along with the consolidation of the io counters and whatnot, makes it easier to split the construction of the ping later on.
Attachment #719614 - Flags: review?(taras.mozilla)
We'll need these split out functions later. The idea is to gather simpleMeasurements and metadata at quit-application-granted and then gather all the histogram info at xpcom-will-shutdown.
Attachment #719616 - Flags: review?(taras.mozilla)
We need to get at the profile directory at xpcom-will-shutdown, but we no longer have an active profile at that point (profile-before-change has already run). It seemed simplest to simply cache the profile directory in profile-after-change and use that. This needs changes to test code for the xpcshell tests, which don't send the appropriate notifications.
Here's the interesting part of the patch.
Attachment #719618 - Flags: review?(taras.mozilla)
Attachment #719617 - Flags: review?(taras.mozilla)
Attachment #719613 - Flags: review?(taras.mozilla) → review+
Attachment #719614 - Flags: review?(taras.mozilla) → review+
Comment on attachment 719616 [details] [diff] [review] part 3 - split out payload/payload+slug construction to separate functions this seems reasonable. I'll trust you that this works.
Attachment #719616 - Flags: review?(taras.mozilla) → review+
Comment on attachment 719617 [details] [diff] [review] part 4 - cache the profile directory in TelemetryPing this.setup(); + this.cacheProfileDirectory(); r+, but this feels like it should be done in setup(). No strong opinion here.
Attachment #719617 - Flags: review?(taras.mozilla) → review+
Comment on attachment 719618 [details] [diff] [review] part 5 - move saved-session writing to xpcom-will-shutdown to catch cache and FHR activity Thanks for nicely-split patches.
Attachment #719618 - Flags: review?(taras.mozilla) → review+
This needs a test. I'm ok with this landing without a test + doing test in a followup.
(In reply to Taras Glek (:taras) from comment #9) > This needs a test. I'm ok with this landing without a test + doing test in a > followup. Do you have good ideas about how to test this? xpcshell doesn't really do notifications, and we can't really test shutdown in, say, browser-chrome. I guess we could try faking the notifications in xpcshell and hoping they don't crash...? Or test-only notifications that execute the same codepaths? It's also worth pointing out two things: 1) I forgot to add TelemetryPing of an observer of xpcom-will-shutdown. Will fix. 2) On Windows--at least--I don't think we ever get xpcom-will-shutdown. At least on Windows, it looks like we do a shorter shutdown sequence: http://mxr.mozilla.org/mozilla-central/source/widget/windows/nsWindow.cpp#4544 Looks like we should be smarter about monitoring profile-before-change, perhaps? Though it looks like FHR already looks at profile-before-change: http://dxr.mozilla.org/mozilla-central/services/healthreport/healthreporter.jsm#l216 and records telemetry during it: http://dxr.mozilla.org/mozilla-central/services/healthreport/healthreporter.jsm#l333 so it may be a race to see whether we can get information from FHR anyway...
(In reply to Nathan Froyd (:froydnj) from comment #10) > (In reply to Taras Glek (:taras) from comment #9) > > This needs a test. I'm ok with this landing without a test + doing test in a > > followup. > > Do you have good ideas about how to test this? xpcshell doesn't really do > notifications, and we can't really test shutdown in, say, browser-chrome. I > guess we could try faking the notifications in xpcshell and hoping they > don't crash...? Or test-only notifications that execute the same codepaths? I see cookie tests just blithely sending profile-before-change notifications, so maybe we can get away with doing that too. From browsing around, it looks like everybody else is saving stuff at profile-before-change. Maybe we should be saving then too?
(In reply to Nathan Froyd (:froydnj) from comment #11) > From browsing around, it looks like everybody else is saving stuff at > profile-before-change. Maybe we should be saving then too? Telemetry would have to be the *last* observer to handle the profile-before-change event since other components might record in histograms in their own handlers. I don't think there's any way to guarantee this and I'm not sure it's desirable to hack it in. Firefox does get the "xpcom-will-shutdown" event during regular shutdowns, but the nsWindow.cpp code you linked to is the "fast shutdown" that Firefox performs when Windows tell Firefox that the user has requested OS shutdown. What do you think of extending the "fast shutdown" sequence to also call xpcom-will-shutdown?
(In reply to Vladan Djeric (:vladan) from comment #12) > (In reply to Nathan Froyd (:froydnj) from comment #11) > > From browsing around, it looks like everybody else is saving stuff at > > profile-before-change. Maybe we should be saving then too? > > Telemetry would have to be the *last* observer to handle the > profile-before-change event since other components might record in > histograms in their own handlers. I don't think there's any way to guarantee > this and I'm not sure it's desirable to hack it in. Right, that would be ugly. > What do you think of extending the "fast shutdown" sequence to also call > xpcom-will-shutdown? I don't have a problem with it, the question is whether the Android and Windows folks have a problem with it. roc, WDYT about adding xpcom-will-shutdown to the code here: http://mxr.mozilla.org/mozilla-central/source/widget/windows/nsWindow.cpp#4557 http://mxr.mozilla.org/mozilla-central/source/widget/android/nsAppShell.cpp#379 so that we have a place to collect Telemetry data after more-or-less everybody has finished doing things in profile-before-change?
Flags: needinfo?(roc)
Actually, roc, why don't you just review the patch?
Attachment #720074 - Flags: review?(roc)
Clearing needinfo.
Flags: needinfo?(roc)
Attachment #720074 - Flags: review?(roc) → review?(benjamin)
Comment on attachment 720074 [details] [diff] [review] have android and windows shutdown notify on xpcom-will-shutdown I don't think that this is a good name for the action being taken here. As you've described it, this is really "a notification fired right after profile-before-change". Presumably you want this notification to be fired even if we early-exit(), and xpcom-will-shutdown definitely should *not* be fired in this case. For this case you should just create/use a profile-before-change2.
Attachment #720074 - Flags: review?(benjamin) → review-
A semi-blind addition of notifying profile-before-change2 everywhere we notify profile-before-change. That dom/power bit looks *really* ugly...
Attachment #720074 - Attachment is obsolete: true
Attachment #722839 - Flags: review?(benjamin)
This just moves saved-session writing to profile-before-change2. No material changes from the earlier part 5, so carrying over r+.
Attachment #719618 - Attachment is obsolete: true
Attachment #722840 - Flags: review+
...and fix typoes.
Attachment #722839 - Attachment is obsolete: true
Attachment #722839 - Flags: review?(benjamin)
Attachment #722873 - Flags: review?(benjamin)
Attachment #722873 - Flags: review?(benjamin) → review+
Please update https://wiki.mozilla.org/XPCOM_Shutdown with docs on this when it lands.
Blocks: 853499
The xpcshell test harness does send shutdown events, if the test uses the profile: http://mxr.mozilla.org/comm-central/source/mozilla/testing/xpcshell/head.js#909 Do we want to add profile-before-change2 there as well?
Comment on attachment 719613 [details] [diff] [review] part 1 - rename bits with |getCurrentSessionPayload| to |getSessionPayload| [Approval Request Comment] Would like to upload this series to provide better measurement of things that Firefox Health Report and the network cache do. Bug caused by (feature/regressing bug #): None User impact if declined: No user impact. Strictly a measurement improvement patch. Testing completed (on m-c, etc.): On m-c for a week. Risk to taking this patch (and alternatives if risky): Little risk. String or UUID changes made by this patch: None.
Attachment #719613 - Flags: approval-mozilla-aurora?
Comment on attachment 719614 [details] [diff] [review] part 2 - move getSimpleMeasurements to live in TelemetryPing [Approval Request Comment] Would like to upload this series to provide better measurement of things that Firefox Health Report and the network cache do. Bug caused by (feature/regressing bug #): None User impact if declined: No user impact. Strictly a measurement improvement patch. Testing completed (on m-c, etc.): On m-c for a week. Risk to taking this patch (and alternatives if risky): Little risk. String or UUID changes made by this patch: None.
Attachment #719614 - Flags: approval-mozilla-aurora?
Comment on attachment 719616 [details] [diff] [review] part 3 - split out payload/payload+slug construction to separate functions [Approval Request Comment] Would like to upload this series to provide better measurement of things that Firefox Health Report and the network cache do. Bug caused by (feature/regressing bug #): None User impact if declined: No user impact. Strictly a measurement improvement patch. Testing completed (on m-c, etc.): On m-c for a week. Risk to taking this patch (and alternatives if risky): Little risk. String or UUID changes made by this patch: None.
Attachment #719616 - Flags: approval-mozilla-aurora?
Comment on attachment 719617 [details] [diff] [review] part 4 - cache the profile directory in TelemetryPing [Approval Request Comment] Would like to upload this series to provide better measurement of things that Firefox Health Report and the network cache do. Bug caused by (feature/regressing bug #): None User impact if declined: No user impact. Strictly a measurement improvement patch. Testing completed (on m-c, etc.): On m-c for a week. Risk to taking this patch (and alternatives if risky): Little risk. String or UUID changes made by this patch: None.
Attachment #719617 - Flags: approval-mozilla-aurora?
Comment on attachment 722840 [details] [diff] [review] part 6 - move saved-session writing to profile-before-change2 to catch cache and FHR activity [Approval Request Comment] Would like to upload this series to provide better measurement of things that Firefox Health Report and the network cache do. Bug caused by (feature/regressing bug #): None User impact if declined: No user impact. Strictly a measurement improvement patch. Testing completed (on m-c, etc.): On m-c for a week. Risk to taking this patch (and alternatives if risky): Little risk. String or UUID changes made by this patch: None.patch:
Attachment #722840 - Flags: approval-mozilla-aurora?
Comment on attachment 722873 [details] [diff] [review] part 5 - add profile-before-change2 notification [Approval Request Comment] Would like to upload this series to provide better measurement of things that Firefox Health Report and the network cache do. Bug caused by (feature/regressing bug #): None User impact if declined: No user impact. Strictly a measurement improvement patch. Testing completed (on m-c, etc.): On m-c for a week. Risk to taking this patch (and alternatives if risky): Little risk. String or UUID changes made by this patch: None.
Attachment #722873 - Flags: approval-mozilla-aurora?
(In reply to Nathan Froyd (:froydnj) from comment #29) > Comment on attachment 722873 [details] [diff] [review] > part 5 - add profile-before-change2 notification > > [Approval Request Comment] > Would like to upload this series to provide better measurement of things > that Firefox Health Report and the network cache do. > > Bug caused by (feature/regressing bug #): None > User impact if declined: No user impact. Strictly a measurement improvement > patch. > Testing completed (on m-c, etc.): On m-c for a week. > Risk to taking this patch (and alternatives if risky): Little risk. Can you please expand a bit more on the risk here, what could be the worst thing that may happen ? Considering this is a last week of aurora and this seems a huge uplift right before the merge, are we worried about stability ?Also is this strictly being uplifted to help with FHR and how badly do we need it there ?
(In reply to bhavana bajaj [:bajaj] from comment #30) > (In reply to Nathan Froyd (:froydnj) from comment #29) > > Risk to taking this patch (and alternatives if risky): Little risk. > > Can you please expand a bit more on the risk here, what could be the worst > thing that may happen ? The worst thing that could happen is that Telemetry would somehow get shut off, but we'd notice that after about a week or so of the patch landing. So we'd have time to back out/fix. > Considering this is a last week of aurora and this seems a huge uplift right > before the merge, are we worried about stability ?Also is this strictly > being uplifted to help with FHR and how badly do we need it there ? Stability of the browser shouldn't be a problem with these patches. FHR records several pieces of data at browser shutdown, such as how long FHR shutdown takes. Due to bad interactions with Telemetry, we're not getting any visibility on that data. There's also some networking cache telemetry that we don't have any visibility into. There's a much shorter fix that could be landed on Aurora, roughly equivalent to part 5 and part 6 of the above patch series. Would you prefer that instead?
Flags: needinfo?(bbajaj)
(In reply to Nathan Froyd (:froydnj) from comment #31) > (In reply to bhavana bajaj [:bajaj] from comment #30) > > (In reply to Nathan Froyd (:froydnj) from comment #29) > > > Risk to taking this patch (and alternatives if risky): Little risk. > > > > Can you please expand a bit more on the risk here, what could be the worst > > thing that may happen ? > > The worst thing that could happen is that Telemetry would somehow get shut > off, but we'd notice that after about a week or so of the patch landing. So > we'd have time to back out/fix. > > > Considering this is a last week of aurora and this seems a huge uplift right > > before the merge, are we worried about stability ?Also is this strictly > > being uplifted to help with FHR and how badly do we need it there ? > > Stability of the browser shouldn't be a problem with these patches. > > FHR records several pieces of data at browser shutdown, such as how long FHR > shutdown takes. Due to bad interactions with Telemetry, we're not getting > any visibility on that data. There's also some networking cache telemetry > that we don't have any visibility into. > > There's a much shorter fix that could be landed on Aurora, roughly > equivalent to part 5 and part 6 of the above patch series. Would you prefer > that instead? If that workaround solves the purpose here and is even slightly less riskier I think we should go with that option at this time.
Flags: needinfo?(bbajaj)
Here's a much simpler patch for Aurora uplift only. Instead of using all the complicated machinery, we simply move ping saving to profile-before-change2. Which is probably what I should have done once Benjamin said that adding a new notification was OK.
Attachment #730703 - Flags: review?(vdjeric)
Attachment #719613 - Flags: approval-mozilla-aurora?
Attachment #719614 - Flags: approval-mozilla-aurora?
Attachment #719616 - Flags: approval-mozilla-aurora?
Attachment #719617 - Flags: approval-mozilla-aurora?
Attachment #722840 - Flags: approval-mozilla-aurora?
Attachment #730703 - Flags: review?(vdjeric) → review+
Comment on attachment 730703 [details] [diff] [review] part 6 - switch saving pings over to profile-before-change2 Here's a much simpler patch that, in conjunction with part 5, would implement the minimum that we would need for an Aurora uplift. Taking part 5 and part 6 on Aurora would be much more palatable than the full series. See comment 31 especially for why uplifting these would be a good thing. [Approval Request Comment] Bug caused by (feature/regressing bug #): None. User impact if declined: None. Testing completed (on m-c, etc.): Local testing OK. Risk to taking this patch (and alternatives if risky): No patch is risk free; this patch has low risk. String or IDL/UUID changes made by this patch: None.
Attachment #730703 - Flags: approval-mozilla-aurora?
Comment on attachment 730703 [details] [diff] [review] part 6 - switch saving pings over to profile-before-change2 low risk patch, helps FHR get visibility into shutdown data. Thanks for the alternative patches here for aurora, please keep an eye on any regressions this may cause due to less baketime on aurora.
Attachment #730703 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #722873 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
This bug looks like a very likely candidate for breaking Telemetry submissions starting around the 19th of February. This patch backs out the last part of the patch series. I realize the expected thing to do would be to back out the entire patch series. However, this is the only patch that actually changes any behavior; the others are merely rearranging code. I also verified using the debugger that the functions the other patches affected are still working properly and not throwing errors. So this is really the only patch that matters...
Attachment #732016 - Flags: review?(vdjeric)
Attachment #732016 - Flags: review?(vdjeric) → review+
Backed out part 6: https://hg.mozilla.org/integration/mozilla-inbound/rev/322dcd797401 Verified on Windows that backing out that patch is sufficient to restore saving of telemetry pings. It's possible that the patch we uplifted to FF21 (now on beta) is really the right way to go. Will try that next.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [Telemetry:P1] → [Telemetry:P1][leave open]
Comment on attachment 732016 [details] [diff] [review] Back out 80034b418a48 (bug 844331, part 6) on suspicion of breaking Telemetry submission It turns out that the patches from this bug that were landed late in the last release cycle on m-c markedly lowered telemetry submission rates. This patch landed on m-c ~1 week ago and telemetry submission rates have shown a significant uptick. [Approval Request Comment] Bug caused by (feature/regressing bug #): bug 844331 User impact if declined: None Testing completed (on m-c, etc.): On m-c for > 1 week. Has markedly improved telemetry submission rates. Risk to taking this patch (and alternatives if risky): None. String or IDL/UUID changes made by this patch: None
Attachment #732016 - Flags: approval-mozilla-aurora?
Comment on attachment 732016 [details] [diff] [review] Back out 80034b418a48 (bug 844331, part 6) on suspicion of breaking Telemetry submission Always willing to take fixes in support of our metrics when low risk :)
Attachment #732016 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
I filed bug 862599 on what looks to be a regression on Beta caused by this patch.
Depends on: 867401
bug 867401 eliminates the issues with saving pings in profile-before-change2. This patch, therefore, moves the saving of pings to profile-before-change2. I've verified on Windows that we do save pings with these changes.
Attachment #744235 - Flags: review?(vdjeric)
Attachment #744235 - Flags: review?(vdjeric) → review+
Per comment 44 the last outstanding patch here got landed.
Status: REOPENED → RESOLVED
Closed: 12 years ago9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: