Closed Bug 1185123 Opened 10 years ago Closed 9 years ago

Validate shutdown behavior (getting and sending pings)

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: thuelbert, Assigned: sphilp)

References

Details

(Whiteboard: [unifiedTelemetry])

The tigers in the tiger team - Stuart "putting software in a headlock" Philp and Sam "all your datas are belong to me" Penrose. Takes priority over all other testing :) My hope is Sam will turn this very vague and non actionable bug into something awesome.
Sam's going to look at the data side (and/or recruit help), which is really following up on the missing subsessions bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1171268 We're prioritizing this because we think it could be a sign of a client bug. BDS has a theory that we're losing the last subsession (which could be the entire session). This could be due to crashes and/or quick shutdown when logging out of Windows.
Repeating here from https://bugzilla.mozilla.org/show_bug.cgi?id=1171268#c13: > This is the script I'm using to detect and save nonlinear history. It > excludes people who channel-switch. I'm hoping that this dataset will be > useful for somebody to test my theory, which is that there is a client bug > which causes us to lose the last subsession of a session, perhaps related to > the client crashing or the user logging out of the computer while Firefox is > running. > > Useful questions: > * is it predominantly the last subsession of a session that is missing? > * are there any crash pings from these clients? > * Is this solely or primarily on Windows? > > I also recommend manual QA verification: > > * try crashing Firefox, check the pings > * try logging out of Windows with Firefox running, check the pings
Some initial notes: Regular cases: on trigger, on profile change, normal/safe shutdown Exceptional cases: browser crash, logging out with browser open, force quit. I imagine a full OS crash is out of scope ;) On nightly OSX Yosemite: Pings seem to be sent for all regular cases. Exceptional cases are somewhat okay, but I can't get pings to trigger on crashes (using a snippet in the browser console to force a crash). There also occasionally seems to be a quite a long delay between browser shutdown and having shutdown pings sent. I’ve seen as much as a minute later (I assume these are shutdown pings otherwise what else would be sent at that point?). That could be a case where the browser would be shutdown normally, followed by the computer being put to sleep (or a similar normal user shutdown pattern), but the data is lost as the pings aren’t sent immediately/before the computer goes to sleep. On nightly Windows 8.1: Pings are sent in most regular cases, except I'm not seeing any traffic at all on shutdown. Exceptional cases don't seem to send either, although this may be related to the general lacking shutdown ping issue. Safe, crashed, killed, nothing seems to send. I'll do some testing on Ubuntu tomorrow and try to clarify the above issues with some more details.
Vladan, it feels to me like Stuart may have confirmed the concerns from :kparlante and :bsmedberg in Comment 1 (see quoted bit below -- thanks Stuart!). Since they are not here and this issue is a potential blocker for UT release, can you: 1) Let us know if Stuart has found enough to assign an engineer to a high priority bug. 2) If so, handle the assignment. 3) If this is not the right approach, let me know what is the right approach? Thanks, Sam (In reply to Stuart Philp :sphilp from comment #3) ... > On nightly OSX Yosemite: > ... There also occasionally seems to be a > ... but the data is lost as the pings aren’t sent immediately/before > the computer goes to sleep. > > On nightly Windows 8.1: > Pings are sent in most regular cases, except I'm not seeing any traffic at > all on shutdown. Exceptional cases don't seem to send either, although this > may be related to the general lacking shutdown ping issue. Safe, crashed, > killed, nothing seems to send.
Flags: needinfo?(vdjeric)
Thanks for the *awesome* work Stuart! Just one note: "shutdown" pings are not supposed to be sent when Firefox shuts down. Instead, they are saved to disk as pending pings [1] and get sent as soon as Firefox is run again and Telemetry initializes (which happens 1 minute after Firefox starts, so we don't cause jank). I'm not quite sure about the crash pings, but I think they work in a similar way. In general we try to send pings as soon as possible, if the network scheduler allows us (we're capped at 10 pings per minute). The exception to this rule are the shutdown and the saved-session pings (and possibly the crash ping?). [1] - http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetrySession.jsm#1629
Thanks Alessio -- do you think we do not have a problem, or is it too early to tell? (In reply to Alessio Placitelli [:Dexter] from comment #5) > Thanks for the *awesome* work Stuart! > > Just one note: "shutdown" pings are not supposed to be sent when Firefox > shuts down. Instead, they are saved to disk as pending pings [1] and get > sent as soon as Firefox is run again and Telemetry initializes (which > happens 1 minute after Firefox starts, so we don't cause jank). > > I'm not quite sure about the crash pings, but I think they work in a similar > way. > > In general we try to send pings as soon as possible, if the network > scheduler allows us (we're capped at 10 pings per minute). The exception to > this rule are the shutdown and the saved-session pings (and possibly the > crash ping?). > > [1] - > http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/ > TelemetrySession.jsm#1629
Thanks Alessio! I'll look at verifying that the pings are indeed sent on the following start up. So it sounds like there is a period where the pings are not sent for users who either don't use Firefox all that often, or use it but just happen to be away from their computer for a while, and so their data is not counted temporarily until each user starts Firefox again. Assuming everything is sent correctly on start up, might that be enough to account for the discrepancy here?
Confirmed on OSX and Windows 8.1 that shutdown, saved-session, and crash pings are sent on start up correctly. Thanks to Alessio for the help :) Do we have any other ideas what might be causing this?
I think this bug may have gone off track? The motivation behind this bug is to understand why Telemetry pings sent to Telemetry servers don't always form an uninterrupted chain (a singly-linked list with no forks). Each ping should point to the previous ping sent from the same profile. Benjamin's hypothesis is that we are losing the final ping of each Firefox session to abnormal Firefox shutdowns. This is the main thing we need to test in this bug. In order to verify Telemetry shutdown behaviour for this bug, it is necessary to check the chaining across v4 pings, it's not enough to check the number of pings sent. There are chains formed between pings by ping.payload.info.previousSessionId and ping.payload.info.previousSubssesionId, as well as counters in ping.payload.info.profileSubsessionCounter and ping.payload.subsessionCounter. Did you check these during your testing? Test scenarios: You need to verify chaining *during* sessions (with multiple subsessions) and the chaining *across* sessions after an improper shutdown on Windows & Mac. a) session chaining across multiple sessions with environment-change, midnight, and shutdown events b) session chaining across multiple sessions with computer sleeps & hibernates with Firefox running c) session chaining across multiple very brief sessions (less than 1 minute + slightly more than 1 minute) with clean shutdowns d) session chaining when network is unavailable for sending ad-hoc pings (midnight boundary or environment-chang), Telemetry should persist failed pings to disk and use exponential backoff to retry sending e) session chaining in the face of unclean shutdowns during sessions of varying lengths (< 1 minute, ~1 minute, > 1 minute): the unclean shutdowns can be caused by 1) browser crash, 2) force quit, 3) logging out of Windows with browser open f) also, session chaining when Firefox hangs at shutdown and the hang is resolved by 1) Firefox terminating itself on its own if shutdown lasts longer than 1 minute, 2) the user starting a second instance of Firefox with the same profile (using firefox.exe -no-remote) and then choosing to end the the concurrent "hung" Firefox process from the startup dialog, 3) user logging out of Windows Some background on expected Firefox behaviour that you'll need to take into account: Telemetry starts a new subsession whenever 1) Firefox is started, 2) whenever the environment changes while Firefox is running (addons, prefs, etc -- but subject to a changes-per-minute maximum) and 3) at midnight local time if Firefox is still running (there are also gotchas around the PC hibernating during midnight). Telemetry will immediately send a ping containing the current subsession's data whenever a new subsession is started (subject to midnight "fuzzing" and max-send-rate rules). If sending fails, the ping is persisted to saved-telemetry-pings/ to prevent losing unsent data in the case of a crash. At shutdown, Telemetry will persist the final subsession to saved-telemetry-pings/ to be sent during a future Firefox session (it's not a good idea to block Firefox shutdown on Telemetry sending). A minute after Firefox startup, Telemetry initializes itself and then attempts to send the unsent pings from the saved-telemetry-pings/ directory in batches of 10. If there are too many pings on disk, Telemetry may delete the pings without sending them on account of disk quota or expiry rules. Telemetry will also periodically persist the *current* (unfinished) subsession to saved-telemetry-pings/ in a file named "aborted-session-ping" so that no data is lost in the event of a crash (check with Alessio for timing of these backups). If Firefox shuts down cleanly, the aborted-session-ping file is deleted at the end of shutdown. If Firefox doesn't shut down properly, Firefox will send the leftover aborted-session-ping during the next Firefox session. There is also something called a "crash ping" but it's a way to report Firefox crash metadata, and is not in scope of this bug.
Flags: needinfo?(vdjeric)
Alessio: You might want to fsync session-state.json when you writeAtomic for extra safety (and small risk of jank). If you decide to fsync, please talk to bcolloran about setting up an experiment to measure the benefit of the fsync. Also measure the fsync cost with a Telemetry histogram.
Flags: needinfo?(alessio.placitelli)
(In reply to Sam Penrose from comment #6) > Thanks Alessio -- do you think we do not have a problem, or is it too early > to tell? From those preliminary results from Stuart, it doesn't look like shutdown pings for normal shutdowns are the issue. We manually tested those situations (clean shutdown on Windows and Ubuntu) and I'm not really expecting big surprises there, but having a new pair of eyes confirming that is great. That said, we did not test Benjamin's theory, and I agree we could have a problem lurking there :( Stuart correctly identified those test cases as "Exceptional Cases". We've put up some notes on how to verify session chaining here: https://docs.google.com/document/d/10sZICCbsfcSTF3RPyeVDskSI9-I2E4iApmShmIWSLfg/edit?usp=sharing (Subsessions chaining/validity section). The cases Vladan suggested cover pretty much all that needs to be covered (and thanks for expanding on the background :-) ). I understand that the (d) point is being checked in bug 1182249. Notes about the aborted session ping: it's saved every 5 minutes (starting when Telemetry is initialized) or when daily/environment changed pings are generated (so we don't gather the same data twice). @Vladan: I'll check the fsync thing, thanks for pointing me to that (leaving the ni? as a reminder). As always, feel free to contact me for any doubt or issue :-)
Thanks Vladan for clarifying! Very thorough. I'm retesting and checking the logs to look at the sessionid's and counters to make sure these are correct, here's what I have so far: Following a browser crash an aborted-session ping did send, the ID's look correct for the previous session. No other pings were sent. Is a shutdown ping supposed to be sent here? On resume from sleep with Firefox left open overnight, a daily ping was sent with the correct IDs for the previous session, no other pings were sent. On killing from task manager: the subsequent session does not appear to send any pings related to the previous session (aborted, shutdown). Following a restart with Firefox left open: aborted-session, shutdown and saved-session are all sent with the correct IDs for the aborted and new session. subsessionCounter is always 1, I assume this is supposed to increment?... profileSubsessionCounter is incrementing alongside the subsessionid changes, I assume that is correct behaviour (increment for each new subsession on the profile) When sending data saved on disk the subsessions appear to be sent out of order, if that matters. 4 pings were sent for 2 sessions, they were saved-sessions followed by shutdown pings, in the order of sessions 10-10-9-9 I'll continue testing for c and f
(In reply to Stuart Philp :sphilp from comment #12) > > Following a browser crash an aborted-session ping did send, the ID's look > correct for the previous session. No other pings were sent. Is a shutdown > ping supposed to be sent here? No, the shutdown ping is not supposed to be there. Are the subsessionCounter and the profileSubsessionCounter coherent? > On resume from sleep with Firefox left open overnight, a daily ping was sent > with the correct IDs for the previous session, no other pings were sent. Correct :) Did the subsessionCounter correctly get increased? > On killing from task manager: the subsequent session does not appear to send > any pings related to the previous session (aborted, shutdown). Mh, that's unexpected. An aborted session ping should get sent. Can you expand a bit on this test? > Following a restart with Firefox left open: aborted-session, shutdown and > saved-session are all sent with the correct IDs for the aborted and new > session. Mh, that's interesting. We're getting all the 3 pings? > subsessionCounter is always 1, I assume this is supposed to increment?... Is the subsessionCounter always 1 just for the last test, or for all the tests? > profileSubsessionCounter is incrementing alongside the subsessionid changes, > I assume that is correct behaviour (increment for each new subsession on the > profile) Yes, that's the correct behaviour. > When sending data saved on disk the subsessions appear to be sent out of > order, if that matters. 4 pings were sent for 2 sessions, they were > saved-sessions followed by shutdown pings, in the order of sessions 10-10-9-9 Yeah, it should send newest pings first.
(In reply to Alessio Placitelli [:Dexter] from comment #13) > (In reply to Stuart Philp :sphilp from comment #12) > > > > Following a browser crash an aborted-session ping did send, the ID's look > > correct for the previous session. No other pings were sent. Is a shutdown > > ping supposed to be sent here? > > No, the shutdown ping is not supposed to be there. Are the subsessionCounter > and the profileSubsessionCounter coherent? > subsessionCounter remains 1, but the profileSubsessionCounter increased by 1 > > On resume from sleep with Firefox left open overnight, a daily ping was sent > > with the correct IDs for the previous session, no other pings were sent. > > Correct :) Did the subsessionCounter correctly get increased? > The subsessionCounter is still 1. The profileSubsessionCounter is now 11, up from 10. > > On killing from task manager: the subsequent session does not appear to send > > any pings related to the previous session (aborted, shutdown). > > Mh, that's unexpected. An aborted session ping should get sent. Can you > expand a bit on this test? > Here's what I did: - Launch Firefox Nightly on Windows 8.1, browse a bit, kill nightly from Task Manager. - Launch nightly again and wait a few minutes to let telemetry start up, no pings are sent at all. > > Following a restart with Firefox left open: aborted-session, shutdown and > > saved-session are all sent with the correct IDs for the aborted and new > > session. > > Mh, that's interesting. We're getting all the 3 pings? > In the aborted-session ping for this test the subsessionCounter increased to 2, this is the only time I've seen this. Note the saved-session and shutdown pings are a new session, the aborted-session comes first from the session that was active before the OS restart: "subsessionId": "7d8e9ab0-03c5-421d-b3d8-63585c2feb36", "reason": "aborted-session", "previousSubsessionId": "c9699c64-8e03-4093-a2bc-6b183b94fc28", "sessionId": "b0b28caa-14f0-42f1-8109-b1ac8116756f", "subsessionId": "4fc81137-68b4-477a-9558-39f28e2c639c", "reason": "saved-session", "previousSubsessionId": "7d8e9ab0-03c5-421d-b3d8-63585c2feb36", "sessionId": "ac95b112-de86-44d2-aef9-116115a36998", "subsessionId": "4fc81137-68b4-477a-9558-39f28e2c639c", "reason": "shutdown", "previousSubsessionId": "7d8e9ab0-03c5-421d-b3d8-63585c2feb36", "sessionId": "ac95b112-de86-44d2-aef9-116115a36998", What's weird is the shutdown ping is from a new session, but there was no shutdown between the restart and this ping coming in after the OS came back up and I started nightly again. I might not be understanding it, but seems odd too me. > > subsessionCounter is always 1, I assume this is supposed to increment?... > > Is the subsessionCounter always 1 just for the last test, or for all the > tests? > Always 1 except the above note. When is this supposed to increment?
(In reply to Stuart Philp :sphilp from comment #14) > > Is the subsessionCounter always 1 just for the last test, or for all the > > tests? > > > Always 1 except the above note. When is this supposed to increment? It increments whenever we split subsessions, so: - At midnight (when daily pings are generated). - On environment-changes. Consider this scenario and the expected values for the subsessionCounter - Start Firefox - Init Telemetry - Environment-Change ping genrated (subsessionCounter = 1) - Shutdown Firefox (shutdown ping with subsessionCounter = 2)
(In reply to Stuart Philp :sphilp from comment #14) > > Mh, that's unexpected. An aborted session ping should get sent. Can you > > expand a bit on this test? > > > Here's what I did: > > - Launch Firefox Nightly on Windows 8.1, browse a bit, kill nightly from > Task Manager. > - Launch nightly again and wait a few minutes to let telemetry start up, no > pings are sent at all. So, if you kill Firefox Nightly before Telemetry inits, no ping should be generated. The first aborted-session ping is generated 1 minute after Firefox is started (when Telemetry inits). Moreover, when Firefox is restarted, the newly found aborted-session ping will be added to the pending ping list but not immediately sent: it will be sent with other pings or on the next Firefox start. > > > Following a restart with Firefox left open: aborted-session, shutdown and > > > saved-session are all sent with the correct IDs for the aborted and new > > > session. > > > > Mh, that's interesting. We're getting all the 3 pings? > > > > In the aborted-session ping for this test the subsessionCounter increased to > 2, this is the only time I've seen this. > > Note the saved-session and shutdown pings are a new session, the > aborted-session comes first from the session that was active before the OS > restart: > > "subsessionId": "7d8e9ab0-03c5-421d-b3d8-63585c2feb36", > "reason": "aborted-session", > "previousSubsessionId": "c9699c64-8e03-4093-a2bc-6b183b94fc28", > "sessionId": "b0b28caa-14f0-42f1-8109-b1ac8116756f", > > "subsessionId": "4fc81137-68b4-477a-9558-39f28e2c639c", > "reason": "saved-session", > "previousSubsessionId": "7d8e9ab0-03c5-421d-b3d8-63585c2feb36", > "sessionId": "ac95b112-de86-44d2-aef9-116115a36998", > > "subsessionId": "4fc81137-68b4-477a-9558-39f28e2c639c", > "reason": "shutdown", > "previousSubsessionId": "7d8e9ab0-03c5-421d-b3d8-63585c2feb36", > "sessionId": "ac95b112-de86-44d2-aef9-116115a36998", > > What's weird is the shutdown ping is from a new session, but there was no > shutdown between the restart and this ping coming in after the OS came back > up and I started nightly again. I might not be understanding it, but seems > odd too me. >
I run an analysis [1] to compare the ratio of chain-breaks due to missing ending fragments to the one due to missing starting fragments. The rationale of the comparison being that if the original hypothesis is correct then we would expect the former to be higher than the latter. The ratio is based on the total number of breaks between consecutive sessions. If there is break in the fragment chain between non-consecutive sessions then we can't really say much about it. It seems that there is no real difference between the ratios. In fact there are slightly more starting fragments missing than ending ones but that might be due to the fact that some sessions haven't ended yet. [1] http://nbviewer.ipython.org/gist/vitillo/97d5e0b09b5c144e0268
FYI for any future readers, status of missing pings hypotheses: https://etherpad.mozilla.org/u230yVoP9S
Are there open questions or issues here or does this seem to be cleared up?
Filed bug 1186877 to track the fsync suggestion.
Flags: needinfo?(alessio.placitelli)
(In reply to Georg Fritzsche [:gfritzsche] [away until july 22] from comment #19) > Are there open questions or issues here or does this seem to be cleared up? Roberto's server-side analysis seems to have invalidated the shutdown hypothesis Stuart: did you finish testing all the different shutdown behaviours?
Missing fragments seem to be strongly correlated with crashes [1]. Clients with missing fragments are about two times as likely to have experienced a crash. I used crash pings and the STARTUP_CRASH_DETECTED histogram to check if a previous session has crashed. Some notable things I found are: 1. Crash pings are difficult to join with a client fragment history as the info section is missing; I used creation timestamps for now. 2 There are some edges belonging to adjacent sessions (previous fragment’s sessionId == following fragment’s previousSessionid) that connect an end of session fragment (with reason aborted-session or shutdown) to a start of session fragment (subsessionCounter == 1) that have a skipped payload/info/profileSubsessionCounter (previous fragment’s profileSubsessionCounter == following fragment’s profileSubsessionCounter + 2), any thoughts how that is possible? This seems to happen only when the previous fragment reason is set to aborted-session. 3. The most common scenario of missing fragments is when entire sessions are missing. That’s likely due to the fact that most sessions have a single fragment. In that scenario, missing fragment edges are about 5 times as likely to be associated with a crash compared to the average likelihood. [1] http://nbviewer.ipython.org/gist/vitillo/9ed3e8da9c9888a188ce
Flags: needinfo?(gfritzsche)
Good finds here. The next steps we found here: > 1. Crash pings are difficult to join with a client fragment history as the > info section is missing; I used creation timestamps for now. * put sessionId in crash annotation, submit with crash ping: bug 1187270 > 2 There are some edges belonging to adjacent sessions (previous fragment’s > sessionId == following fragment’s previousSessionid) that connect an end of > session fragment (with reason aborted-session or shutdown) to a start of > session fragment (subsessionCounter == 1) that have a skipped > payload/info/profileSubsessionCounter (previous fragment’s > profileSubsessionCounter == following fragment’s profileSubsessionCounter + > 2), any thoughts how that is possible? This seems to happen only when the > previous fragment reason is set to aborted-session. * always duplicate new main pings as aborted-session - currently aborted-session may lag behind > 3. The most common scenario of missing fragments is when entire sessions are > missing. That’s likely due to the fact that most sessions have a single > fragment. In that scenario, missing fragment edges are about 5 times as > likely to be associated with a crash compared to the average likelihood. While i'm not sure how to completely address this, we can run things down more by: * introducing something like STARTUP_CRASH_DETECTED to detect crashes before the delayed Telemetry init (other crashes should be covered by aborted-session pings), to check whether that is the issue * track down other places we discard pings and verify we have probes covering this (or add them): * discard for ping size * discard for server error codes * failure to load pings from disk
Flags: needinfo?(gfritzsche)
Per the notebook: > < Digression> What's the percentage of clients that have at least one pair of fragments with different documentIds but the same profileSubsessionCounter? > ... > ~3% ... i think that ties into bug 1171265, where we try to run down and understand the profileSubsessionCounter issues. See bug 1171265, comment 13 ff.
(In reply to Georg Fritzsche [:gfritzsche] [away until july 22] from comment #23) > > 2 There are some edges belonging to adjacent sessions (previous fragment’s > > sessionId == following fragment’s previousSessionid) that connect an end of > > session fragment (with reason aborted-session or shutdown) to a start of > > session fragment (subsessionCounter == 1) that have a skipped > > payload/info/profileSubsessionCounter (previous fragment’s > > profileSubsessionCounter == following fragment’s profileSubsessionCounter + > > 2), any thoughts how that is possible? This seems to happen only when the > > previous fragment reason is set to aborted-session. > > * always duplicate new main pings as aborted-session - currently > aborted-session may lag behind Bug 1187339. > > 3. The most common scenario of missing fragments is when entire sessions are > > missing. That’s likely due to the fact that most sessions have a single > > fragment. In that scenario, missing fragment edges are about 5 times as > > likely to be associated with a crash compared to the average likelihood. > > While i'm not sure how to completely address this, we can run things down > more by: > * introducing something like STARTUP_CRASH_DETECTED to detect crashes before > the delayed Telemetry init (other crashes should be covered by > aborted-session pings), to check whether that is the issue Thinking about this again, this can't be the issue. If we did not get to delayed Telemetry init, then we also did not update the session-state on disk yet, so we would never see that session (nor see profileSubsessionCounter increase), so it can't show up as missing. Unless we can tell something else from that now i think we should use the crash pings to identify such scenarios (and fix them if needed). > * track down other places we discard pings and verify we have probes > covering this (or add them): > * discard for ping size Bug 1186955. > * discard for server error codes Bug 1186492. > * failure to load pings from disk Bug 1187340.
Bug 1168835 is on Nightly >=20150722 and had some days to accumulate data. Roberto, can we recheck how this affected things here?
Flags: needinfo?(rvitillo)
Iteration: --- → 42.3 - Aug 10
Whiteboard: [40b7] [unifiedTelemetry][data-validation] → [unifiedTelemetry][data-validation]
There are less missing fragments between adjacent sessions but the overall picture didn't change. See http://nbviewer.ipython.org/gist/vitillo/3047c0d896b08f75c403
Flags: needinfo?(rvitillo)
I matched the missing fragments, belonging to sessions with a single fragment, to the saved_session pings. Unfortunately I could find a match only for about 5% of those missing sessions and I haven't seen any particularly strong pattern. It's hard to say anything with such a small sample size, plus the missing sessions might be the ones that are truly interesting.
No longer depends on: 1187270
Whiteboard: [unifiedTelemetry][data-validation] → [unifiedTelemetry]
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.