If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

aborted-session pings may not get deleted properly

NEW
Unassigned

Status

()

Toolkit
Telemetry
P4
normal
2 years ago
8 months ago

People

(Reporter: gfritzsche, Unassigned)

Tracking

(Depends on: 1 bug, Blocks: 1 bug)

Trunk
Points:
1
Dependency tree / graph

Firefox Tracking Flags

(firefox41 wontfix, firefox42 affected, firefox43 affected)

Details

(Whiteboard: [unifiedTelemetry][measurement:client])

(Vladan Djeric (:vladan) from bug 1191681, comment #21)
> I found a broken session chain on my Nightly e10s profile on my home Windows
> 7 machine.
> 
> https://docs.google.com/a/mozilla.com/spreadsheets/d/
> 13qw14UBQ2VEMk5Pu_OMt92kc_cSOJwmK9LHwpvu-xnk/edit?usp=sharing
> 
> It looks like there's a fork inside a single session's chain:
> 
> 1. at roughly midnight local time, session writes a "daily ping"
> 2. 45 minutes later, session writes an aborted-session ping
> 3. 5 minutes after that, session shuts down and writes a shutdown ping with
> the SAME subsessionID as #2
This looks like the aborted-session ping didn't get deleted although we properly wrote out the shutdown ping.
As a result we picked it up at startup but also sent/persisted the shutdown ping.

I think we need analysis on the scale of this issue, i.e. check server-side:
* for buildids >= 20150722
* how many aborted-session pings & shutdown pings are there with the same subsession id

That could happen for any number of oddities, so lets check that first.
We might want to consider adding this to the Telemetry client health monitoring.
why aren't the rest of the details in the PingReadError logged to the browser console? such as path etc

This is all I saw in my Browser Console:

1440541025908 Toolkit.Telemetry ERROR TelemetryStorage::loadAbortedSessionPing - error removing ping: PingReadError JS Stack trace: PingReadError@TelemetryStorage.jsm:80:15 <TelemetryStorageImpl.loadPingFile<@TelemetryStorage.jsm:1448:13
status-firefox41: --- → wontfix
status-firefox42: --- → affected
Priority: -- → P3
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #2)
> why aren't the rest of the details in the PingReadError logged to the
> browser console? such as path etc

Lets move this to a separate bug.

> This is all I saw in my Browser Console:
> 
> 1440541025908 Toolkit.Telemetry ERROR
> TelemetryStorage::loadAbortedSessionPing - error removing ping:
> PingReadError JS Stack trace: PingReadError@TelemetryStorage.jsm:80:15
> <TelemetryStorageImpl.loadPingFile<@TelemetryStorage.jsm:1448:13

That incorrect logging should be fixed by bug 1191825 FWIW.
(In reply to Georg Fritzsche [:gfritzsche] [away Sep 5 - 13] from comment #1)
> This looks like the aborted-session ping didn't get deleted although we
> properly wrote out the shutdown ping.
> As a result we picked it up at startup but also sent/persisted the shutdown
> ping.
> 
> I think we need analysis on the scale of this issue, i.e. check server-side:
> * for buildids >= 20150722
> * how many aborted-session pings & shutdown pings are there with the same
> subsession id

Alessio, can you see if we can get numbers here? Would be good to know the proportion of this vs. the "ok" aborted sessions.
Flags: needinfo?(alessio.placitelli)
The first 500 entries that I get in Kibana by filtering for the buildId and restricting the pings to "shutdown" and "aborted-session", don't show any duplicated subsessionId.

I'm working on getting a more quantitative measure out of Kibana.
Flags: needinfo?(alessio.placitelli)

Updated

2 years ago
Whiteboard: [unifiedTelemetry] → [unifiedTelemetry][data-validation]
(In reply to Georg Fritzsche [:gfritzsche] [away Sep 5 - 13] from comment #4)
> Alessio, can you see if we can get numbers here? Would be good to know the
> proportion of this vs. the "ok" aborted sessions.

I wrote a short notebook to answer this question [0]. Results:

- 10% samples from Firefox Nightly, July only, have 44 over 7849 aborted-session pings having the same subsessionId of shutdown pings.
- 10% samples from Firefox Nightly, July+August, have 157 over 30117 aborted-session pings having the same subsessionId of shutdown pings.

If my notebook is correct, this problem doesn't seem to have a huge scope. I'm trying to break down the numbers by channel to see if this problem is relevant to Nightly only.

[0] - https://gist.github.com/Dexterp37/c0786d2ba2f782020264

Updated

2 years ago
Points: --- → 1
Priority: P3 → P2
Whiteboard: [unifiedTelemetry][data-validation] → [unifiedTelemetry][data-validation][measurement:client]
Depends on: 1206103
Depends on: 1207473
Per bug 1206103 this seems to be around 0.5% of the aborted-session pings.
Not a major issue, but that will definitely affect analysis if the duplicates are not filtered out.
Potential causes:
(1) any exception etc. in the shutdown paths that would prevent us from doing the aborted-session file remove (see bug 1177958)
(2) file access issues through AV or device/drive issues etc. (unlikely at that scale)
(3) missing synchronization that would resolve in a race of an aborted-session file write vs. the removal
Depends on: 1177958
(In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> (3) missing synchronization that would resolve in a race of an
> aborted-session file write vs. the removal

One possible path here was found ok in bug 1207473.

Updated

2 years ago
Whiteboard: [unifiedTelemetry][data-validation][measurement:client] → [unifiedTelemetry][measurement:client]
Priority: P2 → P3
Depends on: 1237296
No longer depends on: 1237296
Priority: P3 → P4
You need to log in before you can comment on or make changes to this bug.