Closed Bug 1154113 Opened 9 years ago Closed 9 years ago

telemetry/fhrV4 missing substantial number of sessions compared to v2

Categories

(Toolkit :: Telemetry, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: bcolloran, Unassigned)

References

Details

(Whiteboard: [rC] [unifiedTelemetry][data-validation])

Attachments

(2 files)

Looking at the latest v4 extract from Mark (see https://bugzilla.mozilla.org/show_bug.cgi?id=1149666 ), the problem that I notice in the first extract persist even when there has been plenty of time for upload; namely: there are a ton of sessions that show up in v2 that are not present in v4. See the attachment for examples.

I filed this as a bug against the desktop client, but it could obviously result from some kind of problem with the pipeline as well.
Blocks: 1134661
Component: Client: Desktop → Telemetry
Product: Firefox Health Report → Toolkit
Here is how I think we should approach this bug:

1) Start with individual clients. Compare the records saved on the client for unified telemetry and FHR and see whether the client has matching entries. If there are discrepancies, we have to solve this with client QA.

We can start with our own browsers, but may need to move on to the slightly larger group of fx-team. I think it should be possible once bug 1151034 lands to write a browser console script to do this comparison and flag errors.

Georg, I'm not 100% clear on the state of ping retention in the client. At what nightly would ping retention be working properly, and are there bugs other than 1151034 which need to land for retention to be a reliable QA technique? I see bug 1120380 which seems related.

2) Look at the records for these same individual client IDs on the server to see if the server has the same data that the client does. If the server is missing client data, then we have a sending/receiving issue to debug.
Flags: needinfo?(gfritzsche)
That sounds reasonable.

> Georg, I'm not 100% clear on the state of ping retention in the client. At
> what nightly would ping retention be working properly, and are there bugs
> other than 1151034 which need to land for retention to be a reliable QA
> technique? I see bug 1120380 which seems related.

You mean bug 1150134?

We definitely need bug 1137252 for the archived pings and should be good with that.
Bug 1120380 is about some details that don'T matter for QA.
For proper scripting we want bug 1150134, that should be enough. That one is WIP from me right now, not far off.
Flags: needinfo?(gfritzsche)
Ok, taking a moment to check here - the remaining phase 2 landings (bug 1069869) were on April 3.
I don't think looking at data received before that patch stack landed makes much sense.

Given that, we don't seem in as bad a shape?
Does it makes sense to recheck with a sample set that only includes data from April 4 on? (From the attachment we don't seem to have much sample data after that?)
That actually because of the delay in FHR v2 deorphaning-- we only have deorphaned v2 data through April 5, and of course we can only look at dates present in both v2 and v4. I think I can expedite getting a new sample of v2 data that is current through April 12. Let me look into it.

Georg, would it make sense to add a filter that only attempts to match session that were created under a the build id that got that patch stack (and any new build ids)? If so, what build id would you suggest using as the cutoff?
Flags: needinfo?(gfritzsche)
This was the push that got the phase 2 landings to m-c (Fri Apr 03 00:18:39 2015 +0000):
https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=cc3983f3acb5

The next Nightly would be:
* 2015-04-03-03-02-04-mozilla-central/ - 03-Apr-2015 15:40 UTC
  http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2015-04-03-03-02-04-mozilla-central/
  -> build id: 20150403030204
Flags: needinfo?(gfritzsche)
OS: Linux → All
Hardware: x86_64 → All
Version: unspecified → Trunk
(In reply to brendan c from comment #4)
> Georg, would it make sense to add a filter that only attempts to match
> session that were created under a the build id that got that patch stack
> (and any new build ids)? If so, what build id would you suggest using as the
> cutoff?

Yes, i think so - there were some bigger issues addressed in the patches that landed there and i don't think there is much worth in investigating issues that happened before.

Per comment 5, the build id for the cutoff is 20150403030204.
We're not out of the woods yet :-(

There is not much data yet for subsessions submitted under build 20150403030204 or greater but the problem is still there. Attached are few examples of records with subsessions that are present in v2 but that don't show up in v4 even after build 20150403030204 (loof for "true" in the "newBuild" column)
Ok, thanks for checking.

We found bug 1155604 (on startup evicting all but the 17 most recent pending pings) which might account for a lot of loss.
We also will have to start on the testing plan outlined above next week.
A short update on things affecting this bug:
* bug 1155604 looks good - we evict much less pings now
* bug 1156712 should bring further improvements for ping turn-around times and loss
* bug 1157359 uncovered some substantial issues for chaining, should be fixed in the next nightly (2015-05-06)
* bug 1157408 is pending bug 1157359, should be addressed by that
Depends on: 1156712
Nightly 2015-05-06 has all the fixes from comment 9, we should recheck from that.
ni? for running re-checks for this and related bugs.
Flags: needinfo?(bcolloran)
(fyi and note for myself) As discussed in the meeting this morning, I'll wait until Monday the 10th for some data to accumulate, then look at the session graphs again, filtering on nightly buildIds greater than or equal to 20150507*. I'll loop back and clear the needinfo at that time.
http://nbviewer.ipython.org/gist/bcolloran/1377894409137c8960f8

This is hugely improved, nicely done. However, we're not totally out of the woods-- from the quick initial checking I've done, there are still quite a few clients that have subSessions with repeated profileSubsessionCounter values. Some of this could be as a result of identical submissions (looking into that), but sometimes it seems like it may be failures to increment.

The example below also shows that this may be profile branching. So, good progress, more inspection still needed.

[{'previousSubsessionId': None,
  'profileSubsessionCounter': 1,
  'subsessionId': u'a9e3f6c5-706c-4892-91da-ceb182105d54'},
 {'previousSubsessionId': None,
  'profileSubsessionCounter': 1,
  'subsessionId': u'1a3e0c10-4357-4e88-b726-7a5f55b45722'},
 {'previousSubsessionId': None,
  'profileSubsessionCounter': 1,
  'subsessionId': u'65beae09-6747-45bd-beae-6f864f6018ac'},
 {'previousSubsessionId': u'a9e3f6c5-706c-4892-91da-ceb182105d54',
  'profileSubsessionCounter': 2,
  'subsessionId': u'f338373f-1ac4-4526-80a2-767b7889837f'},
 {'previousSubsessionId': u'1a3e0c10-4357-4e88-b726-7a5f55b45722',
  'profileSubsessionCounter': 2,
  'subsessionId': u'0a4c7f83-fb40-4f25-8133-234b1854340d'},
 {'previousSubsessionId': u'f338373f-1ac4-4526-80a2-767b7889837f',
  'profileSubsessionCounter': 3,
  'subsessionId': u'5f4d6fdf-b9a9-4d80-a2c2-2d6bbc06910e'},
 {'previousSubsessionId': u'0a4c7f83-fb40-4f25-8133-234b1854340d',
  'profileSubsessionCounter': 3,
  'subsessionId': u'083158ce-5e32-4b35-9250-728b70dcd4c3'},
 {'previousSubsessionId': u'5f4d6fdf-b9a9-4d80-a2c2-2d6bbc06910e',
  'profileSubsessionCounter': 4,
  'subsessionId': u'8bc8efca-1e7c-4c01-afbe-e406a663438b'},
 {'previousSubsessionId': u'083158ce-5e32-4b35-9250-728b70dcd4c3',
  'profileSubsessionCounter': 4,
  'subsessionId': u'79fe1306-ee60-4fe9-a53d-ed696e1967f8'},
 {'previousSubsessionId': u'8bc8efca-1e7c-4c01-afbe-e406a663438b',
  'profileSubsessionCounter': 5,
  'subsessionId': u'0cd6e454-5ada-4969-9137-a05a8646c1d7'},
 {'previousSubsessionId': u'0cd6e454-5ada-4969-9137-a05a8646c1d7',
  'profileSubsessionCounter': 6,
  'subsessionId': u'939d9244-0b1f-4bb4-b09d-d1fb2299988a'},
 {'previousSubsessionId': u'939d9244-0b1f-4bb4-b09d-d1fb2299988a',
  'profileSubsessionCounter': 7,
  'subsessionId': u'598fc94d-35f7-45b9-b5ab-195799cfd7b9'},
 {'previousSubsessionId': u'598fc94d-35f7-45b9-b5ab-195799cfd7b9',
  'profileSubsessionCounter': 8,
  'subsessionId': u'79f7a373-a981-4df9-8d40-242bd48a0b5e'},
 {'previousSubsessionId': u'79f7a373-a981-4df9-8d40-242bd48a0b5e',
  'profileSubsessionCounter': 9,
  'subsessionId': u'd71a7318-fed6-4e6d-8888-6be9cdb27db3'}]
Flags: needinfo?(bcolloran)
(In reply to brendan c from comment #13)
> http://nbviewer.ipython.org/gist/bcolloran/1377894409137c8960f8
> 
> This is hugely improved, nicely done. However, we're not totally out of the
> woods-- from the quick initial checking I've done, there are still quite a
> few clients that have subSessions with repeated profileSubsessionCounter
> values. Some of this could be as a result of identical submissions (looking
> into that),

That would be great to know. We should investigate duplicate submissions too, but that's not as critical now.
E.g. we definitely have some edge cases where we would not be able to tell right now whether submission succeeded already or not.
Also, if you branch/duplicate a profile with pending submissions, they can be submitted by both.

So everything with duplicate ping ids is probably uninteresting.
(In reply to Georg Fritzsche [:gfritzsche] from comment #14)
> So everything with duplicate ping ids is probably uninteresting.
Bug 1142153 will address this - once that is implemented, downstream consumers of the data will not see duplicate documentIds.  It is expected that sometimes the client will submit duplicates for valid / unavoidable reasons (e.g. network dies before it receives a response), so we will direct duplicates to the error stream for monitoring / validation and skip them in the main corpus of data.
(In reply to brendan c from comment #13)
> The example below also shows that this may be profile branching. So, good
> progress, more inspection still needed.
> 
> [{'previousSubsessionId': None,
>   'profileSubsessionCounter': 1,
[...]

Ok, i don't know what to make of that specific example.
If it is profile branching, can any other data points confirm this?
There's probably not much that we can do about this AFAIK.

I would love to get more details about this and other scenarios.
Unless I missed something, you need to group by "sessionId" first. For example, the pasted data in comment 13 appears to be from three independent sessions:

Session 1
subsession 1, a9e3f6c5-706c-4892-91da-ceb182105d54
 subsession 2, f338373f-1ac4-4526-80a2-767b7889837f
  subsession 3, 5f4d6fdf-b9a9-4d80-a2c2-2d6bbc06910e
   subsession 4, 8bc8efca-1e7c-4c01-afbe-e406a663438b
    subsession 5, 0cd6e454-5ada-4969-9137-a05a8646c1d7
     subsession 6, 939d9244-0b1f-4bb4-b09d-d1fb2299988a
      subsession 7, 598fc94d-35f7-45b9-b5ab-195799cfd7b9
       subsession 8, 79f7a373-a981-4df9-8d40-242bd48a0b5e
        subsession 9, d71a7318-fed6-4e6d-8888-6be9cdb27db3

Session 2
subsession 1, 1a3e0c10-4357-4e88-b726-7a5f55b45722
 subsession 2, 0a4c7f83-fb40-4f25-8133-234b1854340d
  subsession 3, 083158ce-5e32-4b35-9250-728b70dcd4c3
   subsession 4, 79fe1306-ee60-4fe9-a53d-ed696e1967f8

Session 3
subsession 1, 65beae09-6747-45bd-beae-6f864f6018ac


The code to calculate graph degree by subsession *id* looks right, but subsession counters would legitimately be repeated in each separate session, as I understand it.

I think you could use profileSubsessionCounter for this purpose.
Mark, I'm confused... I believe I was using profileSubsessionCounter in all instances (as opposed to subsessionCounter-- doing a quick string search, I don't seem to access subsessionCounter values anywhere in that noteboook).

I agree that subsessionCounter should reset in each new session, but I'd thought that that profileSubsessionCounter should never reset. So as long as you use profileSubsessionCounter, there should never be duplicate values... do I have this wrong?

In general, my approach in this particular analysis has been to ignore full sessions and focus on subsessions (which should be internally consistent independent of full session consistency). At this level of granularity I think it should not be necessary to group by sessions at all.

What am I missing?


ps. just for reference, here is ['payload']['info'].keys() --

[u'subsessionStartDate',
 u'profileSubsessionCounter',
 u'subsessionId',
 u'reason',
 u'flashVersion',
 u'sessionId',
 u'subsessionCounter',
 u'asyncPluginInit',
 u'previousBuildId',
 u'subsessionLength',
 u'previousSubsessionId',
 u'addons',
 u'sessionStartDate',
 u'timezoneOffset',
 u'revision']
I think looking at the subsession chaining in isolation is a fine approach.
A grouping by session id like comment 17 however allows a better look at the issue - if those group into separate sessions as shown than we may to fail to persist the profileSubsessionCounter properly between sessions.

Questions from looking at the notebook:
* does the sample data set this runs on only contain pings of type "main"? If not you'd have to filter out "saved-session" pings
* "idle-daily" pings were removed, you should not need to filter for that anymore?
* the examples of bad graphs are hard to read, but:
  * there are a lot of examples that actually look like ok session chains to me (e.g. the first example) - what's the deal with those?
  * the other common occurence seems to be having one isolated subsession?
(In reply to Georg Fritzsche [:gfritzsche] from comment #19)
> A grouping by session id like comment 17 however allows a better look at the
> issue - if those group into separate sessions as shown than we may to fail
> to persist the profileSubsessionCounter properly between sessions.

Hm, or it is profile branching from a profile created before the current session state persistence.
(In reply to Mark Reid [:mreid] from comment #17)
> Unless I missed something, ....
Here is the problem ^  :)

I could've sworn I read that as using subsessionCounter. Sorry for the noise.
(In reply to Georg Fritzsche [:gfritzsche] from comment #19)
> I think looking at the subsession chaining in isolation is a fine approach.
> A grouping by session id like comment 17 however allows a better look at the
> issue - if those group into separate sessions as shown than we may to fail
> to persist the profileSubsessionCounter properly between sessions.

I see, good idea. I'll look at both.

> Questions from looking at the notebook:
> * does the sample data set this runs on only contain pings of type "main"?
> If not you'd have to filter out "saved-session" pings

Ok, thanks for the pointer. I'll go back and filter out the "saved-session"s

> * "idle-daily" pings were removed, you should not need to filter for that
> anymore?

Ok. Maybe I'll add a quick check to verify that none appear in the pings filtered by build.

> * the examples of bad graphs are hard to read, but:
>   * there are a lot of examples that actually look like ok session chains to
> me (e.g. the first example) - what's the deal with those?
>   * the other common occurence seems to be having one isolated subsession?

Sorry, I know the graphs are a mess. I just wanted to get something out quickly so I dashed those one off, but I have actually been working on clarifying the depiction. I'll reply with an updated notebook as soon as I have something.
(In reply to brendan c from comment #22)
> Sorry, I know the graphs are a mess. I just wanted to get something out
> quickly so I dashed those one off, but I have actually been working on
> clarifying the depiction. I'll reply with an updated notebook as soon as I
> have something.

Great, thanks.

I think the next best action is for us to identify to group the remaining cases of broken chains so we can either find the underlying issue or explain them (and verify the assumptions where possible).
Your analysis and graphs seem like the best step into that direction.
> > * "idle-daily" pings were removed, you should not need to filter for that
> > anymore?
> 
> Ok. Maybe I'll add a quick check to verify that none appear in the pings
> filtered by build.

I still see some in builds >"20150507000000" -- was this dropped in a later build, or is that a potential bug?

Also, have there been more fixes since 20150507000000? Should I be filtering on a more recent build?
Flags: needinfo?(gfritzsche)
(In reply to brendan c from comment #24)
> > > * "idle-daily" pings were removed, you should not need to filter for that
> > > anymore?
> > 
> > Ok. Maybe I'll add a quick check to verify that none appear in the pings
> > filtered by build.
> 
> I still see some in builds >"20150507000000" -- was this dropped in a later
> build, or is that a potential bug?

That is odd, we definitely don't send that anymore. 
Maybe that is some people running their own builds on older revisions, they would show up with current build ids.
 
> Also, have there been more fixes since 20150507000000? Should I be filtering
> on a more recent build?

Going from [0], we don't have any newer fixes that would affect data quality here.

0: https://mnoorenberghe.github.io/bz-dependency-buglist/?list=1120356&resolved=1&meta=0&sortColumn=cf_last_resolved&sortDirection=asc
Flags: needinfo?(gfritzsche)
Brendan, did you have any success on running down the scenarios of brokenness or graphing here?
Flags: needinfo?(bcolloran)
No, I've been having big problems with Spark for the last week and a half, and I haven't made any progress. Roberto has been helping me, so hopefully we'll be back on track soon.
Flags: needinfo?(bcolloran)
ok, there is some more data here,
http://nbviewer.ipython.org/gist/bcolloran/5558dfbe42d4ef48838f
but this investigation has strayed far from the original v2/v4 comparison, so I'm going to open new bugs for the main points of this notebook.
Whiteboard: [rC] [unifiedTelemetry]
Whiteboard: [rC] [unifiedTelemetry] → [rC] [unifiedTelemetry][data-validation]
No longer blocks: 1154224
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: