Closed Bug 1344774 Opened 8 years ago Closed 5 years ago

Reported Overlapping Subsessions in Longitudinal

Categories

(Data Platform and Tools Graveyard :: Datasets: Longitudinal, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: cameres, Unassigned, Mentored, NeedInfo)

Details

The following notebook shows that after deduplication (removed a majority of issues) there are still long subsessions that appear to overlap. In the last cell of my notebook, one of the subsession pairs in the output is the following... NOTE: I am using the ordering of the arrays in longitudinal to determine prev_subsession_id. I note this, because there is a field in longitudinal 'prev_subsession_id'. ((prev_subsession_id, cur_subsession_id), ( (prev_subsession_date, prev_subsession_length), (cur_subsession_date, cur_subsession_length) )) ((u'7f429e0d-7372-45ad-bdb8-5584499e46a1', u'675db484-0a61-4c71-9851-4a34e46098d1'), ((u'2016-11-07T00:00:00.000+09:00', 279104), (u'2016-11-08T00:00:00.000+09:00', 11239))) These are subsessions for a particular profile where the first subsession has a subsession of approximately 77 hours. The second subsession starts on the day following the first subsession, which in a conversation w/ @spenrose we decided should not exists. Although subsession lengths are known to be funky, they should not overlap. https://gist.github.com/cameres/f9383c0c9813e63f9cc4b1b09de6613c
I can think of several fronts on which to advance this issue: 1) Root cause analysis for the client side. Are the pings chained via subsessionId? If not, do all fields that should be machine-invariant validate on these pings, or is it possible that we have found copied profiles? 2) Are one or both of the pings garbage which should not be included in analyses? How about the client record as a whole? 3) Building an infrastructure to prevent data like this from getting into longitudinal and other derived datasets. I am working on that. 4) Cleaning up longitudinal in the short term. Georg and Roberto, what do you think?
Flags: needinfo?(rvitillo)
Flags: needinfo?(gfritzsche)
How often does this happen?
Flags: needinfo?(rvitillo)
adding needs infor to Conner based on Robertos comment.
Flags: needinfo?(cameres)
I have the counts of how frequently this occurs in the notebook that I attached. I'm not sure if that answers Roberto's question.
Flags: needinfo?(cameres)
Connor, please submit a report to mozilla-reports with clearly stated conclusions and confidence intervals. It would be useful to answer the following questions: - what's the percentage of profiles that have at least one overlapping session? - what's the distribution of the percentage of overlapping sessions per profile? - what's the distribution of the overlap duration? - what's the distribution of subsession durations > 24h and how does it relate to the studied phenomena? - how do the above distributions vary per channel? - how do the above distributions vary when considering only subsessions originating from recent Firefox builds? - how do the above distributions vary when using docid or profileSubsessionCounter to dedupe sessions?
Flags: needinfo?(cameres)
Great questions! Answers are in the works. The issue that I currently see with answering the third question and any other question that require measuring the time of the overlap, is that I have only been able to compute the overlap in days, b.c. of the granularity of subsession start dates. The issue is that for an example ping that overlaps a previous pings, the duration of the overlap is dependent on when the example ping starts.
adding priority to get it out of triaged and assigned to Conner as he is on it. please add points when you get a chance
Assignee: nobody → cameres
Priority: -- → P1
Flags: needinfo?(gfritzsche)
Component: Metrics: Pipeline → Datasets: Longitudinal
Product: Cloud Services → Data Platform and Tools
Hey cameres - If you're still working on this do you mind adding points? If not, please update the priority accordingly.
Last week was Connor's last week. Removing him as assignee and dropping the priority for now.
Assignee: cameres → nobody
Points: --- → 2
Priority: P1 → P3

Longitudinal has been decommissioned per Bug 1572033.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.