Closed Bug 1402134 Opened 7 years ago Closed 7 years ago

probe: places_bookmarks_count not populating correctly

Categories

(Data Platform and Tools Graveyard :: Datasets: Longitudinal, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: shong, Assigned: relud)

References

(Depends on 1 open bug)

Details

The places_bookmarks_count doesn't seem to be populating correctly in longitudinal dataset. I don't think this is happening at the dervied dataset level because I'm seeing it in main summary as well. Description: * The places_bookmarks_count field doesn't populate (is null) in a large number of client's pings. Sometimes a ping includes this information, sometimes it doesn't. For example, this query: ``` q = """ SELECT client_id, places_bookmarks_count, substring(subsession_start_date, 1, 10) as ssd, profile_subsession_counter as psc FROM main_summary ORDER BY 1, 3 ASC, 4 ASC LIMIT 1000 """ spark.sql(q).show() ``` will produce: | client_id |places_bookmarks_count | subsession start dt| profile subsession counter| |00005d34-310e-409b-ab1c-2ed099dfb669 |nan |2017-08-25 |170| |00005d34-310e-409b-ab1c-2ed099dfb669 |nan |2017-08-25 |171| |00005d34-310e-409b-ab1c-2ed099dfb669 |14.0 |2017-08-25 |172| |00005d34-310e-409b-ab1c-2ed099dfb669 |nan |2017-08-27 |173| |00005d34-310e-409b-ab1c-2ed099dfb669 |nan |2017-08-27 |174| |00005d34-310e-409b-ab1c-2ed099dfb669 |nan |2017-08-28 |175| |00005d34-310e-409b-ab1c-2ed099dfb669 |14.0 |2017-08-28 |176| |00005d34-310e-409b-ab1c-2ed099dfb669 |nan |2017-08-29 |177| |00005d34-310e-409b-ab1c-2ed099dfb669 |nan |2017-08-29 |178| |00005d34-310e-409b-ab1c-2ed099dfb669 |14.0 |2017-08-30 |179| Similarly, in longitudinal, you will have clients looking like this: ``` q = """ SELECT client_id, places_bookmarks_count FROM long_filtered LIMIT 50 """ ex1 = spark.sql(q).toPandas() ex1['places_bookmarks_count'][13][230:] ``` [Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0), Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0), Row(values=[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=20), Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0), Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0), Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0), Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0), Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0), Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0), Row(values=[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=13), Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0), Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0), Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0), Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0)] This shouldn't be recording event based information, just the number of bookmarks that a client has. For some reason, it's not getting recorded on every subsession... The Problem: I'm trying to find out how many bookmarks a client is making, however, with the current state of this probe, it's impossible since it's unreliably filled in. Please advise.
I will investigate this, and file a follow up bug to fix what I find.
Assignee: nobody → dthorn
Priority: P2 → P1
Points: --- → 1
In querying main_summary on the week of 20171223 (1) I found 542,620,261 pings with null bookmark count, and 1,636,246,368 that included a bookmark count. The week of 20171216 yields 579,338,360 null to 1,836,050,118 present. The week of 20171209 yields 576,573,825 null to 1,737,042,392 present. But when querying longitudinal_v20171223 (2) I found 1,869,967 with null bookmark count and 518,646 with a bookmark count. v20171216 yields 5,260,376 null to 1,449,600 present. v2017109 yields 5,491,715 null to 1,516,335 present. The difference in ratios there looks suspicious to me, but not in a way that I know how to move forward with, so I will consult people when they return on Monday. (1) > WITH sample AS > ( > SELECT (places_bookmarks_count IS NULL) AS has_bm > FROM main_summary > WHERE submission_date_s3 <= '20171223' AND submission_date_s3 > '20171216' > ) > SELECT COUNT(*), has_bm > FROM sample > GROUP BY has_bm (2) > WITH sample AS > ( > SELECT (places_bookmarks_count IS NULL) AS has_bm > FROM longitudinal_v20171223 > ) > SELECT COUNT(*), has_bm > FROM sample > GROUP BY has_bm
Also, now that I read the bug description again, I realize that the source of the original issue is pretty clear based on the information originally given. not every subsession ping is including bookmark count, even within a consistent client id. I need to investigate this from a different angle.
link to redash query demonstrating the issue/query shown in comment 0: https://sql.telemetry.mozilla.org/queries/50103/source
okay, so this seems to be a side-effect related to the issue described in Bug 1127907 where places_bookmarks_count is collected during 'gather-telemetry' which doesn't happen for every telemetry subsession. I think, but I am not entirely sure, that if you want to check places_bookmarks_count for now, then you will need to use a 'WHERE places_bookmarks_count IS NOT NULL' clause, until Bug 1127907 is worked out, and we have a way of collecting a histogram on a per-subsession basis and can then evaluate if this metric should be collected at that frequency. :mreid does that sound right to you?
Depends on: 1127907
Flags: needinfo?(mreid)
Yes, that sounds right. Filtering out nulls should give a reasonable view of the number of bookmarks over time. As pointed out in comment 5 of bug 1127907, however, it is recorded on the "idle-daily" event which is not completely reliable.
Flags: needinfo?(mreid)
Priority: P1 → P2
as per comment 1 and comment 6 I'm going to close this 'wontfix' because this is currently the expected behavior
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.