Closed
Bug 1402134
Opened 7 years ago
Closed 7 years ago
probe: places_bookmarks_count not populating correctly
Categories
(Data Platform and Tools Graveyard :: Datasets: Longitudinal, defect, P2)
Data Platform and Tools Graveyard
Datasets: Longitudinal
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: shong, Assigned: relud)
References
(Depends on 1 open bug)
Details
The places_bookmarks_count doesn't seem to be populating correctly in longitudinal dataset. I don't think this is happening at the dervied dataset level because I'm seeing it in main summary as well.
Description:
* The places_bookmarks_count field doesn't populate (is null) in a large number of client's pings. Sometimes a ping includes this information, sometimes it doesn't. For example, this query:
```
q = """
SELECT
client_id,
places_bookmarks_count,
substring(subsession_start_date, 1, 10) as ssd,
profile_subsession_counter as psc
FROM
main_summary
ORDER BY
1, 3 ASC, 4 ASC
LIMIT 1000
"""
spark.sql(q).show()
```
will produce:
| client_id |places_bookmarks_count | subsession start dt| profile subsession counter|
|00005d34-310e-409b-ab1c-2ed099dfb669 |nan |2017-08-25 |170|
|00005d34-310e-409b-ab1c-2ed099dfb669 |nan |2017-08-25 |171|
|00005d34-310e-409b-ab1c-2ed099dfb669 |14.0 |2017-08-25 |172|
|00005d34-310e-409b-ab1c-2ed099dfb669 |nan |2017-08-27 |173|
|00005d34-310e-409b-ab1c-2ed099dfb669 |nan |2017-08-27 |174|
|00005d34-310e-409b-ab1c-2ed099dfb669 |nan |2017-08-28 |175|
|00005d34-310e-409b-ab1c-2ed099dfb669 |14.0 |2017-08-28 |176|
|00005d34-310e-409b-ab1c-2ed099dfb669 |nan |2017-08-29 |177|
|00005d34-310e-409b-ab1c-2ed099dfb669 |nan |2017-08-29 |178|
|00005d34-310e-409b-ab1c-2ed099dfb669 |14.0 |2017-08-30 |179|
Similarly, in longitudinal, you will have clients looking like this:
```
q = """
SELECT
client_id,
places_bookmarks_count
FROM
long_filtered
LIMIT 50
"""
ex1 = spark.sql(q).toPandas()
ex1['places_bookmarks_count'][13][230:]
```
[Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0),
Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0),
Row(values=[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=20),
Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0),
Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0),
Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0),
Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0),
Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0),
Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0),
Row(values=[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=13),
Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0),
Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0),
Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0),
Row(values=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sum=0)]
This shouldn't be recording event based information, just the number of bookmarks that a client has. For some reason, it's not getting recorded on every subsession...
The Problem:
I'm trying to find out how many bookmarks a client is making, however, with the current state of this probe, it's impossible since it's unreliably filled in. Please advise.
Assignee | ||
Comment 1•7 years ago
|
||
I will investigate this, and file a follow up bug to fix what I find.
Assignee: nobody → dthorn
Priority: P2 → P1
Assignee | ||
Updated•7 years ago
|
Points: --- → 1
Assignee | ||
Comment 2•7 years ago
|
||
In querying main_summary on the week of 20171223 (1) I found 542,620,261 pings with null bookmark count, and 1,636,246,368 that included a bookmark count. The week of 20171216 yields 579,338,360 null to 1,836,050,118 present. The week of 20171209 yields 576,573,825 null to 1,737,042,392 present.
But when querying longitudinal_v20171223 (2) I found 1,869,967 with null bookmark count and 518,646 with a bookmark count. v20171216 yields 5,260,376 null to 1,449,600 present. v2017109 yields 5,491,715 null to 1,516,335 present.
The difference in ratios there looks suspicious to me, but not in a way that I know how to move forward with, so I will consult people when they return on Monday.
(1)
> WITH sample AS
> (
> SELECT (places_bookmarks_count IS NULL) AS has_bm
> FROM main_summary
> WHERE submission_date_s3 <= '20171223' AND submission_date_s3 > '20171216'
> )
> SELECT COUNT(*), has_bm
> FROM sample
> GROUP BY has_bm
(2)
> WITH sample AS
> (
> SELECT (places_bookmarks_count IS NULL) AS has_bm
> FROM longitudinal_v20171223
> )
> SELECT COUNT(*), has_bm
> FROM sample
> GROUP BY has_bm
Assignee | ||
Comment 3•7 years ago
|
||
Also, now that I read the bug description again, I realize that the source of the original issue is pretty clear based on the information originally given. not every subsession ping is including bookmark count, even within a consistent client id. I need to investigate this from a different angle.
Assignee | ||
Comment 4•7 years ago
|
||
link to redash query demonstrating the issue/query shown in comment 0: https://sql.telemetry.mozilla.org/queries/50103/source
Assignee | ||
Comment 5•7 years ago
|
||
okay, so this seems to be a side-effect related to the issue described in Bug 1127907 where places_bookmarks_count is collected during 'gather-telemetry' which doesn't happen for every telemetry subsession.
I think, but I am not entirely sure, that if you want to check places_bookmarks_count for now, then you will need to use a 'WHERE places_bookmarks_count IS NOT NULL' clause, until Bug 1127907 is worked out, and we have a way of collecting a histogram on a per-subsession basis and can then evaluate if this metric should be collected at that frequency.
:mreid does that sound right to you?
Depends on: 1127907
Flags: needinfo?(mreid)
Comment 6•7 years ago
|
||
Yes, that sounds right. Filtering out nulls should give a reasonable view of the number of bookmarks over time. As pointed out in comment 5 of bug 1127907, however, it is recorded on the "idle-daily" event which is not completely reliable.
Flags: needinfo?(mreid)
Assignee | ||
Updated•7 years ago
|
Priority: P1 → P2
Assignee | ||
Comment 7•7 years ago
|
||
as per comment 1 and comment 6 I'm going to close this 'wontfix' because this is currently the expected behavior
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Updated•5 years ago
|
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•