Main ping contains missing rows between decoded and live tables on 2020-01-22
Categories
(Data Platform and Tools :: General, defect, P1)
Tracking
(Not tracked)
People
(Reporter: amiyaguchi, Assigned: klukas)
Details
(Whiteboard: [dataquality])
As per the pings decoded and in live tables (main/event) dashboard, there is a 0.08% difference between the counts of rows in decoded vs live tables for the main ping. Other dates, such as 2020-01-30, also contain a significant number of missing rows.
There should be little to no differences between the number of rows between decoded and live tables.
| Assignee | ||
Comment 1•6 years ago
|
||
When we deduplicate by document_id, there is no difference between the counts. The following query yields a null result set:
with pbd AS (
select document_id, count(*) n_pbd from
`moz-fx-data-shared-prod.payload_bytes_decoded.telemetry_telemetry__main_v4`
where date(submission_timestamp ) = '2020-01-22'
group by 1
),
live as (
select document_id, count(*) n_live from
`moz-fx-data-shared-prod.telemetry_live.main_v4`
where date(submission_timestamp ) = '2020-01-22'
group by 1
)
select * from pbd join live using (document_id)
where n_pbd is null or n_live is null
I think the only explanation here is that the payload_bytes_decoded sink read some messages from pubsub twice. It would be interesting to check if there was a deploy on this day that could have contributed.
Comment 2•6 years ago
|
||
there is a 0.08% difference between the counts of rows in decoded vs live tables for the main ping
When we deduplicate by document_id, there is no difference between the counts.
this is expected behavior.
the live sink uses the old ingestion-beam sink, which attempts to achieve exactly once delivery but does not guarantee at least once delivery.
the decoded sink uses the new ingestion-sink, which produces a higher volume of duplicates because it guarantees at least once delivery.
| Assignee | ||
Updated•6 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Description
•