Calculate the exact number of Telemetry records on S3

RESOLVED FIXED

Status

P1
normal
RESOLVED FIXED
3 years ago
2 months ago

People

(Reporter: mreid, Assigned: trink)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

3 years ago
When querying for a relatively large amount of records using different methods, I keep seeing slightly different record counts.

I'd like to use Heka/Hindsight to confirm the actual record counts on S3 (as well as check for any corrupt streams) so that I can validate counts that come from the Scala and Python code.

Ideally, I'd like to count the exact number of records under the following S3 prefixes:

telemetry-2/20160401/telemetry/4/main/Firefox
telemetry-2/20160402/telemetry/4/main/Firefox

That will make it easy to compare against the counts in the "main_summary" derived dataset.
(Assignee)

Updated

3 years ago
Assignee: nobody → mtrinkala
Status: NEW → ASSIGNED
Points: --- → 1
Priority: -- → P2
(Assignee)

Updated

3 years ago
Priority: P2 → P1
(Assignee)

Comment 1

3 years ago
Created attachment 8754655 [details]
Counts per file for each day

schema.json was the file selection criteria

tsv: column 1 filename, column 2 number of messages in the file
cnts-20160401.tsv
  total files    = 166338
  total messages = 396151215
cnts-20160402.tsv
  total files    = 152984
  total messages = 303040220
Mark checked that the Scala implementation returns the correct number of records while I did the same for the Python one. We can both confirm that the number of records match for the given dates.
(Assignee)

Updated

3 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED

Updated

2 months ago
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.