Closed Bug 1271640 Opened 8 years ago Closed 8 years ago

Calculate the exact number of Telemetry records on S3

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mreid, Assigned: trink)

Details

Attachments

(1 file)

When querying for a relatively large amount of records using different methods, I keep seeing slightly different record counts.

I'd like to use Heka/Hindsight to confirm the actual record counts on S3 (as well as check for any corrupt streams) so that I can validate counts that come from the Scala and Python code.

Ideally, I'd like to count the exact number of records under the following S3 prefixes:

telemetry-2/20160401/telemetry/4/main/Firefox
telemetry-2/20160402/telemetry/4/main/Firefox

That will make it easy to compare against the counts in the "main_summary" derived dataset.
Assignee: nobody → mtrinkala
Status: NEW → ASSIGNED
Points: --- → 1
Priority: -- → P2
Priority: P2 → P1
schema.json was the file selection criteria

tsv: column 1 filename, column 2 number of messages in the file
cnts-20160401.tsv
  total files    = 166338
  total messages = 396151215
cnts-20160402.tsv
  total files    = 152984
  total messages = 303040220
Mark checked that the Scala implementation returns the correct number of records while I did the same for the Python one. We can both confirm that the number of records match for the given dates.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: