When querying for a relatively large amount of records using different methods, I keep seeing slightly different record counts. I'd like to use Heka/Hindsight to confirm the actual record counts on S3 (as well as check for any corrupt streams) so that I can validate counts that come from the Scala and Python code. Ideally, I'd like to count the exact number of records under the following S3 prefixes: telemetry-2/20160401/telemetry/4/main/Firefox telemetry-2/20160402/telemetry/4/main/Firefox That will make it easy to compare against the counts in the "main_summary" derived dataset.
Assignee: nobody → mtrinkala
Status: NEW → ASSIGNED
Points: --- → 1
Priority: -- → P2
Created attachment 8754655 [details] Counts per file for each day schema.json was the file selection criteria tsv: column 1 filename, column 2 number of messages in the file cnts-20160401.tsv total files = 166338 total messages = 396151215 cnts-20160402.tsv total files = 152984 total messages = 303040220
Mark checked that the Scala implementation returns the correct number of records while I did the same for the Python one. We can both confirm that the number of records match for the given dates.
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.