Closed
Bug 1329842
Opened 8 years ago
Closed 8 years ago
Create validation script to diff ToplineSummary and run_executive_report.py
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: amiyaguchi, Assigned: amiyaguchi)
References
Details
The majority of the scala code has been written for the Topline report to use the main_summary dataset. In order to push this into production, the report needs to be verified against the original v4_monthly script (or possibly the v4_weekly script).
The plan is to load both datasets into memory, and calculate the mean percent error between all fields. The aggregated data won't be the same, but they should be very close.
Assignee | ||
Updated•8 years ago
|
Assignee | ||
Updated•8 years ago
|
Assignee | ||
Comment 1•8 years ago
|
||
I've written a simple script to compare the difference between two versions of the topline report [1]. Running this notebook takes a bit of setup, but it doesn't need to be run often. I am planning on validating against the weekly report instead of the monthly one going forward.
There is a significant error (> 10%) with the results of the search counts. I need to do more investigation and debugging to reduce this.
[1] https://gist.github.com/acmiyaguchi/f33f1b7844a8ba17bc6e2a8c85d1188c
Assignee | ||
Comment 2•8 years ago
|
||
It seems that the error is mostly from the differences in how the search engine names are normalized [1]. The total search counts between the main summary and redshift data sources are approximately the same (0.001% error).
I will need to figure out a better regex for finding the correct labels, as well as a better way to test their implementation.
[1] https://gist.github.com/acmiyaguchi/a7e6132a19b88ef17d8c44786ab0e197
Assignee | ||
Comment 3•8 years ago
|
||
I have updated the topline summary regexes to reflect differences between match and search. The report in this comment [1] differs comment 1 in the following ways:
* Changes the period from a month to a week
* Requires significantly less setup (you only need to scp `weekly-v4.csv` to an analysis machine)
* Uses the percent error without an absolute value (to see the direction of change)
* Formatting of the result columns so they don't break in the gist
Instead of 15% errors, we see more reasonable errors in the range of 5%. However, it looks like the total search counts The change to how strings are normalized seems to be the significant contributor to fixing this.
I'll probably create a small harness to test the differences between lpeg and regexes on a sample of all search strings.
[1] https://gist.github.com/acmiyaguchi/3b5a66649189603e78d6b72047886ccc
Assignee | ||
Updated•8 years ago
|
Summary: Verify that ToplineSummary provides the same results as run_executive_report.py → Create validation script to diff ToplineSummary and run_executive_report.py
Assignee | ||
Comment 4•8 years ago
|
||
It looks like the regex matches the lpeg patterns. This notebook [1] shows that the regex correctly identifies all the channels in the last month.
[1] https://gist.github.com/acmiyaguchi/9f525ae8e4cbdcad2b5a69c84131164f
Assignee | ||
Comment 5•8 years ago
|
||
The report is now accurate within 1% of the original script.
The topline report was over counting the results for the week by 10%. I verified that field normalization was working correctly in the last comment. :mreid suggested that the date ranges between the two were off, accounting for the mismatch (7 +/- 1 day should see an error of ~15%). It turns out I had been including the report end (1 week/month after the report start) for the main_summary submission_date_s3 boundary.
I have also have made revisions to the validation notebook [1], which will mostly likely be modified for production use.
[1] https://gist.github.com/acmiyaguchi/3b5a66649189603e78d6b72047886ccc
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 6•8 years ago
|
||
The link for the above gist should be tied down to the following revision:
https://gist.github.com/acmiyaguchi/3b5a66649189603e78d6b72047886ccc/e4d29bc5298ccf2af121be88c0fd240c30088cfd
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•