Closed
Bug 1375108
Opened 7 years ago
Closed 7 years ago
Create Notebook Checking if Quantum RC Needs Deduping
Categories
(Data Platform and Tools :: General, enhancement, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: frank, Unassigned)
References
Details
We're considering deduping Spark-side for Quantum RC. Dupes will only affect "sum" columns, so we should create a notebook comparing with/without deduping to see what effect they are having.
Reporter | ||
Comment 1•7 years ago
|
||
I don't think we need to dedupe. The differences are <1%, and they are all in the same direction and around the same amount -- since we're using these for MTBF, that number will remain steady with these changes (SUM(subsession_length) / SUM(some_val)). Mark, can you do a quick r on my notebook? https://gist.github.com/fbertsch/33a87422fa1d4c1f667ca9146fc7da77
Flags: needinfo?(mreid)
Comment 2•7 years ago
|
||
Analysis looks solid: r+, and I agree that 1% should not make a significant difference in conclusions. Out of curiosity, how long did the "drop dupes" operation take?
Flags: needinfo?(mreid) → needinfo?(fbertsch)
Reporter | ||
Comment 3•7 years ago
|
||
(In reply to Mark Reid [:mreid] from comment #2) > Analysis looks solid: r+, and I agree that 1% should not make a significant > difference in conclusions. > > Out of curiosity, how long did the "drop dupes" operation take? You know I'm not totally sure. I ended up running them separately and didn't time them :/
Flags: needinfo?(fbertsch)
Reporter | ||
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•2 years ago
|
Component: Datasets: General → General
You need to log in
before you can comment on or make changes to this bug.
Description
•