Closed Bug 1375108 Opened 7 years ago Closed 7 years ago

Create Notebook Checking if Quantum RC Needs Deduping

Categories

(Data Platform and Tools :: General, enhancement, P2)

x86
macOS
enhancement
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: frank, Unassigned)

References

Details

We're considering deduping Spark-side for Quantum RC. Dupes will only affect "sum" columns, so we should create a notebook comparing with/without deduping to see what effect they are having.
I don't think we need to dedupe. The differences are <1%, and they are all in the same direction and around the same amount -- since we're using these for MTBF, that number will remain steady with these changes (SUM(subsession_length) / SUM(some_val)).

Mark, can you do a quick r on my notebook? https://gist.github.com/fbertsch/33a87422fa1d4c1f667ca9146fc7da77
Flags: needinfo?(mreid)
Analysis looks solid: r+, and I agree that 1% should not make a significant difference in conclusions.

Out of curiosity, how long did the "drop dupes" operation take?
Flags: needinfo?(mreid) → needinfo?(fbertsch)
(In reply to Mark Reid [:mreid] from comment #2)
> Analysis looks solid: r+, and I agree that 1% should not make a significant
> difference in conclusions.
> 
> Out of curiosity, how long did the "drop dupes" operation take?

You know I'm not totally sure. I ended up running them separately and didn't time them :/
Flags: needinfo?(fbertsch)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.