Closed Bug 1386554 Opened 7 years ago Closed 7 years ago

Consider converting the R analysis code for RAPPOR to python

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: fhartmann)

References

Details

(Whiteboard: [measurement:client:tracking])

This bug is about considering converting the analysis code used to produce the aggregates for the gathered data to Python from R. Our pipeline makes extensive use of Python and introducing an R kernel might come with additional problems we don't want to face within the scope of this project.

The skinned down RAPPOR code is available on Alejandro's Github at https://github.com/Alexrs95/rappor with some updated documentation.

The documentation about how to run the simulation lives at https://docs.google.com/document/d/1xi-3liU7wWOUaL_QEOA8vvNFCc4ULjLelth1QWNwSPk/edit?ts=595547c8#heading=h.jt46muo7hp9y

We should get an understanding of:

- does the R code require exotic libraries only available in R?
- is there any speed concern in using the Python libraries compared to the R ones?
Blocks: 1379180
Whiteboard: [measurement:client:tracking]
Blocks: 1386564
Assignee: nobody → fhartmann
Priority: -- → P1
Blocks: 1394856
By now we finished evaluating the R code base, and started working on the reimplementation in Python.
In terms of performance, the Python version generally performs a little bit better than the original R implementation.

Generally, we were able to find equivalent Python libraries for most libraries used in R.
The only exception is limSolve[1] which provides lsei, a least squares implementation that allows the user to encode additional constraints.
We evaluated several possible replacements. Using a nonnegative least squares solver[2] allows us to encode the first constraint and generally performs very similar in terms of the coefficients that it finds.
By using this replacement, we can get results that are very comparable to the results from the original implementation.

[1] https://cran.r-project.org/web/packages/limSolve/limSolve.pdf
[2] https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.nnls.html
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.