Consider converting the R analysis code for RAPPOR to python

RESOLVED FIXED

Status

Data Platform and Tools
General
P1
normal
RESOLVED FIXED
11 months ago
10 months ago

People

(Reporter: Dexter, Assigned: Florian Hartmann)

Tracking

(Blocks: 3 bugs)

Details

(Whiteboard: [measurement:client:tracking])

(Reporter)

Description

11 months ago
This bug is about considering converting the analysis code used to produce the aggregates for the gathered data to Python from R. Our pipeline makes extensive use of Python and introducing an R kernel might come with additional problems we don't want to face within the scope of this project.

The skinned down RAPPOR code is available on Alejandro's Github at https://github.com/Alexrs95/rappor with some updated documentation.

The documentation about how to run the simulation lives at https://docs.google.com/document/d/1xi-3liU7wWOUaL_QEOA8vvNFCc4ULjLelth1QWNwSPk/edit?ts=595547c8#heading=h.jt46muo7hp9y

We should get an understanding of:

- does the R code require exotic libraries only available in R?
- is there any speed concern in using the Python libraries compared to the R ones?
(Reporter)

Updated

11 months ago
Blocks: 1379180
Whiteboard: [measurement:client:tracking]
(Reporter)

Updated

11 months ago
Blocks: 1386564
(Reporter)

Updated

11 months ago
Assignee: nobody → fhartmann
Priority: -- → P1

Updated

10 months ago
Blocks: 1394856
(Assignee)

Comment 1

10 months ago
By now we finished evaluating the R code base, and started working on the reimplementation in Python.
In terms of performance, the Python version generally performs a little bit better than the original R implementation.

Generally, we were able to find equivalent Python libraries for most libraries used in R.
The only exception is limSolve[1] which provides lsei, a least squares implementation that allows the user to encode additional constraints.
We evaluated several possible replacements. Using a nonnegative least squares solver[2] allows us to encode the first constraint and generally performs very similar in terms of the coefficients that it finds.
By using this replacement, we can get results that are very comparable to the results from the original implementation.

[1] https://cran.r-project.org/web/packages/limSolve/limSolve.pdf
[2] https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.nnls.html
Status: NEW → RESOLVED
Last Resolved: 10 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.