Closed Bug 1158225 Opened 10 years ago Closed 9 years ago

Explore anonymizing raw data immediately by converting IP to geo

Categories

(Content Services Graveyard :: Tiles, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Mardak, Assigned: oyiptong)

References

Details

(Whiteboard: .?)

It sounds like our current 7 day policy of keeping raw data is raising some eyebrows with the additional data that we would be getting with Suggested Tiles. This behavior is within our strict policy, but we could potentially do better. oyiptong/tspurway, does it seem reasonable to do geoip conversion immediately and strip off ip addresses? I would assume we would still only save the raw-ip+geo data for 7 days. What problems might this cause or prevent us from solving? (And what are potential solutions that allow for immediate anonymization and avoiding those problems?) maksik had a similar request in the context of sending some data.. was it for randomized response?
One potential issue is supporting bug 1156993 where it might be useful to track down individual IP addresses. But even then, we could potentially find an offending geo to then drill in more. Alternatively, we could keep a count of accesses per IP addresses within the 7 day window, so that we can't directly associate the IP address to payloads... although if an IP address X sent 1 million requests and we see country Y had a spike of 1 million clicks, we could probably guess it was the same source.
Blocks: 1158230
One tactic might be to change our current impression log to exclude the IP address (and substitute the GEO data), then create another log file that just has the IP address (and maybe a date). You couldn't credibly match the IP log with the impression log as there is no timestamp, and we could choose a buffering/scrambling logging algorithm that could easily defeat any cross referencing attempt. ipbuffer = [] loop: ping = get_incoming_ping() IP = ping.IP get = get_geo(ping) log(impressions, ping - IP, geo) ipbuffer.append(ip) if len(ipbuffer) == random(25000): log(ips, shuffle(ipbuffer))
:ekr On our way to implementing the encrypted trusted third party proxy for suggested tiles reporting, we have come up with the above scheme to dis-associate and randomize the IP address from other data collected by onyx. The idea would be to track IP addresses in their own log (for fraud detection), and only insert the GEO (currently Country/State level) in the actual impression log. The IP log would be 'shuffled', and no timestamp would be recorded, making it very difficult to 're-assemble' the original IP address with impression payload. We think this would be a good short-term solution, as it provides a better privacy profile without requiring any client-side changes or new server infrastructure. Thoughts?
Flags: needinfo?(ekr)
Assignee: nobody → oyiptong
Overtaken by bigger discussion.
Flags: needinfo?(ekr)
We looked at the data collection strategies with input from ekr and Watson for the data we want to collect. In the end, however, we decided not to go forward with any additional data collection due to a change in product strategy
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.