Closed
Bug 1158225
Opened 10 years ago
Closed 9 years ago
Explore anonymizing raw data immediately by converting IP to geo
Categories
(Content Services Graveyard :: Tiles, defect)
Content Services Graveyard
Tiles
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Mardak, Assigned: oyiptong)
References
Details
(Whiteboard: .?)
It sounds like our current 7 day policy of keeping raw data is raising some eyebrows with the additional data that we would be getting with Suggested Tiles. This behavior is within our strict policy, but we could potentially do better.
oyiptong/tspurway, does it seem reasonable to do geoip conversion immediately and strip off ip addresses? I would assume we would still only save the raw-ip+geo data for 7 days.
What problems might this cause or prevent us from solving? (And what are potential solutions that allow for immediate anonymization and avoiding those problems?)
maksik had a similar request in the context of sending some data.. was it for randomized response?
Reporter | ||
Comment 1•10 years ago
|
||
One potential issue is supporting bug 1156993 where it might be useful to track down individual IP addresses. But even then, we could potentially find an offending geo to then drill in more. Alternatively, we could keep a count of accesses per IP addresses within the 7 day window, so that we can't directly associate the IP address to payloads... although if an IP address X sent 1 million requests and we see country Y had a spike of 1 million clicks, we could probably guess it was the same source.
Comment 2•10 years ago
|
||
One tactic might be to change our current impression log to exclude the IP address (and substitute the GEO data), then create another log file that just has the IP address (and maybe a date).
You couldn't credibly match the IP log with the impression log as there is no timestamp, and we could choose a buffering/scrambling logging algorithm that could easily defeat any cross referencing attempt.
ipbuffer = []
loop:
ping = get_incoming_ping()
IP = ping.IP
get = get_geo(ping)
log(impressions, ping - IP, geo)
ipbuffer.append(ip)
if len(ipbuffer) == random(25000):
log(ips, shuffle(ipbuffer))
Comment 3•9 years ago
|
||
:ekr
On our way to implementing the encrypted trusted third party proxy for suggested tiles reporting, we have come up with the above scheme to dis-associate and randomize the IP address from other data collected by onyx. The idea would be to track IP addresses in their own log (for fraud detection), and only insert the GEO (currently Country/State level) in the actual impression log. The IP log would be 'shuffled', and no timestamp would be recorded, making it very difficult to 're-assemble' the original IP address with impression payload.
We think this would be a good short-term solution, as it provides a better privacy profile without requiring any client-side changes or new server infrastructure.
Thoughts?
Flags: needinfo?(ekr)
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → oyiptong
Assignee | ||
Comment 5•9 years ago
|
||
We looked at the data collection strategies with input from ekr and Watson for the data we want to collect.
In the end, however, we decided not to go forward with any additional data collection due to a change in product strategy
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•