Closed Bug 1132660 Opened 10 years ago Closed 9 years ago

Change to a lua geoip lib based on libmaxminddb

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)

All
Linux
defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: mreid, Assigned: kparlante)

References

Details

(Whiteboard: [unifiedTelemetry][40b9][data-validation])

The https://github.com/agladysh/lua-geoip library uses the old-style ".dat" geo database. We should use a library that can read the GeoIP2 ".mmdb" style database.
Priority: -- → P3
Bumping up to P2 because I noticed some of my own submissions are showing up with geoCountry == "??"
Priority: P3 → P2
Brendan, Could you possibly take a look at the geoCountry info for FHRv2 vs. Unified? In particular, I'm interested in cases where: - Unified geolocation differs from FHR, but both are set to specific country codes. - Unified geo is "unknown" (denoted by "??"), while FHR is known. - Ratio of "unknown" to "total" in FHR. If you don't have bandwidth to take a look, please let me know. Thanks!
Flags: needinfo?(bcolloran)
I am unlikely to get to this question in the near future. It's now on my radar as something that needs to be looked at, but if someone else can do the looking that'd be great!
Flags: needinfo?(bcolloran)
Blocks: 1169103
Sheeri, do you know what the FHRv2 data looks like from the perspective of Comment 2?
Flags: needinfo?(scabral)
Sorry it took a while to get to this. From what I can gather, the FHR v2 stuff is in tables named: fhr_rollups_daily_base fhr_rollups_weekly_base fhr_rollups_monly_base We have other tables, but that data stopped being collected in March, so yell if that's more like what you expect. That being said, here's what the data looks like with a sample entry from 6/1/2015 - the format is field name:sample data. I think what you're looking for is the 8th column down - "geo".... vendor:Ebon name:Ebon channel:default os:WINNT osdetail:win7 distribution: locale:en-US geo:EU version:34.0.8.8 isstdprofile:FALSE stdchannel:other stdos:Windows distribtype:mozilla snapshot:20150601 granularity:day timeStart:2015-03-05 timeEnd:2015-03-05 tTotalProfiles:0 tExistingProfiles:0 tNewProfiles:0 tActiveProfiles:0 tInActiveProfiles:1 tActiveDays:0 tTotalSeconds:0 tActiveSeconds:0 tNumSessions:0 tCrashes:0 tTotalSearch:0 tGoogleSearch:0 tYahooSearch:0 tBingSearch:0 tOfficialSearch:0 tIsDefault:0 tIsActiveProfileDefault:0 t5outOf7:0 tChurned:0 tHasUP:0 Here's the top 5 geos from 6/1: dbadmin=> select count(*),geo from fhr_rollups_daily_base where snapshot='20150601' group by geo order by 1 DESC limit 5; count | geo -------+----- 53900 | US 39354 | DE 31423 | GB 26551 | FR 26215 | ES (5 rows)
Flags: needinfo?(scabral)
Summary: Change to a lua geoip lib based on libmaxminddb → Make sure that geoIP lookups are working correctly
Whiteboard: [unifiedTelemetry][b5]
Whiteboard: [unifiedTelemetry][b5] → [unifiedTelemetry][b5][data-validation]
Assignee: nobody → kparlante
Whiteboard: [unifiedTelemetry][b5][data-validation] → [unifiedTelemetry][40b9][data-validation]
Iteration: --- → 42.3 - Aug 10
Iteration: 42.3 - Aug 10 → 43.1 - Aug 24
Updates on this bug: - We do see some v2/v4 discrepancies when we look at an individual clientId [1] - Bagheera/v2 and heka/v4 are using the same old-style .dat database [2] - Both update regularly (heka/v4 updates daily), but they might update at a slightly different cadence - whd found problems [2] with the current lib [2] while load testing, and as part of that work is recommending that we WONTFIX this bug [1] See Appendix C https://docs.google.com/document/d/1XLaW7lq-dL6bcd7dixsk2K5F8TSgSrjG5Oy4kzweWLs/edit [2] https://github.com/mozilla-services/data-pipeline/issues/115 [3] https://github.com/agladysh/lua-geoip
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
I don't understand what the resolution of this bug means. Are you saying that geoip itself is WONTFIX, or that proper testing geoip is WONTFIX? Or is this bug tracking something other than data validation of the geoip bits?
Flags: needinfo?(kparlante)
The original bug was a question about whether or not we needed to upgrade MaxMind's GeoIP Legacy (old-style .dat) to MaxMind's GeoIP2 database (.mmdb). This was the actionable decision to make. Summary of differences: http://dev.maxmind.com/geoip/geoip2/whats-new-in-geoip2/, they have the same accuracy: https://support.maxmind.com/ (see GeoIP FAQ). FWIW, we could upgrade to the precision database if we wanted improved accuracy at the city level. The bug morphed into comparing to v2, questioning if the "legacy" format was less accurate than whatever v2 was using. After comparing the data, looking at what v2 was actually using (exact same database), researching the differences between the formats, and checking the frequency with which we update the database (daily, updating to the paid, most accurate version), we have no reason to believe we are less accurate than v2, or that we have any problems with accuracy. The exploration we embarked upon is done, and we're proposing we WONTFIX the change to the GeoIP mmdb.
Flags: needinfo?(kparlante)
Summary: Make sure that geoIP lookups are working correctly → Change to a lua geoip lib based on libmaxminddb
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.