1116217 - investigate too many SEARCH_SERVICE_COUNTRY_SUCCESS zeroes

:Gavin Sharp [email: gavin@gavinsharp.com]

Reporter

Description

•

10 years ago

http://telemetry.mozilla.org/#filter=nightly%2F37%2FSEARCH_SERVICE_COUNTRY_SUCCESS&aggregates=multiselect-all!Submissions&evoOver=Builds&locked=true&sanitize=true&renderhistogram=Graph If I'm reading this data right, this is showing 62% of submissions had SEARCH_SERVICE_COUNTRY_SUCCESS=0. This is worrying! - SEARCH_SERVICE_COUNTRY_FETCH_TIMEOUT suggests only 17% of submissions encountered a time out - SEARCH_SERVICE_COUNTRY_SUCCESS_WITHOUT_DATA suggests 0% of submissions hit that case Some possibilities: - we're hitting the generic onerror case ~45% of the time. We should probably add some more specific telemetry for possible onerror cases? - I'm not interpreting this data correctly (is it possible we're mis-counting this data? I'm not sure how boolean histograms work exactly)

Hanno Schlichting

Comment 1

•

10 years ago

Can't speak on how interpret the SEARCH_SERVICE_COUNTRY_SUCCESS stats. But related, if I'm reading the SEARCH_SERVICE_COUNTRY_FETCH_MS data correctly, the 95th percentile is close to the 2000ms timeout value and sometimes above it. Maybe we should increase the timeout to 3000ms or more here? Almost all of this time is latency in DNS resolution and SSL connection setup - the response time on the service end (once it hits the load balancer) is more in the 10ms range. On the service end I see no noticeable error responses, almost 99% of the responses are 200 OK. But since we landed the patch on nightly, we are also getting a pretty constant ~6 requests / second from IP addresses out of Google Compute Engine. This currently accounts for 93% of all requests to the country API. Do we use GCE for any of our build or test infrastructure? If so, some tests might now call out to the live service when they shouldn't.

Mark Hammond [:markh] [:mhammond]

Comment 2

•

10 years ago

(In reply to Hanno Schlichting [:hannosch] from comment #1) > Can't speak on how interpret the SEARCH_SERVICE_COUNTRY_SUCCESS stats. Me either, although: > On the service end I see no noticeable error responses, almost 99% of the > responses are 200 OK. Note that SEARCH_SERVICE_COUNTRY_SUCCESS *includes* all responses from the server rather than just 200 (the code doesn't even check the response code - it just looks for country_code in the json). However, SEARCH_SERVICE_COUNTRY_SUCCESS_WITHOUT_DATA will almost certainly capture the rest (and we are seeing < 1% of them). So SEARCH_SERVICE_COUNTRY_SUCCESS=0 will tend to be errors that prevented us hitting the server or a network error reading the response. > But since we landed the patch on nightly, we are also > getting a pretty constant ~6 requests / second from IP addresses out of > Google Compute Engine. This currently accounts for 93% of all requests to > the country API. Do we use GCE for any of our build or test infrastructure? > If so, some tests might now call out to the live service when they shouldn't. Interesting. Most of our test infrastructure aborts on access to the network - but I'm certainly not an expert on what infra we have :) But I wonder if something similar is causing SEARCH_SERVICE_COUNTRY_SUCCESS=0 - eg, something automated that (a) can't reach the servers but (b) does report telemetry. Sounds unlikely (and very bad), but that's the only theory I've come up with so far :( > But related, if I'm reading the SEARCH_SERVICE_COUNTRY_FETCH_MS data > correctly, the 95th percentile is close to the 2000ms timeout value and > sometimes above it. Maybe we should increase the timeout to 3000ms or more > here? Almost all of this time is latency in DNS resolution and SSL > connection setup - the response time on the service end (once it hits the > load balancer) is more in the 10ms range. The trade-off is that we don't want to be waiting when the user interacts with search as that causes janky delays. But you are completely correct, and bug 1116404 is for a better technique here (and should also give us measurements even when we consider the request timed-out).

Hanno Schlichting

Comment 3

•

10 years ago

We have the first telemetry stats for the newer SEARCH_SERVICE_COUNTRY_FETCH_RESULT and SEARCH_SERVICE_COUNTRY_FETCH_TIME_MS probes introduced in https://bugzilla.mozilla.org/show_bug.cgi?id=1116404. Data is for Nightly 38, ~2k submissions, #1116404 isn't in a beta build yet and Aurora 37 only has ~200 submissions. Fetch result: - 72% success (code 0) - 14% success without data (code 1) - 1% xhr timeout (timeout at 100 seconds, code 2) - 13% error (code 3) To figure out what's going on with the errors, we need https://bugzilla.mozilla.org/show_bug.cgi?id=1116383. The success without data might be overrepresented. If we cannot determine the country for some users IP address, and that users quits his browser and restarts it, does it ping the service again? In the successful case the country is written into the profile, but what happens in the unsuccessful case? Compared to earlier we have more success without data cases than errors (used to be 30-40% errors, almost no "without data"). We did one fix on the service end, which helped here. The underlying geoip database had some fake country entries in it, for example AN1 for "anonymous network". These did result in 500 errors on the service end. We fixed those and they now report a proper 404 not found. Fetch time: - 1.5 seconds median - 2.6 seconds 75th percentile - 6-13 seconds 95th percentile Based on those, I'd say increasing browser.search.geoip.timeout from 2 seconds to 3 seconds might still be worth it.

Mark Hammond [:markh] [:mhammond]

Comment 4

•

10 years ago

(In reply to Hanno Schlichting [:hannosch] from comment #3) > Fetch result: > > - 72% success (code 0) > - 14% success without data (code 1) > - 1% xhr timeout (timeout at 100 seconds, code 2) > - 13% error (code 3) > > To figure out what's going on with the errors, we need > https://bugzilla.mozilla.org/show_bug.cgi?id=1116383. The success without > data might be overrepresented. If we cannot determine the country for some > users IP address, and that users quits his browser and restarts it, does it > ping the service again? In the successful case the country is written into > the profile, but what happens in the unsuccessful case? In any case where we don't get a country code we try again on next startup. I assume you are saying that in the "we cannot determine the country for some users IP address" case we end up in the "success without data" state and will continue to retry forever. To mitigate this, we could ask the server to return an authoritative "don't know" response, although the downside of that is that on next start the service might be able to offer a country code (eg, connected via a different network) and we then lose the ability to have a real answer. Hanno, do you have stats for how often this case occurs? > Compared to earlier we have more success without data cases than errors > (used to be 30-40% errors, almost no "without data"). We did one fix on the > service end, which helped here. The underlying geoip database had some fake > country entries in it, for example AN1 for "anonymous network". These did > result in 500 errors on the service end. We fixed those and they now report > a proper 404 not found. From what you describe, that server change should have no impact - both a 500 and 404 are treated identically - presumably "success without data" assuming no valid payload in the responses. > Fetch time: > > - 1.5 seconds median > - 2.6 seconds 75th percentile > - 6-13 seconds 95th percentile > > Based on those, I'd say increasing browser.search.geoip.timeout from 2 > seconds to 3 seconds might still be worth it. Our main concern with that timeout is that the search service is forced to initialize synchronously during this timeout. SEARCH_SERVICE_COUNTRY_FETCH_CAUSED_SYNC_INIT tracks this, and currently 147 submissions from a total of ~10k (~2%) are hitting this, so I'd be inclined to agree - I opened bug 1123972.

Flags: needinfo?(hschlichting)

Hanno Schlichting

Comment 5

•

10 years ago

(In reply to Mark Hammond [:markh] from comment #4) > (In reply to Hanno Schlichting [:hannosch] from comment #3) > In any case where we don't get a country code we try again on next startup. > I assume you are saying that in the "we cannot determine the country for > some users IP address" case we end up in the "success without data" state > and will continue to retry forever. To mitigate this, we could ask the > server to return an authoritative "don't know" response, although the > downside of that is that on next start the service might be able to offer a > country code (eg, connected via a different network) and we then lose the > ability to have a real answer. > > Hanno, do you have stats for how often this case occurs? That rate has been fairly stable at 0.5% of all API lookups (15-20 successful requests / second, 0-0.1 unsuccessful ones). And since there are likely repeating users amongst the unsuccessful API lookups the number of actual users this refers to is less than that. Since it's such a small fraction, I'd say we should leave things as is. We just have to keep this in mind when interpreting stats. Over time the percentage number will grow, since all the successful cases won't call the service anymore. > From what you describe, that server change should have no impact - both a > 500 and 404 are treated identically - presumably "success without data" > assuming no valid payload in the responses. Mmh, there goes my theory for explaining the drop in the error rate. I'm out of ideas for this one, hoping the more fine-grained error stats might give us some hints.

Flags: needinfo?(hschlichting)

Mark Hammond [:markh] [:mhammond]

Comment 6

•

10 years ago

(In reply to Hanno Schlichting [:hannosch] from comment #3) > We did one fix on the > service end, which helped here. The underlying geoip database had some fake > country entries in it, for example AN1 for "anonymous network". These did > result in 500 errors on the service end. We fixed those and they now report > a proper 404 not found. This is more related to bug 1123974, but the context is here, so I'm asking here :) * Gavin, if we do get back an authoritative "don't know the country code", should we store an empty value (ie, treat that as authoritative), or is it OK to not store anything and thus have that profile always retry the geoip next startup? * Hanno, it seems to me that a 404 is somewhat ambiguous - it might imply either the end-point wasn't found (which we probably should consider a transient error and try again next startup) or the countryCode can't be determined (which, as above, we could consider storing to avoid repeatedly getting the same answer every startup). Is there any way to differentiate those 2 cases? There are obviously trade-offs in both cases, but I'm not sure what the correct trade-off actually is.

Flags: needinfo?(hschlichting)

Flags: needinfo?(gavin.sharp)

Hanno Schlichting

Comment 7

•

10 years ago

(In reply to Mark Hammond [:markh] from comment #6) > * Gavin, if we do get back an authoritative "don't know the country code", > should we store an empty value (ie, treat that as authoritative), or is it > OK to not store anything and thus have that profile always retry the geoip > next startup? I think it's ok to retry geoip. The users IP address might change, so the next lookup could be successful. Or the geoip database might change and start including an entry for the users IP. We cannot know either of these upfront, but need to do a geo lookup again. As I said earlier, this seems to apply to about 0.5% of all geoip requests. > * Hanno, it seems to me that a 404 is somewhat ambiguous - it might imply > either the end-point wasn't found (which we probably should consider a > transient error and try again next startup) or the countryCode can't be > determined (which, as above, we could consider storing to avoid repeatedly > getting the same answer every startup). Is there any way to differentiate > those 2 cases? Yes, it's ambiguous and there's no way to distinguish the two. Any 4xx error needs to be treated as a transient failure and should be retried. We choose to use the Google geolocation API spec for all of this, so we are bound by it to stay compatible. Using that API for geolocation and the same for the country API has advantages overall. The downside is that we cannot change or improve the API.

Flags: needinfo?(hschlichting)

:Gavin Sharp [email: gavin@gavinsharp.com]

Reporter

Comment 8

•

10 years ago

(In reply to Mark Hammond [:markh] from comment #6) > * Gavin, if we do get back an authoritative "don't know the country code", > should we store an empty value (ie, treat that as authoritative), or is it > OK to not store anything and thus have that profile always retry the geoip > next startup? Retrying again seems reasonable, as long as that doesn't cause servers to melt. Seems like it should be fine.

Flags: needinfo?(gavin.sharp)

Hanno Schlichting

Comment 9

•

10 years ago

Good news on SEARCH_SERVICE_COUNTRY_FETCH_RESULT, error rates on release are much lower than those on beta. Beta 36 had: 26% error, 3% xhr timeout Release 36 has: 6% error, 1% xhr timeout We still don't know what these errors are. https://bugzilla.mozilla.org/show_bug.cgi?id=1116383 was supposed to help with that. "Result without data" is in the 1% range, which matches the stats I have on the service end.

Dave Townsend [:mossop]

Updated

•

9 years ago

Priority: -- → P3

Whiteboard: [fxsearch]

:shell escalante

Updated

•

9 years ago

Rank: 35

Mark Banner (:standard8)

Comment 10

•

5 years ago

We removed this probe in bug 1485116.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → WONTFIX

Bugzilla

investigate too many SEARCH_SERVICE_COUNTRY_SUCCESS zeroes

Categories

(Firefox :: Search, defect, P3)

Tracking

()

People

(Reporter: Gavin, Unassigned)

References

Details

(Whiteboard: [fxsearch])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Updated

Comment 10