Open Bug 1290256 Opened 8 years ago Updated 2 years ago

Blindly trusting HTTP 200 response code in Telemetry submission does not guarantee data delivery

Categories

(Toolkit :: Telemetry, defect, P3)

defect
Points:
2

Tracking

()

People

(Reporter: gps, Unassigned)

References

Details

(Whiteboard: [measurement:client])

When Telemetry submits pings to the server, it assumes that an HTTP 200 response code means the ping was submitted successfully:

https://hg.mozilla.org/mozilla-central/file/9ec789c0ee5b/toolkit/components/telemetry/TelemetrySend.jsm#l946

An HTTP 200 response code is *not* sufficient to assume success. There are a number of failure scenarios where an HTTP 200 response can be observed:

* Captive portal intercepts all requests and renders login page
* Temporary DNS failure results in ISP/network rendering a static HTML page
* Misbehaving HTTP proxies and other network devices (including network devices on Mozilla's network)
* Weird application level failures on the Telemetry server (I've seen lots of strange behavior of servers/apps under load over the years)

If the client sees a 200 in these or other failure scenarios, it could result in Telemetry not receiving all data. I shouldn't have to state how serious that can be.

Yes, TLS should help in a number of scenarios. But if there is a corporate controlled CA cert loaded to MitM all connections, all bets are off. (Although we may be pinning the CA cert somewhere - if we are it isn't in this file AFAICT.)

You shouldn't take the risk that data was never delivered. There needs to be some kind of stronger "protocol" between the client and server that the server actually successfully received the incoming payload. Ensuring the response body is "OK" would be a good start. Including a randomly generated identifier in the HTTP request and having the server echo it back out in the HTTP response would be even better.
I confirmed the old FHR code was susceptible to this issue as well. This /could/ account for some of the orphaned records we saw in FHR, as blindly trusting the 200/201 status code as FHR did would result in the client discarding old document IDs that were part of the HTTP request.

I vaguely recall discussions of this blind trust issue from when I initially implemented FHR. I'm not sure if there is an old bug on file. But I do have vague memories of discussions with the Bagheera server maintainers about adding some kind of server-side acknowledgement to requests so this wouldn't be an issue.
At one point there was also a failure in Bagheera where it would return success before ensuring that both deletion and insertion operations were durably enqueued. It's wise to make sure the same is true whenever deploying a service!
Good point, we should do something here.
I think a simple signal is sufficient though, like checking the servers response body for containing a known text or UUID.

Note that orphaning does not affect Telemetry heavily.
We are looking at aggregate data mostly and the pings can be processed independently of each other; so some missing pings for individual clients are not an issue.

I expect this to increase the duplicate ping submissions a bit.

Mark, any opinions here?
Flags: needinfo?(mreid)
Priority: -- → P3
Whiteboard: [measurement:client]
The server currently responds with a body of "OK" - that is not explicitly mentioned in the Edge spec[1], but it could (and should) be added so we have a baseline for an expected response to check.

I am hesitant to add anything more complex than this to handle the case where someone or something has interrupted the normal request/response cycle, since anything we do to add another layer of confirmation could simply be duplicated by an attacker.

I suggest we begin by checking the response body for "OK", and incrementing a scalar count of cases where we received a 2XX response but did /not/ see a body of "OK" (to be reported in Telemetry as usual).

If we see a significant number of these cases, then the matter warrants further investigation. If we see exceedingly few or zero of these cases, then it is unlikely that further effort will be fruitful.


[1] https://wiki.mozilla.org/CloudServices/DataPipeline/HTTPEdgeServerSpecification
Flags: needinfo?(mreid)
I agree that anything more complex seems not useful now.
I wonder if a response body of "OK" is unique enough though, i.e. if any scenarios also return this content.
If we start shipping clients with a check for "OK" and we then find out we need a more unique response body, we'd have to start sending different response bodies to different client versions.

What do you think about adding some simple, but more unique, content to the response body?
Like "OK\n<some constant uuid>"?
Flags: needinfo?(mreid)
What do you think about just checking if responses begin with "OK"?  Then no change required to the server / spec, and it leaves us free to make the response body check more specific in the future without having to send special responses to specific client versions.
Flags: needinfo?(mreid)
That sounds good, lets do that.
See Also: → 1502378
I have started work on a proposal: https://docs.google.com/document/d/1bQkkbLhWT52y-BaXczCbdeSVcBoZeJ4z9BjHvYJdAaQ/edit#
Assignee: nobody → chutten
Status: NEW → ASSIGNED
Points: --- → 2
Priority: P3 → P2
Deprioritizing as it appears that, by bug 1503988's metric at least, that the rate of gaps and dupes is low enough to not warrant immediate action.
Assignee: chutten → nobody
Status: ASSIGNED → NEW
Priority: P2 → P3
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.