If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

add a Content-MD5 header to telemetry pings

RESOLVED INCOMPLETE

Status

()

Toolkit
Telemetry
RESOLVED INCOMPLETE
5 years ago
5 years ago

People

(Reporter: froydnj, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments)

(Reporter)

Description

5 years ago
Some end-to-end verification that what we send actually arrives as-is on the server would be good.
(In reply to Nathan Froyd (:froydnj) from comment #0)
> Some end-to-end verification that what we send actually arrives as-is on the
> server would be good.

It's a bikeshed, but can we use SHA1, SHA256, or a CRC?  There's no reason to use a totally broken secure hash function.

Also, thisbug++.  :)
(Reporter)

Comment 2

5 years ago
(In reply to Justin Lebar [:jlebar] from comment #1)
> (In reply to Nathan Froyd (:froydnj) from comment #0)
> > Some end-to-end verification that what we send actually arrives as-is on the
> > server would be good.
> 
> It's a bikeshed, but can we use SHA1, SHA256, or a CRC?  There's no reason
> to use a totally broken secure hash function.

I suggested Content-MD5 because it was standard, but I don't see why we couldn't send X-Content-SHA{1,256}, since there are limited interoperability concerns here.  I think the only constraint is that nsICryptoHash support the chosen hash.
> I suggested Content-MD5 because it was standard

Oh; that's a good reason.  Sigh.

It looks like Content-SHA might be a standard as well; I can't really tell.
(Reporter)

Comment 4

5 years ago
(In reply to Justin Lebar [:jlebar] from comment #3)
> > I suggested Content-MD5 because it was standard
> 
> Oh; that's a good reason.  Sigh.
> 
> It looks like Content-SHA might be a standard as well; I can't really tell.

It looks like various people have proposed Content-Digest, with the digest indicated in the header; X-Content-Digest seems to be used in various bits of web middleware.  Googling for these terms is mostly unhelpful, since Google searches for something like "Content-SHA OR Content OR SHA".

In any event, I don't think nsICryptoHash makes it difficult to switch algorithms.
(Reporter)

Comment 5

5 years ago
Created attachment 631408 [details] [diff] [review]
part 1 - add the X-Content-Digest header

The TelemetryPing modifications.
Attachment #631408 - Flags: review?(taras.mozilla)
(Reporter)

Comment 6

5 years ago
Created attachment 631409 [details] [diff] [review]
part 2 - make tests more amenable to checksumming

This patch makes the tests more amenable to checksumming, by making sure that all JSON decoding works with a string read from the request's stream, rather than one arm decoding from a string and the other from the stream.  Separated out for ease of review.
Attachment #631409 - Flags: review?(taras.mozilla)
(Reporter)

Comment 7

5 years ago
Created attachment 631411 [details] [diff] [review]
part 3 - tests for X-Content-Digest

Now we can sanely check for X-Content-Digest in the tests.
Attachment #631411 - Flags: review?(taras.mozilla)

Comment 8

5 years ago
Comment on attachment 631411 [details] [diff] [review]
part 3 - tests for X-Content-Digest

+  let payload, declaredBase64Digest, computedBase64Digest;
+  [payload, declaredBase64Digest, computedBase64Digest] = decodeRequestPayload(request);
 
you can do let [x,y,z]=...
Attachment #631411 - Flags: review?(taras.mozilla) → review+

Updated

5 years ago
Attachment #631409 - Flags: review?(taras.mozilla) → review+

Comment 9

5 years ago
Comment on attachment 631408 [details] [diff] [review]
part 1 - add the X-Content-Digest header

r- because this should sha before gzip.
Attachment #631408 - Flags: review?(taras.mozilla) → review-
(In reply to Taras Glek (:taras) from comment #9)
> Comment on attachment 631408 [details] [diff] [review]
> part 1 - add the X-Content-Digest header
> 
> r- because this should sha before gzip.

In crypto, you always encrypt then MAC.  Isn't the same rule here to compress and then hash?

It probably doesn't matter if you send gzip(data) | sha(data).  But if you instead did gzip(data|sha(data)), that would be bad.
I guess the sha is going in the HTTP header, which isn't compressed, so we're doing gzip(data) | sha(data), which should be fine.
(Reporter)

Comment 12

5 years ago
FWIW, the HTTP spec for the Content-MD5 header, which I was patterning this after, says:

"The MD5 digest is computed based on the content of the entity-body, including any content-coding that has been applied, but not including any transfer-encoding applied to the message-body. If the message is received with a transfer-encoding, that encoding MUST be removed prior to checking the Content-MD5 value against the received entity."

The transfer-encoding is not relevant to our purposes here; the content-coding that we're applying is gzip.  So if we're going to "follow" the spec, then hashing after gzipping is correct, assuming I'm reading the spec correctly.

Comment 13

5 years ago
Xavier, we are looking to add checksums to our telemetry packets(part of the end-to-end validation principle). Can your infrastructure check checksum of gzipped ping content? If so I can r+ Nathan's existing patch.
Technically we could validate the checksum gzip'd or not I think. Isn't all of this redundant to TCP checksums though? 

That aside, I also have serious concerns about the amount of CPU that would be consumed on the server between GZIP and something like SHA-1 on a per request basis.
(In reply to Xavier Stevens [:xstevens] from comment #14)
> Technically we could validate the checksum gzip'd or not I think. Isn't all
> of this redundant to TCP checksums though? 

TCP checksums are 32 bits long, which is pretty lame.  They're also not end-to-end; data could be corrupted in a client's networking stack, for example.

> That aside, I also have serious concerns about the amount of CPU that would
> be consumed on the server between GZIP and something like SHA-1 on a per
> request basis.

The pings are already sent gzip'ed, right?

I'd expect hashing to be cheap compared to parsing JSON.  But you don't even have to do the hashes as requests come in; you can batch them and do them later as a low-priority job.

If we really care about CPU, we should zip then hash, but honestly, if we're optimizing for the amount of data being hashed in our DC, I think we're doing something wrong.
Justin,

The number of times you get actual corruption in the TCP stack should be really really rare.

Yes, the payload is gzip'd already. That is actually a benefit since it reduces the data size significantly.

If we really need to do this then we could use MD5, but I still think it's unecessary.
I've understood that in distributed systems, end-to-end checksums are important if you care about data integrity.  Do you disagree?

There are lots of possible sources of corruption in a system, both within and without the networking stack.  Indeed, we've observed plenty of corruption in our telemetry data.  It's not clear to me whether that corruption happens before or after the data goes on the wire, but checksums implemented here would rule out one class of sources of corruption.

I guess I don't understand the reasons behind the push-back here.  Is the concern simply about wasted CPU cycles?
I just feel that the checksums included with the protocol are sufficient. These servers do more than telemetry as well and if other projects are green lit, then doing thousands of MD5 sums every second is really wasteful.

I think our separate process that validates the json structure and payload is the best place to focus our efforts in correcting the problems we see. Which is done on the cluster (offline) rather than in the server itself.
Like I said in comment 15, there is no reason you have to compute these checksums online.

But in general, how can you validate the payload against corruption if you don't have a checksum?  You can validate that the payload meets some of your expectations, sure, but we agree that there are plenty of types of corruption you can't guard against, right?

Why are we doing any validation at all if we think the checksums in TCP are sufficient?
We do validation only to make sure we were sent valid JSON in the server. Its a really loose condition because we use the same data router for several projects. 

We then added the offline checks for Telemetry specifically because we saw cases where the we received valid JSON (technically), but the JSON was not in the expected format for Telemetry during analytical processing.

The server will outright reject invalid JSON already. Which we see little to none of currently.
> We then added the offline checks for Telemetry specifically because we saw cases where the we 
> received valid JSON (technically), but the JSON was not in the expected format for Telemetry during 
> analytical processing.

If there are sources of corruption which cause us to receive valid JSON but with an unexpected format, do you think it's possible that there are sources of corruption which cause us to receive valid JSON with a valid format, but the incorrect values?
The only case I could see that happening in is a malicious actor. In which case an MD5 checksum won't prevent it from happening.
Do you think that the corruption observed by the telemetry-specific JSON checks is all due to malicious actors?  (Surely not, right?)  At least some of it, if not all of it, is due to corruption somewhere between the data being created on the client and being read in the server, right?

So what I was asking in comment 21 is, if such sources of non-malicious corruption exist, could they not modify the telemetry data sent in such a way that the packet remains valid according to our schema checks?  Surely you're not saying that random corruption can change "foo" to "goo" (invalid schema) but not "2" to "0" (valid schema, modified data).
Nathan, fwiw I think hashing the unzipped text is probably better from an end-to-end point of view.  Metrics could, if they desired, check the hash immediately before processing the unzipped JSON.  That would cover storage errors, etc on their end.

Comment 25

5 years ago
Comment on attachment 631408 [details] [diff] [review]
part 1 - add the X-Content-Digest header

Changing this to r+ based on the fact that this behavior matches the header spec and there is no technical reason preventing telemetry server from verifying the checksum in this form.
Attachment #631408 - Flags: review- → review+
>Do you think that the corruption observed by the telemetry-specific JSON checks is all due to malicious actors?

Absolutlely not.

I do doubt though that something would get corrupted during transmission in such a way that it was still valid GZIP and valid JSON. Which means if you validate pre-submission on the client, and we validate offline on the json structure. Then MD5 is a zero gain proposition.
What we have seen so far are cases where the data generated by Telemetry and put into the JSON payload violates the contract of what we expect Telemetry to put into the payload.

In a couple of cases, it is believed that the cause of that invalid data was corruption due to an unstable application state such as a crash.  In most cases, it is due to changes in probe definitions or bugs in the Telemetry data generation code.

A checksum won't call those out.  It will build a checksum of the invalid data and that checksum will match what we compute on our side (unless there is also a bug in the checksumming implementation on either side.

It is possible that a data corruption after the construction and checksumming but before sending the payload out on the wire.  I don't personally feel that is a very likely event, but it certainly wouldn't hurt to add some sort of checksum to the payload to at least try to measure the frequency of that event.

It will not be a trivial thing to implement the comparison of the checksum on the server side though.

If we wait and do the comparison out-of-band, then we need to make sure we have a way to reconstruct exactly what the payload looked like when it was received.  Currently, we blend in data about the request with the payload when it is stored.

If we tried to do the comparison on the fly, then we add another computational overhead to the system.  As Xavier mentioned above, the system is a shared resource, and we need to ensure that we have the capacity to meet the needs of not only Telemetry but the other projects using this system.  Our capacity planning did not take this additional overhead into mind, and there are many ways that Telemetry can cause an increase in load on the system:
* Telemetry Persistence being re-enabled
* Telemetry shifting to opt-out for non-release builds
* More histograms
* Additional payload sections similar to slow-sql or chromehangs

We have to do our best to gauge whether any and all of these things will add to the bandwidth, storage, and computational requirements of the system and plan out how the cluster will have to grow to accomodate them.

We aren't pushing back just for the sake of nit-picking, but we do have to make sure we aren't going to run into trouble and have to scramble to get more resources.  Our original budgeted capacity planning for this cluster has already had most of the head-room taken away by modifications to Telemetry and other systems over time (and the lack of any data-expunging policy for Telemetry), and we can't get better at communicating how these changes impact our capacity planning unless we do understand the effects of suggested and planned changes.

Comment 28

5 years ago
I know I am late to the game here, but it seems to me that a checksum in the header is only going to catch corruption over the wire, in which case the jason would be corrupt anyway. I see no information gain achieved here but sticking a checksum in the header. Just my 2 cents.

Comment 29

5 years ago
I should note, this is a nice-to-have feature for the telemetry ping. We need the checksum for something else, so we might as well send it in the header if we ever deem over the wire corruption to be a problem. I agree that between the tcp, gzip, crc, json doing basic checks, chances of corruption are low. 
I'm ok if the metrics server ignores this header until corruption is deemed to be a problem.
(Reporter)

Comment 30

5 years ago
I'm going to close this; it doesn't sound like we have any use for this on the server side, and I don't believe we have any use for it on the client side.  If we ever feel differently, we can revisit this bug.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.