Closed Bug 1544102 Opened 6 years ago Closed 6 years ago

Investigate spike in SSL_HANDSHAKE_VERSION and SPDY_REQUEST_PER_CONN during beta 65

Categories

(Data Science :: Investigation, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mreid, Unassigned, NeedInfo)

Details

There is a spike in bucket 4 in SSL_HANDSHAKE_VERSION during beta 65 as seen here. A similar spike appears in nightly 66.

The number of samples in the SPDY_REQUEST_PER_CONN histogram are up during the same time.

It looks like the number of clients reporting these histograms hasn't changed much, and one avenue to investigate is that there are more clients and pings reporting very high values (>500k) for this histogram during the spike. If you exclude those large values some of the spike disappears.

Moving investigation to ::Investigation

Component: Datasets: Telemetry Aggregates → Investigation
Product: Data Platform and Tools → Data Science

Nhi, can you find an engineer from network engineering to help track this down with Data Science.

Flags: needinfo?(nhnguyen)

Honza, could you take a look?

Flags: needinfo?(nhnguyen) → needinfo?(honzab.moz)

I absolutely don't understand the graphs. Can you please give me some insight what I'm actually looking at? what are the percentage buckets?

This could well be just some experiment made by google or facebook or some other major site.

Flags: needinfo?(honzab.moz) → needinfo?(mreid)

The SSL_HANDSHAKE_VERSION graph shows the aggregate counts of different negotiated SSL versions, grouped by Firefox Build ID.

Buckets are summed across all Telemetry pings observed from builds on that day, and the percentage shown is the percentage of all observed values falling in that bucket. More details are documented here.

Bucket 4 corresponds to tls1.3, while bucket 3 corresponds to tls1.2.

What the graph shows is a relative increase in the rate of version tls1.3 for beta builds between Jan 14th and Jan 28th.

You can also change the graph to show aggregates by submission date rather than build date, which spreads this increase out over a longer period of time. This makes me think it is a change in the Firefox code rather than a change in the world (since bucket 4 counts for builds after Jan 28th drop back down, while later submission dates remain high).

If you include bucket 3 (tls1.2), it looks like the increase in bucket 4 matches a decrease in bucket 3 over the same set of builds.

Finally, there also seems to be an increase in the absolute number of samples for SSL_HANDSHAKE_VERSION at the end of Beta 65 as seen in the third graph here.

Flags: needinfo?(mreid)

Honza, does that help?

Flags: needinfo?(honzab.moz)
Priority: -- → P3

(In reply to Mark Reid [:mreid] from comment #6)

Honza, does that help?

Thanks, it's definitely clearer what's going on. Although, Necko team has not identified any change in that time that would cause such a spike. I'm redirect ni? to some folks from PSM/NSS.

Franziskus, Martin, can you please look at this spike and think of any possible cause? Thanks.

Flags: needinfo?(mt)
Flags: needinfo?(honzab.moz)
Flags: needinfo?(franziskuskiefer)

I can't explain this. That's how we got here. Thanks to Mark's analysis, I'm starting to see some clues though.

Now that we have had some time since the problem started, it's starting to look like there is something in Beta 65 that is causing the problem. Though we might expect some noise as 65 winds down, as mreid says, the TLS version evolution shows that the TLS 1.3 numbers are consistently higher (and noisier) for Beta 65, with TLS 1.2 down in a similar fashion.

This SPDY_REQUESTS_PER_CONN by build date graph is a killer. Check out December 23. That doesn't match well with the TLS graph, but it is a clear signal. Of what, I just don't know.

I'm not aware of any change that we made during this period that would have any effect on connection rates. That, combined with the crazy spike in the above SPDY_REQUESTS_PER_CONN graph suggests that we might have hit on a problem with that specific build.

I don't see anyone landing risky changes right before Christmas (when most of us were offline). So I'm inclined to suggest that one of our friends at a big site (one that runs TLS 1.3) decided to do something extraordinary. The key insight here is that this is unlikely to be a problem with connection failures or similar, because although we get lots of connections, we are getting a TON of requests on those connections. Very few sites can drive that sort of spike, so it suggests that it might have one of the bigger sites, like Youtube.

It's possible that there were changes in Gecko that neither Honza nor I were involved in. Maybe the media team changed the way they use fetching for MSE or something. It might pay to ask around.

p.s., I find the fact that telemetry colours the three graphs on the evolution view differently to be VERY annoying.

Flags: needinfo?(mt)
Flags: needinfo?(franziskuskiefer)

Looks like we could be asking around on and on. I tend to close this as INCOMPLETE, Mark, would you be OK with it or do you want us to investigate further? As I understand, the problem is gone long ago.

Flags: needinfo?(mreid)

(In reply to Martin Thomson [:mt:] from comment #8)

p.s., I find the fact that telemetry colours the three graphs on the evolution view differently to be VERY annoying.

I filed an issue for this piece here.

Thank you all for the information.

If we are comfortable saying that this was most likely a now-resolved issue with Beta 65, I am good with calling the investigation complete and resolving this bug.

Ekr, do you agree?

Flags: needinfo?(mreid) → needinfo?(ekr)

I'm going to call this "done" - please reopen if there are other avenues of investigation you'd like us to pursue.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.