Closed Bug 1040928 Opened 7 years ago Closed 6 years ago

Improve snippet performance and delivery

Categories

(Snippets Graveyard :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cmore, Assigned: nmaul)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/116] )

This a tracking bug for a project with three main goals:

* Understand what are the bottlenecks and drivers of snippet performance

* Increase 200KB file size restriction (server side)

* Closer to real-time snippet (hourly) snippet deployment

Team:

* Technical project manager: Chris More
* Developer: Mike Kelly
* Owner: Jean Collings
* IT: Corey Shields or Jake Maul's team

Key Dates:

* Research: July 21st-Aug 4th
* Project completion: Sept 30th

More soon!
Depends on: 1014157
Depends on: 1040933
We'll be working with jakem on the infrastructure side of this project.
Briefly, the obvious way to do this (at least the part about allowing bigger snippets) is to try to move delivery off our infrastructure and onto a CDN. That's pretty straightforward:

Create snippets.cdn.mozilla.net on Edgecast (because it has the SSL Cert we'll need)
Set the Cache-Control max-age to some reasonable (short) value
Change browser.aboutHomeSnippets.updateUrl in nightly/aurora/beta to use it
wait for next Firefox release train

Administration wouldn't change, just the URL Firefox accesses.


Longer max-age's are better for hit rate, but in this case we're looking mainly for offloading traffic from the origin. I'd bet traffic is heavily concentrated to just a few URLs (top 3-5 locales, current version, release channel) so hit rate is likely to be very good on those regardless. I'm guessing we can go pretty short and get away with it.


I can't speak well to the idea of more frequent snippets deployments (browser checking more than every 24h), but to my eyes the obvious blocker there is, if the snippet package has not changed, it needs to be able to tell that and not re-download a new one. Potentially that could be as simple as proper handling of Last-Modified and If-Modified-Since type headers returning 304's. Worth considering: that would make it imperative to have iron-clad control over when a new snippet gets published, because doing so will generate a large bandwidth spike right away. It would be bad if we published a new one and then had to remove it shortly thereafter for some reason, because a whole lot of folks would have already fetched the new bundle and would have to do so again. Lots of wasted bandwidth and processing... server-side and client-side.
jakem: Can we get a dump of the traffic logs for snippets over a day or two? I'd like to get some stats on the different URLs people are accessing.

I'd assume, if we're using apache logs, then the logs are post-zeus. Is there a way to get pre-zeus numbers, or some numbers on how Zeus is responding to requests? (I'm trying to answer questions like "What percentage of traffic is getting cached results from Zeus" or "What percentage of traffic is served by the top 20 Zeus cached responses").

Also, any sort've logging we have for memcached that lets us find out how it's being used would be useful too.
Flags: needinfo?(nmaul)
Also as a note, I've got a public Google doc where I'm storing notes and thoughts as I research this stuff. In case you're interested: https://docs.google.com/document/d/1GPrd0-yRTitkGAFj0L0VENgUxq9SeODens5OpCtd6HE/edit?usp=sharing
(In reply to Jake Maul [:jakem] from comment #2)
> Briefly, the obvious way to do this (at least the part about allowing bigger
> snippets) is to try to move delivery off our infrastructure and onto a CDN.
> That's pretty straightforward:
> 
> Create snippets.cdn.mozilla.net on Edgecast (because it has the SSL Cert
> we'll need)
> Set the Cache-Control max-age to some reasonable (short) value
> Change browser.aboutHomeSnippets.updateUrl in nightly/aurora/beta to use it
> wait for next Firefox release train
> 
> Administration wouldn't change, just the URL Firefox accesses.
> 
> 
> Longer max-age's are better for hit rate, but in this case we're looking
> mainly for offloading traffic from the origin. I'd bet traffic is heavily
> concentrated to just a few URLs (top 3-5 locales, current version, release
> channel) so hit rate is likely to be very good on those regardless. I'm
> guessing we can go pretty short and get away with it.

This sounds straightforward and awesome.

> I can't speak well to the idea of more frequent snippets deployments
> (browser checking more than every 24h), but to my eyes the obvious blocker
> there is, if the snippet package has not changed, it needs to be able to
> tell that and not re-download a new one. Potentially that could be as simple
> as proper handling of Last-Modified and If-Modified-Since type headers
> returning 304's. Worth considering: that would make it imperative to have
> iron-clad control over when a new snippet gets published, because doing so
> will generate a large bandwidth spike right away. It would be bad if we
> published a new one and then had to remove it shortly thereafter for some
> reason, because a whole lot of folks would have already fetched the new
> bundle and would have to do so again. Lots of wasted bandwidth and
> processing... server-side and client-side.

We could use ETags to check the content of the snippets a user already has, and if they haven't changed, return a 304. That'd work even if we put out a new snippet and then take it back quickly because when users who haven't retrieved the new snippet yet request more snippets, the content won't have changed.

(The linked doc above has a chart and some more notes on how we could use ETags.)
:Jakem: Can you get the following questions answered by Aug 5th? We want to keep momentum on this project and Osmose has done some good research on how to dramatically increase the efficiency of the service:

1) Current Zeus configuration; how long does it cache, what headers does it cache by?

2) Memcached utilization; how often is the cache hit and how often do keys rotate in and out?

3) Hit rates of our top urls. In other words, are we seeing tons of traffic to the same URLs? Is Zeus handling these? Are there many variations with tiny differences that all get the same snippets, and can that be optimized?

4) Does Zeus take into account ETags when choosing whether to return a cached response or not? Or is it URL only?

If you could put the answers inline or attach gdocs attachments. Thanks!
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/645]
No longer depends on: 1040933
Blocks: 1014157, 1040933
No longer depends on: 1014157
Summary: Improve snippet performance and delivery [tracking] → Improve snippet performance and delivery
I can't get you a day or two of logs of snippets traffic... we don't store that one for that long. I could get you Apache stats, but as you guessed it'd be post-Zeus-caching, so not particularly valuable.

I can give other stats though:

Zeus hit rate for snippets.mozilla.com is about 96%.

Zeus caches for whatever the web nodes tell it to, up to 21600 seconds. Based on my quick testing, it seems that the web nodes say 300s (5 min), so that's what it caches for.

Zeus obeys Cache-Control and Expires headers. I don't know if it pays any attention to E-Tags... we can probably work up a test to find out.

Counts for top-20 URLs added to the google doc. Nothing too surprising... Firefox 30 and 31 both rank heavily, as we are mid-release. en-US on Windows are the largest. Fennec 31 cracks the top-20. You can see that the top URL gets an order of magnitude more hits than the 20th URL. A word of caution though: this is only 1-2 hours of data, so it's potentially biased due to timing (probably US-centric data). A full day would be better, but I don't have that capability right now.

Based on log files ("bytes transferred"), it looks to me like each locale, OS, and Firefox version get a slightly different bundle. I couldn't say just how dramatic the difference really is, but it exists.
Flags: needinfo?(nmaul)
Depends on: 1054472
Depends on: 1058620
Depends on: 1058748
Depends on: 1058759
A summary of what we've done:

- We avoided the extra template caching due to discoveries about what snippets and templates were using them for that made caching hard/impossible.

- We added the extra ETag logic and then from our tests of that concluded that the only real benefit to adding ETags was Zeus sending back 304s, so we removed most of it and kept the basic ETags.

- We set up the CDN and are currently waiting on Firefox dev to update the snippet URL.

(In reply to Jake Maul [:jakem] from comment #2)
> Briefly, the obvious way to do this (at least the part about allowing bigger
> snippets) is to try to move delivery off our infrastructure and onto a CDN.
> That's pretty straightforward:

Thinking on this more, it was my understanding that when the snippets service couldn't handle a large snippet, it was because the app server couldn't handle the traffic that was getting through the load balancer, rather than the load balancer not being able to handle it. Wouldn't adding a CDN to the mix not fix that issue?

> I can't speak well to the idea of more frequent snippets deployments
> (browser checking more than every 24h), but to my eyes the obvious blocker
> there is, if the snippet package has not changed, it needs to be able to
> tell that and not re-download a new one. Potentially that could be as simple
> as proper handling of Last-Modified and If-Modified-Since type headers
> returning 304's. Worth considering: that would make it imperative to have
> iron-clad control over when a new snippet gets published, because doing so
> will generate a large bandwidth spike right away. It would be bad if we
> published a new one and then had to remove it shortly thereafter for some
> reason, because a whole lot of folks would have already fetched the new
> bundle and would have to do so again. Lots of wasted bandwidth and
> processing... server-side and client-side.

What we found with the ETags is that if the app server returns a 304, Zeus doesn't cache the response and let's more traffic through. Thus, it seems that the app server returning 304s is a bad idea because it will cause Zeus to send even more traffic to the app server to check if the snippets have been modified, rather than caching the 304 for a set amount of time. If Zeus could cache the 304s then it'd be worth it for the app server to respond with 304s. Is this configurable?
Flags: needinfo?(nmaul)
Oh, forgot to mention that we also upped the max-age from 5 minutes to 10 minutes on snippets-prod.
It might be worth playing with ETag's and 304's more thoroughly, but honestly I'm not particularly concerned... Cache-Control is just as good from my perspective.

The CDN should still help- it's an additional layer of caching, not a replacement layer.

For example, assume we get a 90% hit rate currently (it's actually higher, I believe, but for the sake of the math), and 1000 hits/sec. 900 hits/sec will be served from Zeus cache, and 100 hits/sec from the app servers. Now let's add in the CDN, and presume it gets the same hit rate (it won't at the edge nodes, but there's a mid-tier cache that should- or at least combined, it should). This doesn't replace Zeus cache, which is still in effect. Zeus probably won't get quite as good a cache hit rate, but it's likely to still get some... so let's say 50%. That's 900 hits/sec from CDN, and 50 hits/sec from Zeus cache. We've effectively cut the traffic to the origin in half.

We'll have proper numbers once it starts getting traffic, but I suspect we'll see a reasonable improvement overall. It should be easy to see the effect by watching New Relic's request volume to snippets.mozilla.com- that's the origin for the CDN. Any traffic it receives is traffic that fell through everything all the way back to the app servers.
Flags: needinfo?(nmaul)
Depends on: 1091939
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/645] → [kanban:https://webops.kanbanize.com/ctrl_board/2/116]
We haven't had any problems here for quite some time, so I'm content to call this one good and close out.

If I've forgotten anything, please reopen and let me know. Thanks!
Assignee: nobody → nmaul
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Product: Snippets → Snippets Graveyard
You need to log in before you can comment on or make changes to this bug.