Closed Bug 1088329 Opened 10 years ago Closed 10 years ago

Investigate bottleneck in homepage feed API endpoint

Categories

(Marketplace Graveyard :: API, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cvan, Unassigned)

References

()

Details

(Keywords: perf)

Attachments

(2 files)

This bug was filed in response to bug 1085739 (bug 1085739 comment 7, specifically). This bug is to do profiling to figure out what our the bottlenecks in returning a speedy response. My hope is that we can discover exactly where all of the computing time is spent in the client->server->client request lifecycle.

I'm excited!

(By Christopher Van Wiemeersch [:cvan] from bug 1085739 comment 7)
> We agree we should increase the `max-age`, we already serve the important public API responses from the CDN to avoid hitting Zamboni directly, and we cache all API responses in `localStorage` and refresh them asynchronously - but all these perf tricks are only masking an underlying problem with Zamboni/the webheads. Do we know why hitting Zamboni directly is taking so long? Servers? TLS/SSL? Nginx? Python? Django? Elasticsearch? MySQL? Redis? Which tools can we use to profile this and who can help?

from CDN (with response header of `Cache-Control:must-revalidate, max-age=180`, gets cached on CDN):
https://marketplace.cdn.mozilla.net/api/v2/feed/get/?lang=en-US&limit=10&region=us&cache=1&vary=0

from Zamboni directly:
https://marketplace.firefox.com/api/v2/feed/get/?lang=en-US&limit=10&region=us&cache=1&vary=0
So, FeedView from start to finish, including middlewares etc take between 1 and 1.5 seconds. That's already a lot, there is room for improvement.

Interestingly, our Ajax response times for marketplace.cdn.mozilla.net/api/v2/feed/get/ are much worse. This seems to indicate that the CDN is evicting our content from its cache. 

Locally:
- Loading the endpoint a couple times it takes between 1.5 and 3 seconds to load
- All subsequent loads take 0.150 to 0.250 ms.

That seems to prove this is a sound theory. I've merged bug 1088362 and bug 1088361 so *next tuesday* we should see some improvements, as this endpoint will get cached for 6 hours.
Priority: -- → P3
I started looking into this and dropping in some timer decorators around various things to try to get a more fine-grained idea of what inside FeedView is taking so long.

Pretty quickly I discovered that all of FeedView is pretty fast except for these lines here where the majority of time is spent:
https://github.com/mozilla/zamboni/blob/2014.11.04/mkt/feed/views.py#L726-L731

From there I started digging deeper. There are many `to_native` calls, some taking longer than others but nothing that really seemed to take a majority of time. Then I found that this call:
https://github.com/mozilla/zamboni/blob/2014.11.04/mkt/feed/serializers.py#L70-L80
would take up 1/3rd of the time of the view but only the first time it is called. On successive calls it takes 1/10th of its original time. I couldn't figure why the first time takes so long.

There are also a couple thing just reading the code that might be cause for concern:

1. In some edge cases when a feed for a certain region has feed items but those feed items get filtered out, we essentially call the FeedView a 2nd time except with region=RESTOFWORLD. This is to ensure we don't deliver an empty feed. But this also means if the feed took 1s to load and came up empty we've got another 1s to get the RoW feed. This happens here: https://github.com/mozilla/zamboni/blob/2014.11.04/mkt/feed/views.py#L664-L674  We should probably added some logging here to get a sense of how often this might happen (if at all?) https://github.com/mozilla/zamboni/blob/2014.11.04/mkt/feed/views.py#L738

2. The note here about https://github.com/mozilla/zamboni/blob/2014.11.04/mkt/feed/serializers.py#L70-L80 to native getting called twice. In my timing output I saw while `to_native` is pretty fast it's by no means free and cumulatively could add up to a decent amount of time since it gets called a lot. If we could avoid the double call I think it would help a little.

There isn't much actionable here but I think knowing that Elasticsearch isn't the culprit is a good discovery. On -dev loading the region=us feed results in 3 ES queries with the following times as reported by the ES call output: 0.004s, 0.006s, and 0.068s... 0.078s in total. All the time spent was after these 3 ES calls in Python serializing the data into JSON.

Another oddity is that if you hit the API endpoint it seems the first time is always slow, while successive calls are faster.

time curl 'https://marketplace-dev.allizom.org/api/v2/feed/get/?cache=1&lang=en-US&limit=25&region=us' => 2.339s and a successive call is 0.590s. I'm not sure why this is.

I'm open to exploring more and digging deeper but I wanted to report what I had so far if anyone had ideas of where to go next.
(In reply to Rob Hudson [:robhudson] from comment #3)
> time curl
> 'https://marketplace-dev.allizom.org/api/v2/feed/get/?cache=1&lang=en-
> US&limit=25&region=us' => 2.339s and a successive call is 0.590s. I'm not
> sure why this is.

Drop the `?cache=1` and you'll notice that both take ~2.339s. This is because passing `?cache=1` causes Zamboni to set this response header, `Cache-Control: must-revalidate, max-age=180`, which caches the response for 3 minutes.

––

Lots of great discoveries. Thanks, Rob! What are the actionable tasks in your opinion?
?cache=1 shouldn't affect the timings, since it only sets the response header, we are not using the CDN here, and we are using `curl` and not a browser, so there is nothing doing caching ?
(In reply to Mathieu Pillard [:mat] from comment #5)
> ?cache=1 shouldn't affect the timings, since it only sets the response
> header, we are not using the CDN here, and we are using `curl` and not a
> browser, so there is nothing doing caching ?

Whatever it does, however, does make the times consistently slow when doing cache=0.

(In reply to Christopher Van Wiemeersch [:cvan] from comment #4)
> Lots of great discoveries. Thanks, Rob! What are the actionable tasks in
> your opinion?

I'm not sure exactly. I'd definitely like to see a log so we know how often we run through the full FeedView twice with the 2nd time using region=RESTOFWORLD. Hopefully that never happens.

Other than that I think looking into speeding up the serialization process would be the big win and along the way trying to avoid the double `to_native` call? The labyrinth of serializer classes is a challenge to really see what's going on.
(In reply to Mathieu Pillard [:mat] from comment #5)
> ?cache=1 shouldn't affect the timings, since it only sets the response
> header, we are not using the CDN here, and we are using `curl` and not a
> browser, so there is nothing doing caching ?

Yeah, that's what I thought.

I know that `?cache=1` should add a `Cache-Control: must-revalidate, max-age=180`, but why is it also adding an `Expires` header for 3 minutes in the future?

    > Expires: Fri, 31 Oct 2014 22:37:04 GMT

And `?cache=1` is also adding this header:

    > X-Cache-Info: cached

https://www.dropbox.com/s/36jsy70eab3shyl/Screenshot%202014-10-31%2015.37.22.png?dl=0

FWIW, I was not able to reproduce this behaviour on prod.

That same endpoint on prod takes the same amount of time regardless of the `?cache=1` query-string parameter:

    > % time curl 'https://marketplace.firefox.com/api/v2/feed/get/?lang=en-US&limit=25&region=us' -I ~
    > curl  -I  0.04s user 0.01s system 2% cpu 2.031 total
    > % time curl 'https://marketplace.firefox.com/api/v2/feed/get/?lang=en-US&limit=25&region=us&cache=1' -I ~
    > curl  -I  0.04s user 0.01s system 2% cpu 2.033 total

Same thing on stage.

Y'all think this is a regression from bug 1088362? Or something wonky going on with -dev setup?
I don't think it's a regression from bug 1088362, I think it's something on dev configuration: X-Cache-Info sounds like something from nginx. I suspect django is not doing any caching, it's just that when we add headers suddently nginx starts caching it. 

We should ask ops.
(In reply to Christopher Van Wiemeersch [:cvan] from comment #7)
> That same endpoint on prod takes the same amount of time regardless of the
> `?cache=1` query-string parameter:

That's not what I'm seeing. I get 2.138 first time, then 0.393 after when passing ?cache=1.
Jason, see comment 7, comment 8. Ideas?
Flags: needinfo?(jthomas)
On -dev and stage X-Cache-Info is set by the LB. On prod it will say something like 'Via: NS-CACHE-10.1'.

-dev: cache=1 vs no cache
sdiff <(curl https://marketplace-dev.allizom.org/api/v2/feed/get/\?lang\=en-US\&limit\=25\&region\=us\&cache\=1 -D- -so /dev/null) <(curl https://marketplace-dev.allizom.org/api/v2/feed/get/\?lang\=en-US\&limit\=25\&region\=us -D- -so /dev/null)
HTTP/1.1 200 OK 						HTTP/1.1 200 OK
Content-Security-Policy-Report-Only: script-src https://*.new   Content-Security-Policy-Report-Only: script-src https://*.new
Server: nginx   						Server: nginx
Vary: Accept, API-Filter, Accept-Language, Cookie, Accept-Enc   Vary: Accept, API-Filter, Accept-Language, Cookie, Accept-Enc
X-Backend-Server: dev1  					X-Backend-Server: dev1
Cache-Control: must-revalidate, max-age=180     	      <
Content-Type: application/json  				Content-Type: application/json
Access-Control-Expose-Headers: API-Filter, API-Status, API-Ve   Access-Control-Expose-Headers: API-Filter, API-Status, API-Ve
Strict-Transport-Security: max-age=31536000     		Strict-Transport-Security: max-age=31536000
Date: Tue, 04 Nov 2014 17:47:47 GMT     		      |	Date: Tue, 04 Nov 2014 17:49:22 GMT
API-Pinned: False       					API-Pinned: False
Expires: Tue, 04 Nov 2014 17:50:47 GMT  		      <
Transfer-Encoding: chunked      				Transfer-Encoding: chunked
Access-Control-Allow-Origin: *  				Access-Control-Allow-Origin: *
ETag: "692375c51aa4e102e8c23daec64ea8f0"			ETag: "692375c51aa4e102e8c23daec64ea8f0"
Via: Moz-zlb10  						Via: Moz-zlb10
							      >	Connection: keep-alive
API-Filter: carrier=&lang=en-US&pro=&region=us  		API-Filter: carrier=&lang=en-US&pro=&region=us
Last-Modified: Tue, 04 Nov 2014 17:47:47 GMT    	      <
Allow: GET, HEAD, OPTIONS       				Allow: GET, HEAD, OPTIONS
Access-Control-Allow-Methods: GET, OPTIONS      		Access-Control-Allow-Methods: GET, OPTIONS
Access-Control-Allow-Headers: X-HTTP-Method-Override, Content   Access-Control-Allow-Headers: X-HTTP-Method-Override, Content
API-Version: 2  						API-Version: 2
Connection: Keep-Alive  				      <
X-Cache-Info: cached    				      <

prod:
sdiff <(curl https://marketplace.firefox.com/api/v2/feed/get/\?lang\=en-US\&limit\=25\&region\=us\&cache\=1 -D- -so /dev/null) <(curl https://marketplace.firefox.com/api/v2/feed/get/\?lang\=en-US\&limit\=25\&region\=us -D- -so /dev/null)
HTTP/1.1 200 OK 						HTTP/1.1 200 OK
Age: 3          					      <
Date: Tue, 04 Nov 2014 17:51:12 GMT     		      <
Expires: Tue, 04 Nov 2014 17:48:06 GMT  		      <
Cache-Control: max-age=180       ,must-revalidate             <
Content-Length: 24225           			      <
Connection: Keep-Alive  				      <
Via: NS-CACHE-10.1:  11 				      <
ETag: "cd6faa3d6d4b974c89e9209fe554832e"		      <
Server: nginx   						Server: nginx
							      >	Date: Tue, 04 Nov 2014 17:51:14 GMT
Content-Type: application/json  				Content-Type: application/json
							      >	Transfer-Encoding: chunked
							      >	Connection: keep-alive
API-Pinned: False       					API-Pinned: False
Access-Control-Allow-Methods: GET, OPTIONS      		Access-Control-Allow-Methods: GET, OPTIONS
API-Filter: carrier=&lang=en-US&pro=&region=us  		API-Filter: carrier=&lang=en-US&pro=&region=us
Access-Control-Expose-Headers: API-Filter, API-Status, API-Ve   Access-Control-Expose-Headers: API-Filter, API-Status, API-Ve
Vary: Accept, API-Filter, Accept-Language, Cookie, Accept-Enc   Vary: Accept, API-Filter, Accept-Language, Cookie, Accept-Enc
Last-Modified: Tue, 04 Nov 2014 17:45:06 GMT    	      |	ETag: "cd6faa3d6d4b974c89e9209fe554832e"
Strict-Transport-Security: max-age=31536000     		Strict-Transport-Security: max-age=31536000
Allow: GET, HEAD, OPTIONS       				Allow: GET, HEAD, OPTIONS
API-Version: 2  						API-Version: 2
Access-Control-Allow-Origin: *  				Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: X-HTTP-Method-Override, Content   Access-Control-Allow-Headers: X-HTTP-Method-Override, Content
X-Backend-Server: web2  				      |	X-Backend-Server: web4
Flags: needinfo?(jthomas)
As per my conversation with :mat, we aren't really gaining much by having LB cache enabled. This is now disabled on -dev, stage and prod.
Found some small optimizations committed here:
https://github.com/mozilla/zamboni/commit/06ed8d2 

1. The app serializer was using `AppESField` which serializes all fields for an app object but we only need a handful for the feed page.

2. In some cases we could have queried for a list of apps and that list wasn't unique.
After profiling the serializer call it seems much of the time spent is spent resolving urls. See the attached profile output. We can see what urls this serialization process is resolving and possibly resolve them during indexing and skip the need for this at view time.
Attached file profile.txt
Heh very interesting. On a different Django project I discovered the same thing.

Which tools did you use to profile this?
Attached file profile.py
(In reply to Christopher Van Wiemeersch [:cvan] from comment #17)
> Heh very interesting. On a different Django project I discovered the same
> thing.
> 
> Which tools did you use to profile this?

I copy/pasted the (attached) profile.py into `./manage.py shell` and it prints out the profile view. I tried to run it directly but zamboni does some patch hacking I couldn't reproduce so this was easier. Hopefully we'll move the path hacking once we finish moving everything out of apps/.
This commit slims down the app object of collection listing feed elements:
https://github.com/mozilla/zamboni/commit/7108c1a
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: