Closed Bug 735867 Opened 9 years ago Closed 9 years ago

People page takes up to 12 seconds to render on production

Categories

(Mozilla Reps Graveyard :: reps.mozilla.org, task)

0.2 - Skon
task
Not set
critical

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: stephend, Assigned: giorgos)

References

()

Details

(Keywords: perf)

Attachments

(2 files)

Leaving untargetted, for now, but I'd love to see this in the 0.2 Skon release.

The /people page is *really* slow on prod, sometimes taking up to 18 seconds to render:

http://www.webpagetest.org/result/120314_MY_3K6JJ/

First View 18.691s

Repeat View 12.067s

One reason is that there are 284 objects being requested; see the waterfall, there, for a more-detailed view.
mpressman: any DB-related optimizations we can make, here?
Probably that's also related to Bug 735050 ? We'll ship some caching improvements with 0.2
Assignee: nobody → scabral
We tested this out, and it doesn't seem to be the database:

[14:52:57] <sheeri> I've got both prod servers logging every query now, so one more time please :D
[14:53:21] <stephend> sure thing
[14:54:34] <stephend> done
[14:55:11] <stephend> 16 seconds, 9 seconds
[14:55:16] <sheeri> seeing queries, that's good :D
[14:55:51] <sheeri> OK, so on both logs I did:
[14:55:52] <sheeri> grep Query mysql-slow.log  | grep -v "Query_time: 0.0"
[14:56:04] <sheeri> there's only one query that matched that, on the master, # Query_time: 0.108180  Lock_time: 0.000122 Rows_sent: 1  Rows_examined: 18377
[14:56:23] <sheeri> that's SELECT AVG(clicks) FROM (SELECT `auth_user`.`id` AS `id`, `auth_user`.`username` AS `username`, `auth_user`.`first_name` AS `first_name`, `auth_user`.`last_name` AS `last_name`, `auth_user`.`email` AS `email`, `auth_user`.`password` AS `password`, `auth_user`.`is_staff` AS `is_staff`, `auth_user`.`is_active` AS `is_active`, `auth_user`.`is_superuser` AS `is_superuser`, `auth_user`.`last_login` AS `last_login`, `auth_user`.`date_joined` AS
[14:56:23] <sheeri> `date_joined`, SUM(`badges_clickstats`.`clicks`) AS `clicks` FROM `auth_user` LEFT OUTER JOIN `badges_badgeinstance` ON (`auth_user`.`id` = `badges_badgeinstance`.`user_id`) LEFT OUTER JOIN `badges_clickstats` ON (`badges_badgeinstance`.`id` = `badges_clickstats`.`badge_instance_id`) WHERE (`badges_clickstats`.`year` = 2012  AND `badges_clickstats`.`month` = 3 ) GROUP BY `auth_user`.`id`, `auth_user`.`id`, `auth_user`.`username`,
[14:56:23] <sheeri> `auth_user`.`first_name`, `auth_user`.`last_name`, `auth_user`.`email`, `auth_user`.`password`, `auth_user`.`is_staff`, `auth_user`.`is_active`, `auth_user`.`is_superuser`, `auth_user`.`last_login`, `auth_user`.`date_joined` ORDER BY NULL) subquery;
[14:56:43] <sheeri> but I honestly don't think that a 1/10 second query caused 16 or 9 seconds of delay
[14:56:49] <stephend> yah
Paul/Jen: is there anything else we can investigate, here?  How closely do we think this is related to bug 735050, if that's a valid guess?
Assignee: scabral → nobody
What kind of caching was implemented so far?
According to bug 735867 we do not cache pages under /people/, /featured/ and /u/, basically everything except the main page. 

I think that the best solution would be to cache pages for a few hours or so (suggestions please!) for non logged in users and do not cache at all for logged in users.

Logged in users make changes to the website, e.g. changing their profile data, adding reports which alter most of the website views. So I guess no caching would be the best option for that, so sure how to implement this is django though. Any suggestions?

Jason, what's your opinion on this? What about "must revalidate" option, can this somehow solve our problem and is it support by our upstream caches?

Thanks
Giorgos,

I do not think that 'must revalidate' will do any good.  This is useful only after you have some sort of front end caching.  It will tell the cache to force a refresh even if the page modification time has not changed since the page was last cached.

In order to improve performance on this page (people) you will need to figure out how to do some sort of front end caching.  Right now the production site has caching disabled for everything under people, configuration file looks like:

    # Bug 731756
    <LocationMatch "/(u|people|featured)/">
        ExpiresActive On
        ExpiresDefault "access plus 0 seconds"
    </LocationMatch>

You can either take control of caching in the Django interface or provide us with a more complex configuration for Apache.

Either way I think there will need to be some compromise between real-time updates and performance.  Perhaps you can be willing to cache these pages for 10 minutes or more.  Anything like this will help.  Additionally you can try to make fewer external http requests, aggregate your js and css files, minify js and css files, etc...  For example this page is currently doing:

This page has 8 external Javascript scripts. Try combining them into one.
This page has 3 external stylesheets. Try combining them into one.

There are 3 components that can be minified
    https://reps.mozilla.org/media/js/activate.browserid.js
    https://reps.mozilla.org/media/js/app.js
    inline <script> tag #1

There are 6701 DOM elements on the page

There are 34 images that are scaled down


I am sure you know this so I will digress the point.

Really the largest help is going to be getting some sort of caching in place.  Even if you just cache the static elements (js css png etc) and leave the avatars or whatever alone, this will help.

After some sort of caching schema is developed we can then evaluate placing some static assets on a CDN if the site remains slow due to these elements.

Please let me know if this is not helpful or if I can provide any additional information.
I've looked into this a bit as well. Here are my recommendations:

1)
There are approximately 25 static content things (js, css, png) that are delivered from our servers without any form of cache headers (expires or cache-control). These result in round-trips back to us on every single page load... this adds up fast, especially as the user gets farther away from PHX1. Folks in EU are hurt significantly by this.

Fred Wenzel should be able to point you in the right direction on this. Basically, we already have a standard solution for having a django app return static content with a query string that changes whenever the actual content changes. Then it sets a very long Expires header on that content- the query string is used to invalidate the cache as-needed. This works extremely well, for almost all of our Django-based sites (AMO sets a *1 year* expiration thanks to this).

Note that the lack of headers on this also means that even Zeus won't cache it... these requests have to go all the way back to the origin nodes.


2)
The page itself is rather complex. As jd mentioned, almost 7000 DOM elements. This is probably affecting rendering speed, especially on older PCs and mobile devices. If anything can be done to simplify the page structure, that would probably help the page to display faster.

3)
There is a huge number of calls to Gravatar for all the individual images. Most of these have very short expires headers, or even *no* expires headers in some cases. This results in over 100 round-trip calls to Gravatar's CDN. It might be good to talk to them about this, and see if anything can be done.

Alternatively, perhaps the page can be paginated and show fewer users at a time. This may help with both #2 and #3.

4)
The actual document https://reps.mozilla.org/people/ is slow to load in the first place... this seems to account for 1/3 to 1/2 of the overall page load time. This appears to be largely CPU usage on whichever node happens to handle the request.

The fix for this is going to be similar to what we do with the generic cluster in PHX1- make the HP blade primary, and the Seamicro Atom nodes failover. In my testing this drops the processing time for this request from about 6-9s down to around 2-4s, simply due to having much faster processors. It's faster enough that I don't think CPU is a limitation anymore, for any single request.

The remaining 2-4s seems much more likely to be database now (maybe 1-2s of CPU, plus 1-2s of DB queries), and I suspect we'd want to revisit that more closely once this is in place. My guess is that we have a lot of small queries happening.



I am working on #4 now. I recommend you (web dev) looks over 1-3 to figure out what might be doable.
I really think we should try to get fixes in for 0.2; setting that milestone so we can triage it out, if we decide not to.
Version: unspecified → 0.2 - Skon
Thanks for your recommendations, Jake!

(In reply to Jake Maul [:jakem] from comment #8)
> 1)
> There are approximately 25 static content things (js, css, png) that are
> delivered from our servers without any form of cache headers (expires or
> cache-control). These result in round-trips back to us on every single page
> load... this adds up fast, especially as the user gets farther away from
> PHX1. Folks in EU are hurt significantly by this.

Are you using jingo-minify? It's part of playdoh's standard package, and it seems you're using it, but on deployment you neither set TEMPLATE_DEBUG to False, nor do you seem to actually minify the CSS and JS. Please do. There's also an update script for stage and prod servers packaged with playdoh that does this.

> 2)
> The page itself is rather complex. As jd mentioned, almost 7000 DOM
> elements. This is probably affecting rendering speed, especially on older
> PCs and mobile devices. If anything can be done to simplify the page
> structure, that would probably help the page to display faster.

Yeeeah, 11000 lines, 7k DOM elements. That's waaay too complex.

> 3)
> Alternatively, perhaps the page can be paginated and show fewer users at a
> time. This may help with both #2 and #3.

Yeah that's probably a good idea.

> 4)
> The actual document https://reps.mozilla.org/people/ is slow to load in the
> first place... this seems to account for 1/3 to 1/2 of the overall page load
> time. This appears to be largely CPU usage on whichever node happens to
> handle the request.
> 
> The fix for this is going to be similar to what we do with the generic
> cluster in PHX1- make the HP blade primary, and the Seamicro Atom nodes
> failover. In my testing this drops the processing time for this request from
> about 6-9s down to around 2-4s, simply due to having much faster processors.
> It's faster enough that I don't think CPU is a limitation anymore, for any
> single request.

That works for me, though this is pretty clearly foremost an app problem, and only as a secondary fix would I throw more horsepower at it.

> The remaining 2-4s seems much more likely to be database now (maybe 1-2s of
> CPU, plus 1-2s of DB queries), and I suspect we'd want to revisit that more
> closely once this is in place. My guess is that we have a lot of small
> queries happening.
> 
> I am working on #4 now. I recommend you (web dev) looks over 1-3 to figure
> out what might be doable.

Sounds good, thanks!
(In reply to Jason Crowe [:jd] from comment #7)
> In order to improve performance on this page (people) you will need to
> figure out how to do some sort of front end caching.  Right now the
> production site has caching disabled for everything under people,
> configuration file looks like:
> 
>     # Bug 731756
>     <LocationMatch "/(u|people|featured)/">
>         ExpiresActive On
>         ExpiresDefault "access plus 0 seconds"
>     </LocationMatch>
> 
> You can either take control of caching in the Django interface or provide us
> with a more complex configuration for Apache.
> 

I've already added logic on the application side to control the cache. When we push 0.2, I'll ask you to remove this configuration and performance should increase with the new more complex control. I plan to cache some pages, like the one under question (/people/), which doesn't matter if the are not up to date.

> Really the largest help is going to be getting some sort of caching in
> place.  Even if you just cache the static elements (js css png etc) and
> leave the avatars or whatever alone, this will help.

I wonder why static elements are not currently cached. I guess that these are directly served from apache, isn't there a default caching policy for them?


> Please let me know if this is not helpful or if I can provide any additional
> information.

Once you mentioned on IRC (if I remember correctly of course ;) that on drupal sites you don't cache any content when a user is logged in. I don't really understand how you can instruct upstream caches to not serve a cached page to a logged in user. I mean, since the cache server will serve the page and the request will never hit the app to check whether a user is logged in or not. I guess it's probably a trick I'm missing or something I misunderstand. Any thoughts?

Great info, thanks for your time! ;)
(In reply to Fred Wenzel [:wenzel] from comment #10)
> Thanks for your recommendations, Jake!
> 
> (In reply to Jake Maul [:jakem] from comment #8)
> > 1)
> > There are approximately 25 static content things (js, css, png) that are
> > delivered from our servers without any form of cache headers (expires or
> > cache-control). These result in round-trips back to us on every single page
> > load... this adds up fast, especially as the user gets farther away from
> > PHX1. Folks in EU are hurt significantly by this.
> 
> Are you using jingo-minify? It's part of playdoh's standard package, and it
> seems you're using it, but on deployment you neither set TEMPLATE_DEBUG to
> False, nor do you seem to actually minify the CSS and JS. Please do. There's
> also an update script for stage and prod servers packaged with playdoh that
> does this.

Yes we are using jingo-minify on 0.2, which is still on dev servers. Version 0.1 was not using jingo-minify, because it was not using jinja in the first place!

> 
> > 2)
> > The page itself is rather complex. As jd mentioned, almost 7000 DOM
> > elements. This is probably affecting rendering speed, especially on older
> > PCs and mobile devices. If anything can be done to simplify the page
> > structure, that would probably help the page to display faster.
> 
> Yeeeah, 11000 lines, 7k DOM elements. That's waaay too complex.

Agreed. It's quite complex and contains lots of information. Maybe Pierros can design something more lightweight for 0.3.

> 
> > 3)
> > Alternatively, perhaps the page can be paginated and show fewer users at a
> > time. This may help with both #2 and #3.
> 
> Yeah that's probably a good idea.

Right now we cannot do pagination, since we are doing javascript side search queries. With a different design that's definitely a good workaround. 

> 
> > 4)
> > The actual document https://reps.mozilla.org/people/ is slow to load in the
> > first place... this seems to account for 1/3 to 1/2 of the overall page load
> > time. This appears to be largely CPU usage on whichever node happens to
> > handle the request.
> > 

Since the view of this page only executes the query to the database, I guess that most of the processing time is spent on rendering the template. 

On 0.1 we are using the default django template engine and not jinja, therefore I expect that 0.2 will be quite faster (that is if the promise of jinja being faster than django holds ;)
Summing up, changes to go into 0.2 related to this bug:

1. Switch to Jinja for template rendering
2. Activate jingo minify
3. Cache static content. Jason, do I have to file a bug about this? Do we cache content under /media/?
4. Cache some pages, including /people/ for about 10 minutes.

More comments and suggestions always welcome. 

Thank you all for your detailed answers!
On the /media/ cache header front, here's what AMO does:

    Alias /media /data/www/addons.mozilla.org/zamboni/media
    <FilesMatch "(\.(css|gif|ico|jpe?g|js|png|svg))$">
        ExpiresActive On
        ExpiresDefault "access plus 10 years"
    </FilesMatch>

So yes, things in /media/ are served up directly by Apache, and Apache sets a huge Expires time on them. It relies on the app to include a query string in the HTML when requesting that content, so the cache can be "busted" when the content changes.

Note that SUMO is basically the same, but uses a bit different implementation. They have a .htacces file in /media/ which sets all of the headers for them, and it keys off of MIME type rather than extension.

Their javascript and CSS files include a ?build=XXXX query string (presumably from Jinja), and have 1-year Expires headers. Their images in /media/ do not have query strings, but *do* have 1-month Expires headers. Videos have no query string and a 1-week expiry. If they change an image or video, there will be a significant lag before people see the new one unless they force-reload the page.


I recommend you check in a .htaccess file into /media/ with some suitable cache headers. I'll attach a sample one to start from. Note that you might want to leave some MIME types without any cache headers until the switch to Jinja/jingo-minify is completed and you have working query strings on content.

A compromise solution might be to set it up now with a relatively short TTL (6 hours maybe?) for now, and plan to increase it dramatically once the query string is included. I don't know what the timeline on that is like... maybe it's not worth the trouble.
As I mentioned in my last comment, you may want to start with something like this, but with smaller values, and/or comment out the more troublesome MIME types. Also take note of the fallback "ExpiresDefault" line... you'll get that for any MIME type that doesn't have an explicit setting!
Thanks Jake! I already included your htaccess file. I didn't change the times, since we already activated jingo minify.

Jason, I added the .htaccess file in the /media directory. Please take care of any apache configuration if required. Thanks!
Assignee: nobody → giorgos
Status: NEW → ASSIGNED
To clarify, did you want a push to stage or production (or neither) to deploy this .htaccess file? It's not live in stage or prod at the moment, but I do see it in dev.
Jake,

This is not yet pushed to production or stage. Expect that by next week!

thanks
Summary: People page takes up to 18 seconds to render on production → People page takes up to 12 seconds to render on production
We deployed the suggested improvements and the page feels a bit faster. But the page keeps growing due to more people joining our website. As of today it counts more than 8000 DOM elements. We will work on a number of improvements to reduce the number of DOM elements and the heavy JS scripts. I'll create new bugs to track the development.
Depends on: 743656
Closing this one based on Giorgos's last comment. Any new improvements can be tracked in a new bug.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
It's worse, and takes 21+ seconds, now -- likely due to the fact that we're growing awesome community reps :-)  But, as you both say, on to new bugs.

https://browsermob.com/free-website-performance-test/b6e41d4739f14f33a046b6d89410aa87

I'll go find them and comment there :-)
Couldn't find any new issues filed, so I logged bug 750951.
Verified FIXED:

http://www.webpagetest.org/result/120905_J1_KAC/

~ 7 seconds; nice!
Status: RESOLVED → VERIFIED
Product: Mozilla Reps → Mozilla Reps Graveyard
You need to log in before you can comment on or make changes to this bug.