[research] figure out why uploads pages are so slow
Categories
(Tecken :: General, defect, P2)
Tracking
(Not tracked)
People
(Reporter: willkg, Assigned: willkg)
References
Details
Attachments
(4 files)
Occasionally, the general uploads page is slow:
https://symbols.mozilla.org/uploads
Much more often, the upload files page is slow:
https://symbols.mozilla.org/uploads/files/
This bug covers getting timings, figuring out what the issues are, and then fixing them.
Assignee | ||
Comment 1•4 years ago
|
||
Making this a P3. It affects me every time I have to investigate symbols issues, but otherwise it's not a big deal.
Assignee | ||
Comment 2•4 years ago
|
||
We got the site status page into prod and it takes 20 seconds to run and the most interesting thing it's doing is table counts. I wonder if table counts are slow for large tables. That'd totally affect the upload pages because they all return table counts.
Maybe we should cache counts for large tables?
Maybe we should ditch the whole table count altogether? Is it necessary to have?
Assignee | ||
Comment 3•4 years ago
|
||
Ideas from bug #1667132:
The site status page I added gets table counts for a few tables a couple of which are large. It takes like 30+ seconds to load.
I have a few ideas on next steps:
- uploads views are doing a bunch of counts
- remove instances we don't really need?
- look at caching count values?
- look at saving file counts to the upload record?
- switch to estimates where possible? https://wiki.postgresql.org/wiki/Count_estimate
- missingsymbols views are doing pagination that involves OFFSET and LIMIT which is slow
- switch to another way to do pagination? https://www.citusdata.com/blog/2020/05/20/postgres-tips-for-django-and-python/
Assignee | ||
Comment 4•3 years ago
|
||
Here's a cursor pagination library:
https://github.com/photocrowd/django-cursor-pagination
That's worth looking into switching to.
Updated•2 years ago
|
Comment 5•2 years ago
•
|
||
For https://symbols.mozilla.org/uploads/files/
I loaded my local psql db with 20M (2x prod) entries and was able to notice some slowdown (~4s), but nothing prod-like (~5m), then I used django silky to profile calls to the db.
Assuming I was able to reproduce the issue, if we look at the image attached we see the culprits are probably:
count for incomplete files:
SELECT COUNT(*) AS "__count" FROM "upload_fileupload" WHERE "upload_fileupload"."completed_at" IS NULL
time_avg:
SELECT AVG(("upload_fileupload"."completed_at" - "upload_fileupload"."created_at")) AS "time_avg" FROM "upload_fileupload"
count, size_avg, size_sum:
SELECT COUNT("upload_fileupload"."id") AS "count", AVG("upload_fileupload"."size") AS "size_avg", SUM("upload_fileupload"."size") AS "size_sum" FROM "upload_fileupload"
Comment 6•2 years ago
•
|
||
The admin page https://symbols.mozilla.org/admin/sitestatus/ is also slow on prod, and despite not being able to quite reproduce the issue locally (it only takes 1.3s), I found some interesting things:
SELECT count(*)
orSELECT count(id)
yield roughly the same load time (1.3s)- An estimate count
SELECT reltuples AS estimate FROM pg_class WHERE relname = '<table_name>'
reduces the load time from 1.3s to 276ms
Comment 7•2 years ago
•
|
||
As a first step I decided to make the File Uploads page (https://symbols.mozilla.org/uploads/files/) components load asynchronously: "Files", which populates the table with a list of files and "Aggregation", which sits at the bottom of the page and provides additional data such as total count, avg size and avg upload time. Local tests have shown the usability will increase considerably since users will be able to visualize the list of files quicker while the Aggregation part, slower to compute, loads in parallel.
My main issue now is with the "Next Page" button. It relies on the "total count" value, which after the refactoring is being fetched by the Aggregation component (slower than Files).
So if we decide to wait for the total count to arrive along with the other calculations to enable the "Next Page" button, the user will get stuck on a page until Aggregation loads. That makes me think we should add that cursor pagination library willkg mentioned previously, but because the front-end pagination component is shared between three different pages in the front-end IDK if it's a good idea to do it all in one shot, so here are some options I thought:
- let's do it all in one shot: Add the pagination component and refactor all apis and front-end views to use it now (slowest, cleanest)
- quick fix #1: When asked for Files, the backend also sends a "hasNext" (temporarily determined by fetching BATCH_SIZE + 1 elements); Front-end adds a special condition in its pagination component in case this field is present, to use it. This is not as quick to impl as the third option, but not so dirty since hasNext will continue to be a thing after introducing cursor pagination.
- quick change #2: File Uploads page assumes whenever a page is full of items there's a Next Page. This is the fastest to implement, but also dirtiest since it will be all a throwaway once we add cursor pagination, plus the obvious bug that takes place when (total count % BATCH_SIZE == 0) in which the last page is blank.
willkg I'd like to take your input here. Feel free to suggest a diff option
Assignee | ||
Comment 8•2 years ago
|
||
I don't think we need to worry about "in which the last page is blank". The only time we're looking at the last page would be in a local dev environment because there aren't many pages. In stage and prod, we have plenty of data so we're never going to look at the last page. (Or at least, I can't think of a case where I've wanted to look at the last page.)
Beyond that, I'm not sure I have a strong opinion here. None of the pros and cons that you've outlined strike me as particularly amazing or onerous. I think the risk of making a "bad decision" here is low. In the worst possible scenario, you and I are inconvenienced with a frontend that has some issue until we get around to fixing it which is effectively where we're at now.
One thing I like doing is breaking things down into small, incremental steps--especially when I'm testing theories. I don't know if framing this work in that way changes how you want to approach the adjustments you want to make.
Comment 9•2 years ago
|
||
Fair enough. One use case that we might hit a blank page is if we use a filter that reduce the resultset to, say, 120 results e.g.
Assignee | ||
Comment 10•2 years ago
|
||
Oh, you're right! I completely forgot about the filters.
Comment 11•2 years ago
|
||
Comment 12•2 years ago
|
||
Comment 13•2 years ago
|
||
Deployed the Async fetch Files and Aggregates for the /uploads/files page to prod in bug 1764583.
We're seeing the following changes in that page's load time:
- Before the deployment the page used to take from 1 to 3 minutes to load
- After the deployment, the file list component is loading in less than a second, stable. Which is better than I expected
- Right after the deployment the aggregates component was loading in under 20s, but 15 minutes later it started taking from 1 to 3 minutes to load, similar to the load time from before the deployment. But then right after that it went back to loading fast (~30 seconds). This looks like a degradation and might have something to do with available pod memory/garbage collection.
My thoughts on next steps:
- Apply the same solution to the
/uploads
page - In case we want to stick with aggregates and make them faster, then Further investigation should consider issuing requests to the
/files/aggregates
endpoint and looking at db load/memory consumption/caching to pinpoint where the bottleneck is.
Assignee | ||
Comment 14•2 years ago
|
||
Comment 15•2 years ago
|
||
Comment 16•2 years ago
•
|
||
We just deployed the changes for the /uploads
page and had similar improvements on the uploads list!
However :willkg and I noticed that when enough number of requests are made (usually from 5-10) to /uploads/files
something happens and the page becomes slow again. We think something might be tying up gunicorn workers, making them unavailable for new requests, which will queue up requests. We gotta investigate and hopefully fix that. Because of that we're deciding to leave this bug open.
Assignee | ||
Comment 17•2 years ago
|
||
Here's a summary of where this bug is at:
-
Current table counts in production:
key value download_missingsymbol 7,737,096 upload_upload 207,665 upload_fileupload 14,590,035 upload_uploadscreated 1,799 Current table counts in stage:
key value download_missingsymbol 529 upload_upload 164 upload_fileupload 733 upload_uploadscreated 449 These counts come from the "Site Status" page in the Django admin. In production, it takes over a minute to produce those counts.
-
The uploads view (
https://symbols.mozilla.org/uploads
) lists records in theupload_upload
table. Eduardo broke that view up into two separate API calls--one that gets records and one that gets aggregates.My attempt today to load that page with no filters is pretty fast. Both API calls return very quickly.
However, when I filtered by mcastelluccio just now, it takes a long time. I hard-refreshed the page for a second set of timings to see if I could reproduce that:
API timing 1 timing 2 records 96 seconds 115 milliseconds aggregates 98 seconds 51 seconds -
This page isn't used very often, but the concern is that if it is used by multiple people for debugging some symbol upload problem, it could tie up the database and web workers sufficiently to DOS the site and block symbol uploads.
-
When Eduardo was researching the problem, he was unable to reproduce it in his local dev environment. We don't have enough records on stage to reproduce it, either. Currently the only place this problem exists is in production.
Outstanding questions and possible next steps:
-
Can we reproduce this issue on stage?
If so, this gives us a lot more freedom to experiment and test theories. If not, why?
-
Is the production database and the local dev environment defined the same? (tables, indexes, and other things)
If not (missing index, table definitions different, etc), then maybe the slowness problem is caused by a difference.
-
Is the production database being maintained regularly? (vacuum, updates, etc)
If not (not getting regularly vacuumed), then maybe we should vacuum the
upload_upload
table and that will reduce the slowness. -
Is the number of records in the
upload_upload
table not conducive to the way Django does aggregations and pagination?If so, then we should switch how we do aggregations and caching. An earlier comment wondered whether we can remove the aggregations altogether. Another earlier comment suggested other ways to do pagination.
We don't know what the underlying cause is. We've done some work to alleviate the cause somewhat, but the issue is still reproducible though not consistently so. Lonnen and I don't think this feels like a "production database is set up differently" issue and that the problem is related to the number of records in the tables and how Postgres does counts. We should spend some time verifying that, but it's probably the case we should rethink this view.
Further, the "upload files" view suffers the same issues and whatever we do with "uploads" we should probably do there, too.
Related resources:
- PostgresSQL vacuum documentation: https://www.postgresql.org/docs/14/sql-vacuum.html
- Efficient Pagination in Django and Postgres (2021): https://pganalyze.com/blog/pagination-django-postgres
- Postgres tips for optimizing Django and Python performance: https://www.citusdata.com/blog/2020/05/20/postgres-tips-for-django-and-python/
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 18•2 years ago
|
||
In bug #1796579, I removed all the aggregates from the Uploads page. In bug #1796633, I removed all the aggregates from the Uploads Files page. The uploads and uploads files pages are a lot faster now.
Given that, I'm going to close this out.
Description
•