Closed Bug 1668371 Opened 4 years ago Closed 2 years ago

[research] figure out why uploads pages are so slow

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(4 files)

sql_profiling_all.png 2 years ago Eduardo Filho [:efilho] 165.44 KB, image/png		Details
pr 2531: bug 1668371: Async fetch Files and Aggregates 2 years ago Eduardo Filho [:efilho] 52 bytes, text/x-github-pull-request		Details \| Review
load times before after.png 2 years ago Eduardo Filho [:efilho] 88.18 KB, image/png		Details
pr 2551: Bug 1668371 async fetch uploads 2 years ago Will Kahn-Greene [:willkg] ET needinfo? me 52 bytes, text/x-github-pull-request		Details \| Review

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Description

•

4 years ago

Occasionally, the general uploads page is slow:

https://symbols.mozilla.org/uploads

Much more often, the upload files page is slow:

https://symbols.mozilla.org/uploads/files/

This bug covers getting timings, figuring out what the issues are, and then fixing them.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 1

•

4 years ago

Making this a P3. It affects me every time I have to investigate symbols issues, but otherwise it's not a big deal.

Priority: -- → P3

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 2

•

4 years ago

We got the site status page into prod and it takes 20 seconds to run and the most interesting thing it's doing is table counts. I wonder if table counts are slow for large tables. That'd totally affect the upload pages because they all return table counts.

Maybe we should cache counts for large tables?

Maybe we should ditch the whole table count altogether? Is it necessary to have?

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 3

•

4 years ago

Ideas from bug #1667132:

The site status page I added gets table counts for a few tables a couple of which are large. It takes like 30+ seconds to load.

I have a few ideas on next steps:

uploads views are doing a bunch of counts
- remove instances we don't really need?
- look at caching count values?
- look at saving file counts to the upload record?
- switch to estimates where possible? https://wiki.postgresql.org/wiki/Count_estimate
missingsymbols views are doing pagination that involves OFFSET and LIMIT which is slow
- switch to another way to do pagination? https://www.citusdata.com/blog/2020/05/20/postgres-tips-for-django-and-python/

Priority: P3 → P2

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 4

•

3 years ago

Here's a cursor pagination library:

https://github.com/photocrowd/django-cursor-pagination

That's worth looking into switching to.

Eduardo Filho [:efilho]

Updated

•

2 years ago

Assignee: nobody → efilho

Status: NEW → ASSIGNED

Eduardo Filho [:efilho]

Comment 5

•

2 years ago

•

Edited

Attached image sql_profiling_all.png — Details

For https://symbols.mozilla.org/uploads/files/
I loaded my local psql db with 20M (2x prod) entries and was able to notice some slowdown (~4s), but nothing prod-like (~5m), then I used django silky to profile calls to the db.

Assuming I was able to reproduce the issue, if we look at the image attached we see the culprits are probably:

count for incomplete files:
SELECT COUNT(*) AS "__count" FROM "upload_fileupload" WHERE "upload_fileupload"."completed_at" IS NULL

time_avg:
SELECT AVG(("upload_fileupload"."completed_at" - "upload_fileupload"."created_at")) AS "time_avg" FROM "upload_fileupload"

count, size_avg, size_sum:
SELECT COUNT("upload_fileupload"."id") AS "count", AVG("upload_fileupload"."size") AS "size_avg", SUM("upload_fileupload"."size") AS "size_sum" FROM "upload_fileupload"

Eduardo Filho [:efilho]

Comment 6

•

2 years ago

•

Edited

The admin page https://symbols.mozilla.org/admin/sitestatus/ is also slow on prod, and despite not being able to quite reproduce the issue locally (it only takes 1.3s), I found some interesting things:

SELECT count(*) or SELECT count(id) yield roughly the same load time (1.3s)
An estimate count SELECT reltuples AS estimate FROM pg_class WHERE relname = '<table_name>' reduces the load time from 1.3s to 276ms

Eduardo Filho [:efilho]

Comment 7

•

2 years ago

•

Edited

As a first step I decided to make the File Uploads page (https://symbols.mozilla.org/uploads/files/) components load asynchronously: "Files", which populates the table with a list of files and "Aggregation", which sits at the bottom of the page and provides additional data such as total count, avg size and avg upload time. Local tests have shown the usability will increase considerably since users will be able to visualize the list of files quicker while the Aggregation part, slower to compute, loads in parallel.

My main issue now is with the "Next Page" button. It relies on the "total count" value, which after the refactoring is being fetched by the Aggregation component (slower than Files).
So if we decide to wait for the total count to arrive along with the other calculations to enable the "Next Page" button, the user will get stuck on a page until Aggregation loads. That makes me think we should add that cursor pagination library willkg mentioned previously, but because the front-end pagination component is shared between three different pages in the front-end IDK if it's a good idea to do it all in one shot, so here are some options I thought:

let's do it all in one shot: Add the pagination component and refactor all apis and front-end views to use it now (slowest, cleanest)
quick fix #1: When asked for Files, the backend also sends a "hasNext" (temporarily determined by fetching BATCH_SIZE + 1 elements); Front-end adds a special condition in its pagination component in case this field is present, to use it. This is not as quick to impl as the third option, but not so dirty since hasNext will continue to be a thing after introducing cursor pagination.
quick change #2: File Uploads page assumes whenever a page is full of items there's a Next Page. This is the fastest to implement, but also dirtiest since it will be all a throwaway once we add cursor pagination, plus the obvious bug that takes place when (total count % BATCH_SIZE == 0) in which the last page is blank.

willkg I'd like to take your input here. Feel free to suggest a diff option

Flags: needinfo?(willkg)

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 8

•

2 years ago

I don't think we need to worry about "in which the last page is blank". The only time we're looking at the last page would be in a local dev environment because there aren't many pages. In stage and prod, we have plenty of data so we're never going to look at the last page. (Or at least, I can't think of a case where I've wanted to look at the last page.)

Beyond that, I'm not sure I have a strong opinion here. None of the pros and cons that you've outlined strike me as particularly amazing or onerous. I think the risk of making a "bad decision" here is low. In the worst possible scenario, you and I are inconvenienced with a frontend that has some issue until we get around to fixing it which is effectively where we're at now.

One thing I like doing is breaking things down into small, incremental steps--especially when I'm testing theories. I don't know if framing this work in that way changes how you want to approach the adjustments you want to make.

Flags: needinfo?(willkg)

Eduardo Filho [:efilho]

Comment 9

•

2 years ago

Fair enough. One use case that we might hit a blank page is if we use a filter that reduce the resultset to, say, 120 results e.g.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 10

•

2 years ago

Oh, you're right! I completely forgot about the filters.

Eduardo Filho [:efilho]

Comment 11

•

2 years ago

edugfilho merged PR #2531: "bug 1668371: Async fetch Files and Aggregates" in a586325.

Eduardo Filho [:efilho]

Comment 12

•

2 years ago

Attached file pr 2531: bug 1668371: Async fetch Files and Aggregates — Details

Eduardo Filho [:efilho]

Comment 13

•

2 years ago

Attached image load times before after.png — Details

Deployed the Async fetch Files and Aggregates for the /uploads/files page to prod in bug 1764583.

We're seeing the following changes in that page's load time:

Before the deployment the page used to take from 1 to 3 minutes to load
After the deployment, the file list component is loading in less than a second, stable. Which is better than I expected
Right after the deployment the aggregates component was loading in under 20s, but 15 minutes later it started taking from 1 to 3 minutes to load, similar to the load time from before the deployment. But then right after that it went back to loading fast (~30 seconds). This looks like a degradation and might have something to do with available pod memory/garbage collection.

My thoughts on next steps:

Apply the same solution to the /uploads page
In case we want to stick with aggregates and make them faster, then Further investigation should consider issuing requests to the /files/aggregates endpoint and looking at db load/memory consumption/caching to pinpoint where the bottleneck is.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 14

•

2 years ago

Attached file pr 2551: Bug 1668371 async fetch uploads — Details

Eduardo Filho [:efilho]

Comment 15

•

2 years ago

edugfilho merged PR #2551: "Bug 1668371 async fetch uploads" in 516876b.

Eduardo Filho [:efilho]

Comment 16

•

2 years ago

•

Edited

We just deployed the changes for the /uploads page and had similar improvements on the uploads list!

However :willkg and I noticed that when enough number of requests are made (usually from 5-10) to /uploads/files something happens and the page becomes slow again. We think something might be tying up gunicorn workers, making them unavailable for new requests, which will queue up requests. We gotta investigate and hopefully fix that. Because of that we're deciding to leave this bug open.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 17

•

2 years ago

Here's a summary of where this bug is at:

Current table counts in production:

key value

download_missingsymbol 7,737,096

upload_upload 207,665

upload_fileupload 14,590,035

upload_uploadscreated 1,799

Current table counts in stage:

key value

download_missingsymbol 529

upload_upload 164

upload_fileupload 733

upload_uploadscreated 449

These counts come from the "Site Status" page in the Django admin. In production, it takes over a minute to produce those counts.
The uploads view (https://symbols.mozilla.org/uploads) lists records in the upload_upload table. Eduardo broke that view up into two separate API calls--one that gets records and one that gets aggregates.

My attempt today to load that page with no filters is pretty fast. Both API calls return very quickly.

However, when I filtered by mcastelluccio just now, it takes a long time. I hard-refreshed the page for a second set of timings to see if I could reproduce that:

API timing 1 timing 2

records 96 seconds 115 milliseconds

aggregates 98 seconds 51 seconds
This page isn't used very often, but the concern is that if it is used by multiple people for debugging some symbol upload problem, it could tie up the database and web workers sufficiently to DOS the site and block symbol uploads.
When Eduardo was researching the problem, he was unable to reproduce it in his local dev environment. We don't have enough records on stage to reproduce it, either. Currently the only place this problem exists is in production.

Outstanding questions and possible next steps:

Can we reproduce this issue on stage?

If so, this gives us a lot more freedom to experiment and test theories. If not, why?
Is the production database and the local dev environment defined the same? (tables, indexes, and other things)

If not (missing index, table definitions different, etc), then maybe the slowness problem is caused by a difference.
Is the production database being maintained regularly? (vacuum, updates, etc)

If not (not getting regularly vacuumed), then maybe we should vacuum the upload_upload table and that will reduce the slowness.
Is the number of records in the upload_upload table not conducive to the way Django does aggregations and pagination?

If so, then we should switch how we do aggregations and caching. An earlier comment wondered whether we can remove the aggregations altogether. Another earlier comment suggested other ways to do pagination.

We don't know what the underlying cause is. We've done some work to alleviate the cause somewhat, but the issue is still reproducible though not consistently so. Lonnen and I don't think this feels like a "production database is set up differently" issue and that the problem is related to the number of records in the tables and how Postgres does counts. We should spend some time verifying that, but it's probably the case we should rethink this view.

Further, the "upload files" view suffers the same issues and whatever we do with "uploads" we should probably do there, too.

Related resources:

PostgresSQL vacuum documentation: https://www.postgresql.org/docs/14/sql-vacuum.html
Efficient Pagination in Django and Postgres (2021): https://pganalyze.com/blog/pagination-django-postgres
Postgres tips for optimizing Django and Python performance: https://www.citusdata.com/blog/2020/05/20/postgres-tips-for-django-and-python/

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Updated

•

2 years ago

Assignee: efilho → willkg

Depends on: 1796579, 1796633

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 18

•

2 years ago

In bug #1796579, I removed all the aggregates from the Uploads page. In bug #1796633, I removed all the aggregates from the Uploads Files page. The uploads and uploads files pages are a lot faster now.

Given that, I'm going to close this out.

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

key	value
download_missingsymbol	7,737,096
upload_upload	207,665
upload_fileupload	14,590,035
upload_uploadscreated	1,799

API	timing 1	timing 2
records	96 seconds	115 milliseconds
aggregates	98 seconds	51 seconds