Closed Bug 1668371 Opened 4 years ago Closed 2 years ago

[research] figure out why uploads pages are so slow

Categories

(Tecken :: General, defect, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(4 files)

Occasionally, the general uploads page is slow:

https://symbols.mozilla.org/uploads

Much more often, the upload files page is slow:

https://symbols.mozilla.org/uploads/files/

This bug covers getting timings, figuring out what the issues are, and then fixing them.

Making this a P3. It affects me every time I have to investigate symbols issues, but otherwise it's not a big deal.

Priority: -- → P3

We got the site status page into prod and it takes 20 seconds to run and the most interesting thing it's doing is table counts. I wonder if table counts are slow for large tables. That'd totally affect the upload pages because they all return table counts.

Maybe we should cache counts for large tables?

Maybe we should ditch the whole table count altogether? Is it necessary to have?

Ideas from bug #1667132:

The site status page I added gets table counts for a few tables a couple of which are large. It takes like 30+ seconds to load.

I have a few ideas on next steps:

  1. uploads views are doing a bunch of counts
  2. missingsymbols views are doing pagination that involves OFFSET and LIMIT which is slow
Priority: P3 → P2

Here's a cursor pagination library:

https://github.com/photocrowd/django-cursor-pagination

That's worth looking into switching to.

Assignee: nobody → efilho
Status: NEW → ASSIGNED
Attached image sql_profiling_all.png

For https://symbols.mozilla.org/uploads/files/
I loaded my local psql db with 20M (2x prod) entries and was able to notice some slowdown (~4s), but nothing prod-like (~5m), then I used django silky to profile calls to the db.

Assuming I was able to reproduce the issue, if we look at the image attached we see the culprits are probably:

count for incomplete files:
SELECT COUNT(*) AS "__count" FROM "upload_fileupload" WHERE "upload_fileupload"."completed_at" IS NULL

time_avg:
SELECT AVG(("upload_fileupload"."completed_at" - "upload_fileupload"."created_at")) AS "time_avg" FROM "upload_fileupload"

count, size_avg, size_sum:
SELECT COUNT("upload_fileupload"."id") AS "count", AVG("upload_fileupload"."size") AS "size_avg", SUM("upload_fileupload"."size") AS "size_sum" FROM "upload_fileupload"

The admin page https://symbols.mozilla.org/admin/sitestatus/ is also slow on prod, and despite not being able to quite reproduce the issue locally (it only takes 1.3s), I found some interesting things:

  • SELECT count(*) or SELECT count(id) yield roughly the same load time (1.3s)
  • An estimate count SELECT reltuples AS estimate FROM pg_class WHERE relname = '<table_name>' reduces the load time from 1.3s to 276ms

As a first step I decided to make the File Uploads page (https://symbols.mozilla.org/uploads/files/) components load asynchronously: "Files", which populates the table with a list of files and "Aggregation", which sits at the bottom of the page and provides additional data such as total count, avg size and avg upload time. Local tests have shown the usability will increase considerably since users will be able to visualize the list of files quicker while the Aggregation part, slower to compute, loads in parallel.

My main issue now is with the "Next Page" button. It relies on the "total count" value, which after the refactoring is being fetched by the Aggregation component (slower than Files).
So if we decide to wait for the total count to arrive along with the other calculations to enable the "Next Page" button, the user will get stuck on a page until Aggregation loads. That makes me think we should add that cursor pagination library willkg mentioned previously, but because the front-end pagination component is shared between three different pages in the front-end IDK if it's a good idea to do it all in one shot, so here are some options I thought:

  • let's do it all in one shot: Add the pagination component and refactor all apis and front-end views to use it now (slowest, cleanest)
  • quick fix #1: When asked for Files, the backend also sends a "hasNext" (temporarily determined by fetching BATCH_SIZE + 1 elements); Front-end adds a special condition in its pagination component in case this field is present, to use it. This is not as quick to impl as the third option, but not so dirty since hasNext will continue to be a thing after introducing cursor pagination.
  • quick change #2: File Uploads page assumes whenever a page is full of items there's a Next Page. This is the fastest to implement, but also dirtiest since it will be all a throwaway once we add cursor pagination, plus the obvious bug that takes place when (total count % BATCH_SIZE == 0) in which the last page is blank.

willkg I'd like to take your input here. Feel free to suggest a diff option

Flags: needinfo?(willkg)

I don't think we need to worry about "in which the last page is blank". The only time we're looking at the last page would be in a local dev environment because there aren't many pages. In stage and prod, we have plenty of data so we're never going to look at the last page. (Or at least, I can't think of a case where I've wanted to look at the last page.)

Beyond that, I'm not sure I have a strong opinion here. None of the pros and cons that you've outlined strike me as particularly amazing or onerous. I think the risk of making a "bad decision" here is low. In the worst possible scenario, you and I are inconvenienced with a frontend that has some issue until we get around to fixing it which is effectively where we're at now.

One thing I like doing is breaking things down into small, incremental steps--especially when I'm testing theories. I don't know if framing this work in that way changes how you want to approach the adjustments you want to make.

Flags: needinfo?(willkg)

Fair enough. One use case that we might hit a blank page is if we use a filter that reduce the resultset to, say, 120 results e.g.

Oh, you're right! I completely forgot about the filters.

Deployed the Async fetch Files and Aggregates for the /uploads/files page to prod in bug 1764583.

We're seeing the following changes in that page's load time:

  • Before the deployment the page used to take from 1 to 3 minutes to load
  • After the deployment, the file list component is loading in less than a second, stable. Which is better than I expected
  • Right after the deployment the aggregates component was loading in under 20s, but 15 minutes later it started taking from 1 to 3 minutes to load, similar to the load time from before the deployment. But then right after that it went back to loading fast (~30 seconds). This looks like a degradation and might have something to do with available pod memory/garbage collection.

My thoughts on next steps:

  • Apply the same solution to the /uploads page
  • In case we want to stick with aggregates and make them faster, then Further investigation should consider issuing requests to the /files/aggregates endpoint and looking at db load/memory consumption/caching to pinpoint where the bottleneck is.

We just deployed the changes for the /uploads page and had similar improvements on the uploads list!

However :willkg and I noticed that when enough number of requests are made (usually from 5-10) to /uploads/files something happens and the page becomes slow again. We think something might be tying up gunicorn workers, making them unavailable for new requests, which will queue up requests. We gotta investigate and hopefully fix that. Because of that we're deciding to leave this bug open.

Here's a summary of where this bug is at:

  1. Current table counts in production:

    key value
    download_missingsymbol 7,737,096
    upload_upload 207,665
    upload_fileupload 14,590,035
    upload_uploadscreated 1,799

    Current table counts in stage:

    key value
    download_missingsymbol 529
    upload_upload 164
    upload_fileupload 733
    upload_uploadscreated 449

    These counts come from the "Site Status" page in the Django admin. In production, it takes over a minute to produce those counts.

  2. The uploads view (https://symbols.mozilla.org/uploads) lists records in the upload_upload table. Eduardo broke that view up into two separate API calls--one that gets records and one that gets aggregates.

    My attempt today to load that page with no filters is pretty fast. Both API calls return very quickly.

    However, when I filtered by mcastelluccio just now, it takes a long time. I hard-refreshed the page for a second set of timings to see if I could reproduce that:

    API timing 1 timing 2
    records 96 seconds 115 milliseconds
    aggregates 98 seconds 51 seconds
  3. This page isn't used very often, but the concern is that if it is used by multiple people for debugging some symbol upload problem, it could tie up the database and web workers sufficiently to DOS the site and block symbol uploads.

  4. When Eduardo was researching the problem, he was unable to reproduce it in his local dev environment. We don't have enough records on stage to reproduce it, either. Currently the only place this problem exists is in production.

Outstanding questions and possible next steps:

  1. Can we reproduce this issue on stage?

    If so, this gives us a lot more freedom to experiment and test theories. If not, why?

  2. Is the production database and the local dev environment defined the same? (tables, indexes, and other things)

    If not (missing index, table definitions different, etc), then maybe the slowness problem is caused by a difference.

  3. Is the production database being maintained regularly? (vacuum, updates, etc)

    If not (not getting regularly vacuumed), then maybe we should vacuum the upload_upload table and that will reduce the slowness.

  4. Is the number of records in the upload_upload table not conducive to the way Django does aggregations and pagination?

    If so, then we should switch how we do aggregations and caching. An earlier comment wondered whether we can remove the aggregations altogether. Another earlier comment suggested other ways to do pagination.

We don't know what the underlying cause is. We've done some work to alleviate the cause somewhat, but the issue is still reproducible though not consistently so. Lonnen and I don't think this feels like a "production database is set up differently" issue and that the problem is related to the number of records in the tables and how Postgres does counts. We should spend some time verifying that, but it's probably the case we should rethink this view.

Further, the "upload files" view suffers the same issues and whatever we do with "uploads" we should probably do there, too.

Related resources:

Assignee: efilho → willkg
Depends on: 1796579, 1796633

In bug #1796579, I removed all the aggregates from the Uploads page. In bug #1796633, I removed all the aggregates from the Uploads Files page. The uploads and uploads files pages are a lot faster now.

Given that, I'm going to close this out.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: