Celery queue "mdn_purgeable" grows to critical levels

RESOLVED FIXED

Status

developer.mozilla.org
Performance
RESOLVED FIXED
10 months ago
3 months ago

People

(Reporter: rjohnson, Assigned: rjohnson)

Tracking

({in-triage})

Details

(Whiteboard: [specification][type:bug])

(Assignee)

Description

10 months ago
What did you do?
================
1. On May 18 2017, high levels were reported (see https://bugzilla.mozilla.org/show_bug.cgi?id=1366044)

2. Yesterday (on June 15 2017) very high levels were reported again, from IRC channel #mdndev:

[2017-06-15 09:25:42] <ericz> jwhitlock: Rabbit's mdn_purgeable queue is getting swamped in prod.  It did it a couple hours ago then cleared and now it's really high again.  Should we take any action there?
[2017-06-15 09:26:16] <jwhitlock> that is unexpected, and mostly invisible to me. I'll take a look on my side. No action yet, plz
[2017-06-15 09:26:23] <ericz> Sure thing
[2017-06-15 09:56:37] <jwhitlock> My analysis shows it is the DocumentNearestZoneJob again


What happened?
==============
The Celery queue "mdn_purgeable" sometimes grows to critical levels.

What should have happened?
==========================
The Celery queue "mdn_purgeable" should not grow to critical levels.

Is there anything else we should know?
======================================
This instability seems to have started due to our backend changes for zones (https://github.com/mozilla/kuma/pull/4209).
(Assignee)

Updated

10 months ago
Assignee: nobody → rjohnson
(Assignee)

Comment 1

10 months ago
:jwhitlock did some research into the Celery "mdn_purgeable" queue when it reached critical levels, and it seems to be due to "refresh_cache" tasks for the DocumentNearestZoneJob (DNZJ). We developed the following hypothesis as to what is happening:

1) hit an MDN page, which loads cache with an initial expiration of 29 hours for the DNZJ result
2) hit the page again after expiration and since the cache has expired, two things occur:
    a) re-loads cache with the stale value and an expiration based on "refresh_timeout" (defaults to 60 seconds)
    b) queues a "refresh_cache" task
3) the page gets hit again after "refresh_timeout" but before the queued task from 2a has completed (tasks are slow to complete perhaps due to a spike of tasks going into the queue), and since the cache has expired, another async task is queued
4) this repeats and starts a vicious cycle where the queue continues to grow and since tasks take longer to complete, more tasks are added

Proposed Solution:

It seems good to spread the expirations, as well as decrease the sensitivity to generating async tasks after the expiration. So:

1) Spread the initial expirations by overriding the "expiry" method on the DNZJ class, spreading randomly over 1-10 days
2) Increase the "refresh_timeout" to more than 60 seconds (perhaps 2-5 minutes)
Keywords: in-triage

Comment 3

10 months ago
Commits pushed to master at https://github.com/mozilla/kuma

https://github.com/mozilla/kuma/commit/bc28dfe5158e0702c09d0a0bd58a28507ea64e2f
bug 1373720: adjust DocumentNearestZoneJob cache staleness thresholds

* increase DocumentNearestZoneJob.refresh_timeout from 60 to 180 seconds
* change DocumentNearestZoneJob.lifetime from a constant to a property
  that returns a random number of days between 1 and 10 (in seconds)

https://github.com/mozilla/kuma/commit/d5334e3c35414723fd63055008e66f2084d349b4
Merge pull request #4272 from escattone/address-celery-queue-issue-1373720

bug 1373720: adjust DocumentNearestZoneJob cache staleness thresholds
Over the last month, the running average has been about 250 tasks in the purgable queue. The pattern is still spiky - roughly zero tasks, then spiking to 200 - 500.  There have been some spikes, up to 30K, but these have been caused by unusual workloads, such as regenerating all the pages on the site.

Thanks Ryan, it looks like the fix worked.
Status: NEW → RESOLVED
Last Resolved: 3 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.