Closed Bug 1413553 Opened 7 years ago Closed 5 years ago

Investigate why DB failovers require worker dyno restarts for tasks to resume

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

Details

Ed Morley [:emorley]

Assignee

Description

•

7 years ago

A couple of times now (the most recent being bug 1413547), the DB has failed over successfully, but the celery queue backlogs for pulse job ingestion (amongst other things) has appeared to continue to grow until the Heroku dynos have been restarted.

If this is actually the case (vs just coincidental timing), then we should try to resolve this, so that unexpected DB failovers in the middle of the night don't require human intervention.

I imagine the best way to confirm the problem would be to:
1) Initiate a failover on stage (dev isn't multi-AZ, so wouldn't work)
2) Monitor the queue sizes on New Relic
3) Don't restart the dynos until a reasonable amount of time has passed (say 30 minutes), to see if they self-recover after the celery task timeouts have passed

I'd imagine the outcome would be either:
* the queues self-recover within a few minutes - ie: the previous instances were just unlucky timing of the restart, so there's nothing more to do here.
* the queues self-recover after 10-20 minutes - ie: the tasks have to hit the celery timeouts, which we could lower to improve the situation.
* the queues never self-recover, in which case this might be a Heroku dyno DNS caching issue or similar.

We may also want to try updating to Celery 4 (bug 1337717), since this may be a bug they've already fixed.

Ed Morley [:emorley]

Assignee

Updated

•

5 years ago

Assignee: nobody → emorley

Ed Morley [:emorley]

Assignee

Comment 1

•

5 years ago

I've rebooted prototype's RDS instance (confusingly called treeherder-dev) to try out the approach above.

Status: NEW → ASSIGNED

Ed Morley [:emorley]

Assignee

Comment 2

•

5 years ago

The workers reconnected to the DB with no intervention required - so looks like this has been fixed in the meantime. (Perhaps by the newer Celery/kombu from bug 1337717)

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Investigate why DB failovers require worker dyno restarts for tasks to resume

Categories

(Tree Management :: Treeherder: Infrastructure, enhancement, P1)

Tracking

(Not tracked)

People

(Reporter: emorley, Assigned: emorley)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2