Closed Bug 1853962 Opened 1 year ago Closed 1 year ago

disk manager process dies if db isn't available

Categories

(Tecken :: General, defect, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

Details

Attachments

(1 file)

When the Tecken Docker container starts up, it runs tini which runs Honcho. Honcho runs:

/app/bin/run_web_disk_manager.sh

to start the disk manager. That runs:

python manage.py remove_orphaned_files --daemon

If the db isn't available or some other check fails, that returns a non-zero exit code which causes it to drop out of the loop and the disk manager process exits. Honcho sees that and then sends a sigterm to the web process. Once all processes exit, Honcho exits and the Docker container stops. Then the instance starts it up again. Badness ensues. That's what happened in bug #1853745.

We should:

  1. add --skip-checks to the python manage.py remove_orphaned_files --daemon line
  2. wrap the line in a sentry wrapper so any failures get sent to sentry
  3. re-consider the loop since the script is set to die on error so the loop isn't helping
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Attachment #9365889 - Attachment description: [mozilla-services/tecken] bug-1853962: improve disk cache manager resilience (#2840) → [mozilla-services/tecken] bug-1853962: improve disk manager resilience (#2840)

willkg merged PR [mozilla-services/tecken]: bug-1853962: improve disk manager resilience (#2840) in eab3500.

This improves the resiliency of the disk manager and also fixes sentry-wrap to work better with Django commands. This should improve Tecken's stability especially in relation to ephemeral issues like being unable to connect to the db.

I pushed this to prod just now in bug #1867844. Marking as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: