autoland should use a queue instead of a database

NEW
Unassigned

Status

2 years ago
2 months ago

People

(Reporter: glob, Unassigned)

Tracking

(Depends on: 1 bug, Blocks: 1 bug, {conduit-story, conduit-triaged})

Production
conduit-story, conduit-triaged
Dependency tree / graph

Details

(URL)

Attachments

(1 attachment, 1 obsolete attachment)

(Reporter)

Description

2 years ago
my concerns with current design:

- doesn't function as a true queue
  - patches are not always landed in the order they were submitted, which is unexpected
  - if there's a transient failure when trying to land a patch at the head of the queue, it'll be deferred and the next patch will be attempted
  - if the trees are closed for an extended period of time, this is likely to result in unexpected ordering of patches

- a single "queue" is used for all repositories
  - we shouldn't be restricted to one worker for all repos

- data stored unnecessarily on server
  - there's a pg database which holds the "queue"
  - once jobs are processed they are not removed from the database

- service outages (eg. review board upgrades) may result in lost notifications

proposed solution:

- use pulse for queuing commits
  - removes data store from autoland server, simplifying deployment and code
  - report success and failures back to review board, also via pulse
    - review board should be the data store, not autoland
  - report to treeherder (maybe?), to provide a high level view of autoland's activities, and as a means to view the detailed autoland logs for success and failures

- always process the commit at the head of the queue (ie. FIFO)
  - need to detect transient vs fatal failures
  - a transient failure should result in the job being retried
    - with a back-off
    - if retry attempts hit a max value, autoland should stop and alerts triggered for admins to deal with
  - admins need a mechanism to easily examine the job at the head of the queue and the failures encountered during processing

- use a separate queue/topic/key for each repository
  - required when switching to a true queuing system
  - allows us to land to different repos at the same time
  - spin up a process/container/instance for each queue
(Reporter)

Comment 1

2 years ago
Created attachment 8770024 [details]
autoland - current
(Reporter)

Comment 2

2 years ago
Created attachment 8770025 [details]
autoland - proposed

Comment 3

2 years ago
There will be some upcoming changes to autoland to support Servo that may require persistent state (read: a database).

I'm all for using a proper queue, however. But don't get your heart set on killing the database :(
(Reporter)

Comment 4

2 years ago
(In reply to Gregory Szorc [:gps] from comment #3)
> There will be some upcoming changes to autoland to support Servo that may
> require persistent state (read: a database).
> 
> I'm all for using a proper queue, however. But don't get your heart set on
> killing the database :(

no worries -- if required we should use RDS (or S3?) for persistence, not a database running on the server.
Depends on: 1287537

Comment 5

2 years ago
Glob and I discussed this the other day and I will be working with him to use Pulse rather than the database as a message queue for the Servo autoland changes.

Updated

2 months ago
Product: MozReview → Conduit

Comment 6

2 months ago
Still something we very much want to do but needs prioritization versus other important Lando features.
Keywords: conduit-triaged
Whiteboard: [lando-backlog]

Updated

2 months ago
Blocks: 1266732
Most likely will use SQS not Pulse. Turning this into the overarching story.
Depends on: 1312140
Keywords: conduit-story
Summary: autoland should use a queue (amqp/pulse) instead of a database → autoland should use a queue instead of a database
Whiteboard: [lando-backlog]
Depends on: 1467694
(Reporter)

Updated

2 months ago
Attachment #8770025 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.