1467845 - Host and run Buildhub2

Reporter

Description

•

6 years ago

This is the tracker bug for getting Buildhub2 off the ground and into Stage + Prod. 

The code is https://github.com/mozilla/buildhub2 and it's three key features:

1. A Django web server that needs to be exposed
2. A daemon script (*)
3. A cron job script (**)


(*) The script is is basically that you need to run `python manage.py daemon` and this consumes an AWS SQS queue forever. But if it crashes/fails it needs to be restarted. (Any crashes/fails will be reported to Sentry)
(**) The backup for the SQS that picks up any potential slack. Essentially just need to run `./manage.py backfill` once a day. 

The project builds a complete docker container on CircleCI https://circleci.com/gh/mozilla/buildhub2
(At the moment it's failing builds on master because it can't publish to Dockerhub)

The intention is to have a Stage and a Prod. Each environment should consume its own SQS queue but it would be ideal if the Stage environment could run its backfill against the S3 bucket used by Prod. 

The S3 bucket it should have access to is the (public) one used by the existing Buildhub service. 

The documentation is (at the time of writing) regarding configuration, incomplete. Tracking that separately here: https://github.com/mozilla/buildhub2/issues/50

Ultimately the goal of Buildhub2 is to completely replace the previous Buildhub.

Peter Bengtsson [:peterbe]

Reporter

Comment 1

•

6 years ago

REGARDING CRON JOB: This can wait. 

The hope is that the SQS queue is stable enough that even if there's a RDS connection glitch or a AWS S3 networking glitch (e.g. SSLError) that the job is not lost. Unlike Lambda. Instead, we'll try and try again till everything works. However, the cron backfill is there in case of bugs in our code or some unpredictable network error after we have deleted a good message off the queue. 

In the "early days" we can execute the cron job manually (e.g. cloudops via ssh shell) as we're learning to walk confidently.

Peter Bengtsson [:peterbe]

Reporter

Comment 2

•

6 years ago

REGARDING THE SQS CONSUMER DAEMON

The expectation is that it SQS queue keeps messages until we have successfully consumed messages and deleted them. The implementation looks like this:


    for message in sqs.receive_messages():  # runs forever
        if message.key.endswith('/buildhub.json'):
            content_from_s3 = download_from_s3(message.key)
            validate_json_schema(content_from_s3)
            insert_into_postgresql(content_from_s3)
        message.delete()

There are more details (such as logging and the actual functions) but key is that it does not try to handle any unexpected errors. For example, if there's a network error related to downloading from S3, that'll cause an exception to raise and the script will exits with a code !0. The expectation is that we'll start the daemon up again immediately.

As time progresses and we study the unexpected errors with Sentry, we'll add some exception handling with the Python code.

Peter Bengtsson [:peterbe]

Reporter

Comment 3

•

6 years ago

REGARDING THE TECH STACK

* Python 3.6
* Docker (uses docker-flow so things like https://domain/__lbheartbeat__ etc. will work)
* PostgreSQL 9.6
* Elasticsearch 6.x
* Nginx that talks to Gunicorn running on TCP

Peter Bengtsson [:peterbe]

Reporter

Updated

•

6 years ago

Blocks: 1467854

Peter Bengtsson [:peterbe]

Reporter

Comment 4

•

6 years ago

(In reply to Peter Bengtsson [:peterbe] from comment #3)
> REGARDING THE TECH STACK
> 
> * Python 3.6
> * Docker (uses docker-flow so things like https://domain/__lbheartbeat__
> etc. will work)
> * PostgreSQL 9.6
> * Elasticsearch 6.x
> * Nginx that talks to Gunicorn running on TCP

Note! A lot of the infrastructure for this project is based on https://github.com/mozilla-services/tecken/ https://github.com/mozilla-services/cloudops-deployment/tree/master/projects/symbols from which many solutions can be copied such as the Nginx configuration.

Peter Bengtsson [:peterbe]

Reporter

Updated

•

6 years ago

Depends on: 1467857

Peter Bengtsson [:peterbe]

Reporter

Updated

•

6 years ago

Depends on: 1469317

Peter Bengtsson [:peterbe]

Reporter

Updated

•

6 years ago

Depends on: 1469322

Peter Bengtsson [:peterbe]

Reporter

Comment 5

•

6 years ago

It's technically disconnected but it's good for the complete picture. The success of Buildhub2 working depends on the work in Taskcluster. 

The two current bugs are:

* https://bugzilla.mozilla.org/show_bug.cgi?id=1443873 - actually uploading the buildhub.json and making sure the download.url key is right and an absolute URL.

* https://bugzilla.mozilla.org/show_bug.cgi?id=1459302 - exploding the en-US buildhub.json into on buildhub.json for each locale.

Comment 6

•

6 years ago

Wei, 
How's this going? What can I do to help you?

Flags: needinfo?(wezhou)

:wezhou

Comment 7

•

6 years ago

Hi Peter,

Still working on it. Have set up postgresql and elasticsearch. Next will be working on setting up the app stack.

I'm not blocked and should have some updates by this Friday.

Flags: needinfo?(wezhou)

:wezhou

Updated

•

6 years ago

Assignee: nobody → wezhou

Component: Operations → Operations: Storage

QA Contact: chartjes

:wezhou

Comment 8

•

6 years ago

Peter, a few questions,

1. What command do I run to migrate the database?

2. Does the db migration help bootstrap the database too (e.g. create the first user, db, and tables etc.)?

3. The document [1] says buildhub2 relies on "net-mozaws-prod-delivery-inventory-us-east-1" bucket, can you confirm that? I'm asking because currently the buildhub lambda depends on two different buckets, namely "net-mozaws-prod-delivery-firefox" and "net-mozaws-prod-delivery-archive". 


4. Can you list all the S3 events you want to be sent to the SQS queue, so that I can pass the info to the engineers who manage that S3 bucket. Available events are listed in [2].



[1] https://buildhub2.readthedocs.io/en/latest/configuration.html
[2] https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html#supported-notification-event-types

Peter Bengtsson [:peterbe]

Reporter

Comment 9

•

6 years ago

(In reply to :wezhou from comment #8)
> Peter, a few questions,
> 
> 1. What command do I run to migrate the database?
> 

The command is `./bin/run.sh migrate`
I'm actually not sure how you are supposed to run that but it's expected to be run after every deployment.
There's also the command `./bin/run.sh web` which starts the `gunicorn` server.

This is also exactly how Tecken works. 

You can run `./bin/run.sh migrate` as many times as you like. The very first time it's going to take care of creating all the initial tables.

To be specific, there's another migration and that's how we copy all the old data from the Kinto database. It's actually a very different thing and I had envisioned we run that once together in an ssh session or something. 
Ultimately what you need to do is, inside the docker container's bash session:

  ./manage.py kinto-migration --skip-validation https://fullurlto.kinto.prod.mozaws.net/v1

That command is expected to take minutes and I've only ever tested it against a fully populated Kinto server on localhost:8888/v1
This is something that can wait. At least until we have something up and running in Stage. Once the dust settles we'll tackle it together on a Vidyo chat or something.

> 2. Does the db migration help bootstrap the database too (e.g. create the
> first user, db, and tables etc.)?
> 

Yes. Creates tables and upgrades them whenever new migrations are added to the repo. 

> 3. The document [1] says buildhub2 relies on
> "net-mozaws-prod-delivery-inventory-us-east-1" bucket, can you confirm that?
> I'm asking because currently the buildhub lambda depends on two different
> buckets, namely "net-mozaws-prod-delivery-firefox" and
> "net-mozaws-prod-delivery-archive". 
> 

That is a very interesting question. I'm pretty confident that `net-mozaws-prod-delivery-inventory-us-east-1` is correct because that's what I've tested locally. 
Also, it might be worth leaving it as is for now and just try it later when the dust settles. This is only needed for the S3 backfill job that is meant to run as a cron job but it's also something we hope we can live entirely without. It's available to remedy potential problems that might arise from the SQS messages failing entirely on us. E.g. a bug in our code incorrectly acknowledges SQS messages but fail to persist.

> 
> 4. Can you list all the S3 events you want to be sent to the SQS queue, so
> that I can pass the info to the engineers who manage that S3 bucket.
> Available events are listed in [2].
> 

Benson, 
Can you answer this? The way you set up the SQS queue for Dev definitely worked. If we can replicate that setup for but the S3 prod bucket that archive.mozilla.org uses. 

> 
> 
> [1] https://buildhub2.readthedocs.io/en/latest/configuration.html
> [2]
> https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.
> html#supported-notification-event-types

Flags: needinfo?(bwong)

Peter Bengtsson [:peterbe]

Reporter

Comment 10

•

6 years ago

For the record, there's a separate bug just about setting up the SQS queues here: https://bugzilla.mozilla.org/show_bug.cgi?id=1469322

Benson Wong [:mostlygeek]

Comment 11

•

6 years ago

Need the: ObjectCreate (All) S3 events sent to the SQS queue

Flags: needinfo?(bwong)

:wezhou

Comment 12

•

6 years ago

Hi :peterbe,

I think the documentation [1] is missing a part for configuring Elasticsearch for the application. Could you add that?

[1] https://buildhub2.readthedocs.io/en/latest/configuration.html

Flags: needinfo?(peterbe)

Peter Bengtsson [:peterbe]

Reporter

Comment 13

•

6 years ago

Coming to a master branch near you soon! https://github.com/mozilla/buildhub2/pull/169

Flags: needinfo?(peterbe)

:wezhou

Comment 14

•

6 years ago

-stage environment is up and running.

https://buildhub2.stage.mozaws.net/

:wezhou

Comment 15

•

6 years ago

File an issue regarding -stage env: https://github.com/mozilla/buildhub2/issues/245

:wezhou

Comment 16

•

6 years ago

https://github.com/mozilla/buildhub2/issues/245 closed.

-stage environment is up and running again.

Peter Bengtsson [:peterbe]

Reporter

Updated

•

6 years ago

Blocks: 1505154

Peter Bengtsson [:peterbe]

Reporter

Updated

•

5 years ago

Assignee: wezhou → autrilla

Peter Bengtsson [:peterbe]

Reporter

Updated

•

5 years ago

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Bugzilla

Quick Search

Host and run Buildhub2

Categories

(Cloud Services :: Operations: Kinto, task)

Tracking

(Not tracked)

People

(Reporter: peterbe, Assigned: autrilla)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Updated

Updated

Updated

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Updated

Updated

Updated