Closed Bug 1467845 Opened 6 years ago Closed 5 years ago

Host and run Buildhub2

Categories

(Cloud Services :: Operations: Kinto, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: peterbe, Assigned: autrilla)

References

Details

This is the tracker bug for getting Buildhub2 off the ground and into Stage + Prod. 

The code is https://github.com/mozilla/buildhub2 and it's three key features:

1. A Django web server that needs to be exposed
2. A daemon script (*)
3. A cron job script (**)


(*) The script is is basically that you need to run `python manage.py daemon` and this consumes an AWS SQS queue forever. But if it crashes/fails it needs to be restarted. (Any crashes/fails will be reported to Sentry)
(**) The backup for the SQS that picks up any potential slack. Essentially just need to run `./manage.py backfill` once a day. 

The project builds a complete docker container on CircleCI https://circleci.com/gh/mozilla/buildhub2
(At the moment it's failing builds on master because it can't publish to Dockerhub)

The intention is to have a Stage and a Prod. Each environment should consume its own SQS queue but it would be ideal if the Stage environment could run its backfill against the S3 bucket used by Prod. 

The S3 bucket it should have access to is the (public) one used by the existing Buildhub service. 

The documentation is (at the time of writing) regarding configuration, incomplete. Tracking that separately here: https://github.com/mozilla/buildhub2/issues/50

Ultimately the goal of Buildhub2 is to completely replace the previous Buildhub.
REGARDING CRON JOB: This can wait. 

The hope is that the SQS queue is stable enough that even if there's a RDS connection glitch or a AWS S3 networking glitch (e.g. SSLError) that the job is not lost. Unlike Lambda. Instead, we'll try and try again till everything works. However, the cron backfill is there in case of bugs in our code or some unpredictable network error after we have deleted a good message off the queue. 

In the "early days" we can execute the cron job manually (e.g. cloudops via ssh shell) as we're learning to walk confidently.
REGARDING THE SQS CONSUMER DAEMON

The expectation is that it SQS queue keeps messages until we have successfully consumed messages and deleted them. The implementation looks like this:


    for message in sqs.receive_messages():  # runs forever
        if message.key.endswith('/buildhub.json'):
            content_from_s3 = download_from_s3(message.key)
            validate_json_schema(content_from_s3)
            insert_into_postgresql(content_from_s3)
        message.delete()

There are more details (such as logging and the actual functions) but key is that it does not try to handle any unexpected errors. For example, if there's a network error related to downloading from S3, that'll cause an exception to raise and the script will exits with a code !0. The expectation is that we'll start the daemon up again immediately.

As time progresses and we study the unexpected errors with Sentry, we'll add some exception handling with the Python code.
REGARDING THE TECH STACK

* Python 3.6
* Docker (uses docker-flow so things like https://domain/__lbheartbeat__ etc. will work)
* PostgreSQL 9.6
* Elasticsearch 6.x
* Nginx that talks to Gunicorn running on TCP
Blocks: 1467854
(In reply to Peter Bengtsson [:peterbe] from comment #3)
> REGARDING THE TECH STACK
> 
> * Python 3.6
> * Docker (uses docker-flow so things like https://domain/__lbheartbeat__
> etc. will work)
> * PostgreSQL 9.6
> * Elasticsearch 6.x
> * Nginx that talks to Gunicorn running on TCP

Note! A lot of the infrastructure for this project is based on https://github.com/mozilla-services/tecken/ https://github.com/mozilla-services/cloudops-deployment/tree/master/projects/symbols from which many solutions can be copied such as the Nginx configuration.
Depends on: 1467857
Depends on: 1469317
Depends on: 1469322
It's technically disconnected but it's good for the complete picture. The success of Buildhub2 working depends on the work in Taskcluster. 

The two current bugs are:

* https://bugzilla.mozilla.org/show_bug.cgi?id=1443873 - actually uploading the buildhub.json and making sure the download.url key is right and an absolute URL.

* https://bugzilla.mozilla.org/show_bug.cgi?id=1459302 - exploding the en-US buildhub.json into on buildhub.json for each locale.
See Also: → 1443873, 1459302
Wei, 
How's this going? What can I do to help you?
Flags: needinfo?(wezhou)
Hi Peter,

Still working on it. Have set up postgresql and elasticsearch. Next will be working on setting up the app stack.

I'm not blocked and should have some updates by this Friday.
Flags: needinfo?(wezhou)
Assignee: nobody → wezhou
Component: Operations → Operations: Storage
QA Contact: chartjes
Peter, a few questions,

1. What command do I run to migrate the database?

2. Does the db migration help bootstrap the database too (e.g. create the first user, db, and tables etc.)?

3. The document [1] says buildhub2 relies on "net-mozaws-prod-delivery-inventory-us-east-1" bucket, can you confirm that? I'm asking because currently the buildhub lambda depends on two different buckets, namely "net-mozaws-prod-delivery-firefox" and "net-mozaws-prod-delivery-archive". 


4. Can you list all the S3 events you want to be sent to the SQS queue, so that I can pass the info to the engineers who manage that S3 bucket. Available events are listed in [2].



[1] https://buildhub2.readthedocs.io/en/latest/configuration.html
[2] https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html#supported-notification-event-types
(In reply to :wezhou from comment #8)
> Peter, a few questions,
> 
> 1. What command do I run to migrate the database?
> 

The command is `./bin/run.sh migrate`
I'm actually not sure how you are supposed to run that but it's expected to be run after every deployment.
There's also the command `./bin/run.sh web` which starts the `gunicorn` server.

This is also exactly how Tecken works. 

You can run `./bin/run.sh migrate` as many times as you like. The very first time it's going to take care of creating all the initial tables.

To be specific, there's another migration and that's how we copy all the old data from the Kinto database. It's actually a very different thing and I had envisioned we run that once together in an ssh session or something. 
Ultimately what you need to do is, inside the docker container's bash session:

  ./manage.py kinto-migration --skip-validation https://fullurlto.kinto.prod.mozaws.net/v1

That command is expected to take minutes and I've only ever tested it against a fully populated Kinto server on localhost:8888/v1
This is something that can wait. At least until we have something up and running in Stage. Once the dust settles we'll tackle it together on a Vidyo chat or something.

> 2. Does the db migration help bootstrap the database too (e.g. create the
> first user, db, and tables etc.)?
> 

Yes. Creates tables and upgrades them whenever new migrations are added to the repo. 

> 3. The document [1] says buildhub2 relies on
> "net-mozaws-prod-delivery-inventory-us-east-1" bucket, can you confirm that?
> I'm asking because currently the buildhub lambda depends on two different
> buckets, namely "net-mozaws-prod-delivery-firefox" and
> "net-mozaws-prod-delivery-archive". 
> 

That is a very interesting question. I'm pretty confident that `net-mozaws-prod-delivery-inventory-us-east-1` is correct because that's what I've tested locally. 
Also, it might be worth leaving it as is for now and just try it later when the dust settles. This is only needed for the S3 backfill job that is meant to run as a cron job but it's also something we hope we can live entirely without. It's available to remedy potential problems that might arise from the SQS messages failing entirely on us. E.g. a bug in our code incorrectly acknowledges SQS messages but fail to persist.

> 
> 4. Can you list all the S3 events you want to be sent to the SQS queue, so
> that I can pass the info to the engineers who manage that S3 bucket.
> Available events are listed in [2].
> 

Benson, 
Can you answer this? The way you set up the SQS queue for Dev definitely worked. If we can replicate that setup for but the S3 prod bucket that archive.mozilla.org uses. 

> 
> 
> [1] https://buildhub2.readthedocs.io/en/latest/configuration.html
> [2]
> https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.
> html#supported-notification-event-types
Flags: needinfo?(bwong)
For the record, there's a separate bug just about setting up the SQS queues here: https://bugzilla.mozilla.org/show_bug.cgi?id=1469322
Need the: ObjectCreate (All) S3 events sent to the SQS queue
Flags: needinfo?(bwong)
Hi :peterbe,

I think the documentation [1] is missing a part for configuring Elasticsearch for the application. Could you add that?

[1] https://buildhub2.readthedocs.io/en/latest/configuration.html
Flags: needinfo?(peterbe)
Coming to a master branch near you soon! https://github.com/mozilla/buildhub2/pull/169
Flags: needinfo?(peterbe)
-stage environment is up and running.

https://buildhub2.stage.mozaws.net/
File an issue regarding -stage env: https://github.com/mozilla/buildhub2/issues/245
https://github.com/mozilla/buildhub2/issues/245 closed.

-stage environment is up and running again.
Blocks: 1505154
Assignee: wezhou → autrilla
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.