Closed Bug 1366301 Opened 5 years ago Closed 4 years ago

write new buildhub scraper


(Socorro :: General, task, P2)


(Not tracked)



(Reporter: willkg, Unassigned)



BuildHub ( has an API for pulling down build information. Socorro currently uses ftpscraper to do this.

This bug covers writing a new script to populate the product_version table, but using the BuildHub API instead of what ftpscraper does.
First off, this is preliminary at the moment until BuildHub is "production" which I don't think it is, yet.

Working on this now helps us help them identify what needs to change so that we can eventually achieve goal number 1 of ditching ftpscraper.

Documentation for the API is located at the "Docs" link on the BuildHub page.

There's a related issue in the buildhub github thinger mcdooder:

In it, Rémy says this:

> We currently have more than 100k records there so using _limit is greatly recommanded :)
Sorry if this is already dead-obvious, but if you do use `_limit` remember to do that *per* product. There are very few SeaMonkey releases so if we use `?product=*&_limit=10` we might miss SeaMonkey releases. 

Also, the old ftpscraper used to do one run per product which was config'ed on.
Spent today working on a demo script. The purpose is three fold:

1. write a demo script that pulls data from buildhub
2. build a harness so I can compare the output of ftpscraper with the output
3. compare the output of the two to see if buildhub has the data we need

I had a bunch of questions about the buildhub data. I put them in this comment:

Here's an example of data.

buildhub scraper:

{"beta_number": null, "build_id": "20170503004005", "build_type": "opt", "ignore_duplicates": true, "platform": "mac", "product_name": "firefox", "repository": "mozilla-aurora", "version": "54.0a2.en-US.mac.dmg", "version_build": null}


{"beta_number": null, "build_id": "20170503004005", "build_type": "aurora", "ignore_duplicates": true, "platform": "mac", "product_name": "firefox", "repository": "mozilla-aurora", "version": "54.0a2"}

Beta releases are a little more off. I didn't check non-firefox products, yet.

Everything I've done so far is in a branch:

I think I have to wait on some answers before I can continue further.
Assignee: nobody → willkg
Somehow I deleted my branch at some point in the last month. But then I did some git reflog skullduggery and recovered it! Yay for git reflog.

I talked with Rémy and crew yesterday. I'm going to fix the collection issues in this script and some other things and then see if I can get it working and comparable to ftpscraper then report back about issues.
I did some more work and wrote up some more issues. Waiting on those before I proceed.
Turns out I was using the wrong collection again. I spent some time on the scraper today, switched the collection, and adjusted some more code.

buildhub doesn't look at the candidates/ tree. I don't know if we (Socorro) *need* those builds or not. Figuring that out is covered in:

buildhub is missing a lot of build information. When I do comparisons between buildhub data and what we get back from ftpscraper, I can't differentiate between data missing from buildhub because there's a lot of missing data from data missing from buildhub because the system will never acquire that data. Some of that is covered in:

I think I've gotten as far as I can get in this round. I'll wait until they have a -prod environment with for-realz data before working on this further.

Again, all my work is in this branch:
Three things I want to do with this bug:

1. The new buildhub scraper should replace the old ftpscraper AND all the stored procedures involved. In other words, the buildhub scraper pulls data from buildhub and puts it into the end tables directly.

2. The new script should be able to run every hour idempotently.

3. We need to figure out what happens when the buildhub data is wrong, they fix the bug in buildhub, and we need to update the data in Socorro. Do we have a way to drop the data in Socorro? Do we have a way to run buildhub to update specific data?

Also, once we fix this, we can nix the AuroraVersionFixitRule processor rule and possibly some other things.
I was looking at product_versions data between -stage and -prod and noticed that the start_date and end_date between some of the versions is different between the two environments. Seems like the ftpscraper flow sets the start_date when it first finds the version.

When we do the buildhub version, we should set it to the date encoded in the build id. That way it'll be stable between environments even if the script doesn't work for a few days.
Unassigning myself from bugs I'm not immediately working on and/or have some meaningful progress on.
Assignee: willkg → nobody
I hit a case today where builds for 61.0b1 were built on 2018-04-28, but released today. They didn't show up in buildhub until today. I don't know why that is, but it's probably good to keep in mind especially for logic we have that validates version numbers against what Socorro thinks the universe looks like.
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #10)
> I hit a case today where builds for 61.0b1 were built on 2018-04-28, but
> released today. They didn't show up in buildhub until today. I don't know
> why that is, but it's probably good to keep in mind especially for logic we
> have that validates version numbers against what Socorro thinks the universe
> looks like.

Thank you. It's sad that it happened  but it's interesting to see that it did happen. 
I've been working really hard to make buildhub more reliable. What probably happened was that our Lambda job failed when that file was made available and then our scraper cron job (which is supposed to fix the holes from unfortunate Lambda failures) took a very long time to fix that. :(

These things will be fixed in a new release of Buildhub hopefully by the end of the quarter. At least we'll go from 95% reliability to 99% or something.
FWIW missioncontrol's buildhub scraper code is here:

I think I promised to dump something like that into the buildhub documentation for people to build on, but for now, there it is.
Will: I've got part of a script done, but the harder part for this bug is making sure the data we get from buildhub goes into "the right places" in crash stats db. That part I haven't figured out. Getting there! Maybe next quarter. I'll definitely look at your script for inspiration.
Making this a P2 to do soon.
Priority: -- → P2
Depends on: 1501780
Instead of doing this, I rewrote the BetaVersionRule to do Buildhub lookups itself. Further, I rewrote a bunch of other things so nothing depends on product_versions data. Thus there's no need to write a buildhub scraper.

Given that, I'm going to close this as WONTFIX.
Closed: 4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.