Closed Bug 684112 Opened 13 years ago Closed 13 years ago

Add nightly & aurora FTP-scraping to releases_raw

Categories

(Socorro :: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jberkus, Assigned: rhelmer)

References

Details

In 2.2.3 we're getting nightly/aurora information from the "builds" table.  However, this table is slated to be depreciated, so we need to add them to releases_raw instead.
I guess bug 640242 should be able to hook into this, then.
Depends on: 688629
Target Milestone: 2.3 → 2.3.1
First shot at rewriting the old FTP scraper:

https://github.com/rhelmer/socorro/commit/f831575c70cd6336177ee35f31422f69cf258322

This is still heavily influenced by the old scraper, and I've hooked it into the "Socorro way" of doing things (unit tests, config, wrapper scripts etc. - the unit test is based on the old scraper, for instance) since I want to make this week's freeze. Given more time I'd like to think about how we can start moving towards something with less boilerplate, and also fewer socorro-isms.

I've tested that this seems to work (and put data into devdb for jberkus to review) and also that the unit tests pass.

peterbe, lonnen, brandon, lars - any thoughts? 

Asking for feedback rather than r? since we can't pull this until we disable the old one, and I'd like to move the "nightly report" UI to use a matview based on releases_raw at the same time (jberkus is working on that now).
Status: NEW → ASSIGNED
One last thing - I've replaced the use of SGMLParser with BeautifulSoup, which while not included in stdlib, tests OK for me with the RHEL-provided RPM so deployment shouldn't be an issue, we just need to make sure the stage and prod puppet manifests get the new package.
There's also a database schema change in upgrade/2.3.1/ associated with this bug.
OK took peterbe and lonnen's comments into account - I think this is ready to land:
r? https://github.com/mozilla/socorro/pull/73

Lots of little changes from comment #2 but bigger ones are:
* switch from beautifulsoup to lxml.html 
** RHEL-provided RPM seems ok for this purpose, just need parsing
* pep8 (for everything except schema.py)

This is blocking work we want to get into 2.3.1 (code freeze EOD tomorrow) and I have some other changes that need to go into this release. Totally happy to continue improving this, but I'd like to push anything non-trivial that doesn't need to block ship to next week's release.
Commit pushed to https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/d872b44000fbc0b167139ca566e39e9c3b0cb8d3
Merge pull request #73 from rhelmer/bug684112-rewrite-ftp-scraper

rewrite FTP scraper to support nightly/aurora
Here are a few supporting changes, r? anyone who has time:

Stop the old scraper from writing to releases_raw table:
https://github.com/mozilla/socorro/pull/74

We're going to remove the old scraper entirely in 2.3.2, see bug 694466.

Add backfill support to ftpScraper.py:
https://github.com/mozilla/socorro/pull/75

Josh reminded me today that we needed this for when we push the release, did a little refactoring I had intended to do anyway as part of it (split up the nightly and release main-loops).

The larger implied change here is that for nightlies, we're always going to look in http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2011/10/ instead of http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-* (which are symlinks to the above actually).

It means we'll pick up a few more builds than we otherwise would have, but it should not be a big deal. In general, I'd rather pick up a few extra builds than put a bunch of special-casing in the code - scraping FTP to get this info is already error-prone as it is.

Release don't go into these dated dirs, and there are way fewer of them, so we automatically backfill whatever we can find on every run - this code doesn't need to change.
Commit pushed to https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/562a0d2ae1deca8614aec07cc51e162e9b283969
Merge pull request #74 from rhelmer/bug684112-disable-old-scraper

bug 684112 - disable releases for old scraper
Commit pushed to https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/cc8c57e313097e05306b8768cd4b0c9fe59d83c7
Merge pull request #75 from rhelmer/bug684112-ftp-scraper-backfill

bug 684112 - add backfill support
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Blocks: 684106
Commits pushed to https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/b123e9e4a6f810327bc28292af9743ac93503f33
bug 684112 - easier to join if we leave a1/a2 in here

https://github.com/mozilla/socorro/commit/2dcfa9c18e20850caa204fa3242f41a824e67054
Merge pull request #85 from rhelmer/bug684112-fix-version-column

bug 684112 - easier to join if we leave a1/a2 in here
Commits pushed to https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/07452de6c55ab16836699b2a92bfba3001f010a9
bug 684112 - indentation is wrong here, want to run insertBuild regardless of nightly/aurora

https://github.com/mozilla/socorro/commit/7b144f544b8a2cd710feb8e58e45a2228bf712ff
Merge pull request #86 from rhelmer/bug684112-fix-indentation

bug 684112 - indentation is wrong here, want to run insertBuild regardles
Commits pushed to https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/5afd93df53c7a41eb4cc062705f30e563aa4c6ff
bug 684112 - easier to join if we leave a1/a2 in here

https://github.com/mozilla/socorro/commit/df6b2e12024308edad0d6460f1c6b2c97f6d87b8
bug 684112 - indentation is wrong here, want to run insertBuild regardless of nightly/aurora
This depends on DB changes that were pulled from 2.3.1, so bumping this and reopening. 

The old scraper (disabled in comment 9) should be re-enabled for 2.3.1
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 2.3.1 → 2.3.2
Commits pushed to https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/4a0bfd54bd9d8d3067356baa5a68b4c5cd0a7297
bug 684112 - reinstate old release scraper until new DB changes are ready

https://github.com/mozilla/socorro/commit/626ad0533b1b5599d8fa287d7b9d0acb5e0d2725
Merge pull request #104 from rhelmer/bug684112-reinstate-old-scraper

bug 684112 - reinstate old release scraper until new DB changes are ready
Commit pushed to https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/f723ab74590beb7a93be5def0d313b1b66008628
bug 684112 - reinstate old release scraper until new DB changes are ready
Commit pushed to https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/f723ab74590beb7a93be5def0d313b1b66008628
bug 684112 - reinstate old release scraper until new DB changes are ready
The DB changes for releases_raw are still in 2.3.1 and are ready.

If there is a bug with them, that's a different matter.
(In reply to Josh Berkus from comment #18)
> The DB changes for releases_raw are still in 2.3.1 and are ready.
> 
> If there is a bug with them, that's a different matter.

I don't feel that this has adequate testing to replace the old scraper yet, and I am out this week so can't help with any fallout.

It's not really necessary to enable this until bug 684106 ships, so not worth the risk.
This should be ready for 2.3.2, it was only backed out on 2.3.1 branch
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
There's a bug in the way single-digit months are handled (should always be padded to two digits, since that's what FTP wants).

One-line, tested fix incoming.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Commit pushed to https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/a25fe165c30b289e1fd779f5e519aca900d78b85
Merge pull request #128 from rhelmer/bug684112-month-formatting-fix

bug 684112 - format month to two digits
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.