Closed Bug 674605 Opened 13 years ago Closed 13 years ago

Need FTP-scraper for beta information

Categories

(Socorro :: General, task, P1)

x86
macOS

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: jberkus, Assigned: rhelmer)

References

Details

Attachments

(1 file, 2 obsolete files)

Since no webservice for release information will be available in time for 2.2, we urgently need someone to write an FTP-scraper which gets the new beta information from the releases FTP server.

This task has not, AFAIK, been assigned to anyone.
Laura: you can assign this to the PGX team and we'll get someone on writing one.  But I need an explicit assignment.
I've already talked to rhelmer about building this, so it's his bug, sorry Josh.
Assignee: laura → rhelmer
Severity: critical → blocker
Priority: -- → P1
Table spec:

beta_releases (
  product_name citext not null,
  version 
     OR major_version text not null,
  build_id numeric not null,
  beta_number INT not null
);
Rob,

So it turns out I need final releases as well.  For that reason, I'm changing the table spec:

releases_raw (
  product_name citext not null,
  major_version text not null,
  build_id numeric not null,
  releasechannel citext not null ("beta" or "release"),
  beta_number INT (null if "release")
)

... for final releases, just leave
Well, the release report should only be for everything with that version on a channel other than "beta", "aurora" or "nightly" (note that there's a pretty large number of "releasechannel" values that run into that).

But for the same build, we need a beta report based on beta channel + version + buildid as well, which is a separate thing because the reports on that channel aren't throttled and so need to be represented differently on the graphs. Fun times. :)

Not sure if "releasechannel" is the best thing to put there, maybe "releasetype" would be better?
also, we only want the releases 5.0beta and later.
Kairo,

This spec is for the raw data table for scraping FTP data into.  I'm not sure that any of your comments apply.
OK, next spec iteration:

create domain major_version AS text
	CHECK ( VALUE ~ $x$^\d+\.\d+$x$ );
	
alter domain major_version owner to breakpad_rw;
	
create table releases_raw (
	product_name citext not null,
	major_version major_version not null,
	build_id numeric not null,
	buildtype citext not null,
	beta_number int,
	constraint release_raw_key primary key ( product_name, major_version, build_id, releasechannel )
);

alter table releases_raw owner to breakpad_rw;

insert into releases_raw values
	( 'Firefox', '6.0', 20110705195857, 'Beta', 1 ),
	( 'Firefox', '5.0', 20110614174314, 'Beta', 7 ),
	( 'Firefox', '5.0', 20110613165758, 'Beta', 6 ),
	( 'Firefox', '6.0', 20110713171652, 'Beta', 2 ),
	( 'Firefox', '6.0', 20110721152715, 'Beta', 3 ),
	( 'Thunderbird', '5.0', 20110528060900, 'Beta', 1 ),
	( 'Thunderbird', '5.0', 20110620152233, 'Beta', 2 ),
	( 'Thunderbird', '6.0', 20110714224206, 'Beta', 1 ),
	( 'Seamonkey', '2.3', 20110724010946, 'Beta', 1 ),
	( 'Firefox', '5.0', 20110615151330 , 'Release', NULL ),
	( 'Thunderbird', '5.0', 20110624132812 , 'Release', NULL );
I've adapted the existing nightly FTP scraper to add support for release/beta candidate areas.

Things to be aware of:

* this job runs once per day
* tested with Firefox
* I expect it to work with Tbird/SeaMonkey 
* mobile will need minor changes (uses a different directory structure)
* I don't expect it to work with Camino since they don't release the same metadata as the other projects.
* no tests for the new methods yet

Looking for feedback on this approach, as the current scraper has some issues but does work. I expect we can merge the nightly and release scrape process and tables over time, but there's a lot of forking for now.
Attachment #549193 - Flags: feedback?(laura)
Attachment #549193 - Flags: feedback?(lars)
What entries is this producing right now, can you give us a good sample?
(In reply to comment #10)
> What entries is this producing right now, can you give us a good sample?

Sure, here it is against a non-final schema, for Firefox:

http://rhelmer.pastebin.mozilla.org/1284466

I let it run for all versions, not just rapid release, just to see how it'd do.
(In reply to comment #11)
> http://rhelmer.pastebin.mozilla.org/1284466

That looks good. Of course, for the older releases, it's not 100% correct, but quite near to the whole truth even there. ;-)
If it works out well for Fennec, Thunderbird, and SeaMonkey, this looks quite good to me!
Comment on attachment 549193 [details] [diff] [review]
add release/beta support to FTP scraper

Checked logic and output, which is in agreement with my personal mental model so r+ from me.
Attachment #549193 - Flags: feedback?(laura) → feedback+
Looks fine to me.  Can you please load that data into devdb?
Comment on attachment 549193 [details] [diff] [review]
add release/beta support to FTP scraper

Review of attachment 549193 [details] [diff] [review]:
-----------------------------------------------------------------

I realy love that this can be tagged onto an existing cron rather than having to create a new one.

As I said in IRC, this looks fine.  There are a few uses of deprecated apis (SGML and the Socorro psycopg helper), but this is not the time to address these issues.
Attachment #549193 - Flags: feedback?(lars) → feedback+
I have this working now for Firefox/Thunderbird/Seamonkey. Camino does not provide metadata txt files so will not work for their project.

Mobile is going to be a problem. The files are inconsistently placed (which I can work around), and not always present (which I cannot).

For example:

http://stage.mozilla.org/pub/mozilla.org/mobile/candidates/4.0b2-candidates/build4/

* maemo5-gtk_info.txt is in root, others in platform/en-US/platform.txt
* android doesn't have a .txt file at all

I think we're going to have to manually update the releases_raw table, perhaps as a supplement to the scraper (and whatever data collector replaces it in the future). 

I suggest adding an admin interface for this, similar to what we do now for branch data sources.
Status: NEW → ASSIGNED
(In reply to comment #16)
> Mobile is going to be a problem. The files are inconsistently placed (which
> I can work around), and not always present (which I cannot).
> 
> For example:
> 
> http://stage.mozilla.org/pub/mozilla.org/mobile/candidates/4.0b2-candidates/
> build4/
> 
> * maemo5-gtk_info.txt is in root, others in platform/en-US/platform.txt
> * android doesn't have a .txt file at all

Ah actually the problem is that the .txt files are only in the unsigned directory:

http://stage.mozilla.org/pub/mozilla.org/mobile/candidates/6.0b3-candidates/build2/unsigned/android/en-US/

So this should be doable. I still think it'd be nice to have an admin interface for viewing and modifying this table though.
Don't waste too much time for things before 5.0, as we only need the new reports starting from there.
(In reply to comment #18)
> Don't waste too much time for things before 5.0, as we only need the new
> reports starting from there.

I tried excluding those, there are still enough differences in mobile that it's making it really hairy. I filed bug 675272 to see if we can get metadata put in place now, and more consistent moving forward.
Depends on: 675272
rhelmer,

I'm not at all sanguine about updating Fennec releases manually.  It's critical that the products table is updated BEFORE we do any of the nightly batch updates to TCBS and other matviews.  Crashes for products which have not been updated simply won't show up in TCBS.  And if we're relying on manual updates, that'll happen on a weekly basis.

You can't solve this with the scraper, though ... it's going to require someone getting on the mobile team's case to get them to put the info file in every time, and in a consistent place.

Hmmm.  I guess this isn't really a serious problem **for us**.  If the Fennec team wants to see crashes reports, then they need to solve the problem.  If they don't solve it, then clearly they didn't care.

Who can contact them and lay this out so that they're not surprised when Fennec starts getting excluded?
rhelmer,

Also, if you can take the scraper you have and deploy it on Dev so that it's updating DevDB, I'd appreciate it.
(In reply to comment #20)

> 
> Who can contact them and lay this out so that they're not surprised when
> Fennec starts getting excluded?

Fennec won't get excluded, because that would block ship, just FYI.  This attitude isn't acceptable.

Worst case is leaving Fennec on the old aggregation system for now.

KaiRo, can you help here?
(In reply to comment #20)
> rhelmer,
> 
> I'm not at all sanguine about updating Fennec releases manually.  It's
> critical that the products table is updated BEFORE we do any of the nightly
> batch updates to TCBS and other matviews.  Crashes for products which have
> not been updated simply won't show up in TCBS.  And if we're relying on
> manual updates, that'll happen on a weekly basis.


Why will manual updates happen on a weekly basis, as opposed to happening the same day a release happens? If it misses the nightly window why wouldn't it happen the next night?

 
> You can't solve this with the scraper, though ... it's going to require
> someone getting on the mobile team's case to get them to put the info file
> in every time, and in a consistent place.


I think you're right and that I can't solve it with the scraper as-is, if I can't depend on the name or location of the file to be consistent across releases. The only option I have now is to crawl the whole tree looking for .txt files, and then use the directory name (not as awful as it sounds actually). I'd really rather not though, since we now have code that works for everything else.

I have filed bug 675272 against releng (not the mobile team) and made set it blocking this one. See that bug for more info on the specific problems here. I have also proposed (and volunteered to) backfill the data, provided it'll be put in a consistent place from now on.

 
> Hmmm.  I guess this isn't really a serious problem **for us**.  If the
> Fennec team wants to see crashes reports, then they need to solve the
> problem.  If they don't solve it, then clearly they didn't care.
> 
> Who can contact them and lay this out so that they're not surprised when
> Fennec starts getting excluded?


See above. Excluding them is not an option, we'll have to update this manually if we don't have a reliable place to automatically source this. Releng has offered to automatically notify us at release time.

Can you provide more details on what the problem with manual updates is? I don't see any real difference between the scraper code or a human running the inserts, except of course typos/delays/etc (which is why I'd rather get this done using one of the methods I've mentioned).
(In reply to comment #23)
>
> See above. Excluding them is not an option, we'll have to update this
> manually if we don't have a reliable place to automatically source this.
> Releng has offered to automatically notify us at release time.

(In reply to comment #22)
> 
> Fennec won't get excluded, because that would block ship, just FYI.  This
> attitude isn't acceptable.
> 
> Worst case is leaving Fennec on the old aggregation system for now.

We could leave them on the old system (I guess Camino is in that boat), but manual update seems more desirable than this? We could easily expose an admin panel and let someone else set this, as we do for branches, no?
(In reply to comment #21)
> rhelmer,
> 
> Also, if you can take the scraper you have and deploy it on Dev so that it's
> updating DevDB, I'd appreciate it.

Sure thing, doing that now.
Laura,

FWIW, we're not terminating any data on the old aggregation system.  It's just that the reporting for betas is deceptive because they get lumped in with releases.  So Fennec will still appear there as it does now.

Fennec could also make one of their team responsible for updating the product_versions list every time they do a release, if that's easier than fixing the info files.  

Clearly, though, writing per-product backfill code is going to be a requirement for 2.2 since we're going to be backfilling.
(In reply to comment #25)
> (In reply to comment #21)
> > rhelmer,
> > 
> > Also, if you can take the scraper you have and deploy it on Dev so that it's
> > updating DevDB, I'd appreciate it.
> 
> Sure thing, doing that now.

Done, let me know if it looks ok.
To be completely clear about my opinion:

(a) yes, we will need to support manual update and make it accessible to the dev/release teams.

(b) I will need to write backfill code which copes with adding a product/version late.
I went ahead and mirrored the .txt files in the mobile area on FTP, and rearranged it to be consistent with others, so I could continue testing the script (for example if bug 675272 is fixed).

It seems to work ok. 

Josh, would it be useful to have the mobile info for the 5+ releases in the releases_raw table in devdb now?
rhelmer:

Yes, it would.
(In reply to comment #29)
> I went ahead and mirrored the .txt files in the mobile area on FTP, and
> rearranged it to be consistent with others, so I could continue testing the
> script (for example if bug 675272 is fixed).
> 
> It seems to work ok. 
> 
> Josh, would it be useful to have the mobile info for the 5+ releases in the
> releases_raw table in devdb now?

(In reply to comment #30)
> rhelmer:
> 
> Yes, it would.

Done.
* tested/works for firefox, thunderbird, seamonkey (and mobile assuming dependent bug 675272 is fixed, tested this on a modified local FTP mirror)
* fixed a few bugs/spurious warnings for nightly case
* reduced the delta between the nightly and release cases 

I've been holding this patch back because I don't have full test coverage of all the release cases, but I don't want to hold QA back any longer. I'll continue working on this while you review.
Attachment #549193 - Attachment is obsolete: true
Attachment #549992 - Flags: review?(lars)
Attachment #549992 - Attachment is obsolete: true
Attachment #549992 - Flags: review?(lars)
Attachment #549993 - Flags: review?(lars)
Comment on attachment 549993 [details] [diff] [review]
add release/beta support to FTP scraper

you can choose to make these trivial changes or not:

====================
.../socorro/cron/builds.py - lines 62-68

>  if result != None:
>    return True
>  else:
>    return False

how about:

>  return result is not None

====================
.../socorro/cron/builds.py - line 305

> if version.find('b') != -1:

how about:

> if 'b' in version:
Attachment #549993 - Flags: review?(lars) → review+
Sound suggestions, made them; thanks lars!

Sending        socorro/cron/builds.py
Sending        socorro/database/schema.py
Sending        socorro/unittest/cron/testBuilds.py
Sending        socorro/unittest/database/testSchema.py
Transmitting file data ....
Committed revision 3323.

I am going to followup with some deeper unit tests, all the shared code and most of the new code is covered at this point, but it's a little tricky (for me anyway) to get all the urllib mocking right.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
FYI I've got some unittests that use a urllib mock here:
http://hg.mozilla.org/hgcustom/hghooks/file/60fe3c851eb2/runtests.py#l173

It lets you do "expect(url, data)" to tell the mock that the next URL should be |url|, and should return |data|.
(In reply to comment #36)
> FYI I've got some unittests that use a urllib mock here:
> http://hg.mozilla.org/hgcustom/hghooks/file/60fe3c851eb2/runtests.py#l173
> 
> It lets you do "expect(url, data)" to tell the mock that the next URL should
> be |url|, and should return |data|.

Thanks! That looks useful.

Socorro has something similar (http://code.google.com/p/socorro/source/browse/trunk/socorro/unittest/cron/testBuilds.py#182), I just haven't had time to focus on it yet and want to make sure this makes code freeze. 

I have tested the code, but it's brittle and I want to spend some extra time on tests to ensure it doesn't break in the future. However I don't want to block the release on this :)
Bumping to QA verified. The releases are properly being scraped. Verified by looking at the data in the app itself. lgtm
Status: RESOLVED → VERIFIED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: