create API endpoint for release metadata

NEW
Unassigned

Status

Socorro
Webapp
3 years ago
2 years ago

People

(Reporter: rhelmer, Unassigned)

Tracking

(Blocks: 2 bugs)

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 years ago
Right now Socorro scrapes FTP to get release metadata. It goes into a table that looks like this:

 product_name | version |  platform  |    build_id    | build_type | beta_number |   repository    | update_channel | version_build 
--------------+---------+------------+----------------+------------+-------------+-----------------+----------------+---------------
 firefox      | 40.0a1  | linux-i686 | 20150410030204 | nightly    |             | mozilla-central | nightly        | 

We should have a service (similar to symbol upload) that allows them to upload this data, instead of us having to scrape it from FTP.

This is somewhat urgent, since FTP is going away in the near future.
(Reporter)

Comment 1

3 years ago
How feasible is it to do this as a microservice, separate from the main crash-stats django app? It'd be nice if crash-stats being down didn't mean we missed release metadata too...
Flags: needinfo?(peterbe)
(Reporter)

Comment 2

3 years ago
Is it possible for you to keep track (within reason) and retry if the upload fails for some reason? 

Thinking of temporary/intermittent downtime events, right now we can "backfill" by scraping dates off of FTP, it'd be a shame to miss it because of a transient network or server error.
Flags: needinfo?(bhearsum)
(Reporter)

Updated

3 years ago
Component: Middleware → Webapp
(In reply to Robert Helmer [:rhelmer] from comment #1)
> How feasible is it to do this as a microservice, separate from the main
> crash-stats django app? It'd be nice if crash-stats being down didn't mean
> we missed release metadata too...

Microservices are cool and all but they're admin overhead too. Why not just stick it in as a django app within the webapp-django. Writing some basic RESTish interface is easy. It doesn't have to be unicorns all the way down. Just some HTTP POST endpoint basically. 
Once we're away from the middleware the code can simply be::

 @protect_with_the_usual_crash_stats_tokens_stuff
 @json_view
 def new_release(request):
    from socorro.external.postgresql.releases import Releases
    Releases().add_release(product_name=request.POST['product_name'], ...)
    return True
Flags: needinfo?(peterbe)
(In reply to Robert Helmer [:rhelmer] from comment #2)
> Is it possible for you to keep track (within reason) and retry if the upload
> fails for some reason? 
> 
> Thinking of temporary/intermittent downtime events, right now we can
> "backfill" by scraping dates off of FTP, it'd be a shame to miss it because
> of a transient network or server error.

While we're still in Buildbot, it's going to be difficult to cope with extended outages. Eg, we could retry for half an hour or so, but I wouldn't want to tie up a slave for hours just waiting to push to Socorro.

When we have scheduling in taskcluster this probably gets a lot easier - we can have a tiny downstream task that retries for a very long period of time.

Another thing we might be able to do is implement this as some sort of status plugin to buildbot, or maybe stuff it into postrun.py....that might let us retry for longer without typing up a slave.
(Reporter)

Comment 5

3 years ago
(In reply to Peter Bengtsson [:peterbe] from comment #3)
> (In reply to Robert Helmer [:rhelmer] from comment #1)
> > How feasible is it to do this as a microservice, separate from the main
> > crash-stats django app? It'd be nice if crash-stats being down didn't mean
> > we missed release metadata too...
> 
> Microservices are cool and all but they're admin overhead too. Why not just
> stick it in as a django app within the webapp-django. Writing some basic
> RESTish interface is easy. It doesn't have to be unicorns all the way down.
> Just some HTTP POST endpoint basically. 
> Once we're away from the middleware the code can simply be::
> 
>  @protect_with_the_usual_crash_stats_tokens_stuff
>  @json_view
>  def new_release(request):
>     from socorro.external.postgresql.releases import Releases
>     Releases().add_release(product_name=request.POST['product_name'], ...)
>     return True

I agree it's easier but there are two things that really suck about the symbol upload situation:

1) when crash-stats is down (for any silly reason) we can't accept debug symbols, breaking nightly builds and closing the tree potentially
2) our auth system is a bit clunky for machine accounts, the way we had to have an individual log in via persona and give them extra-long tokens for instance

So I want to learn from that and do better if we can, I think #1 in particular is going to make us a lot more hesitant to push changes or move to CD if we get burned by a tree closure or two.

It really seems like this new service and also symbol upload don't fit with the rest of crash-stats, and deserve their own services (I'll push on splitting symbol upload out separately, I don't want to block the momentum on that)

Overhead of bringing up/maintaining a new service on AWS is really not much.
(Reporter)

Comment 6

3 years ago
(In reply to Ben Hearsum [:bhearsum] from comment #4)
> (In reply to Robert Helmer [:rhelmer] from comment #2)
> > Is it possible for you to keep track (within reason) and retry if the upload
> > fails for some reason? 
> > 
> > Thinking of temporary/intermittent downtime events, right now we can
> > "backfill" by scraping dates off of FTP, it'd be a shame to miss it because
> > of a transient network or server error.
> 
> While we're still in Buildbot, it's going to be difficult to cope with
> extended outages. Eg, we could retry for half an hour or so, but I wouldn't
> want to tie up a slave for hours just waiting to push to Socorro.
> 
> When we have scheduling in taskcluster this probably gets a lot easier - we
> can have a tiny downstream task that retries for a very long period of time.
> 
> Another thing we might be able to do is implement this as some sort of
> status plugin to buildbot, or maybe stuff it into postrun.py....that might
> let us retry for longer without typing up a slave.


Thanks! Also it would be remiss of me not to at least ask - have you considered running this service instead of us? :) e.g. running a little web app that receives the POST, and then providing an API endpoint for consumers to get at the metadata?

I know for sure there are other people around Mozilla (and maybe outside too) that want to get at this info, we could provide it from crash-stats but that seems a bit roundabout doesn't it?
(In reply to Robert Helmer [:rhelmer] from comment #6)
> (In reply to Ben Hearsum [:bhearsum] from comment #4)
> > (In reply to Robert Helmer [:rhelmer] from comment #2)
> > > Is it possible for you to keep track (within reason) and retry if the upload
> > > fails for some reason? 
> > > 
> > > Thinking of temporary/intermittent downtime events, right now we can
> > > "backfill" by scraping dates off of FTP, it'd be a shame to miss it because
> > > of a transient network or server error.
> > 
> > While we're still in Buildbot, it's going to be difficult to cope with
> > extended outages. Eg, we could retry for half an hour or so, but I wouldn't
> > want to tie up a slave for hours just waiting to push to Socorro.
> > 
> > When we have scheduling in taskcluster this probably gets a lot easier - we
> > can have a tiny downstream task that retries for a very long period of time.
> > 
> > Another thing we might be able to do is implement this as some sort of
> > status plugin to buildbot, or maybe stuff it into postrun.py....that might
> > let us retry for longer without typing up a slave.
> 
> 
> Thanks! Also it would be remiss of me not to at least ask - have you
> considered running this service instead of us? :) e.g. running a little web
> app that receives the POST, and then providing an API endpoint for consumers
> to get at the metadata?

At some point in the future, ship-it.mozilla.org is intended to be a source of truth for release-y builds (Beta/Release/ESR), but that's aways away. Not sure about nightly-style...
Flags: needinfo?(bhearsum)
(Reporter)

Comment 8

3 years ago
(In reply to Ben Hearsum [:bhearsum] from comment #7)
> (In reply to Robert Helmer [:rhelmer] from comment #6)
> > (In reply to Ben Hearsum [:bhearsum] from comment #4)
> > > (In reply to Robert Helmer [:rhelmer] from comment #2)
> > > > Is it possible for you to keep track (within reason) and retry if the upload
> > > > fails for some reason? 
> > > > 
> > > > Thinking of temporary/intermittent downtime events, right now we can
> > > > "backfill" by scraping dates off of FTP, it'd be a shame to miss it because
> > > > of a transient network or server error.
> > > 
> > > While we're still in Buildbot, it's going to be difficult to cope with
> > > extended outages. Eg, we could retry for half an hour or so, but I wouldn't
> > > want to tie up a slave for hours just waiting to push to Socorro.
> > > 
> > > When we have scheduling in taskcluster this probably gets a lot easier - we
> > > can have a tiny downstream task that retries for a very long period of time.
> > > 
> > > Another thing we might be able to do is implement this as some sort of
> > > status plugin to buildbot, or maybe stuff it into postrun.py....that might
> > > let us retry for longer without typing up a slave.
> > 
> > 
> > Thanks! Also it would be remiss of me not to at least ask - have you
> > considered running this service instead of us? :) e.g. running a little web
> > app that receives the POST, and then providing an API endpoint for consumers
> > to get at the metadata?
> 
> At some point in the future, ship-it.mozilla.org is intended to be a source
> of truth for release-y builds (Beta/Release/ESR), but that's aways away. Not
> sure about nightly-style...

Thanks, good to know. The most important for us really is Beta since crash reports come in without the beta number (just version e.g. "40.0" and channel "beta", but no "b2".)

We do use Aurora/Nightly and the rest to try to determine if incoming crashes look valid (if the buildid matches up with the version number and platform that we fetched from FTP, primarily.)

Comment 9

3 years ago
Can't people already listen to pulse to get the info? Of course, a service that can be queried lazily would be even nicer.
Ben, 

Sounds like ship-it.mozilla.org would solve so many problems for us. In particular, we don't need to invent something new for hosting data. 

* How soon can we expect that?

* What can we do to help that being sooner?

* Looking at Rob's first description of our SQL table and the data we need to have, will ship-it be able to satisfy that?
Flags: needinfo?(bhearsum)
(In reply to Peter Bengtsson [:peterbe] from comment #10)
> Ben, 
> 
> Sounds like ship-it.mozilla.org would solve so many problems for us. In
> particular, we don't need to invent something new for hosting data. 
> 
> * How soon can we expect that?
> 
> * What can we do to help that being sooner?
> 
> * Looking at Rob's first description of our SQL table and the data we need
> to have, will ship-it be able to satisfy that?

Someday Ship It might fit the bill here, but it doesn't have nearly enough information at this time. It only has some information about "release" style builds (Beta, Release, and ESR channel builds that ship to users), but only for ~18.0 onwards. And even for release-style builds it doesn't hold update channel, buildid, or platform lists.

I would love for it to be the source of truth for all of this information, but it's in dire need of rearchitecting if it's going to do that. It was designed as an internal-only tool (it requires auth, has no external IP) for starting release-style automation. If we want to start tracking nightly information, or even just buildids for release builds, we need to adjust its data model, APIs, and add more hooks to downstream automation to feed data back to it.
Flags: needinfo?(bhearsum)
So, it sounds like we're best of to build something ourselves that RelEng can POST to. 
Let's build a prototype then Rob!
So, given that we're already posting the crash symbols to socorro, would it make sense to post the release metadata at the same time as part of that request?
When the release promotion project is complete, scraping probably will not work anymore. We should make sure we address this before that goes live.
Blocks: 1118794
(Reporter)

Updated

2 years ago
Duplicate of this bug: 1021048
I have same situation in bug 1021048. Hope we could have this feature soon to address situation.
Blocks: 1249753
You need to log in before you can comment on or make changes to this bug.