Closed Bug 1116567 Opened 9 years ago Closed 7 years ago

Migrate daily_urls.py crontabber job to S3

Categories

(Socorro :: Backend, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: selenamarie, Unassigned)

Details

Currently, this job creates CSV files that are stored on a server. Migrating away from csv should include: 

* Create a bucket for holding report output like this
* Write output to S3
* Create/update API to access these reports
The old version is in:
https://github.com/mozilla/socorro/blob/master/socorro/cron/dailyUrl.py

There is a newer crontabber-based rewrite:
https://github.com/mozilla/socorro/blob/master/socorro/cron/jobs/daily_url.py

This would be a good opportunity to fix the name ("daily url"?)

The most important issue I think is:

* this writes both a private and public CSV file so we need a private S3 bucket too

There are some other issues that are important but not blocking moving this to S3:

1) uses deprecated code like socorro.database
2) uses reports table which is deprecated

Those two could probably be fixed at the same time later, to use e.g. processed_crashes table instead.
When I wrote daily_url.py I just copied dailyUrl.py, changed some indentation and PEP8 nits. I never really studied how it worked. 


> * Create a bucket for holding report output like this

What do you mean by "Create a bucket"? Can we not give this crontabber app sufficient credentials to create its own bucket. The job can do something like::
    
    bucket = boto.get_bucket(self.config.bucket_name)
    if not bucket:
        boto.create_bucket(self.config.bucket_name)
        bucket = boto.get_bucket(self.config.bucket_name)
    ...

I might have lost some context in the recent work on migrating from hbase to s3 but it scares me that code depends on manual work on setting up buckets. 

> * Write output to S3

Definitely!

> * Create/update API to access these reports

The app should have power (IMHO) to create the bucket and to make the necessary PUT to place in there. But the "API" term scares me. Can't we just use boto right there and then?
(In reply to Peter Bengtsson [:peterbe] from comment #2)
> When I wrote daily_url.py I just copied dailyUrl.py, changed some
> indentation and PEP8 nits. I never really studied how it worked. 
> 
> 
> > * Create a bucket for holding report output like this
> 
> What do you mean by "Create a bucket"? Can we not give this crontabber app
> sufficient credentials to create its own bucket. The job can do something
> like::
>     
>     bucket = boto.get_bucket(self.config.bucket_name)
>     if not bucket:
>         boto.create_bucket(self.config.bucket_name)
>         bucket = boto.get_bucket(self.config.bucket_name)
>     ...
> 
> I might have lost some context in the recent work on migrating from hbase to
> s3 but it scares me that code depends on manual work on setting up buckets. 


Doesn't require manual work - the missing piece here is that our infra will be configured and activated using something like Terraform or CloudFormation. Individual components (like crontabber) will be using very restricted AWS IAM accounts so they won't be able to create things, and will only be able to write to very restricted buckets.

It would be more friendly to devs and others if we had the code attempt to create the bucket if it doesn't exist - this won't work on Socorro infra for reasons above.


> 
> > * Write output to S3
> 
> Definitely!
> 
> > * Create/update API to access these reports
> 
> The app should have power (IMHO) to create the bucket and to make the
> necessary PUT to place in there. But the "API" term scares me. Can't we just
> use boto right there and then?


I think this API bit is more about the output of the reports - there's no autoindex on AWS, so we either need to write something specific for this reports, or have a webapp that does something like apache autoindex (which is what generates what you see on https://crash-analysis.mozilla.com)
(In reply to Robert Helmer [:rhelmer] from comment #3)
> (In reply to Peter Bengtsson [:peterbe] from comment #2)
> > When I wrote daily_url.py I just copied dailyUrl.py, changed some
> > indentation and PEP8 nits. I never really studied how it worked. 
> > 
> > 
> > > * Create a bucket for holding report output like this
> > 
> > What do you mean by "Create a bucket"? Can we not give this crontabber app
> > sufficient credentials to create its own bucket. The job can do something
> > like::
> >     
> >     bucket = boto.get_bucket(self.config.bucket_name)
> >     if not bucket:
> >         boto.create_bucket(self.config.bucket_name)
> >         bucket = boto.get_bucket(self.config.bucket_name)
> >     ...
> > 
> > I might have lost some context in the recent work on migrating from hbase to
> > s3 but it scares me that code depends on manual work on setting up buckets. 
> 
> 
> Doesn't require manual work - the missing piece here is that our infra will
> be configured and activated using something like Terraform or
> CloudFormation. Individual components (like crontabber) will be using very
> restricted AWS IAM accounts so they won't be able to create things, and will
> only be able to write to very restricted buckets.
> 
> It would be more friendly to devs and others if we had the code attempt to
> create the bucket if it doesn't exist - this won't work on Socorro infra for
> reasons above.

Just to clarify - I am saying that it's fine for the code to go ahead and try to do this, just don't expect it to actually work on Socorro stage/prod.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.