Closed Bug 1162105 Opened 10 years ago Closed 8 years ago

disable daily CSV cron job

Categories

(Testing Graveyard :: Sisyphus, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lars, Unassigned)

References

Details

We need a migration plan. It appears that Sisyphus is the last consumer of the old Socorro CSV files known as the DailyURL. As the Socorro project migrates to AWS, we need to make sure that the data from Socorro still flows into Sisyphus. Is Sisyphus also moving to AWS? If Sisyphus does not also move, then the Socorro data must cross the boundary out of AWS and may incur $$ costs. We'd like to audit exactly what data Sisyphus uses from the Socorro CSV files and, perhaps, devise a way for Sisyphus to use the data directly from Socorro. Alternatively, perhaps Socorro could inject data directly into the Sisyphus database.
Blocks: 1118288
The fields that are definitely in use are: signature url product version branch os_name os_version cpu_info I believe the url is the only sensitive data item. It would be nice to be able to get the exploitability rating so I could prioritize testing those first. I would prefer to pull the data from Socorro given a date range, usually just a day. Sisyphus/Bughunter can't handle the full volume of crashes on a daily basis. I normally load them as needed when the current set of crash urls have been processed.
(In reply to Bob Clary [:bc:] from comment #1) > The fields that are definitely in use are: > > signature > url > product > version > branch > os_name > os_version > cpu_info > > I believe the url is the only sensitive data item. > > It would be nice to be able to get the exploitability rating so I could > prioritize testing those first. > > I would prefer to pull the data from Socorro given a date range, usually > just a day. Sisyphus/Bughunter can't handle the full volume of crashes on a > daily basis. I normally load them as needed when the current set of crash > urls have been processed. I believe that you could get all of this info now as JSON from our API: https://crash-stats.mozilla.com/api Peter, would you mind helping to figure out exactly what API call(s) we'd need to provide the above? Bob, you'd need to have an account on crash-stats with PII access (you can generate a token for use with scripts/curl/etc), you could then get this data close to real-time rather than waiting for a daily CSV dump.
Flags: needinfo?(peterbe)
It would be helpful if the api allowed me to only get the crashes I will actually use in testing. I ignore crashes without urls, crashes on old versions, crashes on private urls, duplicates. Not sure if I have PII but I can see urls and exploitability ratings when logged in. To answer the question about where sisyphus/bughunter is going to live: as far as I know we are staying in PHX1 for now and will move to SCL3. The need for OSX, Windows, Linux; the need to update the workers to keep up with security patches; the need to be able to run under snapshots that can be rolled back and the sensitive nature of the urls and crashes gives me pause when thinking of moving it out of house. I think the actual volume of data to be moved will be much smaller than you might expect from the daily crash dumps: 1) I won't be needing new data every day; more like once a week. 2) The overall size will be much smaller.
I'm going to throw the ball over to Adrian who is our resident master of SuperSearch. The endpoint to use is https://crash-stats.mozilla.com/api/#SuperSearchUnredacted which *requires* that you have the following two permissions under your name: * View Exploitability Results * View Personal Identifiable Information If it doesn't appear on https://crash-stats.mozilla.com/api/ for you Bob, you'll have to file bugs against Socorro to have those permissions added to your name. One crux is that the SuperSearchUnredacted currently returns EVERYTHING but the feature to be able to specify exact fields is in the works and will be available in a couple of weeks.
Flags: needinfo?(peterbe) → needinfo?(adrian)
Actually, the truth is that the API will not work until we have the ability to specify which columns should come back. If you get all the fields you risk making such an enormous JSON blob returned that it's going to make our server crawl to a halt. Adrian, can we depend this bug on the one you're working on for the columns?
Peter answered the question very well. You can use the `date` parameter to get only the day you want, and use any other param to filter the results even more. You'll also need to use `_results_offset` and `_results_number` to get more data, by default it returns only the first 100 results. I'd even recommend that you use the UI we have [0] to make the initial search, then click the "More options" link under the search form and copy the link to the public API it gives you. You'll just need to replace `SuperSearch` with `SuperSearchUnredacted` to get all the data you need. Here's an example API call: https://crash-stats.mozilla.com/api/SuperSearch/?date=%3E%3D2015-05-06T00%3A00%3A00&date=%3C2015-05-07T00%3A00%3A00 [0] https://crash-stats.mozilla.com/search/
Depends on: 1144569
Flags: needinfo?(adrian)
We've just pushed to AWS - do you need any help moving over to the API for this, or do you need us to continue producing a CSV? We're not going to be able to push to servers in Mozilla's datacenters though unfortunately.
No longer blocks: 1118288
Flags: needinfo?(bob)
I have one from yesterday which I'll load as soon as the current set completes. That will hold me over for several days at least. I haven't had a chance to look at this while I've been working on Autophone issues, but should be able to get something together soon. If I have problems, I'll ask for help. No need to keep producing the csv file any more.
Flags: needinfo?(bob)
Lars, bsmedberg asked that we re-enabled the CSV job since folks in the graphics teams and external users are using it. Can we come up with a better way to produce this so you can land your refactoring patch (I know the current method uses a bunch of old code we'd like to get rid of?)
Flags: needinfo?(lars)
Summary: rework Sisyphus data source to support Socorro's new home in AWS → disabled daily CSV cron job
Summary: disabled daily CSV cron job → disable daily CSV cron job
I believe I can use the SuperSearchUnredacted to get what I need. I am going to leave this bug open for any work you may need to complete, but will be performing the necessary changes to Bughunter in Bug 1185498
fyi, Bughunter is now using the Socorro search api. Thanks all.
Bughunter is loading urls directly from Socorro (bug 1192646) and no longer uses the daily csv dump files. Is there anything else that needs to be done here to support others' uses or can we mark this fixed?
Flags: needinfo?(lars)
-> fixed as far as I know.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.