Closed Bug 529946 Opened 15 years ago Closed 12 years ago

Query access to backend data

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dmandelin, Assigned: laura)

Details

(Whiteboard: [Q22012wanted][qa-])

I like to do historical analysis on crash report data. Direct query access (SQL or whatever) to the backend would make it a lot easier. Currently I'm using Python scripts to drive the web interface and parse results, but that's clunky and inefficient.
Some concern that long-running queries might impact the ability of the server to timely accept incoming data; and that allowing free-form queries would lead to such a state.

Would it be reasonable to add csv format from the web interface and use that output? I believe at least some pages already provide that.
Assignee: nobody → server-ops
Component: Socorro → Server Operations
Product: Webtools → mozilla.org
QA Contact: socorro → mrz
Version: Trunk → other
mrz - can we get dave access to this data somehow?  Aravind?

If it helps them to query the database directly via some read-only account, or give them a stage db that contains a copy, I'm all for it.  Possible?
I share Frank's concern.  Is there a way to do this and cut off David if it becomes a problem?  

David, you'd need to know access will be cut off if it's service impacting (shoot first, ask questions later).

Aravind, what do you think?
Assignee: server-ops → aravind
(In reply to comment #3)
> David, you'd need to know access will be cut off if it's service impacting
> (shoot first, ask questions later).

Of course. Or maybe you could run my jobs at a lower priority if possible.

Btw, in the past I have done some historical bug analysis by writing scripts that query the web page, e.g., one query for each day in a 3-month time period. You guys never complained about that :-) so I assume it wasn't service impacting. What I'm really asking for in this bug is a way to do that kind of thing that is less clunky (some sort of query language or map-reduce expression vs. a patchwork of HTML parsing scripts) and hopefully faster because it's more direct. Ideally, it would be less perf-impacting overall but with same or better perf since I wouldn't be touching the web/HTML parts as much.
I think this would be better achieved through the python middleware layer we have deployed.

@lars: is this feasible, would it make sense to provide access like this to folks through the middleware layer?
(In reply to comment #5)
Another thing the Postgres folks advocated was creating a special user/password. This will allow us to have different max memory, query timeout, etc for this privileged user to fine tune resource usage.
Assignee: aravind → nobody
Component: Server Operations → Socorro
Product: mozilla.org → Webtools
QA Contact: mrz → socorro
Let's hold this on rewriting the middleware (1.7) and review for 1.8.  Should be a generally useful feature.
Target Milestone: --- → 1.8
This will be a 2.x feature.
Target Milestone: 1.8 → 2.0
dmandelin: for right now, if you have a query file a bug for Metrics.  We'll likely solve this with ES ad hoc queries.
Target Milestone: 2.0 → 2.2
(In reply to comment #10)
> dmandelin: for right now, if you have a query file a bug for Metrics.  We'll
> likely solve this with ES ad hoc queries.

OK, thanks.
Target Milestone: 2.2 → 2.3
Target Milestone: 2.3 → 2.3.1
(In reply to David Mandelin from comment #0)
> I like to do historical analysis on crash report data. Direct query access
> (SQL or whatever) to the backend would make it a lot easier. Currently I'm
> using Python scripts to drive the web interface and parse results, but
> that's clunky and inefficient.

Just a note that we do drop a CSV file (which is a sanitized dump of the main reports table) nightly:
https://crash-analysis.mozilla.com/crash_analysis/20111005/20111005-pub-crashdata.csv.gz

Given the crash ID in that file, you can access individual reports like so:
https://crash-stats.mozilla.com/dumps/446e67d0-def4-43e7-8228-7d62e2110316.jsonz
We have plans for this, but not until early 2012.  De-targeting until then.

If you need something specific let us know.
Whiteboard: [1Q2012]
Target Milestone: 2.3.1 → ---
Whiteboard: [1Q2012] → [Q12012wanted]
Basic plans here are:
- Add a VM to run SQL against (some subset of? secondary?) PostgreSQL, depending on DBA advice
- Add a VM to run queries against HBase
- Open up our API internally for other people to run apps/scripts against
Assignee: nobody → laura
Target Milestone: --- → 2.4.2
Component: Socorro → General
Product: Webtools → Socorro
Re item 2, above:
Dave: You now have access to query HBase via Hive (looks like SQL, maps to MapReduce) and Pig (uses its own query language, called Pig Latin, maps to MapReduce).  Details coming in 
https://bugzilla.mozilla.org/show_bug.cgi?id=708452

Re item 1:
Do you also need access to run SQL against the secondary PostgreSQL instance?  If so I'll get a bug on file for that.  

Re item 3: We're planning on opening the API internally around the end of Q1.

Let me know if you have any questions.
Target Milestone: 2.4.2 → 2.4.3
Target Milestone: 2.4.3 → 2.4.4
Done except for the API part, revisit in a month or so.
Target Milestone: 2.4.4 → 2.6
Target Milestone: 4 → 5
Whiteboard: [Q12012wanted] → [Q22012wanted]
Target Milestone: 5 → 9
Target Milestone: 9 → 13
Target Milestone: 13 → 14
Target Milestone: 14 → 16
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Whiteboard: [Q22012wanted] → [Q22012wanted][qa-]
You need to log in before you can comment on or make changes to this bug.