Closed Bug 1069469 Opened 11 years ago Closed 10 years ago

rewrite modulelist M/R job to use postgres instead

Categories

(Socorro :: Infra, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Assigned: rhelmer)

Details

It should be straightforward to port the current modulelist to Postgres. We can and should make this job more useful, but to facilitate being able to move off of HBase I am thinking of doing a straight port that continues dumping CSV files to https://crash-analysis.mozilla.com/crash_analysis/modulelist/
Query should be something along the lines of the following. Due to the nature of this job it's a full table scan, grouping to remove duplicates: COPY ( WITH modules AS ( SELECT json_array_elements(processed_crash->'json_dump'->'modules') AS modules FROM processed_crashes WHERE date_processed BETWEEN '2014-09-01' AND '2014-09-02') SELECT modules->>'filename' AS filename , modules->>'debug_file' AS debug_file , modules->>'debug_id' AS debug_id , modules->>'version' AS version FROM modules GROUP BY filename, debug_file, debug_id, version ) TO STDOUT WITH CSV ; Hadoop takes ~5 minutes to run this, I expect Postgres to be slower than that. Timing it now.
The current job only selects Windows symbols, so you should make sure to keep that behavior.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #2) > The current job only selects Windows symbols, so you should make sure to > keep that behavior. Ah good point, need to adjust that query thanks.
OK more like this: COPY ( WITH modules AS ( SELECT json_array_elements(processed_crash->'json_dump'->'modules') AS modules , processed_crash->'json_dump'->>'os_name' AS os_name FROM processed_crashes WHERE date_processed BETWEEN '2014-09-01 00:00:00' AND '2014-09-01 00:01:00' ) SELECT modules->>'filename' AS filename , modules->>'debug_file' AS debug_file , modules->>'debug_id' AS debug_id, modules->>'version' AS version FROM modules WHERE os_name = 'Windows NT' GROUP BY filename, debug_file, debug_id, version) TO STDOUT WITH CSV;
This seems worth testing on JSONB.
(In reply to [:jberkus] Josh Berkus from comment #5) > This seems worth testing on JSONB. I agree. Note that this job is temporary (bug 948644 should replace it eventually), but it's still very useful to do ad-hoc queries on this data.
There are also other multicolumn queries we do which suffer from the n-combinatorial problem on the relational data. Offering search access on the JSONB would improve these a lot, and possibly allow us to drop relational indexes to shrink the DB. As a warning, though, the JSONB indexes are equality only, so queries like "modules->version between 1.0 and 1.5" would be unindexed. We won't have range indexes on JSONB datums until Postgres 9.5.
(In reply to [:jberkus] Josh Berkus from comment #7) > There are also other multicolumn queries we do which suffer from the > n-combinatorial problem on the relational data. Offering search access on > the JSONB would improve these a lot, and possibly allow us to drop > relational indexes to shrink the DB. > > As a warning, though, the JSONB indexes are equality only, so queries like > "modules->version between 1.0 and 1.5" would be unindexed. We won't have > range indexes on JSONB datums until Postgres 9.5. That is good to know... in this example and others I have worked on recently (like correlations) we have only cared about equality, fortunately.
(In reply to Robert Helmer [:rhelmer] from comment #4) > OK more like this: > > COPY ( > WITH modules AS ( > SELECT json_array_elements(processed_crash->'json_dump'->'modules') AS > modules > , processed_crash->'json_dump'->>'os_name' AS os_name Actually this should be: processed_crash->>'os_name' I am testing to see how long this takes right now.
(In reply to Robert Helmer [:rhelmer] from comment #9) > (In reply to Robert Helmer [:rhelmer] from comment #4) > > OK more like this: > > > > COPY ( > > WITH modules AS ( > > SELECT json_array_elements(processed_crash->'json_dump'->'modules') AS > > modules > > , processed_crash->'json_dump'->>'os_name' AS os_name > > Actually this should be: processed_crash->>'os_name' > > I am testing to see how long this takes right now. I gave up on this after several hours... I think it's far too slow right now. I have it running in screen right now recording the start/end time so we'll know for sure, but I think this approach isn't going to work.
Yeah, with JSON intead of JSONB that's going to be pretty slow.
We ended up recording these into a table from the processor as they happen, and then dumping that table to a CSV nightly.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.