Closed
Bug 1069469
Opened 11 years ago
Closed 10 years ago
rewrite modulelist M/R job to use postgres instead
Categories
(Socorro :: Infra, task)
Socorro
Infra
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rhelmer, Assigned: rhelmer)
Details
It should be straightforward to port the current modulelist to Postgres.
We can and should make this job more useful, but to facilitate being able to move off of HBase I am thinking of doing a straight port that continues dumping CSV files to https://crash-analysis.mozilla.com/crash_analysis/modulelist/
Assignee | ||
Comment 1•11 years ago
|
||
Query should be something along the lines of the following. Due to the nature of this job it's a full table scan, grouping to remove duplicates:
COPY (
WITH modules AS (
SELECT json_array_elements(processed_crash->'json_dump'->'modules')
AS modules
FROM processed_crashes
WHERE date_processed
BETWEEN '2014-09-01' AND '2014-09-02')
SELECT modules->>'filename' AS filename
, modules->>'debug_file' AS debug_file
, modules->>'debug_id' AS debug_id
, modules->>'version' AS version
FROM modules
GROUP BY filename, debug_file, debug_id, version
) TO STDOUT WITH CSV
;
Hadoop takes ~5 minutes to run this, I expect Postgres to be slower than that. Timing it now.
Comment 2•11 years ago
|
||
The current job only selects Windows symbols, so you should make sure to keep that behavior.
Assignee | ||
Comment 3•11 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #2)
> The current job only selects Windows symbols, so you should make sure to
> keep that behavior.
Ah good point, need to adjust that query thanks.
Assignee | ||
Comment 4•11 years ago
|
||
OK more like this:
COPY (
WITH modules AS (
SELECT json_array_elements(processed_crash->'json_dump'->'modules') AS modules
, processed_crash->'json_dump'->>'os_name' AS os_name
FROM processed_crashes
WHERE date_processed
BETWEEN '2014-09-01 00:00:00' AND '2014-09-01 00:01:00'
)
SELECT modules->>'filename' AS filename
, modules->>'debug_file' AS debug_file
, modules->>'debug_id' AS debug_id, modules->>'version' AS version
FROM modules WHERE os_name = 'Windows NT'
GROUP BY filename, debug_file, debug_id, version) TO STDOUT WITH CSV;
Comment 5•11 years ago
|
||
This seems worth testing on JSONB.
Assignee | ||
Comment 6•11 years ago
|
||
(In reply to [:jberkus] Josh Berkus from comment #5)
> This seems worth testing on JSONB.
I agree. Note that this job is temporary (bug 948644 should replace it eventually), but it's still very useful to do ad-hoc queries on this data.
Comment 7•11 years ago
|
||
There are also other multicolumn queries we do which suffer from the n-combinatorial problem on the relational data. Offering search access on the JSONB would improve these a lot, and possibly allow us to drop relational indexes to shrink the DB.
As a warning, though, the JSONB indexes are equality only, so queries like "modules->version between 1.0 and 1.5" would be unindexed. We won't have range indexes on JSONB datums until Postgres 9.5.
Assignee | ||
Comment 8•11 years ago
|
||
(In reply to [:jberkus] Josh Berkus from comment #7)
> There are also other multicolumn queries we do which suffer from the
> n-combinatorial problem on the relational data. Offering search access on
> the JSONB would improve these a lot, and possibly allow us to drop
> relational indexes to shrink the DB.
>
> As a warning, though, the JSONB indexes are equality only, so queries like
> "modules->version between 1.0 and 1.5" would be unindexed. We won't have
> range indexes on JSONB datums until Postgres 9.5.
That is good to know... in this example and others I have worked on recently (like correlations) we have only cared about equality, fortunately.
Assignee | ||
Comment 9•11 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #4)
> OK more like this:
>
> COPY (
> WITH modules AS (
> SELECT json_array_elements(processed_crash->'json_dump'->'modules') AS
> modules
> , processed_crash->'json_dump'->>'os_name' AS os_name
Actually this should be: processed_crash->>'os_name'
I am testing to see how long this takes right now.
Assignee | ||
Comment 10•11 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #9)
> (In reply to Robert Helmer [:rhelmer] from comment #4)
> > OK more like this:
> >
> > COPY (
> > WITH modules AS (
> > SELECT json_array_elements(processed_crash->'json_dump'->'modules') AS
> > modules
> > , processed_crash->'json_dump'->>'os_name' AS os_name
>
> Actually this should be: processed_crash->>'os_name'
>
> I am testing to see how long this takes right now.
I gave up on this after several hours... I think it's far too slow right now.
I have it running in screen right now recording the start/end time so we'll know for sure, but I think this approach isn't going to work.
Comment 11•11 years ago
|
||
Yeah, with JSON intead of JSONB that's going to be pretty slow.
Assignee | ||
Comment 12•10 years ago
|
||
We ended up recording these into a table from the processor as they happen, and then dumping that table to a CSV nightly.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•