Closed Bug 889041 Opened 11 years ago Closed 11 years ago

Socorro API: return any raw field from report/list

Categories

(Socorro Graveyard :: Middleware, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: benjamin, Assigned: peterbe)

References

Details

(Whiteboard: [qa+])

Related to bug 889040: the report/list API should be configurably able to return additional fields from the raw JSON. For public users, this should be limited to the whitelisted set of raw fields in the API, bug 858245. It would be nice for logged-in users to be able to see full raw data, but that that is a secondary goal.
Depends on: 901977
Assignee: nobody → peterbe
Re-reading this bug, it actually doesn't depend on bug 901977. The report/list page doesn't use elasticsearch at the moment, all its data comes from postgresql. I am not sure how that overlaps with supersearch, as it will be able to show any field from the processed JSON at first, and raw JSON when 889040 is done (hopefully before the end of q3). We might still want to have that ability on both pages.
No longer depends on: 901977
Depends on: 912132
Target Milestone: --- → 60
Target Milestone: 60 → 61
Selena, My objective is to extend the reports page (Reports tab on the reports list) with columns from the raw crash. http://cl.ly/RSUn Currently that table is generated by querying socorro.external.postgresql.report:Report.get_list() which, it appears, simply does a plain select from `reports`. Can I just add an inner join on `raw_crash` and join on uuid and then include `raw_crash` which is a JSON type? Does that sound right? Then I can unpack that massive JSON thing when I render the template based on which columns specifically the user has chosen to see.
Flags: needinfo?(sdeckelmann)
Sorry, I think it should be a left OUTER join.
Commit pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/8bdd29f03c5aec9db37f11e6d00394a195979986 fixes bug 889041 - return any raw field from report/list, r=AdrianGaudebert
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
The new `LEFT OUTER JOIN` added to that query turned out to be MUCH too slow. Some queries took 10 minutes!
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Commit pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/27ada56def564db9fc430ed22c7cc72ee1a2c039 Revert "fixes bug 889041 - return any raw field from report/list, r=AdrianGaudebert" This reverts commit 8bdd29f03c5aec9db37f11e6d00394a195979986.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Stay reopened darn it!
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Selena, Can you help us figure out why the left outer join is so impressively slow? If there is no way to make it faster we're going to have to abandon this approach entirely.
Apparently even the `select count()` causes massive seq scan on raw_crashes!! http://pastebin.mozilla.org/3147786
So I ran that query (3 days span) on dev: http://peterbe.pastebin.mozilla.org/3147873 That's 0.2ms to join 14 million reports with 14 million rows in raw_crashes. On stage, the same query takes 1160ms. http://pastebin.mozilla.org/3147829 At the time of writing, dev db is at schema revision 608afb3824d and stage is at 523b9e57eba2
OOps! I used the wrong time interval on dev. With August instead I get: http://peterbe.pastebin.mozilla.org/3147988 97,000 rows took 3,000ms to do a count on.
(In reply to Peter Bengtsson [:peterbe] from comment #11) > So I ran that query (3 days span) on dev: > http://peterbe.pastebin.mozilla.org/3147873 > That's 0.2ms to join 14 million reports with 14 million rows in raw_crashes. > > On stage, the same query takes 1160ms. > http://pastebin.mozilla.org/3147829 Without any WHERE clause on raw_crashes.date_processed, this query seems to be a sequential scan across all raw_crashes partitions. With the WHERE clause it does an index scan. I am not sure we can compare the contents of dev/stage/prod directly, they potentially have very different sets of data. > At the time of writing, dev db is at schema revision 608afb3824d and stage > is at 523b9e57eba2 I looked into this and think it's a red herring (but would be nice to fix) - latest migration doesn't look related.
(In reply to Peter Bengtsson [:peterbe] from comment #10) > Apparently even the `select count()` causes massive seq scan on raw_crashes!! > http://pastebin.mozilla.org/3147786 I see this same result across dev/stage/prod. If you add a WHERE clause for raw_crashes.date_processed (equivalent date range to the one used on reports) all the seq scans go away, and index scans are used on a smaller set of partitions instead.
You'd have to check the EXPLAIN results, but you can also try joining raw_crashes and crashes on raw_crashes.date_processed = reports_clean.date_processed AND raw_crashes.uuid = reports_clean.date_processed. That may cause the query planner to jump to the single correct partition to look it up. However, meta-question! Aren't the list of crashes in report/list generated from elasticsearch queries anyway? Why couldn't we just use the data from elasticsearch, instead of going back to postgres to get it again?
(In reply to Benjamin Smedberg [:bsmedberg] from comment #15) > However, meta-question! Aren't the list of crashes in report/list generated > from elasticsearch queries anyway? Why couldn't we just use the data from > elasticsearch, instead of going back to postgres to get it again? That is not the case, raw crash data is not even in elasticsearch at the moment. PostgreSQL has its own copy of that data. And report/list has always been using PostgreSQL, and not elasticsearch. We had planned to switch some day, but that never happened.
I don't see how it's going to be possible to fix bug 897139 properly without driving report/list from elasticsearch: how can you apply the same search limits to report/list?
Commit pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/f55cecde56355d46e4c8983c183df60b06cc4efe fixes bug 889041 - return any raw field from report/list take 2, r=selenamarie
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Flags: needinfo?(sdeckelmann)
QA Steps to reproduce: 1. Click on Top Crashers (for any version) 2. Click on any of the signatures 3. Click the Reports tab 4. Scroll down and expect there to be lots of new available columns you can add (all with a * suffix) 5. Select some of the columns with a * and click the "Save and reload" button Expected outcome: * Should see the columns selected in the big table
Whiteboard: [qa?]
Target Milestone: 61 → 63
verified on stage
Status: RESOLVED → VERIFIED
Whiteboard: [qa?] → [qa+]
Product: Socorro → Socorro Graveyard
You need to log in before you can comment on or make changes to this bug.