Closed Bug 650904 Opened 13 years ago Closed 11 years ago

pull data for correlations from pig, implement in postgres

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX
Future

People

(Reporter: laura, Assigned: rhelmer)

References

Details

(Whiteboard: [Q42011wanted] [Q12012wanted])

Namely, they should use Socorro configuration files and other libraries.

See bug 629029 for main details and bug 649891, bug 623164 for some of the symptoms.
This script was checked into Socorro SVN in bug 629029, although we're still using the old version for 1.7.7 release.

Looks like it should be pretty easy to convert this shell script to Python and have it use Socorro configuration files, logging etc. It has it's own rules for deciding which versions to use, which we could carry forward or simplify : http://code.google.com/p/socorro/source/browse/trunk/scripts/crons/cron_libraries.sh#

Seems like this is intended to find the most active releases, perhaps we should just use the same table that the "branch data sources" uses.
Status: NEW → ASSIGNED
Hmm, isn't this a dupe of bug 642325?
Assignee: rhelmer → chris.lonnen
Target Milestone: 1.7.8 → 2.0
See Also: → 642325
So, here's the table structure we want:

table os_names ( 
    os_name text not null primary key )

this is a lookup list containing only the major OS names (OSX, Linux, Windows NT and Solaris).  It's there for convenience and does not change.

table module_crash_counts (
    module_crash_count_id integer not null
    crash_day date not null
    os_name text not null
    field??? text?
    module_signature text not null
    field??? text?
    crash_count integer not null
)

This table contains the counts of crashes for that module/OS for that day. These rows are the same for both reports and are shared; as a result, attempts to insert duplicate rows will be silently ignored (trigger).  The two field??? references are because we're still researching information about how the modules are currently identified in the report.

table module_crash_core_counts (
  module_crash_count_id integer not null FK module_crash_counts
  processor_type text not null
  cores integer not null
  crash_count integer not null
  key module_crash_count_id, processor_type, cores
)

This is the table for storing the core_counts reports.

table module_crash_interesting_addons (
  module_crash_count_id integer not null FK module_crash_counts
  addon_key text not null
  addon_source text
  addon_url text
  addon_version text
  crash_count integer not null
  key module_crash_count_id, addon_key, addon_version
)

This table populates both the interesting_addons and interesting_addons_with_versions reports.

table module_crash_interesting_modules (
  module_crash_count_id integer not null FK module_crash_counts
  module_signature text not null
  module_version text not null
  crash_count integer not null
  key module_crash_count_id, module_signature, module_version
);

This table populates both the interesting_modules and interesting_modules_with_versions reports.
So, questions to answer:

1) What are the two unidentified fields???

2) Are the column names I've given above sensible?  Since I don't really understand the data, I may have misidentified some of it.

3) How long do we keep this data?  It is imperative that we don't add new tables to the database which never expire data.
new (final) schema:

table os_names ( 
    os_name text not null primary key )

this is a lookup list containing only the major OS names (OSX, Linux, Windows NT and Solaris).  It's there for convenience and does not change.

table correlation_crash_counts (
    correlation_id integer not null
    crash_day date not null
    os_name text not null
    signature text not null
	reason text not null
    crash_count integer not null
    key crash_day, os_name, signature, reason
)

This table contains the counts of crashes for that module/OS for that day. These rows are the same for both reports and are shared; as a result, attempts to insert duplicate rows will be silently ignored (trigger).  The two field??? references are because we're still researching information about how the modules are currently identified in the report.

table correlation_core_counts (
  correlation_id integer not null FK correlation_crash_counts
  processor_type text not null
  cores integer not null
  crash_count integer not null
  key correlation_id, processor_type, cores
)

This is the table for storing the core_counts reports.

table correlation_addons (
  correlation_id integer not null FK correlation_crash_counts
  addon_key text not null
  addon_source text
  addon_url text
  addon_version text
  crash_count integer not null
  key correlation_id, addon_key, addon_version
)

This table populates both the interesting_addons and interesting_addons_with_versions reports.

table correlation_modules (
  correlation_id integer not null FK module_crash_counts
  module_signature text not null
  module_version text not null
  crash_count integer not null
  key correlation_id, module_signature, module_version
);

This table populates both the interesting_modules and interesting_modules_with_versions reports.
feh, those are NOT final.  Final ones still coming.
Target Milestone: 2.0 → 2.1
Blocks: 665605
Target Milestone: 2.1 → 2.2
Blocks: 623164
Blocks: 642325
Target Milestone: 2.2 → 2.3
Target Milestone: 2.3 → 2.4
Blocks: 628341
Whiteboard: Q42011wanted
Blocks: 683388
Blocks: 678373
Assignee: chris.lonnen → josh
Whiteboard: Q42011wanted → [Q42011wanted] [Q12012wanted]
Target Milestone: 2.4 → 2.4.1
Component: Socorro → General
Product: Webtools → Socorro
This needs coordination and UI work as well as database work.   Moving to 2.4.3.
Target Milestone: 2.4.1 → 2.4.3
http://dbaron.org/mozilla/topcrash-modules <-- example correlation reports, in case you can't find one in the webapp.
So, question for KaiRo:

Correlation by OS and Architecture are slotted to be in signature summary independently. Is there value in also having a correlation report for these?
(In reply to Chris Lonnen :lonnen from comment #11)
> Correlation by OS and Architecture are slotted to be in signature summary
> independently. Is there value in also having a correlation report for these?

Not really. But we don't have correlation reports for those right now either (architecture != CPU Cores). Let's bring what we have right now in the script to in-Socorro correlation reports, and then we still can look into other possibilities.
So, plans per meeting Friday:

(1) Correlations per addon:  counts will be kept in a matview and populated by stored procedure.  However, addon names/urls/sources are not available in the socorro system at all; the are available from the addon API service.  Undecided whether we should cache them somewhere or just look them up by JS from the API.  Thoughts?

(2) Correlations per module: full module data is currently not available from Postgres or from the processed dumps at all as far as we can tell; what we have are only plugin modules.  As a result, this data would need to be pulled from Hadoop somehow, probably a daily python script which calls a map/reduce job and then stashes the results in Postgres.

(3) Correlations per architecture/core: this data is available from Postgres, but it's not clear how useful it is.  See discussion with Kairo on this bug.

Table schema for the matviews has been checked in.
Kairo,

Your answer above is hard to interpret.  Do we need the Architecture/Core correlations, or don't we?
> (2) Correlations per module: full module data is currently not available
> from Postgres or from the processed dumps at all as far as we can tell; what
> we have are only plugin modules.  As a result, this data would need to be
> pulled from Hadoop somehow, probably a daily python script which calls a
> map/reduce job and then stashes the results in Postgres.

Actually, looking at http://dbaron.org/mozilla/topcrash-modules, it appears that "modules" on the correlations report only refers to plugin modules and not all modules.   Is this correct?
(In reply to [:jberkus] Josh Berkus from comment #14)
> Your answer above is hard to interpret.  Do we need the Architecture/Core
> correlations, or don't we?

We use the the "CPU Cores" correlations quite rarely but we have done it at times. What I was saying above is that "CPU Cores" is something different from "Architecture" - one is CPOU information, the other is the basic architecture (say, x86 or amd64 or arm) that the _build_ is compiled for. So, the "CPU Cores" correlation tells us e.g. if people on multicore or single core cpus are more likely to hit the crash than the average, the "Architecture" signature summary tells us how many of those people have a 32bit, 64bit, or ARM build. So it's two completely different things.


(In reply to [:jberkus] Josh Berkus from comment #15)
> Actually, looking at http://dbaron.org/mozilla/topcrash-modules, it appears
> that "modules" on the correlations report only refers to plugin modules and
> not all modules.   Is this correct?

No. When looking at the correlation reports, you'll see that all kinds of stuff is in there, system DLLs, ones hooking into our process, firefox.exe, every binary with executable code that the crashing process has loaded.
(In reply to [:jberkus] Josh Berkus from comment #13)
> However, addon names/urls/sources are not available in
> the socorro system at all; the are available from the addon API service. 
> Undecided whether we should cache them somewhere or just look them up by JS
> from the API.  Thoughts?

From our position, both is fine. If it takes an XHR and a bit of time until the actual name appears (i.e. if it's uncached and you fetch via JS), that's fine with us. There's a number of add-on IDs we need to manually look up somewhere anyhow because they're not in AMO at all.
Still, this is data that rarely ever changes (people _can_ rename an add-on on AMO, but it's really rare and probably never substantial), so caching shouldn't be too hard in the end.
mass migration of correlation-report related bugs that won't land today.
This should no longer be assigned to me, since the DB infrastructure for correlations is complete (AFAIK).  Assigning to lonnen.
Assignee: josh → chris.lonnen
Target Milestone: 2.4.3 → 2.4.4
Target Milestone: 2.4.4 → 2.5.1
Target Milestone: 2.5.1 → 2.5.2
Kicking these out to 2.6, which roughly matches the next week I'll be working on Socorro. If someone wants to poach these to get them done earlier, that's fine with me.
Target Milestone: 2.5.2 → 2.6
Target Milestone: 4 → 6
Assignee: chris.lonnen → peterbe
Target Milestone: 6 → 7
Target Milestone: 7 → 8
Target Milestone: 8 → 10
Target Milestone: 10 → 11
I've been looking into reimplementing the Socorro UI and correlation reports are one of the big pain points right now. It's slow and doesn't seem to work on the topcrashers page as far as I can tell. I'd like to just fix it now, and avoid having to reimplement the current approach.

The current method works something like this:

report (topcrashers, etc) 
  -> XHR call to Correlation_Controller
    -> fetch and parse a report such as [1]

There is some attempt to cache, but really since we know the schedule these are uploaded to the server there's no point in letting the first user suffer. I can't see any reason this should use XHR at all, instead of just being a backend job like bugzilla, ftpscraper, etc. that populates the tables Josh describes in comment 5.

So, I propose we have a backend service to fetch the reports, and also to fetch and store addon info (per comment 17 it rarely changes), and provide a middleware service to get at the data. Then the UI can simply call the service and generate HTML.

Does that make sense? It'd be nice to change the summary to something more actionable if so :) 
peterbe, I can help with this if you need, too - lmk.

[1] https://crash-analysis.mozilla.com/crash_analysis/20120516/20120516_Firefox_15.0a1-core-counts.txt
We should have proper DB tables or matviews that are being filled with the correlation data instead of pulling in the text files generated by the current scripts. If we have the info in proper DB tables and can get it via the middleware like any other stuff we get from the DB, that would make dealing with the UI pretty easy, I guess.
rhelmer: I think the goal is actually to generate these within Socorro rather than pulling them in from the text files

The database schema for storage is done (see above), so the remaining parts are:
- Stored proc + cron job to generate the data, daily
- Middleware
- UI (separate bug for this)
(In reply to Laura Thomson :laura from comment #23)
> rhelmer: I think the goal is actually to generate these within Socorro
> rather than pulling them in from the text files

Yeah understood (I thought that was bug 642325 specifically? there seems to be some overlap)

> The database schema for storage is done (see above), so the remaining parts
> are:
> - Stored proc + cron job to generate the data, daily

I suspect we could get it working now by doing the below parts first, and then not have to change them when we have generation moved directly into socorro

That said, I'd be happy to help out with the above too if everyone wants to do that bit first.

> - Middleware
> - UI (separate bug for this)
Talk to Lonnen about where we left this.

There are already Sprocs+Matviews for two out of the 3 correlation reports.  We can't currently do the "modules" correlation report because only "interesting" modules are stored in the PostgreSQL database, and that report needs to search *all* modules.  Lonnen was working on getting an aggregate of the complete list of modules out of Hbase, although we could try a different approach.
(In reply to [:jberkus] Josh Berkus from comment #25)
> Talk to Lonnen about where we left this.
> 
> There are already Sprocs+Matviews for two out of the 3 correlation reports. 
> We can't currently do the "modules" correlation report because only
> "interesting" modules are stored in the PostgreSQL database, and that report
> needs to search *all* modules.  Lonnen was working on getting an aggregate
> of the complete list of modules out of Hbase, although we could try a
> different approach.

Socorro actually runs a mapreduce job nightly that does this, I think:
https://github.com/mozilla/socorro/blob/ae282ea69ef8d663519a7628e530d27c1a02ba53/analysis/src/java/com/mozilla/socorro/hadoop/CrashReportModuleList.java

Output is here:

https://crash-analysis.mozilla.com/crash_analysis/modulelist/

If that is not exactly what we need I am sure we can adjust. If it does happen to be exactly what we need, it'd be easy to also insert into the DB as well as generate that file.
Well, the problem is that we need to have a complete list of modules per crash to generate the aggregates in PostgreSQL.  A canonical list of modules without a relationship to crashes doesn't do us much good.

When I get back from pgcon, I'll look at the exact aggregate we need; generating it daily via a scheduled M/R job would be a fine approach.
Blocks: 749359
Target Milestone: 11 → Future
Assignee: peterbe → rhelmer
Summary: Refactor correlation reports to be more Socorro-ish → move correlation reports to postgres
So what I'd need from the aggregation job is the following grouping:

Group By:
product
version
crash reason
operating system name
signature
module_name
module_signature

Operation:
count of crashes

Yes, this means that the same crash will be counted for multiple modules.

The domain of the search would be just the last complete UTC day (usually yesterday), and (if you can make this happen) currently "active" versions only.  You probably don't have active version information, though.

So a sample pseudo-SQL query would be:

SELECT product, version, reason, os_name, signature, 
      module_name, module_signature, count(*)
FROM reports JOIN modules
   JOIN product_versions
WHERE date_trunc('day', date_processed) = ( current_date - 1 ) 
   AND ( current_date -1 ) between release_date and sunset_date
GROUP BY product, version, reason, os_name, signature, 
      module_name, module_signature
Xavier, please see comment 29 - I can take a shot at doing this with Pig and get review from you, but might need some help getting set up.
Summary: move correlation reports to postgres → implement correlation reports in pig, output to postgres
(In reply to Robert Helmer [:rhelmer] from comment #30)
> Xavier, please see comment 29 - I can take a shot at doing this with Pig and
> get review from you, but might need some help getting set up.

I'm set up now and working on this.

(In reply to [:jberkus] Josh Berkus from comment #29)

Looks doable.

> The domain of the search would be just the last complete UTC day (usually
> yesterday), and (if you can make this happen) currently "active" versions
> only.  You probably don't have active version information, though.

Since this is triggered from Socorro, we could pass params to pig such as active versions, product name etc. to filter on, so this is doable.

I am going to take a first stab without this though since it's a little bigger in scope (need to make the cron job that calls this smarter etc)
Josh - please try this sample: http://people.mozilla.org/~rhelmer/temp/correlations-20120801.csv.gz

This is a little more than you asked for - it has everything the "modulelist" report has (which has the debug_filename and also module_version, which you didn't ask for) and a little more (including count). I assume that by "module_signature" you mean what we call "debug id" in the code.

Note that I am doing no sanity checking of the data, so there are missing fields - I can filter these out if you want. Also this doesn't filter by active versions as I mentioned, but that's something we could add pretty easily.

If the extra info isn't a problem for you, I think we could use this info to replace the current modulelist impl, since it's a superset of that. It'd make sense to just have one nightly run instead of two, so I am inclined to keep all this in if it isn't a problem.
Summary: implement correlation reports in pig, output to postgres → pull data for correlations from pig, implement in postgres
xstevens - can you please review this pull request? https://github.com/mozilla-metrics/socorro-toolbox/pull/7

Obviously based heavily off of the existing modulelist.pig, per comment 32 I think we could potentially replace that one with this one.
Rob,

Data looks pretty good, except that we're missing crash_reason.
(In reply to [:jberkus] Josh Berkus from comment #34)
> Rob,
> 
> Data looks pretty good, except that we're missing crash_reason.

Oops, sorry about that. Here's a new version:
http://people.mozilla.org/~rhelmer/temp/correlations-20120801-v2.csv.gz

Updated the PR as well.
(In reply to Robert Helmer [:rhelmer] from comment #35)
> http://people.mozilla.org/~rhelmer/temp/correlations-20120801-v2.csv.gz

BTW note that I put the header in there, the pig script doesn't do this (perhaps it should now that I think about it, it's not exactly intuitive).
Yes, please.

The header will really help us if something goes wrong with the data.
(In reply to [:jberkus] Josh Berkus from comment #37)
> Yes, please.
> 
> The header will really help us if something goes wrong with the data.

The version of pig we're using (0.8) doesn't really have a great way to do this (there is a way in 0.9), but I can have the script that's driving the overall run put the header in for now.

That means there's some danger of the header and actual columns going out-of-sync, there's not an easy way to deal with this without writing our own storage UDF (which I think is a disproportionate amount of work versus the risk, considering that we'll be able to use something more standard in the near future.
Here's a very simple script for generating the report and importing to postgres:
https://github.com/rhelmer/socorro/commit/418b75107d39c8e7d3690ec9cfdbb5b185aa1e94

I've landed the report itself here after review from xstevens:
https://github.com/mozilla-metrics/socorro-toolbox/

This depends on some changes to replace the old java analysis stuff with submodules and pig in bug 765001 (it's branched off of that branch in fact)
Depends on: 765001
Target Milestone: Future → 19
Blocks: 788048
Target Milestone: 19 → 20
Target Milestone: 20 → 21
Target Milestone: 21 → 23
Depends on: 792630
Commits pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/bc328ffe5f2a0eb79e2078089ae12e1fc67ef3f6
bug 650904 - start correlations job from pig, insert into postgres

https://github.com/mozilla/socorro/commit/1120b11ed0a842c5cac7c87a2f7377e7a8607f6f
Merge pull request #859 from rhelmer/bug650904-correlations-pig

bug 650904 - start correlations job from pig, insert into postgres
(In reply to [:jberkus] Josh Berkus from comment #37)
> Yes, please.
> 
> The header will really help us if something goes wrong with the data.

OK this has landed and I am testing it for inclusion into Socorro.

Into which table(s) does the module info from hbase+pig go?

I see a correlation_modules table, but it seems to be unused by any stored procedures. Is this intended as the matview table?
Flags: needinfo?(josh)
Yes, it is.

My original design was that you'd land the data into a raw data table (correlation_modules_raw, maybe) and then a sproc would clean it up and normalize it and put it into correlation_modules.  Talk to Selena about how she wants to handle it now, though.
Flags: needinfo?(josh)
(In reply to [:jberkus] Josh Berkus from comment #45)
> Yes, it is.
> 
> My original design was that you'd land the data into a raw data table
> (correlation_modules_raw, maybe) and then a sproc would clean it up and
> normalize it and put it into correlation_modules.  Talk to Selena about how
> she wants to handle it now, though.

OK thanks for the info, I had thought this was further along and just waiting on the extra modulelist data to proceed.

I am going to change tack for the django version - we should proceed with this but not blocking on it.
No longer blocks: 749359, 788048
No longer blocks: 683388
Target Milestone: 23 → Future
Selena is working on a better way in bug 875990. 

Selena, you may find some of the history in this bug interesting if you're stuck etc.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.