Closed Bug 674415 Opened 13 years ago Closed 12 years ago

Replace reports with a real fact table

Categories

(Socorro :: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jberkus, Assigned: jberkus)

Details

(Whiteboard: [qa-])

This bug makes the assumption that we will never be able to entirely replace the reports table in PostgreSQL with ElasticSearch.  It is more-or-less a contingency plan.

Currently the reports table (as well as several other tables) in the PostgreSQL database confuses the concepts of a fact table with a raw data table.  reports is a raw data table, full of raw strings, garbage, duplicates and bad rows, as well as many fields which are populated for less than 2% of the table.  A fact table would contain cleaned data and many, if not most, string values would be reduced to references to canonical dimension tables.

Currently, the reports table has this schema:

 id                  | integer                     | not null default nextval('reports_id_seq'::regclass)
 client_crash_date   | timestamp with time zone    | 
 date_processed      | timestamp without time zone | 
 uuid                | character varying(50)       | not null
 product             | character varying(30)       | 
 version             | character varying(16)       | 
 build               | character varying(30)       | 
 signature           | character varying(255)      | 
 url                 | character varying(255)      | 
 install_age         | integer                     | 
 last_crash          | integer                     | 
 uptime              | integer                     | 
 cpu_name            | character varying(100)      | 
 cpu_info            | character varying(100)      | 
 reason              | character varying(255)      | 
 address             | character varying(20)       | 
 os_name             | character varying(100)      | 
 os_version          | character varying(100)      | 
 email               | character varying(100)      | 
 build_date          | timestamp without time zone | 
 user_id             | character varying(50)       | 
 started_datetime    | timestamp without time zone | 
 completed_datetime  | timestamp without time zone | 
 success             | boolean                     | 
 truncated           | boolean                     | 
 processor_notes     | text                        | 
 user_comments       | character varying(1024)     | 
 app_notes           | character varying(1024)     | 
 distributor         | character varying(20)       | 
 distributor_version | character varying(20)       | 
 topmost_filenames   | text                        | 
 addons_checked      | boolean                     | 
 flash_version       | text                        | 
 hangid              | text                        | 
 process_type        | text                        | 

... the reason nearly every column is VARCHAR or TEXT is so that it can accept garbage values.  

In a real DW configuration, only a "buffer" table would have a schema like that, and it would only hold an hour of data or so.  Periodically, a set of ETL procedures (in python or in stored procedures, it doesn't much matter) would clean the data and put it into a normalized reports schema which would look more like:

 client_crash_date   | timestamp with time zone
 date_processed      | timestamp with time zone
 uuid                | uuid not null PK
 product             | INT FK products
 version             | INT FK product_versions
 build               | INT FK builds
 signature           | INT FK signatures 
 url                 | url_id
 url_raw		     | text
 install_time        | timestamp with time zone  
 last_crash			 | timestamp with time zone 
 uptime              | interval 
 cpu				 | INT FK cpus
 reason              | character varying(255)      | 
 address             | character varying(20)       | 
 os_version		     | INT FK os_versions
 build_date          | timestamp with time zone    |  
 processor_notes     | text                        | 
 app_notes           | text					       | 
 hang_id			 | uuid
 
 (a bunch of fields I'm not sure we need anymore omitted.  will need to be discussed.)
 
Some things which are seldom inserted and have special data purposes should be farmed off to related tables, for example:

reports_userinfo
	uuid			uuid FK reports.uuid
	date_processed  timestamp with time zone
	email			email
	user_comments	text
	
The above is just an example and would need some discussion and analysis.  But it would make the reports table MUCH smaller (like 75%) and the data more consistent, and thus faster and more useful.

Unfortunately, I can't see any good way to move towards this model incrementally in any reasonable period of time.  We would need to overhaul a lot of the application to use the new data structure.
I certainly like this idea.  I would further like to collapse the product/version/build/channel into a single column that refers to a dimension table - the same one that we'll be using for the new materialized views.
Lars,

Yeah, I was thinking of that.  That partly depends on performance tests; we do a LOT of rollups by OS, version, and build, so it might actually be better to have all three references in the table even thought they're redundant.

And actually, it occurs to me that there's no reason we can't build the new_reports table alongside the old and gradually move functionality over to it.  I'd need to spend a big chunk of time with you inderstanding the various columns ... and the sub-tables of reports ... but then we could have a phased approach.
Per IRC discussion, I'm going to suggest bumping up the schedule for this feature.  In retrospect, doing 2.2 would have been both easier and faster for all staff if I'd done "reports-normalized" in the first place instead of trying to woork around not doing it.
Lars and I chatted about this.  There are some issues which make doing this more complex:

Advanced Search currently offers the ability to get reports up to the current minute.  This makes updating reports_normalized by batch impractical, unless we introduce a time-delay which users are liable to be unhappy with.  

This means that the normalization needs to happen in the processors, or via a batch job which runs every minute.  If we do it in the processors, then we need to write middleware or SP code which will deal with conflicting concurrent updates to dimension tables without blocking the processors.  Such concurrency-sensitive code will be more complex than the originally anticipated batch jobs.

However, the list of products will always be time-lagged, because it depends on FTP scraping or other asynchronous population.  This means that we would need to add some reports as being "Unknown Build" to the normalized table, and then update them later (by batch job) to be the correct release versions.

The reasons to do this are:

A) Fix current broken search issues in the Socorro UI for rapid release betas, and anticipated issues with nightlies and aurora.

B) Better eventual integration with ES-based search if we drop PG search, by allowing us to keep only metadata for reports.

C) Eventually shrinking of the database (by as much as 50%) once we can get rid of the raw reports.

D) Better performance on search, exports, and batch jobs.
This is in progress with the new reports_clean in 2.3.2.

Unresolved is deciding how to handle making reports_clean up-to-the-minute for the devs.  In 2.3.2, reports_clean get updated hourly and runs a few hours behind.  This means that it's not current enough.
Component: Socorro → General
Product: Webtools → Socorro
This is on track but slower than expected.
Target Milestone: 2.5 → 2.7
Currently waiting on the new search code.
Target Milestone: 5 → 8
Also needs to wait on ripping out OldTCBS.
Target Milestone: 8 → 11
Target Milestone: 11 → Future
Target Milestone: Future → 18
Mobeta is merged now,marking this done.  [qa-]
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Whiteboard: [qa-]
You need to log in before you can comment on or make changes to this bug.