Closed
Bug 674415
Opened 13 years ago
Closed 12 years ago
Replace reports with a real fact table
Categories
(Socorro :: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
18
People
(Reporter: jberkus, Assigned: jberkus)
Details
(Whiteboard: [qa-])
This bug makes the assumption that we will never be able to entirely replace the reports table in PostgreSQL with ElasticSearch. It is more-or-less a contingency plan. Currently the reports table (as well as several other tables) in the PostgreSQL database confuses the concepts of a fact table with a raw data table. reports is a raw data table, full of raw strings, garbage, duplicates and bad rows, as well as many fields which are populated for less than 2% of the table. A fact table would contain cleaned data and many, if not most, string values would be reduced to references to canonical dimension tables. Currently, the reports table has this schema: id | integer | not null default nextval('reports_id_seq'::regclass) client_crash_date | timestamp with time zone | date_processed | timestamp without time zone | uuid | character varying(50) | not null product | character varying(30) | version | character varying(16) | build | character varying(30) | signature | character varying(255) | url | character varying(255) | install_age | integer | last_crash | integer | uptime | integer | cpu_name | character varying(100) | cpu_info | character varying(100) | reason | character varying(255) | address | character varying(20) | os_name | character varying(100) | os_version | character varying(100) | email | character varying(100) | build_date | timestamp without time zone | user_id | character varying(50) | started_datetime | timestamp without time zone | completed_datetime | timestamp without time zone | success | boolean | truncated | boolean | processor_notes | text | user_comments | character varying(1024) | app_notes | character varying(1024) | distributor | character varying(20) | distributor_version | character varying(20) | topmost_filenames | text | addons_checked | boolean | flash_version | text | hangid | text | process_type | text | ... the reason nearly every column is VARCHAR or TEXT is so that it can accept garbage values. In a real DW configuration, only a "buffer" table would have a schema like that, and it would only hold an hour of data or so. Periodically, a set of ETL procedures (in python or in stored procedures, it doesn't much matter) would clean the data and put it into a normalized reports schema which would look more like: client_crash_date | timestamp with time zone date_processed | timestamp with time zone uuid | uuid not null PK product | INT FK products version | INT FK product_versions build | INT FK builds signature | INT FK signatures url | url_id url_raw | text install_time | timestamp with time zone last_crash | timestamp with time zone uptime | interval cpu | INT FK cpus reason | character varying(255) | address | character varying(20) | os_version | INT FK os_versions build_date | timestamp with time zone | processor_notes | text | app_notes | text | hang_id | uuid (a bunch of fields I'm not sure we need anymore omitted. will need to be discussed.) Some things which are seldom inserted and have special data purposes should be farmed off to related tables, for example: reports_userinfo uuid uuid FK reports.uuid date_processed timestamp with time zone email email user_comments text The above is just an example and would need some discussion and analysis. But it would make the reports table MUCH smaller (like 75%) and the data more consistent, and thus faster and more useful. Unfortunately, I can't see any good way to move towards this model incrementally in any reasonable period of time. We would need to overhaul a lot of the application to use the new data structure.
Comment 1•13 years ago
|
||
I certainly like this idea. I would further like to collapse the product/version/build/channel into a single column that refers to a dimension table - the same one that we'll be using for the new materialized views.
Assignee | ||
Comment 2•13 years ago
|
||
Lars, Yeah, I was thinking of that. That partly depends on performance tests; we do a LOT of rollups by OS, version, and build, so it might actually be better to have all three references in the table even thought they're redundant. And actually, it occurs to me that there's no reason we can't build the new_reports table alongside the old and gradually move functionality over to it. I'd need to spend a big chunk of time with you inderstanding the various columns ... and the sub-tables of reports ... but then we could have a phased approach.
Assignee | ||
Comment 3•13 years ago
|
||
Per IRC discussion, I'm going to suggest bumping up the schedule for this feature. In retrospect, doing 2.2 would have been both easier and faster for all staff if I'd done "reports-normalized" in the first place instead of trying to woork around not doing it.
Assignee | ||
Comment 4•13 years ago
|
||
Lars and I chatted about this. There are some issues which make doing this more complex: Advanced Search currently offers the ability to get reports up to the current minute. This makes updating reports_normalized by batch impractical, unless we introduce a time-delay which users are liable to be unhappy with. This means that the normalization needs to happen in the processors, or via a batch job which runs every minute. If we do it in the processors, then we need to write middleware or SP code which will deal with conflicting concurrent updates to dimension tables without blocking the processors. Such concurrency-sensitive code will be more complex than the originally anticipated batch jobs. However, the list of products will always be time-lagged, because it depends on FTP scraping or other asynchronous population. This means that we would need to add some reports as being "Unknown Build" to the normalized table, and then update them later (by batch job) to be the correct release versions. The reasons to do this are: A) Fix current broken search issues in the Socorro UI for rapid release betas, and anticipated issues with nightlies and aurora. B) Better eventual integration with ES-based search if we drop PG search, by allowing us to keep only metadata for reports. C) Eventually shrinking of the database (by as much as 50%) once we can get rid of the raw reports. D) Better performance on search, exports, and batch jobs.
Assignee | ||
Comment 5•13 years ago
|
||
This is in progress with the new reports_clean in 2.3.2. Unresolved is deciding how to handle making reports_clean up-to-the-minute for the devs. In 2.3.2, reports_clean get updated hourly and runs a few hours behind. This means that it's not current enough.
Updated•13 years ago
|
Component: Socorro → General
Product: Webtools → Socorro
Assignee | ||
Comment 6•12 years ago
|
||
This is on track but slower than expected.
Target Milestone: 2.5 → 2.7
Assignee | ||
Comment 8•12 years ago
|
||
Also needs to wait on ripping out OldTCBS.
Assignee | ||
Updated•12 years ago
|
Target Milestone: 8 → 11
Assignee | ||
Updated•12 years ago
|
Target Milestone: 11 → Future
Assignee | ||
Updated•12 years ago
|
Target Milestone: Future → 18
Assignee | ||
Comment 9•12 years ago
|
||
Mobeta is merged now,marking this done. [qa-]
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Whiteboard: [qa-]
You need to log in
before you can comment on or make changes to this bug.
Description
•