Closed Bug 661266 Opened 13 years ago Closed 10 years ago

Socorro - care and maintenance of 'osdims' and 'productdims' tables

Categories

(Socorro :: General, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lars, Unassigned)

References

Details

Currently, the 'osdims' and 'productdims' tables are populated by clients of the socorro.database.cachedIdAccess module's IdCache class.  That means there is no central process responsible for populating these tables.  The crons TCBS, TCBU and DailyUrl all use the IdCache class, so potentially all can be doing inserts into the 'osdims' and 'productdims' tables.

I suggest that we consolidate this into one location so that we have a saner, one writer many reader module.  That one location ought to be the processor.  It examines every processed crash, so it has the opportunity to keep the 'osdims' and 'productdims' tables up to date with the stream of incoming crashes. 

I also suggest that the reports table be given extra fields for 'productdims_id' and 'osdims_id'.  This would simplify the new TCBS and TCBU crons as they would no longer be responsible for normalizing os versions.

I'm considering having the processor also do the 'urldims' table because parallelism is attractive.  However, that task depends on the fate of TCBU
Blocks: 660087
In the same theme as this bug, the 'osdims' table should be reduced to only a few entries.  If the incoming os doesn't match anything ever seen before, it shouldn't just get automatically included in the table.  there should be an 'other' or 'unknown' entry in the osdims table for these.  What about new verisons of a known os?  How do we distinguish those from spurious garbage?
Can we have known patterns that are considered valid, like "Windows NT x.x", "Mac OS X x.x", and insert new entries for new ones of those, with the mapping to human-readable versions (i.e. "Windows 7") added later?
After some discussion on IRC:

(1) Productdims will become completely different under the new releasechannel model, so it's taken off this bug.

(2) I disagree that the processors ought to handle this, for several reasons:
    (a) processors do row-at-a-time instead of batches
    (b) will add locking overhead to processors
    (c) will add extra processing time to processors

(3) Adding IDs to the reports table ought to be part of a more general move to separate "raw" data from a fully normalized & cleaned fact table.  In other words, we shouldn't be adding columns, we should be creating a new table.

(4) I've added a new bug for a new osdims schema: 674065
Depends on: 674065
Component: Socorro → General
Product: Webtools → Socorro
long ago resolved
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.