Closed Bug 629054 Opened 9 years ago Closed 9 years ago
Generate date of first occurrence for a signature
We'd like to know the first date a signature occurred, see https://wiki.mozilla.org/Socorro:PRD_2.x#New_.2F_explosive_.2F_critical_crash_tracking This feels like a cron job to me. Thoughts?
We can actually get this data out of the existing HBase schema. When processed items are stored, we store a record in a secondary index table named crash_reports_index_signature_ooid. The key is a signature followed by the ooid. If you do a scan on this table for a specific signature you will get all of the ooids that have that signature. If you look at the last 6 characters of those ooids, you'll know the date they were submitted.
in hindsight, it would have been better to prefix the ooid with the date so you could do a limit 1 and get the earliest date, but I put this feature in not knowing exactly how it might be used at the time.
We can also query this out of the existing PostgreSQL data, if "first throttled occurance" is OK. I'd like to create a matview to store it, but the underlying data is there.
X suggests we do it in PG, it will be easier.
Josh, can you advise re what you'd do for a matview?
Assignee: nobody → laura
We already have the signature_productdims table, which contains a list of all of the crash signatures found in each product/version. This table is populated by the scripts/startSignatureProductdims.py cron script. So, we could probably update this script to grab the first occurrence of a crash signature for a product/version and add it to this table.
How often does that script run? We'll want a separate query and matview for "first occurances"; I'm currently going over with chofmann what he means by "first" (which is actually non-obvious). Note that it would be relatively easy (as in low-CPU-overhead) to keep this table populated on an ongoing basis if we use a small window based on the frequency of polling. I'll talk to Ryan about what the queries and data structure would look like.
we want to keep track of both first time we have ever seen a signature, and first time we have seen for any given product-release combo. The goals here is to help us find regression ranges for crashes, and to understand the differences in number signatures and of volume of crashes between releases. this is the prototype that we have been using for a solution to this problem http://people.mozilla.org/~jst/new-crashes/Firefox/ a report like http://people.mozilla.org/~jst/new-crashes/Firefox/20110125/3.5.5-3.5.html gives us a list of which crashes are new in a release.
Chris, The reports would seem to indicate that by "first" you mean "first version" not "first timestamp". That is, you don't really care about timestamps at all. Is this correct?
Added Chofmann so that he'll answer my questions.
If you click though the links on the reports on comment 8 you can eventually get to a timestamp for the first crash so that information is indirectly accessible. There are actually 3 pieces of info that help us to narrow down regression windows depending on the type of crash we are looking at. * time stamp of the first instance of a signature * first version, and/or full set of versions it appeared in, and weather it appeared roughly together in several releases, or if it appear initially in just one release. * first buildid (which inbeds the time of build) is also useful if we could get it. That ads a bit more precsion to the version information. having all those helps to drill down on the regression window for any given signature and figure out what kind of crash we are looking at and where the source of the crash might be (either product code, or external website or plugin). here is an example of one instance where determining the time of first instance was critical in correlating the bug to an external website/plugin change. https://bugzilla.mozilla.org/show_bug.cgi?id=519039#c15 and the associated attachement graph that narrowed down the prettya close to the instance where and external website made changes and the crash volume increased. https://bug519039.bugzilla.mozilla.org/attachment.cgi?id=403138 in other cases finding the buildid(build time), product, and first crash time, helps us to understand and correlate how recent source changes in firefox might be correlated to the crash, and how visible the crash might be. when source change time, build time, and first trunk first crash time are all close together we have a highly visible regression seen with only a few testers. when those are spread out more we might have a crash that first appears when we get a million, ten million, or 100 million users on a product. thats probably more that you wanted to know, but gives some ideas about the types of things we want to do with those pieces of data once we make then more accessible in the database and then in reports.
So I hashed this out with Chris. What's required in terms of matviews is: * signature_productdims needs to have a date of first appearance for a given signature for that product_id. * new matview signature_build should track which signatures have shown up in which build, and when they first appeared in that build The rest of what Chris needs is all UI work. Unfortunately, accomplishing both of the above requires a bit of data cleanup, hence the two new blocker bugs. Please approve work for those.
Feh. As usual, one more thing: Also, we need to update signature_productdims and the new signature_build on the same schedule as Top Crashes By Signature: once per hour, trailing 3 hours behind the clock.
(In reply to comment #12) > So I hashed this out with Chris. What's required in terms of matviews is: > > * signature_productdims needs to have a date of first appearance for a given > signature for that product_id. > > * new matview signature_build should track which signatures have shown up in > which build, and when they first appeared in that build > > The rest of what Chris needs is all UI work. Some of this is in bug 629067, I am not sure what else is needed but I'll discuss with chofmann. > > Unfortunately, accomplishing both of the above requires a bit of data cleanup, > hence the two new blocker bugs. Please approve work for those. Approved.
Assinging this to myself since it's waiting on me.
Assignee: laura → josh
some notes from an IRC conversation. If we need a sort order for the version/build_id pairs where crashes first appeared then we can use the build_id which corresponds to a timestamp of when the build was run. for instance its more correct to show a list the following theoretical list of of crash that might have been introduced into "trunk/4.0" development, then back ported and showed up in build of 3.6.5pre 20100624, then that change may have been later back ported to the 3.5.10 branch and showed up first on that branch on a build from 20100712. The sort order for that scenario is best represented as: 4.0b1 20100610, 3.6.5pre 20100624, 3.5.10 20100712, thus, we can use build id as the best sort key to show the linage for any particular crash. Its less interesting and more confusing to try and create a sort order based on the version, which we tend to make up random names for as you can see in https://wiki.mozilla.org/Releases
Ok, so I have most of the code for this. However, the natural place to put summary statistics for display is TCBS. But ... apparently nobody currently on the project readily understands the code which generates TCBS. So, this is a blocker. Need to discuss Tuesday.
So the code for this is on devdb. Where we're starting for display is the TCBS view, per Ryan. This requires just a small change to the TCBS SQL: explain analyze select tcbs.signature, sum(tcbs.count) as count, cast(sum(tcbs.count) as float) / 1000 as percentOfTotal, sum(case when os.os_name LIKE 'Windows%' then tcbs.count else 0 end) as win_count, sum(case when os.os_name = 'Mac OS X' then tcbs.count else 0 end) as mac_count, sum(case when os.os_name = 'Linux' then tcbs.count else 0 end) as linux_count, sum(tcbs.hang_count) as hang_count, sum(tcbs.plugin_count) as plugin_count, min(first_report) as first_report, min(first_build) as first_build from top_crashes_by_signature tcbs join osdims os on tcbs.osdims_id = os.id left outer join signature_first on tcbs.signature = signature_first.signature and tcbs.productdims_id = signature_first.productdims_id and tcbs.osdims_id = signature_first.osdims_id where '2011-01-25' < tcbs.window_end and tcbs.window_end <= '2011-02-01' and tcbs.productdims_id = 297 group by tcbs.signature order by 2 desc limit 100; Note that I'm also correcting an error in the original SQL. This will require some other corrections in the code to make it work, mostly in the generation of optional WHERE clauses.
HOWEVER, I've discovered a potential issue with the new version. The join, above, makes the TCBS view query take considerably more time. As in 4X longer. I'll test some tricks to make it faster ...
I posted this in IRC, but just so it is documented I will post it here as well. We are running a supplementary query for TCBS on the PHP side to fetch all of the versions that a crash signature is found within. It's not ideal to add more logic on the PHP side, but it would be significantly easier than updating TCBS, and would suffice until we rewrite TCBS. You can find the SQL in the fetchTopcrasherVersions() method in https://code.google.com/p/socorro/source/browse/trunk/webapp-php/application/models/topcrashers.php
Attached is a file of instructions on how to modify schema.py to accomodate the new first appearance matviews (signature_build, signature_first and the new column in signature_productdims). Over to you, Rob.
Currently we're only displaying the first appearance in one field on one screen, but I'm going to close this bug once Rob adds the schema.py changes. We should open more specific bugs when we have a spec on where to display other information. The underlying data is now available.
Assignee: josh → rhelmer
Target Milestone: 1.7.8 → 1.7.7
Comment on attachment 514935 [details] [diff] [review] add schema changes to schema.py the need for refactoring of our SQL code is so acute that we're in danger of cutting ourselves... but this looks acceptable to me for now.
Attachment #514935 - Flags: review?(lars) → review+
Comment on attachment 514935 [details] [diff] [review] add schema changes to schema.py Committed revision 2990.
Also updated hardCodedSchemaClasses in socorro/unittest/database/testSchema.py
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
In testing bug 629067, I noticed that note all signatures had a value for the First Appearance column -- is that expected?
Stephen, Well, the lack of a first appearance date for a Null Signature is expected, since those can't be matched against any other signature. However, I cannot explain why some of the other signatures don't have first appearance dates, unless the cron jobs aren't running on stage to update them. Can someone check?
Looks like something's wrong; reopening. Chofmann was also mentioning that the tooltips aren't always right, either, for the "ver" column when sorted by First Appearance.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
We weren't getting email about this cron job failing (now fixed, bug 651201), but it hasn't been working on staging since we enabled it. The problem appears to be the "update_signature_matviews" function, here's a traceback for the exception. Josh already has this info and has been working on it. Traceback (most recent call last): File "/data/socorro/application/scripts/startSignatures.py", line 30, in <module> signatures.update_signatures(config) File "/data/socorro/application/socorro/cron/signatures.py", line 27, in update_signatures databaseCursor.execute(sql % (datetime.now())) psycopg2.IntegrityError: duplicate key value violates unique constraint "signature_first_key" DETAIL: Key (signature, productdims_id, osdims_id)=(mozalloc_abort(char const* const) | NS_DebugBreak_P | nsCycleCollectorGCHookRunnable::Run(), 378, 25) already exists. CONTEXT: SQL statement "insert into signature_first (signature, productdims_id, osdims_id, first_report, first_build ) select sbup.signature, sbup.productdims_id, osdims.id, min(sbup.first_report), min(sbup.build) from signature_build_updates sbup join top_crashes_by_signature tcbs on sbup.signature = tcbs.signature and sbup.productdims_id = tcbs.productdims_id join osdims ON tcbs.osdims_id = osdims.id left outer join signature_first sfirst on sbup.signature = sfirst.signature and sbup.productdims_id = sfirst.productdims_id and tcbs.osdims_id = sfirst.osdims_id where sbup.os_name = osdims.os_name and tcbs.window_end BETWEEN ( currenttime - ( interval '1 hour' * hours_back ) - (interval '1 hour' * hours_window ) ) AND ( currenttime - ( interval '1 hour' * hours_back ) ) group by sbup.signature, sbup.productdims_id, osdims.id" PL/pgSQL function "update_signature_matviews" line 63 at SQL statement
Assignee: rhelmer → josh
Fixed the SP called by the cron job. Rob fixed the fact that the cron job wasn't reporting failure. We're currently fixing the cron job so that it understands backfilling.
Assignee: josh → rhelmer
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.