Closed Bug 544940 Opened 11 years ago Closed 11 years ago

Hang detector: associate two crashes in Socorro

Categories

(Socorro :: General, task, P1)

x86
Linux
task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benjamin, Assigned: lars)

References

Details

Attachments

(1 file)

For OOPP we're going to implement a hang detector which notices when the child process is not responding. When this happens, the hang detector will generate two minidumps: one for the child and one for the parent. We would like to be able to associate these two reports together so that you can click from one to the other in crash-stats.

The simplest thing is to send the two reports separately and include a GUID in the .extra data which associates the two, although we could possibly also send both reports in the same HTTP post (i.e. two extra files and two dumps).

At this point I don't think we need to add extra searcheable metadata for hangs: we'll just treat them like crashes but in the comments we'll note that this is a hang instead of a normal crash.
Sending the associated GUID in the extra data sounds like the simplest way to go here.
(In reply to comment #0)
What is the rough deadline?

Will both .extra files have another GUID? Will there be a way to distinguish parent process versus child processes? Do we need to? What is the field name?

Roughing out requirements, please correct:
I think we'll need to add client GUID to socorro UUID in our database?

When we accept a crash the collector Assigns a UUID. We will need to track GUID to UUID during processing. Reason: When processing crash 1 we will know crash 1 and crash 2's GUIDs, crashes 1's UUID, but not crash 2's UUID since it (possibly) hasn't been collected yet.

Extra GUID will go into the processed jsonz file via current extra -> json properties flow.

For displaying a crash the PHP app will need to lookup the other crash's UUID via the extra GUID in the database.
> Will both .extra files have another GUID? Will there be a way to distinguish
> parent process versus child processes? Do we need to? What is the field name?

GUID and UUID seem to be causing confusion with client and server crash IDs. Let's call this new things a HangID. When submitting a hang:

* The client will generate a HangID
* The client will submit a plugin crash report, with the normal ProcessType=plugin PluginName=foo PluginFilename=bar fields, plugin HangID=generatedID
* The client will submit a browser crash report with the normal fields

> I think we'll need to add client GUID to socorro UUID in our database?

I'm not sure this is necessary. Couldn't you just have a HangID field? Then when you see a crash with a HangID you can query other crashes for the same HangID. That way you don't have to try and map anything in the collector or processor, just store the HangID.
Oh, and this is not "right now" urgent: we probably wont have the client-side work done for another couple weeks, and even after that we can just use comments to manually correlate the two crashes for a bit. I wanted to get it filed to give you a 2 or 4 or 6-week heads-up.
(In reply to comment #3)
HangID is only on 1 of the crashes.

Support search by HangID with exact match.

No need to link the two crashes reports.

Much easier, thanks!
Target Milestone: --- → 1.6
Thinking aloud:
If they are separate crashes from the collector's and processor's point of view and there is throttling, we need to be sure we get both crashes or neither (or are able to retrieve the 'other one' from the deferred file storage.

If they are separate crashes, we will need to associate them. This can be done in an association 
 table report_hang_guid (
   guid text, 
   report_id int -- foreign key
)
Which probably needs to be partitioned as reports table is so the report_id is readily accessible. Do it this way to avoid alter-ing reports (again)

If you select guid from this table where guid = 'whatever it is', then you should find (or force a find?) two report ids (and their uuids) that are associated.
Target Milestone: 1.6 → ---
Another option would be to keep track of crashes that have the guid/hangId field, skip them in the primary processing batch, make a second processing pass over just those crashes, and assure the association happens in that pass. Keep a memory based cross-reference between our ooid and the hangId that were seen, fill in blanks as needed from deferred storage, and only then finish processing.

More aloud thinking:
 - what if one of the crashes doesn't make it to processing for some reason? (maybe the processor ran just as one had arrived?)
 - We still need the association some way. Could create an 'associated' uuid (lose yet another digit to the task?), but I think the association table is better than an 'orrible 'ack.
Hmm. Reports table has uuid varchar(50) but an actual uuid is only 36 chars long. That gives us 14 'extra' characters that could be used for association if we prefer hacking on that field to creating an association table. 

Using base-36 integers (as string) gives you 4.7*10^18 distinct associations using just 12 characters. Envision associated uuidX examples:
  'f5343b5f-36b8-4440-a802-f0d512100210:00000000001z'
  '450f502d-570c-4cd2-93ec-b46552100210:00000000001z'

This will cause a problem select-ing using the uuid as a key unless we use the whole extended uuid when appropriate.
Assignee: nobody → lars
Target Milestone: --- → 1.7
Severity: normal → critical
ok, Frank is no longer with us so this bug falls to me.

bsmedberg, in comment #4, you said you were a couple weeks out for getting this code done.  We're beyond that point, is your code complete?  Did you opt to add a new field called HangId?

Frank was very right in comment #6 in his comment about throttling.  There is a very good chance that throttling will separate the two crashes and make the association impossible.  A way around that is to make sure that we accept all crashes that have these pairs.  This could be done using the pre-existing 'Throttleable' flag.  My concern then becomes that we're messing up the statistics by accepting 100% of these crashes while accepting only 15% of other crash types.

The throttling is going away, but we're a couple months at least from that.
The code is implemented and is present in 3.6/Lorentz/Trunk. It generates a HangID=UUID which is sent with the pair of reports.
A point of clarification, please verify that the following assertion is true:

Given crash1_uuid and crash2_uuid:

crash1_uuid != HangID && crash2_uuid != crash2_uuid

The UUID mentioned here is independent of the uuid that is returned as a crash identifier from collector on submission.  I'm assuming that this UUID is unique one time use that is the same value in both crashes but does not equal the Socorro assigned identifying uuid of either crash (except in case of an extremely improbable coincidence).
gah, pasting error in my assertion.  That should have read: 

crash1_uuid != HangID && crash2_uuid != HangID
Correct, modulo the astronomically improbable coincidence.  However, what's the threat model here?  It would be easy to forge .extras for paired dumps with a HangID of 00...0 and submit the pair multiple times.
There's no threat, we're just trying to figure out a way to implement this association in Socorro.  We've been working blind without actually having seen one of these pairs and needed to get more data about them.  We're trying to guarantee that both crashes get through throttling without being separated and having to register at one of those separated at birth Web sites...
In case its still helpful, there are some crash report pairs for different hangs in bug 549930 comment 4 and bug 555699 comment 0. These all made it through throttling due to manual requests.

If crash-report pairs relating to a plugin hang that have been throttled would be useful, then I am willing to submit a bunch of them and provide client GUIDs.
to recap
the #1  KiFastSystemCallRet signature (currently 550 crashes) and 
the #18 NtWaitForSingleObject signature (currently 34 crashes)

in the 3.6.1plugin1 topcrash report at

http://crash-stats.mozilla.com/topcrasher/byversion/Firefox/3.6.3plugin1

are one of these pairs.  That would be a good test case for how we try and map these together.

One question is why we would see such a disparity in the volume for this pair, especially when there should be no throttling applied to that set of crashes.

what are the indicators in the reports from these two signatures that could be use to do the mapping to each other?
chofmann, I don't think that KiFastSystemCallRet and NtWaitForSingleObject are always a matched pair. I think you're going to see KiFastSystemCallRet(firefox)/KiFastSystemCallRet(plugin) pairs, plus many other child stacks and perhaps several different parent stacks depending on different versions of Windows.

Note that this bug is just about cross-linking the reports, it won't have any effect on the signature. I filed bug 559174 about changing the signature for hangs, and bug 559137 about doing short-term custom queries which I can use to create better correlations for hang reports.
ok, so based on that here are some pairs, and some singletons that are missing mates based on time, signature, and OS that could be used as a test sample to evaluate lar's theories above about how to do the matching.
Blocks: 559360
Blocks: 560627
Blocks: 560628
No longer blocks: 560628
No longer blocks: 560627
Priority: -- → P1
Blocks: 562375
Lars,  I'd like to sync up with you on this work to see if what we are planning on delivering is exactly what you need.
This was really a tracker; this work was shipped in 1.6.2
Target Milestone: 1.7 → 1.6.2
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.