Closed Bug 782630 Opened 13 years ago Closed 8 years ago

Improve signature generation algorithm

Categories

(Socorro :: Backend, task)

x86_64
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: benjamin, Unassigned)

Details

(Whiteboard: [crashkill:P2])

Attachments

(1 file)

I've been looking at the signature generation algorithm (for non-Java crashes). I'm doing some experiments currently with alternate algorithms using pig to compare the existing algorithm with any changes. The advantage of doing this with pig is that I can provide summaries of old-signature->new signature with samples so that we can directly compare and tweak algorithms before we put them into production. My first trial is that by default we don't start walking at the top of the stack. Instead, we start the algorithm at the first frame with source information. This produces some interesting results that in the common case are probably better. https://github.com/bsmedberg/socorro-toolbox/commits/badsignaturesearch is the UDF and .pig I'm using. I'll try to attach a summary shortly; some of the side effects may be undesirable: * crashes in modules without symbols (Firefox, Flash, or etc) will be charged to the caller in ways that may bucket too broadly * crashes deep in system/driver libraries get charged to the caller and if the caller is not directly responsible (e.g. memory corruption) this may unbucket the crash I haven't yet tried to fully replicate the exact current signature generation algorithm as a pig UDF, I just ported the pieces which were necessary to rewrite a single frame into the same pattern. I suspect that rather than replicating the entire signature generation system in Java, it would make more sense to write the UDF in python and use the existing socorro code. If this is the case, I may shortly submit a socorro patch which factors out the signature generation bits to make that possible. I haven't yet thought about whether or how we'd do any migration/backfill if we decided to change the algorithm. Let's not worry about that until I have a more solid proposal.
This is my results of my first experiment improving the signature generation algorithm, which is to skip to the first frame with source information. There are definitely some flaws, specifically around things like: 'SocketRead | nsSocketInputStream::Read(char*, unsigned int, unsigned int*)' combines multiple old signatures: 'strstr | recv | SocketRead | nsSocketInputStream::Read(char*, unsigned int, unsigned int*)' (10 : c898df15-af0c-4612-a260-784592120814) 'ntuser.dll@0x3a4a' (1 : 71226bf1-450c-4cb5-95de-d6b332120814) 'DSOCKET::GetCountedDSocketFromSocket(unsigned int)' (2 : 7824e698-eb7a-4030-bac8-52fa52120814) 'RtlpCoalesceFreeBlocks | RtlpCoalesceFreeBlocks | idmmbc.dll@0x19692' (1 : 56d8f905-dc0e-4abd-81e2-7ace42120814) '@0x0 | SocketRead | nsSocketInputStream::Read(char*, unsigned int, unsigned int*)' (4 : d13a0d61-3bcf-4eda-83ef-5e94e2120815) 'memcpy | gxvxcjmvfbnsnirrcexyqhrapefdwkpninuwk.dll@0x1f25' (1 : 6040c30c-e791-4b19-9a6d-b85d92120814) 'op_uid.dll@0x3c01' (1 : 2efbdf38-786a-420d-9303-3b97c2120815) 'nmsvc.dll@0x1052ab' (4 : f1e9e53f-f3de-46ed-b90d-7673b2120814) 'towlower' (1 : 56405112-2874-4057-bcd0-61e802120815) 'RtlpCoalesceFreeBlocks | msspirex.dll@0x1c492' (1 : e3f9f10f-7a2d-4a0d-9935-be3c52120815) 'memcpy | WSARecv | recv | SocketRead | nsSocketInputStream::Read(char*, unsigned int, unsigned int*)' (9 : dbfa3bf4-b3fd-4865-bf63-7eeb92120815) 'RtlpCoalesceFreeBlocks | recv | SocketRead | nsSocketInputStream::Read(char*, unsigned int, unsigned int*)' (1 : 6197a433-951a-49c0-a7a7-2ee602120815) 'filterlsp.dll@0x17e1' (2 : 9706ff0a-4290-414e-9663-68c5f2120815) 'RtlpCoalesceFreeBlocks | msspirex.dll@0x1d282' (4 : c8ec0c52-e1ac-4054-ad39-653132120815) 'nmsvc.dll@0x10528b' (32 : 7d1d28c5-0e0c-402f-9bb3-a4ef72120815) 'swi_filter_0001.dll@0x14750' (1 : 9cc9fb89-0960-422b-b9ec-f43af2120815) '_strdup | ftp33.dll@0x105cab' (2 : 86c52e8e-8eec-42ce-9eaf-24b032120814) 'winhadnt.dll@0x590cb' (1 : 8ab9f8e6-2160-4b1e-8692-d02f32120814) 'RtlReAllocateHeap | realloc | nspr4.dll@0x297f' (2 : 00c0bfb5-aced-4256-a4d4-113ed2120814) 'ntload.dll@0x3a13' (2 : 4b0acbdf-2f2f-4c79-a3d2-45c512120815) 'OutputDebugStringA' (10 : 1f3b1279-83b7-445a-aa8a-57d892120814) 'netchartfilter.dll@0x168ab' (90 : 0ad0bd7b-f708-4c62-b571-f29d42120815) 'strncmp' (1 : 74679b4a-5a6a-4221-8434-e9deb2120814) 'RtlInitializeCriticalSection | WSARecv | recv | SocketRead | nsSocketInputStream::Read(char*, unsigned int, unsigned int*)' (3 : e908c38e-a4f1-4ad7-af94-ba86e2120814) 'memcmp | ntdll.dll@0xfff' (4 : e0300d5b-324c-4c02-a36d-704e82120815) 'flvsniff.dll@0x51000' (6 : cfa51a2b-041a-49f9-98a1-546d02120814) 'flvsniff.dll@0x506f0' (3 : f936a215-ba5a-4ce3-90b9-b17c72120815) '_invalid_parameter_noinfo' (1 : bd32fcbb-b81a-4222-bff9-a76722120815) 'nmsvc.dll@0x103c84' (6 : f0b18989-5676-4140-b984-1fd912120815) 'RtlpCoalesceFreeBlocks | msspirex.dll@0x1d3a2' (23 : b50aa0f3-91ac-4e69-b901-f6aa22120814) 'WahCloseApcHelper' (1 : 60769b5e-953e-4347-83b4-ebec72120815) 'nmsvc.dll@0xeffe' (269 : 469adc7f-b721-47a8-b54a-8b8812120814) '@0x0 | WSARecv | recv | SocketRead | nsSocketInputStream::Read(char*, unsigned int, unsigned int*)' (1 : 659fa2f9-592c-4ec9-9a01-c2b172120814) 'adguard.dll@0x168bb' (49 : 468636d5-dddb-4d9c-9ad5-16d782120814) 'op_uid.dll@0x3bf7' (1 : 85c8ffbc-e3e9-4df9-b61a-336002120815) 'wlifhfltr32.dll@0x8b3b' (1 : 53b9c7ce-1034-4908-b992-e569d2120815) 'op_uid.dll@0x3bf3' (1 : b502be9c-e167-4a26-8a19-db5db2120815) '_SEH_prolog' (1 : 4b40e3b9-a323-4cb9-8fa6-00bec2120815) 'malloc | operator new(unsigned int) | filterlsp.dll@0x12de5' (1 : 7bccf6b6-3563-47d2-9aba-e0e082120814) 'notepad.dll@0x3a13' (4 : a2f75649-904e-43a2-86ff-497ad2120814) 'plspnt.dll@0xe9ba' (3 : b496ef9c-8b9c-4331-a0de-10c192120815) '_strdup | ftp34.dll@0x105ce8' (4 : c3f55920-c1b2-49ae-9f4c-d657c2120815) 'RtlpWin32NTNameToNtPathName_U | StrStrIA' (1 : 0d60a1e0-e3df-46d8-a6af-1b5a02120814) 'CharLowerA' (30 : becde4c6-c202-4fee-b3b9-bb34f2120815) 'CharNextA' (35 : 7c03d6a4-6368-409b-b043-c58f02120814) 'radhslib.dll@0x254f' (1 : 55ee36e3-585b-4cd3-a7ad-a1d422120815) 'RtlpWin32NTNameToNtPathName_U | _StrCmpLocaleA' (2 : a236f9be-5631-4dfb-9d25-a18552120814) 'recv | SocketRead | nsSocketInputStream::Read(char*, unsigned int, unsigned int*)' (3 : c3e8b1ea-57fe-4ca6-86a9-521fb2120814) 'infilter.dll@0x68470' (1 : d9e8bda6-507f-43c1-9fdc-8d7142120814) 'RtlReAllocateHeap | realloc | nspr4.dll@0x280f' (2 : cc60728f-13fe-41e9-84d6-82ffa2120814) 'RtlpCoalesceFreeBlocks | msspirex.dll@0x1d222' (3 : 28919e66-ba3b-40fc-baaf-853532120814) 'flvsniff.dll@0x50f30' (11 : b2fd1254-af0d-4a77-9835-a69172120814) 'radhslib.dll@0x3b6f' (126 : 3f00d057-ad69-4b6f-9182-7de212120815) 'bnmndrv.dll@0x81210' (1 : 11fe434c-f212-4a03-aca5-8238a2120814) 'WSARecv | recv | SocketRead | nsSocketInputStream::Read(char*, unsigned int, unsigned int*)' (3 : 4fbd174f-f1d0-490e-9273-2163c2120814) 'WSPRecv' (1 : 20e80016-4fb6-4b4f-a888-c9a6f2120815) 'RtlpFreeToHeapLookaside | nspr4.dll@0x280f' (1 : 3fc6e4a4-6886-4479-a6ae-f2c692120814) '_strdup | ftp34.dll@0x105bc3' (1 : 022a1e22-1c71-4aee-9373-ab6142120815) 'flvsniff.dll@0x42a67' (1 : 082949e9-1f45-4940-a2e2-cb3ba2120814) 'RtlInitializeCriticalSection | RtlpExtendHeap | RtlpInsertFreeBlock | msspirex.dll@0x1c501' (1 : 70526993-add4-48f8-847c-28b432120815) 'RtlLookupElementGenericTableFull | RtlLookupElementGenericTable | WSARecv | recv | SocketRead | nsSocketInputStream::Read(char*, unsigned int, unsigned int*)' (1 : 3bfbdcdd-066a-4ab2-8a9c-644572120814) '@0x0 | WSPRecv' (1 : feeada70-3dd9-4ff2-989f-b38562120815) 'cc3250.dll@0x59977' (2 : ccd3d25b-dbde-4a09-83e6-085e32120814) 'RtlpLowFragHeapFree | RtlFreeHeap | HeapFree' (1 : c2a2d208-29a8-4955-8ac3-17f0d2120815) 'webfilter.dll@0x1de3' (4 : a4a28ef5-aa3a-41a7-abc6-875c42120815) 'msspirex.dll@0x1d411' (2 : 14737c53-9826-48ac-9a60-347252120815) '_strdup | explorer.dll@0x1994' (4 : df2608e7-a8e8-417e-ad9d-4588a2120814) 'plspnt.dll@0x59ea' (2 : 6f0bf5bd-d063-4793-8afb-554f82120815) 'plspnt.dll@0x6111' (1 : 497e2ec1-229b-4bab-baa4-9d77a2120814) 'strstr | rsvp322.dll@0x4613' (1 : 8c69a54a-f452-4f21-92c9-4c4d52120815) 'msspirex.dll@0x1d2f1' (4 : d098809b-d231-4c42-b450-9c8f52120815) 'sscanf' (2 : 709de08e-098f-4713-8fc6-7c4682120814) 'nmsvc.dll@0x10671b' (2 : 074fa717-4aff-471a-a043-12a662120815) 'adguard.dll@0x168ab' (1 : 56196d21-acaf-4a86-a1db-6ee7f2120815) 'deezripdll.dll@0x3a56' (1 : 59943e43-45d4-4aa3-b994-ddff92120814) 'memcpy | filterlsp.dll@0x39682' (5 : 6c61d6ed-8273-4448-97d8-336792120815) 'StrCmpNIA' (74 : b35b2e38-4b58-44af-88bc-c611f2120814) 'radhslib.dll@0x3b78' (26 : 30453a0e-60d2-4365-8fdd-bf36e2120815) 'memcpy | filterlsp.dll@0x37dc8' (4 : e35cae90-f4cb-41b9-929f-f07812120814) 'notepad.dll@0x3ce5' (4 : f117df20-a826-4fed-9fcc-e41012120814) 'strstr | strtolX' (1 : 273d143a-cec5-4718-9ff8-3c07d2120814) 'radhslib.dll@0x3b73' (13 : 0e0eb92a-692c-4ba3-b328-680ad2120815) 'autochk.dll@0x317c' (6 : f82c3400-bce5-4f49-8d38-84c0c2120814) 'winhadnt.dll@0x525a9' (6 : 4ac4172c-585d-49de-9c4f-4cc6a2120814) 'flvsniff.dll@0x4f229' (1 : 9ec55ecf-dd83-4557-a225-15c852120814) 'StrChrIA' (2227 : 83a9c653-12e2-4af0-8c92-be7412120815) 'nmsvc.dll@0x10529e' (1 : ddaa9aaa-6829-4c18-9e42-8aada2120814) 'strstr | rsvp322.dll@0x6378' (2 : 0aca0096-828e-4b9d-9638-597d22120815) 'radhslib.dll@0x3b7d' (21 : eede047c-2af3-4f7e-bc89-b6d842120814) 'ws2_32.dll@0x18541' (1 : 1a9381ff-fbef-4176-814f-3c5b32120814) 'nmsvc.dll@0xfa8e' (14 : ceca14ca-7e7c-4c46-9141-f062b2120814) '_strdup | ftpdll.dll@0x105c6d' (5 : 55778256-aacd-470c-b528-5d4212120815) 'RtlpCoalesceFreeBlocks | RtlpExtendHeap | RtlpInsertFreeBlock | msspirex.dll@0x1d411' (1 : d0647958-af7e-4183-bee7-fa18c2120815) 'RtlpInsertFreeBlock | RtlpExtendHeap | RtlpInsertFreeBlock | msspirex.dll@0x1d411' (1 : e7ffe8ca-c950-402d-a75a-c8c5b2120814) 'wsprintfA' (2 : b7987037-5e27-495e-bc83-f29a12120815) 'RtlConsoleMultiByteToUnicodeN | vsprintf' (1 : a6e4fcac-9421-4b35-98de-442dc2120815) 'flvsniff.dll@0x424c7' (3 : 05550a1b-f8a3-4899-99b4-4910a2120815) 'memmove | recv | SocketRead | nsSocketInputStream::Read(char*, unsigned int, unsigned int*)' (1 : 595e9f3d-8918-4e4d-97ee-5898a2120815) 'adguard.dll@0x181cb' (48 : 3def249a-9e7d-4f5b-a553-725202120815) and 'nppdf32.dll@0xa302' always maps to 'mozilla::plugins::BrowserStreamChild::DeliverPendingData()' (7 : 16fd6b7d-6f52-47a2-8279-7f8462120815), Probably I think we should coalesce all frames that have a module without symbols into a single entry, so the signature for this particular report would end up as "<nppdf32.dll> | mozilla::plugins::BrowserStreamChild::DeliverPendingData()" The changes for this experiment can be found at: https://github.com/bsmedberg/socorro/compare/signature-of-source I'd love others (scoobidiver/Kairo/ted) to look through this and see if there's anything else in here that is particularly good or bad about this alternate slicing mechanism.
(In reply to Benjamin Smedberg [:bsmedberg] [away 27-July until 7-Aug] from comment #1) > I'd love others (scoobidiver/Kairo/ted) to look through this and see if > there's anything else in here that is particularly good or bad about this > alternate slicing mechanism. I'm not sure what you really want me to look at there, I don't really understand the diff at github, and the attachment appears unreadable (wrong MIME type?) to me. The suggestion of something like "<nppdf32.dll> | mozilla::plugins::BrowserStreamChild::DeliverPendingData()" is surely something I feel more comfortable with than ignoring symbol-less frames completely, but I fear even that would make some stats worth less than they are currently. See for example https://crash-analysis.mozilla.com/rkaiser/2012-08-14/2012-08-14.firefox.flash.11-3-300-271.html which at least tells us that a single location we crash in has 506 crashes and 29% of the overall ones - if we put all crashes in that symbol-missing module into a single bucket, we even lose that info.
Skipping modules without source info doesn't seem like the right tradeoff. There are going to be things that we have symbols for (system libs), but not source, and we should try to aggregate over that. Skipping things without symbols is probably a decent call, and your suggestion of bucketing foo.dll@0xanything -> <foo.dll> seems right. I'd say we should probably even make it coalesce any number of consecutive frames in the same module into just the module name, so that something like: foo.dll@0x123 foo.dll@0xABC nsFoo::Foo() turns into <foo.dll> | nsFoo::Foo()
> Skipping modules without source info doesn't seem like the right tradeoff. I'd be a proponent of having multiple signature algorithms running in parallel producing a variety of top crash lists. Analysis would then proceed across the different lists helping us to uncover different bugs. Of course this could also introduce some confusion with the same bug having multiple signatures; but we already have that problem, and have developed a vocabulary and tools for being able to track these kinds of multiple sigs that lead to the same underlying problem. "Source top crash #1" could be its own unique problem, or it could link to "old style topcrash #25 ...." The key to expanding our crash analysis beyond the current levels is to use the same data that we have, and develop interesting new tools and ways of looking at it to uncover new bugs in the long tail of crashes. I think this project fits into that goal very well.
Yes, as an additional algorithm to experiment with, I'm all for trying. I'm just weary of replacing the current algorithm at this time.
Sorry, it's a plaintext.bz2. It's rather large and unwieldy; I'm printing out results for every crash over a two-day period; I wonder if I should instead be sorting or filtering so that we can focus on the top signatures. I'll try slicing the data that way. Ted, I'm not sure that *in general* we really need the system lib functions to be in the signature to get good bucketing (which is why I'm doing this experiment). I started out by noticing that in order to get useful hang signatures, we've basically had to spend all our time tweaking the append/skiplists so that we end up at the first mozilla-or-flash frame: * tweaked "hang | WaitForMultipleObjectsEx | RealMsgWaitForMultipleObjectsEx | MsgWaitForMultipleObjects | F_1152915508___________________________________" * tweaked "BaseGetNamedObjectDirectory | RealMsgWaitForMultipleObjectsEx | MsgWaitForMultipleObjects | F_1152915508___________________________________" * haven't yet tweaked "hang | GetProcessHeap" So the theory I'm testing is that *usually* we could just skip to the first Flash-or-Mozilla frame and get better results. Although I do think that coalescing multiple frames into <foo.dll> would probably be good, I'll try implementing that on my next full run. Also one thing to consider is always printing the first frame and then skipping to the first source-frame, so those hangs above would end up as hang | WaitForMultipleObjectsEx | F_1152915508... Alternately and less invasively, we could make frames-without-source always appendlist (instead of skiplist). We also have the option of doing a blacklist of sorts: crashes should include (or end at) particular system library frames even without sources. chofmann, I love the *idea* of having multiple signatures, but I'm looking for something shorter-term which doesn't require invasive changes to socorro and improves our bucketing and analysis more quickly.
For multiple signatures produced by various signature algorithms, I suggest this system: 1) signature algorithms should have a unique identifier that is the module and classname of the python code that produces the signature. All signature algorithms must ducktype for the class socorro.processor.signature_tools.SignatureTool 2) the current signature field is not changed in Postgres and in the processed crash json. A new key is added to the processed crash json in this form: "signatures": { "socorro.processor.signature_utilities.SignatureTool": "ntuser.dll@0x3a4a | abc | xyz", "bsmedberg.signature_algorithms.test01.SignatureTool17": "xyz(const char * const)", "socorro_lars.signature_tests.SignatureTool2012": "zyx | cba | a4a3x0@lld.resutn" } As for doing all the aggregation reports for a given experimental signature, they may have to be generated on demand by map/reduce jobs or other techniques. More discussion would be required. 3) the processor has a configuration parameter that specifies which signature algorithm is the primary one. This is already implemented in Processor2012. 4) a new table in postgres should list the names of the signature algorithms currently used for the "signatures" key in the processed crash. That way we can dynamically add or remove new algorithms without having to restart the processors. It also has the benefit of allowing for a UI component. Digression: making the code for a new algorithm available to the processors would still require action by IT... If there is a desire for a system like this, please let me know and I'll file a separate bug for implementation.
I disagree with the current mapping that is not bijective and shows Firefox as the only culprit. For instance, some plugin, driver and third-party module crashes would be classified as Firefox crashes as long as one Firefox component is in the stack trace. The problem with the current algo is with OS or utility modules and for that there are at least bug 711953, bug 711954, bug 764756, and bug 778404. To know the rank of one crash composed of various signatures, there's bug 717797.
Attachment #652584 - Attachment mime type: text/plain → application/x-bzip2
(In reply to Benjamin Smedberg [:bsmedberg] from comment #6) > Sorry, it's a plaintext.bz2. Changed the attachment type to reflect that it's a .bz2, so that Bugzilla won't try to show it plainly in UI and the browser. > Ted, I'm not sure that *in general* we really need the system lib functions > to be in the signature to get good bucketing (which is why I'm doing this > experiment). Well, I'm somewhat concerned about cases e.g. when a new OS X version comes out, we tend to get actual crashes in some system libraries where those are indeed bugs in that Apple code and we report those to them. Same for other OSes and other third-party code at times.
Assignee: benjamin → nobody
Whiteboard: [crashkill:P2]
These thoughts are useful as a historic artifact. Now that data is available in a.t.m.o and spark, we are encouraging interested parties to experiment with different signature generation methods there and then bring specific innovations back for implementation in socorro or similar.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: