Closed Bug 423968 Opened 14 years ago Closed 12 years ago

find and report on bad modules (e.g. DLLs) in the process list that correlate with particular stack signatures

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 521917

People

(Reporter: chofmann, Unassigned)

References

Details

(Whiteboard: cloud[crashkill][crashkill-metrics])

Attachments

(1 file)

https://bugzilla.mozilla.org/show_bug.cgi?id=412605#c6 has a good idea about analyzing the module list to detect .dll's that might have some association with a particular stack trace, and producing a report that would help speed up the analysis.

this would be a great addition.  right now it just takes a lot of eyeballs and time going though individual crash reports to find this association.
Target Milestone: --- → 0.8
Attached file proposed crash report
Chofmann and I have started creating a proposed data script/query and report format.  The attached example uses most of the data provided by Benjamin on bug 427820.  We want a way to look at all dll's across all crashes, count those dll's, provide a couple additional metrics (these last two are left blank in the attached example)... and then make all of this data/reporting both platform specific and release specific.
Ted -- copying you with Benjamin being on leave.  this bug is also related to bugs 427820 and 412605.
I watch Benjamin's address, so I get mail for bugs he's CCed on. I'm not sure what the intent of this report you've attached is, this just seems to be the baseline data? Do you need anything more from us to proceed here?
We need to start thinking about how we can turn ken's rough prototype into script that can run and produce (nightly/weekly?) reports.   Maybe this is more morgamic.  I see that he has this marked as 0.8 work.  

If anyone can contribute feedback on what ken proposes before work on this starts in 0.8 that would be great.
the analysis part of this bug might share some of the same back end process needed for bug 439679 and we might want to have several report outputs that look at the data in a variety of ways.
Blocks: 439679
I think this kind of analysis could have spotted this bug pretty easily

bug 441649 Firefox 3.0 Crash Report [@ nsBaseWidget::RemoveChild(nsIWidget*) ] - metasearch addon

xshared.dll     1.0.0.42 might only appear in crashes related to stack signature nsBaseWidget::RemoveChild    


Blocks: 441648
Blocks: 441649
No longer blocks: 441648
Target Milestone: 0.8 → ---
Severity: normal → enhancement
OS: Mac OS X → All
Hardware: PC → All
another possible output of this analysis was just spotted in https://bugzilla.mozilla.org/show_bug.cgi?id=434403#c111

  -------  Comment #111 From  Henrik Skupin   2008-11-14 16:37:14 PST  -------

I've taken a look at a couple of these reports and each one lists some bogous
DLL files which have random names and cannot be found by searching on Google.
Looks like that Firefox 3.0.x can be used to detect this trojan.

------------------------------------------------------------------

lets find new .dll's names that don't appear in previous reports or anti-virus databases and turn thenm over to others that can investigate the possibility that these are new spyware variants.
it would be very cool to get alerts when the name of some new .dll never seen before was detected in the module list of an incoming report.
Chris, that would be somewhat problematic for crash reports like we have on bug 434403. Because the file names are chosen randomly by the trojan you wont get a significant list of new DLL files. It will be mostly garbage.
Could this also be used to detect *.so (Linux shared libraries) which don't jibe with the version they were used with? I remember random crashes caused by having installed a newer .tar.gz/.tar.bz2 version "on top of" an older version which included some .so library not present in the newer version. (The solution is to remove the installdir with all its contents before unpacking the archive for a newer version, but not everyone does that.)
re: comment 9

sure it would not work in all cases.  It would work in others.   for example if we had this report running right now we could do a search for 

 a Trojan DLL named “wmimachine2.dll”

and find out how many people have been hit by the current zero-day attacks on  Adobe Flash and PDF reader, before they have gotten the patch out, or before anti-virus packages have started detecting.

http://www.scmagazineuk.com/Finjan-detects-zero-day-attacks-due-to-Adobe-vulnerability/article/140564/

This is definitely going to be a needle in the haystack kind of problem so developing good filters is going to be part of it.
bug 366973 is marked as fixed.  just printing that stuff out in a report somewhere would help to a bunch of related analysis.
if we had a tab separated text file that contained

signature \t uuid_url \t last_crash \t product \t version \t build \t branch \t os_name \t os_version \t comma,separated,list,of,the,module,list,in,alpha,order

we could start to analyze some of the module list data in spreadsheets and text processing tools.

I've heard the module list is in a format that isn't easy to work with, but is something like this possible?
Socorro is not presently parsing the module data provided by breakpad. Does anyone have an ort of institutional memory about why we stopped collecting module info? 

Assuming we re-started collecting the module data, the report mentioned in comment #13 would be reasonably easy to do. The alternative would be to parse the crash's json file on demand.

I think Aravind mentioned that we are now storing json files for only a few days, so the window of opportunity to do on-demand parsing is small. If we want historical data, or the load is high, or we need to compare crashes, then we probably need to look at parsing / saving the module data as crashes are processed.
We're collecting it, just not storing it the database. This was a database-size-and-maintenance issue IIRC, because it's difficult to normalize the table and a non-normalized table was very large.
Agree: My 'collecting' should have been spelled 'saving'.

On average, the module list per crash seems to be something over 100 modules long (based on an exhaustive analysis of three data points); and both the module list and the details within each module are different, even when the crashes have the same signature. It does appear that trying to normalize such data would not work because there would be too many distinct module strings; and trying to save it in raw form would make storage even heavier.

Saving only the module name (e.g: 'nss3.dll' or 'XUL' would be much more feasible, probably requiring only several hundred to a few thousand distinct module names. That would make normalizing the data simple, and storing it reasonably easy, especially if we store the module list as a comma-sep list within the database. Would such coarsly ground data be useful?
imo, not without the version and hash.
if you retained those two bits, ... maybe. is it possible to have a table which would have (key autoinc, name, version, hash) and then have module list just reference things by that key?
version info would be nice, but we can get started without it.

we could also filter the list down considerably by removing any module that we have symbols for on the symbol server.   That would be the first step I'll look at in the post processing of the data, but if we could do it at the time when the crash reported is digested and stored in the database that would be a plus for me.

with product version number info we know the versions of modules that we have symbols for.
I think it makes sense to filter out known product modules: there's no point in correlating crashes against versions of xul.dll or js3250.dll.

It also makes sense to filter out known system modules which are always present, such as libc, libstdc++, system32, etc... it's possible but unlikely that we'll correlate crashes against particular versions of those, and removing them significantly reduces the dataset size.

However, I don't think it makes sense to remove all modules we have symbols for: we have symbols for various extensions and plugins, and hopefully will have more in the future, and it's likely that we'll be able to correlate crashes against those.

How exactly we enumerate "known product and system modules" should be considered: it would probably be better to keep a list in the DB instead of hardcoding it, so that as products and systems change we can keep up with potential name changes.
Also note that Socorro in general has no knowledge of symbol files at all. It treats Breakpad like a black box, and simply accepts the minidump_stackwalk output with or without symbols, never knowing (or caring) whether they're present for any symbols.

In the script I implemented to fetch missing Win32 symbols for bug 419882, I simply had a blacklist of modules that I knew were part of Mozilla apps, and ignored them:
http://hg.mozilla.org/users/tmielczarek_mozilla.com/fetch-win32-symbols/file/tip/blacklist.txt

It wouldn't be hard to expand that list to include common Windows/Mac/Linux system symbols, so you could pare the list down to simply third-party symbols (plugins, drivers, malware, etc).
per comment #17
What timeless suggests is exactly what normalization does. The issue here is keeping the size of the 'module info' table small enough to be reasonable. Adding a version column probably multiplies the table size, long term, by about 10 (immediately, it has small effect). 

I don't know what a 'hash' is here. If it is a hash of all the data except the name and version, then I think it would be too much: Would basically get us back to just storing it all, since a good hash would be different for each different set of details. I think name and version is quite feasible. If 'hash' has fewer than 10 values for a given name and version, then I'll go on a limb to say that would be feasible. More than 1000 hash values per name/version: Not feasible.

per comment #18, comment #19
A list of module names/versions to not keep track of would be easy to handle in the database: Simple (LDAP authorized) GUI to add/remove them, and simple SQL to access them from the processor. Items missing from that list add a little noise, with small conceptual cost for programmers (I think). Items improperly in that list probably would be noticed 'pretty soon' when a programmer tries to see details that aren't there. Maybe 'almost anyone' can remove items from the list, but you have to be specially authorized to put them in?
A hash is a unique identifier for a particular build of a DLL. Most details will have common modules, since (for example) system32.dll will have the same hash/version for everyone who's using Windows XP SP3, or Windows Vista SP1, etc... so I don't expect that the modules table will grow multiplicitavely large.

Since some (many?) DLLs don't have useful versions, the hash is usually the better thing to key on... the version number is mainly useful for human-readable communication. It's unlikely that if a DLL has a version number it will have more than one hash.

As for the blacklist management, a few key technical people for each project (ted and myself, a few people each from SM and TB) would be sufficient to maintain it.
re: comment 21 about the number of possible/reasonable hashes...

Here is a sample of the number of flash versions detected on mozilla.com cut off at 20...   include various platform and debug versions and its easy to see the number of hashes getting into the hundreds pretty quickly for flash alone.

1. 10.0.22 2,302,363 60.7% 
2. 10.0.12 585,302 15.4% 
3. 9.0.124 275,640 7.3% 
4. 10.0.32 225,881 6.0% 
5. -1 116,714 3.1% 
6. 9.0.115 82,708 2.2% 
7. 9.0.159 54,124 1.4% 
8. 9.0.47 29,641 0.8% 
9. 9.0.151 24,253 0.6% 
10. 9.0.45 23,791 0.6% 
11. 9.0.28 22,946 0.6% 
12. 10.0.2 19,153 0.5% 
13. 8.0.22 7,809 0.2% 
14. 9.0.16 6,972 0.2% 
15. 7.0.19 2,606 0.1% 
16. 8.0.24 2,171 0.1% 
17. 10.0.b218 1,054 0.0% 
18. 10.0.15 797 0.0% 
19. 9.0.19 746 0.0% 
20. 10.0.26 692 0.0%
Sure, but hundreds is not a big deal. I'd only be worried about database size if we ended up with 50k. If we exclude browser DLLs and known system DLLs I think we'll be well under that mark.
50,000 rows/table divided by 200 rows/item = 250 items/table. Assuming "hundreds" is 200. My look at a few data points seems to indicate that 250 is the appropriate order of magnitude for the number of modules, so this probably fits.

Given this raw datum:
Module|libplds4.dylib|0.1.0.0|libplds4.dylib|414E08F7EE504E7FBFED13DA7F38DFE20|0x00f46000|0x00f50fff|0

I'm guessing the hash is 414E08F7EE504E7FBFED13DA7F38DFE20, correct? If so, then from my three data points we get about 380 distinct hashes out of 430 lines. Looking two at a time, we see a little under 20% overlap for some pairs, and effectively no overlap for others. I'm not statistician enough (and three points isn't data enough) to make any prediction from that...
I wouldn't use libplds4.dylib as an example, since it's part of Firefox. There's going to be more variation in modules we ship, since we upload new builds every day, therefore ensuring a huge amount of different modules. It'd be more interesting to look at chofmann's example of flash--npswf32.dll on Windows, "Flash Player" on mac, and libflashplayer.so on Linux.
Summary: find and report on bad .dll's in the process list that correlate with particular stack signatures → find and report on bad modules (e.g. DLLs) in the process list that correlate with particular stack signatures
See also bug 464775.
Whiteboard: cloud
I guess there are two ways the analysis might go with tools related to this bug.

one way is to look at a big pool of reports to try and figure out what combination of .dll's and versions that might be associated with a crash.  e.g. we don't know whats happening, lets look to see if we can see a common pattern in what extra .dll might be running....

the other way is more targeted.  in this case we have a hunch that a particular plugin is the cause of the crash, and we just want to confirm or deny if its entirely the same version of the plugin or more distribute across versions.

cww and I have a few tools for doing the later now.

my tool basically: 

1) gets a list of crash reports for a particular signature
2) foreach report grab the version of the .dll we are interested in - e.g.
   grep -i "Module|"NPSWF32.dll temp  | awk -F'|' '{printf "\t%s\t%s\n",$2,$3}'
3) circle though the full list of reports and kick out version info or a summary report with counts of all the different versions encountered in the crash reports.
sounds like cww has some rough scrapping tools that also look at the "big pool" analysis part.
Whiteboard: cloud → cloud[crashkill][crashkill-metrics]
Fixed by bug 521917?
yeah
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 521917
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.