Closed Bug 948644 Opened 11 years ago Closed 10 years ago

record missing symbols

Categories

(Socorro :: General, task)

All
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benjamin, Assigned: rhelmer)

References

Details

(Whiteboard: [db migration])

I think we should monitor when we're missing symbols more carefully, and notify if we're missing symbols that we don't expect to be missing. I recently ran reports of missing symbols for a single day (8-Dec). https://crash-analysis.mozilla.com/bsmedberg/missing-symbols.20131208.byname.csv https://crash-analysis.mozilla.com/bsmedberg/missing-symbols.20131208.csv This is just a report on missing symbols that affect signature generation, not all missing symbols. It uses the JSON MDSW data from bug 431514. The script to generate this is at https://github.com/mozilla/jydoop/blob/master/scripts/missing-symbols.py What I think I'd like to do is spend some time manually noting, for each DLL name, the following information: * who made it * whether we expect to have symbols for it So for example: "user32.dll" -> ("Microsoft", true) "xul.dll" -> ("Mozilla", true) "igd10iumd32.dll" -> ("Intel", false) "nvwgf2um.dll" -> ("Nvidia", false) Then, either continuously in the processors or at the end of each data, create a report for missing symbols. This report should also trigger alerts to stability@mozilla.org: * if we are missing symbols that we expect to have * the DLL version and debug ID that we're missing Currently my report includes DLLs even if we don't know the Debug ID. This is interesting inforamation in general sense, but it's often that case that OOM can cause this to happen and we can't look up symbols without a debug ID anyway, so we should exclude DLLs with no debug ID from the alerts. I'd really like the alerts to happen very quickly, more frequently than once a day: e.g. on patch Tuesday we often have a lag of a day or more before we've fetched symbols: it would be nice to fix that within a few hours. But as a first step, doing it on a daily batch would get us most of the way there and is probably very easy, either using the hadoop/jydoop script I already wrote, or building this into the processors and having them populate a postgres table of missing symbols. Laura, this is a January/Febryary ask from me if possible.
Oh, I primarily care about Windows right now. It might be interesting to run this for B2G as well, but that's a lower priority and might have slightly different requirements. I care only a little about Mac, and not at all about Linux because we're unlikely to have symbols for any useful OS or 3rd-party software on Linux.
OS: Linux → Windows 7
Hardware: x86_64 → All
Assignee: nobody → dmaher
Depends on: 976034
There is a query in bug 976034 that produces the data necessary to satisfy this request; however, as per conversations with Ted on IRC, it is quite heavy (on the order of hours to complete). Clearly this isn't as simple as "write a Nagios check", since running that query every 5 minutes would cause havoc, heh. Running query once per day and then creating some sort of report would seem to be the only way to use that query sanely. :bsmedberg has a jydoop job that, while effective, is "pretty heavyweight in general", and therefore falls into the same category as the SQL query above. Alternatively, Ted suggests another potential approach: the processors are already generating this data (as part of the processed JSON), and it may be possible to extract and report on this data more efficiently than from the SQL database. Note the "filename" and "missing_symbols" elements of this example: https://gist.github.com/luser/9510714 . It may be possible to use this data somehow. Laura notes that the above-noted JSON exists in HBase, but also in Postgres and Elasticsearch, the latter two being potentially very interesting (read: efficient) for report generation. Investigation on-going.
(In reply to Daniel Maher [:phrawzty] from comment #2) > There is a query in bug 976034 that produces the data necessary to satisfy > this request; however, as per conversations with Ted on IRC, it is quite > heavy (on the order of hours to complete). Clearly this isn't as simple as > "write a Nagios check", since running that query every 5 minutes would cause > havoc, heh. Running query once per day and then creating some sort of > report would seem to be the only way to use that query sanely. > > :bsmedberg has a jydoop job that, while effective, is "pretty heavyweight in > general", and therefore falls into the same category as the SQL query above. > > Alternatively, Ted suggests another potential approach: the processors are > already generating this data (as part of the processed JSON), and it may be > possible to extract and report on this data more efficiently than from the > SQL database. Note the "filename" and "missing_symbols" elements of this > example: https://gist.github.com/luser/9510714 . It may be possible to use > this data somehow. We're looking at how to make this query faster, the data is totally unstructured right now so there should be some easy wins. Worst-case we can run it once per day and save the output in a table which you could use. > Laura notes that the above-noted JSON exists in HBase, but also in Postgres > and Elasticsearch, the latter two being potentially very interesting (read: > efficient) for report generation. > > Investigation on-going.
FWIW, I think you'd have better luck by having the processors write a simple log of missing symbols and aggregating those logs. After-the-fact database querying seems like overkill.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #4) > FWIW, I think you'd have better luck by having the processors write a simple > log of missing symbols and aggregating those logs. After-the-fact database > querying seems like overkill. Good point! We could have the processors insert info about missing symbols into a table in postgres (or wherever else we want too)
(In reply to Daniel Maher [:phrawzty] from comment #2) > There is a query in bug 976034 that produces the data necessary to satisfy > this request; however, as per conversations with Ted on IRC, it is quite > heavy (on the order of hours to complete). Clearly this isn't as simple as > "write a Nagios check", since running that query every 5 minutes would cause > havoc, heh. Running query once per day and then creating some sort of > report would seem to be the only way to use that query sanely. > > :bsmedberg has a jydoop job that, while effective, is "pretty heavyweight in > general", and therefore falls into the same category as the SQL query above. > > Alternatively, Ted suggests another potential approach: the processors are > already generating this data (as part of the processed JSON), and it may be > possible to extract and report on this data more efficiently than from the > SQL database. Note the "filename" and "missing_symbols" elements of this > example: https://gist.github.com/luser/9510714 . It may be possible to use > this data somehow. > > Laura notes that the above-noted JSON exists in HBase, but also in Postgres > and Elasticsearch, the latter two being potentially very interesting (read: > efficient) for report generation. > > Investigation on-going. I agree with bsmedberg in comment 4 - we're looking for needles in the haystack here, so doing a huge query to pull them out doesn't make sense. Processors are in a good position to see that 'missing_symbols' is 'true' in the JSON, and record just the info about those modules somewhere (a Postgres table would be quite easy to provide). Ted, could we WONTFIX bug 976034 in favor of this bug? Sorry I didn't notice before that it is set blocking this one currently.
Flags: needinfo?(ted)
Flags: needinfo?(dmaher)
If you can get this output out in some format I'm happy to make the missing symbols job use it. I don't care how you want to work the Bugzilla mechanics. My only worry with producing this data on-demand is that it's going to wind up being a lot of data. We should do some exploratory queries here to figure out how much data we're talking about. Something like `select count(*) from modules where modules.missing_symbols = true` (although clearly that's not proper SQL) would give us an estimate of how many entries we'd have for a set of reports.
Flags: needinfo?(ted)
Flags: needinfo?(dmaher)
Blocks: 1091124
I believe that we can do this with a processor rule. Something along these lines: MissingSymbols -- for module in processed_crash['modules']: if module['missing_symbols']: cursor.execute(''' INSERT INTO missing_symbols (debug_file, debug_id) VALUES (%s, %s)' '''INSERT INTO missing_symbols
Assignee: dmaher → rhelmer
Flags: needinfo?(sdeckelmann)
Flags: needinfo?(lars)
What I get for inserting code into bugzilla :P MissingSymbols -- for module in processed_crash['modules']: if module['missing_symbols']: debug_file = processed_crash['debug_file'] debug_id = processed_crash['debug_id'] cursor.execute(''' INSERT INTO missing_symbols (debug_file, debug_id) VALUES (%s, %s)''', debug_file, debug_id) -- Open questions for you Lars and Selena: * is it OK to insert to Postgres in the _action() method of a processor rule? * we expect a lot of duplicates, would it be better to upsert here, or to insert dupes and group it when doing daily reports? See comment 4 for context on this.
Flags: needinfo?(sdeckelmann)
Flags: needinfo?(lars)
More concrete proposal w/ simple test in https://gist.github.com/rhelmer/5113a29812708450c08e
Note that I have co-opted this bug somewhat because we need a replacement for the modulelist hadoop job (which dumps all modules but really we only need missing symbols) - but I believe we can help to satisfy comment 0 by providing a table that is updated by the processors in real-time.
I think the bits from comment 0 would be better off as a standalone service that drank from the firehose of the data feed you'll be producing here.
Commits pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/f7841cbcf90c35f0849a67de54ec6cc4cf04af01 bug 948644 - add rule to track missing symbols in postgres table, with help from lars on unittests https://github.com/mozilla/socorro/commit/b3d76ac25e12bdd0501450b6b60a90dc9ce64a11 Merge pull request #2487 from rhelmer/bug948644-store-missing-symbols bug 948644 - add rule to track missing symbols in postgres table, with
Target Milestone: --- → 111
Commits pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/bcca1a7cd549d7267d6d3417c61dd32d179de3c4 followup to bug 948644 - provide placeholder for all expected values https://github.com/mozilla/socorro/commit/fbc9176346ce00a54ac43dda41748d1e6b47ce07 Merge pull request #2496 from rhelmer/bug948644-store-missing-symbols followup to bug 948644 - provide placeholder for all expected values
I hadn't looked at your commits till just now, but I note that this will produce missing symbols for crashes on all platforms, whereas before we were limiting to just Windows. I don't think it's that big of a deal, since it's not that hard to filter by filename and the majority of our crashes are on Windows anyway.
So, since you morphed bsmedberg's bug to not exactly do what he wanted in comment 0 (but lay the groundwork for it), we should file a followup to actually build what he wants. As per comment 12, I think what would be useful would be to feed this data into a queue (I would suggest RabbitMQ but I hear we want to do away with that) and write a simple consumer that can handle the incoming data stream and figure out what to do with it.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #16) > I hadn't looked at your commits till just now, but I note that this will > produce missing symbols for crashes on all platforms, whereas before we were > limiting to just Windows. I don't think it's that big of a deal, since it's > not that hard to filter by filename and the majority of our crashes are on > Windows anyway. That is what I was thinking, I still need to do generate daily reports based on this data we're collecting to replace modulelist.txt, Going to file a separate bug for that. (In reply to Ted Mielczarek [:ted.mielczarek] from comment #17) > So, since you morphed bsmedberg's bug to not exactly do what he wanted in > comment 0 (but lay the groundwork for it), we should file a followup to > actually build what he wants. As per comment 12, I think what would be > useful would be to feed this data into a queue (I would suggest RabbitMQ but > I hear we want to do away with that) and write a simple consumer that can > handle the incoming data stream and figure out what to do with it. I was leaving this bug open, but can file a separate bug if it makes things clearer/more actionable. Would the (potentially large) number of duplicates be a problem if we were inserting this straight from processor into a queue? If some delay is tolerable (hourly for instance) then we could filter out at least some of these by doing a periodic SELECT on the missing_symbols table and GROUP BY to remove duplicates within that hour bucket. In comment 0 bsmedberg mentions starting with a daily batch for simplicity, with "a few hours" as ideal. I think it would be just as simple to start with hourly reports, if it's just a query against the missing_symbols table.
Status: NEW → ASSIGNED
(In reply to Robert Helmer [:rhelmer] from comment #18) > I was leaving this bug open, but can file a separate bug if it makes things > clearer/more actionable. I think so, conflating separate but related issues in a single bug gets messy. > In comment 0 bsmedberg mentions starting with a daily batch for simplicity, > with "a few hours" as ideal. I think it would be just as simple to start > with hourly reports, if it's just a query against the missing_symbols table. Yeah, that might be better, especially as a first implementation. Whatever is reading this data is still going to have to deal with duplicates at some point (the delay between when the report is generated and when it produces the missing symbols will ensure that), but we can fine-tune this a bit.
Target Milestone: 111 → 112
Blocks: 1106313
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Summary: Automatic monitoring and classification of missing symbols → record missing symbols
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #19) > (In reply to Robert Helmer [:rhelmer] from comment #18) > > I was leaving this bug open, but can file a separate bug if it makes things > > clearer/more actionable. > > I think so, conflating separate but related issues in a single bug gets > messy. OK filed bug 1106313 to track automatic monitoring/classification of missing symbols, and morphed this one to be about record missing symbols that the processor sees (the actual work done in this bug).
Whiteboard: [db migration]
Migration run on prod.
You need to log in before you can comment on or make changes to this bug.