Closed
Bug 637680
Opened 14 years ago
Closed 14 years ago
Get top crashers for Firefox and Fennec where crash-stats are broken (linux, android)
Categories
(Socorro :: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: glandium, Assigned: rhelmer)
References
Details
Attachments
(9 files, 1 obsolete file)
4.21 KB,
text/plain
|
Details | |
4.63 KB,
text/plain
|
Details | |
6.79 KB,
patch
|
lars
:
review+
|
Details | Diff | Splinter Review |
394.01 KB,
application/octet-stream
|
Details | |
651.88 KB,
text/plain
|
Details | |
411.89 KB,
application/octet-stream
|
Details | |
395.76 KB,
application/octet-stream
|
Details | |
615.33 KB,
text/plain
|
Details | |
407.67 KB,
application/octet-stream
|
Details |
I'll attach the two programs that can be used to fixup minidumps.
Reporter | ||
Comment 1•14 years ago
|
||
Build with -I$(topsrcdir)/toolkit/toolkit/crashreporter/google-breakpad/src
Just give a bunch of minidumps on the command line, and it will modify them in-place.
Reporter | ||
Comment 2•14 years ago
|
||
Comment 3•14 years ago
|
||
The plan is to get these minidumps into a dev server (in bug 637678), where I'll run this tool on them, then we'll feed them into the Socorro staging server to generate topcrash lists.
Comment 4•14 years ago
|
||
How many dumps are we talking?
Could we:
- run a MR to pull each busted dump, fix it, and replace it in hbase
- insert all fixed dumps into the legacy processing queue
This would get the data up on prod.
Comment 5•14 years ago
|
||
Actually, after chatting with laura a bit on IRC, here is what I would offer for your consideration:
Create a Postgres query that can extract a list of submitted_timestamp that need to be fixed
Create a simple Python script that can iterate over those ooids and talk to the hbaseClient object
Call hbaseClient.get_dump(ooid)
Shell exec the fixer program on the dump
Insert the dump back into HBase using a subset of the code in hbaseClient.put_json_dump()
Insert the ooid back into the legacy processing queue by calling hbaseClient.put_crash_report_indices(ooid,CurrentTimestamp,['crash_reports_index_legacy_unprocessed_flag'])
Note that the current timestamp in the same format as what is used for submitted_timestamp should be used so that the entries to be reprocessed don't take priority over normal jobs.
The end result of this job if it were run on a regular basis is that we would update the record in hbase with a fixed copy of the dump file (the old one would still be present but not visible to the normal Socorro system). The monitor would see these entries in the queue, and as long as it doesn't reject them as already having been processed, it would send them back through the system.
There would be no load increase on the production HBase cluster to support this. If we attempted to do a map reduce job, then we'd have to tune and test that carefully to make sure it wouldn't mess things up. If this were tens of thousands of crashes per day then that might be worth it, but for a small volume, this should be a simple to implement solution.
Comment 6•14 years ago
|
||
Sorry, at the beginning, the first step should read:
Create a Postgres query that can extract a list of ooids that need to be fixed
Comment 7•14 years ago
|
||
Ted, do you want us to go ahead?
Comment 8•14 years ago
|
||
Daniel's proposal sounds fine to me. Let me know how I can help make this happen.
Updated•14 years ago
|
Attachment #515926 -
Attachment mime type: text/x-csrc → text/plain
Updated•14 years ago
|
Attachment #515927 -
Attachment mime type: text/x-csrc → text/plain
Comment 9•14 years ago
|
||
Rob: see comment 5 for the agreed procedure. We'll need two weeks worth of dumps to get decent topcrasher info, for the broken builds. You might need jberkus to run the query on prod PG for you for that part.
The other part is hacking up some python to follow the above steps, and running that on prod.
We really need to get this done today - it fell off the radar this week. Can you manage it?
Assignee: ted.mielczarek → rhelmer
Severity: normal → blocker
Assignee | ||
Comment 10•14 years ago
|
||
Here's a count and example queries we'll be working with:
"""
breakpad=> select count(*) from reports where product = 'Firefox' and version = '4.0b11' or version = '4.0b12' and os_name = 'Linux' and date_processed > '2011-01-01';
count
---------
1233417
(1 row)
breakpad=> select count(*) from reports where product = 'Fennec' and version = '4.0b5'; count
-------
7851
(1 row)
"""
Ran over this with ted in irc, looks good but let me know if anyone notices anything odd. I am now working on the approach Daniel suggests in comment 5.
Status: NEW → ASSIGNED
Assignee | ||
Comment 11•14 years ago
|
||
(In reply to comment #5)
> Insert the dump back into HBase using a subset of the code in
> hbaseClient.put_json_dump()
Daniel, can you expand on which part(s) of put_json_dump() we don't want?
> Insert the ooid back into the legacy processing queue by calling
> hbaseClient.put_crash_report_indices(ooid,CurrentTimestamp,['crash_reports_index_legacy_unprocessed_flag'])
> Note that the current timestamp in the same format as what is used for
> submitted_timestamp should be used so that the entries to be reprocessed don't
> take priority over normal jobs.
I think for this we could easily add an optional param put_json_dump() to override submitted_timestamp, passing in the current time (or whatever we want instead). Let me know if I'm understanding correctly.
Comment 12•14 years ago
|
||
basically, we only want the lines in put_json_dump() that write the dump, none of the metadata manipulation or index management and such.
something like this with the one comment placeholder filled in.:
@optional_retry_wrapper
def put_fixed_dump(self, ooid, dump, add_to_unprocessed_queue = True):
"""
Update a crash report with a new dump file optionally queuing for processing
"""
row_id = ooid_to_row_id(ooid)
submitted_timestamp = # Python code for getting current timestamp in correct format
columns = [
("raw_data:dump", dump)
]
mutationList = [ self.mutationClass(column=c, value=v)
for c, v in columns if v is not None]
indices = []
if add_to_unprocessed_queue:
indices.append('crash_reports_index_legacy_unprocessed_flag')
self.client.mutateRow('crash_reports', row_id, mutationList) # unit test marker 233
self.put_crash_report_indices(ooid,submitted_timestamp,indices)
Assignee | ||
Comment 13•14 years ago
|
||
This is based on comment 5 and comment 12 (thanks Daniel!)
Not sure if we want to keep it, but I set it up so we could trivially land this into Socorro and call this as a cron job if it's needed.
Building attachment 515926 [details] and 515927 with a Socorro checkout just needs:
make minidump_stackwalk
gcc -o minidump_hack-fennec -I google-breakpad/src/ minidump_hack-fennec.c
gcc -o minidump_hack-firefox_linux -I google-breakpad/src/ minidump_hack-firefox_linux.c
Names of the fixup commands and also the SQL queries are configurable, but "Fennec" and "Firefox Linux" are hardcoded in the config and the start script (hopefully we never need to expand this :))
Might be nice to make the fixup commands read stdin and write output to stdout so we don't need to touch the disk, but not going to sweat this right now.
Attachment #517027 -
Flags: review?(lars)
Attachment #517027 -
Flags: feedback?(deinspanjer)
Assignee | ||
Comment 14•14 years ago
|
||
We tested this on a single crash to start:
https://crash-stats.mozilla.com/report/index/30100333-b41e-4b2e
-93fb-694472110220
I have the original dump, and the md5sum changed, but not sure how else to verify.
Who can help with this?
Reporter | ||
Comment 15•14 years ago
|
||
(In reply to comment #14)
> We tested this on a single crash to start:
> https://crash-stats.mozilla.com/report/index/30100333-b41e-4b2e
> -93fb-694472110220
>
> I have the original dump, and the md5sum changed, but not sure how else to
> verify.
>
> Who can help with this?
Taking a look at the original raw dump vs. the new one should help. In the original, you should see 3 modules for fennec libraries such as libxul.so, while in the new one you should see only two, with the first one covering the address space covered by the original two first. The resolved function names should look better too.
Reporter | ||
Comment 16•14 years ago
|
||
In the crash report you link, the stack trace on other threads almost look normal, except for the ashmem (deleted) parts, which are not related to elfhack. Having the /proc/pid/maps output from the minidump would help, there.
Reporter | ||
Comment 17•14 years ago
|
||
Note the fixing behaviour is different on Linux. The original minidumps should have one module for each Firefox library, except each is too small. The fixup will make the module address space larger, so that it fits the actual address space used in the process.
Assignee | ||
Comment 18•14 years ago
|
||
glandium has been helping to test this in IRC; looks good so we're going to proceed with all Fennec 4.0b5 crashes. Doing a dry-run now, to make sure everything looks ok (processing the right number, calling the right binary).
There appears to be caching enabled on /rawdumps calls (which Apache rewrites to the socorro-api hostname); I imagine this is on the Zeus, not sure if this is valuable.
Also, just a reminder that per comment 5 these will get inserted into the normal (not priority) queue for processing, so it'll take a while for processors to pick these up.
I should have a reasonable estimate for how long this will take once we have it running for real for a bit.
Assignee | ||
Comment 19•14 years ago
|
||
(In reply to comment #10)
> Here's a count and example queries we'll be working with:
>
> """
> breakpad=> select count(*) from reports where product = 'Firefox' and version =
> '4.0b11' or version = '4.0b12' and os_name = 'Linux' and date_processed >
> '2011-01-01';
> count
> ---------
> 1233417
> (1 row)
Oops this is wrong, should be explicit about precedence here (thanks glandium):
breakpad=> select count(*) from reports where product = 'Firefox' and (version = '4.0b11' or version = '4.0b12') and os_name = 'Linux' and date_processed > '2011-01-01';
count
-------
7732
(1 row)
Assignee | ||
Comment 20•14 years ago
|
||
Same as attachment 517027 [details] [diff] [review] plus:
* fix firefox SQL statemnt
* use /dev/shm for tmpfiles instead of disk
* catch/log exceptions in the pull/fix/push loop
Attachment #517027 -
Attachment is obsolete: true
Attachment #517027 -
Flags: review?(lars)
Attachment #517027 -
Flags: feedback?(deinspanjer)
Attachment #517061 -
Flags: review?(lars)
Attachment #517061 -
Flags: feedback?(deinspanjer)
Assignee | ||
Comment 21•14 years ago
|
||
Assignee | ||
Comment 22•14 years ago
|
||
started 2011-03-04 17:51:18
stopped 2011-03-04 18:02:55
Assignee | ||
Comment 23•14 years ago
|
||
Assignee | ||
Comment 24•14 years ago
|
||
(In reply to comment #16)
> In the crash report you link, the stack trace on other threads almost look
> normal, except for the ashmem (deleted) parts, which are not related to
> elfhack. Having the /proc/pid/maps output from the minidump would help, there.
This looks like a separate/pre-existing issue, per irc.
Assignee | ||
Comment 25•14 years ago
|
||
Expected number of OOIDs processed.
I need to step away for a little bit, will run this when I get back and can keep an eye on it.
Assignee | ||
Comment 26•14 years ago
|
||
Assignee | ||
Comment 27•14 years ago
|
||
Assignee | ||
Comment 28•14 years ago
|
||
Comment on attachment 517078 [details]
firefox OOIDs modified
started 2011-03-04 20:59:10
stopped 2011-03-04 21:24:48
Reporter | ||
Comment 29•14 years ago
|
||
Worked awesomely. The top crashers list seems not to be updated, though. And new crashes obviously are broken, too.
Updated•14 years ago
|
Attachment #517061 -
Flags: review?(lars) → review+
Assignee | ||
Comment 30•14 years ago
|
||
Fennec reports are fixed as of 2011-03-04 18:02:55, and Firefox as of 2011-03-04 21:24:48.
(In reply to comment #29)
> Worked awesomely. The top crashers list seems not to be updated, though. And
> new crashes obviously are broken, too.
We are looking into the top crashers issue now.
To run this on a regular basis, we can easily add this as a cron job in Socorro but we should add a feature so the script keeps track of where it left off and doesn't fix crashes multiple times (the bugzilla cron drops a timestamp into a pickled file, we could do something similar, perhaps using the last-fixed id from the reports table rather than timestamp).
Assignee | ||
Comment 31•14 years ago
|
||
(In reply to comment #30)
> Fennec reports are fixed as of 2011-03-04 18:02:55, and Firefox as of
> 2011-03-04 21:24:48.
>
> (In reply to comment #29)
> > Worked awesomely. The top crashers list seems not to be updated, though. And
> > new crashes obviously are broken, too.
>
> We are looking into the top crashers issue now.
Each run of the TCBS cron job will look for reprocessed jobs up to 2 hours prior, so we should be able to keep up with a batch job such as the one proposed by comment 30, as long as it was run at least hourly.
However to catch up the backlog, the most straightforward way to do this would be to delete the top crash by signature (TCBS) table from the first crash processed which exhibits the problem ('2011-02-20 12:22:40') until '2011-03-04 21:24:48', and let the TCBS cron job rebuild based on the (now fixed) reports table.
We would expect this to take between 1 and 10 hours. This means that top crashers list for crashes before the start date above (Feb 20) would be unavailable. As it completes each hour of work, though, that hour would be immediately available.
Assignee | ||
Comment 32•14 years ago
|
||
(In reply to comment #31)
> Each run of the TCBS cron job will look for reprocessed jobs up to 2 hours
> prior, so we should be able to keep up with a batch job such as the one
> proposed by comment 30, as long as it was run at least hourly.
Filed bug 639514 to follow up on this.
> However to catch up the backlog, the most straightforward way to do this would
> be to delete the top crash by signature (TCBS) table from the first crash
> processed which exhibits the problem ('2011-02-20 12:22:40') until '2011-03-04
> 21:24:48', and let the TCBS cron job rebuild based on the (now fixed) reports
> table.
Filed bug 639512 for this.
Assignee | ||
Comment 33•14 years ago
|
||
Comment on attachment 517061 [details] [diff] [review]
script dump fix and re-insertion
Landed this, which is appropriate for one-time fix (given appropriate SQL query in the config file). Going to add changes needed for running from cron in bug 639514:
Committed revision 2997.
Assignee | ||
Comment 34•14 years ago
|
||
Backlog is caught up, top crashers lists updated, and a cron job is running hourly to fix broken crashes as they come in.
Top Crashes By Signature table was rebuilt in bug 639512
fixBrokenDumps cron job was installed in bug 639514
All work was completed by 2011-03-07 22:40:56 Pacific.
Please reopen if you see any problems, or mark bug verified if everything looks ok.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•14 years ago
|
Attachment #517061 -
Flags: feedback?(deinspanjer)
Updated•13 years ago
|
Component: Socorro → General
Product: Webtools → Socorro
You need to log in
before you can comment on or make changes to this bug.
Description
•