Closed Bug 1410580 Opened 7 years ago Closed 7 years ago

[skiplist] Properly differentiate SIGABRT crashes on macOS with a top frame of mach_msg_trap (bug 1409699)

Categories

(Socorro :: General, task, P1)

All
macOS
task

Tracking

(firefox57 fixed, firefox58 fixed)

RESOLVED FIXED
Tracking Status
firefox57 --- fixed
firefox58 --- fixed

People

(Reporter: spohl, Assigned: willkg)

References

Details

Attachments

(3 files)

SIGABRT content crashes on macOS used to get reported incorrectly and land in bug 1405151 because they all looked the same. Now that bug 1405151 is fixed, we're submitting proper crash reports with correct call stacks. Unfortunately, they're now all landing in bug 1409699 with a top crash frame of mach_msg_trap. I have verified that there are several different crashes (with different call stacks). Since we have never reported these crashes before, Socorro doesn't seem to be able to differentiate between them. I'm not sure how to begin to establish a proper skiplist. It seems like we would want to apply a skiplist that's similar to other crashes on macOS. Since this accounts for 44% or more of our content crashes on macOS (see bug 1405151 comment 0), I would appreciate any help to get this resolved quickly.
There are a lot of crash reports mentioned in those bugs. Can you list urls for crash reports that are good examples of the issue?
Flags: needinfo?(spohl.mozilla.bugs)
We basically need to be smarter about all of them. But I've noticed that two crashes seem to occur particularly often: One seems related to cubeb: https://crash-stats.mozilla.com/report/index/a2e5766e-5d9c-4972-967b-13f1e0171019 Another one has to do with shared surfaces: https://crash-stats.mozilla.com/report/index/2e177d22-8236-40dc-88f7-2e5b10171018 Once these are being filtered out correctly, it would be interesting to see if we have any further high-volume crashes.
Flags: needinfo?(spohl.mozilla.bugs)
If I add "mach_msg_trap" and "mach_msg" to the irrelevant list (it gets ignored and the signature proceeds to the next frame), then we end up with this: app@c29d818f7d3a:/app$ python -m socorro.signature a2e5766e-5d9c-4972-967b-13f1e0171019 2e177d22-8236-40dc-88f7-2e5b10171018 Crash id: a2e5766e-5d9c-4972-967b-13f1e0171019 Original: mach_msg_trap New: google_breakpad::ReceivePort::WaitForMessage | google_breakpad::CrashGenerationClient::RequestDumpForException Same?: False Crash id: 2e177d22-8236-40dc-88f7-2e5b10171018 Original: mach_msg_trap New: google_breakpad::ReceivePort::WaitForMessage | google_breakpad::CrashGenerationClient::RequestDumpForException Same?: False Does that look better?
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Bah--keep missing the needsinfo thing.
Flags: needinfo?(spohl.mozilla.bugs)
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #3) > If I add "mach_msg_trap" and "mach_msg" to the irrelevant list (it gets > ignored and the signature proceeds to the next frame), then we end up with > this: > > app@c29d818f7d3a:/app$ python -m socorro.signature > a2e5766e-5d9c-4972-967b-13f1e0171019 2e177d22-8236-40dc-88f7-2e5b10171018 > Crash id: a2e5766e-5d9c-4972-967b-13f1e0171019 > Original: mach_msg_trap > New: google_breakpad::ReceivePort::WaitForMessage | > google_breakpad::CrashGenerationClient::RequestDumpForException > Same?: False > > Crash id: 2e177d22-8236-40dc-88f7-2e5b10171018 > Original: mach_msg_trap > New: google_breakpad::ReceivePort::WaitForMessage | > google_breakpad::CrashGenerationClient::RequestDumpForException > Same?: False > > > Does that look better? Unfortunately, this still wouldn't differentiate between the two crashes. Don't we have a skiplist in place for other (non-SIGABRT) crashes? If I had to make a suggestion based on my very limited knowledge of what's currently in place for other crashes I would say that we should skip everything until we hit the first XUL frame that doesn't originate from toolkit/crashreporter/. This would give us these two top stack frames: 1: https://crash-stats.mozilla.com/report/index/a2e5766e-5d9c-4972-967b-13f1e0171019 Original: mach_msg_trap New: audiounit_stream_destroy(cubeb_stream*) 2: https://crash-stats.mozilla.com/report/index/2e177d22-8236-40dc-88f7-2e5b10171018 Original: mach_msg_trap New: mozilla::gl::SharedSurface_IOSurface::SharedSurface_IOSurface(RefPtr<MacIOSurface> const&, mozilla::gl::GLContext*, mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> const&, bool) Does this help?
Flags: needinfo?(spohl.mozilla.bugs)
That does help! Looking at those stacks, I think we'd need to ignore (add to the irrelevant list) the following: * google_breakpad::ReceivePort::WaitForMessage * google_breakpad::CrashGenerationClient::RequestDumpForException * google_breakpad::ExceptionHandler::SignalHandler * google_breakpad::ExceptionHandler::WriteMinidumpWithException That should be fine--there are similar things in the irrelevant list (ignore the frame altogether). Then we should also add these, which I'm less sure about: * _dispatch_barrier_sync_f_invoke * _dispatch_client_callout * glDrawArrays.* * gleCmdProcessor * glGenTextures_ExecThread * glIsTexture_Exec I looked at how often these show up in signatures over the last 6 months and I think this won't cause problems. We should also add the following things to the prefix list (keeps the frame and continue to the next one): * __assert_rtn * AppleIntelHD3000GraphicsGLDriver.* * _sigtramp There's an assert-like thing in the prefix list already and a lot of signatures have AppleIntelHD3000GraphicsGLDriver* in them, so it seems like something to keep in the signature. _sigtramp shows up in other signatures. That gets us to this: app@c2c14e5bba3a:/app$ python -m socorro.signature a2e5766e-5d9c-4972-967b-13f1e0171019 2e177d22-8236-40dc-88f7-2e5b10171018 Crash id: a2e5766e-5d9c-4972-967b-13f1e0171019 Original: mach_msg_trap New: _sigtramp | malloc | abort | __assert_rtn | audiounit_stream_destroy Same?: False Crash id: 2e177d22-8236-40dc-88f7-2e5b10171018 Original: mach_msg_trap New: _sigtramp | abort | AppleIntelHD3000GraphicsGLDriver@0x1fc55 | AppleIntelHD3000GraphicsGLDriver@0x2c2fee | mozilla::gl::SharedSurface_IOSurface::SharedSurface_IOSurface Same?: False I think that's pretty close to what you want. I ran signature generation on the last week or so of crashes that have a signature with "mach_msg_trap". I'll attach the .csv. That caused me to add this to the irrelevant list, too: * glFlush_ExecThread * _mach_msg_trap I'll throw that into a PR for Monday and have Ted sanity-check it.
This looks fantastic and will be a dramatic improvement over the status quo! Thank you!
So I think we can fix some of this to improve signatures for now, but I also think we need to fix the Breakpad client code to do better here. I filed bug 1410868 on that.
For the record, [@ mach_msg_trap ] is basically never a useful signature. It's just "this thread is waiting on something". (In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #6) > * google_breakpad::ReceivePort::WaitForMessage > * google_breakpad::CrashGenerationClient::RequestDumpForException > * google_breakpad::ExceptionHandler::SignalHandler > * google_breakpad::ExceptionHandler::WriteMinidumpWithException These are all fine to add to the irrelevant list--they'll never be part of a useful signature, they're just Breakpad internals. > That should be fine--there are similar things in the irrelevant list (ignore > the frame altogether). Then we should also add these, which I'm less sure > about: > > * _dispatch_barrier_sync_f_invoke > * _dispatch_client_callout > * glDrawArrays.* > * gleCmdProcessor > * glGenTextures_ExecThread > * glIsTexture_Exec These...are probably not things we want to skip. They all seem like valid functions that just happen to accidentally wind up in the stack between the crashing frames and the signal handler. If we need to add these to get better signatures, we should only do so until we fix the Breakpad code to do the right thing here. > * __assert_rtn > * AppleIntelHD3000GraphicsGLDriver.* > * _sigtramp These are probably fine. The best thing to do here, however, might be to put _sigtramp on the signature sentinel list. Looking at crash reports where the signature is currently [@ mach_msg_trap ] but _sigtramp isn't in the proto signature gives only a handful of crashes: https://crash-stats.mozilla.com/search/?proto_signature=%21~_sigtramp&signature=%3Dmach_msg_trap&product=Firefox&date=%3E%3D2017-10-16T12%3A28%3A00.000Z&date=%3C2017-10-23T12%3A28%3A00.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature Most of those seem to just be crashes where we're missing symbols for the system library that contains _sigtramp (libsystem_platform.dylib), and a few of them are not crashes, but cases where we wrote a minidump of a process for a hang or something like that and the "crashing thread" is simply spinning the event loop, so the top frame is waiting in mach_msg_trap.
Ted: Thank you so much--that was super helpful! In light of that, I adjusted the patch. Added to the signature_sentinel list (signature starts at these frames): * _sigtramp Added to irrelevant list (these frames don't show up in the signature): * google_breakpad::CrashGenerationClient::RequestDumpForException * google_breakpad::ExceptionHandler::SignalHandler * google_breakpad::ExceptionHandler::WriteMinidumpWithException * google_breakpad::ReceivePort::WaitForMessage * mach_msg * _mach_msg * mach_msg_trap * _mach_msg_trap Also added to the irrelevant list: * glFlush_ExecThread * glDrawArrays.* * AppleIntelHD3000GraphicsGLDriver@.* That AppleIntelHD3000GraphicsGLDriver@.* one might be a bad thing to add since it will lump multiple AppleIntelHD3000GraphicsGLDriver issues into the same bucket. We can revisit that later if warranted. Added to the prefix list (these show up in the signature and indicate signature generation should continue to the next frame): * __assert_rtn * _dispatch_barrier_sync_f_invoke * _dispatch_client_callout * _dispatch_source_merge_mach_msg_direct * gleCmdProcessor * glGenTextures_ExecThread * glIsTexture_Exec * _sigtramp * __sigtramp Now the example crashes look like this: app@59e4c7f769b6:/app$ python -m socorro.signature a2e5766e-5d9c-4972-967b-13f1e0171019 2e177d22-8236-40dc-88f7-2e5b10171018 Crash id: a2e5766e-5d9c-4972-967b-13f1e0171019 Original: mach_msg_trap New: _sigtramp | malloc | abort | __assert_rtn | audiounit_stream_destroy Same?: False Crash id: 2e177d22-8236-40dc-88f7-2e5b10171018 Original: mach_msg_trap New: _sigtramp | abort | gleCmdProcessor | _dispatch_client_callout | _dispatch_barrier_sync_f_invoke | glGenTextures_ExecThread | glIsTexture_Exec | mozilla::gl::SharedSurface_IOSurface::SharedSurface_IOSurface Same?: False I'll attach a new .csv of other crashes with the mach_msg_trap signature and how they change.
After some discussion on IRC, I just want to clarify: (In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #12) > Ted: Thank you so much--that was super helpful! > > In light of that, I adjusted the patch. > > Added to the signature_sentinel list (signature starts at these frames): > > * _sigtramp > > Added to irrelevant list (these frames don't show up in the signature): > > * google_breakpad::CrashGenerationClient::RequestDumpForException > * google_breakpad::ExceptionHandler::SignalHandler > * google_breakpad::ExceptionHandler::WriteMinidumpWithException > * google_breakpad::ReceivePort::WaitForMessage > * mach_msg > * _mach_msg > * mach_msg_trap > * _mach_msg_trap This is all good and fine. > Also added to the irrelevant list: > > * glFlush_ExecThread > * glDrawArrays.* > * AppleIntelHD3000GraphicsGLDriver@.* > > That AppleIntelHD3000GraphicsGLDriver@.* one might be a bad thing to add > since it will lump multiple AppleIntelHD3000GraphicsGLDriver issues into the > same bucket. We can revisit that later if warranted. > > Added to the prefix list (these show up in the signature and indicate > signature generation should continue to the next frame): > > * __assert_rtn > * _dispatch_barrier_sync_f_invoke > * _dispatch_client_callout > * _dispatch_source_merge_mach_msg_direct > * gleCmdProcessor > * glGenTextures_ExecThread > * glIsTexture_Exec These are specific to that one set of crashes and not directly related to the general issue here--"bad signatures from SIGABRT crashes in child processes". If spohl thinks this is the sensible thing to do for that set of crashes that's fine. > * _sigtramp > * __sigtramp I think these should be in the irrelevant list. They don't add any value.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #14) > > Also added to the irrelevant list: > > > > * glFlush_ExecThread > > * glDrawArrays.* > > * AppleIntelHD3000GraphicsGLDriver@.* > > > > That AppleIntelHD3000GraphicsGLDriver@.* one might be a bad thing to add > > since it will lump multiple AppleIntelHD3000GraphicsGLDriver issues into the > > same bucket. We can revisit that later if warranted. > > > > Added to the prefix list (these show up in the signature and indicate > > signature generation should continue to the next frame): > > > > * __assert_rtn > > * _dispatch_barrier_sync_f_invoke > > * _dispatch_client_callout > > * _dispatch_source_merge_mach_msg_direct > > * gleCmdProcessor > > * glGenTextures_ExecThread > > * glIsTexture_Exec > > These are specific to that one set of crashes and not directly related to > the general issue here--"bad signatures from SIGABRT crashes in child > processes". If spohl thinks this is the sensible thing to do for that set of > crashes that's fine. My goal was only to disambiguate the two crashes that seemed to occur most often (see comment 5). Since these frames are unique to one of the two crashes, I have no problem with leaving them as-is and not adding them to the prefix list.
Stephen: If we don't add those items to the prefix list, then signature generation stops at that frame. Then you end up with this: app@processor:/app$ python -m socorro.signature a2e5766e-5d9c-4972-967b-13f1e0171019 2e177d22-8236-40dc-88f7-2e5b10171018 Crash id: a2e5766e-5d9c-4972-967b-13f1e0171019 Original: mach_msg_trap New: malloc | abort | __assert_rtn Same?: False Crash id: 2e177d22-8236-40dc-88f7-2e5b10171018 Original: mach_msg_trap New: abort | gleCmdProcessor Same?: False In comment #2, you said you'd prefer the signature contain the first "first XUL frame that doesn't originate from toolkit/crashreporter" which this doesn't do. Are these signatures what you're looking for?
Flags: needinfo?(spohl.mozilla.bugs)
In comment 5 I said: "If I had to make a suggestion based on my very limited knowledge of what's currently in place for other crashes I would say that we should skip everything until we hit the first XUL frame that doesn't originate from toolkit/crashreporter/." This was meant as suggestion only. Ted is in a much better position to comment on what's currently being done for other crashes. As long as the generated signatures are unique to individual crash reasons, they can be assigned to the right team, product and/or engineers directly. I am a bit concerned that "malloc | abort | __assert_rtn" is too generic and may not be unique to one crash reason, but I can't tell for sure. Ted, what are your thoughts?
Flags: needinfo?(spohl.mozilla.bugs) → needinfo?(ted)
__assert_rtn should be on either the prefix or irrelevant list, signature generation should be allowed to go further than that.
(In reply to Julien Cristau [:jcristau] from comment #18) > __assert_rtn should be on either the prefix or irrelevant list, signature > generation should be allowed to go further than that. Yes, sorry, that's my mistake. `__assert_rtn` is an internal implementation detail of `assert`, so it should go in the irrelevant list. That should mean that (roughly): a2e5766e-5d9c-4972-967b-13f1e0171019 => [@ malloc | abort | audiounit_stream_destroy ] and (assuming we keep AppleIntelHD3000GraphicsGLDriver@.* in the irrelevant list): 2e177d22-8236-40dc-88f7-2e5b10171018 => [@ abort | glDrawArraysInstanced_STD_GL3Exec ] Both of which seem pretty reasonable. If the GL crashes wind up being a pain we can revisit them later.
Flags: needinfo?(ted)
Commits pushed to master at https://github.com/mozilla-services/socorro https://github.com/mozilla-services/socorro/commit/7a586aa07f1979eaba77dbdc355b9ef1504c529d fixes bug 1410580 - improve "mach_msg_trap" signatures https://github.com/mozilla-services/socorro/commit/29897e4814b090ab35ae8478c7647ab941b140e2 bug 1410580 - adjust mach_msg_trap signature fixing This moves some things around and adds _sigtramp to the sentinels list. This improves mach_msg_trap signatures without removing as many of the frames and takes Ted's thoughts into account. https://github.com/mozilla-services/socorro/commit/c907d2481f6234a671ec05a82ccfb9a4fead1555 bug 1410580 - More adjustments to scale back the overall adjustments https://github.com/mozilla-services/socorro/commit/95028e1ce9da15b7b0b2d95b969ecd33e67bff4f Merge pull request #4056 from willkg/1410580-mach_msg_trap fixes bug 1410580 - improve "mach_msg_trap" signatures
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
I landed this round of changes. We can hone the changes going forward if we need to. Stephen: After this goes to -prod today, does it help to reprocess the last week of mach_msg_trap-related crashes?
Flags: needinfo?(spohl.mozilla.bugs)
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #21) > I landed this round of changes. We can hone the changes going forward if we > need to. > > Stephen: After this goes to -prod today, does it help to reprocess the last > week of mach_msg_trap-related crashes? Yes, that would be very helpful! Thank you!
Flags: needinfo?(spohl.mozilla.bugs)
We pushed the code to -prod an hour ago or so and I reprocessed crashes with "mach_msg_trap" in the signature over the last week.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: