Closed
Bug 1410580
Opened 7 years ago
Closed 7 years ago
[skiplist] Properly differentiate SIGABRT crashes on macOS with a top frame of mach_msg_trap (bug 1409699)
Categories
(Socorro :: General, task, P1)
Tracking
(firefox57 fixed, firefox58 fixed)
RESOLVED
FIXED
People
(Reporter: spohl, Assigned: willkg)
References
Details
Attachments
(3 files)
SIGABRT content crashes on macOS used to get reported incorrectly and land in bug 1405151 because they all looked the same. Now that bug 1405151 is fixed, we're submitting proper crash reports with correct call stacks. Unfortunately, they're now all landing in bug 1409699 with a top crash frame of mach_msg_trap. I have verified that there are several different crashes (with different call stacks). Since we have never reported these crashes before, Socorro doesn't seem to be able to differentiate between them.
I'm not sure how to begin to establish a proper skiplist. It seems like we would want to apply a skiplist that's similar to other crashes on macOS. Since this accounts for 44% or more of our content crashes on macOS (see bug 1405151 comment 0), I would appreciate any help to get this resolved quickly.
Assignee | ||
Comment 1•7 years ago
|
||
There are a lot of crash reports mentioned in those bugs. Can you list urls for crash reports that are good examples of the issue?
Assignee | ||
Updated•7 years ago
|
Flags: needinfo?(spohl.mozilla.bugs)
Reporter | ||
Comment 2•7 years ago
|
||
We basically need to be smarter about all of them. But I've noticed that two crashes seem to occur particularly often:
One seems related to cubeb: https://crash-stats.mozilla.com/report/index/a2e5766e-5d9c-4972-967b-13f1e0171019
Another one has to do with shared surfaces: https://crash-stats.mozilla.com/report/index/2e177d22-8236-40dc-88f7-2e5b10171018
Once these are being filtered out correctly, it would be interesting to see if we have any further high-volume crashes.
Flags: needinfo?(spohl.mozilla.bugs)
Assignee | ||
Comment 3•7 years ago
|
||
If I add "mach_msg_trap" and "mach_msg" to the irrelevant list (it gets ignored and the signature proceeds to the next frame), then we end up with this:
app@c29d818f7d3a:/app$ python -m socorro.signature a2e5766e-5d9c-4972-967b-13f1e0171019 2e177d22-8236-40dc-88f7-2e5b10171018
Crash id: a2e5766e-5d9c-4972-967b-13f1e0171019
Original: mach_msg_trap
New: google_breakpad::ReceivePort::WaitForMessage | google_breakpad::CrashGenerationClient::RequestDumpForException
Same?: False
Crash id: 2e177d22-8236-40dc-88f7-2e5b10171018
Original: mach_msg_trap
New: google_breakpad::ReceivePort::WaitForMessage | google_breakpad::CrashGenerationClient::RequestDumpForException
Same?: False
Does that look better?
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Assignee | ||
Comment 4•7 years ago
|
||
Bah--keep missing the needsinfo thing.
Flags: needinfo?(spohl.mozilla.bugs)
Reporter | ||
Comment 5•7 years ago
|
||
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #3)
> If I add "mach_msg_trap" and "mach_msg" to the irrelevant list (it gets
> ignored and the signature proceeds to the next frame), then we end up with
> this:
>
> app@c29d818f7d3a:/app$ python -m socorro.signature
> a2e5766e-5d9c-4972-967b-13f1e0171019 2e177d22-8236-40dc-88f7-2e5b10171018
> Crash id: a2e5766e-5d9c-4972-967b-13f1e0171019
> Original: mach_msg_trap
> New: google_breakpad::ReceivePort::WaitForMessage |
> google_breakpad::CrashGenerationClient::RequestDumpForException
> Same?: False
>
> Crash id: 2e177d22-8236-40dc-88f7-2e5b10171018
> Original: mach_msg_trap
> New: google_breakpad::ReceivePort::WaitForMessage |
> google_breakpad::CrashGenerationClient::RequestDumpForException
> Same?: False
>
>
> Does that look better?
Unfortunately, this still wouldn't differentiate between the two crashes. Don't we have a skiplist in place for other (non-SIGABRT) crashes?
If I had to make a suggestion based on my very limited knowledge of what's currently in place for other crashes I would say that we should skip everything until we hit the first XUL frame that doesn't originate from toolkit/crashreporter/. This would give us these two top stack frames:
1:
https://crash-stats.mozilla.com/report/index/a2e5766e-5d9c-4972-967b-13f1e0171019
Original: mach_msg_trap
New: audiounit_stream_destroy(cubeb_stream*)
2:
https://crash-stats.mozilla.com/report/index/2e177d22-8236-40dc-88f7-2e5b10171018
Original: mach_msg_trap
New: mozilla::gl::SharedSurface_IOSurface::SharedSurface_IOSurface(RefPtr<MacIOSurface> const&, mozilla::gl::GLContext*, mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> const&, bool)
Does this help?
Flags: needinfo?(spohl.mozilla.bugs)
Assignee | ||
Comment 6•7 years ago
|
||
That does help!
Looking at those stacks, I think we'd need to ignore (add to the irrelevant list) the following:
* google_breakpad::ReceivePort::WaitForMessage
* google_breakpad::CrashGenerationClient::RequestDumpForException
* google_breakpad::ExceptionHandler::SignalHandler
* google_breakpad::ExceptionHandler::WriteMinidumpWithException
That should be fine--there are similar things in the irrelevant list (ignore the frame altogether). Then we should also add these, which I'm less sure about:
* _dispatch_barrier_sync_f_invoke
* _dispatch_client_callout
* glDrawArrays.*
* gleCmdProcessor
* glGenTextures_ExecThread
* glIsTexture_Exec
I looked at how often these show up in signatures over the last 6 months and I think this won't cause problems.
We should also add the following things to the prefix list (keeps the frame and continue to the next one):
* __assert_rtn
* AppleIntelHD3000GraphicsGLDriver.*
* _sigtramp
There's an assert-like thing in the prefix list already and a lot of signatures have AppleIntelHD3000GraphicsGLDriver* in them, so it seems like something to keep in the signature. _sigtramp shows up in other signatures.
That gets us to this:
app@c2c14e5bba3a:/app$ python -m socorro.signature a2e5766e-5d9c-4972-967b-13f1e0171019 2e177d22-8236-40dc-88f7-2e5b10171018
Crash id: a2e5766e-5d9c-4972-967b-13f1e0171019
Original: mach_msg_trap
New: _sigtramp | malloc | abort | __assert_rtn | audiounit_stream_destroy
Same?: False
Crash id: 2e177d22-8236-40dc-88f7-2e5b10171018
Original: mach_msg_trap
New: _sigtramp | abort | AppleIntelHD3000GraphicsGLDriver@0x1fc55 | AppleIntelHD3000GraphicsGLDriver@0x2c2fee | mozilla::gl::SharedSurface_IOSurface::SharedSurface_IOSurface
Same?: False
I think that's pretty close to what you want.
I ran signature generation on the last week or so of crashes that have a signature with "mach_msg_trap". I'll attach the .csv. That caused me to add this to the irrelevant list, too:
* glFlush_ExecThread
* _mach_msg_trap
I'll throw that into a PR for Monday and have Ted sanity-check it.
Assignee | ||
Comment 7•7 years ago
|
||
Reporter | ||
Comment 8•7 years ago
|
||
This looks fantastic and will be a dramatic improvement over the status quo! Thank you!
Assignee | ||
Comment 9•7 years ago
|
||
Comment 10•7 years ago
|
||
So I think we can fix some of this to improve signatures for now, but I also think we need to fix the Breakpad client code to do better here. I filed bug 1410868 on that.
Comment 11•7 years ago
|
||
For the record, [@ mach_msg_trap ] is basically never a useful signature. It's just "this thread is waiting on something".
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #6)
> * google_breakpad::ReceivePort::WaitForMessage
> * google_breakpad::CrashGenerationClient::RequestDumpForException
> * google_breakpad::ExceptionHandler::SignalHandler
> * google_breakpad::ExceptionHandler::WriteMinidumpWithException
These are all fine to add to the irrelevant list--they'll never be part of a useful signature, they're just Breakpad internals.
> That should be fine--there are similar things in the irrelevant list (ignore
> the frame altogether). Then we should also add these, which I'm less sure
> about:
>
> * _dispatch_barrier_sync_f_invoke
> * _dispatch_client_callout
> * glDrawArrays.*
> * gleCmdProcessor
> * glGenTextures_ExecThread
> * glIsTexture_Exec
These...are probably not things we want to skip. They all seem like valid functions that just happen to accidentally wind up in the stack between the crashing frames and the signal handler. If we need to add these to get better signatures, we should only do so until we fix the Breakpad code to do the right thing here.
> * __assert_rtn
> * AppleIntelHD3000GraphicsGLDriver.*
> * _sigtramp
These are probably fine.
The best thing to do here, however, might be to put _sigtramp on the signature sentinel list. Looking at crash reports where the signature is currently [@ mach_msg_trap ] but _sigtramp isn't in the proto signature gives only a handful of crashes:
https://crash-stats.mozilla.com/search/?proto_signature=%21~_sigtramp&signature=%3Dmach_msg_trap&product=Firefox&date=%3E%3D2017-10-16T12%3A28%3A00.000Z&date=%3C2017-10-23T12%3A28%3A00.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
Most of those seem to just be crashes where we're missing symbols for the system library that contains _sigtramp (libsystem_platform.dylib), and a few of them are not crashes, but cases where we wrote a minidump of a process for a hang or something like that and the "crashing thread" is simply spinning the event loop, so the top frame is waiting in mach_msg_trap.
Assignee | ||
Comment 12•7 years ago
|
||
Ted: Thank you so much--that was super helpful!
In light of that, I adjusted the patch.
Added to the signature_sentinel list (signature starts at these frames):
* _sigtramp
Added to irrelevant list (these frames don't show up in the signature):
* google_breakpad::CrashGenerationClient::RequestDumpForException
* google_breakpad::ExceptionHandler::SignalHandler
* google_breakpad::ExceptionHandler::WriteMinidumpWithException
* google_breakpad::ReceivePort::WaitForMessage
* mach_msg
* _mach_msg
* mach_msg_trap
* _mach_msg_trap
Also added to the irrelevant list:
* glFlush_ExecThread
* glDrawArrays.*
* AppleIntelHD3000GraphicsGLDriver@.*
That AppleIntelHD3000GraphicsGLDriver@.* one might be a bad thing to add since it will lump multiple AppleIntelHD3000GraphicsGLDriver issues into the same bucket. We can revisit that later if warranted.
Added to the prefix list (these show up in the signature and indicate signature generation should continue to the next frame):
* __assert_rtn
* _dispatch_barrier_sync_f_invoke
* _dispatch_client_callout
* _dispatch_source_merge_mach_msg_direct
* gleCmdProcessor
* glGenTextures_ExecThread
* glIsTexture_Exec
* _sigtramp
* __sigtramp
Now the example crashes look like this:
app@59e4c7f769b6:/app$ python -m socorro.signature a2e5766e-5d9c-4972-967b-13f1e0171019 2e177d22-8236-40dc-88f7-2e5b10171018
Crash id: a2e5766e-5d9c-4972-967b-13f1e0171019
Original: mach_msg_trap
New: _sigtramp | malloc | abort | __assert_rtn | audiounit_stream_destroy
Same?: False
Crash id: 2e177d22-8236-40dc-88f7-2e5b10171018
Original: mach_msg_trap
New: _sigtramp | abort | gleCmdProcessor | _dispatch_client_callout | _dispatch_barrier_sync_f_invoke | glGenTextures_ExecThread | glIsTexture_Exec | mozilla::gl::SharedSurface_IOSurface::SharedSurface_IOSurface
Same?: False
I'll attach a new .csv of other crashes with the mach_msg_trap signature and how they change.
Assignee | ||
Comment 13•7 years ago
|
||
Comment 14•7 years ago
|
||
After some discussion on IRC, I just want to clarify:
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #12)
> Ted: Thank you so much--that was super helpful!
>
> In light of that, I adjusted the patch.
>
> Added to the signature_sentinel list (signature starts at these frames):
>
> * _sigtramp
>
> Added to irrelevant list (these frames don't show up in the signature):
>
> * google_breakpad::CrashGenerationClient::RequestDumpForException
> * google_breakpad::ExceptionHandler::SignalHandler
> * google_breakpad::ExceptionHandler::WriteMinidumpWithException
> * google_breakpad::ReceivePort::WaitForMessage
> * mach_msg
> * _mach_msg
> * mach_msg_trap
> * _mach_msg_trap
This is all good and fine.
> Also added to the irrelevant list:
>
> * glFlush_ExecThread
> * glDrawArrays.*
> * AppleIntelHD3000GraphicsGLDriver@.*
>
> That AppleIntelHD3000GraphicsGLDriver@.* one might be a bad thing to add
> since it will lump multiple AppleIntelHD3000GraphicsGLDriver issues into the
> same bucket. We can revisit that later if warranted.
>
> Added to the prefix list (these show up in the signature and indicate
> signature generation should continue to the next frame):
>
> * __assert_rtn
> * _dispatch_barrier_sync_f_invoke
> * _dispatch_client_callout
> * _dispatch_source_merge_mach_msg_direct
> * gleCmdProcessor
> * glGenTextures_ExecThread
> * glIsTexture_Exec
These are specific to that one set of crashes and not directly related to the general issue here--"bad signatures from SIGABRT crashes in child processes". If spohl thinks this is the sensible thing to do for that set of crashes that's fine.
> * _sigtramp
> * __sigtramp
I think these should be in the irrelevant list. They don't add any value.
Reporter | ||
Comment 15•7 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #14)
> > Also added to the irrelevant list:
> >
> > * glFlush_ExecThread
> > * glDrawArrays.*
> > * AppleIntelHD3000GraphicsGLDriver@.*
> >
> > That AppleIntelHD3000GraphicsGLDriver@.* one might be a bad thing to add
> > since it will lump multiple AppleIntelHD3000GraphicsGLDriver issues into the
> > same bucket. We can revisit that later if warranted.
> >
> > Added to the prefix list (these show up in the signature and indicate
> > signature generation should continue to the next frame):
> >
> > * __assert_rtn
> > * _dispatch_barrier_sync_f_invoke
> > * _dispatch_client_callout
> > * _dispatch_source_merge_mach_msg_direct
> > * gleCmdProcessor
> > * glGenTextures_ExecThread
> > * glIsTexture_Exec
>
> These are specific to that one set of crashes and not directly related to
> the general issue here--"bad signatures from SIGABRT crashes in child
> processes". If spohl thinks this is the sensible thing to do for that set of
> crashes that's fine.
My goal was only to disambiguate the two crashes that seemed to occur most often (see comment 5). Since these frames are unique to one of the two crashes, I have no problem with leaving them as-is and not adding them to the prefix list.
Assignee | ||
Comment 16•7 years ago
|
||
Stephen: If we don't add those items to the prefix list, then signature generation stops at that frame. Then you end up with this:
app@processor:/app$ python -m socorro.signature a2e5766e-5d9c-4972-967b-13f1e0171019 2e177d22-8236-40dc-88f7-2e5b10171018
Crash id: a2e5766e-5d9c-4972-967b-13f1e0171019
Original: mach_msg_trap
New: malloc | abort | __assert_rtn
Same?: False
Crash id: 2e177d22-8236-40dc-88f7-2e5b10171018
Original: mach_msg_trap
New: abort | gleCmdProcessor
Same?: False
In comment #2, you said you'd prefer the signature contain the first "first XUL frame that doesn't originate from toolkit/crashreporter" which this doesn't do.
Are these signatures what you're looking for?
Flags: needinfo?(spohl.mozilla.bugs)
Reporter | ||
Comment 17•7 years ago
|
||
In comment 5 I said:
"If I had to make a suggestion based on my very limited knowledge of what's currently in place for other crashes I would say that we should skip everything until we hit the first XUL frame that doesn't originate from toolkit/crashreporter/."
This was meant as suggestion only. Ted is in a much better position to comment on what's currently being done for other crashes. As long as the generated signatures are unique to individual crash reasons, they can be assigned to the right team, product and/or engineers directly. I am a bit concerned that "malloc | abort | __assert_rtn" is too generic and may not be unique to one crash reason, but I can't tell for sure. Ted, what are your thoughts?
Flags: needinfo?(spohl.mozilla.bugs) → needinfo?(ted)
Comment 18•7 years ago
|
||
__assert_rtn should be on either the prefix or irrelevant list, signature generation should be allowed to go further than that.
Comment 19•7 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #18)
> __assert_rtn should be on either the prefix or irrelevant list, signature
> generation should be allowed to go further than that.
Yes, sorry, that's my mistake. `__assert_rtn` is an internal implementation detail of `assert`, so it should go in the irrelevant list. That should mean that (roughly):
a2e5766e-5d9c-4972-967b-13f1e0171019 => [@ malloc | abort | audiounit_stream_destroy ]
and (assuming we keep AppleIntelHD3000GraphicsGLDriver@.* in the irrelevant list):
2e177d22-8236-40dc-88f7-2e5b10171018 => [@ abort | glDrawArraysInstanced_STD_GL3Exec ]
Both of which seem pretty reasonable. If the GL crashes wind up being a pain we can revisit them later.
Flags: needinfo?(ted)
Comment 20•7 years ago
|
||
Commits pushed to master at https://github.com/mozilla-services/socorro
https://github.com/mozilla-services/socorro/commit/7a586aa07f1979eaba77dbdc355b9ef1504c529d
fixes bug 1410580 - improve "mach_msg_trap" signatures
https://github.com/mozilla-services/socorro/commit/29897e4814b090ab35ae8478c7647ab941b140e2
bug 1410580 - adjust mach_msg_trap signature fixing
This moves some things around and adds _sigtramp to the sentinels list. This
improves mach_msg_trap signatures without removing as many of the frames and
takes Ted's thoughts into account.
https://github.com/mozilla-services/socorro/commit/c907d2481f6234a671ec05a82ccfb9a4fead1555
bug 1410580 - More adjustments to scale back the overall adjustments
https://github.com/mozilla-services/socorro/commit/95028e1ce9da15b7b0b2d95b969ecd33e67bff4f
Merge pull request #4056 from willkg/1410580-mach_msg_trap
fixes bug 1410580 - improve "mach_msg_trap" signatures
Updated•7 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 21•7 years ago
|
||
I landed this round of changes. We can hone the changes going forward if we need to.
Stephen: After this goes to -prod today, does it help to reprocess the last week of mach_msg_trap-related crashes?
Assignee | ||
Updated•7 years ago
|
Flags: needinfo?(spohl.mozilla.bugs)
Reporter | ||
Comment 22•7 years ago
|
||
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #21)
> I landed this round of changes. We can hone the changes going forward if we
> need to.
>
> Stephen: After this goes to -prod today, does it help to reprocess the last
> week of mach_msg_trap-related crashes?
Yes, that would be very helpful! Thank you!
Flags: needinfo?(spohl.mozilla.bugs)
Assignee | ||
Comment 23•7 years ago
|
||
We pushed the code to -prod an hour ago or so and I reprocessed crashes with "mach_msg_trap" in the signature over the last week.
Updated•7 years ago
|
Updated•7 years ago
|
tracking-firefox57:
? → ---
tracking-firefox58:
? → ---
You need to log in
before you can comment on or make changes to this bug.
Description
•