Closed Bug 1709658 Opened 3 years ago Closed 3 years ago

Add "mac_crash_info" to the "details" page of crash reports, and make it searchable

Categories

(Socorro :: General, task, P2)

All
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: smichaud, Assigned: willkg)

References

Details

Attachments

(5 files, 1 obsolete file)

Patches have just been landed at bug 1577886 to add support for __crash_info data to Breakpad. Another is about to be landed at https://github.com/mozilla-services/minidump-stackwalk/pull/29. I'm opening this follow up bug to deal with two tasks:

  1. Make a summary of __crash_info data (if present) available on the "details" page of crash reports at https://crash-stats.mozilla.org/.

  2. Make this data searchable.

Once the GitHub pull request has landed, __crash_info data will be available on the "raw data" page of crash reports. But its format won't be particularly user-friendly, and of course it won't (yet) be searchable. Gabriele has pointed out that before we make either of these changes, we need to make sure this data doesn't contain user-sensitive information.

Apple's __crash_info data is completely undocumented, beyond a few references to it in the source code available at https://opensource.apple.com/. But it's only system modules that contain a __crash_info section. So the information in them can only be very low-level. For example I'd be surprised if it could contain URLs. I'd expect the concept of a URL to only be understandable by higher-level, Mozilla-specific code.

I have a lot of experience reverse-engineering macOS, so this is something I can try to check. I'll spend a few hours doing that and report back.

By the way, I only have the vaguest notion of "user-sensitive information". Is there a good definition of it somewhere, that I can rely on?

Here are my ideas of how the two tasks from comment #0 can be accomplished, once the problem of user-sensitive information is resolved.

I'd like the summary of __crash_info data on the "details" page to look something like this (from the output of minidump_stackwalk):

    Application-specific information:
     Module "/System/Library/Frameworks/Security.framework/Versions/A/Security":
      message: "CryptKit fatal error: Raise test exception from _pthread_cond_wait(1)"
     Module "/usr/lib/system/libsystem_c.dylib":
      message: "abort() called"

I'd like the following fields (from mac_crash_info in the output of stackwalker) to be searchable:

    num_records
    message
    signature_string
    backtrace
    message2
    thread
    dialog_mode
    abort_cause

I'd also like the whole mac_crash_info field (all of its contents) to be searchable, like proto signature is.

See Also: → 1577886

For adding this data to the details page, I can take a pass at that and attach screenshots that we can iterate on.

Does the __crash_info data include argument data in the messages? For example, Java crash reports for exceptions that occur when manipulating strings include the string arguments in the message and that can contain urls being visited.

Examples of sensitive data that shouldn't be public:

  • personally identifiable information: names, phone numbers, addresses, email addresses, SSNs, drivers license cards, passport numbers, personal credentials, account numbers, passwords, urls of visited sites
  • sensitive data: exploitability information, anything Mozilla confidential, credentials

I don't know offhand if there's a list somewhere. Having one would be a good idea--I wrote up bug #1709688 to cover that.

For making the data searchable, generally, I don't make data in a crash report searchable unless it's useful to search. I'd need to know more about what questions users might be asking and how they'd be searching these fields.

For example, num_records doesn't seem interesting to search to me. What questions would an engineer have such that they're searching for crash reports that have some number of records?

Can you walk me through how you expect users to be looking at each field?

Also, I'm not sure how to take what's in the stackwalker output and convert it into that list of fields. I see num_records and for each record message and module. Where do the rest of them come from?

Flags: needinfo?(smichaud)

I don't have a good handle on all the information that can be included in __crash_info. But, like I said in comment #0, I'm confident it's all very low-level. For example I doubt it can ever contain a URL. This is helped by the fact that Gecko tends to do everything itself, and not rely on system code for any high-level stuff. I'll know more when I've finished my survey of the __crash_info sections in all the system modules that Firefox pulls in which have one. That's only about 50 modules. They shouldn't take too long to work through.

The only __crash_info section I'm already familiar with is the one in /System/Library/PrivateFrameworks/GPUSupport.framework/Libraries/libGPUSupportMercury.dylib. Code in its gpusGenerateCrashLog.cold.1() can write either of the following two messages to __crash_info's 'signature_string' field:

    Graphics kernel error: 0x%08x\\n

    Graphics hardware encountered an error and was reset: 0x%08x\\n

Where do the rest of them come from?

One place to see all the fields is in the code I added to ConvertProcessStateToJSON(), here.

I agree that num_records won't be interesting to most people. But I'm quite interested in it. The reason is that I don't what's the largest number of records that's practically possible, or even the number to expect in a "typical" crash report.

The other fields are self-explanatory, I think. Any of them might contain critically useful information. Because Apple hasn't documented __crash_info, I don't really know what to expect. Aside from the research I've promised to do into those 50 modules I mentioned above, I have no way of finding out except to look at crash reports as they come in. Unless these fields are searchable, doing that will be like looking for a needle in a haystack.

You should probably treat all the "string" fields (message, signature_string, backtrace and message2) as free-form strings. Searching on them should mean finding out whether or not they contain a given substring. The other fields (thread, dialog_mode and abort_cause) are numeric -- at least Apple's very sparse documentation seems to indicate that.

Examples of sensitive data that shouldn't be public:

Thanks very much for these! Your list is very helpful.

Flags: needinfo?(smichaud)

(Following up comment #3)

Also, I think the entire mac_crash_info field should be searchable as a free-form string, like proto signature. You'd be trying to find out whether or not it contained a given substring.

I didn't mention the module field above, because I don't think it's important to be able to search on it individually. But I would like it to be considered part of the contents of mac_crash_info, for the purposes of searching on the whole field.

Grabbing this to work on in the next week or so.

Assignee: nobody → willkg
Status: NEW → ASSIGNED
Priority: -- → P2

(Following up comment #0)

Apple's __crash_info data is completely undocumented, beyond a few references to it in the source code available at https://opensource.apple.com/. But it's only system modules that contain a __crash_info section. So the information in them can only be very low-level. For example I'd be surprised if it could contain URLs. I'd expect the concept of a URL to only be understandable by higher-level, Mozilla-specific code.

I have a lot of experience reverse-engineering macOS, so this is something I can try to check. I'll spend a few hours doing that and report back.

This going to take me longer than I expected, because it's a lot more complicated than I expected. I may end up recommending that we not make public the __crash_info data from certain modules. But doing that prematurely may end up making it impossible to do the research required to find out whether or not information from those modules' __crash_info sections is too sensitive.

I'll have at least a preliminary report available either later today or sometime tomorrow.

Edit: It's going to be sometime tomorrow.

Here's my thoughts about this: I don't think that making the individual fields searchable is valuable but it would be useful to be able to search crashes that have / don't have this particular bit and it would be nice to be able to do free form searches in the field as if it were a simple string. My reasoning is the following: this is a little bit like last_error_value on Windows: it's not something that matters in and by itself but it's interesting to know if all crashes under a given signature have the same (or similar) errors recorded there.

(In reply to Gabriele Svelto [:gsvelto] from comment #7)

My reasoning is the following: this is a little bit like last_error_value on Windows: it's not something that matters in and by itself but it's interesting to know if all crashes under a given signature have the same (or similar) errors recorded there.

Actually, the information in __crash_info is much more precisely targeted than last_error_value. It's usually written, just before an abort, to specify the reason for that abort. See for example gpusGenerateCrashLog.cold.1() from comment #3.

it would be nice to be able to do free form searches in the field as if it were a simple string.

By "in the field" do you mean the whole mac_crash_info field? If so, then I don't really object to your suggestion. With __crash_info basically undocumented, it's hard to tell which field to look in for information. We can be much more certain that important information will be found somewhere in mac_crash_info than we can that it will be found in any particular field within mac_crash_info. On the other hand, it'd be good to find out, over time, if some fields in mac_crash_info tend to be used for specific purposes.

So yes. Let's make it possible to do free form searches in the entire mac_crash_info field, and find out which crashes have or don't have data in particular fields within mac_crash_info.

Also, of course, we should be able to search for crashes that have or don't have a mac_crash_info field at all.

Here's my preliminary report.

I quickly found that I wouldn't have time to report on all the modules pulled in by Firefox that have __crash_info sections. So I concentrated on seven of them whose names indicate they are more likely to log user-sensitive information. Of these, I found only one that actually does:

    /usr/lib/libnetwork.dylib

It's used by Firefox's DNSResolver. And when it's effected by low level errors, it can write external ip addresses to __crash_info. So we should probably prevent the public from seeing __crash_info data from this module.

At least for now, I think everything else should be reported without restrictions. I'll keep my eye on crash reports with __crash_info data, and so I assume will others. If I see user-sensitive information there, I'll open a new bug, mark it security-sensitive, and CC at least Gabriele and Will.

I created this list using output from my HookCase hook library from bug 1577886 comment #12. Beforehand I commented back in the code that traces Firefox's crash handling.

Attached file HookCase hook library I tested with (obsolete) —

Here's the HookCase hook library I used to test writing __crash_info data in the libnetwork.dylib and AccountsDaemon modules (as a patch on https://github.com/steven-michaud/HookCase/blob/master/HookLibraryTemplate/hook.mm).

(Following up comment #10)

    /usr/lib/libnetwork.dylib

It's used by Firefox's DNSResolver. And when it's effected by low level errors, it can write external ip addresses to __crash_info. So we should probably prevent the public from seeing __crash_info data from this module.

I'm now much less confident that we need to prevent this module's __crash_info data from becoming public. I notice that the ip addresses in my logs never include pages that Firefox has visited -- including sites I'm quite sure I've never visited before, or at least for a very long time (so it's unlikely they're in some kind of cache).

All of the logged NWConcrete_nw_endpoint objects are created by calls to mozilla::net::GetAddrInfo() from here:

https://hg.mozilla.org/mozilla-central/file/f9bdd1b929f234f4defb8c2344c24d4e3b2547bc/netwerk/dns/nsHostResolver.cpp#l2232

Maybe we should consult someone who knows this code, and who could tell us whether any of the host addresses that pass through here are user-sensitive.

Revised version of the HookCase hook library from comment #12.

Attachment #9220910 - Attachment is obsolete: true

(Following up comment #13)

I'm now much less confident that we need to prevent this module's __crash_info data from becoming public. I notice that the ip addresses in my logs never include pages that Firefox has visited -- including sites I'm quite sure I've never visited before, or at least for a very long time (so it's unlikely they're in some kind of cache).

I've now figured out what was happening: I had dns over https turned on. When I turned it off, I started seeing ip addresses for the sites I was visiting. (I also saw a lot more logging.)

So yes, we probably do need to prevent __crash_info data from the following module from becoming public:

    /usr/lib/libnetwork.dylib

Just to make things clear:

Crashes in /usr/lib/libnetwork.dylib of the kind that might cause user-sensitive information to be written to its __crash_info section are vanishingly rare. There have been none over the last six months, aside from the ones I myself triggered, using the HookCase hook library I've attached to this bug. (My crashes all have hook.dylib in the stack trace.)

https://crash-stats.mozilla.org/search/?proto_signature=~NWConcrete_nw_endpoint&platform=Mac%20OS%20X&date=%3E%3D2020-11-10T15%3A35%3A00.000Z&date=%3C2021-05-10T15%3A35%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

So we probably don't need to work out how to prevent __crash_info data from /usr/lib/libnetwork.dylib from becoming public before we start allowing mac_crash_info to appear in crash reports.

I continue investigating what can show up in the __crash_info sections of system modules pulled in by Firefox. So far I haven't discovered any more user-sensitive information. I'll post another report later today.

Edit: It'll be sometime tomorrow.

(Following up comment #16)

I broadened my search a bit and found two crashes (besides my own) in /usr/lib/libnetwork.dylib over the last six months that might cause user-sensitive information to be written to its __crash_info section. I'd still say they're "vanishingly rare", though.

https://crash-stats.mozilla.org/search/?proto_signature=~NWConcrete_nw_path&platform=Mac%20OS%20X&date=%3E%3D2020-11-10T18%3A14%3A00.000Z&date=%3C2021-05-10T18%3A14%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

I spent a bunch of time thinking about this. I can add the mac_crash_info to the Details tab--that's pretty straightforward.

The mac_crash_info structure has an array of structures in it--I can't break that up into parts that I can index and make searchable. Further, I can't index structures.

I think what I'm going to do is serialize the data as a string and then index that string and make it searchable. It's not great, but I think it'll give you something you can use to answer questions like:

  1. what are all the crash reports that have a mac_crash_info?
  2. what are all the crash reports that have "CryptKit" in the mac_crash_info?

Then we can iterate on that in the future.

I just triggered another of my CKRaise crashes (all of whose minidumps contain __crash_info data):

bp-a3ae783c-13a2-4ccf-8197-b7a840210511

But there isn't any mac_crash_info information in either the "details" page or the "raw data" page.

Will: How long will it take for your changes to work their way into public-facing systems?

Flags: needinfo?(willkg)

I merged a patch which automatically gets deployed to the staging site. You can see your crash here on the staging site:

https://crash-stats.allizom.org/report/index/a3ae783c-13a2-4ccf-8197-b7a840210511

This involved an index change, so the mac_crash_info field won't be searchable until Monday when the new index is created. Crash reports submitted after the new index is created will get indexed correctly and will be searchable.

In order for this to be available in our production environment, I need to do a prod deploy. I'll probably do that later this week. I hit issues with availability last week, so it's possible it may take me longer. I'll update the bug as things progress.

Flags: needinfo?(willkg)

Thanks for the info.

https://crash-stats.allizom.org/report/index/a3ae783c-13a2-4ccf-8197-b7a840210511

This looks fine to me. I notice that you've chosen to display mac_crash_info in the "details" page exactly as it's displayed in the "raw data" page. It takes up a bit more room that way than as minidump_stackwalk displays it. But it shows people exactly how to search on substrings of mac_crash_info. I assume that, once mac_crash_info becomes searchable, it will be possible to do a search like mac_crash_info contains '"num_records": 2'. Is that right?

    {
      "num_records": 2,
      "records": [
        {
          "message": "CryptKit fatal error: Raise test exception from _pthread_cond_wait(1)",
          "module": "/System/Library/Frameworks/Security.framework/Versions/A/Security"
        },
        {
          "message": "abort() called",
          "module": "/usr/lib/system/libsystem_c.dylib"
        }
      ]
    }

Yes. I had no specification for the structure, so I figured it's best to show it JSON encoded for now. Regarding searches, yes, I'm pretty sure that's right. I think we'll know more once stage creates a new index on Monday.

This report is about large, general-purpose system modules. After the first batch, these are the most likely to write user-sensitive information to their __crash_info sections. I worked through everything they might write there, and I didn't find anything that might be user-sensitive.

So it looks like Apple is quite careful about what it writes to __crash_info. I'll keep my eyes open. If I find problems, I'll open security-sensitive bugs about them. But I doubt I'll find anything. It seems like what I reported above about /usr/lib/libnetwork.dylib is very much the exception.

Unless something comes up, I don't plan on writing any more of these reports.

If it turns out to contain protected data, we can lock it down--we've got runbooks for that. I think you've done the due diligence and I feel comfortable with where things are. I really appreciate the work you've done on this!

You're most welcome! It'll be very good to have access to this new trove of Mac crash data.

I pushed the code to prod in bug #1711055. On Monday, a new index will get created and we should be able to search the mac_crash_info field. I'll keep this open and needinfo me to verify that next week.

Flags: needinfo?(willkg)

Crashes with mac_crash_info have started to appear at https://crash-stats.mozilla.org/, all (so far) with the signature gpusGenerateCrashLog.cold.1:

https://crash-stats.mozilla.org/signature/?signature=__pthread_kill%20%7C%20abort%20%7C%20gpusGenerateCrashLog.cold.1&version=90.0a1&platform=Mac%20OS%20X&date=%3E%3D2021-05-13T17%3A10%3A00.000Z&date=%3C2021-05-14T14%3A53%3A00.000Z&_sort=-date#aggregations

But I've noticed that the "aggregate on" function doesn't work on mac_crash_info, though the option is available.

Is this something that will be resolved by having a new index? Or will getting this functionality require extra work, and maybe a new bug?

The "aggregate on" won't work until we have a new index because there's no data for the mac_crash_info field being indexed, yet.

Flags: needinfo?(willkg)

I've already found one puzzle, though. Here's a search for all crash reports with signatures containing "gpusGenerateCrashLog", on macOS and the 90.0a1 branch, created since 2021-05-13 05:01PM UTC (when comment #28's push to prod happened). Oddly, the results don't contain any of today's crashes:

https://crash-stats.mozilla.org/search/?signature=~gpusGenerateCrashLog&version=90.0a1&platform=Mac%20OS%20X&date=%3E%3D2021-05-13T17%3A01%3A00.000Z&date=%3C2021-05-17T16%3A16%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

(All these crash reports happen to contain mac_crash_info.)

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED

Hrm... That's puzzling. Seems like there's some funny business with date stamps. If I push the end date of the original query to 6:00pm (18:00), then three crash reports show up, but they're all before 4:00pm (16:00).

I don't think this search issue is related to the new index getting created. I think it's more likely there's some timezone conversion happening somewhere that shouldn't be. It should get a new bug.

See Also: → 1711550

I wrote up bug #1711550 to cover the date filter issue with super search.

(Following up comment #31)

The index now includes __pthread_kill | abort | gpusGenerateCrashLog.cold.1 crashes with more than one kind of "graphics kernel error". So I reran my test of the "aggregate on" function. It works fine:

https://crash-stats.mozilla.org/signature/?mac_crash_info=%21__null__&signature=__pthread_kill%20%7C%20abort%20%7C%20gpusGenerateCrashLog.cold.1&date=%3E%3D2021-05-14T17%3A23%3A00.000Z&date=%3C2021-05-21T17%3A23%3A00.000Z#aggregations

Edit: To see the results you have to explicitly choose "aggregate on mac_crash_info".

I've just opened bug 1713355 for a followup issue.

I've opened bug 1714190 for another followup issue.

I've opened bug 1715812 for another followup issue.

Edit: This turns out to be an Apple bug, and not a problem with Socorro.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: