Open Bug 1745224 Opened 4 years ago Updated 4 years ago

Occasional crash reports with NULL debug ids for Mozilla-specific modules, maybe only on content process

Categories

(Toolkit :: Crash Reporting, defect, P3)

Unspecified
macOS
defect

Tracking

()

People

(Reporter: smichaud, Unassigned)

References

Details

This bug is spun off from bug 1741287. It eventually became clear that bug 1741287 covers two distinct, unrelated bugs, as follows:

  1. The build process for official builds sometimes fails to copy Mozilla-specific symbols for that build to the symbol server.

  2. Sometimes the debug_id for Mozilla-specific modules (like XUL) is zeroed out in crash reports (perhaps only with content-process crashes). This prevents these modules from being symbolicated in those crash reports.

Bug 1741287 has now been DUPed to bug 1658531, which covers only issue #1. I'm opening this bug to deal with issue #2.

Neither issue exists on Windows. Issue #2 (this bug) may exist on Linux, though I haven't seen it. But I've seen many examples on macOS. So, at least for the moment, I'm limiting this report to macOS.

There's no way to search or facet on module debug ids. So it's difficult to search for crash reports that match this bug. At best you can search for bugs matching both issues #1 and #2, and look through them by hand to find examples of one or the other.

https://crash-stats.mozilla.org/search/?signature=~XUL%40&release_channel=nightly&release_channel=release&date=%3E%3D2021-12-02T17%3A36%3A00.000Z&date=%3C2021-12-09T17%3A36%3A00.000Z&_facets=signature&_facets=platform&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports

Here's the most recent example I can find for a mozilla-central nightly:

bp-756ce220-911c-41fc-b452-3b29c0211207

Here's a snippet from its "Modules" tab, showing Mozilla-specific modules with NULL debug ids:

      SafariSafeBrowsing   0.0.0.0       5513EB53B5393D8D801DE64175D3A03C0  SafariSafeBrowsing
    Ø libnss3.dylib        0.1.0.0       000000000000000000000000000000000  libnss3.dylib
    Ø libmozglue.dylib     0.1.0.0       000000000000000000000000000000000  libmozglue.dylib
      liblgpllibs.dylib    0.1.0.0       000000000000000000000000000000000  liblgpllibs.dylib
    Ø XUL                  0.1.0.0       000000000000000000000000000000000  XUL
      libcorecrypto.dylib  0.1000.140.4  D211160DE22F344080541F5824519C7F0  libcorecrypto.dylib

This may be a bug in Breakpad code. When I have time I'll look through it for possible causes.

See Also: → 1741287

rust-minidump version of this crash report also reports "null" debug_ids, so unlikely to be a bug in the processor:

https://crash-stats.allizom.org/report/index/756ce220-911c-41fc-b452-3b29c0211207#tab-modules

(In reply to Steven Michaud [:smichaud] (Retired) from comment #0)

Here's a snippet from its "Modules" tab, showing Mozilla-specific modules with NULL debug ids:

      SafariSafeBrowsing   0.0.0.0       5513EB53B5393D8D801DE64175D3A03C0  SafariSafeBrowsing
    Ø libnss3.dylib        0.1.0.0       000000000000000000000000000000000  libnss3.dylib
    Ø libmozglue.dylib     0.1.0.0       000000000000000000000000000000000  libmozglue.dylib
      liblgpllibs.dylib    0.1.0.0       000000000000000000000000000000000  liblgpllibs.dylib
    Ø XUL                  0.1.0.0       000000000000000000000000000000000  XUL
      libcorecrypto.dylib  0.1000.140.4  D211160DE22F344080541F5824519C7F0  libcorecrypto.dylib

This may be a bug in Breakpad code. When I have time I'll look through it for possible causes.

Yeah, they're all empty. This definitely smells like a bug in the minidump writer. It's curious that it specifically affects Mozilla's libraries but not the system ones.

This bug may be limited to the content process. A quick search (of necessity by hand) didn't turn up any on the parent process.

Summary: Occasional crash reports with NULL debug ids for Mozilla-specific modules → Occasional crash reports with NULL debug ids for Mozilla-specific modules, maybe only on content process

Here's another interesting data point: I found a few content process crashes with these NULL debug_ids but not parent process crashes (and I've looked at several dozen). When we're doing out-of-process minidump generation and the module is not in the dyld shared cache we bail out early if we can't find the ID. Maybe we have to double-check the error handling in there to be sure we're not bailing out too early.

[edit] Hadn't seen Steven comment, glad we came to the same conclusion.

Here are some parent crashes for Firefox for release channel from 12/8/2021:

Here are some content crashes:

I threw together an STMO query. I don't know offhand what the access requirements are for it:

https://sql.telemetry.mozilla.org/queries/83220/source

(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #5)

Here are some parent crashes for Firefox for release channel from 12/8/2021:

Here are some content crashes:

  • 7446f41e-b51e-4eaa-882e-92c2f0211208
  • d0503ddc-9dc3-417f-8415-819570211208
  • d284750c-42d5-4e7c-a98f-d8dcc0211208

None of these are missing Mozilla-specific symbols, or have NULL debug ids for Mozilla-specific modules.

I threw together an STMO query. I don't know offhand what the access requirements are for it:

https://sql.telemetry.mozilla.org/queries/83220/source

I don't seem to have permission to see these results (or to perform the query). If you're able, in this custom query, to search on module debug ids, I'd specify the following search criteria (to be ANDed together):

  1. Signature contains "XUL@"

  2. Release channel is "release" or "nightly"

  3. Product is not "SeaMonkey"

  4. XUL module debug id contains "000000000000000000000000000000000"

SeaMonkey needs to be excluded because its build process never copies Mozilla-specific symbols to the symbol server.

Thanks Will! I've narrowed down the query to only macOS crashes with a NULL XUL and this is what I get: https://sql.telemetry.mozilla.org/queries/83221

They're all content crashes save for a handful coming from a single machine. The assertion message in those parent process minidumps points to a potentially corrupted Firefox installation so given they're coming from a single user I'm fairly convinced this is a content-specific issue.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #6)

  1. Release channel is "release" or "nightly"

  2. Product is not "SeaMonkey"

I haven't added those but all crashes I found come from Firefox and are on the release channel.

The severity field is not set for this bug.
:gsvelto, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(gsvelto)
Severity: -- → S3
Flags: needinfo?(gsvelto)
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.