Closed Bug 1752970 Opened 2 years ago Closed 1 year ago

Crash in [@ nsHtml5StreamParser::nsHtml5StreamParser]

Categories

(Core :: DOM: HTML Parser, defect, P5)

defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: gsvelto, Unassigned)

Details

(Keywords: crash, Whiteboard: [likely a CPU bug plus failure to update microcode])

Crash Data

Crash report: https://crash-stats.mozilla.org/report/index/dd696029-1a5c-4345-a2f4-c89730220131

Reason: SIGSEGV / SEGV_MAPERR

Top 10 frames of crashing thread:

0 libxul.so nsHtml5StreamParser::nsHtml5StreamParser parser/html/nsHtml5StreamParser.cpp:236
1 libxul.so nsHtml5Parser::MarkAsNotScriptCreated parser/html/nsHtml5Parser.cpp:536
2 libxul.so nsHTMLDocument::StartDocumentLoad dom/html/nsHTMLDocument.cpp:378
3 libxul.so nsContentDLF::CreateInstance layout/build/nsContentDLF.cpp:123
4 libxul.so nsDocShell::CreateContentViewer docshell/base/nsDocShell.cpp:7683
5 libxul.so nsDSURIContentListener::DoContent docshell/base/nsDSURIContentListener.cpp:186
6 libxul.so nsDocumentOpenInfo::TryContentListener uriloader/base/nsURILoader.cpp:632
7 libxul.so nsDocumentOpenInfo::OnStartRequest uriloader/base/nsURILoader.cpp:155
8 libxul.so nsBaseChannel::OnStartRequest netwerk/base/nsBaseChannel.cpp:819
9 libxul.so nsInputStreamPump::OnInputStreamReady netwerk/base/nsInputStreamPump.cpp:371

Not a new crash but it seems to be gaining significant volume on nightly. The crash is happening here which suggests that nsHtml5Module::GetStreamParserThread() returned NULL. This might be related to bug 1642086 comment 2.

(In reply to Gabriele Svelto [:gsvelto] from comment #0)

which suggests that nsHtml5Module::GetStreamParserThread() returned NULL.

Indeed.

All but two crashes are on Linux.

All are 64-bit.

A suspiciously high proportion is ALT 8 SP Workstation (cliff).

The crashes are clustered to Coffee Lake and Atom Z36xxx/Z37xxx, which hints at a CPU bug and Linux configurations that don't load microcode updates into the CPU.

Most crashes are near startup but still late enough that normal startup should have completed.

We don't seem to get microcode versions in Linux crash reports, but I think there are enough indications to dismiss this as a CPU bug that's probably remedied by a microcode update that ALT Linux probably isn't applying.

This might be related to bug 1642086 comment 2.

What relation do you mean?

Aside: We could remove StaticPrefs::html5_offmainthread(). I'm not aware of anyone flipping the pref for any legitimate reason anyway.

(In reply to Henri Sivonen (:hsivonen) from comment #1)

The crashes are clustered to Coffee Lake and Atom Z36xxx/Z37xxx, which hints at a CPU bug and Linux configurations that don't load microcode updates into the CPU.

Most crashes are near startup but still late enough that normal startup should have completed.

We don't seem to get microcode versions in Linux crash reports, but I think there are enough indications to dismiss this as a CPU bug that's probably remedied by a microcode update that ALT Linux probably isn't applying.

The data is in the dump but we don't surface it, see bug 1320921.

(In reply to Henri Sivonen (:hsivonen) from comment #1)

What relation do you mean?

This is happening on the main thread, IIUC that object might be freed on the parser thread so I thought we might be getting a NULL pointer because it was cleared in the parser thread.

Aside: We could remove StaticPrefs::html5_offmainthread(). I'm not aware of anyone flipping the pref for any legitimate reason anyway.

Good point.

(In reply to Gabriele Svelto [:gsvelto] from comment #2)

(In reply to Henri Sivonen (:hsivonen) from comment #1)

The crashes are clustered to Coffee Lake and Atom Z36xxx/Z37xxx, which hints at a CPU bug and Linux configurations that don't load microcode updates into the CPU.

Most crashes are near startup but still late enough that normal startup should have completed.

We don't seem to get microcode versions in Linux crash reports, but I think there are enough indications to dismiss this as a CPU bug that's probably remedied by a microcode update that ALT Linux probably isn't applying.

The data is in the dump but we don't surface it, see bug 1320921.

I'm leaving this open in order to have a bug number to attribute the stacks to, but I'm treating this as a WONTFIX given the information available so far. Hence, lowering severity and priority.

(In reply to Henri Sivonen (:hsivonen) from comment #1)

What relation do you mean?

This is happening on the main thread, IIUC that object might be freed on the parser thread so I thought we might be getting a NULL pointer because it was cleared in the parser thread.

nsHtml5StreamParser is supposed to be freed on the main thread, always. Bug 1642086 relates to the mechanism that handles the freeing on the main thread.

Severity: S2 → S4
Priority: -- → P5
Whiteboard: [likely a CPU bug plus failure to update microcode]

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 10 content process crashes on beta
  • Top 5 desktop browser crashes on Linux on beta

:hsivonen, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(hsivonen)
Keywords: topcrash

(In reply to Release mgmt bot [:suhaib / :marco/ :calixte] from comment #4)

:hsivonen, could you consider increasing the severity of this top-crash bug?

Continuing to assume that this crash is due to failure to apply CPU microcode updates on Linux.

Flags: needinfo?(hsivonen)
Keywords: stalled

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 5 desktop browser crashes on Linux on beta

:hsivonen, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(hsivonen)
Keywords: topcrash

Bug 1791728 would make it easy to confirm the hypothesis at a glance.

I see a lot of crashes with Intel family 158 (decimal), which by now should be at least at microcode version 0xEC, but the crashes all show earlier microcode. This supports the hypothesis that this crash is due to failure to apply microcode updates.

gsvelto, now that crash stats has CPU microcode data for Linux, how do I see the CPU info and CPU microcode fields in a joined table so that I can systematically see what CPU models the microcode versions apply to?

Flags: needinfo?(hsivonen) → needinfo?(gsvelto)

You can use this query that adds two columns to the reports table with CPU info values and microcode versions. The majority of the crashes are indeed coming from Intel family 158 machines with microcodes that are lower than 0xEC.

Flags: needinfo?(gsvelto)

(In reply to Henri Sivonen (:hsivonen) from comment #1)

Aside: We could remove StaticPrefs::html5_offmainthread(). I'm not aware of anyone flipping the pref for any legitimate reason anyway.

Bug 1801862. It might perturb things enough to make this go away. Or it might just change the crash signature.

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

(In reply to Henri Sivonen (:hsivonen) from comment #10)

Bug 1801862. It might perturb things enough to make this go away. Or it might just change the crash signature.

This has now landed, but after the bot remark in the previous comment.

Looks like this simply went away after 108.0rc2. Perhaps a compiler optimization was perturbed enough not to trigger the CPU bug anymore?

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → WORKSFORME

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit auto_nag documentation.

Keywords: stalled
You need to log in before you can comment on or make changes to this bug.