1793410 - Crash in [@ mozilla::RestyleManager::ElementStateChanged], due to diagnostic assertion failing: MOZ_DIAGNOSTIC_ASSERT(!mInStyleRefresh)

Gabriele Svelto [:gsvelto]

Reporter

Description

•

3 years ago

Crash report: https://crash-stats.mozilla.org/report/index/7e69a3c4-4632-4e86-8702-67fdf0221003

MOZ_CRASH Reason: MOZ_DIAGNOSTIC_ASSERT(!mInStyleRefresh)

Top 10 frames of crashing thread:

0 XUL mozilla::RestyleManager::ElementStateChanged layout/base/RestyleManager.cpp
0 XUL mozilla::PresShell::ElementStateChanged layout/base/PresShell.cpp:4489
1 XUL mozilla::dom::Document::ElementStateChanged dom/base/Document.cpp:8059
1 XUL mozilla::dom::Element::UpdateState dom/base/Element.cpp:386
2 XUL mozilla::dom::HTMLInputElement::OnValueChanged dom/html/HTMLInputElement.cpp:6695
2 XUL mozilla::TextControlState::SetValue dom/html/TextControlState.cpp:2744
3 XUL mozilla::TextControlState::SetValue dom/html/TextControlState.h:283
3 XUL mozilla::TextControlState::UnbindFromFrame dom/html/TextControlState.cpp:2507
4 XUL nsTextControlFrame::DestroyFrom layout/forms/nsTextControlFrame.cpp:148
5 XUL nsLineBox::DeleteLineList layout/generic/nsLineBox.cpp:387

I'm not sure what's going on but this appears to be a recent regression, the first affected build ID is 20220906213903.

Emilio Cobos Álvarez (:emilio)

Assignee

Comment 1

•

3 years ago

So this is one of those times where the form control state gets out of sync between the DOM and layout (and layout saving the value changes the form control state). This seems to happen only on Android, and it seems it has IMEStateManager::DispatchCompositionEvent on the stack. Masayuki / Makoto, do you know what might have triggered this?

Flags: needinfo?(masayuki)

Flags: needinfo?(m_kato)

Makoto Kato [:m_kato]

Comment 2

•

3 years ago

This seems to happen only on Android

bp-9f9d06ea-6541-4483-b543-412350221009 is macOS crash and doesn't seem to be related to IMEStateManager.

Makoto Kato [:m_kato]

Comment 3

•

3 years ago

So this is dup of bug 1778476?

Flags: needinfo?(m_kato)

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 4

•

3 years ago

It seems that the web app handles compositionupdate event in <input> element and flush pending things in it. Then, the <input> element is reframed and its state tries to store the new value while TextEditor is not available. So, I agree that this is not an Android specific bug.

Flags: needinfo?(masayuki)

BugBot [:suhaib / :marco/ :calixte]

Comment 5

•

3 years ago

Since the crash volume is low (less than 5 per week), the severity is downgraded to S3. Feel free to change it back if you think the bug is still critical.

For more information, please visit auto_nag documentation.

Severity: S2 → S3

BugBot [:suhaib / :marco/ :calixte]

Comment 6

•

3 years ago

The bug is linked to a topcrash signature, which matches the following criterion:

Top 10 content process crashes on beta

:dholbert, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(dholbert)

Keywords: topcrash

Daniel Holbert [:dholbert]

Comment 7

•

3 years ago

•

Edited

Yeah, let's call this S2 given the recent increase in crash volume; it would be nice to understand what's going on, given that this seems to be a recently-introduced thing (first crash in august, turning into a consistent daily low volume of crashes in mid-Sept).

This is a diagnostic assert, so release users won't be affected, but also they might be getting other forms of bustage when they sail past this assertion (not sure).

RE which platforms are affected -- this initially seemed to be Android-only, but as Masayuki noted, we got several on Mac and now we're getting crashes on Windows, -- e.g. bp-6c82a5e3-2691-4000-9d36-185f20221024.

The "Correlations" pane on crash reports show that 76% of the crashes (including that^ one) have useragent_locale = zh-CN. Those ones seem to have IME-related stuff in the backtrace, e.g. that^ one has IMEContentObserver::HandleQueryContentEvent. So there does seem to be an IME connection here (maybe IME code just makes this easier to hit).

For other locales, I'm not seeing IME in the backtrace; e.g. here's a recent en-US Windows crash:
bp-6ed7330a-8abb-4782-a703-53cdf0221025

But it does look like the backtraces consistently have RecreateFramesForContent calling into nsTextControlFrame::DestroyFrom / mozilla::TextControlState::UnbindFromFrame / mozilla::TextControlState::SetValue in the backtrace, though.

So: essentially it looks like, during a style flush, we have to RecreateFramesForContent on a subtree that includes some text input field, and destruction of that text field's frames causes a value to be set, which triggers ElementStateChanged which then fatally asserts.

Severity: S3 → S2

Flags: needinfo?(dholbert)

Hardware: Unspecified → All

Summary: Crash in [@ mozilla::RestyleManager::ElementStateChanged] → Crash in [@ mozilla::RestyleManager::ElementStateChanged], due to diagnostic assertion failing: MOZ_DIAGNOSTIC_ASSERT(!mInStyleRefresh)

Emilio Cobos Álvarez (:emilio)

Assignee

Comment 8

•

3 years ago

I can try to expand the assert message to try to come up with a test-case.

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Assignee

Updated

•

3 years ago

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Assignee

Comment 9

•

3 years ago

Attached file Bug 1793410 - Print changed bits in assert failure. r=dholbert — Details

Phabricator Automation

Updated

•

3 years ago

Assignee: nobody → emilio

Status: NEW → ASSIGNED

Pulsebot

Comment 10

•

3 years ago

Pushed by ealvarez@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/61d384d41c43 Print changed bits in assert failure. r=jwatt

Emilio Cobos Álvarez (:emilio)

Assignee

Updated

•

3 years ago

Keywords: leave-open

Sandor Molnar[:smolnar]

Comment 11

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/61d384d41c43

Hiroyuki Ikezoe (:hiro)

Updated

•

3 years ago

Comment 12

•

3 years ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

BugBot [:suhaib / :marco/ :calixte]

Comment 13

•

3 years ago

The bug is linked to a topcrash signature, which matches the following criterion:

Top 10 content process crashes on beta

For more information, please visit auto_nag documentation.

Keywords: topcrash

Daniel Holbert [:dholbert]

Comment 14

•

3 years ago

•

Edited

Now that we've got the bits changed in the assertion message (per emilio's tweak), we have a tad more info here. I can see the bits by adding the "Moz crash reason raw" column in the crash reports listing.
The names for these bits are in this struct definition:
https://searchfox.org/mozilla-central/rev/670e2e0999f04dc7734c8c12b2c3d420a1e31f12/dom/base/rust/lib.rs#12

pub struct ElementState: u64 {

In the last week of crash data, it looks like the changed-bits values are all one of the following:

140737488355328 (aka 1u64<<47 which is VALUE_EMPTY) -- this is the most common right now, FWIW.
4 (aka 1<<2 which is HOVER)
3072 (aka 1<<10 | 1<<11 which is VALID and INVALID)
15360 (aka bits 10, 11, 12, 13 which is VALID, INVALID, USER_VALID, and USER_INVALID)

Daniel Holbert [:dholbert]

Comment 15

•

3 years ago

In the past 7 days (since comment 14), I'm seeing more of those same codes (15360, 3072, and 140737488355328), plus one additional one:

140737488355456 (aka 1<<47 | 1<<7 which is VALUE_EMPTY and PLACEHOLDER_SHOWN).

Hiroyuki Ikezoe (:hiro)

Updated

•

3 years ago

Comment 16

•

3 years ago

(In reply to Daniel Holbert [:dholbert] from comment #14)

In the last week of crash data, it looks like the changed-bits values are all one of the following:
[...]

3072 (aka 1<<10 | 1<<11 which is VALID and INVALID)

Awesome, it looks like we've got a fuzzer testcase for this one in bug 1804806.

Depends on: 1804806, 1800543

Comment 17

•

3 years ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

Daniel Holbert [:dholbert]

Comment 18

•

3 years ago

Downgrading due to reduced crash volume, and since we've identified two causes (and fixed one).

Severity: S2 → S3

Daniel Holbert [:dholbert]

Updated

•

3 years ago

Depends on: 1651070
No longer depends on: 1804806

Daniel Holbert [:dholbert]

Comment 19

•

2 years ago

(In reply to BugBot [:suhaib / :marco/ :calixte] from comment #17)

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

Update: apparently this is a topcrash again, though the bot posted it on the wrong bug (since we had two bugs with the same crash-signature field).

BugBot posted in bug 1651070 comment #9:

The bug is linked to a topcrash signature, which matches the following criterion:

Top 10 AArch64 and ARM crashes on nightly

Daniel Holbert [:dholbert]

Updated

•

2 years ago

Keywords: topcrash

Daniel Holbert [:dholbert]

Comment 20

•

2 years ago

•

Edited

In the past 7 days, we have 62 crashes, with these MOZ_CRASH_REASONS:

128 (13 of these)
140737488355328 (43 of these)
140737488355456 (3 of these)
15360 (2 of these)
3072 (1 of these)

These are nearly all the same values from comment 14 - comment 15, except for 128 which is a new one -- that corresponds to the PLACEHOLDER_SHOWN bit.

As was the case in comment 14, VALUE_EMPTY (140737488355328) remains the most common bit here.

Treeherder Bug Filer

Updated

•

2 years ago

Comment 21

•

2 years ago

(In reply to Daniel Holbert [:dholbert] from comment #20)

As was the case in comment 14, VALUE_EMPTY (140737488355328) remains the most common bit here.

Good news, we've got a fuzzer testcase that triggers this now, over in bug 1836854 (potentially related to intermittent bug 1835587 which has the diagnostic assert failing for the same bit.)

Depends on: 1836854

Daniel Holbert [:dholbert]

Updated

•

2 years ago

Depends on: 1838985

Daniel Holbert [:dholbert]

Comment 22

•

2 years ago

•

Edited

(In reply to Daniel Holbert [:dholbert] from comment #21)

(In reply to Daniel Holbert [:dholbert] from comment #20)

As was the case in comment 14, VALUE_EMPTY (140737488355328) remains the most common bit here.

Good news, we've got a fuzzer testcase that triggers this now, over in bug 1836854

Update: that bug was fixed (via Bug 1472169), but it didn't fix this flavor of the crash entirely; we have a new fuzzer testcase in bug 1838985 that crashes with the same signature/enum-value.

Andrew McCreight (out of office until 8/21) [:mccr8]

Updated

•

2 years ago

Duplicate of this bug: 1843445

BugBot [:suhaib / :marco/ :calixte]

Comment 24

•

2 years ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

BugBot [:suhaib / :marco/ :calixte]

Comment 25

•

2 years ago

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

Top 10 AArch64 and ARM crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

BugBot [:suhaib / :marco/ :calixte]

Comment 26

•

2 years ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

BugBot [:suhaib / :marco/ :calixte]

Comment 27

•

2 years ago

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

Top 10 AArch64 and ARM crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

BugBot [:suhaib / :marco/ :calixte]

Comment 28

•

2 years ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Comment hidden (Intermittent Failures Robot)

Mayank Bansal

Updated

•

1 year ago

Updated

•

1 year ago

Depends on: 1889803

Daniel Holbert [:dholbert]

Comment 30

•

1 year ago

Good news, have a new fuzzer testcase that triggers this crash in bug 1889803 (and bug 1889804, though I tentatively duped that to the other one).

Comment hidden (Intermittent Failures Robot)

BugBot [:suhaib / :marco/ :calixte]

Comment 32

•

10 months ago

The leave-open keyword is there and there is no activity for 6 months.
:emilio, maybe it's time to close this bug?
For more information, please visit BugBot documentation.

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Assignee

Comment 33

•

10 months ago

No

Flags: needinfo?(emilio)

Comment hidden (Intermittent Failures Robot)

Mayank Bansal

Updated

•

8 months ago

Comment 35

•

8 months ago

FWIW bug 1936213 is another case with 3072 (aka 1<<10 | 1<<11 which are ElementState's VALID and INVALID bits combined).

Daniel Holbert [:dholbert]

Comment 36

•

8 months ago

•

Edited

But looking at the last 3 months, 822 / 864 of the crashes with this signature (95% of them) have this for moz crash reason raw:

Element state change during style refresh (35184372088832)

That giant number is from just a single bit being set -- it's 1<<45 i.e. VALUE_EMPTY, defined here:
https://searchfox.org/mozilla-central/rev/9783996dbd86f999cab50ea426079a7b10f28a2f/dom/base/rust/lib.rs#126-128

/// For :-moz-value-empty (to show widgets like the reveal password
/// button or the clear button).
const VALUE_EMPTY = 1u64 << 45;

Daniel Holbert [:dholbert]

Comment 37

•

8 months ago

I guess comment 36's 35184372088832 aka VALUE_EMPTY version of this bug (comment 36) is covered by the fuzzer testcase in bug 1889803. (At least, that's one way to trigger this.)