Closed Bug 1737405 Opened 3 years ago Closed 3 years ago

Diagnostic crash in [@ mozilla::dom::JSStreamConsumer::WriteSegment ]

Tracking

()

Status:

RESOLVED FIXED

Milestone:

98 Branch

Tracking Flags:

Tracking

Status

firefox-esr78

---

unaffected

firefox-esr91

---

unaffected

firefox93

---

unaffected

firefox94

---

unaffected

firefox95

---

disabled

firefox96

---

disabled

firefox97

---

disabled

firefox98

---

fixed

People

(Reporter: jimm, Assigned: yury)

References

(Regression)

Details

(Keywords: regression, Whiteboard: [necko-triaged])

Crash Data

Attachments

(6 files)

Bug 1737405 - Disable wasm caching for release/beta. r?lth 3 years ago Yury Delendik (:yury) 48 bytes, text/x-phabricator-request	pascalc : approval-mozilla-beta+	Details \| Review
log-main.10416.zip 3 years ago Jim Mathies [:jimm] 1.42 MB, application/x-zip-compressed		Details
Bug 1737405 - Change MOZ_DIAGNOSTIC_ASSERTs in wasm cache code. r?valentin 3 years ago Yury Delendik (:yury) 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1737405 - Remove superfluous wasm cache stream check. r?valentin 3 years ago Yury Delendik (:yury) 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1737405 - Enable wasm caching for early beta. r?ryanvm 3 years ago Yury Delendik (:yury) 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1737405 - Enable wasm caching. r?ryanvm 3 years ago Yury Delendik (:yury) 48 bytes, text/x-phabricator-request		Details \| Review

Jim Mathies [:jimm]

Reporter

Description

•

3 years ago

Reliably happens on tab load of Matrix Chat.

https://crash-stats.mozilla.org/report/index/b363da03-399c-4259-9791-b152d0211023#tab-details

MOZ_DIAGNOSTIC_ASSERT(self->mZStream.avail_out > 0)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

3 years ago

status-firefox93: --- → unaffected

status-firefox94: --- → unaffected

status-firefox-esr78: --- → unaffected

status-firefox-esr91: --- → unaffected

Flags: needinfo?(ydelendik)

Keywords: regression

Regressed by: 1545131

BMO Automation

Updated

•

3 years ago

Has Regression Range: --- → yes

Jim Mathies [:jimm]

Reporter

Comment 1

•

3 years ago

Seems to be fixed this morning.

Yury Delendik (:yury)

Assignee

Comment 2

•

3 years ago

I don't see how this state can be triggered. I have couple of thoughts:

Multiple JSStreamConsumer read/write operations are happening at the same time (thus using the same mZStream). Maybe WriteSegment is called when new cache is saved.
Interesting buildId in-between versions upgrade
Something else corrupts cache's alt-data or in-memory mZStream

I'll monitor this for a little bit longer.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

3 years ago

Comment 3

•

3 years ago

I haven't reproduced this crash myself, but when monitoring crash pings from Fission users, I see the MOZ_DIAGNOSTIC_ASSERT(self->mZStream.avail_out > 0) crash reason in about 70 crash pings from Beta 95 and 5 from Nightly 96.

status-firefox96: --- → affected

Yury Delendik (:yury)

Assignee

Comment 4

•

3 years ago

Yeah, I need help from necko team to figure this out. I cannot reproduce the issue on my systems locally Windows or Mac OS as well, but looks like cache system returns invalid or corrupted alt data. The timing of events play huge role in it.

FWIW backing out the compression (bug 1545131) will not solve the issue, but will hide it. Looks like the problem will be with uncompressed data too, which is more depressing -- no ways to check if machine code is corrupted.

Dragana Damjanovic [:dragana]

Comment 5

•

3 years ago

Do you have more hints on in which way data is corrupted, e.g. incomplete, completely empty, bytes in the stream are corrupted?

Yury Delendik (:yury)

Assignee

Comment 6

•

3 years ago

Attached file Bug 1737405 - Disable wasm caching for release/beta. r?lth — Details

Phabricator Automation

Updated

•

3 years ago

Assignee: nobody → ydelendik

Status: NEW → ASSIGNED

Yury Delendik (:yury)

Assignee

Comment 7

•

3 years ago

Do you have more hints on in which way data is corrupted, e.g. incomplete, completely empty, bytes in the stream are corrupted?

MOZ_DIAGNOSTIC_ASSERT(self->mZStream.avail_out > 0) is indication that allocated buffer for decompressed data is exhausted, but data is still coming from the stream.

There is also bug 1738987, which has fails because delivered stream is empty (or possibly incomplete).

Matthew Gaudet (he/him) [:mgaudet]

Comment 8

•

3 years ago

(Taking a guess at component: DOM Streams is probably wrong because none of the code covered by that component is active yet)

Component: DOM: Streams → DOM: Networking

Yury Delendik (:yury)

Assignee

Updated

•

3 years ago

Flags: needinfo?(ydelendik)

Keywords: leave-open

Pulsebot

Comment 9

•

3 years ago

Pushed by ydelendik@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ddba013c8f70 Disable wasm caching for release/beta. r=lth

Cristian Tuns

Comment 10

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/ddba013c8f70

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

3 years ago

Regressions: 1739617

Valentin Gosu [:valentin] (he/him)

Comment 11

•

3 years ago

(In reply to Yury Delendik (:yury) from comment #7)

Do you have more hints on in which way data is corrupted, e.g. incomplete, completely empty, bytes in the stream are corrupted?

MOZ_DIAGNOSTIC_ASSERT(self->mZStream.avail_out > 0) is indication that allocated buffer for decompressed data is exhausted, but data is still coming from the stream.

There is also bug 1738987, which has fails because delivered stream is empty (or possibly incomplete).

It's not clear to me "how" the alt-data could get corrupted. Technically the bytes you put into a cache entry should be the same as the bytes that come out - unless something else corrupts the file on disk.
Is it possible to add some sort of checksum to the alt-data representation?

Flags: needinfo?(ydelendik)

Yury Delendik (:yury)

Assignee

Comment 12

•

3 years ago

•

Edited

Is it possible to add some sort of checksum to the alt-data representation?

The prefix length works like checksum. The WriteSegment/OnInputStreamReady logic is built such way the inflate algorithm will fail if data is corrupted, because of this length over- or under run. On the success, the inflate shall extract the exact amount of bytes provided at the start of the data.

Flags: needinfo?(ydelendik)

Valentin Gosu [:valentin] (he/him)

Comment 13

•

3 years ago

Technically the alt-data cache entry could have both an input stream and an output stream open. I guess it's possible that the output stream could fail for some reason before it gets the chance to write all the content and i the input stream reader would get incomplete data.
But I'm not sure that's what's going on in this case.

Jimm, if you can reliably reproduce this, some HTTP logging would be very useful.

Flags: needinfo?(jmathies)

Yury Delendik (:yury)

Assignee

Comment 14

•

3 years ago

Comment on attachment 9249305 [details]
Bug 1737405 - Disable wasm caching for release/beta. r?lth

Beta/Release Uplift Approval Request

User impact if declined: wasm http caching will be enabled, currently not really stable / no intent to ship yet
Is this code covered by automated tests?: Yes
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: Bug 1739617
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky):
String changes made/needed:

Attachment #9249305 - Flags: approval-mozilla-beta?

Pascal Chevrel:pascalc

Comment 15

•

3 years ago

Comment on attachment 9249305 [details]
Bug 1737405 - Disable wasm caching for release/beta. r?lth

Crash fix, approved for 95 beta 6, thanks.

Attachment #9249305 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Pascal Chevrel:pascalc

Comment 16

•

3 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/610226674258

status-firefox95: affected → fixed

Jim Mathies [:jimm]

Reporter

Comment 17

•

3 years ago

Attached file log-main.10416.zip — Details

(In reply to Valentin Gosu [:valentin] (he/him) from comment #13)

Technically the alt-data cache entry could have both an input stream and an output stream open. I guess it's possible that the output stream could fail for some reason before it gets the chance to write all the content and i the input stream reader would get incomplete data.
But I'm not sure that's what's going on in this case.

Jimm, if you can reliably reproduce this, some HTTP logging would be very useful.

This came back today. Here's the log.

Flags: needinfo?(jmathies)

Jim Mathies [:jimm]

Reporter

Updated

•

3 years ago

Flags: needinfo?(valentin.gosu)

Jim Mathies [:jimm]

Reporter

Comment 18

•

3 years ago

FWIW, disabling 'javascript.options.wasm_caching' fixes the problem.

Valentin Gosu [:valentin] (he/him)

Comment 19

•

3 years ago

It's not clear from the logs what is wrong.
Yury, is it possible that we read 0 as the length of an entry here?

Flags: needinfo?(valentin.gosu) → needinfo?(ydelendik)

Yury Delendik (:yury)

Assignee

Comment 20

•

3 years ago

•

Edited

(wrong comment/analysis, removed)

Flags: needinfo?(ydelendik)

Yury Delendik (:yury)

Assignee

Comment 21

•

3 years ago

Attached file Bug 1737405 - Change MOZ_DIAGNOSTIC_ASSERTs in wasm cache code. r?valentin — Details

Separate asserts logic to provide more details for crashes.

BugBot [:suhaib / :marco/ :calixte]

Comment 22

•

3 years ago

The severity field is not set for this bug.
:dragana, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(dd.mozilla)

Valentin Gosu [:valentin] (he/him)

Comment 23

•

3 years ago

Does not affect release.

Severity: -- → S3

Flags: needinfo?(dd.mozilla)

Priority: -- → P2

Phabricator Automation

Updated

•

3 years ago

Attachment #9252030 - Attachment description: Bug 1737405 - Change MOZ_DAIG_ASSERTs in wasm cache code. r?valentin → Bug 1737405 - Change MOZ_DIAGNOSTIC_ASSERTs in wasm cache code. r?valentin

Pulsebot

Comment 24

•

3 years ago

Pushed by ydelendik@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9df077db9e0c Change MOZ_DIAGNOSTIC_ASSERTs in wasm cache code. r=valentin

Atila Butkovits

Comment 25

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9df077db9e0c

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

3 years ago

Regressions: 1742887

Yury Delendik (:yury)

Assignee

Comment 26

•

3 years ago

Interesting assert triggered on Fenix https://hg.mozilla.org/mozilla-central/file/524df7136a1f401f317d472f7945e6a284bd66f5/dom/fetch/FetchUtil.cpp#l410: MOZ_DIAGNOSTIC_ASSERT(!self->mConsumerAborted); Looks like it is unrelated to the subject of mOptimizedEncoding though it has the same signature -- it might be useful to investigate it somewhere else.

Yury Delendik (:yury)

Assignee

Comment 27

•

3 years ago

Attached file Bug 1737405 - Remove superfluous wasm cache stream check. r?valentin — Details

The self->mZStream.avail_out > 0 check is not correct assert.
In some rare cases the incoming data may not produce any output.

Rely on zlib's inflate() to perform needed validation.

Pulsebot

Comment 28

•

3 years ago

Pushed by ydelendik@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/bdb2ea56b4ec Remove superfluous wasm cache stream check. r=valentin

Marian-Vasile Laza

Comment 29

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/bdb2ea56b4ec

Yury Delendik (:yury)

Assignee

Comment 30

•

3 years ago

I'm surprised, that after 6 days, I don't see the "WriteSegment" diagnostics crashes. It is either: a) self->mZStream.avail_out > 0 was incorrect and the main cause of the mayhem, b) the wasm related crashes (from twitch.tv) surfaced with different signature.

Yury Delendik (:yury)

Assignee

Comment 31

•

3 years ago

Ryan, I'm looking for a release management consult. It looks like "Remove superfluous wasm cache stream check" patch removed WriteSegment crash from top of the list. My initial thought was it is corrupted cache data, but looks like it is not the case and just the assert was wrong. There is still a tiny chance that it can manifest in some other form, e.g. as a failure to load a wasm module and not run a web application, in particular at twitch.tv.

I'm looking for a way to enable the "Enable HTTP wasm caching" feature back, but in controlled manner. Open for ideas, e.g. only for beta, A/B testing, etc.

Flags: needinfo?(ryanvm)

Ryan VanderMeulen [:RyanVM]

Comment 32

•

3 years ago

Could we start with early beta? I guess the only concern I have is that for Fenix, we won't have any diagnostic assert coverage beyond Nightly. Are we likely to see crashes elsewhere outside of those?

Flags: needinfo?(ryanvm)

Yury Delendik (:yury)

Assignee

Comment 33

•

3 years ago

Could we start with early beta?

I assume we just enable that (by reversing "Disable wasm caching for release/beta." path). What early beta time periods for two upcoming releases.

Are we likely to see crashes elsewhere outside of those?

The crashes are only for diagnostics and shall not cause any crashes for release. If we ignore (or remove) these assert, the Firefox presumably will not load applications (e.g. if something wrong with the internal cache logic/data). I wonder if the telemetry will be useful here.

Ryan VanderMeulen [:RyanVM]

Comment 34

•

3 years ago

•

Edited

There's a separate define for early beta: https://wiki.mozilla.org/Platform/Channel-specific_build_defines#EARLY_BETA_OR_EARLIER

In your case, you'd want to re-land with @IS_EARLY_BETA_OR_EARLIER@ instead of @IS_NIGHTLY_BUILD@. That'd get you the first half of the Beta cycle before being automatically turned off. I don't think it'd be particularly risky to go ahead and land that change, but we should probably have a better sense of what to be on the lookout for before letting it ride the trains past early beta.

Yury Delendik (:yury)

Assignee

Comment 35

•

3 years ago

Attached file Bug 1737405 - Enable wasm caching for early beta. r?ryanvm — Details

Valentin Gosu [:valentin] (he/him)

Updated

•

3 years ago

Whiteboard: [necko-triaged]

Pulsebot

Comment 36

•

3 years ago

Pushed by ydelendik@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4b72b106a50b Enable wasm caching for early beta. r=RyanVM

Cristian Tuns

Comment 37

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/4b72b106a50b

Ryan VanderMeulen [:RyanVM]

Comment 38

•

3 years ago

I believe that the current status of this bug is that wasm caching remains disabled on 95/96, but it's now enabled up through early beta on 97 and we believe that the issues which led to the prior disabling have been fixed now. We're waiting on wider testing of 97 to confirm that, however.

Did I get that correct, Yury? Tracking on this bug has gotten pretty messy :(

status-firefox95: fixed → disabled

status-firefox96: affected → disabled

status-firefox97: --- → affected

Flags: needinfo?(ydelendik)

Yury Delendik (:yury)

Assignee

Comment 39

•

3 years ago

wasm caching remains disabled on 95/96, but it's now enabled up through early beta on 97
we believe that the issues which led to the prior disabling have been fixed now.

That is correct. If there is no weird manifestations of this issue, I would like this to ride trains with 97, e.g. try to switch it full on at the later beta releases.

Flags: needinfo?(ydelendik)

Yury Delendik (:yury)

Assignee

Comment 40

•

3 years ago

Attached file Bug 1737405 - Enable wasm caching. r?ryanvm — Details

Yury Delendik (:yury)

Assignee

Comment 41

•

3 years ago

No visible fallout found with enabling it for nightly and early beta. The plan, from wasm team, is to enable HTTP wasm caching to be fully released with FF98.

Yury Delendik (:yury)

Assignee

Updated

•

3 years ago

Keywords: leave-open

Pulsebot

Comment 42

•

3 years ago

Pushed by ydelendik@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/5c9a21444fc8 Enable wasm caching. r=RyanVM

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

status-firefox97: affected → disabled

Noemi Erli[:noemi_erli]

Comment 43

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/5c9a21444fc8

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox98: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 98 Branch

Ryan Hunt [:rhunt]

Updated

•

3 years ago

Regressions: 1762619

You need to log in before you can comment on or make changes to this bug.