Closed Bug 1737405 Opened 7 months ago Closed 4 months ago

Diagnostic crash in [@ mozilla::dom::JSStreamConsumer::WriteSegment ]

Categories

(Core :: DOM: Networking, defect, P2)

Desktop
Windows 10
defect

Tracking

()

RESOLVED FIXED
98 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- unaffected
firefox93 --- unaffected
firefox94 --- unaffected
firefox95 --- disabled
firefox96 --- disabled
firefox97 --- disabled
firefox98 --- fixed

People

(Reporter: jimm, Assigned: yury)

References

(Regression)

Details

(Keywords: regression, Whiteboard: [necko-triaged])

Crash Data

Attachments

(6 files)

Reliably happens on tab load of Matrix Chat.

https://crash-stats.mozilla.org/report/index/b363da03-399c-4259-9791-b152d0211023#tab-details

MOZ_DIAGNOSTIC_ASSERT(self->mZStream.avail_out > 0)

Has Regression Range: --- → yes

Seems to be fixed this morning.

I don't see how this state can be triggered. I have couple of thoughts:

  • Multiple JSStreamConsumer read/write operations are happening at the same time (thus using the same mZStream). Maybe WriteSegment is called when new cache is saved.
  • Interesting buildId in-between versions upgrade
  • Something else corrupts cache's alt-data or in-memory mZStream

I'll monitor this for a little bit longer.

I haven't reproduced this crash myself, but when monitoring crash pings from Fission users, I see the MOZ_DIAGNOSTIC_ASSERT(self->mZStream.avail_out > 0) crash reason in about 70 crash pings from Beta 95 and 5 from Nightly 96.

Yeah, I need help from necko team to figure this out. I cannot reproduce the issue on my systems locally Windows or Mac OS as well, but looks like cache system returns invalid or corrupted alt data. The timing of events play huge role in it.

FWIW backing out the compression (bug 1545131) will not solve the issue, but will hide it. Looks like the problem will be with uncompressed data too, which is more depressing -- no ways to check if machine code is corrupted.

Do you have more hints on in which way data is corrupted, e.g. incomplete, completely empty, bytes in the stream are corrupted?

Assignee: nobody → ydelendik
Status: NEW → ASSIGNED

Do you have more hints on in which way data is corrupted, e.g. incomplete, completely empty, bytes in the stream are corrupted?

MOZ_DIAGNOSTIC_ASSERT(self->mZStream.avail_out > 0) is indication that allocated buffer for decompressed data is exhausted, but data is still coming from the stream.

There is also bug 1738987, which has fails because delivered stream is empty (or possibly incomplete).

(Taking a guess at component: DOM Streams is probably wrong because none of the code covered by that component is active yet)

Component: DOM: Streams → DOM: Networking
Flags: needinfo?(ydelendik)
Keywords: leave-open
Pushed by ydelendik@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ddba013c8f70
Disable wasm caching for release/beta. r=lth

(In reply to Yury Delendik (:yury) from comment #7)

Do you have more hints on in which way data is corrupted, e.g. incomplete, completely empty, bytes in the stream are corrupted?

MOZ_DIAGNOSTIC_ASSERT(self->mZStream.avail_out > 0) is indication that allocated buffer for decompressed data is exhausted, but data is still coming from the stream.

There is also bug 1738987, which has fails because delivered stream is empty (or possibly incomplete).

It's not clear to me "how" the alt-data could get corrupted. Technically the bytes you put into a cache entry should be the same as the bytes that come out - unless something else corrupts the file on disk.
Is it possible to add some sort of checksum to the alt-data representation?

Flags: needinfo?(ydelendik)

Is it possible to add some sort of checksum to the alt-data representation?

The prefix length works like checksum. The WriteSegment/OnInputStreamReady logic is built such way the inflate algorithm will fail if data is corrupted, because of this length over- or under run. On the success, the inflate shall extract the exact amount of bytes provided at the start of the data.

Flags: needinfo?(ydelendik)

Technically the alt-data cache entry could have both an input stream and an output stream open. I guess it's possible that the output stream could fail for some reason before it gets the chance to write all the content and i the input stream reader would get incomplete data.
But I'm not sure that's what's going on in this case.

Jimm, if you can reliably reproduce this, some HTTP logging would be very useful.

Flags: needinfo?(jmathies)

Comment on attachment 9249305 [details]
Bug 1737405 - Disable wasm caching for release/beta. r?lth

Beta/Release Uplift Approval Request

  • User impact if declined: wasm http caching will be enabled, currently not really stable / no intent to ship yet
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: Bug 1739617
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky):
  • String changes made/needed:
Attachment #9249305 - Flags: approval-mozilla-beta?

Comment on attachment 9249305 [details]
Bug 1737405 - Disable wasm caching for release/beta. r?lth

Crash fix, approved for 95 beta 6, thanks.

Attachment #9249305 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attached file log-main.10416.zip

(In reply to Valentin Gosu [:valentin] (he/him) from comment #13)

Technically the alt-data cache entry could have both an input stream and an output stream open. I guess it's possible that the output stream could fail for some reason before it gets the chance to write all the content and i the input stream reader would get incomplete data.
But I'm not sure that's what's going on in this case.

Jimm, if you can reliably reproduce this, some HTTP logging would be very useful.

This came back today. Here's the log.

Flags: needinfo?(jmathies)
Flags: needinfo?(valentin.gosu)

FWIW, disabling 'javascript.options.wasm_caching' fixes the problem.

It's not clear from the logs what is wrong.
Yury, is it possible that we read 0 as the length of an entry here?

Flags: needinfo?(valentin.gosu) → needinfo?(ydelendik)

(wrong comment/analysis, removed)

Flags: needinfo?(ydelendik)

Separate asserts logic to provide more details for crashes.

The severity field is not set for this bug.
:dragana, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(dd.mozilla)

Does not affect release.

Severity: -- → S3
Flags: needinfo?(dd.mozilla)
Priority: -- → P2
Attachment #9252030 - Attachment description: Bug 1737405 - Change MOZ_DAIG_ASSERTs in wasm cache code. r?valentin → Bug 1737405 - Change MOZ_DIAGNOSTIC_ASSERTs in wasm cache code. r?valentin
Pushed by ydelendik@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9df077db9e0c
Change MOZ_DIAGNOSTIC_ASSERTs in wasm cache code. r=valentin

Interesting assert triggered on Fenix https://hg.mozilla.org/mozilla-central/file/524df7136a1f401f317d472f7945e6a284bd66f5/dom/fetch/FetchUtil.cpp#l410: MOZ_DIAGNOSTIC_ASSERT(!self->mConsumerAborted); Looks like it is unrelated to the subject of mOptimizedEncoding though it has the same signature -- it might be useful to investigate it somewhere else.

The self->mZStream.avail_out > 0 check is not correct assert.
In some rare cases the incoming data may not produce any output.

Rely on zlib's inflate() to perform needed validation.

Pushed by ydelendik@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/bdb2ea56b4ec
Remove superfluous wasm cache stream check. r=valentin

I'm surprised, that after 6 days, I don't see the "WriteSegment" diagnostics crashes. It is either: a) self->mZStream.avail_out > 0 was incorrect and the main cause of the mayhem, b) the wasm related crashes (from twitch.tv) surfaced with different signature.

Ryan, I'm looking for a release management consult. It looks like "Remove superfluous wasm cache stream check" patch removed WriteSegment crash from top of the list. My initial thought was it is corrupted cache data, but looks like it is not the case and just the assert was wrong. There is still a tiny chance that it can manifest in some other form, e.g. as a failure to load a wasm module and not run a web application, in particular at twitch.tv.

I'm looking for a way to enable the "Enable HTTP wasm caching" feature back, but in controlled manner. Open for ideas, e.g. only for beta, A/B testing, etc.

Flags: needinfo?(ryanvm)

Could we start with early beta? I guess the only concern I have is that for Fenix, we won't have any diagnostic assert coverage beyond Nightly. Are we likely to see crashes elsewhere outside of those?

Flags: needinfo?(ryanvm)

Could we start with early beta?

I assume we just enable that (by reversing "Disable wasm caching for release/beta." path). What early beta time periods for two upcoming releases.

Are we likely to see crashes elsewhere outside of those?

The crashes are only for diagnostics and shall not cause any crashes for release. If we ignore (or remove) these assert, the Firefox presumably will not load applications (e.g. if something wrong with the internal cache logic/data). I wonder if the telemetry will be useful here.

There's a separate define for early beta: https://wiki.mozilla.org/Platform/Channel-specific_build_defines#EARLY_BETA_OR_EARLIER

In your case, you'd want to re-land with @IS_EARLY_BETA_OR_EARLIER@ instead of @IS_NIGHTLY_BUILD@. That'd get you the first half of the Beta cycle before being automatically turned off. I don't think it'd be particularly risky to go ahead and land that change, but we should probably have a better sense of what to be on the lookout for before letting it ride the trains past early beta.

Pushed by ydelendik@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4b72b106a50b
Enable wasm caching for early beta. r=RyanVM

I believe that the current status of this bug is that wasm caching remains disabled on 95/96, but it's now enabled up through early beta on 97 and we believe that the issues which led to the prior disabling have been fixed now. We're waiting on wider testing of 97 to confirm that, however.

Did I get that correct, Yury? Tracking on this bug has gotten pretty messy :(

wasm caching remains disabled on 95/96, but it's now enabled up through early beta on 97
we believe that the issues which led to the prior disabling have been fixed now.

That is correct. If there is no weird manifestations of this issue, I would like this to ride trains with 97, e.g. try to switch it full on at the later beta releases.

Flags: needinfo?(ydelendik)

No visible fallout found with enabling it for nightly and early beta. The plan, from wasm team, is to enable HTTP wasm caching to be fully released with FF98.

Keywords: leave-open
Status: ASSIGNED → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → 98 Branch
Regressions: 1762619
You need to log in before you can comment on or make changes to this bug.