Diagnostic crash in [@ mozilla::dom::JSStreamConsumer::WriteSegment ]
Categories
(Core :: DOM: Networking, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr78 | --- | unaffected |
firefox-esr91 | --- | unaffected |
firefox93 | --- | unaffected |
firefox94 | --- | unaffected |
firefox95 | --- | disabled |
firefox96 | --- | disabled |
firefox97 | --- | disabled |
firefox98 | --- | fixed |
People
(Reporter: jimm, Assigned: yury)
References
(Regression)
Details
(Keywords: regression, Whiteboard: [necko-triaged])
Crash Data
Attachments
(6 files)
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
1.42 MB,
application/x-zip-compressed
|
Details | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
Reliably happens on tab load of Matrix Chat.
https://crash-stats.mozilla.org/report/index/b363da03-399c-4259-9791-b152d0211023#tab-details
MOZ_DIAGNOSTIC_ASSERT(self->mZStream.avail_out > 0)
Updated•3 years ago
|
Updated•3 years ago
|
Reporter | ||
Comment 1•3 years ago
|
||
Seems to be fixed this morning.
Assignee | ||
Comment 2•3 years ago
|
||
I don't see how this state can be triggered. I have couple of thoughts:
- Multiple JSStreamConsumer read/write operations are happening at the same time (thus using the same mZStream). Maybe
WriteSegment
is called when new cache is saved. - Interesting buildId in-between versions upgrade
- Something else corrupts cache's alt-data or in-memory mZStream
I'll monitor this for a little bit longer.
Comment 3•3 years ago
|
||
I haven't reproduced this crash myself, but when monitoring crash pings from Fission users, I see the MOZ_DIAGNOSTIC_ASSERT(self->mZStream.avail_out > 0)
crash reason in about 70 crash pings from Beta 95 and 5 from Nightly 96.
Assignee | ||
Comment 4•3 years ago
|
||
Yeah, I need help from necko team to figure this out. I cannot reproduce the issue on my systems locally Windows or Mac OS as well, but looks like cache system returns invalid or corrupted alt data. The timing of events play huge role in it.
FWIW backing out the compression (bug 1545131) will not solve the issue, but will hide it. Looks like the problem will be with uncompressed data too, which is more depressing -- no ways to check if machine code is corrupted.
Comment 5•3 years ago
|
||
Do you have more hints on in which way data is corrupted, e.g. incomplete, completely empty, bytes in the stream are corrupted?
Assignee | ||
Comment 6•3 years ago
|
||
Updated•3 years ago
|
Assignee | ||
Comment 7•3 years ago
|
||
Do you have more hints on in which way data is corrupted, e.g. incomplete, completely empty, bytes in the stream are corrupted?
MOZ_DIAGNOSTIC_ASSERT(self->mZStream.avail_out > 0)
is indication that allocated buffer for decompressed data is exhausted, but data is still coming from the stream.
There is also bug 1738987, which has fails because delivered stream is empty (or possibly incomplete).
Comment 8•3 years ago
|
||
(Taking a guess at component: DOM Streams is probably wrong because none of the code covered by that component is active yet)
Assignee | ||
Updated•3 years ago
|
Comment 10•3 years ago
|
||
bugherder |
Comment 11•3 years ago
|
||
(In reply to Yury Delendik (:yury) from comment #7)
Do you have more hints on in which way data is corrupted, e.g. incomplete, completely empty, bytes in the stream are corrupted?
MOZ_DIAGNOSTIC_ASSERT(self->mZStream.avail_out > 0)
is indication that allocated buffer for decompressed data is exhausted, but data is still coming from the stream.There is also bug 1738987, which has fails because delivered stream is empty (or possibly incomplete).
It's not clear to me "how" the alt-data could get corrupted. Technically the bytes you put into a cache entry should be the same as the bytes that come out - unless something else corrupts the file on disk.
Is it possible to add some sort of checksum to the alt-data representation?
Assignee | ||
Comment 12•3 years ago
•
|
||
Is it possible to add some sort of checksum to the alt-data representation?
The prefix length works like checksum. The WriteSegment/OnInputStreamReady logic is built such way the inflate algorithm will fail if data is corrupted, because of this length over- or under run. On the success, the inflate shall extract the exact amount of bytes provided at the start of the data.
Comment 13•3 years ago
|
||
Technically the alt-data cache entry could have both an input stream and an output stream open. I guess it's possible that the output stream could fail for some reason before it gets the chance to write all the content and i the input stream reader would get incomplete data.
But I'm not sure that's what's going on in this case.
Jimm, if you can reliably reproduce this, some HTTP logging would be very useful.
Assignee | ||
Comment 14•3 years ago
|
||
Comment on attachment 9249305 [details]
Bug 1737405 - Disable wasm caching for release/beta. r?lth
Beta/Release Uplift Approval Request
- User impact if declined: wasm http caching will be enabled, currently not really stable / no intent to ship yet
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: Bug 1739617
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky):
- String changes made/needed:
Comment 15•3 years ago
|
||
Comment on attachment 9249305 [details]
Bug 1737405 - Disable wasm caching for release/beta. r?lth
Crash fix, approved for 95 beta 6, thanks.
Comment 16•3 years ago
|
||
bugherder uplift |
Reporter | ||
Comment 17•3 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #13)
Technically the alt-data cache entry could have both an input stream and an output stream open. I guess it's possible that the output stream could fail for some reason before it gets the chance to write all the content and i the input stream reader would get incomplete data.
But I'm not sure that's what's going on in this case.Jimm, if you can reliably reproduce this, some HTTP logging would be very useful.
This came back today. Here's the log.
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Comment 18•3 years ago
|
||
FWIW, disabling 'javascript.options.wasm_caching' fixes the problem.
Comment 19•3 years ago
|
||
It's not clear from the logs what is wrong.
Yury, is it possible that we read 0 as the length of an entry here?
Assignee | ||
Comment 21•3 years ago
|
||
Separate asserts logic to provide more details for crashes.
Comment 22•3 years ago
|
||
The severity field is not set for this bug.
:dragana, could you have a look please?
For more information, please visit auto_nag documentation.
Comment 23•3 years ago
|
||
Does not affect release.
Updated•3 years ago
|
Comment 24•3 years ago
|
||
Comment 25•3 years ago
|
||
bugherder |
Assignee | ||
Comment 26•3 years ago
|
||
Interesting assert triggered on Fenix https://hg.mozilla.org/mozilla-central/file/524df7136a1f401f317d472f7945e6a284bd66f5/dom/fetch/FetchUtil.cpp#l410: MOZ_DIAGNOSTIC_ASSERT(!self->mConsumerAborted);
Looks like it is unrelated to the subject of mOptimizedEncoding though it has the same signature -- it might be useful to investigate it somewhere else.
Assignee | ||
Comment 27•3 years ago
|
||
The self->mZStream.avail_out > 0
check is not correct assert.
In some rare cases the incoming data may not produce any output.
Rely on zlib's inflate() to perform needed validation.
Comment 28•3 years ago
|
||
Comment 29•3 years ago
|
||
bugherder |
Assignee | ||
Comment 30•3 years ago
|
||
I'm surprised, that after 6 days, I don't see the "WriteSegment" diagnostics crashes. It is either: a) self->mZStream.avail_out > 0
was incorrect and the main cause of the mayhem, b) the wasm related crashes (from twitch.tv) surfaced with different signature.
Assignee | ||
Comment 31•3 years ago
|
||
Ryan, I'm looking for a release management consult. It looks like "Remove superfluous wasm cache stream check" patch removed WriteSegment crash from top of the list. My initial thought was it is corrupted cache data, but looks like it is not the case and just the assert was wrong. There is still a tiny chance that it can manifest in some other form, e.g. as a failure to load a wasm module and not run a web application, in particular at twitch.tv.
I'm looking for a way to enable the "Enable HTTP wasm caching" feature back, but in controlled manner. Open for ideas, e.g. only for beta, A/B testing, etc.
Comment 32•3 years ago
|
||
Could we start with early beta? I guess the only concern I have is that for Fenix, we won't have any diagnostic assert coverage beyond Nightly. Are we likely to see crashes elsewhere outside of those?
Assignee | ||
Comment 33•3 years ago
|
||
Could we start with early beta?
I assume we just enable that (by reversing "Disable wasm caching for release/beta." path). What early beta time periods for two upcoming releases.
Are we likely to see crashes elsewhere outside of those?
The crashes are only for diagnostics and shall not cause any crashes for release. If we ignore (or remove) these assert, the Firefox presumably will not load applications (e.g. if something wrong with the internal cache logic/data). I wonder if the telemetry will be useful here.
Comment 34•3 years ago
•
|
||
There's a separate define for early beta: https://wiki.mozilla.org/Platform/Channel-specific_build_defines#EARLY_BETA_OR_EARLIER
In your case, you'd want to re-land with @IS_EARLY_BETA_OR_EARLIER@ instead of @IS_NIGHTLY_BUILD@. That'd get you the first half of the Beta cycle before being automatically turned off. I don't think it'd be particularly risky to go ahead and land that change, but we should probably have a better sense of what to be on the lookout for before letting it ride the trains past early beta.
Assignee | ||
Comment 35•3 years ago
|
||
Updated•3 years ago
|
Comment 36•3 years ago
|
||
Comment 37•3 years ago
|
||
bugherder |
Comment 38•3 years ago
|
||
I believe that the current status of this bug is that wasm caching remains disabled on 95/96, but it's now enabled up through early beta on 97 and we believe that the issues which led to the prior disabling have been fixed now. We're waiting on wider testing of 97 to confirm that, however.
Did I get that correct, Yury? Tracking on this bug has gotten pretty messy :(
Assignee | ||
Comment 39•3 years ago
|
||
wasm caching remains disabled on 95/96, but it's now enabled up through early beta on 97
we believe that the issues which led to the prior disabling have been fixed now.
That is correct. If there is no weird manifestations of this issue, I would like this to ride trains with 97, e.g. try to switch it full on at the later beta releases.
Assignee | ||
Comment 40•3 years ago
|
||
Assignee | ||
Comment 41•3 years ago
|
||
No visible fallout found with enabling it for nightly and early beta. The plan, from wasm team, is to enable HTTP wasm caching to be fully released with FF98.
Assignee | ||
Updated•3 years ago
|
Comment 42•3 years ago
|
||
Updated•3 years ago
|
Comment 43•3 years ago
|
||
bugherder |
Description
•