YSOD on browser.xhtml with NO_ELEMENTS/UNCLOSED_TOKEN
Categories
(Core :: XPCOM, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr102 | --- | unaffected |
firefox112 | --- | wontfix |
firefox113 | --- | wontfix |
firefox114 | --- | wontfix |
People
(Reporter: zbraniecki, Unassigned)
References
(Depends on 1 open bug, Blocks 1 open bug, Regression)
Details
(Keywords: regression)
User Story
Volume: ~18000/day on release
Attachments
(1 file)
Investigating data on Beta channel on 2021-03-03, I have identified a common pattern of zero_byte_load and YSOD events from telemetry on Beta that look like this:
Session 1
Build: 20210302185821
- ts: 16727,
zero_byte_load
:[{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/browser/browser.xhtml"}]},{"f":[{"v":"status"},{"v":"NS_BINDING_ABORTED"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]
- ts: 16759,
ysod
:[{"f":[{"v":"error_code"},{"v":"3"}]},{"f":[{"v":"location"},{"v":"1:1"}]},{"f":[{"v":"last_line"},{"v":""}]},{"f":[{"v":"last_line_len"},{"v":"0"}]}]
- ts: 21821,
zero_byte_load
:[{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/toolkit/content/gfxsanity/sanitytest.html"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FAILURE"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]
Session 2
Build: 20210302185821
- ts: 6640,
zero_byte_load
:[{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/browser/browser.xhtml"}]},{"f":[{"v":"status"},{"v":"NS_BINDING_ABORTED"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]
- ts: 6656,
ysod
:[{"f":[{"v":"error_code"},{"v":"3"}]},{"f":[{"v":"location"},{"v":"1:1"}]},{"f":[{"v":"last_line"},{"v":""}]},{"f":[{"v":"last_line_len"},{"v":"0"}]}]
- ts: 7732,
zero_byte_load
:[{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/toolkit/content/gfxsanity/sanitytest.html"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FAILURE"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]
Session 3
Build: 20210225185804
- ts: 29421,
zero_byte_load
:[{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/toolkit/content/mozapps/extensions/default-theme/preview.png"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_NOT_FOUND"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
- ts: 29437,
zero_byte_load
:[{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/builtin-themes/light/preview.png"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_NOT_FOUND"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
- ts: 29437,
zero_byte_load
:[{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/builtin-themes/dark/preview.png"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_NOT_FOUND"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
- ts: 29468,
zero_byte_load
:[{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/builtin-themes/alpenglow/preview.png"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_NOT_FOUND"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
- ts: 35703,
zero_byte_load
:[{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/browser/browser.xhtml"}]},{"f":[{"v":"status"},{"v":"NS_BINDING_ABORTED"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]
- ts: 35718,
ysod
:[{"f":[{"v":"error_code"},{"v":"3"}]},{"f":[{"v":"location"},{"v":"1:1"}]},{"f":[{"v":"last_line"},{"v":""}]},{"f":[{"v":"last_line_len"},{"v":"0"}]}]
- ts: 57484,
zero_byte_load
:[{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/toolkit/content/gfxsanity/sanitytest.html"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FAILURE"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]
Session 4
Build: 20210225185804
- ts: 12261,
zero_byte_load
:[{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/builtin-themes/light/preview.png"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_NOT_FOUND"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
- ts: 12230,
zero_byte_load
:[{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/toolkit/content/mozapps/extensions/default-theme/preview.png"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_NOT_FOUND"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
- ts: 12277,
zero_byte_load
:[{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/builtin-themes/dark/preview.png"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_NOT_FOUND"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
- ts: 12308,
zero_byte_load
:[{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/builtin-themes/alpenglow/preview.png"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_NOT_FOUND"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
- ts: 16926,
zero_byte_load
:[{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/browser/browser.xhtml"}]},{"f":[{"v":"status"},{"v":"NS_BINDING_ABORTED"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]
- ts: 16926,
ysod
:[{"f":[{"v":"error_code"},{"v":"3"}]},{"f":[{"v":"location"},{"v":"1:1"}]},{"f":[{"v":"last_line"},{"v":""}]},{"f":[{"v":"last_line_len"},{"v":"0"}]}]
- ts: 18657,
zero_byte_load
:[{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/toolkit/content/gfxsanity/sanitytest.html"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FAILURE"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]
Session 5
Build: 20210225185804
- ts: 36211124,
zero_byte_load
:[{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/toolkit/content/gfxsanity/sanityparent.html"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FAILURE"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]
- ts: 36211843,
ysod
:[{"f":[{"v":"error_code"},{"v":"3"}]},{"f":[{"v":"location"},{"v":"1:1"}]},{"f":[{"v":"last_line"},{"v":""}]},{"f":[{"v":"last_line_len"},{"v":"0"}]}]
- ts: 536211530,
zero_byte_load
:[{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/browser/browser.xhtml"}]},{"f":[{"v":"status"},{"v":"NS_BINDING_ABORTED"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]
Reporter | ||
Comment 1•4 years ago
|
||
I'm not sure if the Session 3-4 file not found in themes is relevant, but listed all events from the probe from that session just in case it gives us a lead.
Except of that, all 5 sessions registered a zero_byte_load
with NS_ERROR_FAILURE
and cancelled: true
right before YSOD
on NO_ELEMENTS
in line 1:1
.
Also, all 5 of them had a NS_ERROR_FAILURE
with cancelled: true
on gfxsanity/sanitytest.html
or gfxsanity/sanityparent.html
in case of Session 5. Session 5 is also much older when it happens.
Reporter | ||
Comment 2•4 years ago
|
||
Matt, Mossop - you reviewed the sanitytest.html at landing. Do you have any idea on why would there be an NS_FAILURE_ERROR
correlated with YSOD on browser.xhtml
?
My only wild hypothesis would be that user is trying to open a new window, but some I/O error causes graphic error which triggers sanitytest.html
check (which errors out) right around browser.xhtml
load that ends up in an error?
We're trying to understand why under any circumstances anything would want to cancel a load of browser.xhtml
.
Reporter | ||
Comment 3•4 years ago
|
||
I just found a session which looks exactly like Session 1 & 2 above, except that YSOD error payload is [{"f":[{"v":"error_code"},{"v":"3"}]},{"f":[{"v":"location"},{"v":"4343:7"}]},{"f":[{"v":"last_line"},{"v":" "}]},{"f":[{"v":"last_line_len"},{"v":"6"}]}]
- notice line:col, and last_line_len
.
Since the error code is still NO_ELEMENTS, I assume this is mangled stream of 4343 bytes, but I don't know if it's just whitespace or not. My guess would be that it must be whitespace, or the XML parser would error out earlier, but it could be also something half-mangled, like <ELEMENTr2j3rn2krjnk2...
for 4343 bytes.
I'm also not sure if it's the same issue as 1:1, or separate one, but the sanitytest.html
error is also present and NS_BINDING_ABORTED
zero_byte_load as well.
Reporter | ||
Comment 4•4 years ago
|
||
And now I confirmed that all YSODs with UNCLOSED_TOKEN
on browser.xhtml also look like one of the 5 sessions listed above.
One session has YSOD with payload [{"f":[{"v":"error_code"},{"v":"5"}]},{"f":[{"v":"location"},{"v":"3741:17"}]},{"f":[{"v":"last_line"},{"v":"
and looks exactly like Session 3 & 4
Another has YSOD with payload [{"f":[{"v":"error_code"},{"v":"5"}]},{"f":[{"v":"location"},{"v":"565:7"}]},{"f":[{"v":"last_line"},{"v":"
and looks exactly like Session 1 & 2
For the UNCLOSED_TOKEN
YSOD, the line:col distribution is not normal, and there are 4-5 clusters: 565:7, 1894:5, 2458:11, 3741:17 - it may be just the buffer size in async load, but at least since it keeps repeating, it is not a random stream of bytes. And since last_line
seems to be reliably empty it may mean that the I/O feeds an empty string.
Reporter | ||
Updated•4 years ago
|
Comment 5•4 years ago
|
||
(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #2)
Matt, Mossop - you reviewed the sanitytest.html at landing. Do you have any idea on why would there be an
NS_FAILURE_ERROR
correlated with YSOD onbrowser.xhtml
?My only wild hypothesis would be that user is trying to open a new window, but some I/O error causes graphic error which triggers
sanitytest.html
check (which errors out) right aroundbrowser.xhtml
load that ends up in an error?We're trying to understand why under any circumstances anything would want to cancel a load of
browser.xhtml
.
I can't think of any correlation between sanitytest.html and browser.xhtml failing sorry.
Comment 6•4 years ago
|
||
I don't remember anything about sanitytest.html.
Updated•4 years ago
|
Comment 7•4 years ago
|
||
As with bug 1696565 I don't think there is much to do from the update side. See bug 1700205 for a proposal about verifying files while updating.
Comment 8•4 years ago
|
||
Can this bug be closed or have a severity assigned?
Reporter | ||
Comment 9•4 years ago
|
||
No. This is a category of YSOD that our recent removal of DTD did not fix. This happens when Necko sends incomplete or empty XHTML document (for example in scenario where Necko received a cancel signal during loading). This may still happen and there's an architectural disagreement on whether Necko API should allow consumer to receive incomplete/empty document from it.
Reporter | ||
Comment 11•4 years ago
|
||
Frankly, I'd move this to Necko and let them asses the severity. On beta we slashed the rate of YSODs by 2/3 by removing bug 1696565 vector. I want to be cautious when trying to reason about Release, but it may be that we'll decrease the volume by as much as 70% (in March 70% of browser.xhtml YSODs on Release were from DTDs).
If that's the case, we'd be talking about ~800-900 YSODs a day on release from this and I'd argue the severity should be fairly low, but I'll wait for the DTD removal to hit release first.
Updated•4 years ago
|
Updated•4 years ago
|
Reporter | ||
Comment 12•4 years ago
|
||
With bug 1696565 we now see 50% of the remaining YSODs on browser.xhtml to come from old betas and 50% from this bug.
I'll monitor the progress as we move to release, but if this would be a single causal problem and we can fix it it would be the last piece of YSOD on main UI!
Comment 13•3 years ago
|
||
dragana, do you think this is still s2-level?
(Also/in-support-of-that-question: do you know if have anyone working on YSOD-related stuff at this point, e.g. monitoring the data that Zibi mentioned in comment 0? It looks like it was a project that Zibi was focusing on, but he's no longer active at this point. So I'm not sure who the right person would be to ask about the data stream referenced in comment 0 and comment 12 here.)
Comment 14•3 years ago
|
||
Kershaw, can you take a look if we can change the s2-level? Thank you.
Comment 15•3 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #14)
Kershaw, can you take a look if we can change the s2-level? Thank you.
After our internal discussion, this bug should be still a s2.
We will try to investigate this soon.
Comment 16•2 years ago
|
||
Comment 17•2 years ago
|
||
So, I took another look at the graphs.
One thing that seems rather clear by the error rate, is that this kind of error usually happens during/after an update: the Nightly error rate is rather constant, beta has a big spike at the beginning of the release cycle, release is rather constant, and its error rates don't approach beta, while these errors are almost missing from ESR (with the note that ESR may have a smaller number of users and they might not all be sending telemetry).
That said, I noticed a big spike in browser.xhtml errors on Nightly starting 21 feb - This is also visible in the Beta graph starting March 13th (112.0b1)
Looking at the m-c pushlog feb 20-22 nothing immediately stands out - but it might be a good lead into figuring this out:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=9de559474dda78da572098cddf3be734718bab56&tochange=e4d0dac7f34f555d2085fd2f9fe07dc9318c87a6
Comment 18•2 years ago
|
||
Short update. Noting that the event recording is EARLY_BETA_OR_EARLIER which is probably why we're seeing the big spike at the beginning of the beta cycle, and it goes down towards the end. All of the release events are probably from Fx88.
Comment 19•2 years ago
|
||
Should we push reporting of this to release?
Comment 20•2 years ago
|
||
The likely change is Bug 1804803 - Hit MOZ_CRASH(DocumentLoadListener::Open for invalid history entry due to mismatch of 'Invalid LoadId'). r=smaug
Comment 21•2 years ago
|
||
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #19)
Should we push reporting of this to release?
The only question is whether this info is actionable enough to warrant the extra storage for the events.
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #20)
The likely change is Bug 1804803 - Hit MOZ_CRASH(DocumentLoadListener::Open for invalid history entry due to mismatch of 'Invalid LoadId'). r=smaug
This seems the most likely indeed. Thanks for finding it.
Comment 22•2 years ago
|
||
Storing a counter for telemetry should be measured in 10's or 100's of bytes, I'd imagine, so that's a low bar
Comment 23•2 years ago
|
||
Peter - you landed the regressing patch, can you take a look?
Updated•2 years ago
|
Comment 24•2 years ago
|
||
Set release status flags based on info from the regressing bug 1804803
Comment 25•2 years ago
|
||
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #23)
Peter - you landed the regressing patch, can you take a look?
That's highly unlikely, that patch is specific to reloads. How would we be doing a reload of browser.xhtml?
Comment 26•2 years ago
|
||
I'm going to try to add some debug code to figure how it's possible that we would be doing reloads of chrome URLs, and browser.xhtml in particular. Smaug and I are really confused why we would even end up in the code that bug 1804803 added for a load of browser.xhtml.
Updated•2 years ago
|
Comment 27•2 years ago
|
||
Peter, is there any progress on this one? Should it be assigned to somebody?
Comment 28•2 years ago
|
||
Comment 29•2 years ago
|
||
The peak for browser.xhtml went away a while ago. I had prepared this patch to catch reloading of browser.xhtml, but I wanted to make sure that it works, but I haven't been able to trigger it either. At this point I'm not sure it's actually useful anymore.
So I think we should refocus this bug a bit, and figure out what we need to be investigating. Looking at the YSOD dashboards, I don't see any peak for browser.xhtml which is quite odd? A lot of the YSOD seem to be happening on about: pages like about:preferences, which do show up in the network errors. Those do look interesting. But all the other network errors actually not lead to a YSOD? And do they have any effect on the user?
Comment 30•2 years ago
|
||
Kershaw, can you please re-evaluate if this is still a valid S2/P2? Thanks!
Comment 31•2 years ago
|
||
(In reply to Neha Kochar [:neha] from comment #30)
Kershaw, can you please re-evaluate if this is still a valid S2/P2? Thanks!
I've created a query to see YSOD events on browser.xhtml
. It shows that after 2023-06-06
we have 300~500 YSOD per day, so it seems that the peak for browser.xhtml
is still there.
Unfortunately, I think this is sill a S2 bug.
Peter, I think we could land your patch if that's helpful.
Thanks.
Comment 32•2 years ago
|
||
Moving this to xpcom -- the current guess is that this is somehow related to startupcache (see comment 17, and more recent issues highlighted by the Android incident)
Comment 33•1 year ago
|
||
Jesup, is this still an issue? Do you have any idea what the next steps are? I pinged nika about this and she told me you're actually the most familiar with StartupCache which you guessed was the responsible part.
Comment 34•1 year ago
|
||
Definitely still an issue. Also a spike in the last few months in aboutNetError (see the bottom of https://sql.telemetry.mozilla.org/dashboard/ui-resource-network-errors)
While we've improved things some since 3 years ago, there are still ongoing issues, obviously. I'm not sure of next steps at this point. Perhaps some improved diagnostic telemetry?
Updated•1 year ago
|
Comment 35•5 months ago
|
||
Given that we're still seeing this on browser.xhtml
, maybe my diagnostic crash patch should be landed after all? I haven't been able to test it because I don't know of a way to reload browser.xhtml
.
Comment 36•5 months ago
|
||
location.reload() and such of course would do reload, but I couldn't get the patch to be triggered.
Comment 37•5 months ago
|
||
What do the ui-resource-network-errors errors mean? Are those Necko level errors?
If the YSOD error happens mostly after updating, that hints that the issue might be about non-StartupCache, since normally browser.xhtml should be coming from the cache.
Comment 39•4 months ago
|
||
Was looking into the telemetry a bit this morning. I think some (perhaps most?) of the NS_ERROR_FAILURE
errors for documents loaded in-content (like about pages and activity-stream-noscripts.html) are not real errors, and should probably be filtered out so we can get a real idea of what the errors look like here.
The zero-length-read telemetry reports a message whenever we don't read any data from the underlying file, (e.g. because it was cancelled before reading). We do this intentionally very frequently for about: documents loaded in content processes due to how process switching/selection works:
When we start a document navigation, we first open the channel in the parent process, and wait for onStartRequest
, where we'll make a process selection decision. After that decision, we then use RedirectToRealChannel
to redirect the request into the "real" channel in the parent process. This registers the channel with the RedirectChannelRegistrar so that it can be linked up with a channel opened in the content process which was selected for the navigation. For non-http(s) channels, however, we often don't end up linking up the channel at all, as the content process will instead open up their own independent version of the channel to directly read the channel data (e.g. from the omnijar).
If the channel is never taken to be linked up, we should notice this in RedirectToRealChannelFinished
, and end up calling FinishReplacementChannelSetup(NS_ERROR_FAILURE)
, which cancels the channel with that error, before any data is read from it.
I think there's a good chance that this normal pattern is treated as a zero-byte read for the purposes of the zero byte read telemetry, and might be inflating the number of errors we're seeing here quite a bit. I'm not sure how best to filter it out though, as NS_ERROR_FAILURE
with no cancellation reason isn't the most descriptive. The easiest might be to add a new error which is used by this call-site as well as this assignment to indicate this reason for cancelling. (maybe it should be something specific, like NS_ERROR_DLL_NO_PARENT_CHANNEL
? - names are hard). That error could then be filtered out. (We might also want to add a reason string here, though it'd be less precise).
As a side note, there are a large number of errors on activity-stream-noscripts.html
which were previously filtered out using https://searchfox.org/mozilla-central/rev/4d00b50c42a788c51ac4c5fe92b684569ba6c3a5/modules/libjar/nsJARChannel.cpp#1009-1015, however recent changes to how activity stream is bundled has changed the path for this file, meaning that these errors are no longer filtered out. I imagine (almost?) all of these errors are from the call-site I mention above (the newtab page is notoriously loaded in a content process).
Comment 40•4 months ago
|
||
Thanks nika, those are good ideas to improve the telemetry here. Can you file a bug for that?
@smaug: true YSOD errors seem to be always right after an update
Comment 41•4 months ago
|
||
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #40)
Thanks nika, those are good ideas to improve the telemetry here. Can you file a bug for that?
I have filed bug 1963079
Description
•