Open Bug 1696551 Opened 4 years ago Updated 1 month ago

YSOD on browser.xhtml with NO_ELEMENTS/UNCLOSED_TOKEN

Categories

(Core :: XPCOM, defect, P2)

defect

Tracking

()

Tracking Status
firefox-esr102 --- unaffected
firefox112 --- wontfix
firefox113 --- wontfix
firefox114 --- wontfix

People

(Reporter: zbraniecki, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug, Regression)

Details

(Keywords: regression)

User Story

Volume: ~18000/day on release

Attachments

(1 file)

Investigating data on Beta channel on 2021-03-03, I have identified a common pattern of zero_byte_load and YSOD events from telemetry on Beta that look like this:

Session 1

Build: 20210302185821

  • ts: 16727, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/browser/browser.xhtml"}]},{"f":[{"v":"status"},{"v":"NS_BINDING_ABORTED"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]
  • ts: 16759, ysod: [{"f":[{"v":"error_code"},{"v":"3"}]},{"f":[{"v":"location"},{"v":"1:1"}]},{"f":[{"v":"last_line"},{"v":""}]},{"f":[{"v":"last_line_len"},{"v":"0"}]}]
  • ts: 21821, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/toolkit/content/gfxsanity/sanitytest.html"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FAILURE"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]

Session 2

Build: 20210302185821

  • ts: 6640, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/browser/browser.xhtml"}]},{"f":[{"v":"status"},{"v":"NS_BINDING_ABORTED"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]
  • ts: 6656, ysod: [{"f":[{"v":"error_code"},{"v":"3"}]},{"f":[{"v":"location"},{"v":"1:1"}]},{"f":[{"v":"last_line"},{"v":""}]},{"f":[{"v":"last_line_len"},{"v":"0"}]}]
  • ts: 7732, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/toolkit/content/gfxsanity/sanitytest.html"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FAILURE"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]

Session 3

Build: 20210225185804

  • ts: 29421, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/toolkit/content/mozapps/extensions/default-theme/preview.png"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_NOT_FOUND"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
  • ts: 29437, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/builtin-themes/light/preview.png"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_NOT_FOUND"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
  • ts: 29437, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/builtin-themes/dark/preview.png"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_NOT_FOUND"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
  • ts: 29468, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/builtin-themes/alpenglow/preview.png"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_NOT_FOUND"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
  • ts: 35703, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/browser/browser.xhtml"}]},{"f":[{"v":"status"},{"v":"NS_BINDING_ABORTED"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]
  • ts: 35718, ysod: [{"f":[{"v":"error_code"},{"v":"3"}]},{"f":[{"v":"location"},{"v":"1:1"}]},{"f":[{"v":"last_line"},{"v":""}]},{"f":[{"v":"last_line_len"},{"v":"0"}]}]
  • ts: 57484, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/toolkit/content/gfxsanity/sanitytest.html"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FAILURE"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]

Session 4

Build: 20210225185804

  • ts: 12261, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/builtin-themes/light/preview.png"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_NOT_FOUND"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
  • ts: 12230, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/toolkit/content/mozapps/extensions/default-theme/preview.png"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_NOT_FOUND"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
  • ts: 12277, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/builtin-themes/dark/preview.png"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_NOT_FOUND"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
  • ts: 12308, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/builtin-themes/alpenglow/preview.png"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_NOT_FOUND"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
  • ts: 16926, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/browser/browser.xhtml"}]},{"f":[{"v":"status"},{"v":"NS_BINDING_ABORTED"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]
  • ts: 16926, ysod: [{"f":[{"v":"error_code"},{"v":"3"}]},{"f":[{"v":"location"},{"v":"1:1"}]},{"f":[{"v":"last_line"},{"v":""}]},{"f":[{"v":"last_line_len"},{"v":"0"}]}]
  • ts: 18657, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/toolkit/content/gfxsanity/sanitytest.html"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FAILURE"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]

Session 5

Build: 20210225185804

  • ts: 36211124, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/toolkit/content/gfxsanity/sanityparent.html"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FAILURE"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]
  • ts: 36211843, ysod: [{"f":[{"v":"error_code"},{"v":"3"}]},{"f":[{"v":"location"},{"v":"1:1"}]},{"f":[{"v":"last_line"},{"v":""}]},{"f":[{"v":"last_line_len"},{"v":"0"}]}]
  • ts: 536211530, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/browser/browser.xhtml"}]},{"f":[{"v":"status"},{"v":"NS_BINDING_ABORTED"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]

I'm not sure if the Session 3-4 file not found in themes is relevant, but listed all events from the probe from that session just in case it gives us a lead.

Except of that, all 5 sessions registered a zero_byte_load with NS_ERROR_FAILURE and cancelled: true right before YSOD on NO_ELEMENTS in line 1:1.

Also, all 5 of them had a NS_ERROR_FAILURE with cancelled: true on gfxsanity/sanitytest.html or gfxsanity/sanityparent.html in case of Session 5. Session 5 is also much older when it happens.

Matt, Mossop - you reviewed the sanitytest.html at landing. Do you have any idea on why would there be an NS_FAILURE_ERROR correlated with YSOD on browser.xhtml?

My only wild hypothesis would be that user is trying to open a new window, but some I/O error causes graphic error which triggers sanitytest.html check (which errors out) right around browser.xhtml load that ends up in an error?

We're trying to understand why under any circumstances anything would want to cancel a load of browser.xhtml.

Flags: needinfo?(matt.woodrow)
Flags: needinfo?(dtownsend)

I just found a session which looks exactly like Session 1 & 2 above, except that YSOD error payload is [{"f":[{"v":"error_code"},{"v":"3"}]},{"f":[{"v":"location"},{"v":"4343:7"}]},{"f":[{"v":"last_line"},{"v":" "}]},{"f":[{"v":"last_line_len"},{"v":"6"}]}] - notice line:col, and last_line_len.

Since the error code is still NO_ELEMENTS, I assume this is mangled stream of 4343 bytes, but I don't know if it's just whitespace or not. My guess would be that it must be whitespace, or the XML parser would error out earlier, but it could be also something half-mangled, like <ELEMENTr2j3rn2krjnk2... for 4343 bytes.

I'm also not sure if it's the same issue as 1:1, or separate one, but the sanitytest.html error is also present and NS_BINDING_ABORTED zero_byte_load as well.

And now I confirmed that all YSODs with UNCLOSED_TOKEN on browser.xhtml also look like one of the 5 sessions listed above.

One session has YSOD with payload [{"f":[{"v":"error_code"},{"v":"5"}]},{"f":[{"v":"location"},{"v":"3741:17"}]},{"f":[{"v":"last_line"},{"v":" and looks exactly like Session 3 & 4

Another has YSOD with payload [{"f":[{"v":"error_code"},{"v":"5"}]},{"f":[{"v":"location"},{"v":"565:7"}]},{"f":[{"v":"last_line"},{"v":" and looks exactly like Session 1 & 2

For the UNCLOSED_TOKEN YSOD, the line:col distribution is not normal, and there are 4-5 clusters: 565:7, 1894:5, 2458:11, 3741:17 - it may be just the buffer size in async load, but at least since it keeps repeating, it is not a random stream of bytes. And since last_line seems to be reliably empty it may mean that the I/O feeds an empty string.

Summary: YSOD on browser.xhtml with NO_ELEMENTS in line 1:1 → YSOD on browser.xhtml with NO_ELEMENTS/UNCLOSED_TOKEN
User Story: (updated)

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #2)

Matt, Mossop - you reviewed the sanitytest.html at landing. Do you have any idea on why would there be an NS_FAILURE_ERROR correlated with YSOD on browser.xhtml?

My only wild hypothesis would be that user is trying to open a new window, but some I/O error causes graphic error which triggers sanitytest.html check (which errors out) right around browser.xhtml load that ends up in an error?

We're trying to understand why under any circumstances anything would want to cancel a load of browser.xhtml.

I can't think of any correlation between sanitytest.html and browser.xhtml failing sorry.

Flags: needinfo?(matt.woodrow)

I don't remember anything about sanitytest.html.

Flags: needinfo?(dtownsend)
Component: General → Application Update
Product: Core → Toolkit

As with bug 1696565 I don't think there is much to do from the update side. See bug 1700205 for a proposal about verifying files while updating.

Component: Application Update → Startup and Profile System

Can this bug be closed or have a severity assigned?

Flags: needinfo?(zbraniecki)

No. This is a category of YSOD that our recent removal of DTD did not fix. This happens when Necko sends incomplete or empty XHTML document (for example in scenario where Necko received a cancel signal during loading). This may still happen and there's an architectural disagreement on whether Necko API should allow consumer to receive incomplete/empty document from it.

Flags: needinfo?(zbraniecki)

So what should the severity be?

Flags: needinfo?(zbraniecki)

Frankly, I'd move this to Necko and let them asses the severity. On beta we slashed the rate of YSODs by 2/3 by removing bug 1696565 vector. I want to be cautious when trying to reason about Release, but it may be that we'll decrease the volume by as much as 70% (in March 70% of browser.xhtml YSODs on Release were from DTDs).

If that's the case, we'd be talking about ~800-900 YSODs a day on release from this and I'd argue the severity should be fairly low, but I'll wait for the DTD removal to hit release first.

Flags: needinfo?(zbraniecki)
Component: Startup and Profile System → Networking
Product: Toolkit → Core
Severity: -- → S2
Priority: -- → P2
Whiteboard: [necko-triaged]

With bug 1696565 we now see 50% of the remaining YSODs on browser.xhtml to come from old betas and 50% from this bug.

I'll monitor the progress as we move to release, but if this would be a single causal problem and we can fix it it would be the last piece of YSOD on main UI!

dragana, do you think this is still s2-level?

(Also/in-support-of-that-question: do you know if have anyone working on YSOD-related stuff at this point, e.g. monitoring the data that Zibi mentioned in comment 0? It looks like it was a project that Zibi was focusing on, but he's no longer active at this point. So I'm not sure who the right person would be to ask about the data stream referenced in comment 0 and comment 12 here.)

Flags: needinfo?(dd.mozilla)

Kershaw, can you take a look if we can change the s2-level? Thank you.

Flags: needinfo?(dd.mozilla) → needinfo?(kershaw)

(In reply to Dragana Damjanovic [:dragana] from comment #14)

Kershaw, can you take a look if we can change the s2-level? Thank you.

After our internal discussion, this bug should be still a s2.
We will try to investigate this soon.

Flags: needinfo?(kershaw)
Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-review]
Whiteboard: [necko-triaged][necko-priority-review] → [necko-triaged][necko-priority-queue]

So, I took another look at the graphs.
One thing that seems rather clear by the error rate, is that this kind of error usually happens during/after an update: the Nightly error rate is rather constant, beta has a big spike at the beginning of the release cycle, release is rather constant, and its error rates don't approach beta, while these errors are almost missing from ESR (with the note that ESR may have a smaller number of users and they might not all be sending telemetry).

That said, I noticed a big spike in browser.xhtml errors on Nightly starting 21 feb - This is also visible in the Beta graph starting March 13th (112.0b1)

Looking at the m-c pushlog feb 20-22 nothing immediately stands out - but it might be a good lead into figuring this out:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=9de559474dda78da572098cddf3be734718bab56&tochange=e4d0dac7f34f555d2085fd2f9fe07dc9318c87a6

Short update. Noting that the event recording is EARLY_BETA_OR_EARLIER which is probably why we're seeing the big spike at the beginning of the beta cycle, and it goes down towards the end. All of the release events are probably from Fx88.

Should we push reporting of this to release?

The likely change is Bug 1804803 - Hit MOZ_CRASH(DocumentLoadListener::Open for invalid history entry due to mismatch of 'Invalid LoadId'). r=smaug

Flags: needinfo?(valentin.gosu)

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #19)

Should we push reporting of this to release?

The only question is whether this info is actionable enough to warrant the extra storage for the events.

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #20)

The likely change is Bug 1804803 - Hit MOZ_CRASH(DocumentLoadListener::Open for invalid history entry due to mismatch of 'Invalid LoadId'). r=smaug

This seems the most likely indeed. Thanks for finding it.

Flags: needinfo?(valentin.gosu)

Storing a counter for telemetry should be measured in 10's or 100's of bytes, I'd imagine, so that's a low bar

Peter - you landed the regressing patch, can you take a look?

Blocks: 1804803
Flags: needinfo?(peterv)
No longer blocks: 1804803
Keywords: regression
Regressed by: 1804803

Set release status flags based on info from the regressing bug 1804803

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #23)

Peter - you landed the regressing patch, can you take a look?

That's highly unlikely, that patch is specific to reloads. How would we be doing a reload of browser.xhtml?

Flags: needinfo?(peterv)

I'm going to try to add some debug code to figure how it's possible that we would be doing reloads of chrome URLs, and browser.xhtml in particular. Smaug and I are really confused why we would even end up in the code that bug 1804803 added for a load of browser.xhtml.

Blocks: 1639821

Peter, is there any progress on this one? Should it be assigned to somebody?

Flags: needinfo?(peterv)

The peak for browser.xhtml went away a while ago. I had prepared this patch to catch reloading of browser.xhtml, but I wanted to make sure that it works, but I haven't been able to trigger it either. At this point I'm not sure it's actually useful anymore.

So I think we should refocus this bug a bit, and figure out what we need to be investigating. Looking at the YSOD dashboards, I don't see any peak for browser.xhtml which is quite odd? A lot of the YSOD seem to be happening on about: pages like about:preferences, which do show up in the network errors. Those do look interesting. But all the other network errors actually not lead to a YSOD? And do they have any effect on the user?

Flags: needinfo?(peterv)

Kershaw, can you please re-evaluate if this is still a valid S2/P2? Thanks!

Flags: needinfo?(kershaw)

(In reply to Neha Kochar [:neha] from comment #30)

Kershaw, can you please re-evaluate if this is still a valid S2/P2? Thanks!

I've created a query to see YSOD events on browser.xhtml. It shows that after 2023-06-06 we have 300~500 YSOD per day, so it seems that the peak for browser.xhtml is still there.
Unfortunately, I think this is sill a S2 bug.

Peter, I think we could land your patch if that's helpful.
Thanks.

Flags: needinfo?(kershaw) → needinfo?(peterv)

Moving this to xpcom -- the current guess is that this is somehow related to startupcache (see comment 17, and more recent issues highlighted by the Android incident)

Component: Networking → XPCOM
Whiteboard: [necko-triaged][necko-priority-queue]

Jesup, is this still an issue? Do you have any idea what the next steps are? I pinged nika about this and she told me you're actually the most familiar with StartupCache which you guessed was the responsible part.

Flags: needinfo?(rjesup)

Definitely still an issue. Also a spike in the last few months in aboutNetError (see the bottom of https://sql.telemetry.mozilla.org/dashboard/ui-resource-network-errors)

While we've improved things some since 3 years ago, there are still ongoing issues, obviously. I'm not sure of next steps at this point. Perhaps some improved diagnostic telemetry?

Flags: needinfo?(rjesup)
No longer blocks: 1639821
See Also: → 1639821

Given that we're still seeing this on browser.xhtml, maybe my diagnostic crash patch should be landed after all? I haven't been able to test it because I don't know of a way to reload browser.xhtml.

Flags: needinfo?(smaug)
Flags: needinfo?(peterv)
Flags: needinfo?(htsai)

location.reload() and such of course would do reload, but I couldn't get the patch to be triggered.

What do the ui-resource-network-errors errors mean? Are those Necko level errors?

If the YSOD error happens mostly after updating, that hints that the issue might be about non-StartupCache, since normally browser.xhtml should be coming from the cache.

Flags: needinfo?(smaug) → needinfo?(rjesup)

(was about to NI :jesup, too)

Flags: needinfo?(htsai)

Was looking into the telemetry a bit this morning. I think some (perhaps most?) of the NS_ERROR_FAILURE errors for documents loaded in-content (like about pages and activity-stream-noscripts.html) are not real errors, and should probably be filtered out so we can get a real idea of what the errors look like here.

The zero-length-read telemetry reports a message whenever we don't read any data from the underlying file, (e.g. because it was cancelled before reading). We do this intentionally very frequently for about: documents loaded in content processes due to how process switching/selection works:

When we start a document navigation, we first open the channel in the parent process, and wait for onStartRequest, where we'll make a process selection decision. After that decision, we then use RedirectToRealChannel to redirect the request into the "real" channel in the parent process. This registers the channel with the RedirectChannelRegistrar so that it can be linked up with a channel opened in the content process which was selected for the navigation. For non-http(s) channels, however, we often don't end up linking up the channel at all, as the content process will instead open up their own independent version of the channel to directly read the channel data (e.g. from the omnijar).

If the channel is never taken to be linked up, we should notice this in RedirectToRealChannelFinished, and end up calling FinishReplacementChannelSetup(NS_ERROR_FAILURE), which cancels the channel with that error, before any data is read from it.

I think there's a good chance that this normal pattern is treated as a zero-byte read for the purposes of the zero byte read telemetry, and might be inflating the number of errors we're seeing here quite a bit. I'm not sure how best to filter it out though, as NS_ERROR_FAILURE with no cancellation reason isn't the most descriptive. The easiest might be to add a new error which is used by this call-site as well as this assignment to indicate this reason for cancelling. (maybe it should be something specific, like NS_ERROR_DLL_NO_PARENT_CHANNEL? - names are hard). That error could then be filtered out. (We might also want to add a reason string here, though it'd be less precise).


As a side note, there are a large number of errors on activity-stream-noscripts.html which were previously filtered out using https://searchfox.org/mozilla-central/rev/4d00b50c42a788c51ac4c5fe92b684569ba6c3a5/modules/libjar/nsJARChannel.cpp#1009-1015, however recent changes to how activity stream is bundled has changed the path for this file, meaning that these errors are no longer filtered out. I imagine (almost?) all of these errors are from the call-site I mention above (the newtab page is notoriously loaded in a content process).

Thanks nika, those are good ideas to improve the telemetry here. Can you file a bug for that?

@smaug: true YSOD errors seem to be always right after an update

Flags: needinfo?(rjesup) → needinfo?(nika)

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #40)

Thanks nika, those are good ideas to improve the telemetry here. Can you file a bug for that?

I have filed bug 1963079

Flags: needinfo?(nika)

Downgrading based on comment 39.

Severity: S2 → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: