Closed Bug 1695560 Opened 3 years ago Closed 3 years ago

Narrow down the zero_byte_load probe to tailor results for YSOD, part 4

Categories

(Core :: Networking: JAR, task, P2)

task

Tracking

()

RESOLVED FIXED
88 Branch
Tracking Status
firefox87 --- fixed
firefox88 --- fixed

People

(Reporter: zbraniecki, Assigned: kershaw)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged])

Attachments

(2 files)

As a follow up to bug 1693955 and bug 1694263 I believe we have now enough data from Beta to place additional filters.

Looking at the dashboard at https://sql.telemetry.mozilla.org/dashboard/ui-resource-network-errors I see several potentials:

Status as of Feb 28 2021

Nightly stands at ~1.8m events per day. 900k/day of them are HTML and then PNG, SVG and Others with +200k/day.

Beta stands at ~14m events per day. 6.6m are HTML, then 2m SVG and Others, and then 1.5m XHTML and PNG.

Proposed new filters

1) Filter out non-omni.ja HTML

  • Scope: all channels
  • Impact:
    ** Nightly: 900k -> 300k
    ** Beta: 6.6m -> 3.9m

This seems quite obvious. Lots of volume from extensions, and we are not currently aiming to fix that.

2) Filter out omni.ja!/chrome/browser/res/activity-stream/prerendered/activity-stream-noscript/NS_ERROR_FAILURE

  • Scope: all but Nightly (keep it on Nightly)
  • Impact:
    ** Nightly: 300k -> 60k
    ** Beta: 3.9m -> 300k

This is a bit more tricky. The error indicates that something is going wrong, and it is on a primary UI. Today we saw 3.6m events from Beta from 380k unique clients, and on Nightly from 25k clients.
Those numbers represent a substantial portion of the DAU!
What's also worth noticing is that on average a client which experiences it, reports it 10 times a day.
I would be concerned about filtering it out, because both, it can be useful in our investigation of YSOD sources, and it may be a valueable Q&A artifact.

Unfortunately, since the volume is really high, I think it would be sensible to restrict it to Nightly and monitor there.

3) Filter out omni.ja!/chrome/browser/search-extensions/google/favicon.ico / NS_BINDING_ABORTED

  • Scope: all channels
  • Impact:
    ** Nightly: -
    ** Beta: 2m -> 1m

While this file seems to be of importance, it feels to me like it should have the same underlying root cause as other search_extensions/*/favicon.ico with NS_BINDING_ABORTED, all of which are in top 10, but none of which has the same 1m events per day volume.
So by filtering this out, we're not ignoring the problem, we just have to scale the other analogous volume by 20x to include this one.
And if we fix the icon for ddg, bing etc. we're likely going to fix this one as well.
Another thing to notice is that the problem seems to be non-existing in Nightly.

Also, we should file a bug about it and CC someone from the Search Team to investigate this particular error's root cause.

4) Filter out omni.ja!/update.locale / NS_ERROR_FILE_NOT_FOUND

  • Scope: all channels
  • Impact:
    ** Nightly: 200k -> 145k
    ** Beta: 1m -> 400k

This is a known problem that building Firefox without an installer leads to this file being missing.
We have two places where we ask for this file, in LocaleService where I think we stat it, and in JS here: https://searchfox.org/mozilla-central/rev/6a023272d590409c80458d373986e379b3ad86f4/toolkit/modules/UpdateUtils.jsm#146
Since JS doesn't have an easy "if (fileExists(path)), they justfetch` and check the ouput, which leads to NS_ERROR_FILE_NOT_FOUND.
Monitoring this error may be useful for general Q&A as we can then verify that the source code handles missing result, but for this particular file we already did, and I think we can filter it out now.

  1. Filter out non-omni.ja SVG&PNG
    Scope: all channels
    Impact:
    ** Nightly: 470k -> 200k
    ** Beta: 3.4m -> 2.3m

I know it sounds like overly cautious but I prefer to slowly filter out non-omni.ja per type rather than all at once because it allows me to see if there's any suspicous non-omni.ja call that may lead us to a quality problem in code.
But I also want to reduce the volume, so

Those 5 target all that I can see standing out in the dashboard, and reduces Beta volume from 14m to 5m, which I think is appropriate.

I am in particular not cutting on events for XHTML which I think is significant for YSOD and I'd like to monitor it fully for a couple weeks more.

This is also the first time we're cutting out what I suspect may be a legitimate error, but is at a volume that justifies cutting it since we have lower-volume equivalents that we can study and will likely be fixed by fixing the same root cause.

It'll be important to remember to add the volume of those ((2) on Beta and (3) on all platforms) when evaluating severity and scope of the problem.

Does the list look good to you? If so, would you have time to write a PR?

Flags: needinfo?(kershaw)

I also think that at this point we'll likely be done with Beta filters, and we'll switch fully to evaluating the correlations with the data at this level.
And waiting for Stable to hit, which may mean 60x volume.

I think :kershaw's idea to implement a perf that we could use to turn it off (or at least clamp?) may be important if we the data from release surprised us and we'd get much more than 300m events per day.
From what I understand from #telemetry Matrix channel, 300m events a day wouldn't overload the system, but I'd still prefer to only collect such volume for a week or so, and then maybe turn it off on Release, fine-tune on Nightly+Beta and then re-enable for the next release with new filters.

Assignee: nobody → kershaw
Severity: -- → N/A
Flags: needinfo?(kershaw)
Priority: -- → P2
Whiteboard: [necko-triaged]

2) Filter out omni.ja!/chrome/browser/res/activity-stream/prerendered/activity-stream-noscript/NS_ERROR_FAILURE

  • Scope: all but Nightly (keep it on Nightly)
  • Impact:
    ** Nightly: 300k -> 60k
    ** Beta: 3.9m -> 300k

This is a bit more tricky. The error indicates that something is going wrong, and it is on a primary UI. Today we saw 3.6m events from Beta from 380k unique clients, and on Nightly from 25k clients.
Those numbers represent a substantial portion of the DAU!
What's also worth noticing is that on average a client which experiences it, reports it 10 times a day.
I would be concerned about filtering it out, because both, it can be useful in our investigation of YSOD sources, and it may be a valueable Q&A artifact.

Unfortunately, since the volume is really high, I think it would be sensible to restrict it to Nightly and monitor there.

This can be easily reproduced locally on my side.
It seems the channel is cancelled here.
I also captured the log and stack trace below.

Since this is easy to reproduce, I think we don't need to record this on Nightly and we can filter out this completely.

#01: mozilla::net::DocumentLoadListener::FinishReplacementChannelSetup(nsresult)[/Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/MacOS/XUL +0x8a4868]
#02: mozilla::net::DocumentLoadListener::RedirectToRealChannelFinished(nsresult)[/Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/MacOS/XUL +0x8a42fb]
#03: mozilla::MozPromise<nsresult, mozilla::ipc::ResponseRejectReason, true>::ThenValue<mozilla::net::DocumentLoadListener::TriggerRedirectToRealChannel(mozilla::Maybe<unsigned long long> const&)::$_24, mozilla::net::DocumentLoadListener::TriggerRedirectToReal[/Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/MacOS/XUL +0x8c9ae4]
#04: mozilla::MozPromise<nsresult, mozilla::ipc::ResponseRejectReason, true>::ThenValueBase::ResolveOrRejectRunnable::Run()[/Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/MacOS/XUL +0x8c11c4]
#05: mozilla::SimpleTaskQueue::DrainTasks()[/Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/MacOS/XUL +0x1898ad]
#06: nsThread::DrainDirectTasks()[/Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/MacOS/XUL +0x199531]
#07: nsThread::ProcessNextEvent(bool, bool*)[/Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/MacOS/XUL +0x197fca]
[Parent 7283: Main Thread]: V/DocumentChannel DocumentLoadListener ctor [this=0x12d815940]
[Parent 7283: Main Thread]: V/DocumentChannel DocumentLoadListener [0x12d815940] OpenDocument [uri=about:home]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::nsJARChannel [this=0x12ccc08d0]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::Init [this=0x12ccc08d0]
[Parent 7283: Main Thread]: V/DocumentChannel ParentProcessDocumentOpenInfo ctor [this=0x12d76c900]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::AsyncOpen [this=0x12ccc08d0]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::AsyncOpen [this=0x12ccc08d0]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::LookupFile [this=0x12ccc08d0 jar:file:///Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/Resources/browser/omni.ja!/chrome/browser/res/activity-stream/prerendered/activity-stream-noscripts.html]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::OpenLocalFile [this=0x12ccc08d0]
[Parent 7283: StreamTrans #5]: D/nsJarProtocol nsJARChannel::CreateLocalJarInput [aJarCache=0x10eb11940, , chrome/browser/res/activity-stream/prerendered/activity-stream-noscripts.html]
[Parent 7283: StreamTrans #1]: D/nsJarProtocol nsJARChannel::CreateLocalJarInput [aJarCache=0x10eb11940, , update.locale]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::ContinueOpenLocalFile [this=0x12d14e030 0x111670ca0]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::OnOpenLocalFileComplete [this=0x12d14e030 00000000]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::ContinueOpenLocalFile [this=0x12ccc08d0 0x129436d60]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::OnOpenLocalFileComplete [this=0x12ccc08d0 00000000]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::OnStartRequest [this=0x12ccc08d0 jar:file:///Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/Resources/browser/omni.ja!/chrome/browser/res/activity-stream/prerendered/activity-stream-noscripts.html]
[Parent 7283: Main Thread]: V/DocumentChannel ParentProcessDocumentOpenInfo OnStartRequest [this=0x12d76c900]
[Parent 7283: Main Thread]: V/DocumentChannel ParentProcessDocumentOpenInfo OnDocumentStartRequest [this=0x12d76c900]
[Parent 7283: Main Thread]: V/DocumentChannel DocumentLoadListener OnStartRequest [this=0x12d815940]
[Parent 7283: Main Thread]: V/DocumentChannel DocumentLoadListener MaybeTriggerProcessSwitch [this=0x12d815940]
[Parent 7283: Main Thread]: V/DocumentChannel DocumentLoadListener GetRemoteTypeForPrincipal [this=0x12d815940, contentParent=privilegedabout, preferredRemoteType=privilegedabout]
[Parent 7283: Main Thread]: V/DocumentChannel GetRemoteTypeForPrincipal -> current:privilegedabout remoteType:privilegedabout
[Parent 7283: Main Thread]: V/DocumentChannel Process Switch Abort: type (privilegedabout) is compatible
[Parent 7283: Main Thread]: V/DocumentChannel DocumentLoadListener::TriggerRedirectToRealChannel [this=0x12d815940] aDestinationProcess=-1
[Parent 7283: Main Thread]: V/DocumentChannel DocumentLoadListener RedirectToRealChannel [this=0x12d815940] aRedirectFlags=4, aLoadFlags=3d0004
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::nsJARChannel [this=0x12d14ea40]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::Init [this=0x12d14ea40]
[Child 7285: Main Thread]: V/DocumentChannel DocumentChannelChild RecvRedirectToRealChannel [this=0x10e3c7280, uri=about:home]
[Child 7285: Main Thread]: V/DocumentChannel DocumentChannelChild OnRedirectVerifyCallback [this=0x10e3c7280, aRv=0x00000000 ]
[Child 7285: Main Thread]: D/nsJarProtocol nsJARChannel::nsJARChannel [this=0x1168aed20]
[Child 7285: Main Thread]: D/nsJarProtocol nsJARChannel::Init [this=0x1168aed20]
[Child 7285: Main Thread]: D/nsJarProtocol nsJARChannel::Open [this=0x1168aed20]
[Child 7285: Main Thread]: D/nsJarProtocol nsJARChannel::Open [this=0x1168aed20]
[Child 7285: Main Thread]: D/nsJarProtocol nsJARChannel::LookupFile [this=0x1168aed20 jar:file:///Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/Resources/browser/omni.ja!/chrome/devtools/modules/devtools/client/jsonview/Sniffer.jsm]
[Child 7285: Main Thread]: D/nsJarProtocol nsJARChannel::CreateJarInput [this=0x1168aed20]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::AsyncOpen [this=0x12d14ea40]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::AsyncOpen [this=0x12d14ea40]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::LookupFile [this=0x12d14ea40 jar:file:///Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/Resources/omni.ja!/modules/workers/PromiseWorker.js]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::OpenLocalFile [this=0x12d14ea40]
[Parent 7283: StreamTrans #3]: D/nsJarProtocol nsJARChannel::CreateLocalJarInput [aJarCache=0x10eb11940, , modules/workers/PromiseWorker.js]
[Child 7285: Main Thread]: D/nsJarProtocol nsJARChannel::~nsJARChannel [this=0x1168aed20]
[Parent 7283: Main Thread]: V/DocumentChannel DocumentLoadListener RedirectToRealChannelFinished [this=0x12d815940, aRv=0 ]
[Parent 7283: Main Thread]: V/DocumentChannel DocumentLoadListener FinishReplacementChannelSetup [this=0x12d815940, aResult=80004005]

Ooh, thank you!
So, if I read your stack correctly, this is not an error, we are just redirecting to another document while loading one?

Do you think more of HTML/XHTML loads are caused by such behavior? Is there a way for us to recognize that as an always "correct" behavior and not include in the probe (should it even be an error?)? - in a follow-up bug of course.

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #8)

Ooh, thank you!
So, if I read your stack correctly, this is not an error, we are just redirecting to another document while loading one?

Do you think more of HTML/XHTML loads are caused by such behavior? Is there a way for us to recognize that as an always "correct" behavior and not include in the probe (should it even be an error?)? - in a follow-up bug of course.

Honestly, I am not sure. It seems wrong to me that redirection involves loading a file from omni.ja locally. I'll file a bug for this.

Pushed by kjang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8f8d7e12acaa
Narrow down the zero_byte_load probe r=zbraniecki

Comment on attachment 9206061 [details]
Bug 1695560 - Narrow down the zero_byte_load probe

Beta/Release Uplift Approval Request

  • User impact if declined: No impact. However, this patch is about filtering out some unnecessary events for YSOD. If we don't uplift this, the telemetry data will be inconsistent with the data on nightly. To make it easier for analyzing this data, it'd be great if we can uplift this patch.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This patch only affects how we send the zero_byte load event.
  • String changes made/needed: N/A
Attachment #9206061 - Flags: approval-mozilla-beta?

This is important because we seem a continuous increase in the event volume from the Beta channel. As of yesterday we passed 18m events per day. This PR will allow us to lower that to ~5m.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 88 Branch

FYI, bug 1696151 will block uplifting this.

Flags: needinfo?(zbraniecki)

Thank you! Bug 1696151 just landed. Should we add request beta flag on that follow up, or would you prefer we folded the two PRs for beta uplift?

Flags: needinfo?(zbraniecki) → needinfo?(jcristau)

:kershaw said he's working on a folded one for uplift.

Flags: needinfo?(jcristau)

Comment on attachment 9206674 [details]
Bug 1695560 - Narrow down the zero_byte_load probe (beta uplift)

Beta/Release Uplift Approval Request

  • User impact if declined: No impact. However, this patch is about filtering out some unnecessary events for YSOD. If we don't uplift this, the telemetry data will be inconsistent with the data on nightly. To make it easier for analyzing this data, it'd be great if we can uplift this patch.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This patch only affects how we send the zero_byte load event.
  • String changes made/needed: N/A
Attachment #9206674 - Flags: approval-mozilla-beta?

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #15)

Thank you! Bug 1696151 just landed. Should we add request beta flag on that follow up, or would you prefer we folded the two PRs for beta uplift?

I can't remove the previous approval-mozilla-beta flag. Please make sure to uplift the latest patch. Thanks.

Attachment #9206061 - Flags: approval-mozilla-beta?

Comment on attachment 9206674 [details]
Bug 1695560 - Narrow down the zero_byte_load probe (beta uplift)

approved for 87.0b6

Attachment #9206674 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: