Open Bug 1510262 Opened 5 years ago Updated 21 days ago

nsIncrementalDownload causes a lot of jank with main thread I/O while downloading updates in the background

Categories

(Core :: Networking: HTTP, defect, P2)

defect

Tracking

()

Performance Impact medium

People

(Reporter: florian, Unassigned, NeedInfo)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: perf, perf:responsiveness, Whiteboard: [necko-triaged][bhr:nsIncrementalDownload][necko-priority-next])

See this profile captured on the 2018 reference hardware: https://perfht.ml/2RiiwTM

This is happening when downloading an update in the background, which can happen while the user is attempting to load pages. Especially as this background update download tends to happen soon after startup.
Whiteboard: [qf] → [qf:p2:responsiveness]
The jank occurs on closing the fd,
and we did the fd open and close every ODA.

We might only close the fd on OnStopRequest, but it's risky to modify super old code [1]
[1] https://searchfox.org/mozilla-central/rev/0859e6b10fb901875c80de8f8fc33cbb77b2505e/netwerk/base/nsIncrementalDownload.cpp#60
Component: Networking → Networking: HTTP
Priority: -- → P2
Whiteboard: [qf:p2:responsiveness] → [qf:p2:responsiveness][necko-triaged]
(In reply to Junior Hsu from comment #1)

> We might only close the fd on OnStopRequest

Ideally the whole thing should move off main thread. We only need to send something back to the main thread if we have JavaScript code on the receiving end.

http://queze.net/bhr/#filter=nsIncrementalDownload%3A%3AFlush shows we have hangs for the PR_Close that you pointed, but we also have some for PR_Write and for OpenNSPRFileDesc.
Agree, this has to move to a different thread.  Is this is a recent regression?
(In reply to Honza Bambas (:mayhemer) from comment #3)
> Is this is a recent regression?

I don't think so, I already noticed this behavior when profiling for Quantum Flow / Firefox 57. It's more likely to be the last occurence that remains of something that was kinda OK a decade ago.
Honza is this something you can tackle?
Assignee: nobody → honzab.moz
Flags: needinfo?(honzab.moz)
Following calls hit PR_Close():

OnStartRequest:
https://searchfox.org/mozilla-central/rev/3160ddc1f0ab55d230c595366662c62950e5c785/netwerk/base/nsIncrementalDownload.cpp#616
OnDataAvailable:
https://searchfox.org/mozilla-central/rev/3160ddc1f0ab55d230c595366662c62950e5c785/netwerk/base/nsIncrementalDownload.cpp#690
OnStopRequest:
https://searchfox.org/mozilla-central/rev/3160ddc1f0ab55d230c595366662c62950e5c785/netwerk/base/nsIncrementalDownload.cpp#653


Solution:
- retarget the channel to stream transport service
- have a task queue, as at [1]
- all writes to the file will happen there
- all notifications (start/progress) must go the main thread and after file operations (truncate in onstart, final write on onstop) are done; this is a precaution, not something I checked on we really must do

[1] https://searchfox.org/mozilla-central/rev/3160ddc1f0ab55d230c595366662c62950e5c785/netwerk/base/nsNetUtil.cpp#1433-1447
Status: NEW → ASSIGNED
Flags: needinfo?(honzab.moz)
https://perfht.ml/2GFHcEG has us janking for a combined total of about 15 seconds (7k samples on Windows for nsIncrementalDownload::FlushChunk, plus some more elsewhere) because it's downloading a complete mar file, on the reference hardware with its disk + cpu contention... fixing this bug would really help quite a lot for those cases.

I find the jank in this startup profile even more impressive: https://perfht.ml/2Ezid1L Probably because I was using the fast internet connexion on a machine with slow I/O, so the main thread I/O blocked the whole time.

This could be nicely fixed with bug 1528285. Looking more into the code, observer is only an nsIRequestObserver which we can easily dispatch to the main thread. Then there is mProgressSink, for which we can do the same.

Adding dep, should it significantly delay, I can move only the writes/flushes to a bck thread.

Depends on: omt-networking
Assignee: honzab.moz → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(gijskruitbosch+bugs)
Whiteboard: [qf:p2:responsiveness][necko-triaged] → [qf:p2:responsiveness][necko-triaged][bhr:nsIncrementalDownload]
Priority: P2 → P3
Performance Impact: --- → P2
Whiteboard: [qf:p2:responsiveness][necko-triaged][bhr:nsIncrementalDownload] → [necko-triaged][bhr:nsIncrementalDownload]
Severity: normal → S3

Can this still be reproduced?

We have omt-networking in the content-process scheduled for H1, 2023, fwiw.

(In reply to Andrew Creskey [:acreskey] from comment #11)

Can this still be reproduced?

I'm not sure. Florian, do you have cycles + a machine to check this on? It may be worth trying release vs. nightly here given the work on socket process etc.

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(florian)

(In reply to Andrew Creskey [:acreskey] from comment #11)

Can this still be reproduced?

I haven't tried recently, but it's still visible on BHR: https://fqueze.github.io/hang-stats/#date=20230222&row=0&filter=nsIncrementalDownload

It's possible we no longer see it on Windows if updates are downloaded by a background service there.

I think comment 6 still holds.
I think we can improve things at least a little by retargetting onDataAvailable to a background thread. We'd still have MT IO with onStart/onStop, but that's just 2 calls out of many.
Implementing nsIThreadRetargetableRequest, handling onData off-main-thread and dispatching progress notifications back to the main thread should be fairly straightforward.

Priority: P3 → P2
Whiteboard: [necko-triaged][bhr:nsIncrementalDownload] → [necko-triaged][bhr:nsIncrementalDownload][necko-priority-next]
See Also: → 1348087
You need to log in before you can comment on or make changes to this bug.