Closed Bug 1373708 Opened 3 years ago Closed 2 years ago

Convert Jar channels to stop using main thread I/O

Categories

(Core :: Networking, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox61 --- fixed

People

(Reporter: florian, Assigned: michal)

References

(Depends on 1 open bug, Blocks 6 open bugs)

Details

(Keywords: perf, Whiteboard: [qf:f61][qf:p1][necko-triaged])

Attachments

(4 files, 13 obsolete files)

20.47 KB, patch
u408661
: review+
Details | Diff | Splinter Review
4.07 KB, patch
Details | Diff | Splinter Review
1.15 KB, patch
michal
: review+
Details | Diff | Splinter Review
59 bytes, text/x-review-board-request
florian
: review+
Details
The name looks like it's asynchronous, but as soon as jar files are accessed, we have main thread IO in nsZipArchive::OpenArchive.

See this cold startup profile where NetUtil_asyncFetch costs us 372ms before first paint: https://perfht.ml/2sGURmJ

During startup, the consumers of this API are readJSON functions from http://searchfox.org/mozilla-central/rev/20d16dadd336e0c6b97e3e19dc4ff907744b5734/toolkit/components/extensions/Extension.jsm#424 and http://searchfox.org/mozilla-central/rev/20d16dadd336e0c6b97e3e19dc4ff907744b5734/toolkit/components/extensions/Schemas.jsm#43

This is similar to bug 1363384, in that it's mostly nsZipArchive doing blocking IO.

I'm wondering if there's something else that can be done without waiting on a complete fix for bug 1363384. It seems that the add-on manager callers are designed to be async. Could they use another non-blocking API? Or could they be delayed until after first paint?
NetUtil.asyncFetch isn't doing anything wrong, it's calling nsIChannel.asyncOpen2() as it should.  It is the jar channel that doesn't adhere to the necko API contracts by implementing the async channel APIs synchronously, which is the subject of bug 1363384.  Not sure if there is any point to having this bug also open separately...
(In reply to Florian Quèze [:florian] [:flo] from comment #0)
> During startup, the consumers of this API are readJSON functions from
> http://searchfox.org/mozilla-central/rev/
> 20d16dadd336e0c6b97e3e19dc4ff907744b5734/toolkit/components/extensions/
> Extension.jsm#424 and
> http://searchfox.org/mozilla-central/rev/
> 20d16dadd336e0c6b97e3e19dc4ff907744b5734/toolkit/components/extensions/
> Schemas.jsm#43

Note that these are only used during cold startups, after the startup cache has been flushed. Most of the time, these files are cached in IndexedDB and fetched asynchronously.

(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #1)
> NetUtil.asyncFetch isn't doing anything wrong, it's calling
> nsIChannel.asyncOpen2() as it should.  It is the jar channel that doesn't
> adhere to the necko API contracts by implementing the async channel APIs
> synchronously, which is the subject of bug 1363384.  Not sure if there is
> any point to having this bug also open separately...

I think, arguably, they're both doing something wrong. It would be nice if the JAR channel did its IO off-thread in any case, but asyncFetch should really retarget its data off-loading to a background thread when possible. That would have responsiveness benefits even if the underlying channel didn't have sync IO issues.
Assignee: nobody → xeonchen
Gary, do you have time to take a look here? If not, assign it to Jason, and he'll find someone.
Whiteboard: [necko-active]
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #1)
> NetUtil.asyncFetch isn't doing anything wrong, it's calling
> nsIChannel.asyncOpen2() as it should.  It is the jar channel that doesn't
> adhere to the necko API contracts by implementing the async channel APIs
> synchronously, which is the subject of bug 1363384.  Not sure if there is
> any point to having this bug also open separately...

For me the difference between these 2 bugs is that in bug 1363384 we have code that does synchronously something that it expects to just be a memory access, but sometimes turns out to do IO due to the memory mapping. Whereas in this bug, we have code that knows it'll cause IO, and does the correct thing to do this IO asynchronously, but ends up blocking the main thread anyway.
Status: NEW → ASSIGNED
Comment on attachment 8880712 [details]
Bug 1373708 - make local jar I/O off-main-thread;

cancel review because it breaks a lot of tests
Attachment #8880712 - Flags: review?(aklotz)
Attached patch [WIP] base (obsolete) — Splinter Review
Attached patch [WIP] dispatch (obsolete) — Splinter Review
(In reply to Gary Chen [:xeonchen] (needinfo plz) from comment #9)
> Created attachment 8882242 [details] [diff] [review]
> [WIP] dispatch

It seems that on try server, lots of tests are broken ([1]) if I run |CreateJarInput| asynchronously (I've made the patch simplified on [2], [3]).
The patch works perfectly on my computer, and the error messages on try server are nearly random to me.

According to [4], the jar file is possibly broken because the clients use the file before it's ready.
So we may not just simply make |CreateJarInput| asynchronously, I'll do some deeper investigation.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=0044c2dc5a5f
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=6b8867f1fdb53403a937781a293bb71f2ad30b2c
[3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=0d20211dbc571d255186938ebcf6e53b35beecc6
[4] https://treeherder.mozilla.org/logviewer.html#?job_id=109470825&repo=try&lineNumber=2287
Daniel, do you have any idea how attachment 8882242 [details] [diff] [review] makes attachment 8882241 [details] [diff] [review] broken on try server?

My assumption is that any client calls asyncOpen will process following steps only after it gets the callback from the channel, so if I divide the code within asyncOpen into several parts and dispatch to current thread, most things are supposed to be just work because the sequences are the same. However it works on my computer but not on try server...
Flags: needinfo?(daniel)
(In reply to Gary Chen [:xeonchen] (needinfo plz) from comment #11)
> Daniel, do you have any idea how attachment 8882242 [details] [diff] [review]
> makes attachment 8882241 [details] [diff] [review] broken on try server?

Eek, no sorry. I'm not at all well oriented in the finer details of our threading and runnable methods inner workings so I have no idea about this.

Jason any idea about this or who we can ask for further advice?
Flags: needinfo?(daniel) → needinfo?(jduell.mcbugs)
(In reply to Florian Quèze [:florian] [:flo] from comment #0)

> I'm wondering if there's something else that can be done without waiting on
> a complete fix for bug 1363384. It seems that the add-on manager callers are
> designed to be async. Could they use another non-blocking API? Or could they
> be delayed until after first paint?

Another set of callers to nsJARChannel::AsyncOpen2 that end up blocking on main thread IO is asyncLoadModule (http://searchfox.org/mozilla-central/rev/5e1e8d2f244bd8c210a578ff1f65c3b720efe34e/toolkit/components/extensions/ExtensionCommon.jsm#1024) calling ChromeUtils.compileScript. As seen in this profile: https://perfht.ml/2sPCwQB
I took a quick look but I don't see anything obvious (but I was also a little confused because the little patches here seem to be on top of something that's not mozilla-central but also not the big patch?).

So you don't see any crashes on your local box?  That's too bad, otherwise I'd say try to attach to one of the xpcshell tests that's crashing.
Flags: needinfo?(jduell.mcbugs)
Attached patch [WIP] base (obsolete) — Splinter Review
Attachment #8882241 - Attachment is obsolete: true
(In reply to Jason Duell [:jduell] (needinfo me) from comment #14)
> I took a quick look but I don't see anything obvious (but I was also a
> little confused because the little patches here seem to be on top of
> something that's not mozilla-central but also not the big patch?).
> 

Oh, it's my fault. The base patch is incorrect, and I just updated.


> So you don't see any crashes on your local box?  That's too bad, otherwise
> I'd say try to attach to one of the xpcshell tests that's crashing.

Everything's green on my laptop, and it breaks almost every tests on try.
Comment on attachment 8884195 [details] [diff] [review]
[WIP] base

So this is the correct version, trying to move |CreateJarInput| in to a Runnable.
Attachment #8884195 - Flags: feedback?(jduell.mcbugs)
Any progress on this? Just to remind everybody how important this is, here is a startup profile captured today on the reference hardware where nsJARChannel::AsyncOpen2 blocks startup for more than 7s (yes, seconds) : https://perfht.ml/2uDaOHw

(This was the first startup after a Windows update, and there was a malware scan happening in the background.)
(In reply to Florian Quèze [:florian] [:flo] from comment #18)
> Any progress on this? Just to remind everybody how important this is, here
> is a startup profile captured today on the reference hardware where
> nsJARChannel::AsyncOpen2 blocks startup for more than 7s (yes, seconds) :
> https://perfht.ml/2uDaOHw
> 
> (This was the first startup after a Windows update, and there was a malware
> scan happening in the background.)

I'll spend more time on this, and it's very welcome if anyone has any idea on the WIP patch.
Clearer summary.  Note that the main thread I/O here is both the open() that happens in AsyncOpen (via nsZipArchive::OpenArchive: see comment 0), and the fact that the mmap architecture we're using probably means we're hitting jank (via page faults) all over place, probably on the main thread.
Summary: NetUtil_asyncFetch blocks on main thread IO before first paint → Convert Jar channels to stop using main thread I/O
Gary, can you push your patch to try server again. The old runs are not there any more. I can take a look.

Did you consider turning MOZ_LOG on try?
Flags: needinfo?(xeonchen)
Very useful info from the archives (courtesy of aklotz):
--

I am familiar with some of the liberties that Taras took with the zip specification to allow for additional perf optimizations: I strongly suggest that anybody taking this work on should read Taras' old blog post that illustrates some of the decisions and optimizations that were made:

    https://web.archive.org/web/20120301065240/http://blog.mozilla.com/tglek/2010/09/14/firefox-4-jar-jar-jar/

--

This is great information--I've been tinkering with JAR stuff for years and have always wondered why we do things this way, and if we have data to back it up.
 
I'm more convinced than I was previously that the way to move forward here is to keep the whole underlying Zip/mmap/etc architecture, but to "simply" change the code to 1) open the jar file off-main thread (not in the middle of AsyncOpen), and then 2) do the mmap memory reads on a non-main thread and then post the buffer to the main thread (or whatever the target thread is), which then calls OnDataAvailable.  (That will add some interthread latencies/overhead, but I'm hoping it's small and worth the jank savings).  I'm hoping we can do both of those things without any major changes to the existing code.
(In reply to Dragana Damjanovic [:dragana] from comment #21)
> Gary, can you push your patch to try server again. The old runs are not
> there any more. I can take a look.
> 
> Did you consider turning MOZ_LOG on try?

Thank you Dragana!

Here is the try push:
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=644507895c4025a02dc3d1ff65842f6f1e7edbf5
Flags: needinfo?(xeonchen)
Gary's patch is a good start. There is only one problem: when we install, for example feedConverter we use nsJARChannel::Open, the synch version and that in some cases does not manage to load scripts because asynch reading from omni.ja is not done.
We can postpone opening all channels until omni.ja is loaded or make HttpChannelChild/nsHttpChannel::OnStartRequest deal with asynch loading of jar channels. Only in OnStartRequest we know if we need some of these converters like FeedConverter and other scripts like jar:file:///home/worker/workspace/build/application/firefox/omni.ja!/chrome/en-US/locale/en-US/necko/necko.properties

This is a try run where I have added a lot of logging in nsJARChannel:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=56d5e484df2f7510e174762d5ee79caba97358ce&selectedJob=123590499
The test fail on HttpChannelChild::OnStartRequest, but probably there are othe places that use nsJARChannel::Open.
Thank you Dragana, I'm following this direction to do some investigation.
Duplicate of this bug: 817202
Duplicate of this bug: 1364250
Sounds like we should open omnijar async early in startup--I'd suggest during the socket transport init?  It's fine to special case omnijar--it's huge and we know we need it early.

Daniel did mention seeing JarChannel:Open() calls in bug 1364250.  We should see if we can convert all of these call sites to use AsyncOpen instead.  Might make sense to look into that ASAP, as we may need help from domain experts to do that conversion in some cases.  I opened bug 1391356 for that since it could proceed in parallel with this bug.
(In reply to Jason Duell [:jduell] (needinfo me) from comment #29)
> Sounds like we should open omnijar async early in startup--I'd suggest
> during the socket transport init?  It's fine to special case omnijar--it's
> huge and we know we need it early.

OmniJar files are already opened very early during startup, on the main thread: http://searchfox.org/mozilla-central/rev/13148faaa91a1c823a7d68563d9995480e714979/xpcom/build/Omnijar.cpp#109 But this Omnijar.cpp code doesn't seem to involve jar channels.
Attachment #8884195 - Attachment is obsolete: true
Attachment #8884195 - Flags: feedback?(jduell.mcbugs)
Attachment #8882242 - Attachment is obsolete: true
Hi Gary, what's the status of this bug? It would be really nice to get this fixed in time for 57, as jar I/O is currently very visible in cold startup profiles.
Flags: needinfo?(xeonchen)
Whiteboard: [necko-active] → [necko-active][qf]
(In reply to Florian Quèze [:florian] [:flo] from comment #32)
> Hi Gary, what's the status of this bug? It would be really nice to get this
> fixed in time for 57, as jar I/O is currently very visible in cold startup
> profiles.

I'm fixing try test failure, [1] is the progress until yesterday.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=4a2e99cb57f0124ebe60b71e4515d6d63d557d1d
Flags: needinfo?(xeonchen)
(In reply to Gary Chen [:xeonchen] (needinfo plz) from comment #33)
> (In reply to Florian Quèze [:florian] [:flo] from comment #32)
> > Hi Gary, what's the status of this bug? It would be really nice to get this
> > fixed in time for 57, as jar I/O is currently very visible in cold startup
> > profiles.
> 
> I'm fixing try test failure, [1] is the progress until yesterday.
> 
> [1]
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=4a2e99cb57f0124ebe60b71e4515d6d63d557d1d

Gary,
During QF triage, we wanted to ask what is your confidence with your fix on this bug while not risking any regression to 57. If the answer is no to both questions then you should defer implementing this until 58.
(In reply to Jean Gong from comment #34)
> Gary,
> During QF triage, we wanted to ask what is your confidence with your fix on
> this bug while not risking any regression to 57. If the answer is no to both
> questions then you should defer implementing this until 58.

I would say this is a behavior change and no matter how I'm confident with my patch, it could still break some stuff based on previous implementation, and that's what I'm trying to fix.
This basically works, except some tests that don't wait for |onStopRequest| correctly. In order not to block while fixing broken tests, I'll keep update this file if I found any incorrect case while fix test cases.
Attachment #8880712 - Attachment is obsolete: true
Attachment #8904907 - Flags: review?(dd.mozilla)
This session restore test closes the window before all tabs are successfully loaded.
Attachment #8904909 - Flags: review?(wmccloskey)
Windows are closed too early to consume data from |onDataAvailable|, which caused an assertion ([1]).

[1] https://treeherder.mozilla.org/logviewer.html#?job_id=127695437&repo=try&lineNumber=44063
Attachment #8904910 - Flags: review?(josh)
Comment on attachment 8904907 [details] [diff] [review]
(v2) Part 1: make local jar I/O off-main-thread

I am on vacation, Nick, please take a look...
Attachment #8904907 - Flags: review?(dd.mozilla) → review?(hurley)
Attachment #8904956 - Flags: review?(kmaglione+bmo)
Attachment #8904910 - Flags: review?(josh) → review+
Attachment #8904907 - Flags: review?(hurley) → review+
Whiteboard: [necko-active][qf] → [necko-active][qf:p2]
remove debug message
Attachment #8904909 - Attachment is obsolete: true
Attachment #8904909 - Flags: review?(wmccloskey)
Attachment #8905739 - Flags: review?(wmccloskey)
Comment on attachment 8904956 [details] [diff] [review]
Part 4: make sure extensions are loaded

Review of attachment 8904956 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/components/extensions/test/browser/browser_ext_browserAction_popup_resize.js
@@ +26,5 @@
>  
>    let browser = await openPanel(extension, undefined, true);
>  
> +  // |openPanel| only waits for browser loaded, but not extension loaded.
> +  // Thus the following |setSize| may fail.

Sorry for the delay. I looked at this a few days ago, but haven't had time to dive deeper to answer my own question:

`await extension.startup()` waits for the extension to finish loading before we get to the `openPanel` call. And `openPanel` should already wait for anything that matters for the `setSize` call to succeed. The only possible hitch I can think of is that there may be a corner case where we get a load for an initial about:blank document before the expected extension document. And if that's the case, we should just handle that case directly rather than adding an idle timeout.

So, what exactly does this change fix?
(In reply to Kris Maglione [:kmag] (long backlog; ping on IRC if you're blocked) from comment #42)
> Comment on attachment 8904956 [details] [diff] [review]
> Part 4: make sure extensions are loaded
> 
> Review of attachment 8904956 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> :::
> browser/components/extensions/test/browser/
> browser_ext_browserAction_popup_resize.js
> @@ +26,5 @@
> >  
> >    let browser = await openPanel(extension, undefined, true);
> >  
> > +  // |openPanel| only waits for browser loaded, but not extension loaded.
> > +  // Thus the following |setSize| may fail.
> 
> Sorry for the delay. I looked at this a few days ago, but haven't had time
> to dive deeper to answer my own question:
> 
> `await extension.startup()` waits for the extension to finish loading before
> we get to the `openPanel` call. And `openPanel` should already wait for
> anything that matters for the `setSize` call to succeed. The only possible
> hitch I can think of is that there may be a corner case where we get a load
> for an initial about:blank document before the expected extension document.
> And if that's the case, we should just handle that case directly rather than
> adding an idle timeout.
> 
> So, what exactly does this change fix?

Yes, after my OMT patch, while |setSize| is called, it's still at the about:blank page.
So here I'm trying to fix is to wait until the extension is completely loaded.

Actually I see another failure case has similar issue, so I'm still figuring out how to properly fix this kind of issues.
I don't have much experience on this, would you give me some direction/suggestion? Thanks.
Flags: needinfo?(kmaglione+bmo)
fix some errors found while running test cases.
Attachment #8904907 - Attachment is obsolete: true
Attachment #8906933 - Flags: review+
Depends on: 1399409
(In reply to Gary Chen [:xeonchen] (needinfo plz) from comment #43)
> (In reply to Kris Maglione [:kmag] (long backlog; ping on IRC if you're
> blocked) from comment #42)
> > Comment on attachment 8904956 [details] [diff] [review]
> > Part 4: make sure extensions are loaded
> > 
> > Review of attachment 8904956 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > :::
> > browser/components/extensions/test/browser/
> > browser_ext_browserAction_popup_resize.js
> > @@ +26,5 @@
> > >  
> > >    let browser = await openPanel(extension, undefined, true);
> > >  
> > > +  // |openPanel| only waits for browser loaded, but not extension loaded.
> > > +  // Thus the following |setSize| may fail.
> > 
> > Sorry for the delay. I looked at this a few days ago, but haven't had time
> > to dive deeper to answer my own question:
> > 
> > `await extension.startup()` waits for the extension to finish loading before
> > we get to the `openPanel` call. And `openPanel` should already wait for
> > anything that matters for the `setSize` call to succeed. The only possible
> > hitch I can think of is that there may be a corner case where we get a load
> > for an initial about:blank document before the expected extension document.
> > And if that's the case, we should just handle that case directly rather than
> > adding an idle timeout.
> > 
> > So, what exactly does this change fix?
> 
> Yes, after my OMT patch, while |setSize| is called, it's still at the
> about:blank page.
> So here I'm trying to fix is to wait until the extension is completely
> loaded.
> 
> Actually I see another failure case has similar issue, so I'm still figuring
> out how to properly fix this kind of issues.
> I don't have much experience on this, would you give me some
> direction/suggestion? Thanks.

Looks like this is more complicated than I thought, I've opened bug 1399409 for this issue.
Bulk priority update: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
(In reply to Gary Chen [:xeonchen] (needinfo plz) from comment #43)
> Yes, after my OMT patch, while |setSize| is called, it's still at the
> about:blank page.
> So here I'm trying to fix is to wait until the extension is completely
> loaded.
> 
> Actually I see another failure case has similar issue, so I'm still figuring
> out how to properly fix this kind of issues.
> I don't have much experience on this, would you give me some
> direction/suggestion? Thanks.

Sorry for the delay. You should just be able to check that the document URL isn't "about:blank" in head.js's awaitBrowserLoaded. That should have about the same effect as the idle callback, but more reliably.
Flags: needinfo?(kmaglione+bmo)
Attachment #8904956 - Attachment is obsolete: true
Attachment #8904956 - Flags: review?(kmaglione+bmo)
Changing to P2 as this will not be landed in 57, and we would like to land it in the next iteration.
Priority: P1 → P2
Whiteboard: [necko-active][qf:p2] → [qf:p2]
Depends on: 1401081
Depends on: 1401082
Attachment #8905739 - Flags: review?(wmccloskey) → review+
Is this ready to land?
Flags: needinfo?(xeonchen)
Do we have performance numbers here?  How much does this fix help?
(In reply to Jason Duell [:jduell] (needinfo me) from comment #49)
> Is this ready to land?

Yes, after bug 1401081 landed.

(In reply to Jason Duell [:jduell] (needinfo me) from comment #50)
> Do we have performance numbers here?  How much does this fix help?

Not yet, I just got the reference platform. I'll check it :)
Flags: needinfo?(xeonchen)
Whiteboard: [qf:p2] → [qf:p2][necko-triaged]
I've measured the |NetUtil_asyncFetch| function on reference hardware.
Both launched after full OS reboot.

before [1]: NetUtil_asyncFetch took 43ms
after  [2]: NetUtil_asyncFetch took 5ms


[1] https://perfht.ml/2ysMkHy
[2] https://perfht.ml/2ytv8Bw
make OMT default enabled, carry r+
Attachment #8906933 - Attachment is obsolete: true
Attachment #8915865 - Flags: review+
Keywords: perf
Talos result shows it's ~90% faster for startup cases.

[1] https://mzl.la/2ycBD91
Depends on: 1407547
(In reply to Gary Chen [:xeonchen] (needinfo plz) from comment #55)
> Talos result shows it's ~90% faster for startup cases.
> 
> [1] https://mzl.la/2ycBD91

cpstartup is "content process startup", not startup of the browser. The ts_paint_webext changes do look like nice startup improvements though.

But comparing talos between try and mozilla-central is usually not reliable, so to have less noisy results you would need to push to try the changeset right before your patch.
(In reply to Florian Quèze [:florian] [:flo] from comment #56)
> cpstartup is "content process startup", not startup of the browser. The
> ts_paint_webext changes do look like nice startup improvements though.
> 
> But comparing talos between try and mozilla-central is usually not reliable,
> so to have less noisy results you would need to push to try the changeset
> right before your patch.

Thank you Florian, I have a new comparison to the base on try.

[1] https://mzl.la/2ycxmVn
Blocks: 1363384
Blocks: 1409331
Blocks: 1409332
Blocks: 1409333
Blocks: 1409334
Blocks: 1409335
Blocks: 1409336
Blocks: 1409337
Depends on: 1415115
Whiteboard: [qf:p2][necko-triaged] → [qf:p1][necko-triaged]
(In reply to Gary Chen [:xeonchen] (needinfo plz) from comment #51)
> (In reply to Jason Duell [:jduell] (needinfo me) from comment #49)
> > Is this ready to land?
> 
> Yes, after bug 1401081 landed.

Looks like that other bug had its patch land ~2 months back.  Is this one ready to land now?
Flags: needinfo?(xeonchen)
(In reply to Daniel Holbert [:dholbert] (recovering from vacation bugmail backlog) from comment #58)
> (In reply to Gary Chen [:xeonchen] (needinfo plz) from comment #51)
> > (In reply to Jason Duell [:jduell] (needinfo me) from comment #49)
> > > Is this ready to land?
> > 
> > Yes, after bug 1401081 landed.
> 
> Looks like that other bug had its patch land ~2 months back.  Is this one
> ready to land now?

Unfortunately I still got error on xpcshell-test of Win32 platforms.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=190a6cb322667f5865c4af5d0083e66067bf1d8d

Still looking into it. :(
Flags: needinfo?(xeonchen)
Depends on: 1422496
Whiteboard: [qf:p1][necko-triaged] → [qf:i60][qf:p1][necko-triaged]
Whiteboard: [qf:i60][qf:p1][necko-triaged] → [qf:f60][qf:p1][necko-triaged]
rebase, carry r+
Attachment #8915865 - Attachment is obsolete: true
Attachment #8940077 - Flags: review+
Comment on attachment 8904910 [details] [diff] [review]
Part 3: wait until the window is closed

looks like this is not necessary anymore
Attachment #8904910 - Attachment is obsolete: true
I think test failures are filed intermittent bug, so it's worthy to give this a try.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9fafd12416b332311571fc664355acfd950cd26e
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d31af7725e65
Part 1: Make local jar I/O off-main-thread. r=nwgh
https://hg.mozilla.org/integration/mozilla-inbound/rev/a30d2216e0b9
Part 2: Wait until all tabs are closed. r=billm
Keywords: checkin-needed
Duplicate of this bug: 1363384
https://hg.mozilla.org/mozilla-central/rev/d31af7725e65
https://hg.mozilla.org/mozilla-central/rev/a30d2216e0b9
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Backout by nerli@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/f7f5ba2214d2
Backed out 2 changesets frequently failing browser-chrome's browser/components/sessionstore/test/browser_dying_cache.js a=backout
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla59 → ---
Also, this made browser_all_files_referenced.js permafail on Win32 builds on Beta (confirmed it was this patch via bisection on Try). Looks maybe OOMish.

https://treeherder.mozilla.org/logviewer.html#?job_id=155779172&repo=try
https://treeherder.mozilla.org/logviewer.html#?job_id=155779168&repo=try
Depends on: 1331853
No longer depends on: 1331853
Depends on: 1430039
Depends on: 1430088
(In reply to Ryan VanderMeulen [:RyanVM] from comment #67)
> Also, this made browser_all_files_referenced.js permafail on Win32 builds on
> Beta (confirmed it was this patch via bisection on Try). Looks maybe OOMish.
> 
> https://treeherder.mozilla.org/logviewer.html#?job_id=155779172&repo=try
> https://treeherder.mozilla.org/logviewer.html#?job_id=155779168&repo=try

It shows ~900 threads are created when it crashed, the names of the threads are printed in [1], which show they are StreamTrans threads. But the thread pool size is set to 25 and by nsThreadPool's log, there's no more than 4 StreamTrans threads concurrently.
(In reply to Gary Chen [:xeonchen] (needinfo plz) from comment #68)
> (In reply to Ryan VanderMeulen [:RyanVM] from comment #67)
> > Also, this made browser_all_files_referenced.js permafail on Win32 builds on
> > Beta (confirmed it was this patch via bisection on Try). Looks maybe OOMish.
> > 
> > https://treeherder.mozilla.org/logviewer.html#?job_id=155779172&repo=try
> > https://treeherder.mozilla.org/logviewer.html#?job_id=155779168&repo=try
> 
> It shows ~900 threads are created when it crashed, the names of the threads
> are printed in [1], which show they are StreamTrans threads. But the thread
> pool size is set to 25 and by nsThreadPool's log, there's no more than 4
> StreamTrans threads concurrently.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=98118b75eea6005547dcfeb88deede2588e78122&selectedJob=156530171
(In reply to Ryan VanderMeulen [:RyanVM] from comment #67)
> Also, this made browser_all_files_referenced.js permafail on Win32 builds on
> Beta (confirmed it was this patch via bisection on Try). Looks maybe OOMish.
> 
> https://treeherder.mozilla.org/logviewer.html#?job_id=155779172&repo=try
> https://treeherder.mozilla.org/logviewer.html#?job_id=155779168&repo=try

Hi Florian, bug 1349307 used to have similar issue on this test, do you think it's reasonable to skip the test in win32 as well?
Flags: needinfo?(florian)
Move to bug 1430088
Flags: needinfo?(florian)
Seems we need someone else to work on this...
Assignee: xeonchen → jduell.mcbugs
Comment on attachment 8940077 [details] [diff] [review]
(v3.2) Part 1: make local jar I/O off-main-thread

Review of attachment 8940077 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/libjar/nsJARChannel.cpp
@@ +537,5 @@
> +        // nsJARChannel must be release on main thread, and sometimes
> +        // this still hold nsJARChannel after dispatched.
> +        self = nullptr;
> +
> +        NS_DispatchToMainThread(target);

target.forget()

This will fix bug 1430039, to make sure target is released on main thread.
Hey jduell, looks like this is close but needs someone to drive it over the line. Does anybody on your team have cycles for this, or should I try to find someone?
Flags: needinfo?(jduell.mcbugs)
Michal is taking it (although you certainly pricked my ears up by offering to find someone :)
Assignee: jduell.mcbugs → michal.novotny
Flags: needinfo?(jduell.mcbugs)
This is blocked by 3 bugs.
Bug 1331853 is not fixed yet, bug 1430039 can be fixed by comment 73, and I think bug 1430088 you can just disable it.
See Also: → 1427726
Whiteboard: [qf:f60][qf:p1][necko-triaged] → [qf:f61][qf:p1][necko-triaged]
This patch causes intermittent failure of browser/components/sessionstore/test/browser_354894_perwindowpb.js. I cannot reproduce it locally but on try server it fails almost always: https://treeherder.mozilla.org/logviewer.html#?job_id=163910312&repo=try&lineNumber=3476

It seems that with this patch whenNewWindowLoaded() could resolve the promise before the session is restored. Tim, do you have any idea how to change whenNewWindowLoaded() to fix this?
Flags: needinfo?(ttaubert)
I can't reproduce this either, but here's something you can try. In the failing test, change `injectTestTabs()` to:

> async function injectTestTabs(win) {
>   await Promise.all(TEST_URLS.map(function(url) {
>     let tab = win.gBrowser.addTab(url);
>     return promiseBrowserLoaded(tab.linkedBrowser);
>   }));
> }

And then change `setupTest()` to do:

> await injectTestTabs(newWin);

instead of just calling the function without waiting. We open a new window, add tabs, close the window, and expect all the tabs to reappear. That we're not waiting for the tabs to finish loading seems suspicious to me.
Flags: needinfo?(ttaubert)
Here's the diff that I described above: https://pastebin.mozilla.org/9078614
Depends on: 1379072
Thanks Tim, this helps. Unfortunately, I didn't notice earlier that it's already covered by patch in comment #37. I'll move the patch to bug 1379072.
Comment on attachment 8905739 [details] [diff] [review]
(v2) Part 2: wait until all tabs are closed

This has landed in bug 1379072.
Attachment #8905739 - Attachment is obsolete: true
Attached patch patch v4Splinter Review
- includes fix from comment #73
- removes aAllowAsync argument from nsJARChannel::LookupFile(), it's unused
- patch v3.1 ignored mPreCachedJarReader, this is now fixed
- ContinueOpenLocalFile() and OnOpenLocalFileComplete() have now argument aIsSyncCall to indicate when we should call NotifyError() in OnOpenLocalFileComplete(). This was needed due to mPreCachedJarReader change, when we call CreateJarInput() directly from OpenLocalFile().
Attachment #8940077 - Attachment is obsolete: true
Attachment #8955176 - Flags: review?(hurley)
interdiff for easier review
Comment on attachment 8955176 [details] [diff] [review]
patch v4

Review of attachment 8955176 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, Michal.
Attachment #8955176 - Flags: review?(hurley) → review+
Hi Michal! Looks like this is all good to go - just so I know, are you waiting until after the next uplift to land?
Flags: needinfo?(michal.novotny)
There are at least 2 intermittent failures (bug 1443318 and 1443640) which are failing consistently with this patch and the latest revision. But those tests have not been changed since my last push on try of the same patch on the top of an older revision, where they didn't fail. So now I'm trying to nail down the changesets that cause it.
Flags: needinfo?(michal.novotny)
Depends on: 1443640
It seems that browser_dying_cache.js no longer fails with this patch.
No longer depends on: 1331853
This seems to be ready to land, let's see if it sticks.
Keywords: checkin-needed
Pushed by nbeleuzu@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9e3a9fb7c903
Convert Jar channels to stop using main thread I/O, r=nwgh
Keywords: checkin-needed
In bug 1430088 we disabled the browser_all_files_referenced.js test in Win32 for browser chrome tests, but I forgot we also need to disable it for devtools tests, so we need to change this line:
https://searchfox.org/mozilla-central/rev/99df6dad057b94c2ac4d8b12f7b0b3a7fdf25d04/devtools/client/framework/test/browser.ini#133

I think "(os == 'linux' && bits == 32)" can just be changed to "bits == 32"
(In reply to Andreea Pavel [:apavel] from comment #90)
> Devtools:
> https://treeherder.mozilla.org/logviewer.html#?job_id=167659217&repo=mozilla-
> inbound&lineNumber=1938

How is it possible that test browser/base/content/test/static/browser_all_files_referenced.js is run on windows7-32-pgo when it was disabled on win32 in bug 1430088?
(In reply to Michal Novotny (:michal) from comment #92)
> How is it possible that test
> browser/base/content/test/static/browser_all_files_referenced.js is run on
> windows7-32-pgo when it was disabled on win32 in bug 1430088?

Most likely some  condition in the bug is not correct reason why this test still runs on win32. 
In that patch i see  
[browser_all_files_referenced.js]
+skip-if = bits == 32   - shouldn't it be '32'?
(In reply to Andreea Pavel [:apavel] from comment #93)
> [browser_all_files_referenced.js]
> +skip-if = bits == 32   - shouldn't it be '32'?

Hm, no, I don't think the quotes are necessary - see for example: https://searchfox.org/mozilla-central/rev/99df6dad057b94c2ac4d8b12f7b0b3a7fdf25d04/browser/components/extensions/test/browser/browser-common.ini#154
Comment 91 already covered it - we run the same test in two different suites with two different manifest entries that need updating. I took a look at making that situation a bit saner over in bug 1355984, but ran into issues with test packaging that I haven't been able to spend time looking at further.
Okay, sounds like florian's approach in comment 91 has everything we need then? Is this ready to roll then?
ni?ing myself to check in on this in the next little while to make sure it's going to land.
Flags: needinfo?(mconley)
(In reply to Mike Conley (:mconley) (:⚙️) (Totally backlogged on reviews and needinfos) from comment #96)
> Okay, sounds like florian's approach in comment 91 has everything we need
> then? Is this ready to roll then?

I r+'ed attachment 8958470 [details] [diff] [review] in bug 1430088 for this.
(In reply to Florian Quèze [:florian] from comment #98)
> (In reply to Mike Conley (:mconley) (:⚙️) (Totally backlogged on reviews and
> needinfos) from comment #96)
> > Okay, sounds like florian's approach in comment 91 has everything we need
> > then? Is this ready to roll then?
> 
> I r+'ed attachment 8958470 [details] [diff] [review] in bug 1430088 for this.

I've set checkin-needed keyword in that bug, but it has been cleared when the first patch landed on m-c, which I didn't notice.

Anyway, I'm still trying to find out what's the problem with the xpcshell test, so this patch isn't yet ready to land again.
Flags: needinfo?(michal.novotny)
(In reply to Andreea Pavel [:apavel] from comment #90)
> Xpcshell:
> https://treeherder.mozilla.org/logviewer.html#?job_id=167654130&repo=mozilla-
> inbound&lineNumber=11247

From the log it's not clear to me what is failing in the test. I thought the problem is the failure in amIAddonManagerStartup.enumerateZipFile, but locally I see the same error call stack and the test succeeds. Bob, you've made some changes in the test so I assume you're familiar with it. Can you please have a look at the log and identify what's failing and what could be a cause?
Flags: needinfo?(bob.silverberg)
I don't think I'm the right person to look into this. Transferring my needinfo to ddurst who should be able to recommend someone.
Flags: needinfo?(bob.silverberg) → needinfo?(ddurst)
So, there are one or two problems here, depending on how you look at it:

1) We don't wait for the extension to finish starting up in this test before we try to remove it. We generally try to do that, and it would make this problem go away, at least.

2) When we uninstall an extension, we flush its zip reader cache entry, and expect that to synchronously close the file descriptor that the cache entry holds open. In this case, though, since the extension is still starting up and has a pending background request, the input thunk holds the descriptor alive even after its cache entry has been flushed.

If this is something we care about fixing, we should probably have flush-cache-entry cancel any background requests for the jar and wait for their descriptors to close.

That's probably better as a follow-up, though. For now, we should probably just fix the test to correctly wait for extension startup.
Flags: needinfo?(ddurst)
https://hg.mozilla.org/integration/mozilla-inbound/rev/f1625f41dda7a51114cc0c51dc6b468897a01622
Bug 1373708: Part α: Fix test_webextension_install to wait for startup before uninstalling. r=aswan
Keywords: leave-open
Thanks Kris, this fixed the failure.
Pushed by dluca@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a06dc76c6ed8
Convert Jar channels to stop using main thread I/O, r=nwgh
Keywords: checkin-needed
(In reply to Florian Quèze [:florian] from comment #91)
> In bug 1430088 we disabled the browser_all_files_referenced.js test in Win32
> for browser chrome tests.

Florian, I see another OOM failure of a different test in the same directory after the patch has re-landed. See Windows 7 pgo bc5 failure https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=231448afb53d8997b569a42f682adb79042bab1a&selectedJob=168649253. Shouldn't we disable the tests for all 32bit OSes?
Flags: needinfo?(florian)
(In reply to Michal Novotny (:michal) from comment #107)
> Florian, I see another OOM failure of a different test in the same directory
> after the patch has re-landed. See Windows 7 pgo bc5 failure
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> inbound&revision=231448afb53d8997b569a42f682adb79042bab1a&selectedJob=1686492
> 53. Shouldn't we disable the tests for all 32bit OSes?

:(

Please disable the tests on 32 bit OSes for now and I'll update them to throttle the requests... I don't understand why we didn't just do that the first time it started failing...
Flags: needinfo?(florian)
Attached patch Disable tests on 32bit OSes (obsolete) — Splinter Review
Attachment #8959744 - Flags: review?(kmaglione+bmo)
Comment on attachment 8959744 [details] [diff] [review]
Disable tests on 32bit OSes

Please also remove the redundant skip-if statements below.
Attachment #8959744 - Flags: review?(kmaglione+bmo) → review+
Attachment #8959744 - Attachment is obsolete: true
Attachment #8959745 - Flags: review+
Keywords: checkin-needed
Pushed by dluca@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b3bb4c38a0ef
Temporarily disable tests on all 32 bit OSes due to OOM failures, r=kmag
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/a06dc76c6ed8
https://hg.mozilla.org/mozilla-central/rev/b3bb4c38a0ef
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Comment on attachment 8959750 [details]
Bug 1373708: Follow-up: Throttle static browser tests so they don't OOM when using background requests.

https://reviewboard.mozilla.org/r/228602/#review234422

Good idea :-).

::: browser/base/content/test/static/browser.ini
(Diff revision 1)
> -skip-if = asan || debug || (os == 'linux' && bits == 32)
>  support-files =
>    head.js
>  
>  [browser_all_files_referenced.js]
> -skip-if = (os == 'win' && bits == 32)

This test should remain disabled for win32 verify jobs, see bug 1412343. It's also an OOM issue, but related to opening xul.dll in an array buffer repeatedly.

::: browser/base/content/test/static/browser_parsable_script.js:125
(Diff revision 1)
> -    }
> -    allPromises.push(parsePromise(uri.spec));
> -  }
> +    }
> -
> -  let promiseResults = await Promise.all(allPromises);
> -  is(promiseResults.filter((x) => !x).length, 0, "There should be 0 parsing errors");
> +    return parsePromise(uri.spec);
> +  });
> +  ok(true, "All files parsed");

Where is this check going?

::: browser/base/content/test/static/head.js:150
(Diff revision 1)
>      };
>      xhr.send(null);
>    });
>  }
> +
> +async function throttledMapPromises(limit, iterable, task) {

I would make 'limit' the last parameter, and give it a default value so that you don't have to specify it in each test file.

::: browser/base/content/test/static/head.js:160
(Diff revision 1)
> +    }
> +
> +    let promise = task(data);
> +    if (promise) {
> +      let cleanup = () => promises.delete(promise);
> +      promise.then(cleanup, cleanup);

promise.finally ? And then you no longer need the 'cleanup' variable.
Attachment #8959750 - Flags: review?(florian) → review-
Comment on attachment 8959750 [details]
Bug 1373708: Follow-up: Throttle static browser tests so they don't OOM when using background requests.

https://reviewboard.mozilla.org/r/228602/#review234422

> Where is this check going?

Without it, the job fails because there are no assertions. The alternative was to add an assertion to every success case, but that seemed unnecessarily noisy.
Comment on attachment 8959750 [details]
Bug 1373708: Follow-up: Throttle static browser tests so they don't OOM when using background requests.

https://reviewboard.mozilla.org/r/228602/#review234436

Looks good to me, thanks!
Attachment #8959750 - Flags: review?(florian) → review+
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla61 → ---
Keywords: leave-open
https://hg.mozilla.org/integration/mozilla-inbound/rev/a8f2043af23a7a7e329a3e0418dc993e732c87fd
Bug 1373708: Follow-up: Throttle static browser tests so they don't OOM when using background requests. r=florian
Depends on: 1446821
Keywords: leave-open
Depends on: 1429628
https://hg.mozilla.org/mozilla-central/rev/a8f2043af23a
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
This bug is kind of a mess to follow - from my understanding, what landed and closed this bug in comment 121 was a test update, but the main-thread IO fix was backed out and hasn't re-landed... so I _think_ this is still open.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla61 → ---
(In reply to Mike Conley (:mconley) (:⚙️) (Totally backlogged on reviews and needinfos) from comment #122)
> This bug is kind of a mess to follow - from my understanding, what landed
> and closed this bug in comment 121 was a test update, but the main-thread IO
> fix was backed out and hasn't re-landed... so I _think_ this is still open.

Urgh. Sorry. I'm not sure how I wound up removing the leave-open keyword.
(In reply to Noemi Erli[:noemi_erli] from comment #119)
> Backed out 2 changesets (bug 1373708) for frequent xpcshell failures in
> xpcshell.ini:toolkit/components/extensions/test/xpcshell/
> test_ext_startup_perf.js a=backout

This was fixed in bug 1446821 which has landed on m-i.
Flags: needinfo?(michal.novotny)
Keywords: checkin-needed
We're sorry - something has gone wrong while rewriting or rebasing your commits. The commits being pushed no longer match what was requested. Please file a bug.
Flags: needinfo?(michal.novotny)
Keywords: checkin-needed
I don't understand the problem. The patch applies cleanly to the latest revision. Did you try to land correct patch (attachment 8955176 [details] [diff] [review])? That one was backed out in comment 119.
Flags: needinfo?(michal.novotny)
Keywords: checkin-needed
Pushed by gijskruitbosch@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8b721870d808
Convert Jar channels to stop using main thread I/O, r=nwgh
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/8b721870d808
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Depends on: 1448290
You need to log in before you can comment on or make changes to this bug.