Closed Bug 1470907 Opened 6 years ago Closed 6 years ago

the number of dns resolver threads is out of control

Categories

(Core :: Networking: DNS, defect, P2)

defect

Tracking

()

VERIFIED FIXED
mozilla63
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- unaffected
firefox61 --- unaffected
firefox62 + verified
firefox63 + verified

People

(Reporter: froydnj, Assigned: valentin)

References

Details

(Keywords: regression, Whiteboard: [MemShrink][necko-triaged])

Attachments

(1 file)

If I understand:

https://searchfox.org/mozilla-central/source/netwerk/dns/nsHostResolver.h#45-50

correctly, nsHostResolver should spin up, at most, 8 resolver threads at one time.

Looking at htop on my currently running Nightly, I have 37 DNS Res~ver (sic) threads running in the parent process.

It is not super-clear to me how this situation comes about.  nsHostResolver::ConditionallyCreateThread is, AFAICT, only called under nsHostResolver::mLock, so there shouldn't be weird issues racing on nsHostResolver::mThreadCount.
Oh, perhaps the problem is that GetHostToLookup returns false in *two* situations:

1) We're shutting down;
2) We timed out in some way.

And ThreadFunc interprets both situations as "welp, time to bail".  And if we timeout on *one* lookup, the thread that timed out just sits there forevermore, because nothing joins the individual threads that we spin up.
[Tracking Requested - why for this release]: leaking threads is bad.  Not sure how many releases this affects, or whether it is a side-effect of bug 1426019.

Valentin, can you see if asynchronously shutting down the thread after ThreadFunc has completed (the shutdown will have to be done on the main thread, IIUC) fixes this issue?
Flags: needinfo?(valentin.gosu)
I tried reproducing it for about 15 minutes... no luck. Couldn't get it to have more than 8 DNS threads, but you're right, not shutting them down would probably leave them hanging there. I pushed a patch to try with your suggested change. Can you use the build to see if it keeps happening? Maybe I don't have a good enough test case :)
Flags: needinfo?(valentin.gosu) → needinfo?(nfroyd)
Assignee: nobody → valentin.gosu
Priority: -- → P2
Whiteboard: [MemShrink] → [MemShrink][necko-triaged]
Blocks: 1426019
It reproduces fairly easily on my Linux machine. I just switched off TRR, revisited most tabs and within minutes I was over 8 DNS threads:

$ ps -eL | grep DNS
23574 23674 pts/8    00:00:00 DNS Resolver #1
23574 23803 pts/8    00:00:00 DNS Resolver #2
23574 23804 pts/8    00:00:00 DNS Resolver #3
23574 23805 pts/8    00:00:00 DNS Resolver #4
23574 23806 pts/8    00:00:00 DNS Resolver #5
23574 23807 pts/8    00:00:00 DNS Resolver #6
23574 23924 pts/8    00:00:00 DNS Resolver #7
23574 32245 pts/8    00:00:00 DNS Resolver #8
23574 32253 pts/8    00:00:00 DNS Resolver #9
23574 32309 pts/8    00:00:00 DNS Res~ver #10
Comment on attachment 8987706 [details]
Bug 1470907 - Call AsyncShutdown for threads at the end of nsHostResolver::ThreadFunc

https://reviewboard.mozilla.org/r/252946/#review259596

Thank you.  I haven't tried the try build, though.  Maybe Daniel's STR help?
Attachment #8987706 - Flags: review?(nfroyd) → review+
Also, I wonder if it's the best policy for the resolver threads to drop themselves if there hasn't been activity for them in the last N minutes.  Seems like we may want them to stick around indefinitely to make DNS lookups more responsive (i.e. not potentially have to spin up threads).  But that would be a separate bug.
Flags: needinfo?(nfroyd)
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/a5a1fdc41ec4
Call AsyncShutdown for threads at the end of nsHostResolver::ThreadFunc r=froydnj
> Seems like we may want them to stick around indefinitely

Agreed. It seems like a waste to start/stop them like this. Once they're all started, I think it would be better to let them run.
Blocks: 1471280
https://hg.mozilla.org/mozilla-central/rev/a5a1fdc41ec4
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
[Tracking Requested - why for this release]:
Leaking threads could be a big performance issue.
Comment on attachment 8987706 [details]
Bug 1470907 - Call AsyncShutdown for threads at the end of nsHostResolver::ThreadFunc

Approval Request Comment
[Feature/Bug causing the regression]: bug 1426019
[User impact if declined]: unnecessary use of resources. Possible OOM.
[Is this code covered by automated tests?]: no
[Has the fix been verified in Nightly?]: just landed. verified with try build.
[Needs manual test from QE? If yes, steps to reproduce]: yes. browse normally for about 15 minutes, opening a large number of tabs. Check the resource monitor on windows, or `ps -eL | grep DNS` on linux and make sure the number of threads is never more than 8. 
[List of other uplifts needed for the feature/fix]: none.
[Is the change risky?]: no
[Why is the change risky/not risky?]: it just makes the threads end after ThreadFunc completes, as it did prior to bug 1426019
[String changes made/needed]: none
Attachment #8987706 - Flags: approval-mozilla-beta?
Andrei, can your team verify the fix in nightly? Thanks!
Flags: qe-verify+
Flags: needinfo?(andrei.vaida)
Flags: needinfo?(andrei.vaida) → needinfo?(adrian.florinescu)
I couldn't manage to reproduce the initial issue on a 63.0a1 20180626220124 nor by switching TRR off (network.trr.wait-for-portal). Valentin, I wonder if this is the right preference to switch off (comment 5).

In latest Nightly(2018-06-28) available, with the same steps, I was never over 8 DNS threads.

(Ubuntu 16.04 x64)
Flags: needinfo?(valentin.gosu)
I didn't actually mean to imply that the TRR pref would have anything to do with this issue, just that switching off TRR will use the native threadpool fully. Since the problem needs some timeouts in the resolver queues, it might take a little time to trigger and is probably more likely the larger number of names Firefox tries to resolve...
(In reply to Adrian Florinescu [:adrian_sv] from comment #14)
> I couldn't manage to reproduce the initial issue on a 63.0a1 20180626220124
> nor by switching TRR off (network.trr.wait-for-portal). Valentin, I wonder
> if this is the right preference to switch off (comment 5).
> 
> In latest Nightly(2018-06-28) available, with the same steps, I was never
> over 8 DNS threads.
> 
> (Ubuntu 16.04 x64)

To disable TRR you should set network.trr.mode;0 (it might already be 0)

Indeed, I had trouble myself reproducing the bug. I guess just verifying that it doesn't happen now is OK.
Flags: needinfo?(valentin.gosu)
I've tried a bit more, using excesive tabs and windows, with both 2018-06-26 build and 2018-6-29 build and the results do look similar to me (network.trr.mode  0). 

Nightly 53 2018-06-26
ps -eL | grep DNS
11508 13520 pts/4    00:00:00 DNS Resolver #7
11508 13525 pts/4    00:00:00 DNS Res~ver #10
11508 13526 pts/4    00:00:00 DNS Res~ver #11
11508 13789 pts/4    00:00:00 DNS Res~ver #12
11508 13790 pts/4    00:00:00 DNS Res~ver #13
11508 13791 pts/4    00:00:00 DNS Res~ver #14
11508 13792 pts/4    00:00:00 DNS Res~ver #15
11508 13794 pts/4    00:00:00 DNS Res~ver #16




Nightly 53 2018-06-29
ps -eL | grep DNS
 8564  8680 pts/4    00:00:00 DNS Resolver #1
 8564  8774 pts/4    00:00:00 DNS Resolver #2
 8564  8775 pts/4    00:00:00 DNS Resolver #3
 8564  8879 pts/4    00:00:00 DNS Resolver #5
 8564  8881 pts/4    00:00:00 DNS Resolver #7
 8564  9198 pts/4    00:00:00 DNS Resolver #9
 8564  9199 pts/4    00:00:00 DNS Res~ver #10
 8564  9256 pts/4    00:00:00 DNS Res~ver #11


I couldn't catch the over 8 DNS threads. Guessing that there must be something additional that makes the original issue easier to reproduce, but no idea what.
Since Daniel's env. reproduced best the initial problem, Daniel would you mind giving the fixed version a spin, just to make sure we're not missing anything?
Flags: needinfo?(daniel)
I've not seen more than 8 DNS threads happen with the fixed version - and I've checked a few times and did it basically as I did before when I could see more than eight. It's far from a very tight verification, but it's at least a data point.
Flags: needinfo?(daniel)
Based on comment 16, comment 17 and comment 18, marking this bug as verified.
Status: RESOLVED → VERIFIED
Flags: qe-verify+
Flags: needinfo?(adrian.florinescu)
Comment on attachment 8987706 [details]
Bug 1470907 - Call AsyncShutdown for threads at the end of nsHostResolver::ThreadFunc

Fix for a recent regression, verified in nightly, let's uplift for beta 5.
Attachment #8987706 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Adding the qe-verify+ flag for the verification on beta 62.
Flags: qe-verify+
I was also unlucky in reproducing this bug, using an affected Nightly build from 2018-06-25; after browsing for several minutes on different tabs, the number of DNS threads remained under 8, on Ubuntu 16.04 x64.

However, I spent quite some time (~30 minutes) on Beta 62.0b6 in order to see the current behavior in a fixed build and the bug seemed fixed; the number of DNS threads stayed at no more that eight when tested on Linux.

I've tried to verify this on Windows as well with the Process Explorer, but after a closer look it turned out that symbols were required to see the DNS threads and a bit complicated for manual verification; After chatting with Valentin on slack we concluded that it shouldn't be differences between platforms and the verification on Ubuntu only, will suffice.

Based on the above verification on Ubuntu 16.04 x64, I think it's safe to mark this bug verified fixed on Beta.
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: