DNS: nsHostResolver should add all negative nsHostRecords to the eviction queue

RESOLVED FIXED in Firefox 28

Status

()

defect
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: sworkman, Assigned: sworkman)

Tracking

25 Branch
mozilla28
Points:
---

Firefox Tracking Flags

(firefox28 fixed)

Details

(Whiteboard: [MemShrink][qa-])

Attachments

(2 attachments)

In nsHostResolver::OnLookupComplete, nsHostRecords which don't have a result from getaddrinfo are not added to the eviction queue. This means they are not deleted until shutdown.

Note: Since mMaxCacheEntries is compared against the size of the eviction queue, this means that the undeleted nsHostRecords are not being included in the DNS cache size. So, these negative entries are adding to the overall size of the cache, potentially pushing it beyond mMaxCacheEntries in size.

Note2: I checked hg blame to see if there was a reason for these records not being deleted. However, this change seems to be there from the file creation/import into hg, so no useful information there.
This is a good find steve! (its good that they aren't added to the size measurement, otherwise we'd have a deadlock instead of a modest leak.)
Simple one-liner. Pat, if you don't think this is good, let me know.
Attachment #8336383 - Flags: review?(mcmanus)
Comment on attachment 8336383 [details] [diff] [review]
v1.0 Add negative DNS cache entries to the eviction queue

Review of attachment 8336383 [details] [diff] [review]:
-----------------------------------------------------------------

So I think this is fine - but I'm going to send it back hoping you'll add a couple other things to it

1] In CacheEntryEnumerator() addr_info is deref'd without checking null.. can you add a null check if you agree?

2] in the code for "Specific DNS request (%s) for an unspecified" around 635 there are 4 deref's of 
unspecHe->rec->addr_info that seem suspicious.. at least for clarity I would like a null check or 2 here as well - do you agree?
Attachment #8336383 - Flags: review?(mcmanus)
oh - and make sure you've actually seen the new expiration path run with a negative entry :)
Whiteboard: [MemShrink]
(In reply to Patrick McManus [:mcmanus] from comment #4)
> oh - and make sure you've actually seen the new expiration path run with a
> negative entry :)

Oh yeah - one of the first things I did yesterday after discovering this and writing up the patch.

Checking on the requests from comment 4 now.
(In reply to Steve Workman [:sworkman] from comment #5)
> Checking on the requests from comment 4 now.

Uh, I meant Comment 3.
(In reply to Patrick McManus [:mcmanus] from comment #3)
> Comment on attachment 8336383 [details] [diff] [review]
> v1.0 Add negative DNS cache entries to the eviction queue
> 
> Review of attachment 8336383 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> So I think this is fine - but I'm going to send it back hoping you'll add a
> couple other things to it
> 
> 1] In CacheEntryEnumerator() addr_info is deref'd without checking null..
> can you add a null check if you agree?

There's a null check for addr_info at the top of the function. However, there isn't one for rec. Something has gone badly wrong if that's not there, but a check is good anyway. I'll MOZ_ASSERT it and add it into the initial condition for release builds.

> 2] in the code for "Specific DNS request (%s) for an unspecified" around 635
> there are 4 deref's of 
> unspecHe->rec->addr_info that seem suspicious.. at least for clarity I would
> like a null check or 2 here as well - do you agree?

For clarity, yes. And really, negative should be checked too. I'll add in some checks.
Adding this as a second patch, attending to Patrick's requests in comment 3.
Attachment #8336434 - Flags: review?(mcmanus)
Comment on attachment 8336383 [details] [diff] [review]
v1.0 Add negative DNS cache entries to the eviction queue

Re-requesting review, as the second patch is on top of this one.
Attachment #8336383 - Flags: review?(mcmanus)
Comment on attachment 8336434 [details] [diff] [review]
v1.0 Add in some null checks for nsHostRecord

Review of attachment 8336434 [details] [diff] [review]:
-----------------------------------------------------------------

thanks steve.. you were right that the CacheEntryEnumerator block is totally gratuitous.. keep it or leave it at your discretion.

::: netwerk/dns/nsHostResolver.cpp
@@ +623,5 @@
>                          unspecHe->rec &&
>                          unspecHe->rec->HasUsableResult(flags) &&
>                          TimeStamp::NowLoRes() <= (he->rec->expiration +
>                              TimeDuration::FromSeconds(mGracePeriod * 60))) {
> +                        

whitespace?
Attachment #8336434 - Flags: review?(mcmanus) → review+
Attachment #8336383 - Flags: review?(mcmanus) → review+
https://hg.mozilla.org/mozilla-central/rev/4eca387b65c5
https://hg.mozilla.org/mozilla-central/rev/f7b3e94e3d50
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla28
Is this worth backporting to Aurora?
I think so. Can you redo the test in Bug 893857 comment 22? I'd like to see the effect with and without the patch, not including the redundant measurement of host in nsHostKey. (I think the initial 1MB measurement included host, right?)
Would you mind doing it?  It's an easy benchmark to run -- just takes a few minutes, wait for the "finished" dialog, and then hit the "close all windows" button, and then measure in about:memory.  To do the "before" part of the comparison, you'll have to back out your leak fix while still keeping the (just-landed) memory reporter in place, of course.  Thanks!
That should be fine. I actually thought that there was a lot more involved in running the test - I wouldn't have asked had I realized it was so simple. I'll try it out tomorrow when your DNS memory reporter patch (Bug 893857) has landed on m-c.
One other thing:  you'll need to follow the instructions at the top of the page, particularly the bit about allowing pop-ups for that site, and modifying the popup maximum in about:config.
Keywords: verifyme
(In reply to Patrick McManus [:mcmanus] from comment #1)
> This is a good find steve! (its good that they aren't added to the size
> measurement, otherwise we'd have a deadlock instead of a modest leak.)

Could you please more details about this leak? (to verify this fix on Firefox for desktop).
Flags: needinfo?(mcmanus)
lookups of hostnames without valid dns records could generate leaks
Flags: needinfo?(mcmanus)
(In reply to Patrick McManus [:mcmanus] from comment #19)
> lookups of hostnames without valid dns records could generate leaks

This sounds like it's not that easy to generate at will. Does it happen often enough to require manual verification?
Flags: needinfo?(mcmanus)
(In reply to Ioana Budnar, QA [:ioana] from comment #20)
> (In reply to Patrick McManus [:mcmanus] from comment #19)
> > lookups of hostnames without valid dns records could generate leaks
> 
> This sounds like it's not that easy to generate at will. Does it happen
> often enough to require manual verification?

You should be able to generate it by visiting hostnames that don't have valid dns addresses.

I don't know what the crieria is for manual verification - so I can't answer that.
Flags: needinfo?(mcmanus)
(In reply to Patrick McManus [:mcmanus] from comment #21)
...
> You should be able to generate it by visiting hostnames that don't have
> valid dns addresses.

Yeah, that's kind of very obvious :D. My remark was in the idea that we'd need specific invalid entries and have to setup the dns or so. I went a bit too far there...

Anyway, I tried to reproduce this with a pre-fix build and I'm not seeing the cache in about:memory increase. about:cache doesn't register any change either. 

Steve, did you use some other tool for monitoring the cache?
Flags: needinfo?(sworkman)
Ioana, my test was observing logs. Without the patch the LOG statement "Clearing cache db entry for host [%s]." would not have been seen, because the function containing it would not have been called.

So, a verification that I'd suggest would be NSPR_LOG_MODULES=nsHostResolver=5, and then fill up the cache (you'll need 400 hostnames for desktop, plus at least one that resolves to nothing). Then you should check the logs for the statement above: as long as one of the non-resolving hostnames shows up, then the clear function is being called. Without the patch, this code path should not be executed.

Leaks: Well, I don't think you'll detect one on shutdown, because the entry is deleted along with the host resolver. The problem would be seen during runtime. If you can generate a list with a lot more that 400 non-resolving entries, you should see memory use increase and keep on increasing - this is without the patch, of course. With the patch, you might seen an initial spike, but once each lookup fails (should be very quick), memory use should drop again.

However, I think checking for the log is good enough :)
Flags: needinfo?(sworkman)
Thanks for the details Steve!

I tried to verify this several times yesterday and today: Fx beta debug - latest repo, built and run with NSPR_LOG_MODULES=nsHostResolver=5 set, multiple profiles. Every time it gets to 200-300something tabs, Firefox freezes for a couple of minutes, then quits (no crash reporter, process gets closed most times, but not always).

The only way I managed to open enough tabs was using "Open All in Tabs" from older profiles history, but that only worked twice: 404 tabs, 823 tabs (the rest of times I got the same behavior as above). There were invalid entries there and many unique entries, but there were also enough repeats there. Unfortunately, I couldn't count all the repeats until my system went broken. I did get to search the logs though and there were no entries containing "Clear cache".

Given the situation, QA will stop working on verifying this. I will investigate the freezing and closing issue and file a new bug if necessary.
Keywords: verifyme
Whiteboard: [MemShrink] → [MemShrink][qa-]
(In reply to Ioana Budnar, QA [:ioana] from comment #24)
> Thanks for the details Steve!

Thanks for trying the verification and sorry you had such a rough time with it.

> I tried to verify this several times yesterday and today: Fx beta debug -
> latest repo, built and run with NSPR_LOG_MODULES=nsHostResolver=5 set,
> multiple profiles. Every time it gets to 200-300something tabs, Firefox
> freezes for a couple of minutes, then quits (no crash reporter, process gets
> closed most times, but not always).
> 
> The only way I managed to open enough tabs was using "Open All in Tabs" from
> older profiles history, but that only worked twice: 404 tabs, 823 tabs (the
> rest of times I got the same behavior as above). There were invalid entries
> there and many unique entries, but there were also enough repeats there.
> Unfortunately, I couldn't count all the repeats until my system went broken.
> I did get to search the logs though and there were no entries containing
> "Clear cache".
> 
> Given the situation, QA will stop working on verifying this. I will
> investigate the freezing and closing issue and file a new bug if necessary.

If you come back to this, I have a way that doesn't involve opening 100s of tabs. We prefetch DNS results for <a> tags, so you could open several pages with different lists of links in them. Now, we max out at 512 prefetches per page, so one page of 512 unique links should be enough to get evictions happening. I didn't think to suggest this earlier, sorry.
(In reply to Steve Workman [:sworkman] from comment #25)
> If you come back to this, I have a way that doesn't involve opening 100s of
> tabs. We prefetch DNS results for <a> tags, so you could open several pages
> with different lists of links in them. Now, we max out at 512 prefetches per
> page, so one page of 512 unique links should be enough to get evictions
> happening. I didn't think to suggest this earlier, sorry.

I used a webpage with 500 links, then I opened other sites and checked the logs with each one. No "clear" appeared in the logs even after surpassing 512. I tried a couple more times with other pages with 500 and 1000 links + adding valid and invalid entries but "clear" did not appear in the logs at any point. I checked each page I used to make sure the links were to unique hostnames (e.g. instead of using top alexa, I went for moz.com/top500). There might be a problem with this fix.

Steve, did you get to test beta for this? If yes, did you get the "Clearing cache db entry for host [%s]." message at any point?
Flags: needinfo?(sworkman)
Ioana, sorry for the delayed response.

I forgot one thing about the test case I suggested to you: we max out at 512 prefetch requests sent to the DNS service, but internally the lookup queue is maxed out at 150 for low priority requests (inc. prefetches). So, a page of 512 links will only prefetch the first 150 for DNS.

Therefore, the test I suggested needs a little reworking. What I just did locally was create 4 pages, each with 150 links: I used a bash script to write links for http://1.com, 2.com etc. so that page 1 has 1-150.com, page 2 151-300.com etc. Then I just added normal html boilerplate.

( for i in {1..150}; do echo "  <a href=$i.com>$i.com</a>"; done; ) >& test1.html

Then, with NSPR_LOG_MODULES=nsHostResolver:5, I loaded them in turn, making sure to wait for all the lookups to be complete before loading the next page. The wait ensures that the internal queue is flushed. And you can tell that from the logs - look for "DNS lookup thread - lookup completed for host [150.com]" (or 300, 450 or 600).

Some of these will succeed; some will not. On my system 74.com failed, so I made sure to look for "Clearing cache db entry for host [74.com]" in the logs. It was there - hurrah! :)

Let me know if this works for you.
Flags: needinfo?(sworkman)
You need to log in before you can comment on or make changes to this bug.