Closed Bug 893857 Opened 11 years ago Closed 11 years ago

Add a memory reporter for the DNS cache

Categories

(Core :: Networking: DNS, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla28

People

(Reporter: mccr8, Assigned: n.nethercote)

References

Details

(Whiteboard: [MemShrink:P2])

Attachments

(3 files)

This is the DMD log from a 36-hour B2G monkey tester run, and it is the 3rd highest single entry.  It looks like quite a bit of live PRAddrInfo to have at once, not knowing anything about networking.  If this DNS info can be held alive by a ContentParent, then it is probably just a symptom of another leak we are investigating.

Unreported: ~322 blocks in stack trace record 3 of 932
 ~1,317,946 bytes (~1,317,946 requested / ~0 slop)
 2.60% of the heap (11.32% cumulative);  6.16% of unreported (26.84% cumulative)
 Allocated at
   malloc /home001/jeremy.kim/workspace/d300/b2g/gecko/memory/build/replace_malloc.c:152 (0x400c82ae libmozglue.so+0x42ae)
   __res_vinit /home001/jeremy.kim/workspace/d300/b2g/gecko/other-licenses/android/res_init.c:284 (0x4009c2e0 libmozglue.so+0xd2e0)
   __res_ninit /home001/jeremy.kim/workspace/d300/b2g/gecko/other-licenses/android/res_init.c:183 (0x4009c8e0 libmozglue.so+0xd8e0)
   _res_thread_alloc /home001/jeremy.kim/workspace/d300/b2g/gecko/other-licenses/android/res_state.c:89 (0x4009dcd8 libmozglue.so+0xecd8)
   __res_get_state /home001/jeremy.kim/workspace/d300/b2g/gecko/other-licenses/android/res_state.c:195 (0x4009dd8a libmozglue.so+0xed8a)
   _dns_getaddrinfo /home001/jeremy.kim/workspace/d300/b2g/gecko/other-licenses/android/getaddrinfo.c:1893 (0x400980b4 libmozglue.so+0x90b4)
   nsdispatch (0x401023ec libc.so+0x323ec) (can't find lib)
   explore_fqdn /home001/jeremy.kim/workspace/d300/b2g/gecko/other-licenses/android/getaddrinfo.c:770 (0x4009897e libmozglue.so+0x997e)
   PR_GetAddrInfoByName /home001/jeremy.kim/workspace/d300/b2g/gecko/nsprpub/pr/src/misc/prnetdb.c:2049 (0x4083a716 libnspr4.so+0x11716)
   nsHostResolver::ThreadFunc(void*) /home001/jeremy.kim/workspace/d300/b2g/gecko/netwerk/dns/nsHostResolver.cpp:975 (0x410f0574 libxul.so+0x2d2574)
   _pt_root /home001/jeremy.kim/workspace/d300/b2g/gecko/nsprpub/pr/src/pthreads/ptthread.c:205 (0x41660a40 libnspr4.so+0x19a40)
   __thread_entry (0x400e30ec libc.so+0x130ec) (can't find lib)
   pthread_create (0x400e2c40 libc.so+0x12c40) (can't find lib)
Summary: Possible leak of PRAddrInfo on B2G → Possible 1.3mb leak of PRAddrInfo on B2G
No longer blocks: 893172
The full log is in bug 893172.
Blocks: 889261
(In reply to Andrew McCreight [:mccr8] from comment #0)
> Unreported: ~322 blocks in stack trace record 3 of932

that looks like the DNS cache to me; not a leak. It defaults to 400 entries, you can override that with network.dnsCacheEntries if you like.. though trading a little ram at the cost of a rtt doesn't seem like the right thing to me. rtt is the enemy and this report is pretty close to the top end of the cost.

If saving 1 MB was deemed high value, I'm sure the cache could be altered to be something far more memory efficient than the list of fragmented pages the PRAddrInfo's come out of NSPR on.
> though trading a little ram at the cost of a rtt doesn't seem like the right thing to me. 

I guess you mean "does", not "doesn't"?

I agree it's probably the right thing, but we should at least have a memory reporter.

> If saving 1 MB was deemed high value, I'm sure the cache could be altered to be something 
> far more memory efficient than the list of fragmented pages the PRAddrInfo's come out of 
> NSPR on.

Saving 1mb is definitely of value on B2G.  On our 256mb devices, we have roughly 150mb available to all Gecko processes.  After counting code and other stuff that the main process uses, we have less than 100mb of memory available to apps.  So saving 1mb in the main process is a ~1% improvement in the amount of memory available to apps!
Summary: Possible 1.3mb leak of PRAddrInfo on B2G → Add a memory reporter for the DNS cache
Let's make this bug about adding a memory reporter, and then we can separately consider whether we can shrink this data structure.
Ah, this is totally different on b2g18 and trunk.

On trunk, we immediately free the object returned by PR_GetAddrInfoByName and shove its info into a mozilla::net::AddrInfo object.  On b2g18, we don't do this; we keep the nspr data structure around.

The mozilla::net::AddrInfo thing looks like it probably uses less space than the nspr type.

I'm a bit confused as to why we use a linked list for mozilla::net::AddrInfo instead of using an nsTArray.  But it's not clear to me how much waste there is in our current setup, or if it matters.
Assignee: nobody → n.nethercote
Whiteboard: [MemShrink] → [MemShrink:P2]
I haven't seen this go above ~100 KB on desktop, but it seems worth having if
it might creep up slowly to much higher than that.
Attachment #8335044 - Flags: review?(sworkman)
Comment on attachment 8335044 [details] [diff] [review]
(part 2) - Add a memory reporter for the DNS service.

Review of attachment 8335044 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for doing this! It'll be great to have it in there.

I have some changes requested, but the bulk of it is there. Feel free to challenge me on some of the points. r- for now.

::: netwerk/dns/nsDNSService2.cpp
@@ +407,5 @@
>      uint32_t lifetimeGracePeriod = 1;
>      bool     disableIPv6      = false;
>      bool     disablePrefetch  = false;
>      int      proxyType        = nsIProtocolProxyService::PROXYCONFIG_DIRECT;
> +

Thanks for whitespace removal!

@@ +905,5 @@
> +    // is worthwhile:
> +    // - mIDN
> +    // - mLock
> +    // - mIPv4OnlyDomains
> +    // - mLocalDomains

mIPv4OnlyDomains and mLocalDomains might be useful are pulled from prefs. The amount of memory isn't something we directly control, but the object is owned by DNS code.

Maybe it's something we should put an upper bound on to keep memory in check? Outside the scope of this bug, of course.

::: netwerk/dns/nsDNSService2.h
@@ +50,5 @@
>      bool                      mFirstTime;
>      bool                      mOffline;
>      nsTHashtable<nsCStringHashKey> mLocalDomains;
> +
> +    nsCOMPtr<nsIMemoryReporter> mReporter;

nit: Please call this mMemoryReporter.

::: netwerk/dns/nsHostResolver.cpp
@@ +272,5 @@
> +
> +    // Measurement of the following members may be added later if DMD finds it
> +    // is worthwhile:
> +    // - nsHostKey (the superclass)
> +    // - callbacks

You mean measuring the component size of nsHostKey, right? Isn't it included in mallocSizeOf(this)?

For callbacks, I'd actually like that to be included. The list can grow if there are a lot of requests for the same domain, e.g. a page with 100s of images all from images.foo.com. It still wouldn't be a ridiculous amount, but I'd rather be accurate here.

Possibly I should know this, but I don't - what is DMD? :)

@@ +280,5 @@
> +    n += mBlacklistedItems.SizeOfExcludingThis(mallocSizeOf);
> +    for (size_t i = 0; i < mBlacklistedItems.Length(); i++) {
> +        n += mBlacklistedItems[i].SizeOfIncludingThisMustBeUnshared(mallocSizeOf);
> +    }
> +    return n;

I'm not sure how often the memory reporter will call these functions or even how it happens, but if it's not too often, it would be great to add a LOG here to print out the memory use of each record. If possible, could you declare an individual size_t for each element, and then print it out; something like:

LOG(("nsHostRecord::SizeOfIncludingThis [%p] [%d=%d+%d+%d+%d+%d] %s",
     this, total_size, this_size, addr_info_size, addr_size, blacklist_size, callback_size, host));

Not mandatory if you think there's a good reason not to have it.

@@ +1070,5 @@
>  }
>  
> +static size_t
> +SizeOfHostRecordListExcludingThis(MallocSizeOf mallocSizeOf,
> +                                  const PRCList *head)

Can you rename the function to ...ExcludingHead or rename 'head' to 'aThis'?

@@ +1089,5 @@
> +    size_t n = mallocSizeOf(this);
> +
> +    // No need to measure what the table entries point to, because those are
> +    // non-owning pointers.
> +    n += PL_DHashTableSizeOfIncludingThis(mDB, nullptr, mallocSizeOf);

Need MutexAutoLock lock(mLock) here.

Also, I think you should be measuring the hash table and not the queues.

The hash table is the main store for nsHostRecords. The queues are only important for prioritizing which nsHostRecord should be used next for a call to getaddrinfo, or which records should be evicted. There's a possibility if this gets called just before nsHostResolver::OnLookupComplete that you'd lose one nsHostRecord, because it has been dequeued from mHigh|Medium|LowQ for lookup, but not yet added to the eviction queue. One record is hardly a big problem, but the hash table is a little more accurate, I think.
Attachment #8335044 - Flags: review?(sworkman) → review-
> @@ +905,5 @@
> > +    // is worthwhile:
> > +    // - mIDN
> > +    // - mLock
> > +    // - mIPv4OnlyDomains
> > +    // - mLocalDomains
> 
> mIPv4OnlyDomains and mLocalDomains might be useful are pulled from prefs.

I added code to measure both of them.  I sure hope the strings stored in mLocalDomains are not shared!


> > +    // Measurement of the following members may be added later if DMD finds it
> > +    // is worthwhile:
> > +    // - nsHostKey (the superclass)
> > +    // - callbacks
> 
> You mean measuring the component size of nsHostKey, right? Isn't it included
> in mallocSizeOf(this)?

The nsHostKey struct itself is measured, yes, but the |host| member of the nsHostKey isn't measured.  I've added measurement of it -- I sure hope it also isn't shared.


> For callbacks, I'd actually like that to be included. The list can grow if
> there are a lot of requests for the same domain, e.g. a page with 100s of
> images all from images.foo.com. It still wouldn't be a ridiculous amount,
> but I'd rather be accurate here.

Done.  Slightly tricky because of the inheritance, but not too bad.


> Possibly I should know this, but I don't - what is DMD? :)

It's a tool that tells us which allocations are being lumped into about:memory's "heap-unclassified", i.e. where we need to add new memory reporters.  Comment 0 has some sample output.  https://wiki.mozilla.org/Performance/MemShrink/DMD has the full details.


> I'm not sure how often the memory reporter will call these functions or even
> how it happens, but if it's not too often, it would be great to add a LOG
> here to print out the memory use of each record.

The main use of memory reporters is when about:memory does measurements.  There are dozens of memory reporters, each of which does any number of measurements (in some cases thousands, e.g. the JS engine's memory reporter).  Adding a log is a reasonable idea but if every memory reporter did it we'd be swamped with output.

If the amount of memory reported by this reporter ends up being large, we could certainly split it into multiple buckets, but at the moment that doesn't seem necessary.


> Also, I think you should be measuring the hash table and not the queues.

Good to know!  Understanding memory ownership is the hardest part of writing a memory reporter.  I've fixed this.


BTW, is there a good way to stress the DNS service, and therefore the reporter?  I've just done vanilla browsing, but I'm not convinced I'm hitting all these new code paths.
I'll fold this into the earlier patch before landing.
Attachment #8335801 - Flags: review?(sworkman)
Comment on attachment 8335801 [details] [diff] [review]
(part 2b) - (address review comments)

Review of attachment 8335801 [details] [diff] [review]:
-----------------------------------------------------------------

Super quick turnaround! Thanks for making the changes. I think one comment is out of date, but otherwise all good. r=me.

::: netwerk/dns/nsDNSService2.cpp
@@ +320,5 @@
> +    // - mResolver, because it's a non-owning pointer
> +
> +    // Measurement of the following members may be added later if DMD finds it
> +    // is worthwhile:
> +    // - mListener

FYI, I doubt this will be needed. mListener is usually a pre-existing object like nsSocketTransport and should be managed elsewhere.

@@ +950,5 @@
>      size_t n = mallocSizeOf(this);
>      n += mResolver->SizeOfIncludingThis(mallocSizeOf);
> +    n += mIPv4OnlyDomains.SizeOfExcludingThisMustBeUnshared(mallocSizeOf);
> +    n += mLocalDomains.SizeOfExcludingThis(SizeOfLocalDomainsEntryExcludingThis,
> +                                           mallocSizeOf);

Thanks for including these strings.

::: netwerk/dns/nsHostResolver.cpp
@@ +276,5 @@
> +            static_cast<nsResolveHostCallback*>(curr);
> +        n += callback->SizeOfIncludingThis(mallocSizeOf);
> +        curr = curr->next;
> +    }
> +    return n;

Thank you for including this!

@@ +290,3 @@
>  
> +    n += SizeOfResolveHostCallbackListExcludingHead(&callbacks, mallocSizeOf);
> +    n += addr_info ? addr_info->SizeOfIncludingThis(mallocSizeOf) : 0;

Good catch of the null check.

@@ +1084,5 @@
>      }
>  }
>  
>  static size_t
> +SizeOfDBEntryExcludingThis(PLDHashEntryHdr* hdr, MallocSizeOf mallocSizeOf,

SizeOfHostDBEntExclusingThis - Just in case someone else wants to have a DBEnt.

@@ +1097,5 @@
>  {
>      size_t n = mallocSizeOf(this);
>  
>      // No need to measure what the table entries point to, because those are
>      // non-owning pointers.

This comment is now out of date.

@@ +1104,5 @@
>  
> +    // The following fields aren't measured.
> +    // - mHighQ, mMediumQ, mLowQ, mEvictionQ, because they just point to
> +    //   nsHostRecords that also pointed to by entries |mDB|, and measured when
> +    //   |mDB| is measured.

Good explanation, thank you.
Attachment #8335801 - Flags: review?(sworkman) → review+
> Need MutexAutoLock lock(mLock) here.

Oh, I missed that.  I'll add before landing.

Any suggestions on stressing the DNS service?

(BTW, I was going to ask you this on IRC, but firebot tells me "sworkman was last seen 10 weeks, 1 day, 1 hour, 19 minutes and 46 seconds ago"...)
(In reply to Nicholas Nethercote [:njn] from comment #12)
> > Need MutexAutoLock lock(mLock) here.
> 
> Oh, I missed that.  I'll add before landing.

Oh yes - you should do that! I should have checked for that! Bad Stephen!

> Any suggestions on stressing the DNS service?
> 
> (BTW, I was going to ask you this on IRC, but firebot tells me "sworkman was
> last seen 10 weeks, 1 day, 1 hour, 19 minutes and 46 seconds ago"...)

10 weeks, huh? I think it's lying - I was on there last week ;)

The easiest way to stress it is to load lots of different pages. Some good examples:
- pages that shard, e.g. pinterest - lots of images or other resources coming from many domains.
- pages with lots of links, but http only; this should start DNS link prefetching.

Just keep opening more and more of these to increase the size of the cache.

Or, if you want to be fancy about it, write up a simple html page with images from many domains. It shouldn't matter if the domains or resources are real or not. The DNS service will still be used to try to resolve the hostnames, and negative entries will be added for domains that don't resolve.

I suggest more than 400, which is the max cache size.

Now, this leads me to Bug 941884, which I discovered while reviewing your patch. We currently don't add negative cache entries to the eviction queue. I've uploaded a one-liner patch in that bug and that should be fixed soon. So, if you run your tests now, and there are lots of unresolving domains, you will probably see the cache size grow stupidly big. I suggest you consider applying my patch, or waiting until that bug is fixed.
> Now, this leads me to Bug 941884, which I discovered while reviewing your
> patch. We currently don't add negative cache entries to the eviction queue.

Excellent!  I wonder if that's the cause of the high memory usage from comment 0.
I imagine it's possible if a "36-hour B2G monkey tester run" includes a lot of dummy or unreachable domains. In that case we'd have a lot of negative cache entries.
I just ran http://gregor-wagner.com/tmp/mem, which opens 150 sites, and "dns-service" got to just over 1MB.
Is this with or without the patch from Bug 941884? I just landed that patch on inbound.
Without the patch.  I'll try again on Monday to see how things have changed.
Awesome - that would be great, thanks.
Comment on attachment 8335043 [details] [diff] [review]
(part 1) - Add LinkedList::SizeOf{In,Ex}cludingThis().

Review of attachment 8335043 [details] [diff] [review]:
-----------------------------------------------------------------

::: mfbt/LinkedList.h
@@ +385,5 @@
> +     * it only measures the list elements themselves.  If the list elements
> +     * contain pointers to other memory blocks, those blocks must be measured
> +     * separately during a subsequent iteration over the list.
> +     */
> +    size_t sizeOfExcludingThis(mozilla::MallocSizeOf mallocSizeOf) const {

The mozilla:: prefix could be omitted here and in the other signature.  Doesn't hurt leaving them, except in noise.

@@ +389,5 @@
> +    size_t sizeOfExcludingThis(mozilla::MallocSizeOf mallocSizeOf) const {
> +      size_t n = 0;
> +      for (const T* t = getFirst(); t; t = t->getNext()) {
> +        n += mallocSizeOf(t);
> +      }

No braces around a single-line body, like JS.
Attachment #8335043 - Flags: review?(jwalden+bmo) → review+
Test failures just taught me that nsHostKey::host shouldn't be measured separately, because we use a variable-length allocation to include it at the end of the nsHostRecord.  I've written the following comment to explain it.

    // The |host| field (inherited from nsHostKey) actually points to extra
    // memory that is allocated beyond the end of the nsHostRecord (see
    // nsHostRecord::Create()).  So it will be included in the
    // |mallocSizeOf(this)| call above.
> I just ran http://gregor-wagner.com/tmp/mem, which opens 150 sites, and
> "dns-service" got to just over 1MB.

With the patch from bug 941884 in place I just re-reran and got 280KB.  Good.
Re comment 21 - Yes, of course. Apologies for missing that one.

Re comment 22 - Wow - that's quite the drop. I assume that not including host in nsHostKey made a difference here as well. But still, it looks like the patch made a decent difference.

I'll add a comment in Bug 941884 to respond to your question about backporting to aurora.
> Re comment 22 - Wow - that's quite the drop. I assume that not including
> host in nsHostKey made a difference here as well.

I don't know.  Calling mallocSizeOf(host) was totally bogus, and I'm surprised I didn't get crashes locally.  Who knows what values it was returning.
https://hg.mozilla.org/mozilla-central/rev/f91e339a1b36
https://hg.mozilla.org/mozilla-central/rev/6cfc4f26a183
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla28
Depends on: 969902
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: