Intermittent test_bug497875.xul | application crashed [@ nsHostResolver::DeQueue(PRCListStr &,nsHostRecord * *)]

RESOLVED FIXED in Firefox 35

Status

()

defect
RESOLVED FIXED
5 years ago
4 years ago

People

(Reporter: RyanVM, Assigned: bagder)

Tracking

({crash, intermittent-failure})

Trunk
mozilla36
x86
Windows 8.1
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:2.1+, firefox34 unaffected, firefox35 fixed, firefox36 fixed, firefox-esr31 unaffected, b2g-v2.1 unaffected, b2g-v2.1S unaffected, b2g-v2.2 fixed)

Details

Attachments

(1 attachment, 4 obsolete attachments)

https://treeherder.mozilla.org/ui/logviewer.html#?job_id=808071&repo=fx-team

builder 	WINNT 6.2 fx-team debug test mochitest-other
buildid 	20141002073921
builduid 	7eb8008831114e3dad9a5aeb84d0d683
results 	warnings (1)
revision 	69bb4035eba1
slave 		t-w864-ix-094
starttime 	Thu Oct 02 2014 11:17:35 GMT-0400 (Eastern Standard Time)

08:22:01 INFO - 435 INFO TEST-START | chrome://mochitests/content/chrome/content/xul/document/test/test_bug497875.xul
08:22:01 INFO - ++DOMWINDOW == 61 (0A354E00) [pid = 3848] [serial = 748] [outer = 158FC600]
08:22:01 INFO - ++DOCSHELL 09E2D000 == 9 [pid = 3848] [id = 216]
08:22:01 INFO - ++DOMWINDOW == 62 (0D383600) [pid = 3848] [serial = 749] [outer = 00000000]
08:22:01 INFO - [Parent 3848] WARNING: NS_ENSURE_TRUE(nsContentUtils::GetCurrentJSContext()) failed: file c:/builds/moz2_slave/fx-team-w32-d-0000000000000000/build/docshell/base/nsDocShell.cpp, line 9036
08:22:01 INFO - ++DOMWINDOW == 63 (0D389E00) [pid = 3848] [serial = 750] [outer = 0D383600]
08:22:01 INFO - Failed to load ftp://some.website.that.will.cause.an.error/
08:22:03 INFO - TEST-INFO | Main app process: exit status c0000005
08:22:03 INFO - 436 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/content/xul/document/test/test_bug497875.xul | application terminated with exit code 3221225477
08:22:03 INFO - runtests.py | Application ran for: 0:02:23.742000
08:22:03 INFO - zombiecheck | Reading PID log: c:\users\cltbld~1.t-w\appdata\local\temp\tmpv5z04hpidlog
08:22:03 INFO - ==> process 3848 launched child process 1208 ("C:\slave\test\build\application\firefox\plugin-container.exe" --channel=3848.de9cd20.1131237995 -greomni "C:\slave\test\build\application\firefox\omni.ja" -appomni "C:\slave\test\build\application\firefox\browser\omni.ja" -appdir "C:\slave\test\build\application\firefox\browser" 3848 "\\.\pipe\gecko-crash-server-pipe.3848" tab)
08:22:11 INFO - mozcrash Saved minidump as C:\slave\test\build\blobber_upload_dir\5f9f6ade-07aa-479b-9263-031226d1b5ee.dmp
08:22:11 INFO - mozcrash Saved app info as C:\slave\test\build\blobber_upload_dir\5f9f6ade-07aa-479b-9263-031226d1b5ee.extra
08:22:11 WARNING - PROCESS-CRASH | chrome://mochitests/content/chrome/content/xul/document/test/test_bug497875.xul | application crashed [@ nsHostResolver::DeQueue(PRCListStr &,nsHostRecord * *)]
08:22:11 INFO - Crash dump filename: c:\users\cltbld~1.t-w\appdata\local\temp\tmpjet6l9.mozrunner\minidumps\5f9f6ade-07aa-479b-9263-031226d1b5ee.dmp
08:22:11 INFO - Operating system: Windows NT
08:22:11 INFO - 6.2.9200
08:22:11 INFO - CPU: x86
08:22:11 INFO - GenuineIntel family 6 model 30 stepping 5
08:22:11 INFO - 8 CPUs
08:22:11 INFO - Crash reason: EXCEPTION_ACCESS_VIOLATION_WRITE
08:22:11 INFO - Crash address: 0x5a5a5a5a
08:22:11 INFO - Thread 73 (crashed)
08:22:11 INFO - 0 xul.dll!nsHostResolver::DeQueue(PRCListStr &,nsHostRecord * *) [nsHostResolver.cpp:69bb4035eba1 : 1271 + 0x5]
08:22:11 INFO - eip = 0x7177d0b6 esp = 0x19cffda4 ebp = 0x19cffda8 ebx = 0x000493e0
08:22:11 INFO - esi = 0x5a5a5a5a edi = 0x19cffe34 eax = 0x19cffe34 ecx = 0x092b85e0
08:22:11 INFO - edx = 0x5a5a5a5a efl = 0x00010216
08:22:11 INFO - Found by: given as instruction pointer in context
08:22:11 INFO - 1 xul.dll!nsHostResolver::GetHostToLookup(nsHostRecord * *) [nsHostResolver.cpp:69bb4035eba1 : 1313 + 0xe]
08:22:11 INFO - eip = 0x7177d3dd esp = 0x19cffdb0 ebp = 0x19cffdd4
08:22:11 INFO - Found by: call frame info
08:22:11 INFO - 2 xul.dll!nsHostResolver::ThreadFunc(void *) [nsHostResolver.cpp:69bb4035eba1 : 1586 + 0xf]
08:22:11 INFO - eip = 0x7177eadb esp = 0x19cffddc ebp = 0x19cffe3c
08:22:11 INFO - Found by: call frame info
08:22:11 INFO - 3 nss3.dll!_PR_NativeRunThread [pruthr.c:69bb4035eba1 : 397 + 0x8]
08:22:11 INFO - eip = 0x6f74f8bf esp = 0x19cffe44 ebp = 0x19cffe5c
08:22:11 INFO - Found by: call frame info
08:22:11 INFO - 4 nss3.dll!pr_root [w95thred.c:69bb4035eba1 : 90 + 0xb]
08:22:11 INFO - eip = 0x6f74136f esp = 0x19cffe64 ebp = 0x19cffe68
08:22:11 INFO - Found by: call frame info
08:22:11 INFO - 5 msvcr100.dll + 0x5c6dd
probly steve or daniel touched this last.. who will take it?
Flags: needinfo?(sworkman)
Flags: needinfo?(daniel)
I've stared at this code now for a while.

The DeQueue() that crashes seems to attempt to dequeue a node that is already freed (5a5a5a5a), and the fact that it dequeues from mLowQ _could_ mean that this is a node added on nsHostResolver.cpp:1490, which is code that came in with the TTL experiments.
Flags: needinfo?(daniel)
Tried 100 runs locally on a Win7 VM can't reproduce it. But I may have an explanation and a speculative solution. Please cast your eyes over this and tell me what you think:

On the DNS thread, a request is dequeued from the HIGH priority queue and a lookup completes.
nsHostResolver::OnLookupComplete is called.
LOCK
  process the lookup result
UNLOCK

Meanwhile, on the main thread, FlushCache is waiting for the lock. It get's it.
LOCK
  prune entries --> release record --> => refcnt == 1
  get eviction queue
UNLOCK
release all entries on eviction q --> release record =? refcnt == 0 --> DELETE

Back on the DNS thread, still in OnLookupComplete:
call callback->OnLookupComplete for all the callbacks associated with the record.
LOCK (again)
  Issue a 2nd lookup, this time to get TTL
  Request now enqueued on LOW priority queue.
UNLOCK

Then, when it comes time to DeQueue from the LOW priority queue, it gets the record, but the data has been erased. Sounds plausible, right?

So, move the IssueLookup for TTL that's in OnLookupComplete into the first mutex block. We could do a holding AddRef for OnLookupComplete, but if the cache has been flushed, then this entry has been removed from the hash table already and a lookup is pointless. Better to have the second lookup issued BEFORE FlushCache has a chance to run and clear it out.
Flags: needinfo?(sworkman)
(In reply to Steve Workman [:sworkman] from comment #3)

> Then, when it comes time to DeQueue from the LOW priority queue, it gets the
> record, but the data has been erased. Sounds plausible, right?
>
> So, move the IssueLookup for TTL that's in OnLookupComplete into the first
> mutex block. We could do a holding AddRef for OnLookupComplete, but if the
> cache has been flushed, then this entry has been removed from the hash table
> already and a lookup is pointless. Better to have the second lookup issued
> BEFORE FlushCache has a chance to run and clear it out.

I completely agree with this analysis and yeah, moving the code to within the previous block seems simpler than doing tricks with adding another ref.

The only possible downside with this, is that in this particular case the events have deemed that the existing entry should be flushed, so this adds a lookup for an entry that otherwise would be (and will be?) discarded.
Blocks: 1008091
Here's a patch for what we've discussed.
Attachment #8502425 - Flags: review?(sworkman)
Comment on attachment 8502425 [details] [diff] [review]
0001-bug-1077084-move-TTL-lookup-to-within-initial-lock-b.patch

Review of attachment 8502425 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Daniel Stenberg [:bagder] from comment #4)
> The only possible downside with this, is that in this particular case the
> events have deemed that the existing entry should be flushed, so this adds a
> lookup for an entry that otherwise would be (and will be?) discarded.

Yup - I think that a PR_REMOVE_LINK should be added to PruneEntry in the PL_DHASH_REMOVE case. But .....

There's one other thing that I'm wondering about here. In PruneEntry we return true if the record is both resolving and onQueue, right? But this means that requests that are on the queue pending resolution will be removed from the cache. Am I reading that correctly? If so, that's something I missed in the earlier review sorry.

So, in RemoveOrRefresh we should
if (resolving && onQueue) {
  // Don't remove; resolution pending.
  return false;
} else if (resolving && !onQueue) {
  // Request already with OS resolver; require a refresh.
  mResolveAgain = true;
  return false;
} else {
  // Already resolved; not in a pending state; remove from cache.
  return true;
}

Assuming this is right, the lookup for TTL should still happen, but that's ok. It should be a valid result since it should happen after the NETWORK CHANGE event. We could do a check to see if the pending request is for TTL, but that's extra complexity that I don't think we need.

Anyway, tell me what you think. I'm concerned about this, but I'm aware that I could be missing something in the existing code.
Attachment #8502425 - Flags: review?(sworkman) → feedback+
(In reply to Steve Workman [:sworkman] from comment #6)

Gah. I feel stupid for not noticing that the entry was added already when the entry is added to a queue but of course it makes sense and yes RemoveOrRefresh needs to be adapted to this.

I left the TTL block moved like in the first version of the patch since it seems to still make sense, and I believe addressed all the other concerns in this v2 patch.
Attachment #8502425 - Attachment is obsolete: true
Attachment #8502794 - Flags: review?(sworkman)
Clearly this patch introduces memory leaks:

https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=8dea0317bcc2
v3. Fixed the leak.

Since we now remove the entries from the list in the HostDB_PruneEntry() function, it also removes entries from the evictionq. To avoid that causing problem, I changed the logic to first remove all entries that are in the evictionq and then clear the cache accordingly. This seems to fix the memory leaks. I re-run a try on windows here; https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=8794bdf39851

I'll run a full try too, just want to see that you're with me on this approach Steve.
Attachment #8502794 - Attachment is obsolete: true
Attachment #8502794 - Flags: review?(sworkman)
Attachment #8503293 - Flags: review?(sworkman)
Comment on attachment 8503293 [details] [diff] [review]
v3-0001-bug-1077084-improved-DNS-cache-pruning-r-sworkman.patch

Review of attachment 8503293 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/dns/nsHostResolver.cpp
@@ +479,5 @@
> +            // to get resolved. Leave it.
> +            fprintf(stderr, "!!!!! RemoveOrRefresh() line %d\n", __LINE__);
> +            return false;
> +        }
> +        else {

} else {

And remove the fprintf ;)

@@ +485,5 @@
> +            // record should be considered stale and not trusted; set a flag to
> +            // ensure it is called again.
> +            mResolveAgain = true;
> +            return false;
> +        }

You could put the return false outside the else and just have an "if (!onQueue) { ... }".

@@ +816,5 @@
> +      }
> +  }
> +
> +  // prune the hash from all hosts already resolved
> +  PL_DHashTableEnumerate(&mDB, HostDB_PruneEntry, nullptr);

So, by the time we arrive at PruneEntry now, all the resolved (i.e. !resolving && !onQueue) should all be removed. When an entry is resolved, it's added to the evictionQ, so I think PruneEntry is effectively just marking the entries for refresh. Can you check this, please?
Attachment #8503293 - Flags: review?(sworkman) → feedback+
(In reply to Daniel Stenberg [:bagder] from comment #7)
> Gah. I feel stupid for not noticing that the entry was added already when
> the entry is added to a queue but of course it makes sense and yes
> RemoveOrRefresh needs to be adapted to this.

It's ok. I feel stupid too ;) I should have spotted it in the earlier review, especially since I'm the one that requested dealing with requests already with the OS resolver!
Assignee: nobody → daniel
(In reply to Steve Workman [:sworkman] from comment #10)

> So, by the time we arrive at PruneEntry now, all the resolved (i.e.
> !resolving && !onQueue) should all be removed. When an entry is resolved,
> it's added to the evictionQ, so I think PruneEntry is effectively just
> marking the entries for refresh. Can you check this, please?

Ah, yes exactly what I think. The entries still in the cache at the point after the evictionq is "drained" should all either get resolved soon or be in the process of being resolved right now.

However...

I did a test-run[1] and asserted that 'resolving' was always true in RemoveOrRefresh() but that triggered in some test cases and caused failures. So, the safer approach is clearly to keep the logic that iterates over all entries, and removes those that can be removed and refreshes those in need of a second resolve.

[1] = that test run can be seen here: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=50fe060f7a05
v4, updated according to previous comments. Only minor changes since I couldn't do the simplification as mention in a previous comment above here.

Test run on win32 debug-only (since that's where we saw the problems before): https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=50fe060f7a05
Attachment #8503293 - Attachment is obsolete: true
Attachment #8504048 - Flags: review?(sworkman)
Comment on attachment 8504048 [details] [diff] [review]
v4-0001-bug-1077084-improved-DNS-cache-pruning-r-sworkman.patch

Review of attachment 8504048 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for fixing this Daniel! r=me.

::: netwerk/dns/nsHostResolver.cpp
@@ +600,5 @@
>  static PLDHashOperator
>  HostDB_PruneEntry(PLDHashTable *table,
> +                    PLDHashEntryHdr *hdr,
> +                    uint32_t number,
> +                    void *arg)

nit: Indentation changed here; looks like some weirdness.

@@ +670,5 @@
>  
>      // We can't set a pref in the context of a pref change callback, so
>      // dispatch DnsExperimentChangedInternal for async getting/setting.
>      DebugOnly<nsresult> rv = NS_DispatchToMainThread(
> +        NS_NewRunnableMethod(self, &DnsExperimentChangedInternal));

Ah, this was changed in a recent patch. Please don't change it back. The class name is needed to avoid a clang warning.

@@ +807,5 @@
> +      }
> +  }
> +
> +  // Refresh the cache entries that are resolving RIGHT now, remove the rest.
> +  PL_DHashTableEnumerate(&mDB, HostDB_PruneEntry, nullptr);

Cool. Safer route is better.
Attachment #8504048 - Flags: review?(sworkman) → review+
(In reply to Steve Workman [:sworkman] from comment #14)

> >      DebugOnly<nsresult> rv = NS_DispatchToMainThread(
> > +        NS_NewRunnableMethod(self, &DnsExperimentChangedInternal));
> 
> Ah, this was changed in a recent patch. Please don't change it back. The
> class name is needed to avoid a clang warning.

Good catch. I don't know how it got to be like that, I blame a crazy rebase or something. It was not done on purpose and I'll get rid of it from my patch!
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/073e83cef238
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
AFAICT, we need this on Aurora at least. Do we need this on Beta34 as well? Please nominate accordingly :)
Flags: needinfo?(daniel)
This fix is primarily a fix for flaws in the bug 939318 series that are targeted for 35 as I understand it.

It also changes code brought in via bug 1067679 that probably has a completely different target (if any) so some splitting up work is probably required at some point.
Flags: needinfo?(daniel)
Please nominate this for aurora approval when you get a chance :-)
Blocks: 939318
Flags: needinfo?(daniel)
Comment on attachment 8504516 [details] [diff] [review]
v5-0001-bug-1077084-improved-DNS-cache-pruning-r-sworkman.patch

Approval Request Comment
[Feature/regressing bug #]: 939318

[User impact if declined]: occasional crashes occur as outlined in this bug report

[Describe test coverage new/current, TBPL]: not terribly good, which is why 939318 got in with this problem in the first place, but also difficult to do good tests for due to the timing dependent/race nature of the problem

[Risks and why]: 

[String/UUID change made/needed]:
Flags: needinfo?(daniel)
Attachment #8504516 - Flags: approval-mozilla-aurora?
Attachment #8504516 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Per bug 1008091 comment 79.
blocking-b2g: --- → 2.1S?
stability issue.
blocking-b2g: 2.1S? → 2.1S+
Flags: needinfo?(vliu)
(In reply to Steven Yang [:styang] from comment #26)
> stability issue.

Shouldn't we merge into v2.1 if this is a stability issue?
Flags: needinfo?(vliu) → needinfo?(styang)
ok, let me nominate it for triage
Flags: needinfo?(styang)
[Blocking Requested - why for this release]:
blocking-b2g: 2.1S+ → 2.1?
as triage. plug it for 2.1
blocking-b2g: 2.1? → 2.1+
Please nominate this patch for b2g34 approval when you get a chance.
Flags: needinfo?(daniel)
Comment on attachment 8504516 [details] [diff] [review]
v5-0001-bug-1077084-improved-DNS-cache-pruning-r-sworkman.patch

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]

Bug caused by (feature/regressing bug #): 939318

User impact if declined:  occasional crashes occur as outlined in this bug report

Testing completed: It has been landed and used for quite some time now in other Firefox versions/trees

Risk to taking this patch (and alternatives if risky): 
String or UUID changes made by this patch:
Flags: needinfo?(daniel)
Attachment #8504516 - Flags: approval-mozilla-b2g34?
Comment on attachment 8504516 [details] [diff] [review]
v5-0001-bug-1077084-improved-DNS-cache-pruning-r-sworkman.patch

Keeping the risk in mind and backing out if causing any side effect.
Attachment #8504516 - Flags: approval-mozilla-b2g34? → approval-mozilla-b2g34+
(In reply to Daniel Stenberg [:bagder] from comment #23)
> Approval Request Comment
> [Feature/regressing bug #]: 939318

Bug 939318 never landed on b2g34. Consequently, the patch in this bug doesn't apply at all. Why was this nominated for v2.1?
Flags: needinfo?(vliu)

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #34)
> (In reply to Daniel Stenberg [:bagder] from comment #23)
> > Approval Request Comment
> > [Feature/regressing bug #]: 939318
> 
> Bug 939318 never landed on b2g34. Consequently, the patch in this bug
> doesn't apply at all. Why was this nominated for v2.1?

I think the reason why it was nominated for v2,1 is based on Comment 25 - Comment 30.
Flags: needinfo?(vliu)
Bug 1008091 never got uplifted to v2.1/v2.1S. There's nothing to do here AFAICT.
Comment on attachment 8504516 [details] [diff] [review]
v5-0001-bug-1077084-improved-DNS-cache-pruning-r-sworkman.patch

Clearing the approval for now. Please re-nominate if I'm missing something here.
Attachment #8504516 - Flags: approval-mozilla-b2g34+
You need to log in before you can comment on or make changes to this bug.