Closed Bug 775090 Opened 7 years ago Closed 7 years ago

Firefox startup crash in PR_EnumerateAddrInfo | PR_GetHostByAddr

Categories

(Core :: Networking, defect, critical)

14 Branch
x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox14 + wontfix
firefox15 - ---

People

(Reporter: marcia, Unassigned)

References

Details

(Keywords: crash, qawanted, topcrash, Whiteboard: [startupcrash])

Crash Data

This bug was filed from the Socorro interface and is 
report bp-3cfc2b99-4374-4cf0-a9f4-0009c2120718 .
============================================================= 

This crash appears as a new signature which just happens in 14.0.1. https://crash-stats.mozilla.com/report/list?signature=PR_EnumerateAddrInfo%20|%20PR_GetHostByAddr%20|%20PR_ExitMonitor%20|%20nspr4.dll@0x26cf. Currently the advanced query shows it ranking #15

Suspect a third party issue but manual correlations are not showing anything so far. Comments so far are not useful.

Frame 	Module 	Signature 	Source
0 	nspr4.dll 	PR_EnumerateAddrInfo 	nsprpub/pr/src/misc/prnetdb.c:2117
1 	nspr4.dll 	PR_GetHostByAddr 	nsprpub/pr/src/misc/prnetdb.c:1171
2 	nspr4.dll 	PR_ExitMonitor 	nsprpub/pr/src/threads/prmon.c:132
3 	nspr4.dll 	nspr4.dll@0x26cf 	
4 	winmm.dll 	timeGetTime 	
5 	xul.dll 	nsSocketTransportService::Poll 	netwerk/base/src/nsSocketTransportService2.cpp:431
6 	nspr4.dll 	PR_ExitMonitor 	nsprpub/pr/src/threads/prmon.c:132
7 	xul.dll 	nsSocketTransportService::Run 	netwerk/base/src/nsSocketTransportService2.cpp:652
8 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:656
9 	xul.dll 	nsThread::ThreadFunc 	xpcom/threads/nsThread.cpp:289
10 	nspr4.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:426
11 	nspr4.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c:122
12 	msvcr100.dll 	_callthreadstartex 	f:\dd\vctools\crt_bld\self_x86\crt\src\threadex.c:314
13 	msvcr100.dll 	_threadstartex 	f:\dd\vctools\crt_bld\self_x86\crt\src\threadex.c:292
14 	kernel32.dll 	BaseThreadInitThunk 	
15 	ntdll.dll 	__RtlUserThreadStart 	
16 	ntdll.dll 	_RtlUserThreadStart
Whiteboard: [startupcrash]
Let's wait for more data - volume is pretty low so far from what I can see. Since this crash is only affecting 14.0.1, possible causes could be:

* External issue where our beta population isn't representative
* External issue where only {release channel, version 14} is affected
* Bug 772282, which hasn't been on a beta release yet
* A crash signature move for some reason
* A buildID-specific build issue

Let's keep up with this, investigate any actionable leads, and discuss during tomorrow's channel meeting.
(In reply to Marcia Knous [:marcia] from comment #0)
> Comments so far are not useful.

FWIW, a german Commenter mentions getting these Crashes even in Safe-Mode and after "Reinstalling".
It seems related to IPv6:
89% (449/502) vs.  34% (22863/66763) wship6.dll
(In reply to Scoobidiver from comment #3)
> It seems related to IPv6:
> 89% (449/502) vs.  34% (22863/66763) wship6.dll

Given this, CC'ing the networking team. Did we take any major IPv6 changes recently?

Also tracking for 14 given the this is #14 on the top crash list currently.

We need to start thinking about the types of software that only target our release software, but for some reason wouldn't show up in correlations - firewalls?
Keywords: topcrash
Here are some URLs so far:

186 	about:blank
23 	about:home
5 	about:sessionrestore
4 	http://www.google.co.uk/
4 	jar:file:///C:/Program%20Files/Mozilla%20Firefox/omni.ja!/chrome/browser/content
4 	http://www.facebook.com/
2 	http://www.mozilla.com/ru/firefox/14.0.1/whatsnew/?oldversion=13.0.1
2 	http://vuku.ru/
2 	http://www.mozilla.com/en-US/firefox/14.0.1/whatsnew/?oldversion=13.0.1
1 	http://www.lenta.ru/
1 	http://bl150w.blu150.mail.live.com/default.aspx#!/mail/InboxLight.aspx
1 	http://s2.gladiators.ru/xml/main/news.php?id=11443&enableChat=1
1 	https://www.google.com/search?q=facebook&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:
1 	http://gmx.de/
1 	http://www.diesiedleronline.de/de/spielen
1 	http://go.divx.com/divx/windows/uninstallsurvey/de
1 	http://www.facebook.com/home.php?
1 	https://www.google.com/search?q=soundcloud&ie=utf-8&oe=utf-8&aq=t&rls=org.mozill
1 	http://support.mozilla.org/1/firefox/14.0.1/WINNT/es-ES/prefs-main
1 	http://www.aol.com/
1 	http://www.yandex.ru/?clid=187997
1 	http://www.dailymail.co.uk/home/index.html
1 	https://www.mozilla.org/de/download/?product=firefox-14.0.1&os=win&lang=de
1 	http://www.google.ca/
1 	http://www.chip.de/downloads/Adobe-Flash-Player_13003561.html
1 	http://www.iransetup.com/
1 	http://www.repubblica.it/
1 	http://google.se/
1 	https://www.google.com/search?q=jesus%20manuel%20chavez%20plascencia&ie=utf-8&oe
1 	http://yandex.ru/yandsearch?text=%D0%BE%D0%B4%D0%BD%D0%BE%D0%BA%D0%BB%D0%B0%D1%8
1 	http://smotri.com/broadcast/list/
1 	http://odnoklassniki.ru/
1 	https://www.norsk-tipping.no/
1 	http://www.hentaimedia.com/
1 	http://www.apeha.ru/
1 	http://www.yandex.ru/?vid=101&clid=48577
1 	http://www.ask.com/?o=10148&l=dis&tb=PTV
1 	https://www.google.com/
1 	https://www.google.de/search?q=Firefox+14&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla
1 	http://www.yahoo.com/?ilc=1
1 	http://go.microsoft.com/fwlink/?LinkId=69157
1 	https://login.live.com/
1 	http://firefox.yandex.ru/
1 	https://services.addons.mozilla.org/en-US/firefox/discovery/pane/14.0.1/WINNT/no
1 	http://get.adobe.com/de/flashplayer/
1 	http://www.jeanmarcmorandini.com/
1 	http://www.jappy.de/
1 	http://www.google.cl/
1 	https://www.google.com/search?q=vkontakte&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla
1 	http://de.mg41.mail.yahoo.com/
1 	http://donbass.ua/news/health/2012/07/18/v-donbasse-malyshi-travjatsja-rakami-i-
1 	https://www.google.com/search?q=screen%20for%20desk&ie=utf-8&oe=utf-8&aq=t&rls=o
1 	about:addons
1 	http://de-de.facebook.com/
1 	http://home.webalta.ru/
I can see the stack gets screwed at call to nspr4.dll@0x26cf.  timeGetTime calls a code at address 700826C0 (winmm.dll!_soundPlay@8+0F1h) when you disassemble ; my winmm.dll loaded at 70080000.  

This could be related to ASLR, but ASLR is enabled since Fx13 (bug 728429).

In one of the reports, nspr4.dll is loaded at 0x6dac0000 and winmm.dll at 0x73f50000, so I don't quit understand.  Other threads are just waiting.
CC'ing Benjamin and Kyle, since they stand behind ASLR.
Crashes on line 2117 at a call to _pr_ipv6_is_present()
Note, this is the same function call that crashes in bug 718389, and that bug does have a resolution yet.

Checked changelog for PR_EnumerateAddrInfo and _pr_ipv6_is_present:
-- no changes since 2008 in the code for the functions.

The following thoughts seem to correlate with Honza's comment 6:

In bug 718389 the crash went up and down with different builds/releases, so I'm wondering if it's an intermittent build thing?

I'm also wondering if the function (_pr_ipv6_is_present) is being declared but not defined due to a build error? Or not linked correctly? The error is EXCEPTION_ACCESS_VIOLATION_READ at the time the function call is made, so something isn't being read right. I don't know how likely that is, but it seems worth it to ask the question.
Updates to 14.0.1 are now fully throttled, which will stop the bleeding. If this does in fact end up being a one-off (two-off?) build-related issue, let's figure out how to identify that the issue is present in a build. We may be able to get away with re-spinning 14.0.1 for all remaining users who will update.
(In reply to Honza Bambas (:mayhemer) from comment #10)
> The second set of reports is strictly on 13.* and 14.0.1 versions.  So I
> really bet this is some ASLR regression...

Is there a way to confirm without STR, prior to re-spinning?
Er, Firefox has had ASLR for a *long* time.  All we did was make sure that binary extensions have it too.  I think it's very unlikely that this is related.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #12)
> Er, Firefox has had ASLR for a *long* time.  All we did was make sure that
> binary extensions have it too.  I think it's very unlikely that this is
> related.

Right, but see Comment 8 - we suspect this has happened before and gone away before as well. What else is variable in that way besides ASLR? PGO?
ASLR is not variable in that way.  ASLR has been on for years.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #14)
> ASLR is not variable in that way.  ASLR has been on for years.

OK, understood. I don't know the build system inside out, so this might be a rathole-type of question, but what variable things are there - is there anything else that would change the binary between releases? Something windows related? Something that only affects certain users? What can we check in the releases binaries? No accusation - trying to rule out possibilities.

Per Honza's comment 6, the stack trace is also weird. There have been no changes to the code. How likely is it that winmm.dll is corrupted or infected? Or how likely is it that the crash reporter is reporting wrong because of an optimization? And, of most importance, is there any way we can check this?

I'm going to continue to investigate the code and build from Necko/NSPR side - is there something weird affecting the build flags around _pr_ipv6_is_present()?
(sorry to interrupt the actual technical conversation)

Just wanted to let you know that I re-checked, and the only other bug that landed between our final 14 beta and the 14.0.1 release build is bug 772841. Doubt that could be related though.
(Ooops sorry for the tracking-firefox15 change - that was a refresh issue, unintentional)
Update after looking at code in and around the stack trace:

The last part of the stack trace is just weird. It doesn't fit with the call chain in mozilla-release. PR_GetHostByAddr() doesn't call PR_EnumerateAddrInfo() according to what I see in mxr: http://mxr.mozilla.org/mozilla-release/source/nsprpub/pr/src/misc/prnetdb.c#1013 And the line number is the end of the function.

I know that compiler optimization can make it tricky for the stack trace to be collected sometimes, so I checked for places where these two functions were called next to each other just in case: I don't see any such occurrences. So, I'm not sure what code to look at for the last three calls mentioned.

Moving higher up the crash stack trace, before winmm.dll:
-- PR_IntervalNow: no recent changes to the code
-- The call at netwerk/base/src/nsSocketTransportService2.cpp:431 is actually PR_IntervalToSeconds() - no recent changes to that code either. The rest of the stack points to timeGetTime which is in PR_IntervalNow, so I'm not sure that this line number is right.
-- In Poll, I did notice that the param *interval is not null-checked and is set with the return value of PR_IntervalNow. That is a potential problem, but Poll seems only to be called by DoPollIteration, and interval is declared right before the call, so it shouldn't be null.
-- In 2011, there was a change from PRBool to bool; this affects param bool wait, but, again, the passed in var is of type bool. And I don't think this should have made a difference anyway.

I can keep looking at it from different angles, but something has been corrupted in the actual code path, or the reporter is having a difficult time understanding the trace. Not sure how to proceed with that. I'll also poke around in nspr4.dll and xul.dll tomorrow (Friday) to check what I can.

As per comment 6 and 15, I'm still wondering about winmm.dll. But then why don't we see more crashes affected by winmm.dll? And why only this stack trace? Why is PR_IntervalNow not crashing all over?

Please comment if you see something here that I don't.
(In reply to Steve Workman [:sworkman] from comment #15)
> Or how likely is it that the crash reporter is reporting wrong
> because of an optimization?

That surely can happen, bsmedberg and others might know more there.

I heard that loading a minidump with the MSVC debugger can lead to better stacks in such cases as it has more data available to walk the stack correctly.
I can't believe that nobody has looked at the minidump in a debugger; guessing based on the crash-stats dump is not a great use of time for something this critical:

>	nspr4.dll!PR_EnumerateAddrInfo(iterPtr=0x00000000, base=0x01525eb0, port=0x01bb, result=0x08af88b8)  Line 2117	C
 	xul.dll!nsDNSRecord::GetNextAddr(port=0x01bb, addr=0x08af88b8)  Line 150	C++
 	xul.dll!nsSocketTransport::OnSocketEvent(type=0x00000000, status=0x00000000, param=0x05b9b6c0)  Line 1490	C++
 	nspr4.dll!PR_ExitMonitor(mon=)  Line 134	C
 	xul.dll!nsThread::ProcessNextEvent(mayWait=true, result=0x045ef927)  Line 662	C++
 	xul.dll!NS_ProcessNextEvent_P(thread=0x0101de80, mayWait=true)  Line 245	C++
 	xul.dll!nsSocketTransportService::Run()  Line 654	C++
 	xul.dll!nsThread::ProcessNextEvent(mayWait=true, result=0x045ef9c4)  Line 662	C++
 	xul.dll!nsThread::ThreadFunc(arg=0x0201de01)  Line 289	C++
 	nspr4.dll!_PR_NativeRunThread(arg=0x0200f640)  Line 448	C

The disassembly is:

  2109: PR_IMPLEMENT(void *) PR_EnumerateAddrInfo(void             *iterPtr,
  2110:                                           const PRAddrInfo *base,
  2111:                                           PRUint16          port,
  2112:                                           PRNetAddr        *result)
  2113: {
6D1487C0  push        ebp  
6D1487C1  mov         ebp,esp  
6D1487C3  and         esp,0FFFFFFF8h  
6D1487C6  push        ecx  
  2114: #if defined(_PR_HAVE_GETADDRINFO)
  2115:     PRADDRINFO *ai;
  2116: #if defined(_PR_INET6_PROBE)
  2117:     if (!_pr_ipv6_is_present()) {
6D1487C7  cmp         dword ptr ds:[6D02858Ch],0   <-- CRASH HERE reading 0x6D02858C
6D1487CE  push        ebx  
6D1487CF  push        esi  
6D1487D0  push        edi  
6D1487D1  je          6D14EAF0  
6D1487D7  cmp         dword ptr [__type_info_root_node+0BCh (6D167A88h)],0
6D1487DE  je          6D14EAFA  

_pr_ipv6_is_present and PR_CallOnce is inlined at 0x6D14EAF0 in this function:

  2117:     if (!_pr_ipv6_is_present()) {
6D14EAF0  call        _PR_InitStuff (6D143F40h)  
6D14EAF5  jmp         PR_EnumerateAddrInfo+17h (6D1487D7h)  
6D14EAFA  mov         eax,1  
6D14EAFF  mov         ecx,65817A8Ch  
6D14EB04  xchg        eax,dword ptr [ecx]  
6D14EB06  test        eax,eax  
6D14EB08  jne         6D14EB46  
6D14EB0A  call        _pr_init_ipv6 (6D145920h)  

So the initial "cmp" should be checking _pr_initialized from here:
http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/misc/prinit.c#771
and then the second cmp is checking once->initialized two lines below

According to MSVC, &_pr_initialized

&_pr_initialized	0x6d16858c __pr_initialized	int *

So it kinda looks to me like the compiler got the wrong address for _pr_initialized in the code.
Trying this is nightly shows that the matching code in nightly is loading/checking the correct address of __pr_initialized
According to dumpbin, the initial disassembly of PR_EnumerateAddrInfo is:

_PR_EnumerateAddrInfo:
  100087C0: 55                 push        ebp
  100087C1: 8B EC              mov         ebp,esp
  100087C3: 83 E4 F8           and         esp,0FFFFFFF8h
  100087C6: 51                 push        ecx
  100087C7: 83 3D 8C 85 02 10  cmp         dword ptr [__pr_initialized],0
            00

So when relocated into this minidump, I believe the address of __pr_initialized should have been

0x1002858c - 0x10001000 (original base address) + 0x6d140000 (new base address) == 0x6d16858c which is what MSVC said. So ISTM that either:

* the memory got corrupted
* the relocation process produced the wrong result

Either way I don't think this is a problem with PGO or code generation.
OK, I have a theory.  If you look at the addresses for the crashes, they *all* end in 02858c, and the first byte is either 00, 10, 5c, 5e, 60, 63, 65, 66, 67, 68, 69, 6a, 6b, 6c, 6d, 6e, 6f, etc.  Also, the absolute addresses later in the function are correct.  therefore it's extremely likely that we're dealing with a memory corruption bug here, not a compiler/linker bug.  In other words, something is corrupting the first byte of this address, leaving the rest untouched.

(And this makes this bug so much harder to figure out :( )
Correlation report:

  PR_EnumerateAddrInfo | PR_GetHostByAddr | PR_ExitMonitor | nspr4.dll@0x26cf|EXCEPTION_ACCESS_VIOLATION_READ (502 crashes)
     89% (449/502) vs.  34% (22863/66763) wship6.dll
     88% (444/502) vs.  35% (23402/66763) WSHTCPIP.DLL
     88% (444/502) vs.  36% (24074/66763) Wldap32.dll
     83% (419/502) vs.  33% (22038/66763) NapiNSP.dll
     82% (413/502) vs.  33% (22026/66763) pnrpnsp.dll
     81% (409/502) vs.  33% (21952/66763) nlaapi.dll
     71% (355/502) vs.  28% (18675/66763) FWPUCLNT.DLL
     71% (355/502) vs.  28% (18784/66763) RpcRtRemote.dll
    100% (502/502) vs.  58% (38474/66763) rasadhlp.dll
     71% (355/502) vs.  29% (19184/66763) cryptsp.dll
     75% (375/502) vs.  33% (21900/66763) DWrite.dll
    100% (502/502) vs.  59% (39116/66763) browsercomps.dll
    100% (502/502) vs.  59% (39552/66763) softokn3.dll
    100% (502/502) vs.  59% (39690/66763) firefox.exe
    100% (502/502) vs.  59% (39712/66763) xpcom.dll
    100% (502/502) vs.  60% (40065/66763) dbghelp.dll
     94% (473/502) vs.  55% (36545/66763) nssckbi.dll
     94% (473/502) vs.  55% (36711/66763) freebl3.dll
     94% (473/502) vs.  55% (36716/66763) nssdbm3.dll
     95% (477/502) vs.  56% (37532/66763) feclient.dll
     95% (475/502) vs.  56% (37624/66763) winrnr.dll
     90% (454/502) vs.  54% (36262/66763) rsaenh.dll
    101% (507/502) vs.  65% (43414/66763) mswsock.dll
     78% (390/502) vs.  42% (27911/66763) t2embed.dll
    100% (502/502) vs.  64% (42988/66763) dnsapi.dll
     80% (401/502) vs.  45% (29889/66763) ntmarta.dll
    100% (502/502) vs.  66% (43784/66763) wintrust.dll
     99% (499/502) vs.  69% (46376/66763) urlmon.dll
     55% (278/502) vs.  27% (17821/66763) explorerframe.dll
     55% (278/502) vs.  27% (17839/66763) dui70.dll
     56% (280/502) vs.  27% (18283/66763) duser.dll
     88% (444/502) vs.  61% (41004/66763) propsys.dll
     95% (478/502) vs.  70% (46401/66763) iertutil.dll
    100% (501/502) vs.  76% (50827/66763) wininet.dll
     88% (444/502) vs.  66% (44321/66763) powrprof.dll
     88% (444/502) vs.  67% (44652/66763) winnsi.dll
     88% (444/502) vs.  67% (44652/66763) nsi.dll
     88% (444/502) vs.  67% (44658/66763) IPHLPAPI.DLL
     88% (444/502) vs.  67% (44662/66763) dwmapi.dll
     83% (418/502) vs.  65% (43554/66763) MMDevAPI.dll
     83% (415/502) vs.  65% (43376/66763) AudioSes.dll
     80% (401/502) vs.  63% (41833/66763) normaliz.dll
     88% (444/502) vs.  73% (48423/66763) lpk.dll
     71% (355/502) vs.  56% (37240/66763) devobj.dll
     71% (355/502) vs.  56% (37244/66763) sechost.dll
     71% (355/502) vs.  56% (37244/66763) CRYPTBASE.dll
     71% (355/502) vs.  56% (37244/66763) KERNELBASE.dll
     69% (346/502) vs.  54% (36320/66763) profapi.dll
     71% (355/502) vs.  56% (37590/66763) cfgmgr32.dll
     90% (454/502) vs.  78% (51742/66763) msctf.dll
     28% (143/502) vs.  16% (10414/66763) mdnsNSP.dll
     29% (147/502) vs.  16% (11010/66763) WLIDNSP.DLL
     76% (380/502) vs.  64% (42994/66763) psapi.dll
     17% (85/502) vs.   6% (3973/66763) ntdsapi.dll
     20% (101/502) vs.  10% (6875/66763) wshbth.dll
     22% (112/502) vs.  15% (9953/66763) d3d10.dll
     22% (112/502) vs.  15% (9953/66763) d3d10core.dll
     23% (117/502) vs.  16% (10754/66763) dxgi.dll
     23% (116/502) vs.  16% (10683/66763) d3d10_1core.dll
     23% (116/502) vs.  16% (10683/66763) d3d10_1.dll
     22% (111/502) vs.  15% (10036/66763) d2d1.dll
      9% (45/502) vs.   4% (2414/66763) wkscli.dll
     16% (80/502) vs.  11% (7245/66763) AudioEng.dll

Has anybody in QA attempted to reproduce this on ipv6 machines?  That seems to have high relevancy to the crash happening.
Adding QA wanted and seeing about ipv6 machines.
Keywords: qawanted
The other thing that Juan and I talked about a few minutes ago was whether anything from the July 10th Patch Tuesday might have tickled something here. He mentioned there was comments relating to people rolling back their machines/brownouts/ - he can clarify further since I think those comments were in Spanish.
(In reply to Marcia Knous [:marcia] from comment #25)
> Adding QA wanted and seeing about ipv6 machines.

I tried this on a Win 7 VM, fully updated - no crash. Network Adapter and test-ipv6.org both say that IPv6 is enabled. Maybe QA will have different luck.
We had a user report this at https://support.mozilla.org/en-US/questions/932781. It ended up being bp-7ac3330b-2a28-4c3c-8599-696122120720. If needed i can reach out to the user to gather needed information.
Tyler: If you could please get some additional information from the user, that would be great. Would be interested in if they applied any MS patches from the most recent Patch Tuesday.

Also I notice they have Ad Aware but wondering which version as on the site there are three.

(In reply to Tyler Downer [:Tyler] from comment #28)
> We had a user report this at
> https://support.mozilla.org/en-US/questions/932781. It ended up being
> bp-7ac3330b-2a28-4c3c-8599-696122120720. If needed i can reach out to the
> user to gather needed information.
(In reply to Tyler Downer [:Tyler] from comment #28)
> We had a user report this at
> https://support.mozilla.org/en-US/questions/932781. It ended up being
> bp-7ac3330b-2a28-4c3c-8599-696122120720. If needed i can reach out to the
> user to gather needed information.

This looks like an instance of this crash to me...
cor-el asked what may be a telling question on https://support.mozilla.org/en-US/questions/932781?page=2.

"What are the connection settings?

    Tools > Options > Advanced : Network : Connection > Settings
    https://support.mozilla.org/kb/Options+window+-+Advanced+panel 

Does it help if you disable IPv6?

    http://kb.mozillazine.org/Error_loading_websites#IPv6"
I have reached out to the User, and am waiting to hear back a reply to my question for the info Marcia asked above, as well as cor-els comment. As soon as I receive those i will update here.
(In reply to Alex Keybl [:akeybl] from comment #31)
> 
> Does it help if you disable IPv6?
> 
Tried this in my VM; still not reproducible. I doubt this should make a difference thought - the function in question is looking at OS capability for NSPR rather than a configuration pref in Firefox. Nonetheless, playing with IPv6 config is worthwhile. I did a little bit of that when trying to reproduce in my Win 7 VM: specifically, I tried different combinations of the Windows 7 network adapter settings, IPv4 only, v6 only and v4 & v6 - still not reproducible.

I'll search for other Windows IPv6 settings - maybe something in the registry?

Re the memory address of the var being wrong, beyond reproducing the bug and hooking up a debugger, I'm not sure how else to determine what is causing it.
This signature is dropping down the crash charts day-to-day (it's a permanent startup crasher of course, so that makes sense).

We've now had close to 50million ADU at our peak late last week, but we've only had (2593 throttled crashes) * (10x throttling factor) / (~3 launches per user at the least). That's about 8k users. Once we fully unthrottle, we can expect no more than another ~10k users lost (sad I know).

Given the fact that this is such an amorphous issue, possibly malware related, and we don't have any actionable leads at the moment, let's deprioritize this to an ongoing investigation as opposed to a chemspill driver (15+). We should be looking into hardening around this and bug 718389 for the next release.

That being said, if anybody has ideas of what we could ask affected users to run (something with full crash dumps enabled?) instead of asking them for access to their computer, please do share.

Also, any methods of comparing the builds affected by bug 718389 and this bug would be greatly appreciated.

Steve - feel free to reassign later this week when you're going out of town.
Assignee: nobody → sworkman
What we're seeing here is a byte in the code page getting modified after the binary corresponding to it has been loaded by the OS loader.  This is very worrying, as it means that some code in our address space is modifying the page's protection bits, and overwrite a byte in our native code.  This may result in any number of weird cases (crahses or worse) if this turns out to happen systematically on different addresses...  I think determining why this is happening is potentially more important that the percentages of users affected by this crash.
(In reply to Ehsan Akhgari [:ehsan] from comment #35)
So, this is outside my range of expertise in the DNS code - It definitely seems like there's a serious issue here - is there anything to be done to harden the build or code against this? I ask because it seems unlikely that we'll get debugger access to an affected machine, and I haven't been able to reproduce it for debugging internally. Or what other options are there?
(In reply to Steve Workman [:sworkman] from comment #36)
> (In reply to Ehsan Akhgari [:ehsan] from comment #35)
> So, this is outside my range of expertise in the DNS code - It definitely
> seems like there's a serious issue here - is there anything to be done to
> harden the build or code against this? I ask because it seems unlikely that
> we'll get debugger access to an affected machine, and I haven't been able to
> reproduce it for debugging internally. Or what other options are there?

Just to be clear, there is probably nothing wrong with our DNS code.  See comment 20 through comment 23.  The generated code in nspr4.dll attempts to access the correct address.  However, the code that gets loaded from nspr4.dll has one of its bytes modified, which causes the CPU to try to read from an invalid address, which is what triggers the crash.  What we need to find out here is what is modifying our code pages.  I don't believe we can protect against this kind of stuff easily since we have no way of checking whether a page in our code has been modified before starting to execute it.
(In reply to Ehsan Akhgari [:ehsan] from comment #37)
> Just to be clear, there is probably nothing wrong with our DNS code.  See
> comment 20 through comment 23.  

Yup, that's what I understood - if there's a way to harden it or something else though ... but you think not. Or at least not easily.

> The generated code in nspr4.dll attempts to
> access the correct address.  However, the code that gets loaded from
> nspr4.dll has one of its bytes modified, which causes the CPU to try to read
> from an invalid address, which is what triggers the crash.  What we need to
> find out here is what is modifying our code pages.  I don't believe we can
> protect against this kind of stuff easily since we have no way of checking
> whether a page in our code has been modified before starting to execute it.

:( Yeah, it was a long shot to ask.
QA Contact: mozillamarcia.knous
Juan and I have installed various versions of Ad Aware from the http://lavasoft.com/ site. So far I have not generated any crashes.

http://lavasoft.com/products/ad_aware_pro.php has both safe networking and safe browsing options so that is the one we are targeting ATM.
(In reply to comment #38)
> (In reply to Ehsan Akhgari [:ehsan] from comment #37)
> > Just to be clear, there is probably nothing wrong with our DNS code.  See
> > comment 20 through comment 23.  
> 
> Yup, that's what I understood - if there's a way to harden it or something else
> though ... but you think not. Or at least not easily.

Not without some black arts, and serious performance penalties.  :(
Re-assigning to you Tyler, since I'm going on vacation for two weeks and it looks like we're still exploring the possibility of getting the user's machine - it seems like you're a good point person for the time being. I suggest you coordinate with bsmedberg or alex regarding re-assignment if/when you get the user's machine.
Thanks!
Assignee: sworkman → tdowner
User has not gotten back to me after attempting to reach out to them through several channels. I've also looked for other users affected by this crash, and haven't found any (or at least none have gotten back to me with crash ID's so I can see if it is this crash). Unassigning.
Assignee: tdowner → nobody
I received a response from one user who was experiencing this problem, and he/she was running four or five tabs with social networking sites, hotmail, his school's site (duoc.cl), and that he had the following antivirus programs installed: eset online scanner, ccleaner, malwarebytes

The problem happened while upgrading Firefox. The user removed Firefox from his machine and could not send us crash ids for this.
Component: General → Networking
Product: Firefox → Core
(In reply to juan becerra [:juanb] from comment #43)
> I received a response from one user who was experiencing this problem, and
> he/she was running four or five tabs with social networking sites, hotmail,
> his school's site (duoc.cl), and that he had the following antivirus
> programs installed: eset online scanner, ccleaner, malwarebytes

Can we try re-testing with these sites and apps installed on a few Windows machines? Should be fairly quick to test, since post-install Firefox should just not start if we're able to repro.
I've been testing on a couple of machines, one real, virtual, with the software mentioned in comment #43 doing all sorts of common user actions including application updates, but so far I have not encountered any problems.
Another user has offered to help and he provided us with a system profile. A quick look shows a few modules associated with Firefox that look suspicious, Google Desktop, RealPlayer Browser Record Plugin, AOL 9.1.

Other software installed in that machine that looks suspicious: SkyCaddie, Babylon 1.5.3.17, Prevx Computer Security Investigator, Skype 5.0.0.156

We can try installing all of this and see if we are able to reproduce the problem, but I will hold off on uploading the system profile information (xml file) until we hear back from the user if this is ok.
This moved down a bit in rank in 14.0.1 to #40 in the last week, but it would still be good to figure out what is going on here.
(In reply to comment #46)
> Another user has offered to help and he provided us with a system profile. A
> quick look shows a few modules associated with Firefox that look suspicious,
> Google Desktop, RealPlayer Browser Record Plugin, AOL 9.1.
> 
> Other software installed in that machine that looks suspicious: SkyCaddie,
> Babylon 1.5.3.17, Prevx Computer Security Investigator, Skype 5.0.0.156
> 
> We can try installing all of this and see if we are able to reproduce the
> problem, but I will hold off on uploading the system profile information (xml
> file) until we hear back from the user if this is ok.

FWIW I installed RealPlayer and couldn't reproduce.  I could not get download links for the other two.
There aren't any crashes for versions other than 14.0.1 for this, so I'm untracking it for 15.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.