Closed Bug 501446 Opened 15 years ago Closed 9 years ago

Crash with NoScript and ABE enabled [@ PR_EnumerateAddrInfo | nsDNSRecord::GetNextAddr][@ PR_EnumerateAddrInfo ]

Categories

(Firefox :: Extension Compatibility, defect)

x86
macOS
defect
Not set
critical

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: justdave, Assigned: ma1)

References

()

Details

(Keywords: crash)

Crash Data

I've had this same crash 7 times so far in the last 6 or 8 hours, seems to happen at random when the machine is otherwise idle.  I have had three or four pages up with an auto-refresh on them.

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1pre) Gecko/20090620 Firefox/3.5pre Glubble/2.0.4.6 ID:20090630031219

ChatZilla 0.9.85
DownloadHelper 4.5
DownThemAll! 1.1.4
Firebug 1.4.0b3
Firefox PDF Plugin for Mac OS X 1.1
Glubble 2.0.4.6
Greasemonkey 0.8.20090123.1
Japanese-English Dictionary for rikaichan 1.10
Live HTTP headers 0.15
Mass Password Reset 1.04
Modify Headers 0.6.6
Nagios Checker 0.14.4
Names Dictionary for rikaichan 1.10
Nightly Tester Tools 2.0.2
NoScript 1.9.5
Personas for Firefox 1.2.1
Rikaichan 1.06
Update Channel Selector 1.5
User Agent Switcher 0.7.1
Web Developer 1.1.7
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.1pre) Gecko/20090630 Shiretoko/3.5.1pre Glubble/2.0.4.6

^^ My correct UserAgent...  I had User-Agent Switcher active because of a Facebook sniffing bug, and NTT believed it when I told it to paste. :)
This is bug 337418 reincarnated.
OK, given the information in bug 337418 and the additional debugging info requested there, I'm suspecting it's a site with automatic-update/refresh usage since it happens at random without me actually touching the browser, and I have several tabs open with auto-refresh stuff (keeping an eye on release activity, etc).

Tab 1: http://www.facebook.com/home.php - does ajax requests to poll for new status updates and notifications

www.facebook.com has address 69.63.184.30

Tab 2: https://nagios.mozilla.org/sentry/ - has a Refresh: 150 on it.

nagios.mozilla.org is an alias for dm-nagios01.mozilla.org.
dm-nagios01.mozilla.org has address 63.245.208.170

Tab 3: https://nagios.mozilla.org/ftplag/ - has a Refresh: 60 on it.

nagios.mozilla.org is an alias for dm-nagios01.mozilla.org.
dm-nagios01.mozilla.org has address 63.245.208.170

Tab 4: http://people.mozilla.org/~nthomas/misc/uptake.html - has <meta http-equiv="refresh" content="180">

people.mozilla.org is an alias for people.mozilla.com.
people.mozilla.com has address 63.245.208.169

Tab 5: http://downloadstats.mozilla.com/ - I assume it's using ajax of some sort to poll for the stats data.

downloadstats.mozilla.com is an alias for star-moz-com-php5.nslb.sj.mozilla.com.
star-moz-com-php5.nslb.sj.mozilla.com has address 63.245.209.77

Of course, the ones doing ajaxy stuff could be using other host names for where to pull data from.
What product/component owns class nsDNSRecord ?  
IMO, the problem probably lies there, not in NSS.
Perhaps we should add members of the cc list for bug 337418 to this bug's list.
PR_EnumerateAddrInfo	 nsprpub/pr/src/misc/prnetdb.c:2095
nsDNSRecord::GetNextAddr	netwerk/dns/src/nsDNSService2.cpp:134
nsDNSRecord::GetNextAddrAsString	netwerk/dns/src/nsDNSService2.cpp:166
NS_InvokeByIndex_P	xpcom/reflect/xptcall/src/md/unix/xptcinvoke_unixish_x86.cpp:179
XPCWrappedNative::CallMethod	js/src/xpconnect/src/xpcwrappednative.cpp:2454
XPC_WN_CallMethod	js/src/xpconnect/src/xpcwrappednativejsops.cpp:1590
js_Invoke	js/src/jsinterp.cpp:1386
js_Interpret	js/src/jsinterp.cpp:5179 

justdave: are you using PAC?
Assignee: wtc → nobody
Severity: major → critical
Component: NSPR → Networking
Product: NSPR → Core
QA Contact: nspr → networking
Summary: Crash [@ PR_EnumerateAddrInfo ] at random times → Crash [@ PR_EnumerateAddrInfo - nsDNSRecord::GetNextAddr] at random times
Version: other → Trunk
(In reply to comment #5)
> justdave: are you using PAC?

What's PAC?
Nope, not using that.
Looks like NoScript is to blame.  While running NoScript (no changes to base install), I loaded about 12 tabs, most to support.mozilla.com, along with three other tabs: Facebook, Nagios (open source monitoring tool, nagios.org) and OTRS (open source trouble ticket system, otrs.org).  

The common thread between OTRS and Nagios is meta refreshes every so often (90 seconds for Nagios, 2 minutes for OTRS.

Additionally, Facebook does some triggered JS, which checks back to FB servers to see if there are any new status updates.  I think it also interacts with their chat servers.

I did not have this crash before 6/30/09, and I was either up to date or a day or two behind Shiretoko Nightlies.  I'm fairly certain that whatever updates were made to Shiretoko and thus FF 3.5 between 6/27/09 and release is causing this issue.

CrashID for the crash with only NoScript running:

41724f2c-19f4-47da-ab54-785202090703	7/3/09	5:01 PM
NoScript 1.9.5 was released on or around June 29, 2009.  Since NoScript was the only AddOn installed, and the same tabs hadn't crashed Firefox with NO AddOns installed, I'm wondering if the combination is suspect.

I likely updated NoScript on June 29 or 30.  Maybe I'll install a previous version and see if that still causes the problems.  I also wonder if it is NoScript to blame, since between the 26th and 30th a bunch of non-3.5 browsers had the same issue.

Other related crashes of mine that show up in PR_EnumerateAddrInfo crash search:
3073a03a-093a-4a83-a7ba-2ba8e2090703
e18a0d58-1c80-469e-8bda-3f0082090630  Shiretoko
eea60f39-2c9a-4953-86eb-c610a2090630  Shiretoko
fa650c74-553b-425a-a1a0-a66742090702
42fc54ef-72f1-4bea-8276-e72032090701
Latest NoScript does perform async DNS resolution from the UI thread as part of ABE's protection against internet-to-intranet CSRF (see http://noscript.net/abe ) and implements a 2nd level cache on top of the native nsIDNSService's one in order to minimize XPCOM and inter-thread dispatching and gain more control over unpinning attacks.

During development I found that if I cached nsIDNSRecord objects directly I caused random crashed like the one described here (I suppose for some pointer ownership issue), so I started storing the address strings returned by the query into an ad-hoc JavaScript DNSRecord object for caching purposes, and had no crash at all since then.

However this report seems to suggest there's still something broken in what I do. Should I deep-clone the strings (e.g. new String(record.getNextAddrAsString()) ) rather than pushing them directly into a JS array?

If you've got NoScript 1.9.5 (or even better 1.9.5.4 from http://noscript.net/getit#devel) installed, the relevant code is all in chrome://noscript/content/DNS.js
Sorry, I actually read the stack now and it doesn't seem an ownership issue but rather a concurrence one. Nothing I can fix on my side, apparently.

Also, I wonder why none of my testers (nor I) had a single crash of this kind in one month of beta testing (the ones I got when I tried to cache nsIDNSRecord objects directly were in private preliminary builds) and I still can't see any report on the NoScript forum.
Stupid question related to comment 12 (unexpectedly few reports): might it be Mac OS X specific? I can see Darwin-specific code in prnetdb.c...
If you look at the crash-stats report linked from the URL field, you can see that almost all of the crashes with this signature are indeed on Mac.  There's a couple Windows NT 6.0 in there, but not very many.
It does seem that the prnetdb.c does have OSX and Darwin specific ifdefs, the only one starts at line 196, with a possible elif at line 333, but the else is at line 403, and the endif at 416.  There's one more DARWIN ifdef on line 2186 which defines function pr_StringToNetAddrFB(), which is called in PR_StringToNetAddr, which I don't see called, but then again, this code is pretty foreign to me, so it is possible that the Darwin specific code related to bug 404399 is to blame. 

Darwin 8.8.4 was OSX 10.4.4, and while OSX10.5 was released in October 2007 (after the bug was opened), it is possible that it is causing problems in 10.5.  I'm guessing that since no popular code used this code recently, it had not reared its head before, as it does now.  And given that the related crashes are almost all OSX, it's possible this code needs a new review.
It worked for me without a crash in Windows7, but crashed in OS X
TEMPORARY FIX FOR THIS ISSUE
If you want to avoid this bug temporarily, and this bug is due to your use of the NoScript Addon, open NoScript Preferences and Disable ABE.

Tools > AddOns > NoScript Preferences > Advanced Tab > ABE Sub-tab > Uncheck Enable ABE

According to NoScript developer Giorgio Maone confirmed in Bugzilla that ABE calls the code that seems to be broken in OSX version of Firefox, and that by disabling ABE, you still get the functionality of NoScript, without the crashes. However once this bug is fixed, you should re-enable ABE to have the utmost protection that NoScript can provide.
Is this going to get fixed?  We've narrowed it down to a piece of OSX-specific code, probably a bad pointer.  What do we do here to get this one fixed to we can use ABE again?
Out of curiosity, what is ABE?
(In reply to comment #20)
> Out of curiosity, what is ABE?

http://noscript.net/abe
From comment #11:

Latest NoScript does perform async DNS resolution from the UI thread as part of
ABE's protection against internet-to-intranet CSRF (see http://noscript.net/abe
) and implements a 2nd level cache on top of the native nsIDNSService's one in
order to minimize XPCOM and inter-thread dispatching and gain more control over
unpinning attacks.
I wonder if the changes to class nsDNSService for bug 453403 are in some way
causal of these crashes.
It'd be great to get this issue resolved.
Summary: Crash [@ PR_EnumerateAddrInfo - nsDNSRecord::GetNextAddr] at random times → Crash with NoScript and ABE enabled [@ PR_EnumerateAddrInfo - nsDNSRecord::GetNextAddr]
Summary: Crash with NoScript and ABE enabled [@ PR_EnumerateAddrInfo - nsDNSRecord::GetNextAddr] → Crash with NoScript and ABE enabled [@ PR_EnumerateAddrInfo - nsDNSRecord::GetNextAddr][@ PR_EnumerateAddrInfo ]
Summary: Crash with NoScript and ABE enabled [@ PR_EnumerateAddrInfo - nsDNSRecord::GetNextAddr][@ PR_EnumerateAddrInfo ] → Crash with NoScript and ABE enabled [@ PR_EnumerateAddrInfo | nsDNSRecord::GetNextAddr][@ PR_EnumerateAddrInfo ]
This has been driving my wife nuts (OSX Firefox user) for weeks now. I finally did some digging and found out about about:crashes, which finally pointed us to the NoScript ABE issue. Would really like to see this addressed, since it seems to be affecting a lot of people (as NoScript is one of the most popular Firefox addons).

She has just disabled the ABE feature and we're crossing our fingers that it stops the crashes.
ok, i claim this is noscript's fault:

timeless-mbp-2:ns timeless$ grep em:ver '/Users/timeless/Library/Application Support/Firefox/Profiles/*.default/extensions/{73a6fe31-595d-460b-a920-fcc0f8843232}/install.rdf' 
   <em:version>1.9.9.35</em:version>

timeless-mbp-2:ns timeless$ unzip '/Users/timeless/Library/Application Support/Firefox/Profiles/*.default/extensions/{73a6fe31-595d-460b-a920-fcc0f8843232}/chrome/noscript.jar' 

timeless-mbp-2:ns timeless$ grep idn-ser content/noscript/DNS.js        host = CC["@mozilla.org/network/idn-service;1"].createInstance(CI.nsIIDNService).convertUTF8toACE(host);

mao: the idn-service is a service. please use .getService().
Assignee: nobody → g.maone
Component: Networking → Extension Compatibility
Product: Core → Firefox
QA Contact: networking → extension.compatibility
Timeless,  Please spell your suggestion out a little more explicitly.
You might even spell it out in a form that an adventurous user could try
out him/her self by editing his/her own noscript.jar file.  If that fixes
the problem, then users get immediate relief, you get more fame & glory. :)
I think fixing NoScript to not cause the bug is not as good as fixing the underlying bug.  One day, another extension or even some Firefox code might come along that triggers the same bug.
@timeless:
You're correct about nsIIDNService being, as the interface name says, a service.
However I happened to copy that code (which, BTW, gets called quite rarely) verbatim from MDC. Therefore I'm gonna fix my call, but someone should fix the docs:
https://developer.mozilla.org/en/nsIIDNService

That said, how is this supposed to fix this very bug?
i'm not sure i want to think about or explain it. i can guess. the code uses locks to protect objects, but if there are two locks each of which should be the *only* lock for something, then when things think they have "locked" the object, they might have just locked a lock which wasn't the same as the other guy's lock for the same object.

as for the wiki, wow, that sucks. in general, anyone who creates an account can edit any page on the wiki (including that one). I don't touch wikis. You're welcome to fix it, or i'll see about getting someone to fix it.

nelson: i'd just as soon giorgio fix it and send it to everyone.

jonathan louie: comment 29 has identified the underlying bug, it's bad documentation.

our arrangement with extension authors is this:
You have full access to the entire application, you can do anything, you can crash the browser. we expect you to follow certain rules and try not to crash the browser.

among the rules are this: a service should be used as a service.

we might someday enforce that codewise, however officially the design of xpcom is such that there might be a class for which it's legal to have both a service and multiple instances. Unfortunately, since xpcom is portable, it's possible some downstream vendor has taken advantage of that with their own objects, which means if we change how the service manager works, we're breaking our contract with them.

it's unfortunate that sheppy made this mistake in his documentation (or in copying it from elsewhere), but he was basically the only person creating any MDC content and he created a lot, getting everything right is a lot to ask for.
While I agree that fixing the nsIIDNService instantiation is due (and indeed I did it), I doubt it's the actual culprit here, especially since it gets called exclusively for hosts containing characters which are illegal in ASCII domains (i.e. quite rarely so far).

After re-digging Gecko's sources I found some more suspicious code:
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/src/nsDNSService2.cpp#184

By "doing the right thing" with the nsIDNSRecord interface (i.e. calling hasMore() before getNextAddrAsString()), NoScript is apparently calling getNextAddr() *twice*, and looking at the stack traces it crashes on the second call.

Furthermore, the thread safety of hasMore() seems dubious, so it plausible that it's preparing our crash since NoScript calls the DNS service asynchronously from an "unusual" thread (the UI one).

Therefore I eliminated the hasMore() call completely, and released both "fixes" (nsIIDNService and hasMore() elision) in NoScript 1.9.9.36, now available on http://noscript.net/getit#direct

Let's cross our fingers and see if crashes go down in a week or so...
Depends on: 716345
There are many startup crashes with this crash signature in 10.0b4.
It doesn't seem related to NoScript.

More reports at:
https://crash-stats.mozilla.com/report/list?signature=PR_EnumerateAddrInfo
Crash Signature: [@ PR_EnumerateAddrInfo] [@ PR_EnumerateAddrInfo | nsDNSRecord::GetNextAddr]
Keywords: crash
For crashes in 10.0b4, I filed bug 718389.
(In reply to Scoobidiver (away) from comment #33)
> For crashes in 10.0b4, I filed bug 718389.

I've closed the above bug report.

Does anyone believe this crash still exists?  
I'm not seeing any on crash-stats
Whiteboard: [closeme 2015-06-01]
Resolved per whiteboard
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INCOMPLETE
Whiteboard: [closeme 2015-06-01]
You need to log in before you can comment on or make changes to this bug.