This bug was filed from the Socorro interface and is
report bp-0ac0593f-8bdf-47cd-851a-770f22110710 .
Filing as this is a new crash showing up in the early 6.0 data (but I did see it showing up in smaller numbers before beta). See https://crash-stats.mozilla.com/report/list?signature=nsStandardURL::SchemeIs%28char%20const*,%20int*%29. It also appears on 7.0a2 but in no other versions.
Some of the comments mentioned it happened right after they updated.
The stack varies quite a bit but this is one sample:
Frame Module Signature [Expand] Source
0 xul.dll nsStandardURL::SchemeIs netwerk/base/src/nsStandardURL.cpp:1686
1 yo0O9efo.cpl yo0O9efo.cpl@0xe67b
2 yo0O9efo.cpl yo0O9efo.cpl@0x3321
3 xul.dll nsDocShell::LoadURI docshell/base/nsDocShell.cpp:1431
4 xul.dll nsLocation::SetURI dom/base/nsLocation.cpp:364
5 xul.dll nsLocation::SetHrefWithBase dom/base/nsLocation.cpp:646
6 xul.dll nsRefPtr<nsIDOMEventListener>::~nsRefPtr<nsIDOMEventListener> obj-firefox/xpcom/build/nsCOMPtr.cpp:81
7 xul.dll nsLocation::SetHrefWithContext dom/base/nsLocation.cpp:593
It is #1 top crasher in 6.0 and happens almost exclusively at startup.
Stack traces contain either a dll or a cpl as frame 1 and 2.
Kairo, can we get some correlations for this one. It seems like there are different stacks...can we get a report with the groupings...might help.
Tracking this for FF6.
(In reply to comment #2)
> Kairo, can we get some correlations for this one.
We don't have correlations with modules for 6.0 right now, and there's no way to force those reports. We need to wait until we have enough crashes on it.
And I'm not even sure how chofmann's signature grouping reports work, unfortunately.
the correlation reports work off an older method of picking a few of the releases with the highest volume crashes to 'automatically' manage where we see the correlation reports. I think arvind put that script together. now that we have rapid release we might be able to do correlations based on channel for nightly aurora and beta without taxing the system for these extra reports, since the volume there should be lower.
it might be a good idea to file a bug on this and cc' rhelmer who has picked up some of aravinds scripts.
(In reply to comment #5)
> it might be a good idea to file a bug on this and cc' rhelmer who has picked
> up some of aravinds scripts.
sure. added the additional background to the clean up parts that were initially talked about in that bug.
(In reply to comment #3)
> Tracking this for FF6.
Sheila, do we have enough data with beta2 to make any better gauge of this?
#9 on 6.0b2 yesterday.
Asa, for b2 it's still high. No obvious leads when we look at the add on correlation. We do see it in previous versions 3.6 but very low volume. None on 5.0 in the last week.
Line 1686 on the 6.0 branch is this:
1686 *result = SegmentIs(mScheme, scheme);
I spot-checked a few of the crashes, and they all share the following features:
1) The crash is at address 0 or nearly 0 with a write violation.
2) The crash is on that line above, which implies that |result| is a null
3) The two stack frames above the SchemeIs call are from non-Gecko DLLs. Some
of the DLLs that appear there are:
4) The stack frame above that is either the call to InternalLoad in
nsDocShell::LoadURI or the call to InternalLoad in OnLinkClickSync.
I don't know what to make of item 3 above nor how reliable that is...
Frame 1 is almost certainly correct, since frame 0 is in our code and we have unwind information for it. Frame 2 and beyond may be suspect since we may resort to stack scanning there.
I amend my previous statement. I downloaded the dump from comment 0, as well as the matching xul.sym from the symbol server, and running minidump_stackwalk locally gives me:
Thread 0 (crashed)
0 xul.dll!nsStandardURL::SchemeIs(char const *,int *) [nsStandardURL.cpp:0d82a53ffaa6 : 1686 + 0x7]
eip = 0x57b6fcf8 esp = 0x0018b6f4 ebp = 0x0018b74c ebx = 0x00000000
esi = 0x128d82f0 edi = 0x00000000 eax = 0x00000000 ecx = 0x0018b728
edx = 0x0000001c efl = 0x00010246
Found by: given as instruction pointer in context
1 yo0O9efo.cpl + 0xe67b
eip = 0x0c99e67c esp = 0x0018b6fc ebp = 0x0018b74c
Found by: call frame info with scanning
so the stackwalker did resort to scanning here, which means this is not an entirely reliable frame.
I'll try WinDBG to see what it thinks...
FWIW, I question the integrity of that randomly-named .cpl file:
Created attachment 546605 [details]
stack from windbg
WinDBG gave me something like:
xul!nsTHashtable<nsBaseHashtableET<nsCStringHashKey,nsFactoryEntry *> >::s_MatchEntry
(full stack attached). I don't know if that looks any saner, but it is nothing but Mozilla code on the stack.
OK -- so nsDocShell::LoadURI has this:
> 1365 nsCOMPtr<nsIScriptSecurityManager> secMan =
> 1366 do_GetService(NS_SCRIPTSECURITYMANAGER_CONTRACTID, &rv);
...which invokes the component manager code to look up that service.
The hashtable that's being searched looks like this:
> 151 nsDataHashtable<nsCStringHashKey, nsFactoryEntry*> mContractIDs;
and the hashtable ::s_MatchEntry() function just does this:
>374 return ((const EntryType*) entry)->KeyEquals(
>375 reinterpret_cast<const KeyTypePointer>(key));
At first glance, I'd expect that we'd only hit string-comparison code there (to compare two nsCStringHashKey objects) -- not sure why we're hitting nsStandardURL-comparison code instead. Maybe I'm misunderstanding, though.
So if I break (in a debug trunk build) at the first chunk in Comment 17, and single-step, I eventually hit what looks like the "s_MatchEntry" call from Ted's stack. Here's the function I'm in, copied from GDB:
> nsTHashtable<nsBaseHashtableET<nsCStringHashKey, nsFactoryEntry*> >::s_MatchEntry
When I single-step through that, I enter " nsCStringHashKey::KeyEquals" (defined in nsHashKeys.h), which just does a string comparison (nsTSubstring_CharT::Equals) and then returns. No URI code is invoked at all.
I have no idea how we could end up triggering a call to nsStandardURL::SchemeIs...
(In reply to comment #11)
> 1) The crash is at address 0 or nearly 0 with a write violation.
Some of them are at other addresses with a write violation, too -- e.g. the following are at 0x6ff7115f and 0x725c115f, respectively:
(In reply to comment #11)
> 2) The crash is on that line above, which implies that |result| is a null
BTW, the stack that Ted attached confirms this, for that crash. It has "int * result = 0x00000015" in the first line.
A few levels up Ted's stack, things are looking pretty busted -- here are 3 consecutive stack levels (partially trimmed for readability):
> xul!SearchTable(struct PLDHashTable * table = 0x0c993322, void * key = 0x128d82f0, unsigned int keyHash = 0x15, PLDHashOperator op = 1470972173 (No matching enumerant)
> xul!PL_DHashTableOperate(struct PLDHashTable * table = 0x0018b728, void * key = 0x0000001c, PLDHashOperator op = <Memory access error>
> xul!nsCOMPtr_base::~nsCOMPtr_base(void)+0xe [e:\builds\moz2_slave\rel-m-beta-w32-bld\build\obj-firefox\xpcom\build\nscomptr.cpp @ 82]
Note in particular:
* 1st line: Bogus PLDHashOperator value 1470972173 (Valid values are 0,1,2)
Likely-bogus keyHash = 0x15 (that's a distinctly un-random-looking hash value, and it happens to end up exactly matching our "int * result" in prev comment, which is suspicious)
* 2nd line: Likely-bogus "void * key" pointer, 0x0000001c
* 3rd line: Looks like it could trigger a delete operation -- that's basically the only way for code to be invoked from ~nsCOMPtr. Not sure why "delete" ends up translating to "do some bogus hash table operations" though.
I wonder if this is a double-free or something...
(Looks like we have a bundle of crashes in nsSimpleURI::SchemeIs, with similar-looking stacks, BTW. e.g. bp-831337db-a24c-461c-b8d4-19ac32110717 )
Created attachment 546700 [details]
alternate stack from MSVC
Here's a stack of a different (but related) crash report -- bp-831337db-a24c-461c-b8d4-19ac32110717 -- which is a crash in nsSimpleURI::SchemeIs. (with user comments "Starting Aurora for the first time.") In this case, we crash when setting the outparam *o_equals, because it's a null pointer.(Incidentally, |this| is null too, though.)
Snippet of stack:
> nsSimpleURI::SchemeIs(const char * i_Scheme, int * o_Equals) Line 483
> nsScriptSecurityManager::QueryInterface(const nsID & aIID, void * * aInstancePtr) Line 514 + 0x48 bytes
> nsComponentManagerImpl::GetServiceByContractID(const char * aContractID, const nsID & aIID, void * * result) Line 1642 + 0x1d bytes
> nsDocShell::LoadURI(nsIURI * aURI, nsIDocShellLoadInfo * aLoadInfo, unsigned int aLoadFlags, int aFirstParty) Line 1469 + 0x3e bytes
> nsLocation::SetURI(nsIURI * aURI, int aReplace) Line 364 + 0x12 bytes
> nsLocation::SetHrefWithBase(const nsAString_internal & aHref, nsIURI * aBase, int aReplace) Line 646 + 0x15 bytes
From debugging this minidump in MSVC, the stack mostly appears sensible, right up to the nsSimpleURI::SchemeIs call -- which makes no sense (how'd we get there?) It appears that nsScriptSecurityManager::QueryInterface is calling "SchemeIs", but that's bogus, as the actual code there (implemented via NS_IMPL_ISUPPORTS4) does no such thing.
Per my comment 15, it's possible that there's malware involved here doing something awful.
Hmmm, yeah... The fact that things get mysterious (e.g. jump into the mysterious .cpl / .dll from ::LoadURI, according to crash stats) is a bit suspicious -- maybe this is malware trying to intercept visited URIs, either for redirection ("sure, paypal.com is right over here") or logging purposes. (Maybe calling SchemeIs to check for https?)
This *does* seem to be entirely windows-only, which increases the apparent likelihood of it all being due to a particular piece of [windows] malware. (In the last 4 weeks, there were 3666 crashes at nsStandardURL::SchemeIs & 165 at nsSimpleURI::SchemeIs, 100% on Windows.)
I'm unclear on why WinDBG / MSVC would disagree with crash-stats about the mysterious-.cpl files being involved (in comment 16 & comment 23), though. I guess the moz-only alternative stacks don't make a ton of sense anyway...
Can some users who let their email addresses in crash reports be contacted to get more information? Which information is needed?
Yeah, I don't know if that's actually a code address in that .cpl, but if you look at the stack memory from the crashing thread in the dump from comment 0, you'll clearly see:
0018b6f4 98 06 0d 12 7c e6 99 0c f0 82 8d 12 24 b7 18 00 1c 00
That address (0x0c99e67c) is the second word of memory on the stack. Not a smoking gun, but certainly suspicious.
So in every instance of this I've seen, the following is true:
(a) There's a non-Mozilla .dll or .cpl file on the stack, inside of ::LoadURI (which is a suspicious place to be jumping into external code, per the first chunk in comment 25)
(b) In the "Raw Dump" view, that dll/cpl is missing a bunch of metadata that's present in (virtually) all other loaded modules. e.g. here's the shady dll:
vs. here's what the rest of the modules look like:
Ted says in IRC: "99% of legit software will have version/debug id info. anything lacking it is automatically suspicious. that's a pretty sure sign of a virus/malware/trojan/whatever."
(c) In cases where the mysterious dll/cpl has a "normal"-looking name (e.g. SysPathVdm.dll, oleobjspl.dll, HpMainSnap.dll), a google-search for the filename turns up nothing, which suggests that it might be a randomly-generated-but-believable filename (from a wordlist, or from making tweaks to existing DLL names)
Conclusion: These facts *strongly* suggest that this is from one or more pieces of malware that use randomly-named (and metadata-lacking) DLL / CPL files.
(In reply to comment #26)
> Can some users who let their email addresses in crash reports be contacted
> to get more information? Which information is needed?
> Ref: https://support.mozilla.com/kb/crash-report-email-faq
That's probably a good idea, though I'm not sure what we'd ask them.
I imagine we'd want them to run some malware or virus scanner, and in an ideal world, we'd end up with either or both of the following results.
(a) the scanner would detect the same malware across different users
(b) the scanner would remove some malware, and in doing so, fix the Firefox crashes
If we get (a), that'd tell us there's a pattern & that our hunch about the cause is likely correct. If we get (b), that'd confirm the cause (and fix the problem for the users we contact, which is nice.)
This could also go wrong in a number of ways, though -- the scanner might detect false-positive malware that's not involved at all (since there are just a lot of malware-infected machines out there), or it might try & fail to remove malware and leave the machine in a less-functional state (unlikely but worth considering when we ask users to do something), or the scanner might fail to find anything 'cause it's missing signatures for whatever malware is involved here. (particularly considering the random-ish DLL / CPL filename)
So, the above are my thoughts/concerns about contacting users. I have no idea which malware scanner(s) to recommend, or who (cww?) would drive the user-contacting effort...
I think it's the equivalent of bug 633445 which is #3 top browser crasher in 5.0 (159 occurrences in 6.0 but not related to malware) while this one is #3 top browser crasher in 6.0 (7 occurrences in 5.0 or 5.0.1 but not related to malware).
Let's assign it to Chris that will give email addresses to Cheng.
I need the email addresses and the name of the DLL... I can ask users to search their computers and email me that DLL.
That's about all that I think users will be able to do given the information we have. I won't recommend AV software since I'd rather not troubleshoot other people's software and pretty much all malware these days tries its darndest to cripple your AV.
For info., here is the SUMO article about malware that targets Firefox:
The failure to start Firefox is one of the symptoms.
According to me, the goal of contacting some users is not to help troubleshoot their problem (there's already the article above), but to get one or several kinds of dlls in order to make Firefox more robust against these threats.
But maybe it's better to have a big visible symptom than invisible malicious ones. The risk is that Mozilla will be held for responsible of their problem instead of the malware and they will switch to another browser.
sent mail to cww with a list of reports that have email addresses to see if he can make headway
(In reply to comment #31)
> I need the email addresses and the name of the DLL...
(the name of the DLL or CPL file varies per-report -- hopefully you can grab that on a case-by-case basis from the reports that choffman sent you. Thanks cww!)
agree with scoobidiver in comment 32, so as soon as we figure out what to tell users we should consider turning on the e-mail auto-responder for these signatures.
(In reply to comment #35)
> what to tell users
After Cheng got some dlls, we can sent something like that:
Some time ago, Firefox crashed on your computer and you sent us a crash report using the Mozilla Crash Reporter tool. We sent you an e-mail because we have more info about this crash, and you requested in the report that you wanted to be informed whenever we do.
We established that this crash was caused by malware that targets among others Firefox. This malware uses randomly named dll or cpl files so as not to be identified by anti-virus and anti-spyware software. Please follow the procedure in the following support article to try to get rid of it:
If you've questions about this email, see:
We hope this email helped you solve your crash issue.
The Mozilla Security team"
Cww spoke with me & LegNeato this morning about maaaaaaaybe backing out bug 308590 and all its dependencies / helper-bugs, with the goal of giving AV companies another 6 weeks to killkillkill this malware before we ship an update that will make it trip over itself and explode in Firefox.
Cww points out that the current situation is this: if we release Firefox 6.0 as-is, our #6 beta topcrash will likely become a much-more-frequent crash for our much larger release audience, because they're likely to be more malware-ridden than our beta audience. That's bad.
From one perspective, it's better to crash like this than to let the malware hijack us to spy on users. However, we have reason to believe that this malware targets multiple browsers -- so the crash doesn't really stop any spying, because users may continue to be spied on when they try another browser after hitting this bug.
So, just to experiment, I ran a trial backout locally, to see how feasible that'd be. The results were good - I had no merge conflicts. The bugs that backed out locally were:
Bug 662242, Bug 659698, Bug 659177, Bug 658949, Bug 658845, Bug 308590
with these 15 csets:
In addition to the mega-backout, we'd also need a one-off fix for the inverse of bug 662242 -- basically, in order for sessionrestore to work across the backout-update for current beta users, we'd need to add a chunk of code much like bug 662242's patch but with the IIDs reversed. (because otherwise the nsIURI IID change will break their deserialization in one particular part of session restore.)
I'll leave it up to release drivers to decide whether or not we want this backout, but I have at least established that "it's possible and not too painful."
For the record, the malware is a variant (and possibly an undetected variant) of: http://www.symantec.com/security_response/writeup.jsp?docid=2010-121016-0900-99&tabid=3
So, Bug 308590 made 3 tweaks to the nsIURI interface, and 2 of them are before the declaration of "SchemeIs".
jst says it's possible that if we shift those tweaks to all live at the *end* of the nsIURI interface, then this problem will go away, because the first chunk of the vtable will look like it used to. That's worked with some similar bugs in the past, anyway.
That would require IID revving and an additional hackaround along the lines of bug 662242, but it's significantly more appealing than a full backout, and it might "fix" this for good regardless of how quickly AV companies can jump on this.
I'm spinning up a patch to hopefully make this change on trunk today, so that we can figure out whether this strategy works ASAP.
Created attachment 548651 [details] [diff] [review]
patch 1: shift nsIURI changes to end of interface
Created attachment 548652 [details] [diff] [review]
patch 2: rev IIDs in IDL files inheriting from nsIURI
This patch revs the IIDs of every interface that inherits (directly or indirectly) from nsIURI.
I verified that this is all of the interfaces whose IIDs we changed in the csets mentioned in comment 37.
Created attachment 548654 [details] [diff] [review]
patch 3: Swap out previous nsIURI IID for new one, in nsBinaryInputStream::ReadObject
This third patch extends the fix in bug 662242 to catch the newly-obsolete nsIURI IID and replace it with the new one, when we QI a freshly-deserialized principal during session restore.
I think it's very worth getting these patches landed asap to see if it seems this eliminates the crashes we've been tracking here. If so, we should consider this as an option for the branches as well.
(confirmed via local testing that patch 2 re-breaks bug 662242 and patch 3 re-fixes it. Hooray!)
(In reply to comment #43)
> I think it's very worth getting these patches landed asap
Agreed - I intend to land them tonight, for tomorrow's nightly build.
(In reply to comment #43)
> I think it's very worth getting these patches landed asap to see if it seems
> this eliminates the crashes we've been tracking here. If so, we should
> consider this as an option for the branches as well.
It happens only once every 3-5 days in 8.0a1, so it will be hard to determine if it is fixed without a testcase:
Cheng, can you ask to someone that reliably crashes in 6.0 to test the new build?
This is malware, now that I know how to fix, I've been telling users to delete the file so I don't have users who have reproducible crashes.
Started discussion with release-drivers on this - it's not clear yet whether or not we want to take this fix on beta, due to how soon the next beta-->release switchover is & the potential for addon-author-pain caused by IID revving.
Will update this bug when we know more.
Sorry -- I forgot to mention that I landed the attached patches on m-c last night, so they're included in today's nightly build.
(In reply to comment #46)
> This is malware, now that I know how to fix, I've been telling users to
> delete the file so I don't have users who have reproducible crashes.
Try some you didn't talk to previously by asking new email addresses to Chris.
(In reply to comment #49)
Already in progress... I was just referring to the fact that I can't just do it right away.
Crash-stats sees no instances of this from 20110727 or newer nightlies. (which have the csets from comment 48):
There are 2 crashes that *happened* since 2011-7-27, but those users were running unpatched out-of-date-by-a-few-days nightly builds.
So, the (sparse) nightly crash data so far suggests that Comment 48 worked.
Cww contacted some users per Comment 50 to ask them if patched builds fix their crash, but he hasn't heard back yet.
I'm going to request approval to land on Aurora. I intend to wait one more day before landing, for more crash-stats data and possible-user-responses-to-Cww, and then I'll assume the best & land.
Landed patches on aurora:
Tentatively resolving as FIXED, but we should watch crash-stats (on nightly as well as aurora) to verify.
(In reply to comment #47)
> Started discussion with release-drivers on this - it's not clear yet whether
> or not we want to take this fix on beta, due to how soon the next
> beta-->release switchover is & the potential for addon-author-pain caused by
> IID revving.
To follow up on this: we ended up leaning against taking this on beta, with the feeling that the fix would likely cause more collective pain than leaving it unfixed on that branch (due to the IID rev likely breaking a bunch of binary addons, without much time for them to react before release).
We decided to fix this on Aurora, Trunk but not Beta. There is really no point continuing to track it for 6.0. If for some reason it becomes a serious issue after 6.0 is released, the craskill team can track it.
(In reply to Daniel Holbert [:dholbert] from comment #52)
> Tentatively resolving as FIXED, but we should watch crash-stats (on nightly
> as well as aurora) to verify.
Just to follow up on this -- the fix does appear to have stuck!
Here's a search I just did, for crashes with this signature on Aurora over the last 2 weeks:
That has 89 results (including some crashes submitted yesterday), but the last *build* with any results returned is 2011080200, which uncoincidentally is the day that the patches landed on Aurora (fixing the subsequent nightly).
Also, there's at least one crash submitted on each build for the week before 20110802 (with four days that have >10 crashes). So, the fact that we've now had 6 straight days of generating apparently-non-crashy builds is significant.
(Same sort of thing on Nightly channel, though less dramatic. The most recent build with crashes there is 2011072600, which is the last unpatched build on that channel.)
So, I'm marking this VERIFIED|FIXED, based on the above crash analysis.
As expected, this is hitting us pretty badly after 6.0 shipped. This is the #4 report overall in yesterday's data for 6.
This happened to me after firefox 5 updated itself to 6. The dll casuing the problem in my case was called apphelpinterval.dll. After removing it from the registry and rebooting firefox 6 started working.
This is my crash report: https://crash-stats.mozilla.com/report/index/c2720c47-7a6d-4a6a-96df-596282110817.
A virus scan of the dll file is here: http://www.virustotal.com/file-scan/report.html?id=904c029f2818d9136f6a4420abd26d4034b0f8970978e31255cf7034bac544c2-1307082643
Can we set up the auto-mailer to mail users crashing with this signature to tell them they have malware?
*** Bug 680418 has been marked as a duplicate of this bug. ***
bug 680418 shows malware with no randomly named dll: DesktopGLSvcs.dll.
(In reply to Scoobidiver from comment #59)
> bug 680418 shows malware with no randomly named dll: DesktopGLSvcs.dll.
I don't think that's new information, if I'm understanding you correctly. I've been using "random" to mean "different on each system" (which suggests that it's being randomly generated), but not necessarily gibberish. See e.g. comment 11, comment 28, comment 56, which all have superficially-legit-sounding DLL filenames.
*** Bug 680127 has been marked as a duplicate of this bug. ***