Closed Bug 598007 Opened 9 years ago Closed 4 years ago

Start-up crash under Windows XP [@ nsDiskCacheMap::Open(nsILocalFile*) ]

Categories

(Core :: Networking: Cache, defect, critical)

x86
Windows XP
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: scoobidiver, Assigned: jduell.mcbugs)

References

(Blocks 1 open bug)

Details

(Keywords: crash, user-doc-needed)

Crash Data

Build : Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b7pre) Gecko/20100919
Firefox/4.0b7pre

This is a residual crash signature that exists in trunk builds.
It is #57 top crasher for 4.0b7pre for the last two weeks.

Signature	nsDiskCacheMap::Open(nsILocalFile*)
UUID	c6ceb095-65f9-447c-8768-85eed2100920
Time 	2010-09-20 06:20:51.277412
Uptime	1
Last Crash	2 seconds before submission
Install Age	40712 seconds (11.3 hours) since version was first installed.
Product	Firefox
Version	4.0b7pre
Build ID	20100919042023
Branch	2.0
OS	Windows NT
OS Version	5.1.2600 Service Pack 3
CPU	x86
CPU Info	GenuineIntel family 6 model 23 stepping 6
Crash Reason	EXCEPTION_ACCESS_VIOLATION_READ
Crash Address	0xffffffff80000000
App Notes 	AdapterVendorID: 10de, AdapterDeviceID: 0622

Crashing Thread
Frame 	Module 	Signature [Expand] 	Source
0 		@0x80000000 	
1 	xul.dll 	nsDiskCacheMap::Open 	netwerk/cache/nsDiskCacheMap.cpp:155
2 	xul.dll 	nsDiskCacheDevice::OpenDiskCache 	
3 	xul.dll 	nsDiskCacheDevice::Init 	netwerk/cache/nsDiskCacheDevice.cpp:384
4 	xul.dll 	nsCacheService::CreateDiskDevice 	netwerk/cache/nsCacheService.cpp:1305
5 	xul.dll 	nsCacheService::SearchCacheDevices 	netwerk/cache/nsCacheService.cpp:1718
6 	xul.dll 	nsCacheService::ActivateEntry 	netwerk/cache/nsCacheService.cpp:1627
7 	xul.dll 	nsCacheService::ProcessRequest 	netwerk/cache/nsCacheService.cpp:1490
8 	xul.dll 	nsProcessRequestEvent::Run 	netwerk/cache/nsCacheService.cpp:913
9 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:547
10 	xul.dll 	nsThread::ThreadFunc 	xpcom/threads/nsThread.cpp:263
11 	nspr4.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:426
12 	nspr4.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c:122
13 	mozcrt19.dll 	_callthreadstartex 	obj-firefox/memory/jemalloc/crtsrc/threadex.c:348
14 	mozcrt19.dll 	_threadstartex 	obj-firefox/memory/jemalloc/crtsrc/threadex.c:326
15 	kernel32.dll 	BaseThreadStart
Assignee: nobody → honzab.moz
Keywords: regression
Looks like something rotten for already a long time, apparently a race condition, there are similar much older reports from this area of code:

http://crash-stats.mozilla.com/report/index/b0cc7822-a4cb-429e-b758-cdea22100906
http://crash-stats.mozilla.com/report/index/9e270a67-27f6-4b4c-b98b-9b16f2100918
http://crash-stats.mozilla.com/report/index/97cc456a-1361-48b9-bd8f-bb9662100830
http://crash-stats.mozilla.com/report/index/07761768-0cbf-492d-8b3e-7d0242100907

Something just woken this up to happen more often.  Will look for the regression range.
I wouldn't be surprised if this has something to do with the smart_size changes in bug 559942, though I don't exactly see how.

If we're lucky this *might* be fixed by bug 596476 or 595413.  Alas, the former won't make it into Beta7.
Depends on: 595413, 596476
#2 top crash in 4.0b7pre early data from yesterday.   we should figure out how to mitigate this.  is bug 596476 still on track?  sounds like bug 595413 is now fixed as of the sept 15.  need to check deeper to see if that fix as helped but looks like it may not have.
blocking2.0: --- → ?
actually looks like this got worse on builds from sept 17.

maybe after trunk users got the patches in 
https://bugzilla.mozilla.org/show_bug.cgi?id=596476#c7  or
https://bugzilla.mozilla.org/show_bug.cgi?id=595413#c8

date     tl crashes at, count build, count build, ...
         nsDiskCacheMap::Open.nsILocalFile..
20100910 2 3.62010011514, 2 ,, 
20100911  ,, 
20100912 2 3.0b12007110904, 2 ,, 
20100913 16 ,, 13 3.0b12007110904, 2 3.0.52008120122, 1 3.0b22007121120, 
20100914 14 ,, 8 3.0b22007121120, 5 3.0b12007110904, 1 3.6.92010082415, 
20100915 2 ,, 1 3.6.92010082415, 1 3.62010011514, 
20100916 2 3.6.92010082415, 2 ,, 
20100917 12 ,, 10 4.0b7pre2010091704, 1 4.0b62010091408, 1 3.0b12007110904, 
20100918 8 ,, 7 4.0b7pre2010091704, 1 3.0b12007110904, 
20100919 46 ,, 42 4.0b7pre2010091704, 3 4.0b62010091408, 1 3.6.102010091412, 
20100920 79 , 67 4.0b7pre2010091704, 7 4.0b7pre2010091904, 4 3.0b12007110904, 
20100921 163 , 67 4.0b7pre2010091704, 60 4.0b7pre2010092004, 31 4.0b7pr20100919
Notes:

1) every instance of this crash is happening only on "Windows NT 5.1.2600 Service Pack 3".   I'm guessing this reduces the importance of this bug, though we should obviously fix it.

2) It seems to be causing fewer crashes in the last few days.  But that could be an artifact of jitter in the number of NT 5.1 boxes running beta7--I don't know how many such boxes there are out there, so we may have high variance.

3) Honza is correct in comment 1 that this bug has been triggered for a while.  The main change seems to be that it used to get (infrequently) hit via a synchronous codepath from AsyncOpen->Connect->OpenCacheEntry->ProcessRequest, whereas now it's getting (more frequently) hit via the async cache read path created in bug 513008 (eliminate sync reads from cache).   So if we get desperate, that's the bug to back out (I'd really hate to back that out, though)

Still looking into the cause of this by poring over the stack trace.  One thing I don't understand is the segfault happening at nsDiskCacheMap.cpp:155:  that's a function call, and shouldn't segfault.  How accurate are our crash stack traces (I notice there's a frame 0 with just an addr listed).  I assume we're crashing in OpenBlockFiles() somewhere.
(In reply to comment #5)
> So if we get
> desperate, that's the bug to back out (I'd really hate to back that out,
> though)

I would like to avoid that too.

> How accurate are our crash stack traces

On windows you get the cursor on a stack put after the line it is being executed.  So, you have to find a line executed "manually" by going upward in the source code.
> On windows you get the cursor on a stack put after the line it is being
executed.  So, you have to find a line executed "manually" by going upward in the source code.

Sorry, having trouble understanding.  So the stack trace is

 0 @0x80000000
 1 xul.dll nsDiskCacheMap::Open netwerk/cache/nsDiskCacheMap.cpp:155
 2 xul.dll nsDiskCacheDevice::OpenDiskCache

And line 155 is a call to OpenBlockFiles().  So are you saying that means the crash happened somewhere in OpenDiskCache, or in nsDiskCacheMap::Open somewhere above line 155?
Status: NEW → ASSIGNED
Is this breakpad, or MSVC? breakpad often skips the next-to-top frame when the top frame is a numeric address.
this should probably block b7.  Its now the #1 topcrash in b7pre and a regression from b6.  can someone mark blocking status so we make sure its on the release radar?
Blocking beta7.
blocking2.0: ? → beta7+
Keywords: topcrash
Oddly enough, sometimes the error is EXCEPTION_ACCESS_VIOLATION_READ, and sometimes EXCEPTION_ACCESS_VIOLATION_EXEC.  Has anyone ever seen that before?  Given that all errors are on x86 systems, which don't even support separate read/exec page permissions, is that a red herring?

FWIW this looks like the same as bug 595957 (which goes back as far as 3.0b1): it also seems to be affecting only Windows NT machines in Russia, and has essentially the same stack trace.  The only difference I see is that async cache reads weren't landed yet in 3.6.x, and that some of the errors are EXCEPTION_ACCESS_VIOLATION_WRITE, which we don't seem to be getting any more with b7pre.

Very weird.  I'd love to hear ideas on how to proceed (other than staring at code, which I'm still doing).  Do we have a Windows NT box somewhere?
Summary: start-up crash under Windows XP [@ nsDiskCacheMap::Open(nsILocalFile*) ] → start-up crash under Windows NT [@ nsDiskCacheMap::Open(nsILocalFile*) ]
Wild guess #1:  This is a problem with appending ASCII to a Cyrillic filename, and/or passing a Cyrillic filename to an NSPR I/O function.  I don't understand charsets (and maybe XPCOM) well enough to know.

nsresult
nsDiskCacheMap::GetBlockFileForIndex(PRUint32 index, nsILocalFile ** result)
{
    if (!mCacheDirectory)  return NS_ERROR_NOT_AVAILABLE;
    
    nsCOMPtr<nsIFile> file;
    nsresult rv = mCacheDirectory->Clone(getter_AddRefs(file));
    if (NS_FAILED(rv))  return rv;
    
    char name[32];
    ::sprintf(name, "_CACHE_%03d_", index + 1);
    rv = file->AppendNative(nsDependentCString(name));
    if (NS_FAILED(rv))  return rv;
    
    nsCOMPtr<nsILocalFile> localFile = do_QueryInterface(file, &rv);
    NS_IF_ADDREF(*result = localFile);

    return rv;
}

The IDL for AppendNative says that the argument must be in the native charset of the filesystem (in our error case, Russian Cyrillic).  If for some reason converting ASCII "_CACHE_001_" to wchar and appending it (AppendNative does the conversion to wchar) returns NS_OK, but then the QI back to nsILocalFile fails, we'll return NS_OK without having touched 'result', which is a stack variable and thus garbage, which could then segfault when OpenBlockFiles calls Open() with it.

But ascii usually converts to wchar fine, right?  And I don't see any reason why the QI back to nsILocalFile could fail: mCacheDirectory is an nsCOMPtr<nsILocalFile>, so we're just going from that to nsIFile and back.  There's nothing fancy about nsLocalFileWin.cpp's implementation of QI:

    NS_IMPL_THREADSAFE_ISUPPORTS4(nsLocalFile, nsILocalFile, 
                                  nsIFile, nsILocalFileWin, nsIHashable)

Wild guess #2: We could get past GetBlockFileForIndex OK, and die in nsDiskCacheBlockFile::Open(), which passes the file to OpenNSPRFileDesc(), which calls the Windows SDK functions GetFileInfo() and CreateFileW().  MSDN doesn't mention GetFileInfo() supporting unicode.  Perhaps some of our Russian users have home directories with characters in them that trigger some sort of crash (only on Windows NT)?
AppendNative should be fine here, it's always ASCII-compatible. The obvious way to check is to create a profile in a Cyrillic-named directory and run against it.

GetFileInfo is not a win32 API, it's http://mxr.mozilla.org/mozilla-central/source/xpcom/io/nsLocalFileWin.cpp#473 and it is unicode-safe.

This is WinXP, so if you don't have a VM of it, we can arrange for one, or you can get somebody in the QA lab to run some experiments for you.
Summary: start-up crash under Windows NT [@ nsDiskCacheMap::Open(nsILocalFile*) ] → start-up crash under Windows XP [@ nsDiskCacheMap::Open(nsILocalFile*) ]
sample of OS versions from yesterday

  87 Windows NT 5.1. nsDiskCacheMap::Open(nsILocalFile*)

83      0.954023        Windows NT5.1.2600 Service Pack 3
4       0.045977        Windows NT5.1.2600 Service Pack 2
Can we get an ETA for a patch here?  Or, will this be fixed by bug 596476?  Also, are we still sure this should block beta 7?
I no longer think bug 596476 is relevant--this is much older than smart sizing.

I can't give an ETA, because I still have no clue what's going on.  I've asked for help from the Mozilla Russia folks, and am trying to repro on an XP box I've set up with Cyrillic.

Re: blocking beta 7: this only appears to affect Russian Windows XP boxes.  It also seems to have tapered off in frequency from 300 crashes/day on 9/17 to 20-30 per day in the last few days. 

   http://tinyurl.com/28hrqvz

Alas, I have no idea why the decline is happening, so it could go back up.  

I wouldn't personally keep the train at the station for this, but I'm not a release driver and don't know how much we care about the Russian audience for the beta.
No longer depends on: 596476, 595413
Leaving this as a blocker so we keep investigating (though it's not clear to me that it actually needs to block, or that it's even something we can fix), but this should not block beta7, not given the decline in crashes and the fact that this has been around seemingly forever.
blocking2.0: beta7+ → betaN+
its been around forever in low volume, but the crashes happening now are almost exlusively 4.0b7pre.   Also we are under somekind of spike related to crash from russia or Cyrillic problems noted in bug 599126 and Bug 597260, but those seem unconnected in time and the releases they apply too.

here are latest stats on which builds were hit by this in the last few days.

date     tl crashes at, count build, count build, ...
         nsDiskCacheMap::Open.nsILocalFile..
20100920 79 ,, 67 4.0b7pre2010091704, 7 4.0b7pre2010091904, 4 3.0b12007110904,
                1 4.0b7pre2010091804, 
20100921 163 ,, 67 4.0b7pre2010091704, 60 4.0b7pre2010092004, 
                31 4.0b7pre2010091904, 3 3.0b12007110904, 
                 1 3.6.92010082415, 1 3.6.102010091412, 
20100922 136 ,, 66 4.0b7pre2010091704, 27 4.0b7pre2010092104, 
                19 3.0b12007110904, 9 4.0b7pre2010092204, 
                 6 4.0b7pre2010091904, 4 3.0b22007121120, 
                 3 4.0b7pre2010091804, 1 4.0b7pre2010092004, 
                 1 3.0.52008120122, 
20100923 87 ,, 32 4.0b7pre2010091704, 27 3.0b12007110904, 
                8 4.0b7pre2010092204, 8 4.0b7pre2010092004, 
                4 4.0b7pre2010091904, 3 4.0b7pre2010092104, 
                2 4.0b7pre2010091804, 1 4.0b7pre2010092304, 
                1 3.6.62010062523, 1 3.6.102010091412,
Hmm.. I don't see that creation/access to nsCacheService::mDiskDevice would be synchronized...  There is some nsCacheService::mLock and the ref counter is thread safe, but what happens when we enter the code on two threads concurrently?
Exactly: executing nsCacheService::SearchCacheDevices.
There is a lot of comments in German in the last crashes. 

I was trying to create an account with some Czech letters in the name, no luck to reproduce.
(In reply to comment #19)
> Hmm.. I don't see that creation/access to nsCacheService::mDiskDevice would be
> synchronized...  There is some nsCacheService::mLock and the ref counter is
> thread safe, but what happens when we enter the code on two threads
> concurrently?

Taking back...  Just checked that all code paths leading to access to mDiskDevice are protected by nsCacheService::mLock.

(In reply to comment #21)
> I was trying to create an account with some Czech letters in the name, no luck
> to reproduce.

And the system was Windows XP SP3 [5.1.2600]
I installed the multilingual user interface package for Russian, and created user account with Cyrillic characters, and I created a profile on a folder with Cyrillic characters. I've been trying to reproduce though general browsing, but no luck so far. None of the comments I saw say much in the way of reproducing the problem.
Just a thought, referring to bug #595957, comment #4: Is there any way we could get hold of the fx-binaries from a user who has experienced this and check if there is a trojan involved? (Or alternatively: Is there a way we can guarantee that no trojan is mucking things up in this particular case?)
Its possible that malware is involved, but that happens rarely as the #1 top crash, and even more rare as the #1 topcrash that affects trunk users.

Another area to look at would be to make sure we've look at all the changes on trunk that could have affected cache operations on just prior to sept 17 with this ramped up exclusively on 4.0b7pre builds.

Honza started that in comment 1 but its not clear that anything conclusive was found.
Summary: start-up crash under Windows XP [@ nsDiskCacheMap::Open(nsILocalFile*) ] → spike in 4.0b7pre start-up crash under Windows XP [@ nsDiskCacheMap::Open(nsILocalFile*) ]
I don't see any mention of 
b47978b94fc9
2010-09-16 20:21 -0700	Bjarne Herland - Bug 596808 - nsDiskCacheDevice::Init() called twice resulting in no disk cache available r=jduell, a=betaN

which landed shortly before this started appearing.  I wonder if it might be worth backing that out for b7 or for a few days on trunk to see if it makes the volume drop back down.   what would be the trade there?
if think about investigating and trying the back out of bug 596808 we should flip the blocking "betaN+" flag to blocking b7+ so it gets on the radar to hold the release.
Let's try the backout and see what it does to the stats.
blocking2.0: betaN+ → beta7+
Whiteboard: [trying a backout]
You might have found the issue although I don't see the relevance of Cyrillic profiles...

The patch for bug #596808 was supposed to initialize the disk-device earlier than it used to. I believe the issue here is that this actually fails (because of the check for existence of the disk-device object in nsCacheService::OnProfileChanged() !) and that this has consequences for later requests which actually creates and initializes the disk-device. The reason the patch resolves bug #596808 is simply because it avoids initializing the disk-device twice (it fails).

IMO, the solution is to ensure the disk-device is created in nsCacheService::OnProfileChanged(). I can come up with a patch for this later, or Honza or Michal could do it.
Yeah, there was a huge spike in this crash on the 17th, although there were a few on the 14th:

http://crash-stats.mozilla.com/report/list?range_value=4&range_unit=weeks&signature=nsDiskCacheMap%3A%3AOpen%28nsILocalFile*%29&branch=2.0&product=Firefox

Since it seems like a startup crash, it probably does seem important to fix for beta7.
For what it's worth, there's also a second cache change in the one-day window when this started:
http://hg.mozilla.org/mozilla-central/rev/26e2971eeec9
Could someone offer an explanation why the number of these crashes drops dramatically in nightlies *after* the 17th (see also comment #16) ?

Also observe that a crash-profile described with nsDiskCacheDevice::OpenDiskCache() is on the top-crasher list of 3.6.9, mainly on WinNT 5.1 SP3 (with lots of Cyrillic fonts in the comments). The stacks from these crashes look very similar to the stacks for this issue.

I'm not so convinced that the patch for bug #596808 is the culprit anymore. IMO it is likely that the earlier initialization performed in this patch exposes something lurking in other parts of the code, and I believe we should try to track down and fix the real issue. It might be worth backing it out to see if it makes a difference in the stats but there are not many crashes with this signature anymore, so I'm not convinced we will see anything.
Is it possible the decline came because people were crashing on startup so they stopped using the browser? Seems like a reasonable reaction to me.
> Is it possible the decline came because people were crashing on startup so 
> they stopped using the browser? Seems like a reasonable reaction to me.
According to crash stats, the number of users increases :
2010-09-27 	1,824 	40,509 	100% 	4.5%
2010-09-26 	1,876 	32,320 	100% 	5.8%
2010-09-25 	1,867 	30,232 	100% 	6.18%
2010-09-24 	1,864 	33,378 	100% 	5.58%
2010-09-23 	2,081 	32,737 	100% 	6.36%
2010-09-22 	2,431 	30,958 	100% 	7.85%
2010-09-21 	2,571 	28,803 	100% 	8.93%
2010-09-20 	2,040 	25,458 	100% 	8.01%
2010-09-19 	1,653 	20,031 	100% 	8.25%
2010-09-18 	1,714 	18,371 	100% 	9.33%
2010-09-17 	2,519 	20,792 	100% 	12.12%
2010-09-16 	738 	18,556 	100% 	3.98%
2010-09-15 	22 	11,565 	100% 	0.19%
2010-09-14 	1,081 	2,601 	100% 	41.56%
> Could someone offer an explanation why the number of these crashes drops
> dramatically in nightlies *after* the 17th (see also comment #16) ?

that's an interesting point, but I'm not sure we can to say the crashes have "dropped", without understand how fast people might be rolling forward.  The core of our nightly testers move forward pretty routinely and agressively, but we have had several tech press articles with "feature X lands on mozilla nightlies" lately.  One of these articles might have skewed the pool of users on builds from the 17, or changed the nightly tester composition, and maybe more people got stuck on sept 17 or just gave up.  here are updated stats.

crashes are showing up on 0924 and 0925 builds, but its true they are still 1/2 the rate of 0917

20100916 2 3.6.92010082415 2 , 
20100917 12  10 4.0b7pre2010091704, 
	     1 4.0b62010091408, 1 3.0b12007110904, 
20100918 8  7 4.0b7pre2010091704, 
	     1 3.0b12007110904, 
20100919 46  42 4.0b7pre2010091704, 
	     3 4.0b62010091408, 1 3.6.102010091412, 
20100920 79  67 4.0b7pre2010091704, 
	     7 4.0b7pre2010091904, 4 3.0b12007110904, 
	     1 4.0b7pre2010091804, 
20100921 163  67 4.0b7pre2010091704, 
	     60 4.0b7pre2010092004, 31 4.0b7pre2010091904, 
	     3 3.0b12007110904, 1 3.6.92010082415, 
	     1 3.6.102010091412, 
20100922 136  66 4.0b7pre2010091704, 
	     27 4.0b7pre2010092104, 19 3.0b12007110904, 
	     9 4.0b7pre2010092204, 6 4.0b7pre2010091904, 
	     4 3.0b22007121120, 3 4.0b7pre2010091804, 
	     1 4.0b7pre2010092004, 1 3.0.52008120122, 
20100923 87  32 4.0b7pre2010091704, 
	     27 3.0b12007110904, 8 4.0b7pre2010092204, 
	     8 4.0b7pre2010092004, 4 4.0b7pre2010091904, 
	     3 4.0b7pre2010092104, 2 4.0b7pre2010091804, 
	     1 4.0b7pre2010092304, 1 3.6.62010062523, 
	     1 3.6.102010091412, 
20100924 66  34 4.0b7pre2010091704, 
	     15 4.0b7pre2010092404, 6 3.0b12007110904, 
	     5 4.0b7pre2010092004, 3 4.0b7pre2010091904, 
	     2 4.0b7pre2010091804, 1 3.6.102010091412, 
20100925 87  47 4.0b7pre2010091704, 
	     20 4.0b7pre2010092404, 6 3.0b12007110904, 
	     5 4.0b7pre2010092304, 4 4.0b7pre2010092312, 
	     3 3.6.102010091412, 2 3.0.52008120122, 
20100926 85  40 4.0b7pre2010091704, 
	     25 4.0b7pre2010092504, 10 4.0b7pre2010092004, 
	     5 4.0b7pre2010092404, 4 3.0b12007110904, 
	     1 3.6.102010091412, 
20100927 89  51 4.0b7pre2010091704, 
	     12 4.0b7pre2010092204, 9 4.0b7pre2010092604, 
	     9 4.0b7pre2010092312, 4 3.0b12007110904, 
	     3 4.0b7pre2010092404, 1 3.6.102010091412,
(In reply to comment #35)
> that's an interesting point, but I'm not sure we can to say the crashes have
> "dropped", without understand how fast people might be rolling forward.  The
> core of our nightly testers move forward pretty routinely and agressively, but
> we have had several tech press articles with "feature X lands on mozilla
> nightlies" lately.  One of these articles might have skewed the pool of users
> on builds from the 17, or changed the nightly tester composition, and maybe
> more people got stuck on sept 17 or just gave up.  here are updated stats.

It seems likely that people got stuck on the Sept. 17 build, since this seems to be a startup crash.
Status:

Spent much of the day staring at minidumps w/dbaron and sicking.  Didn't get much traction.

Just checked in a version bump of the HTTP cache:

   http://hg.mozilla.org/mozilla-central/rev/a9d1ad0bc386

This will cause nightly users to have their cache re-created.  We wanted to do this anyway so that nightly users get the fallocate optimization from bug 592520.  But also, since landing 592520 coincided with the crash spike for beta7 (comment 31), we may wind up seeing either a crash or a dropoff in the crash count.  Seemed worth trying.

I'm also planning to land the patches for bug 596476 tomorrow--they clean up the smart size logic, and might help reduce the crash rate if we're lucky, though they're almost definitely not going to completely fix this.
I think we were able to rule a few things out from the minidumps:

The most notable is that it's related to having Cyrillic characters in the username.  In a bunch of the minidumps (maybe even all?), there were parts of file paths for a cache map file on the stack, and those paths were for the user name Admin.


It's perhaps also of interest that the crashes for this bug are *off* the main thread, and during the crash, the main thread is waiting for the cache lock.  This made it seem like the bug on making nsCacheProfilePrefObserver::GetSmartCacheSize (which runs off the main thread) not call NS_GetSpecialDirectory might help, although we couldn't really see how.

We didn't come to a conclusion about whether or not this is the same as bug 595957.  They have a whole bunch of similarities, though:  most user comments are Cyrillic, spiked around the same time (although not exactly).  It's possible that both are related to malware circulating in Russia, the Ukraine, and Poland.
(In reply to comment #35)
> crashes are showing up on 0924 and 0925 builds, but its true they are still 1/2
> the rate of 0917

I'm sorry, but we're probably looking at different data...  I tend to look at the link provided in comment #30, then choose the "Table" tab. I see 4 crashes on the 14th, 536 on the 17th, 43 on the 24th and 25 on the 25th. Am I looking at the wrong thing?

(In reply to comment #37)
> Just checked in a version bump of the HTTP cache:

Brilliant idea!  :)  If we see another spike, I'd suggest to bump again and back out #596808 (it should probably be fixed more thoroughly anyway).

(In reply to comment #38)
> The most notable is that it's related to having Cyrillic characters in the
> username.  In a bunch of the minidumps (maybe even all?), there were parts of
> file paths for a cache map file on the stack, and those paths were for the user
> name Admin.

Admin means elevated privileges on Windows, right? Virus/Malware...?

A few holes in the story still:

- do we know if the users who experience this crash run the beta again without
  the crash, or do we even know that this is on the first run? (The version-bump
  may provide insight here.)

- is there really no relation to this crash

http://crash-stats.mozilla.com/report/list?range_value=2&range_unit=weeks&signature=nsDiskCacheDevice%3A%3AOpenDiskCache%28%29&version=Firefox%3A3.6.9

  which also has a spike on the 17th
(In reply to comment #38)
> We didn't come to a conclusion about whether or not this is the same as bug
> 595957.  They have a whole bunch of similarities, though:  most user comments
> are Cyrillic, spiked around the same time (although not exactly).  It's
> possible that both are related to malware circulating in Russia, the Ukraine,
> and Poland.

Sorry - I missed the fact that this is the same as the 3.6.9-crash I was referring to in previous comment.

AFAICS the crashes for these two issues seem to both revolve around the statement

rv = mCacheMap.Open(mCacheDirectory)
The theory about the bad off-main-thread usage of the directory service is very likely, bug 597658. There's a patch in bug 596476 to fix it.
Depends on: 596476
>  do we know if the users who experience this crash run the beta again without
>  the crash, or do we even know that this is on the first run?

We're seeing a lot of repeat crashes with the same hour:minute timestamp--usually from 2-6 in a row, which suggests it may be users crashing repeatedly and then giving up. 

> the crashes for these two issues seem to both revolve around 
> 
> rv = mCacheMap.Open(mCacheDirectory)

which is calling nsDiskCacheMap::OpenBlockFiles(), which calls nsDiskCacheMap::GetBlockFileForIndex() three times (to get nsILocalFiles for _CACHE_001,2, and then 3).  I believe we kept seeing "_CACHE_001_" in the disassembly on the stack;  if true we're dying after the first call.  I'm going to write a patch for the potential segfault mentioned in comment 12 just in case that helps.
I take it back.  The code mentioned in comment 12 already returns any error from QI, so that theory is bunk.

Will land 596476 once I get jst's (or anyone's) +r for the directory service patch.

Oh, hmm--we're still seeing crashes from the build after my cache version bump (build 20100928041914): the crash stack (and exception addr) are still the same, but the exception is now always EXCEPTION_ACCESS_VIOLATION_EXEC (before it was almost always a READ exception, with a few EXEC's thrown in).
(In reply to comment #39)
> (In reply to comment #35)
> > crashes are showing up on 0924 and 0925 builds, but its true they are still 1/2
> > the rate of 0917
> 
> I'm sorry, but we're probably looking at different data...  I tend to look at
> the link provided in comment #30, then choose the "Table" tab. I see 4 crashes
> on the 14th, 536 on the 17th, 43 on the 24th and 25 on the 25th. Am I looking
> at the wrong thing?

There are two different notions of time:  (1) the build ID, and (2) the crash date.  chofmann is saying that *for current crash dates*, half the crashes are still from the build ID of the 17th.

The data in comment 35 are a matrix showing *both* of these notions of time.  Each entry is a date-of-crash, formatted like:

date-of-crash  total-count-on-date build-id-1 crashes-that-date-on-build-id-1
               build-id-2 crashes-that-date-on-build-id-2 etc.
Thanks for the clarification! So that means that e.g on Sept.22nd there were 136 total crashes with this signature, 66 from the 0917-build, 27 from the 0921-build, 9 from the 0922-build etc... ok. (Quite useful, I must say :) )

However, IMO it still doesn't explain why the builds after 0917 produce fewer crashes...
Well... because nightly users are stuck on the 09-17 build, I'll bet!
Why would they be stuck? In particular: if it crashes at startup, why would anyone continue using it?
They keep hitting their Minefield icon in the taskbar and then remember that it crashes. They're stuck because if we can't launch, we can't update.

Anyway, let's land the fix we know is a problem and see if this crash signature goes away.
(In reply to comment #48)
> They keep hitting their Minefield icon in the taskbar and then remember that it
> crashes. They're stuck because if we can't launch, we can't update.

How would we ever get them back? :)

Seriously: So the theory is that a number of nightly users has the 0917-build installed and do not manage to upgrade from it? In fact, there are so many of these that the crashes they generate after 11 days (and builds) still dominate this type of crash? Counter-intuitive to me, but I'll accept it if established experience say that this is how it works...

> Anyway, let's land the fix we know is a problem and see if this crash signature
> goes away.

Definitely! :)

(In reply to comment #43)
> Oh, hmm--we're still seeing crashes from the build after my cache version bump
> (build 20100928041914): the crash stack (and exception addr) are still the
> same, but the exception is now always EXCEPTION_ACCESS_VIOLATION_EXEC (before
> it was almost always a READ exception, with a few EXEC's thrown in).

But the number of crashes did not jump? I.e. the act of re-creating the cache does not seem to be the problem (yet)?

Anyone who knows what EXCEPTION_ACCESS_VIOLATION_EXEC in fact means? Illegal instruction?
(In reply to comment #49)
> In fact, there are so many of
> these that the crashes they generate after 11 days (and builds) still dominate
> this type of crash?

... and, btw, they all use Cyrillic keyboards?
Landed 596476--let's see from the nightlies tomorrow if the directory service was indeed the culprit.

> what does EXCEPTION_ACCESS_VIOLATION_EXEC mean?

I believe it means a bad address was used as an instruction (instead of a read/write).   Really not sure what that means here.  A little odd given that the stack frame and addr are the same.

So far 16 crashes today with the build from last night.  Hard to say if this is an improvement, as our slavic XP user base may or may not be trying it in large numbers (some may be stuck on the build from 17th, or given up on nightlies, etc.)

> How would we ever get [those users] back? :)

We can let them switch to Chrome for a while, then realize FF 4 is better.
(In reply to comment #51)
> Landed 596476--let's see from the nightlies tomorrow if the directory service
> was indeed the culprit.
> 
> > what does EXCEPTION_ACCESS_VIOLATION_EXEC mean?
> 
> I believe it means a bad address was used as an instruction (instead of a
> read/write).   Really not sure what that means here.  A little odd given that
> the stack frame and addr are the same.

That's exactly what it means:
http://code.google.com/p/google-breakpad/source/browse/trunk/src/processor/minidump_processor.cc#723

If you look at:
http://crash-stats.mozilla.com/report/index/deb500a9-e87b-4f4b-926c-b0b0b2100924

The top of the stack is the crash address, yes, which means that something caused us to jump to a bad address in non-executable memory. Saved by DEP!

Interestingly, frame 1 is missing source info, which probably means it's in a "cold" block of that function. I've investigated this in the past, when VC++ does PGO optimization it will separate functions out into "hot" and "cold" blocks, and put all the hot blocks in one set of pages, and the cold ones in another set of pages. Unfortunately VC2005 then fails to write out source line info in the PDB for the cold blocks. (VC2010 fixes this, at least.)
So something made us try executing from a bad address? Could this be caused by e.g. calling a method on a dangling pointer to an object?
re comment 45
> So that means that on day A there were X crashes on Y build.... . (Quite useful, I must say :) 

Bug 600534 tracks trying to get this view in the web interface of socorro
(In reply to comment #53)
> So something made us try executing from a bad address? Could this be caused by
> e.g. calling a method on a dangling pointer to an object?

No I'd say, unless the object has virtual methods, nsDiskCacheBlockFile doesn't have any.  This all seems to me more like a stack corruption, and BAD_EXEC as RET would jump to a bad address.  But I'm not that much expert to deep debugging...
Well, we're at 10 crashes so far today with the build from last night, so the directory service fix hasn't made this gone away.  We're still down from the 9/17 spike, but hard to know what sort of prevalence we'd see if this shipped in beta7. 

Error is now back to EXCEPTION_ACCESS_VIOLATION_READ.  (Is it just me, or does the combo of same crash stack + different access error + Slavic XP only == probably some sort of malware problem?)

Will look at some minidumps of yesterday and today as soon as I can get my hands on some.
(In reply to comment #56)
>(Is it just me, or does
> the combo of same crash stack + different access error + Slavic XP only ==
> probably some sort of malware problem?)

No (i.e. it's not just you).

Could we conclude that there was no new spike after the new version-bump?
Still 31 crashes in the 2010-09-30 build.
21 of those 31 crashes for the 9/30 build are in rapid succession, so probably just a very persistent user crashing over and over at startup.

We have alas made very little headway on this bug.  Opening the crashdumps causes my copy of devstudio to load the blue screen of death, which is making it hard for me at least to get anywhere.

Given the crash levels are pretty low since 9/17 do we want to mark this betaN?
FWIW, I found this blog post in Russian: http://translate.google.com/translate?hl=ru&sl=ru&tl=en&u=http%3A%2F%2Fsibilev.net%2F%3Fp%3D3573 that describes either this bug or probably Bug 595957 (FireFox constantly crashes on startup and tries to send a message about the crash). The reason of Firefox crash on startup is virus loaded through HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon\Userinit
Once virus removed, Firefox crash is gone. Few users on our local forum indicate that method of virus removal, described in this blog post, fixed their problem with Firefox startup crash.
Alexander,

Thanks very much for this information!

I am not clear from the translation of the blog post whether the "DrWeb" software mentioned was part of the problem (caused the virus), or was just part of an attempt to fix it.

It looks like the problem here is that a malicious program is somehow inserted into HKEY_LOCAL_MACHINE \ SOFTWARE \ Microsoft \ Windows NT \ CurrentVersion \ Winlogon \ Userinit, and then presumably run at startup.   It's not clear how it winds up affecting Firefox, but when the registry key is removed, the crashes go away.   One possibility is that the filesystem I/O syscalls are being intercepted, but presumably it could be lots of different things.

We've looked over the "interesting_modules" file for the crash, and we don't see any clear .dll file that's associated with this.  So we don't have a .dll name that we can block.   Not sure if there's anything else to do here.

At JST's behest, marking INVALID and removing as blocker, since the crash numbers have stayed low since the spike on 9/17: 

  http://crash-stats.mozilla.com/report/list?range_value=2&range_unit=weeks&signature=nsDiskCacheMap%3A%3AOpen(nsILocalFile*)&version=Firefox%3A4.0b7pre )
Assignee: honzab.moz → jduell.mcbugs
Status: ASSIGNED → RESOLVED
blocking2.0: beta7+ → ---
Closed: 9 years ago
Keywords: regression
Resolution: --- → INVALID
Whiteboard: [trying a backout]
(In reply to comment #61)
> I am not clear from the translation of the blog post whether the "DrWeb"
> software mentioned was part of the problem (caused the virus), or was just part
> of an attempt to fix it.
For clarity, Dr.Web is antivirus software, popular in Russia - http://www.drweb.com/?lng=en
Duplicate of this bug: 595957
http://technet.microsoft.com/en-us/library/cc939862.aspx
Specifies the programs that Winlogon runs when a user logs on. By default, Winlogon runs Userinit.exe, which runs logon scripts, reestablishes network connections, and then starts Explorer.exe, the Windows user interface.

Think of it as like ~/.profile or something, a happy place for bad guys to ask to run really early.
we are getting 10,000-14,000 crashes per day on this. lets not call it invalid, and lets try and figure out what we can do to drive those numbers lower.

date     crashes at
         HeapDestroy   bug 597960
20101001 5943
20101002 6788
20101003 5717
20101004 5288
20101005 4787
172:crashdata chofmann$ ./stacktrend.sh nsDiskCacheDevice::OpenDiskCache 201010*

date     crashes at
         nsDiskCacheDevice::OpenDiskCache bug 595957 
20101001 5503
20101002 7564
20101003 7252
20101004 6486
20101005 6324
172:crashdata chofmann$ ./stacktrend.sh nsDiskCacheMap::Open.nsILocalFile.. 201010*

date     crashes at
         nsDiskCacheMap::Open.nsILocalFile..
20101001 63
20101002 20
20101003 23
20101004 38
20101005 44

I suggested that we post something on SUMO and try and drive traffic to that article with press, but cww says visits by Russian users to SUMO are low.  What are the support venues that we should be hitting?

Should we try and ramp up some press on this with instructions on how to repair?

I've sent mail to contacts kasperski to maybe get involved in blocking/repairing this malware; are there other contacts like that we should reach out to?

e-mail responder feature should be going on-line with socorro 1.7 tomorrow night. this is a good candidate to use for responding to users that add e-mails to crash reports on these three signatures.
Status: RESOLVED → REOPENED
Keywords: user-doc-needed
Resolution: INVALID → ---
(In reply to comment #65)
> I suggested that we post something on SUMO and try and drive traffic to that
> article with press, but cww says visits by Russian users to SUMO are low.  
> What are the support venues that we should be hitting?
I guess it's possible to add to crash reporter detection that Firefox has crashed several times on startup. After crash reporter has detected that Firefox has crashed several times on start up, crash reporter could launch another browser (most probably Internet Explorer) and open in it SUMO article explaining what should be done in case if it's not possible to launch Firefox.

Some pitfalls:
1) Most viruses block for user access to web-sites of antivirus companies. They easily could block access to SUMO too. We could ship SUMO web-page bundled with browser, but it will hurt distribution size.
2) Some computers doesn't have another browser installed (thanks to European Union browser choice initiative). Thankfully Russia is not part of EU.
yeah, we can't ensure that we will each users with any of these channels; but we need to try!

The impact of this is probably far greater than the 10,000-14,000 users I mentioned above.  Thats the number of users that crash per day from the buggy malware.  We will probably run on to more as we find more signatures releated to this problem.   The larger problem might be for users where the malware runs as designed and does not crash and the system is compromised.
(In reply to comment #67)
> The larger problem might be for users where the malware runs
> as designed and does not crash and the system is compromised.
Well, Firefox already have safe browsing feature, thanks to Google. It would be helpful to add to Firefox some basic virus/malware detection feature, that could indicate that this system has been compromised (may be some antivirus company would be interested). At least Mozilla developers wouldn't waste their precious time on bugs, caused by malware.
Though I guess this discussion doesn't belong to this bug.
(In reply to comment #67)
> The larger problem might be for users where the malware runs
> as designed and does not crash and the system is compromised.
Do we have a name for the malware concerned at this point? e.g. could we get a copy from Dr.Web for sandboxed analysis?
Alexander L. Slovesnik: where do most Russians go for tech support (and by extension, where do they go for tech support with Firefox?)  Is the front page of http://mozilla-russia.org/ a good place for a notice about this issue?  You seem to have a much more active community than we do.

FWIW, since it's a startup crash, the primary driver of traffic to SUMO -- the built-in Help button -- is not usable so we should be looking at messaging in other places.

I have no sense for the scale of this problem either... is it bad enough that we should try to send an official notice to the Technology ministry in Russia? Should we try to get Microsoft to release a security update to address this? Comment 67 makes it seem like a much larger percentage of users are affected than the crash reports we see.
translated version of the support forum at http://mozilla-russia.org/

http://translate.google.com/translate?js=n&prev=_t&hl=en&ie=UTF-8&layout=2&eotf=1&sl=ru&tl=en&u=http%3A%2F%2Fmozilla-russia.org%2F

shows the symptoms of this bug 

translated post name                                      posts  views

Mozilla Firefox will not start - Chara [ 1 2 3 4 ]         80 	24660
firefox does not run a report of an unexpected error - axe  14 	1195
Permanent fall browser - KReoN                       	    8 	154
(In reply to comment #70)
> Alexander L. Slovesnik: where do most Russians go for tech support (and by
> extension, where do they go for tech support with Firefox?)  Is the front page
> of http://mozilla-russia.org/ a good place for a notice about this issue?  You
> seem to have a much more active community than we do.
Russian Mozilla forum is http://forum.mozilla-russia.org/. I've created post in our local FAQ on this issue on http://forum.mozilla-russia.org/viewtopic.php?id=46369
However, malware removal is very tricky business and I'm reluctant to convert Mozilla support forum to malware removal support forum. Antivirus companies support and special forums are more qualified to deal with malware issues.

> I have no sense for the scale of this problem either... is it bad enough that
> we should try to send an official notice to the Technology ministry in Russia?
FWIW, it's not only Russia problem. On http://crash-stats.mozilla.com/report/list?signature=nsDiskCacheDevice::OpenDiskCache%28%29 there are some comments on Italian and German. 

> Should we try to get Microsoft to release a security update to address this?
> Comment 67 makes it seem like a much larger percentage of users are affected
> than the crash reports we see.
There is nothing that indicates that this is Microsoft issue.
Microsoft, as part of monthly security updates, pushes out a malware scanner... I don't know if it works really well but if chofmann is right, this is affecting tons of users (who are not crashing) and causing loss of personal data and we should leverage whatever resources we can to help them.

Another question: Is there anything you think Mozilla should do to help?  You probably have a better sense of your locale than we do and I'd be happy to do what we can.  However, you are much better qualified to say what steps/outreach is necessary.
(In reply to comment #73)
> Microsoft, as part of monthly security updates, pushes out a malware scanner...
> I don't know if it works really well but if chofmann is right, this is
> affecting tons of users (who are not crashing) and causing loss of personal
> data and we should leverage whatever resources we can to help them.
Unfortunately, a lot of users disable Microsoft Update on pirated Windows installations.
 
> Another question: Is there anything you think Mozilla should do to help?  You
> probably have a better sense of your locale than we do and I'd be happy to do
> what we can.  However, you are much better qualified to say what 
> steps/outreach is necessary.
I've posted a kind of plan in comment 66. Additionaly Mozilla could contact antivirus companies (http://translate.google.com/translate?js=n&prev=_t&hl=en&ie=UTF-8&layout=2&eotf=1&sl=ru&tl=en&u=http%3A%2F%2Fwww.anti-malware.ru%2Frussian_antivirus_market_2009_2010 shows some stats on antivirus market in Russia) to ask them for any data on Firefox start-up crash issue. I guess they can correlate Firefox crash statistic with malware spread statistic.
The plan in comment 66 is a good long term idea but would minimally require a new version of Firefox to work (and maybe a lot of work in the socorro backend).  Is there anything that we can do without making changes to Firefox?
yes,  https://bugzilla.mozilla.org/show_bug.cgi?id=585593 outlines the plan for changes going into socorro that will allow finding the crash signatures like in this bug and the two other related bugs, then pulling e-mail address where users provided them, then e-mailing with instructions on how to avoid the crash they just hit.

this won't require any changes to firefox.

the message that we construct for the e-mail ought to have information in Russian and English and sounds like maybe Italian and German with maybe links in the e-mail with instructions on how to avoid the crash in each of these languages.
(In reply to comment #76)
> yes,  https://bugzilla.mozilla.org/show_bug.cgi?id=585593 outlines the plan for
> changes going into socorro that will allow finding the crash signatures like in
> this bug and the two other related bugs, then pulling e-mail address where
> users provided them, then e-mailing with instructions on how to avoid the crash
> they just hit.
Can you estimate percentage of users, that have provided their e-mail addresses in crash reports? Are we talking about 1%, 10% or 90%?
yeah, the projections for the number of users that we can reach with this technique are low, but its still one more tool to get the word out.

some quick checks indicate that we might be able to reach just over a 1,000 user per day that that are hitting these crashes.  Here is a sample from oct 6

HeapDestroy
6319 reports -  no e-mail provided
 516 yes, have e-mail address

nsDiskCacheDevice::OpenDiskCache
8269 no e-mail provided
 549 yes, have e-mail

nsDiskCacheMap::Open.nsILocalFile..
68  no e-mail

this is probably a good bug to test the rollout of the e-mail responder system.
I'm no expert in runtime C++ and only use Windows if I'm forced to, but would it be possible to add exception-handling (possible for Windows only) in the appropriately coarse-grained places in the code which loads/bootstraps Firefox-modules? Just to catch stuff like this and pop up some reasonable message?
not like that. we don't know if a library is poisoning our process and running away, or if a process is attacking our process, or if a kernel driver is ruining us.

there's also another minor detail... a rogue piece of code could hurt any random file i/o, not just the one we pick.

ignoring that, assuming the process actually does care about us, this is a losing battle.
Depends on: 585593
still currently running at about ten thousand crashes per day on
Bug 597960 - crash under Windows XP [@ HeapDestroy ] mainly on start-up 

Plus another 8,000 per day with the nsDiskCacheDevice::OpenDiskCache.. signature 

plus another 100 or so per day on this signature would bring the total to 19,000 crashes per day of the crash reports we process.
I'm not seeing anything at 10,000 crashes a day on http://crash-stats.mozilla.com/products/Firefox/versions/4.0b8pre - where are we seeing this volume?
this one of several bugs where we are affected by the same possible malware spans all releases.  this particular signature applies to only trunk so its low volume. one of the bugs are duped against this bugs so I fugured we were concentrating comments here.  maybe we should spin up a tracking bug to cover common stats and attributes of all the bugs.  Here is the first comment for the tracking bug

this bug's stats.

date     tl crashes at, count build, count build, ...
         nsDiskCacheMap::Open.nsILocalFile..
20101020 33  12 4.0b7pre^\2010100204, 
                10 4.0b8pre^\2010101804, 6 4.0b8pre^\2010102004, 
                2 4.0b8pre^\2010101904, 1 4.0b8pre^\2010101104, 
                1 4.0b8pre^\2010100704, 1 4.0b7pre^\2010100304, 
20101021 60  53 4.0b7pre^\2010100204, 
                2 4.0b4^\2010081813, 2 3.6.10^\2010091412, 
                1 4.0b8pre^\2010101604, 1 4.0b8pre^\2010100904, 
                1 3.6.11^\2010101211, 


Bug 595957 - Sept 10-12, Spike in Firefox Crashes for Russian Users [@ nsDiskCacheDevice::OpenDiskCache() ]  (edit)  

date     tl crashes at, count build, count build, ...
         nsDiskCacheDevice::OpenDiskCache..
20101020 4063  2173 3.6.10^\2010091412, 
                343 4.0b6^\2010091408, 260 3.0.19^\2010031422, 
                150 3.6.11^\2010101211, 149 3.6^\2010011514, 
                141 3.5.13^\2010091413, 125 3.5.5^\2009110215, 
                89 3.6.3^\2010040108, 72 3.6.8^\2010072215, 
                59 3.0b5^\2008032620, 55 3.0.1^\2008070208, 
                47 4.0b4^\2010081813, 47 4.0b2^\2010072019, 
                31 3.6.9^\2010082415, 25 3.0.5^\2008120122, 
         <releases where volume is less that 30 crashes per day snipped>

Bug 597960 - crash under Windows XP [@ HeapDestroy ] 

date     tl crashes at, count build, count build, ...
         HeapDestroy
20101020 17985  7901 3.6.10^\2010091412, 
                1517 3.5.13^\2010091413, 901 3.6.8^\2010072215, 
                844 3.6.11^\2010101211, 762 4.0b6^\2010091408, 
                751 3.6^\2010011514, 522 3.6.3^\2010040108, 
                503 3.0.19^\2010031422, 334 3.0.6^\2009011913, 
                297 3.5.6^\2009120122, 283 3.6.6^\2010062523, 
                222 3.5.5^\2009110215, 201 3.5.3^\2009082410, 
                184 3.5.2^\2009072922, 183 3.7a1pre^\2009082804, 
                148 3.0.1^\2008070208, 137 4.0b7pre^\2010100204, 
                131 3.0.5^\2008120122, 111 3.0^\2008052906, 
                101 4.0b4^\2010081813, 96 4.0a1pre^\2008051003, 
    <releases where volume is under 100 per day snipped>
Similar crash in thunderbird. All are win XP.
bp-0d4166c4-99dc-42c3-a130-2e3e42101109
"Opens up again but breaks down right with the 1st click. Firefox doesn't even open anymore."
0		@0xf195b58c	
1	thunderbird.exe	nsDiskCacheMap::OpenBlockFiles	netwerk/cache/src/nsDiskCacheMap.cpp:617
2	thunderbird.exe	nsDiskCacheMap::Open	netwerk/cache/src/nsDiskCacheMap.cpp:155
3	thunderbird.exe	nsDiskCacheDevice::OpenDiskCache	netwerk/cache/src/nsDiskCacheDevice.cpp:896
4	thunderbird.exe	nsDiskCacheDevice::Init	netwerk/cache/src/nsDiskCacheDevice.cpp:374
5	thunderbird.exe	nsCacheService::CreateDiskDevice	netwerk/cache/src/nsCacheService.cpp:966
6	thunderbird.exe	nsCacheService::SearchCacheDevices	netwerk/cache/src/nsCacheService.cpp:1362
7	thunderbird.exe	nsCacheService::ActivateEntry	netwerk/cache/src/nsCacheService.cpp:1271
8	thunderbird.exe	nsCacheService::ProcessRequest	netwerk/cache/src/nsCacheService.cpp:1151
9	thunderbird.exe	nsCacheService::OpenCacheEntry	netwerk/cache/src/nsCacheService.cpp:1236
10	thunderbird.exe	nsCacheSession::OpenCacheEntry	netwerk/cache/src/nsCacheSession.cpp:98
11	thunderbird.exe	nsHttpChannel::OpenCacheEntry	netwerk/protocol/http/src/nsHttpChannel.cpp:1832 

bp-ec0241d2-852e-42fe-b18d-b13fe2101110 (e.biehl)
bp-45bb010c-5f03-4f11-b2cb-3e5022101111 (g.birkle)
We'd like to use this as a test pilot for reaching out to users suffering from a crash where there's a known workaround but not a fix in place in Firefox. 

Based on the Russian forum thread, here's my attempt to translate the instructions to English. Can anyone confirm that this is an accurate translation (and clarification)?

1. Open regedit (click Start, then Run..., and then type "regedit" and press Enter).
2. Locate the key: HKEY_LOCAL_MACHINE \ SOFTWARE \ Microsoft \ Windows NT \ CurrentVersion \ Winlogon.
3. Find the entry called "Userinit". It should only have the value of "C:\WINDOWS\system32\userinit.exe". If there is a comma and more text after it, this is a virus. Remember the part after the comma, which might look like this: "C:\WINDOWS\system32\3abcde04.exe".
4. Open My Computer and navigate to the folder containing the virus. In the example above, this is "C:\Windows\system32".
5. Completely remove the virus file by selecting it ("3abcde04.exe" in the example above) and pressing the Delete key while holding down the Shift key.
6. Go back to regedit and remove the part of the entry "Userinit" so it only includes "C:\WINDOWS\system32\userinit.exe".
7. Restart the computer.
(In reply to comment #85)
> Based on the Russian forum thread, here's my attempt to translate the
> instructions to English. Can anyone confirm that this is an accurate
> translation (and clarification)?
Translation looks good.
Crash Signature: [@ nsDiskCacheMap::Open(nsILocalFile*) ]
It's now a low volume crash: only 11 crashes in 8.0 over the last week.
Keywords: topcrash
Summary: spike in 4.0b7pre start-up crash under Windows XP [@ nsDiskCacheMap::Open(nsILocalFile*) ] → Start-up crash under Windows XP [@ nsDiskCacheMap::Open(nsILocalFile*) ]
Crash Signature: [@ nsDiskCacheMap::Open(nsILocalFile*) ] → [@ nsDiskCacheMap::Open(nsILocalFile*) ] [@ nsDiskCacheMap::Open ]
zero examples with nsDiskCacheMap::Open in signature in the past week for any version
Status: REOPENED → RESOLVED
Closed: 9 years ago4 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.