Closed Bug 630420 Opened 13 years ago Closed 13 years ago

Cold startup is sometimes slow due to race condition of cache lock

Categories

(Core :: Networking: Cache, defect)

x86
Windows Vista
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 670911
Tracking Status
blocking2.0 --- -

People

(Reporter: m_kato, Assigned: bjarne)

References

Details

(Whiteboard: [ts])

Attachments

(5 files, 1 obsolete file)

Since UI thread is pending to get lock of cache, startup time sometimes spends 30s on cold startup.

ENV
===
2011-01-30 nightly

UI thread
=========
0:021> ~0k
ChildEBP RetAddr
003c9b40 77e08dcd ntdll!ZwWaitForSingleObject+0x15
003c9ba4 77e08d98 ntdll!RtlpWaitOnCriticalSection+0x155
003c9bcc 6956b3a7 ntdll!RtlEnterCriticalSection+0x152
003c9bdc 664a864f nspr4!PR_Lock+0x17
003c9bf4 666221bd xul!nsCacheService::OpenCacheEntry+0x2d
003c9c10 6649dfbb xul!nsCacheSession::AsyncOpenCacheEntry+0x19
003c9c9c 6649de6f xul!nsHttpChannel::OpenNormalCacheEntry+0xc1
003c9d84 664a8d1c xul!nsHttpChannel::OpenCacheEntry+0x111
003c9dbc 664ad05a xul!nsHttpChannel::Connect+0x12c
003c9dd4 667109f8 xul!nsHttpChannel::AsyncOpen+0x1aa
003c9df0 665613f7 xul!NS_InvokeByIndex_P+0x27
003ca16c 68553d1f xul!XPC_WN_CallMethod+0x747

This lock is ...

0:000> dt -r2 nspr4!PRLock 08f21b30
   +0x000 links            : PRCListStr
 :
 :
   +0x01c ilock            : _MDLock
      +0x000 mutex            : _RTL_CRITICAL_SECTION
         +0x000 DebugInfo        : 0x001dcfc8 _RTL_CRITICAL_SECTION_DEBUG
         +0x004 LockCount        : -6
         +0x008 RecursionCount   : 1
         +0x00c OwningThread     : 0x00001244
         +0x010 LockSemaphore    : 0x0000046c
         +0x014 SpinCount        : 0
      +0x018 notified         : _MDNotified
         +0x000 length           : 0
         +0x004 cv               : [6] <unnamed-tag>
         +0x04c link             : (null)

Owner of lock

0:021> ~19k 100
ChildEBP RetAddr
0985f2d4 74e11e4d ntdll!ZwOpenFile+0x12
0985f31c 74e1431f NTMARTA!I_MartaFileNtOpenFile+0x4d
0985f360 74e14171 NTMARTA!MartaFindNextFile+0xc5
0985f3a0 74e14422 NTMARTA!MartaUpdateTree+0x293
0985f3f0 74e13d74 NTMARTA!MartaUpdateTree+0x1e2
0985f448 74e134cd NTMARTA!MartaManualPropagation+0x31c
0985f4f8 772a59a4 NTMARTA!AccRewriteSetNamedRights+0x207
0985f528 6646bcb9 ADVAPI32!SetNamedSecurityInfoW+0x4f
0985f720 6661ed31 xul!nsLocalFile::CopySingleFile+0x16b
0985f980 6667b323 xul!nsLocalFile::CopyMove+0xfb
0985f994 66845899 xul!nsLocalFile::MoveTo+0x15
0985f9e0 6667f05f xul!nsLocalFile::MoveToNative+0x1e5dd8
0985fa88 6684ab0b xul!GetTrashDir+0x73
0985faa0 666d9d19 xul!nsDiskCacheDevice::OpenDiskCache+0x18b371
0985fabc 666d9cac xul!nsDiskCacheBindery::Init+0x14
0985fad0 6662d55b xul!nsCacheService::CreateDiskDevice+0x69
0985fae0 6662243d xul!nsCacheService::SearchCacheDevices+0x68
0985fb00 6662230e xul!nsCacheService::ActivateEntry+0x9b
0985fb34 66621e49 xul!nsCacheService::ProcessRequest+0x32
0985fb48 66552ea6 xul!nsProcessRequestEvent::Run+0x26
0985fbb4 664e994c xul!nsThread::ProcessNextEvent+0x266
0985fbdc 6956bdd9 xul!nsThread::ThreadFunc+0x8c
0985fbfc 6956e05d nspr4!_PR_NativeRunThread+0x169
0985fc04 6ea22c28 nspr4!pr_root+0xd
0985fc3c 6ea22cb6 MOZCRT19!_callthreadstartex+0x48
0985fc44 76ceeccb MOZCRT19!_threadstartex+0x66
0985fc50 77e4d24d kernel32!BaseThreadInitThunk+0xe
0985fc90 77e4d45f ntdll!__RtlUserThreadStart+0x23
0985fca8 00000000 ntdll!_RtlUserThreadStart+0x1b
blocking2.0: --- → ?
To be clear I nominated this to block because it can cause cold-startup times in the tens of seconds "randomly" (depending on the size of the user's cache, presumably, and possibly on alignment of the moon and whatnot; I don't know when we decide to do that MoveTo call in the OpenDiskCache code).
And I'm assuming, btw, that the problem is that we're copying lots of data around or something... and locking around that operation, which means we block the UI thread when it tries to also lock during startup.
Don't think it blocks ship, but we definitely want to fix this!
blocking2.0: ? → -
Whiteboard: [ts]
Who should own this bug?

/be
Fwiw, the call seems to originate here

http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheDevice.cpp#922

going to

http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDeleteDir.cpp#81
http://mxr.mozilla.org/mozilla-central/source/xpcom/io/nsLocalFileWin.cpp#3013
http://mxr.mozilla.org/mozilla-central/source/xpcom/io/nsLocalFileWin.cpp#1767

and I guess

http://mxr.mozilla.org/mozilla-central/source/xpcom/io/nsLocalFileWin.cpp#1592

There is a discussion about this in bug #231300, and in particular bug #231300, comment #19 has a constructive proposal pointing to bug #179641. We might want to reopen the latter, assuming that the set of platforms we care about has changed since 2006? (I'm no Windows-guy and could not do this myself, though.) Whether to this sync or async is discussed in bug #321833.
Whatever Windows system calls we use, this bug is about holding a mutex across any filesystem operation. That's a bug on its face. We should never block a thread waiting for another thread to do I/O by camping (possibly competitive-spinning up to some limit) on a mutex. If we have to wait, we should use a condition variable.

But we should not make cold startup wait. Better to come up with the cache off until the operation completes. That seems entirely doable.

Who will take this bug? Bjarne?

/be
Wearing my dogmatic glasses (or contacts, actually) I perfectly agree with you on the point of not holding a mutex across filesystem operations. :)  However, I was taking the pragmatic approach since our cache violates this pretty much all over the place anyway...

I can look at this - need to find a nice way to simulate a slow MoveTo since I'm still no windows-guy; could probably be achieved with a timer or sth.

An approach is to post an event in nsCacheService::Init() to create the disk-device at init-time instead of creating it lazily everywhere (effectively making if "off" until it's ready). Does that sound reasonable? Should we do this also to the offline-device?
Dogmatic, meet pragmatic. Trying to fix this bug by speeding up the filesystem operation a bit won't really cut it. We need to get rid of the mutex-based synch.

While we surely need a cache overhaul, we almost always win with piece-wise improvements. So event-delayed init sounds good to me so long as it does not hit startup time, but I haven't looked at our disk cache code in ages. Boris, Jason, Taras: any thoughts?

/be
Attached patch Proposed solution (obsolete) — Splinter Review
Removing lazy-creation and just post an event to create the disk device. |mInitialized| is set by the event and prevents all usage of cache until it's true.

With this solution we crash into bug #622357 and I'll add the dependency.

Pushed to tryserver.
Assignee: nobody → bjarne
Attachment #509798 - Flags: review?(bzbarsky)
Use this if you want 15s delay in initializing the disk-cache on cold start.
When cache is not available, we are likely to hit bug #622357. Adding dependency as described in comment #9.
Depends on: 622357
(In reply to comment #8)
> While we surely need a cache overhaul, we almost always win with piece-wise
> improvements. So event-delayed init sounds good to me so long as it does not
> hit startup time, but I haven't looked at our disk cache code in ages. Boris,
> Jason, Taras: any thoughts?

Yes, got a rant. Our cache model is broken which is visible when disk io gets expensive. The browser should never have to wait on cache. 

Chrome has the same problem(except it suffers from more fs-fragmentation). We sheepishly wait on disk cache assuming that it's always faster than network. This might've been true with dialup. 

With slow or overloaded disks (or slow flash memory) disk-io often causes more latency than. We should restructure our networking code to access cache in a completely async way and spawn network requests + race-to-completion if the cache doesn't get back soon enough.

The proposed patch seems to create the cache on a thread which seems like a step in the right direction. I'm not sure if this means we'd also parse our big CACHE_* files on a separate thread(which we should).
Comment on attachment 509798 [details] [diff] [review]
Proposed solution

What makes sure that mDiskDevice is not null in EvictEntriesForClient and VisitEntries?
Tryserver-run fails consistently for xpcshell-tests - will have a look sometime over the weekend.
(In reply to comment #13)
> What makes sure that mDiskDevice is not null in EvictEntriesForClient

Line 1230 in the patch..?

> and
> VisitEntries?

Good point - will fix.

A more serious issue with this is that some tests seem to assume that cache is ready to rock immediately after retrieving a cache-session. This is no longer true, and I maybe a notification about the cache being ready is needed. We may also have to indicate this change somehow - e.g. by renaming CreateSession() to AsyncCreateSession() (possibly just add the latter method and make sure the browser uses it) and at least document it properly. An alternative may be to modify all tests using the cache to poll until it's ready, but that may hit add-ons etc...?

Input on this topic is appreciated: Is there a preferred style/approach to such issues or am I free to put together something I'm happy with?
> Line 1230 in the patch..?

Ah, indeed.

> Is there a preferred style/approach to such issues 

Check with jduell?
Ok - issue is that for "normal" operation we're ok with cache being off until it's ready, but a number of existing tests rely on cache being ready and functional when needed, and will fail randomly if this changes.

Current thinking is to

1) send a message via the notification-service when the cache is ready
2) for xpcshell-tests, add a method do_initiate_cache() or similar which loads the cache-service and blocks until the notification is received. xpcshell-tests have to use this method to safely use the cache (potentially we can just do this in the setup of an xpcshell-test but that may hit performance)
3) for mochitests, load cache and wait for notification in test-setup

Identifying xpcshell-tests using the cache is simple since they have to load the cache-service explicitly. It's a little harder to identify such mochitests, which is why I'd prefer to do this in the test-setup (I might be wrong here, though).

Jason, any input? Or anyone else, of-course? Dogmatic or pragmatic - all input is appreciated...
Is there any reason to allow to create a cache session when the cache service isn't initialized completely? I think that CreateSession() should return NS_ERROR_NOT_INITIALIZED in this case. Caller could then use a new method AsyncCreateSession() that you've proposed...


Anyway, I'd say that there could be much easier solution. Described hang is IMO possible only when opening a cache entry. We don't care about OpenCacheEntry() because we shouldn't use it anywhere and if we do, we should fix it. AsyncOpenCacheEntry() is supposed to post the request to the background thread and it should be quite easy to get rid of the lock in nsCacheService::OpenCacheEntry() in case it is called from nsCacheSession::AsyncOpenCacheEntry().
(In reply to comment #18)
> Is there any reason to allow to create a cache session when the cache service
> isn't initialized completely?

I was thinking about that also for some time but concluded that there is no reason *not* to allow it. IMO we won't improve anything by changing this since (as you also state below) the problem occurs when we open a cache-entry, not when we create a session.

> Anyway, I'd say that there could be much easier solution. Described hang is IMO
> possible only when opening a cache entry. We don't care about OpenCacheEntry()
> because we shouldn't use it anywhere and if we do, we should fix it.
> AsyncOpenCacheEntry() is supposed to post the request to the background thread
> and it should be quite easy to get rid of the lock in
> nsCacheService::OpenCacheEntry() in case it is called from
> nsCacheSession::AsyncOpenCacheEntry().

Keep in mind that the issue here is user experience. The lock is held while preparing the disk-device, which is necessary because it modifies shared datastructures. In order to get anything useful out of the cache we must obtain this lock at some point. It will not change the user-experience if all channels return immediately from AsyncOpenCacheEntry but hang on another thread waiting for the cache-lock - the content won't be loaded in either case.

But it's probably worth changing locking in nsCacheService::OpenCacheEntry in any case (not only when called from nsCacheSession::AsyncOpenCacheEntry). The lock is only needed AFAICS if ProcessRequest is called.
(In reply to comment #19)
> But it's probably worth changing locking in nsCacheService::OpenCacheEntry in
> any case (not only when called from nsCacheSession::AsyncOpenCacheEntry). The
> lock is only needed AFAICS if ProcessRequest is called.

Created bug #633146 for this.
This adds a method to head.js and modifies relevant xpcshell-tests to use the method. Note that xpcshell-tests modified by this patch will function correctly without any other code-changes relevant to this bug.
Attachment #511730 - Flags: review?(michal.novotny)
This changes TestRunner to wait for cache to initialize before each test. I also made it clear the cache prior to running each test, but that can be avoided.

I believe this is the right place to do this, but I'm open to suggestions. Feel free to push review to anyone else more familiar with how the mochitests work.

This patch does not need any other changes related to this bug in order to work.
Attachment #511731 - Flags: review?(bzbarsky)
I decided against using a notification since it was simple to make the tests work without it. Adding the notification "cache has been initialized" could be done in a separate bug, if anyone needs it.

nsCacheService::EvictEntries() now throws if the cache is uninitialized, which may have to be mentioned in the idl...
Attachment #509798 - Attachment is obsolete: true
Attachment #511733 - Flags: review?(michal.novotny)
Attachment #509798 - Flags: review?(bzbarsky)
Comment on attachment 511730 [details] [diff] [review]
Part 1: prepare xpcshell tests

Oh - documentation for xpcshell-tests should probably also mention this method and its usage.
Finally, as mentioned in comment #9, this stuff makes fixing bug #622357 very relevant...
Comment on attachment 511731 [details] [diff] [review]
Part 2: Prepare Mochitests

Robert, can you review this?
Attachment #511731 - Flags: review?(bzbarsky) → review?(sayrer)
Comment on attachment 511733 [details] [diff] [review]
Part 3: Code-changes to create disk-device on background thread

nsCreateDiskDeviceEvent is posted even if the disk cache is disabled which fails in CreateDiskDevice(). And since all lazy initialization attempts are removed the disk cache cannot be enabled and used if it was disabled at startup.

(In reply to comment #19)
> (In reply to comment #18)
> > Is there any reason to allow to create a cache session when the cache service
> > isn't initialized completely?
> 
> I was thinking about that also for some time but concluded that there is no
> reason *not* to allow it. IMO we won't improve anything by changing this since
> (as you also state below) the problem occurs when we open a cache-entry, not
> when we create a session.

I.e. we would allow to create a session but we couldn't use it because the cache service is not yet initialized... That doesn't make sense to me.

Do we know if we are blocked by renaming/moving of a single directory, or if we end up moving every single file in the cache? We should investigate and fix this too. Anyway in both cases we could e.g. introduce a cache directory versioning. Then we could first create a new cache directory (e.g. Cache1) and then delete the old directory on a background thread without holding the lock. This could be a sufficient solution.

Anyway if we really want to avoid using the cache during the disk device initialization (but I'm not sure about it) we should IMO limit this only to the disk cache and allow using the memory and offline cache. I.e. just change the patch so that cache service is initialized and fully usable even without the disk device and change the disk device initialization code so that it doesn't need to hold the cache lock. In this case only requests with STORE_ON_DISK policy would fail until the disk device is initialized.
Attachment #511733 - Flags: review?(michal.novotny) → review-
(In reply to comment #27)
> nsCreateDiskDeviceEvent is posted even if the disk cache is disabled which
> fails in CreateDiskDevice(). And since all lazy initialization attempts are
> removed the disk cache cannot be enabled and used if it was disabled at
> startup.

Ahh - good catch! Will fix...

> I.e. we would allow to create a session but we couldn't use it because the
> cache service is not yet initialized... That doesn't make sense to me.

Fair enough.

> Do we know if we are blocked by renaming/moving of a single directory, or if we
> end up moving every single file in the cache?

It seems like in this case we use some sort of backup-solution where each file is moved separately. See stack in comment #0 and comment #5.

> We should investigate and fix
> this too. Anyway in both cases we could e.g. introduce a cache directory
> versioning. Then we could first create a new cache directory (e.g. Cache1) and
> then delete the old directory on a background thread without holding the lock.

I don't disagree - it would be a pragmatic approach to this issue (see last sentence in comment #5). However, see Brendans remarks in comment #6 and comment #8.

Moreover, even if we do apply versioning and delete a corrupted cache-dir off main-thread we first need to determine that it *is* corrupt, which involves reading quite a lot of information blocking the main-thread. Hence, handling this by initializing off main-thread is also be seen as a step towards getting rid of IO on main-thread.

> Anyway if we really want to avoid using the cache during the disk device
> initialization (but I'm not sure about it) we should IMO limit this only to the
> disk cache and allow using the memory and offline cache. I.e. just change the
> patch so that cache service is initialized and fully usable even without the
> disk device

This could be done, yes. It might also fix a few other issues I see.
So, if we are changing how the cache service can be used, we aren't going to take this for 2.0, right?
(In reply to comment #28)
> > Do we know if we are blocked by renaming/moving of a single directory, or if we
> > end up moving every single file in the cache?
> 
> It seems like in this case we use some sort of backup-solution where each file
> is moved separately. See stack in comment #0 and comment #5.

Hmm, from what I see in Windows version of nsLocalFile::CopyMove() and the stack in comment #0 this is a moving of a single directory and not the fallback enumeration. It's interesting that it can block for so long time.


> > We should investigate and fix
> > this too. Anyway in both cases we could e.g. introduce a cache directory
> > versioning. Then we could first create a new cache directory (e.g. Cache1) and
> > then delete the old directory on a background thread without holding the lock.
> 
> I don't disagree - it would be a pragmatic approach to this issue (see last
> sentence in comment #5). However, see Brendans remarks in comment #6 and
> comment #8.

I see, but I think that the IO would be quite quick in this case. And together with fix in bug 633146 there wouldn't be any IO blocking the main thread.


> Moreover, even if we do apply versioning and delete a corrupted cache-dir off
> main-thread we first need to determine that it *is* corrupt, which involves
> reading quite a lot of information blocking the main-thread. Hence, handling
> this by initializing off main-thread is also be seen as a step towards getting
> rid of IO on main-thread.

This would happen automatically off the main thread thanks to bug 633146. Even if we implement solution described below we should minimize the time when the disk cache isn't available due to its initialization.


> > Anyway if we really want to avoid using the cache during the disk device
> > initialization (but I'm not sure about it) we should IMO limit this only to the
> > disk cache and allow using the memory and offline cache. I.e. just change the
> > patch so that cache service is initialized and fully usable even without the
> > disk device
> 
> This could be done, yes. It might also fix a few other issues I see.

Also please note that removing of lazy initialization will lead to increased startup times (in comparison with solution above) in case when nobody wants to use it because there will always be some slowdown of the system response due to IO even if it proceeds on a different thread.
(In reply to comment #30)
> And together
> with fix in bug 633146 there wouldn't be any IO blocking the main thread.
  [ snip ]
> This would happen automatically off the main thread thanks to bug 633146.

Bug #633146 just eliminates one unnecessary lock/unlock pair from the code-path taken by main-thread when requesting/opening a cache-entry. Although this is good, the page-load still waits for all its channels because they are suspended waiting for their cache-entries to be found and returned. These requests for cache-entries are queued up on the cache-io thread *behind* the call to nsDiskCacheDevice::Init() and must wait until it finishes, even if it doesn't lock the cache-service. Thus, it may take some time before the channels start loading, and the browser may seem frozen. (Dogmatically speaking we are fine because we don't block main-thread on a mutex, but the user-experience is pretty much the same...)

One way (the proposed patch) to avoid this is to make sure channels do *not* end up waiting for cache-entries in this situation. The proposed implementation does this by returning failure from the call to OpenCacheEntry(), which means that the channel loads and displays the content without a cache-entry, independently of initializing the disk device.

Another approach is to let the cache-service work temporarily without disk-cache (my interpretation of comment #27 - I'm working on a patch). I.e. give each channel a cache-entry, but not a disk-cache entry. However, this means we have to use another thread to initialize the disk-device because otherwise, as mentioned above, initialization blocks later events on the cache-io thread and hangs the page-load (not because of locking, but because initialization simply takes time to finish).
Status: NEW → ASSIGNED
(In reply to comment #31)
> (In reply to comment #30)
> > And together
> > with fix in bug 633146 there wouldn't be any IO blocking the main thread.
>   [ snip ]
> > This would happen automatically off the main thread thanks to bug 633146.
> 
> Bug #633146 just eliminates one unnecessary lock/unlock pair from the code-path
> taken by main-thread when requesting/opening a cache-entry. Although this is
> good, the page-load still waits for all its channels because they are suspended
> waiting for their cache-entries to be found and returned. These requests for
> cache-entries are queued up on the cache-io thread *behind* the call to
> nsDiskCacheDevice::Init() and must wait until it finishes, even if it doesn't
> lock the cache-service. Thus, it may take some time before the channels start

You've snipped the substantial part of my comment. I meant that bug 633146 together with reversed order of deleting and creating of the cache directory would be good solution.


> loading, and the browser may seem frozen. (Dogmatically speaking we are fine
> because we don't block main-thread on a mutex, but the user-experience is
> pretty much the same...)

I can't agree that the user experience is the same. Frozen UI is unacceptable whereas delayed load could be IMO acceptable (it depends on the delay).
Cache service available without the disk device as described in comment #31 last paragraph.

It doesn't seem necessary to modify tests for this to work. However, we may want to do that anyway for mochitests (along the lines of part 2 above) to ensure a reproducible environment in which tests are run. For xpcshell-tests, we just need to worry about those few who use the method do_load_profile().

Pushed to tryserver...
Attachment #512485 - Flags: review?(michal.novotny)
(In reply to comment #32)
> (In reply to comment #31)
> > (Dogmatically speaking we are fine
> > because we don't block main-thread on a mutex, but the user-experience is
> > pretty much the same...)
> 
> I can't agree that the user experience is the same. Frozen UI is unacceptable
> whereas delayed load could be IMO acceptable (it depends on the delay).

Right. The "dogma" (let's please drop this term) is based precisely on the goal of avoiding starvation, or of preserving responsiveness and other kinds of availability. It's not a blind-faith thing!

/be
Ok - let's drop undesired terms. ;)

> I can't agree that the user experience is the same. Frozen UI is unacceptable
> whereas delayed load could be IMO acceptable (it depends on the delay).

I see your point. But if I start a browser and nothing interesting appears on the screen for 20-30 seconds, I'd probably kill it (I'm an impatient guy, like many other people). Maybe we could combine this with a "please wait while i clean up your cache"-notification? I.e. a third approach could be 1) land bug #633146, 2) reverse-order deleting and creating cache directory and 3) inform user with a notification so that she doesn't kill the browser while waiting.

I'd be happy for input to which approach people would prefer here:

1) run with no cache until disk-device is ready
2) run with cache but without disk-device until its ready
3) approach suggested above
(In reply to comment #35)
> I see your point. But if I start a browser and nothing interesting appears on
> the screen for 20-30 seconds, I'd probably kill it (I'm an impatient guy, like
> many other people). Maybe we could combine this with a "please wait while i

Would you really kill the app if it is responsible, but the pages just don't load? I would check the connectivity instead :)

Seriously, I wouldn't recommend the solution if I thought there would be 20-30 second delay. We should detect broken cache really quickly. In almost all cases the reason is that the browser crashed and the cache wasn't closed properly. This we know after reading nsDiskCacheHeader (i.e. 276 bytes). So if we at this point first create a new disk cache that can be used immediately, there would be virtually no delay.

I'm not against some combined solution, but I think that we should implement the above anyway.

Input from others will be appreciated.
If the primary goal here is to keep GUI responsive, applying the fix for bug #633146 takes care of that.

In order to speed up content loading for cold startup (so that impatient guys like myself doesn't kill the browser or hurt their back crawling under the desk to check cables ;) ) the two first approaches from comment #35 does this job, and Michals suggestion of first create new then later delete old cache-dir will most likely also work. Alternatively we could re-open (or at least investigate!) bug #179641 as suggested in comment #5. (I cannot do the latter, though.)
Comment on attachment 512485 [details] [diff] [review]
Approach 2: Working cache, temporarily wo/ disk device

> +    nsShutdownThreadEvent(nsCOMPtr<nsIThread> thr)

This should be nsIThread *
Or because this is a single purpose event you could move NS_GetCurrentThread() to the constructor instead of passing the thread as an parameter.


> +   // Make sure current thread is shut down when this event has finished
> +   // (unless it is the cache-io thread). Posting this event here ensures
> +   // that it is run before the cache-io thread is shut down, which we
> +   // want in order to control the lifecycle of this thread

I would prefer to not post nsInitDiskDeviceEvent to the cache-io thread. I.e. if NS_NewThread() in AsyncInitializeDiskDevice() fails we won't initialize the disk device.

And does this really work reliably? Although it is very unlikely, it is possible, that the cache is shut down before nsInitDiskDeviceEvent::Run() is called, isn't it?


> +#if DEBUG
> +    printf("###\n");
> +    printf("### mDiskDevice->Init() failed (0x%.8x)\n", rv);
> +    printf("###    - disabling disk cache for this session.\n");
> +    printf("###\n");
> +#endif

I know that this came from the old code, but please change the printf to CACHE_LOG_ERROR or NS_ERROR.


> +   // TODO remove this one?? How can this actually happen..?
>     if (this == nsnull)  return NS_ERROR_NOT_AVAILABLE;

This could happen if the method were called on the null pointer. Something like:
((nsCacheService*)0)->CreateSession();
Not sure if this is really needed.


> +   if (mEnableDiskDevice && mDiskDeviceInitialized) {
> +       nsresult rv = NS_OK;
>         if (mDiskDevice)
>             rv = mDiskDevice->EvictEntries(clientID);
>         if (NS_FAILED(rv)) res = rv;
>     }

You can simplify it. mDiskDevice should be always valid when mDiskDeviceInitialized is true.

if (mEnableDiskDevice && mDiskDeviceInitialized) {
    nsresult rv;
    rv = mDiskDevice->EvictEntries(clientID);
    if (NS_FAILED(rv)) res = rv;
}


> -nsCacheService::CreateDiskDevice()
> +nsCacheService::AsyncInitializeDiskDevice(PRBool deleteOnFailure)
>  {
>  #ifdef NECKO_DISK_CACHE
> -    if (!mInitialized)      return NS_ERROR_NOT_AVAILABLE;
> -    if (!mEnableDiskDevice) return NS_ERROR_NOT_AVAILABLE;
> -    if (mDiskDevice)        return NS_OK;
> -
> -    mDiskDevice = new nsDiskCacheDevice;
> -    if (!mDiskDevice)       return NS_ERROR_OUT_OF_MEMORY;
> +    if (!mDiskDevice)
> +        return NS_ERROR_NOT_AVAILABLE;

This means that the disk device is initialized always, even if it is disabled. This is wrong.


> +    if (gService->mDiskDevice)
> +        gService->AsyncInitializeDiskDevice(PR_FALSE);
>  #endif // !NECKO_DISK_CACHE

The same as above. This would initialize the disk device regardless on mEnableDiskDevice.


> +   mDiskDeviceInitialized = PR_FALSE;

Why is this necessary (in nsCacheService::AsyncInitializeDiskDevice)?
Attachment #512485 - Flags: review?(michal.novotny)
I'll happily listen to advise on questions raised in comment #37 before re-working the patch to match review.
Is this still an issue or should we drop it?
(In reply to comment #40)
> Is this still an issue or should we drop it?

Yes.  Today this occurs.

When I analyzed this on reproduced environment, this root cause seems to be SetNamedSecurityInfoW is too slow.

Although nsDiskCacheDevice::OpenDiskCache moves cache to "Trash", nsLocalFile::CopySingleFile copies ACL for all.  So nsDiskCacheDevice::OpenDiskCache sometimes be bad performance.
Comment on attachment 511731 [details] [diff] [review]
Part 2: Prepare Mochitests

I'm not the right reviewer for this. Over to Josh for dispatch.
Attachment #511731 - Flags: review?(sayrer) → review?(joshmoz)
Attachment #511731 - Flags: review?(joshmoz) → review?(michal.novotny)
Comment on attachment 511731 [details] [diff] [review]
Part 2: Prepare Mochitests

Waiting for the cache service in the while loop is an ugly hack. I still do think that the patch #511733 is a wrong solution. And I would also guess that patch in bug #670911 fixed this bug too. I.e. there will probably still be some delay during the start after the crash, but now it should be just few seconds instead of 30s or more.
Attachment #511731 - Flags: review?(michal.novotny) → review-
Attachment #511730 - Flags: review?(michal.novotny) → review-
Yes, the delays seen in this bug were from deleting the cache on NTFS, which has been fixed by bug 670911.  So there should be no more 30s delays to be blocking on.

WONTFIX?  I'll leave that call up to Michal, in case there are other cases where we could block, and might be able not to with some extra coding.  But it doesnt sound like it.
Actually I guess i'd mark this a DUPE of 670911, if Michal doesn't see anything else to do here.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: