Last Comment Bug 695003 - Race condition when deleting cache directory
: Race condition when deleting cache directory
Status: RESOLVED FIXED
[inbound] [may need patch for Firefox...
:
Product: Core
Classification: Components
Component: Networking: Cache (show other bugs)
: Trunk
: All All
: P1 major (vote)
: mozilla11
Assigned To: Michal Novotny (:michal)
:
: Patrick McManus [:mcmanus]
Mentors:
Depends on: 681407
Blocks: 701909 705761
  Show dependency treegraph
 
Reported: 2011-10-17 08:31 PDT by Michal Novotny (:michal)
Modified: 2011-12-09 11:44 PST (History)
11 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
patch v1 (10.56 KB, patch)
2011-10-25 18:05 PDT, Michal Novotny (:michal)
no flags Details | Diff | Splinter Review
patch v2 (23.04 KB, patch)
2011-11-02 17:46 PDT, Michal Novotny (:michal)
no flags Details | Diff | Splinter Review
patch v3 (23.30 KB, patch)
2011-11-03 16:45 PDT, Michal Novotny (:michal)
no flags Details | Diff | Splinter Review
patch v4 (24.24 KB, patch)
2011-11-04 10:19 PDT, Michal Novotny (:michal)
jduell.mcbugs: review+
Details | Diff | Splinter Review
patch v5 (24.12 KB, patch)
2011-11-29 03:05 PST, Michal Novotny (:michal)
jduell.mcbugs: review+
Details | Diff | Splinter Review
Add 3 telemetry vars (applies on top of main patch) (3.53 KB, patch)
2011-12-02 02:18 PST, Jason Duell [:jduell] (needinfo me)
michal.novotny: review+
Details | Diff | Splinter Review

Description Michal Novotny (:michal) 2011-10-17 08:31:15 PDT
This is a follow-up bug of #681407. We access the trash directory from multiple threads without any synchronization. In fact there is almost guaranteed race condition when the disk cache is cleared. First the cache directory is renamed to cache.trash in nsDiskCacheDevice::ClearDiskCache() and an event to delete this trash is posted to a background thread. Then we check if there is any trash directory in nsDiskCacheDevice::OpenDiskCache() and we try to delete it again.

#0  DeleteDir (dirIn=0xe8258f00, moveToTrash=0, sync=0, delay=0) at /opt/moz/hg/netwerk/cache/nsDeleteDir.cpp:69
#1  0x013ea2d2 in nsDiskCacheDevice::OpenDiskCache (this=0xe9de44e0) at /opt/moz/hg/netwerk/cache/nsDiskCacheDevice.cpp:1045
#2  0x013e8adf in nsDiskCacheDevice::Init (this=0xe9de44e0) at /opt/moz/hg/netwerk/cache/nsDiskCacheDevice.cpp:424
#3  0x013ea39d in nsDiskCacheDevice::ClearDiskCache (this=0xe9de44e0) at /opt/moz/hg/netwerk/cache/nsDiskCacheDevice.cpp:1070
#4  0x013e9fc8 in nsDiskCacheDevice::EvictEntries (this=0xe9de44e0, clientID=0x0)
    at /opt/moz/hg/netwerk/cache/nsDiskCacheDevice.cpp:978
#5  0x013dce75 in nsCacheService::EvictEntriesForClient (this=0xe9852ca0, clientID=0x0, storagePolicy=0)
    at /opt/moz/hg/netwerk/cache/nsCacheService.cpp:1214
#6  0x013dd1ee in nsCacheService::EvictEntries (this=0xe9852ca0, storagePolicy=0)
    at /opt/moz/hg/netwerk/cache/nsCacheService.cpp:1326
Comment 1 Michal Novotny (:michal) 2011-10-25 18:05:44 PDT
Created attachment 569572 [details] [diff] [review]
patch v1

This patch changes:

- always create an unique trash directory
- find all trashes (and do a delayed delete) from previous run during disk cache initialization
- sync deleting of the trash in nsDiskCacheDevice::Shutdown() is no longer needed
- fixes non-working delayed deleting

I planned to implement serialized removing of the directories, but it isn't worth the complexity. The case when we are removing two or more really big directories is rare or almost impossible.
Comment 2 Bjarne (:bjarne) 2011-10-26 06:12:25 PDT
Comment on attachment 569572 [details] [diff] [review]
patch v1

Review of attachment 569572 [details] [diff] [review]:
-----------------------------------------------------------------

Looks ok. Any idea of how this will behave if we shut down while the delete-event is pending?
Comment 3 Michal Novotny (:michal) 2011-10-26 10:27:00 PDT
On shutdown Firefox doesn't wait until deleting process finishes. The current cache code doesn't handle it either. I got some warnings while running the test. I'm not sure what exactly it means and if we should care.

WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /opt/moz/hg/xpcom/base/nsTraceRefcntImpl.cpp, line 172
WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /opt/moz/hg/xpcom/base/nsTraceRefcntImpl.cpp, line 172

(gdb) bt
#0  AssertActivityIsLegal () at /opt/moz/hg/xpcom/base/nsTraceRefcntImpl.cpp:172
#1  0x0287ab35 in NS_LogRelease_P (aPtr=0xf34d1030, aRefcnt=2, aClazz=0x359e298 "nsStringBuffer")
    at /opt/moz/hg/xpcom/base/nsTraceRefcntImpl.cpp:1045
#2  0x02893cfc in nsStringBuffer::Release (this=0xf34d1030) at /opt/moz/hg/xpcom/string/src/nsSubstring.cpp:189
#3  0x02897def in ReleaseData (data=0xf34d1038, flags=65541) at /opt/moz/hg/xpcom/string/src/nsSubstring.cpp:118
#4  0x02895d34 in nsACString_internal::Finalize (this=0xe6afcddc) at /opt/moz/hg/xpcom/string/src/nsTSubstring.cpp:188
#5  0x011a65fe in nsACString_internal::~nsACString_internal (this=0xe6afcddc, __in_chrg=<value optimized out>)
    at ../../dist/include/nsTSubstring.h:113
#6  0x011a6803 in nsCString::~nsCString (this=0xe6afcddc, __in_chrg=<value optimized out>) at ../../dist/include/nsTString.h:55
#7  0x011a6861 in nsFixedCString::~nsFixedCString (this=0xe6afcddc, __in_chrg=<value optimized out>)
    at ../../dist/include/nsTString.h:413
#8  0x011a68b1 in nsCAutoString::~nsCAutoString (this=0xe6afcddc, __in_chrg=<value optimized out>)
    at ../../dist/include/nsTString.h:474
#9  0x02850340 in nsDirEnumeratorUnix::Init (this=0xe56090d0, parent=0xe5676300, resolveSymlinks=false)
    at /opt/moz/hg/xpcom/io/nsLocalFileUnix.cpp:185
#10 0x02852897 in nsLocalFile::Remove (this=0xe5676300, recursive=true) at /opt/moz/hg/xpcom/io/nsLocalFileUnix.cpp:974
#11 0x02852988 in nsLocalFile::Remove (this=0xe5676280, recursive=true) at /opt/moz/hg/xpcom/io/nsLocalFileUnix.cpp:988
#12 0x02852988 in nsLocalFile::Remove (this=0xe5676200, recursive=true) at /opt/moz/hg/xpcom/io/nsLocalFileUnix.cpp:988
#13 0x02852988 in nsLocalFile::Remove (this=0xe5676100, recursive=true) at /opt/moz/hg/xpcom/io/nsLocalFileUnix.cpp:988
#14 0x02852988 in nsLocalFile::Remove (this=0xe5676180, recursive=true) at /opt/moz/hg/xpcom/io/nsLocalFileUnix.cpp:988
#15 0x02852988 in nsLocalFile::Remove (this=0xe5676080, recursive=true) at /opt/moz/hg/xpcom/io/nsLocalFileUnix.cpp:988
#16 0x02852988 in nsLocalFile::Remove (this=0xe577f980, recursive=true) at /opt/moz/hg/xpcom/io/nsLocalFileUnix.cpp:988
#17 0x02852988 in nsLocalFile::Remove (this=0xe5677600, recursive=true) at /opt/moz/hg/xpcom/io/nsLocalFileUnix.cpp:988
#18 0x02852988 in nsLocalFile::Remove (this=0xe604e780, recursive=true) at /opt/moz/hg/xpcom/io/nsLocalFileUnix.cpp:988
#19 0x02852988 in nsLocalFile::Remove (this=0xe5677c00, recursive=true) at /opt/moz/hg/xpcom/io/nsLocalFileUnix.cpp:988
#20 0x02852988 in nsLocalFile::Remove (this=0xea822e80, recursive=true) at /opt/moz/hg/xpcom/io/nsLocalFileUnix.cpp:988
#21 0x012a819a in DeleteDirThreadFunc (arg=0xef9e73e0) at /opt/moz/hg/netwerk/cache/nsDeleteDir.cpp:60
#22 0x00ceb923 in _pt_root (arg=0xe5644ed0) at /opt/moz/hg/nsprpub/pr/src/pthreads/ptthread.c:187
#23 0x00a3ae99 in start_thread (arg=0xe6afdb70) at pthread_create.c:301
#24 0x001e7d2e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:133
Comment 4 Bjarne (:bjarne) 2011-10-27 05:19:32 PDT
Intuitively I'd say we should handle this gracefully but I'm not sure how. What's the state of the cache if this happens? How does it look like for the user?

The reason I raised this issue in the first place is that we delay removal of potentially a number of large directories - this opens a time-window in which the browser actually may be shut down. (Not sure if this is likely to happen very often though.) If the cache stays intact and we don't scare the living daylight out of the user, I guess it's ok...  (who can provide guidelines to this?)

What about mobile? How will this be handled by a device, and how does it work if the browser is suspended while the remove-event is pending?
Comment 5 Michal Novotny (:michal) 2011-10-30 03:42:31 PDT
(In reply to Bjarne (:bjarne) from comment #4)
> Intuitively I'd say we should handle this gracefully but I'm not sure how.
> What's the state of the cache if this happens? How does it look like for the
> user?

These warnings show up only when the deleting is in progress. The cache is in consistent state. Just the trash isn't deleted completely and will be deleted on the next start. We could handle it so that we would postpone the shutdown until deleting finishes but it could take a lot of time. This is a problem for example when user restarts FF after installing a new addon because during the shutdown user doesn't see any window and has no idea what's happening.


> The reason I raised this issue in the first place is that we delay removal
> of potentially a number of large directories - this opens a time-window in
> which the browser actually may be shut down. (Not sure if this is likely to
> happen very often though.) If the cache stays intact and we don't scare the
> living daylight out of the user, I guess it's ok...  (who can provide
> guidelines to this?)

There is no problem when FF is shut down during the delay window (except that we leave the trash on the disk). The trash will be deleted on the next start. The same as above apply, IMO it is much better to leave the trash on the disk than postpone the shutdown.


> What about mobile? How will this be handled by a device, and how does it
> work if the browser is suspended while the remove-event is pending?

I have no idea.
Comment 6 Bjarne (:bjarne) 2011-10-31 03:22:35 PDT
Comment on attachment 569572 [details] [diff] [review]
patch v1

Review of attachment 569572 [details] [diff] [review]:
-----------------------------------------------------------------

If the cache is consistent and the user doesn't see it too much, as you indicate above, I agree with you about being pragmatic about the warnings.

From the code-pow I'll r+ this, however I would prefer if you request a sr for some input on the mobile issue, or at least try it yourself on a mobile to get some idea of how it will work.

Clearing review-flag until this is resolved.
Comment 7 Jason Duell [:jduell] (needinfo me) 2011-10-31 07:43:45 PDT
cc'ing bsmedberg in case he can make sense of the XPCOM error msg in comment 3.  I can't figure out why 1) a stringbuffer is using NS_LOG_RELEASE when AFAICT it's not an XPCOM object, nor 2) how it considers this a static destructor on a call chain that seems to be coming from a nsCAutoString declared on the stack (in nsDirEnumeratorUnix::Init).

Re: mobile: I can only assume that if the browser is suspended by the OS while we're deleting an old cache dir, we'll pause while that happens, and either resume later, or if we get killed by the OS, recover as usual (i.e. delete the rest of the directory tree when we start up the next time).  Do we have any reason to think otherwise?  What's the concern exactly?
Comment 8 Benjamin Smedberg [:bsmedberg] 2011-10-31 07:57:29 PDT
The error message is generated if there are any XPCOM calls after the final NS_LogTerm. This protects against unitialized XPCOM usage and crashes caused by static destructors and exactly this kind of code.

stringbuffers are refcounted, NS_LOG_RELEASE is perfectly appropriate.

You need to shut down your threads before the end of xpcom-shutdown-threads, see https://wiki.mozilla.org/XPCOM_Shutdown. This may just mean setting a flag and abandoning the deletion at the next file (walking the directories manually instead of using recursive=true).

Alternately you could avoid all XPCOM usage on this thread and just "abandon" it during shutdown, but that doesn't seem necessary in this case.
Comment 9 Bjarne (:bjarne) 2011-10-31 08:23:51 PDT
(In reply to Jason Duell (:jduell) from comment #7)
> What's the concern exactly?

The concern is that we don't seem to have any clue of how this will work on mobile, and that we might want to at least get an idea or an informed opinion about it, keeping in mind that there is no test in the patch. Michal: if it's possible to set up a test, please do so (although I don't really see how to do this in practice).
Comment 10 Michal Novotny (:michal) 2011-11-02 17:46:02 PDT
Created attachment 571523 [details] [diff] [review]
patch v2

- deleting is now stopped on shutdown
- sync deleting was removed since it is no longer needed
Comment 11 Michal Novotny (:michal) 2011-11-02 17:48:35 PDT
Pushed to try server https://tbpl.mozilla.org/?tree=Try&rev=a5647efa21d9
Comment 12 Jason Duell [:jduell] (needinfo me) 2011-11-02 22:24:46 PDT
Comment on attachment 571523 [details] [diff] [review]
patch v2

Review of attachment 571523 [details] [diff] [review]:
-----------------------------------------------------------------

Michal,

This patch definitely improves things.  I think I found some issues that need to be fixed first, though.  

It's probably worth fixing them before bjarne looks at it, so I've cleared his review for now.  Bjarne: feel free to ignore me and look at it now if you like :)

::: netwerk/cache/nsDeleteDir.cpp
@@ +67,5 @@
> +    : mThread(thread)
> +  {}
> +  NS_IMETHOD Run()
> +  {
> +    mThread->Shutdown();

As discussed below I think this needs to look like

   if (mTimers.Count() == 0 && mThread) {
      mThread->Shutdown();
      mThread = nsnull;
   }

(I don't *think* you need to lock to check mTimers.Count at this point--this only gets called if background thread sees mTimers was empty, and we're just trying to catch if main thread added any timers in the meantime.)

@@ +123,5 @@
> +      timer->GetClosure(&arg);
> +      delete static_cast<nsCOMArray<nsIFile> *>(arg);
> +    }
> +
> +    thread.swap(gInstance->mThread);

Since you grab mThread here, you need to check if mThread is null in nsDestroyThreadEvent::Run.  Run also needs to set mThread to null so we don't delete it here again.

@@ +127,5 @@
> +    thread.swap(gInstance->mThread);
> +  }
> +
> +  if (thread) {
> +    MutexAutoLock lock(gInstance->mLock);

Any reason why you're releasing the lock then grabbing it right away again here?  If so add comment explaining why.  AFAICT you could hold the lock the whole time and release it at either mCondVar.wait, or the end of the block if (!thread)

@@ +129,5 @@
> +
> +  if (thread) {
> +    MutexAutoLock lock(gInstance->mLock);
> +
> +    nsCOMPtr<nsIRunnable> event = new nsBlockOnBackgroundThreadEvent();

Add comment:

 // dispatch event and wait for it to run and notify us, so we know thread
 // has completed all work and can be shutdown

@@ +140,5 @@
> +    rv = gInstance->mCondVar.Wait();
> +    thread->Shutdown();
> +  }
> +
> +  delete gInstance;

Nit: we only have one call site for this function, but if we ever added more, there'd be the possibility of calling delete gInstance twice (for instance if 1st call was waiting for mConVar when 2nd call enters). Maybe just add a static bool deleting = false, have 1st called set it to true immediately, and return otherwise?

@@ +145,5 @@
> +  return NS_OK;
> +}
> +
> +nsresult
> +nsDeleteDir::InitThread()

I suggest we rename "EnsureThread" since for many calls thread will already exist.

@@ +164,5 @@
> +  return NS_OK;
> +}
> +
> +void
> +nsDeleteDir::DestroyThread()

Since much of the time this function won't actually destroy the thread (whenever mTimers.Count !=0, i.e. there is more work to do), I suggest we rename it "MaybeDestroyThread".

@@ +170,5 @@
> +  if (!mThread)
> +    return;
> +
> +  if (mTimers.Count())
> +    return;

Add comment above timers.Count() check:

// more work to do, so don't delete thread.

Also, AFAICT you've got a race condition here.  You're checking mTimers.Count() from the non-main thread, then sending a msg to the main thread to do the actual deletion unconditionally.  But by the time the event runs on the main thread, a different DeleteDir call may have added something to mTimers.  So I think you need to do the mTimers.Count check again in nsDestroyThreadEvent::Run.   (If you think I'm wrong about this, at least add an assertion in Run to make sure I'm not wrong :)

@@ +201,5 @@
> +    gInstance->RemoveDir((*dirList)[i], &shuttingDown);
> +  }
> +
> +  {
> +    MutexAutoLock lock(gInstance->mLock);

style nit: Since you only call DestroyThread from this location, you could also put lock inside DestroyThread to avoid extra brace scope here.  

I really wish we had a "EnsureLocked(lock)" function.

@@ +440,5 @@
> +  // No check for errors to remove as much as possible
> +
> +  MutexAutoLock lock(mLock);
> +  if (mShutdownPending)
> +    *shuttingDown = true;

Can mShutDownPending be moved to top of function, so that we exit sooner when it gets set?

::: netwerk/cache/nsDeleteDir.h
@@ +59,5 @@
> +  static nsresult Shutdown();
> +
> +  /**
> +   * This routine attempts to delete a directory that may contain some files
> +   * that are still in use. This later point is only an issue on Windows and

s/later/latter/

::: netwerk/cache/nsDiskCacheDevice.cpp
@@ +420,5 @@
>      if (NS_FAILED(rv))
>          return rv;
> +
> +    nsDeleteDir::RemoveOldTrashes(mCacheDirectory);
> +

Hmm, so this does a bunch of file IO on main thread right at startup (enumerates Cache directory to find any .Trash directories to delete).  Given how precious startup time is, I think we should instead post an event to the background DeleteDir thread to do the RemoveOldTrashes check later (30-60 seconds later?).  I also think we should do RemoveOldTrashes only once--the first time Init is called, at startup, and not when it's called later by the user calling ClearCache. 

This means that we might try to delete the same directory twice (if at startup we determine that "Cache" is corrupt, we rename it immediately and then post a timer to delete, but RemoveOldTrashes might see it and delete it first).  But that's harmless--nsIFile.Remove will just return an error (which we ignore), AFAICT.   The only bad side-effect is that we'd report two telemetry timings for DeleteDir (one very short).  We could avoid that in practice by setting a long timer for RemoveOldTrashes (120 secs?) and a shorter one for DeleteDir calls (60 secs), so that the vast majority of the time RemoveOldTrashes won't see anything except directories that were lying around from the last run.  We might still get an artificially short telemetry number if the user deletes their cache during the 2nd minute that the browser is open (RemoveOldTrashes might delete it first), but I don't think we need to worry about that.

Or we could try to be fancier about this, and guarantee a directory is only attempted to be deleted once (and only counted once toward telemetry).  I think the following would work:

  DeleteDir:
    // ex: if renaming Cache -> Cache.Trash.456
     { Lock
       AlreadyDeletingList.Add("Cache.Trash.456")
     }
     Rename(Cache, Cache.Trash.456)
     Schedule timer callback to do actual deletion

  RemoveOldTrashes:
    List = enumerateDirectory;
    for (dir in List)
       { Lock
         if dir in AlreadyDeletingList
            continue;
       }
       if matches "Trash" 
            // I'm ok with not counting enumerateDirectory as part of the
            // telemetry for deleteDir.  It's the actual directory deletion
            // that's going to be expensive.

            Telemetry::Timer
            Remove(dir)

  DeleteDir Timer callback ("TimerCallBack" in patch):
    Delete(dir)
    { Lock
      AlreadyDeletingList.Remove(dir)
    }

I'm ok with either approach (use long timer to make bad telemetry unlikely, or use locked AlreadyDeletingList to make sure it's always just right, but we don't count time to enumerate directory).
Comment 13 Bjarne (:bjarne) 2011-11-03 01:28:31 PDT
(In reply to Jason Duell (:jduell) from comment #12)
> It's probably worth fixing them before bjarne looks at it, so I've cleared
> his review for now.  Bjarne: feel free to ignore me and look at it now if
> you like :)

I wouldn't dream of ignoring your advice Jason. :) Moreover, you seem to be at least as capable as myself to review this so I'd be fine with you doing it.
Comment 14 neil@parkwaycc.co.uk 2011-11-03 01:37:17 PDT
For those people who expect to use Clear Private Data to clear the cache on shutdown, will this mean that their files will still exist on the disk until the next restart?
Comment 15 Michal Novotny (:michal) 2011-11-03 16:45:10 PDT
Created attachment 571837 [details] [diff] [review]
patch v3

(In reply to Jason Duell (:jduell) from comment #12)
> ::: netwerk/cache/nsDeleteDir.cpp
> @@ +67,5 @@
> > +    : mThread(thread)
> > +  {}
> > +  NS_IMETHOD Run()
> > +  {
> > +    mThread->Shutdown();
> 
> As discussed below I think this needs to look like
> 
>    if (mTimers.Count() == 0 && mThread) {
>       mThread->Shutdown();
>       mThread = nsnull;
>    }

This is completely different class, see below...


> @@ +170,5 @@
> > +  if (!mThread)
> > +    return;
> > +
> > +  if (mTimers.Count())
> > +    return;
> 
> Add comment above timers.Count() check:
> 
> // more work to do, so don't delete thread.

Added.


> Also, AFAICT you've got a race condition here.  You're checking
> mTimers.Count() from the non-main thread, then sending a msg to the main
> thread to do the actual deletion unconditionally.  But by the time the event
> runs on the main thread, a different DeleteDir call may have added something
> to mTimers.  So I think you need to do the mTimers.Count check again in
> nsDestroyThreadEvent::Run.   (If you think I'm wrong about this, at least
> add an assertion in Run to make sure I'm not wrong :)

There is no race condition. nsDeleteDir::DestroyThread() is called under the lock, so no other timer can be added to the mTimers. Member mThread is nulled out just after posting nsDestroyThreadEvent so no other event can be posted to it. mThread variable in nsDestroyThreadEvent() is a member of the event class.

If there is a new request for deleting a directory between calls to nsDeleteDir::DestroyThread() and to nsDestroyThreadEvent::Run() then a new thread is created by nsDeleteDir::InitThread(). This could seem like an inefficient handling of resources but please note that the probability that this happens is IMO 0.00000...
 

> @@ +123,5 @@
> > +      timer->GetClosure(&arg);
> > +      delete static_cast<nsCOMArray<nsIFile> *>(arg);
> > +    }
> > +
> > +    thread.swap(gInstance->mThread);
> 
> Since you grab mThread here, you need to check if mThread is null in
> nsDestroyThreadEvent::Run.  Run also needs to set mThread to null so we
> don't delete it here again.

This was already explained above.


> @@ +127,5 @@
> > +    thread.swap(gInstance->mThread);
> > +  }
> > +
> > +  if (thread) {
> > +    MutexAutoLock lock(gInstance->mLock);
> 
> Any reason why you're releasing the lock then grabbing it right away again
> here?  If so add comment explaining why.  AFAICT you could hold the lock the
> whole time and release it at either mCondVar.wait, or the end of the block
> if (!thread)

Right, these blocks can be merged. Fixed.


> @@ +129,5 @@
> > +
> > +  if (thread) {
> > +    MutexAutoLock lock(gInstance->mLock);
> > +
> > +    nsCOMPtr<nsIRunnable> event = new nsBlockOnBackgroundThreadEvent();
> 
> Add comment:
> 
>  // dispatch event and wait for it to run and notify us, so we know thread
>  // has completed all work and can be shutdown

Added.


> @@ +140,5 @@
> > +    rv = gInstance->mCondVar.Wait();
> > +    thread->Shutdown();
> > +  }
> > +
> > +  delete gInstance;
> 
> Nit: we only have one call site for this function, but if we ever added
> more, there'd be the possibility of calling delete gInstance twice (for
> instance if 1st call was waiting for mConVar when 2nd call enters). Maybe
> just add a static bool deleting = false, have 1st called set it to true
> immediately, and return otherwise?

This class is supposed to be initialized at cache startup and destroyed at cache shutdown. Anyway, I've added an assertion to detect such misuse...


> @@ +145,5 @@
> > +  return NS_OK;
> > +}
> > +
> > +nsresult
> > +nsDeleteDir::InitThread()
> 
> I suggest we rename "EnsureThread" since for many calls thread will already
> exist.
> 
> @@ +164,5 @@
> > +  return NS_OK;
> > +}
> > +
> > +void
> > +nsDeleteDir::DestroyThread()
> 
> Since much of the time this function won't actually destroy the thread
> (whenever mTimers.Count !=0, i.e. there is more work to do), I suggest we
> rename it "MaybeDestroyThread".

I was thinking about this but didn't do it because in fact most of the calls will initialize/destroy the thread.


> @@ +201,5 @@
> > +    gInstance->RemoveDir((*dirList)[i], &shuttingDown);
> > +  }
> > +
> > +  {
> > +    MutexAutoLock lock(gInstance->mLock);
> 
> style nit: Since you only call DestroyThread from this location, you could
> also put lock inside DestroyThread to avoid extra brace scope here.

I could but I wanted to be consistent with InitThread().
  

> @@ +440,5 @@
> > +  // No check for errors to remove as much as possible
> > +
> > +  MutexAutoLock lock(mLock);
> > +  if (mShutdownPending)
> > +    *shuttingDown = true;
> 
> Can mShutDownPending be moved to top of function, so that we exit sooner
> when it gets set?

I wanted to have it only at one place to limit the usage of the lock and having it after the file removal is IMO the best place. The worst that can happen is that we traverse from the top directory to the first file or empty directory and we exit after it is removed. I think this is OK since our cache tree is quite flat.


> ::: netwerk/cache/nsDeleteDir.h
> @@ +59,5 @@
> > +  static nsresult Shutdown();
> > +
> > +  /**
> > +   * This routine attempts to delete a directory that may contain some files
> > +   * that are still in use. This later point is only an issue on Windows and
> 
> s/later/latter/

Fixed.


> ::: netwerk/cache/nsDiskCacheDevice.cpp
> @@ +420,5 @@
> >      if (NS_FAILED(rv))
> >          return rv;
> > +
> > +    nsDeleteDir::RemoveOldTrashes(mCacheDirectory);
> > +
> 
> Hmm, so this does a bunch of file IO on main thread right at startup
> (enumerates Cache directory to find any .Trash directories to delete). 
> Given how precious startup time is, I think we should instead post an event
> to the background DeleteDir thread to do the RemoveOldTrashes check later
> (30-60 seconds later?).  I also think we should do RemoveOldTrashes only
> once--the first time Init is called, at startup, and not when it's called
> later by the user calling ClearCache. 

This is completely wrong. First call to RemoveOldTrashes() is done when we create the disk device which we do on the cache IO thread unless (1) the first request to open an entry is synchronous (we don't do it) or (2) the browser starts with pages that don't use the disk cache (i.e. about:blank) and user clears the cache as the first step (we can really ignore this scenario). All further calls to RemoveOldTrashes() are ignored (see firstRun variable) and this is also stated in the method's description. The static variable firstRun is a leftover from the previous patch, I could make it a member variable to be a more obvious.


> This means that we might try to delete the same directory twice (if at
> startup we determine that "Cache" is corrupt, we rename it immediately and
> then post a timer to delete, but RemoveOldTrashes might see it and delete it
> first).  But that's harmless--nsIFile.Remove will just return an error
> (which we ignore), AFAICT.   The only bad side-effect is that we'd report
> two telemetry timings for DeleteDir (one very short).

As described above, this can't happen. RemoveOldTrashes() is called before we try to open the cache, so first we handle old trashes and any other cache deletion has its own timer.
Comment 16 Michal Novotny (:michal) 2011-11-03 16:56:36 PDT
(In reply to neil@parkwaycc.co.uk from comment #14)
> For those people who expect to use Clear Private Data to clear the cache on
> shutdown, will this mean that their files will still exist on the disk until
> the next restart?

I don't know why the users should expect that the data is deleted on the shutdown. The deleting starts immediately but doesn't block the shutdown, so some of the files _can_ stay on the disk after the shutdown. But they are not used by Firefox anymore.
Comment 17 Michal Novotny (:michal) 2011-11-03 17:04:40 PDT
(In reply to Michal Novotny (:michal) from comment #16)
> (In reply to neil@parkwaycc.co.uk from comment #14)
> > For those people who expect to use Clear Private Data to clear the cache on
> > shutdown, will this mean that their files will still exist on the disk until
> > the next restart?
> 
> I don't know why the users should expect that the data is deleted on the
> shutdown.

Sorry, I misread your question. Yes, people shouldn't use "Clear Private Data" to delete all files from the disk right before shutdown.
Comment 18 Benjamin Smedberg [:bsmedberg] 2011-11-03 19:55:25 PDT
What do you mean, "shouldn't"? The whole point of the "clear private data on shutdown" flag is to make sure that the cache files are deleted, and users who have checked that box expect the behavior.
Comment 19 Josh Aas 2011-11-03 20:17:31 PDT
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #18)
> What do you mean, "shouldn't"? The whole point of the "clear private data on
> shutdown" flag is to make sure that the cache files are deleted, and users
> who have checked that box expect the behavior.

Agreed. All of the data must be deleted on shutdown, that's what I would expect.
Comment 20 Bjarne (:bjarne) 2011-11-04 02:05:48 PDT
Comment on attachment 571837 [details] [diff] [review]
patch v3

Review of attachment 571837 [details] [diff] [review]:
-----------------------------------------------------------------

Clearing review-flag until the "clear private data on shutdown flag"-issue has been sorted out.
Comment 21 Michal Novotny (:michal) 2011-11-04 03:16:24 PDT
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #18)
> What do you mean, "shouldn't"? The whole point of the "clear private data on
> shutdown" flag is to make sure that the cache files are deleted, and users
> who have checked that box expect the behavior.

OK, I have to confess that I'm lost. I don't know this flag and have no idea where it is located. I thought that the talk is about clearing cache using "Clear Recent History".
Comment 22 Benjamin Smedberg [:bsmedberg] 2011-11-04 05:04:38 PDT
Options -> Privacy -> Custom settings -> Clear history when Firefox closes
Comment 23 Michal Novotny (:michal) 2011-11-04 10:19:32 PDT
Created attachment 572014 [details] [diff] [review]
patch v4

Shutdown is now blocked until all data is deleted when "Clear history when Firefox closes" flag is set and cache is selected in "Settings".
Comment 24 Jason Duell [:jduell] (needinfo me) 2011-11-08 01:03:01 PST
Comment on attachment 572014 [details] [diff] [review]
patch v4

Review of attachment 572014 [details] [diff] [review]:
-----------------------------------------------------------------

You were right about all the race conditions--thanks for being patient with me while I confused mThread from different classes.

This all looks good to me, including the new logic that ensures deletion if Clear History on shutdown is selected (Neil: thanks for noticing we needed that).

I have only very minor style nits, but it sounds like we need to make Bjarne happy (re: comment 9) with the mobile story before this can land.  AFAICT if we're suspended then killed by Android, we might not be able to delete the full cache even if "clear history on shutdown" is selected--but then again, I can't think of any way we *could* guarantee that under mobile.  So I don't see what there is to worry about specifically under mobile here.  I agree with Bjarne that we can't easily write a test for this code.

::: netwerk/cache/nsDeleteDir.cpp
@@ +119,5 @@
> +    if (!finishDeleting)
> +      gInstance->mStopDeleting = true;
> +
> +    // remove all pending timers
> +    for (PRInt32 i=gInstance->mTimers.Count() ; i > 0 ; i--) {

nit: spaces around "=", no space before semicolons:

  for (PRInt32 i = gInstance->mTimers.Count(); i > 0; i--) {

@@ +152,5 @@
> +  }
> +
> +  delete gInstance;
> +
> +  for (PRInt32 i=0 ; i<dirsToRemove.Count() ; i++)

same deal: spaces around "=" and "<", no space before ";"

@@ +211,5 @@
> +  nsAutoPtr<nsCOMArray<nsIFile> > dirList;
> +  dirList = static_cast<nsCOMArray<nsIFile> *>(arg);
> +
> +  bool shuttingDown = false;
> +  for (PRInt32 i=0 ; i<dirList->Count() && !shuttingDown ; i++) {

and again.  But if there's code elsewhere in the cache directory that has this idiom, I'm fine.  Not a big deal.

::: netwerk/cache/nsDeleteDir.h
@@ +66,5 @@
> +   * If the moveToTrash parameter is true, then the process for deleting the
> +   * directory creates a sibling directory of the same name with the ".Trash"
> +   * suffix followed by a random number. It then attempts to move the given
> +   * directory into the corresponding trash folder (moving individual files if
> +   * necessary). Next, it proceeds to delete each file in the trash folder on

Nit: we don't create the "trash123" directory, and then move the given directory into it (that would make us slow on NTFS): we rename the directory.  How about we change wording to

If the moveToTrash parameter is true we first rename the given directory "foo.Trash123" (where "foo" is the original directory name, and "123" is a random number, in order to support multiple concurrent deletes).  The directory is then deleted, file-by-file, on a background thread.
Comment 25 Michal Novotny (:michal) 2011-11-29 03:05:07 PST
Created attachment 577539 [details] [diff] [review]
patch v5

Fixed comment and style nits.


> What about mobile? How will this be handled by a device, and how does it
> work if the browser is suspended while the remove-event is pending?

If the browser is suspended it is in fact killed. I.e. the behavior is the same as if you killed your firefox on your desktop. Everything works as expected and if the session is longer than 90 seconds then there is no problem. Otherwise the old trashes won't be deleted and their count will grow (if the session is shorter than 60 seconds).
Comment 26 Jason Duell [:jduell] (needinfo me) 2011-12-02 02:12:33 PST
Comment on attachment 577539 [details] [diff] [review]
patch v5

So, this is probably causing the #1 hang for FF nightly (see bug 705761 comment 9), so we should move it along ASAP.

I think the strategy in this patch is fine for mobile.  It may be more likely that mobile could get into a situation where it repeatedly is run for less than 90 seconds (and so >1 delete dirs lie around), but I don't see this as being a problem (this patch will handle them efficiently once the browser is run for 2 minutes at some point).  If we determine that there's an issue with mobile, we can always lower the timeout there (or take whatever other action is needed).
Comment 27 Jason Duell [:jduell] (needinfo me) 2011-12-02 02:18:51 PST
Created attachment 578525 [details] [diff] [review]
Add 3 telemetry vars (applies on top of main patch)

This adds 3 telemetry metrics to this patch

1) NETWORK_DISK_CACHE_SHUTDOWN, "Total Time spent (ms) during disk cache showdown".  Just good to have.

2) NETWORK_DISK_CACHE_DELETEDIR_SHUTDOWN, "Time spent during showdown stopping thread deleting old disk cache (ms)".  I.e. how much time it takes to stop the deletedir thread during shutdown.

3) NETWORK_DISK_CACHE_SHUTDOWN_CLEAR_PRIVATE, "Time spent (ms) during showdown deleting disk cache for 'clear private data' option".  If the user has selected "clear private data", then we wait for the entire cache to be deleted at shutdown.  Generally they shouldn't have a massive cache, since they clear it each time they close, but if they've kept the browser open for a long time, it could take a while.  That may show up as a "hang" given our new hang-detection system (not sure how it's measuring), but I don't think there's anything we can do about it.  I want to make sure we capture that case so we don't think something's terribly wrong when it happens (which is rare, I expect: I don't think it's a popular option to set).

Michal, if you can +r these ASAP, let's land with this.  We can also split out into a separate bug if needed (but it's really a small patch).
Comment 28 Jason Duell [:jduell] (needinfo me) 2011-12-02 02:29:40 PST
P.S. The NETWORK_DISK_CACHE_SHUTDOWN_CLEAR_PRIVATE is not a very exact measure--it doesn't actually time from the initial DeleteDir call (which is called from sanitize.js -> EvictEntries: not an easy place to add telemetry for this) but rather from the time Shutdown is called.  I assume those both happen fairly close together, and the time granularity lost should be minor compared to the total time taken (all we really want to know if how long the longest cases take, and to clearly identify that 'clear private data' is to blame).
Comment 30 (dormant account) 2011-12-02 14:25:47 PST
Note telemetry here is not useful since it does not current persist to disk on shutdown.
Comment 31 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-12-02 19:05:24 PST
(In reply to Taras Glek (:taras) from comment #30)
> Note telemetry here is not useful since it does not current persist to disk
> on shutdown.

What is the best strategy for getting data on shutdown performance problems like this?
Comment 32 Jason Duell [:jduell] (needinfo me) 2011-12-02 19:20:34 PST
When is telemetry data uploaded?  On some timer?

We could roll a one-off solution to store this data to disk and report it at startup.  Or we just leave the shutdown metrics in, even if for now they don't do anything.  It seems likely that we'll eventually fix telemetry to persist un-phone-homed data to disk.
Comment 33 (dormant account) 2011-12-02 19:43:57 PST
the histogram class already supports serialization, just a little matter of implementation in bug 707320
Comment 34 Marco Bonardo [::mak] 2011-12-03 03:15:37 PST
https://hg.mozilla.org/mozilla-central/rev/bb764828a379

Note You need to log in before you can comment on or make changes to this bug.