Closed Bug 829419 Opened 11 years ago Closed 11 years ago

Intermittent Android Talos tcheck2, trobopan, trobocheck, tcheckerboard, ttest.py | application crashed [@ nsDeleteDir::PostTimer(void*, unsigned int)]

Categories

(Core :: Networking: Cache, defect)

ARM
Android
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla23
Tracking Status
firefox20 --- affected
firefox21 --- affected

People

(Reporter: philor, Assigned: michal)

References

Details

(Keywords: crash, intermittent-failure)

Crash Data

Attachments

(1 file, 2 obsolete files)

Now that we actually get stacks, the bulk of the crashes in bug 796914 (and its clones bug 816584 and bug 829371) will turn out to be the complex of NSS crashes around bug 761987, but some of them are yours.

https://tbpl.mozilla.org/php/getParsedLog.php?id=18687626&tree=Mozilla-Inbound
Android Tegra 250 mozilla-inbound talos remote-trobopan on 2013-01-10 14:38:48 PST for push 42b22f0ab58c
slave: tegra-267

mozcrash ERROR | PROCESS-CRASH | ttest.py | application crashed [@ nsDeleteDir::PostTimer(void*, unsigned int)]
getting files in '/mnt/sdcard/tests/profile/minidumps/'
Crash dump filename: /tmp/tmpCeQnZw/7e61f8cc-79d6-126a-2eff37a5-151ea153.dmp
Operating system: Android
                  0.0.0 Linux 2.6.32.9-00002-gd8084dc-dirty #1 SMP PREEMPT Wed Feb 2 11:32:06 PST 2011 armv7l nvidia/harmony/harmony/harmony:2.2/FRF91/20110202.102810:eng/test-keys
CPU: arm
     0 CPUs

Crash reason:  SIGSEGV
Crash address: 0x0

Thread 30 (crashed)
 0  libxul.so!nsDeleteDir::PostTimer(void*, unsigned int) [Mutex.h : 74 + 0x0]
     r4 = 0x00000000    r5 = 0x56fffb9c    r6 = 0x514b81a0    r7 = 0x0000ea60
     r8 = 0x4e934a80    r9 = 0x00010011   r10 = 0x4ed38fd8    fp = 0x5132ae48
     sp = 0x56fffb90    lr = 0x54d53241    pc = 0x5460cdda
    Found by: given as instruction pointer in context
 1  libxul.so!nsDeleteDir::DeleteDir(nsIFile*, bool, unsigned int) [nsDeleteDir.cpp : 270 + 0xd]
     r4 = 0x00000000    r5 = 0x56fffbec    r6 = 0x514b81a0    r7 = 0x0000000a
     r8 = 0x0000ea60    r9 = 0x00010011   r10 = 0x4ed38fd8    fp = 0x5132ae48
     sp = 0x56fffbd0    pc = 0x5460d30f
    Found by: call frame info
 2  libxul.so!nsDiskCacheDevice::OpenDiskCache() [nsDiskCacheDevice.cpp : 999 + 0xb]
     r4 = 0x00000000    r5 = 0x4e96f240    r6 = 0x00000000    r7 = 0x00000000
     r8 = 0x4ed38fd8    r9 = 0x00000001   r10 = 0x4ed38fd8    fp = 0x5132ae48
     sp = 0x56fffcc8    pc = 0x54605a45
    Found by: call frame info
 3  libxul.so!nsDiskCacheDevice::Init() [nsDiskCacheDevice.cpp : 408 + 0x5]
     r4 = 0x00000000    r5 = 0x4e96f240    r6 = 0x00000000    r7 = 0x00000000
     r8 = 0x4ed38fd8    r9 = 0x00000001   r10 = 0x4ed38fd8    fp = 0x5132ae48
     sp = 0x56fffcf8    pc = 0x54605b07
    Found by: call frame info
 4  libxul.so!nsCacheService::CreateDiskDevice() [nsCacheService.cpp : 1570 + 0x7]
     r4 = 0x00000000    r5 = 0x4e96f240    r6 = 0x5132adf0    r7 = 0x00000000
     r8 = 0x4ed38fd8    r9 = 0x00000001   r10 = 0x4ed38fd8    fp = 0x5132ae48
     sp = 0x56fffd08    pc = 0x54600def
    Found by: call frame info
 5  libxul.so!nsCacheService::SearchCacheDevices(nsCString*, int, bool*) [nsCacheService.cpp : 2171 + 0x5]
     r4 = 0x00000000    r5 = 0x5132adf0    r6 = 0x56fffd77    r7 = 0x00000000
     r8 = 0x4ed38fd8    r9 = 0x00000001   r10 = 0x4ed38fd8    fp = 0x5132ae48
     sp = 0x56fffd38    pc = 0x546014a3
    Found by: call frame info
 6  libxul.so!nsCacheService::ActivateEntry(nsCacheRequest*, nsCacheEntry**, nsCacheEntry**) [nsCacheService.cpp : 2074 + 0x3]
     r4 = 0x00000000    r5 = 0x4ed38fd0    r6 = 0x5132adf0    r7 = 0x557035bc
     r8 = 0x56fffdac    r9 = 0x56fffdb0   r10 = 0x4ed38fd8    fp = 0x5132ae48
     sp = 0x56fffd70    pc = 0x54602aa9
    Found by: call frame info
 7  libxul.so!nsCacheService::ProcessRequest(nsCacheRequest*, bool, nsICacheEntryDescriptor**) [nsCacheService.cpp : 1900 + 0xb]
     r4 = 0x4ed38fd0    r5 = 0x00000000    r6 = 0x5132adf0    r7 = 0x00000000
     r8 = 0x00000001    r9 = 0x514b88f8   r10 = 0x557035bc    fp = 0x51efafd0
     sp = 0x56fffda0    pc = 0x54602317
https://tbpl.mozilla.org/php/getParsedLog.php?id=18785030&tree=Mozilla-Inbound
Summary: Intermittent Android Talos "mozcrash ERROR | PROCESS-CRASH | ttest.py | application crashed [@ nsDeleteDir::PostTimer(void*, unsigned int)]" → Intermittent Android Talos "PROCESS-CRASH | tcheck2, trobopan | application crashed [@ nsDeleteDir::PostTimer(void*, unsigned int)]"
Crash Signature: [@ nsDeleteDir::PostTimer(void*, unsigned int)]
https://tbpl.mozilla.org/php/getParsedLog.php?id=18818980&tree=Mozilla-Inbound
Summary: Intermittent Android Talos "PROCESS-CRASH | tcheck2, trobopan | application crashed [@ nsDeleteDir::PostTimer(void*, unsigned int)]" → Intermittent Android Talos "PROCESS-CRASH | tcheck2, trobopan, tcheckerboard | application crashed [@ nsDeleteDir::PostTimer(void*, unsigned int)]"
Additional logs for this crash can be found in bug 796914: see comments 614 and 624, for example.
See Also: → 811653
Michal, I don't suppose you could have a look at this? Now that we're able to get stacks from talos runs, this appears to be the 2nd most common crash we see on TBPL. 

Cheers :-)
Depends on: 811653
Flags: needinfo?(michal.novotny)
See Also: 811653
Summary: Intermittent Android Talos "PROCESS-CRASH | tcheck2, trobopan, tcheckerboard | application crashed [@ nsDeleteDir::PostTimer(void*, unsigned int)]" → Intermittent Android Talos tcheck2, trobopan, tcheckerboard | application crashed [@ nsDeleteDir::PostTimer(void*, unsigned int)]
Thread 29 (crashed)
 0  libxul.so!nsDeleteDir::PostTimer(void*, unsigned int) [Mutex.h : 74 + 0x0]
 1  libxul.so!nsDeleteDir::DeleteDir(nsIFile*, bool, unsigned int) [nsDeleteDir.cpp : 270 + 0xd]
 2  libxul.so!nsDiskCacheDevice::OpenDiskCache() [nsDiskCacheDevice.cpp : 1027 + 0xb]
 3  libxul.so!nsDiskCacheDevice::Init() [nsDiskCacheDevice.cpp : 436 + 0x5]
 4  libxul.so!nsCacheService::CreateDiskDevice() [nsCacheService.cpp : 1542 + 0x7]
 5  libxul.so!nsCacheService::SearchCacheDevices(nsCString*, int, bool*) [nsCacheService.cpp : 2136 + 0x5]
 6  libxul.so!nsCacheService::ActivateEntry(nsCacheRequest*, nsCacheEntry**, nsCacheEntry**) [nsCacheService.cpp : 2039 + 0x3]
 7  libxul.so!nsCacheService::ProcessRequest(nsCacheRequest*, bool, nsICacheEntryDescriptor**) [nsCacheService.cpp : 1865 + 0xb]
 8  libxul.so!nsCacheService::OpenCacheEntry(nsCacheSession*, nsACString_internal const&, int, bool, nsICacheListener*, nsICacheEntryDescriptor**) [nsCacheService.cpp : 1997 + 0xf]
 9  libxul.so!nsCacheSession::AsyncOpenCacheEntry(nsACString_internal const&, int, nsICacheListener*, bool) [nsCacheSession.cpp : 110 + 0x11]
10  libxul.so!mozilla::net::HttpCacheQuery::Run() [nsHttpChannel.cpp : 2804 + 0x15]
11  libxul.so!nsThread::ProcessNextEvent(bool, bool*) [nsThread.cpp : 627 + 0x5]


From the backtrace, I don't understand what's wrong. It seems that mLock in nsDeleteDir is null, but nsDeleteDir was successfully initialized since gInstance is non-null (it is checked at the beginning of nsDeleteDir::DeleteDir()). Also this can't be race condition between nsDeleteDir::DeleteDir() and nsDeleteDir::Shutdown() because Shutdown() is called from nsCacheService::Shutdown() after the cache IO thread is shut down but the crash is on cache IO thread.
Flags: needinfo?(michal.novotny)
I reproduced on try with additional checks. In the failure (crash) case, I found gInstance == 0 in nsDeleteDir::Shutdown(), just before calling gInstance->PostTimer.
(In reply to Geoff Brown [:gbrown] from comment #99)
> I reproduced on try with additional checks. In the failure (crash) case, I
> found gInstance == 0 in nsDeleteDir::Shutdown(), just before calling
> gInstance->PostTimer.

Sorry, that should read "...I found gInstance == 0 in nsDeleteDir::DeleteDir() ..."
Based on https://tbpl.mozilla.org/?tree=Try&rev=8d7b21ae84a8, looking at https://tbpl.mozilla.org/php/getParsedLog.php?id=18900440&tree=Try&full=1#error0, note that nsCacheService::Init, nsCacheService::Shutdown, nsCacheService::Init, and nsCacheService::Shutdown are called in rapid succession. (It appears that Android activity pause/resume notifications have queued up ACTIVITY_PAUSING / ACTIVITY_RESUMING events during application startup. We recently noticed that this is happening and causing problems in bug 792242 and bug 826385.) On one thread (2681), OpenDiskCache detects a corrupt cache and calls DeleteDir; meanwhile, another thread (2609), executes nsCacheService::Shutdown, which calls nsDeleteDir::Shutdown, setting gInstance to null.

I don't understand why the cache is corrupt.

The rapid Init/Shutdown/Init/Shutdown is obviously troubling.

Even so, nsDeleteDir should probably guard against this case.
Attached patch v1 (obsolete) — Splinter Review
Geoff: thanks for digging into this!

Yes, it looks like there's clearly a race between the check for !gInstance at the beginning of nsDeleteDir::DeleteDir and the blind use of it at the end of the function (after it's done lots of I/O, so it can take a while, making a big window for Shutdown() to delete gInstance).

Michal: This is not a very good patch, but I think it may reduce the crashes a lot.  It lowers the window where the race can happen to be much smaller, but there's still a race possible (Shutdown can release the lock and let DeleteDir proceed, but then Shutdown can delete gInstance while DeleteDir is checking mShutDownPending.  But that's a small window, and the crash we're seeing has the opposite ordering, so DeleteDir should see a null ptr and we'll be ok).

A better fix would avoid all race windows entirely.  Possibly making nsDeleteDir a refcounted object (a service?), and/or moving the lock for it outside of the class, so we can avoid the small "check non-null, then grab lock" race window that my patch still has.

Also, we shouldn't be creating a timer in a non-main thread: see 

   http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsITimer.idl#56

Perhaps we should turn nsDeleteDir into a runnable, and it's Run method can post the timer?
Attachment #703716 - Flags: review?(michal.novotny)
Jason: thanks for looking at this! 

There are actually 2 calls to gInstance->PostTimer in nsDeleteDir.cpp -- the one your patch modifies in RemoveOldTrashes and another in nsDeleteDir::DeleteDir (and it's the one in nsDeleteDir::DeleteDir that is causing most of the trouble). Both calls likely need attention.
Comment on attachment 703716 [details] [diff] [review]
v1

(In reply to Jason Duell (:jduell) from comment #114)
> Also, we shouldn't be creating a timer in a non-main thread: see 
>   
> http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsITimer.idl#56
> 
> Perhaps we should turn nsDeleteDir into a runnable, and it's Run method can
> post the timer?

I've filed bug #832474 to fix this issue.


> Yes, it looks like there's clearly a race between the check for !gInstance
> at the beginning of nsDeleteDir::DeleteDir and the blind use of it at the
> end of the function (after it's done lots of I/O, so it can take a while,
> making a big window for Shutdown() to delete gInstance).

No, there is IMO no race condition. Please see my comment #78. nsDeleteDir is initialized in nsCacheService::Init() and destroyed in nsCacheService::ShutDown(). Cache IO thread is shut down before nsDeleteDir is destroyed, so there is no race condition possible when nsDeleteDir is called from cache IO thread, which is this case. See frame #10 in comment #78, in HttpCacheQuery::Run() we open the cache entry on non-mainthread and HttpCacheQuery uses only cache IO thread.

Even if nsDeleteDir::DeleteDir() was called on another thread (which we don't do), it would still be safe because nsDeleteDir::DeleteDir() is called under the cache lock which ensures synchronization with nsCacheService::Shutdown(). We call nsDeleteDir::DeleteDir() outside the cache lock only in case of nsCacheProfilePrefObserver::Observe(), nsCacheService::Shutdown() and nsCacheService::init() always on the main thread and when nsDeleteDir is initialized.

Instead of this hotfix I need to find out what's really happening.

BTW, how are threads organized on android? Thread #5 in all backtraces seems to be main thread, which is #0 on desktop...
Attachment #703716 - Flags: review?(michal.novotny) → review-
Assignee: nobody → michal.novotny
I noticed something interesting....

When nsCacheService::Shutdown calls cacheIOThread->Shutdown() (http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsCacheService.cpp#1251), event processing continues on the main thread at http://mxr.mozilla.org/mozilla-central/source/widget/android/nsAppShell.cpp#270. As I noted earlier, in this Android test case, I see multiple pause/resume events queued up; so what happens here is that we get a resume event, causing a call to nsCacheService::Init, inside the cacheIOThread->Shutdown() call. Once the resume event is processed, a pause event is encountered and nsCacheService::Shutdown is called again, still inside the call to cacheIOThread->Shutdown() inside the first call to nsCacheService::Shutdown: so nsCacheService::Shutdown is called during nsCacheService::Shutdown. 

I don't immediately see how this causes any trouble, but it certainly is unexpected and likely unique to the Android environment.
Attached patch patch v2 (obsolete) — Splinter Review
(In reply to Geoff Brown [:gbrown] from comment #129)
> I don't immediately see how this causes any trouble, but it certainly is
> unexpected and likely unique to the Android environment.

I don't exactly see how we crash when another nsCacheService::Init() and nsCacheService::Shutdown() is called when waiting for the thread shutdown in nsCacheService::Shutdown(). But there is a clear crash scenario when just nsCacheService::Init() is called:

1) nsCacheService::Init() is called while waiting for the cache IO thread shutdown
2) HttpCacheQuery event is being processed on the new cache IO thread
3) at the same time the old cache IO thread is shut down and nsCacheService::Shutdown() continues
4) nsDeleteDir is shut down and gInstance is deleted at the end of nsCacheService::Shutdown()

Crash occurs on the new cache IO thread when point 4 happens after we check gInstance and before we use it in nsDeleteDir::DeleteDir(). Attached patch ensures that no pending event is processed while waiting for the thread shutdown. Tryserver looks good:

https://tbpl.mozilla.org/?tree=Try&rev=4a523ba5439a

Maybe it would be sufficient to post event to the main thread that would call thread->Shutdown() (tryserver push https://tbpl.mozilla.org/?tree=Try&rev=01a2d9ce6c83), but in this case it wouldn't be ensured that nsDeleteDir exists for the whole lifetime of the cache IO thread.
Attachment #703716 - Attachment is obsolete: true
Attachment #704584 - Flags: review?(jduell.mcbugs)
Summary: Intermittent Android Talos tcheck2, trobopan, tcheckerboard | application crashed [@ nsDeleteDir::PostTimer(void*, unsigned int)] → Intermittent Android Talos tcheck2, trobopan, trobocheck, tcheckerboard | application crashed [@ nsDeleteDir::PostTimer(void*, unsigned int)]
https://tbpl.mozilla.org/php/getParsedLog.php?id=19180346&tree=Mozilla-Aurora
Summary: Intermittent Android Talos tcheck2, trobopan, trobocheck, tcheckerboard | application crashed [@ nsDeleteDir::PostTimer(void*, unsigned int)] → Intermittent Android Talos tcheck2, trobopan, trobocheck, tcheckerboard, ttest.py | application crashed [@ nsDeleteDir::PostTimer(void*, unsigned int)]
jduell: review ping?

Also, does anyone know why the test failures stopped? Is it just timing/luck?

The fix for bug 826385 is landing today; that should avoid the frequent nsCacheService::Shutdown/Init calls.
No longer depends on: 811653
Blocks: 810471
Review ping :-)
Flags: needinfo?(jduell.mcbugs)
Comment on attachment 704584 [details] [diff] [review]
patch v2

Review of attachment 704584 [details] [diff] [review]:
-----------------------------------------------------------------

I can live with this fix if it fixes the crash here (I assume we're not seeing the crash now just because the test has been disabled?  Can we get it re-enabled for a try run?).

Basically, in the API of nsDeleteDir there's a race between ::Shutdown and ::DeleteDir/::RemoveOldTrashes: if they're called simultaneously by different threads, we can wind up deleting gInstance in the middle of ::DeleteDir (or ::RemoveOldTrashes) and things go boom when we try gInstance->PostTimer.  To me that's a problem that would ideally be fixed internally inside nsDeleteDir.  That's good API design.  This patch instead fixes it (hopefully: AFAICT it looks good) by making sure the client code never calls the code in a way that hits the race.

If it works let's ship it.  If it doesn't I think the simplest approach might be to really make nsDeleteDir thread-safe (with the main fix being we can't use a lock inside gInstance to check whether gInstance is still live).

::: netwerk/cache/nsCacheUtils.h
@@ +1,1 @@
> +/* -*- Mode: C++; tab-width: 4; indent-tabs-mode: nil; c-basic-offset: 4 -*-

add vi modeline too

  /* vim: set sw=2 ts=8 et tw=80 : */

@@ +13,5 @@
> +#include "mozilla/CondVar.h"
> +
> +class nsIThread;
> +
> +class nsShutdownThread : public nsRunnable {

/**
 * A class with utility methods for shutting down nsIThreads easily. 
 */

@@ +20,5 @@
> +  ~nsShutdownThread();
> +
> +  NS_IMETHOD Run();
> +
> +  static nsresult Shutdown(nsIThread *aThread);

Add comments describing methods:

/**
 * Shutdown ensures that aThread->Shutdown() is called on a different thread (which is required by the API)
 */

/**
 * BlockingShutdown ensures that by the time it returns, aThread->Shutdown() has been called on a different thread.
 */
Attachment #704584 - Flags: review?(jduell.mcbugs) → review+
Flags: needinfo?(jduell.mcbugs)
ryanvm: can you confirm the crashes here are gone because the test is disabled?  Also can you tell us how to enable the test in a try run?
Flags: needinfo?(ryanvm)
I think these tests are still running, and the crashes are gone (drastically reduced in frequency) because the fix for bug 826385 avoids the frequent nsCacheService::Shutdown/Init calls.
What Geoff said. They're running.
Flags: needinfo?(ryanvm)
> Basically, in the API of nsDeleteDir there's a race between ::Shutdown and
> ::DeleteDir/::RemoveOldTrashes: if they're called simultaneously by
> different threads, we can wind up deleting gInstance in the middle of
> ::DeleteDir (or ::RemoveOldTrashes) and things go boom when we try
> gInstance->PostTimer.  To me that's a problem that would ideally be fixed
> internally inside nsDeleteDir.  That's good API design.

Your understanding of the problem is completely wrong. Lifecycle of nsDeleteDir should fully depend on lifecycle of the cache service. This means that nsDeleteDir should be initialized if and only if the cache service is initialized. There is no reason for using nsDeleteDir outside the cache service. This all works as expected.

The problem is a race between nsCacheService::Shutdown() and nsCacheService::Init() because of a stupid nsIThread API. I really don't understand why I can't prohibit processing of pending events on the current thread while a different thread is being shut down.
> There is no reason for using nsDeleteDir outside the cache service.

For now that's true, which is why I'm fine with your patch! :)  Let's land it and see if it works. (would it be worth reverting 826385 in your local build and push it to try a bunch of times with and without your patch?  I'm not sure of a better way to test this.)
(In reply to Jason Duell (:jduell) from comment #209)
> would it be worth reverting 826385 in your local build and push it to try
> a bunch of times with and without your patch?  I'm not sure of a better way
> to test this.

The code has changed a little bit, so the patch from bug #826385 cannot be reverted directly. I've reverted the changes manually, but there is no crash on the try (https://tbpl.mozilla.org/?tree=Try&rev=6aaea40fe931). Maybe I did something wrong, or there is yet another change that avoids frequent calls to nsCacheService::Shutdown()/Init(). Anyway, the fix worked correctly on the tryserver at the time when I wrote it.
Attached patch patch v3Splinter Review
Attachment #704584 - Attachment is obsolete: true
Attachment #738026 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/d8b73ccd6725
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
The code from this patch triggers the warning at http://hg.mozilla.org/mozilla-central/annotate/18467a85acf6/xpcom/threads/nsThread.cpp#l434.  From reading the code it looks like http://hg.mozilla.org/mozilla-central/annotate/18467a85acf6/netwerk/cache/nsCacheUtils.cpp#l66 should be dispatching to the workerThread, not aThread?  Otherwise why is workerThread created at all?
Flags: needinfo?(michal.novotny)
Right, there should be workerThread->Dispatch(). I've already fixed it in gum tree some time ago, but I forgot to land it on m-c.
Flags: needinfo?(michal.novotny)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: