Closed Bug 1011771 Opened 10 years ago Closed 10 years ago

Intermittent f02n0g08.png | application crashed [@ mozilla::net::CacheFile::RemoveChunk(mozilla::net::CacheFileChunk*)]

Categories

(Core :: Networking: Cache, defect)

All
Android
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla32
Tracking Status
firefox30 --- unaffected
firefox31 --- unaffected
firefox32 --- fixed
firefox-esr24 --- unaffected

People

(Reporter: KWierso, Assigned: michal)

References

(Blocks 1 open bug)

Details

(Keywords: crash, intermittent-failure)

Crash Data

Attachments

(1 file, 2 obsolete files)

https://tbpl.mozilla.org/php/getParsedLog.php?id=39845205&tree=Mozilla-Inbound
Android 2.2 Tegra mozilla-inbound opt test plain-reftest-1 on 2014-05-16 14:17:36 PDT for push 505f38b0649b

slave: tegra-111



REFTEST TEST-START | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f01n2c08.png
REFTEST TEST-LOAD | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f01n2c08.png | 186 / 2985 (6%)
REFTEST TEST-LOAD | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f01n2c08.html | 186 / 2985 (6%)
REFTEST TEST-PASS | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f01n2c08.png | image comparison (==)
REFTEST TEST-UNEXPECTED-FAIL | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f01n2c08.html | This test left crash dumps behind, but we weren't expecting it to!
REFTEST INFO | Found unexpected crash dump file /mnt/sdcard/tests/reftest/profile/minidumps/60e7e9b5-6e3a-d375-49b88d3f-5ddec81e.dmp.
REFTEST INFO | Found unexpected crash dump file /mnt/sdcard/tests/reftest/profile/minidumps/60e7e9b5-6e3a-d375-49b88d3f-5ddec81e.extra.
REFTEST INFO | Loading a blank page
REFTEST TEST-END | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f01n2c08.png
REFTEST TEST-START | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f02n0g08.png
REFTEST TEST-LOAD | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f02n0g08.png | 187 / 2985 (6%)
REFTEST TEST-LOAD | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f02n0g08.html | 187 / 2985 (6%)
INFO | automation.py | Application ran for: 0:02:42.719816
INFO | zombiecheck | Reading PID log: /tmp/tmpRSB3hXpidlog
Contents of /data/anr/traces.txt:


PROCESS-CRASH | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f02n0g08.png | application crashed [@ mozilla::net::CacheFile::RemoveChunk(mozilla::net::CacheFileChunk*)]
Crash dump filename: /tmp/tmp60vTPM/60e7e9b5-6e3a-d375-49b88d3f-5ddec81e.dmp
Operating system: Android
                  0.0.0 Linux 2.6.32.9-00002-gd8084dc-dirty #1 SMP PREEMPT Wed Feb 2 11:32:06 PST 2011 armv7l nvidia/harmony/harmony/harmony:2.2/FRF91/20110202.102810:eng/test-keys
CPU: arm
     2 CPUs

Crash reason:  SIGSEGV
Crash address: 0x5

Thread 21 (crashed)
 0  libxul.so!mozilla::net::CacheFile::RemoveChunk(mozilla::net::CacheFileChunk*) [nsAutoPtr.h:505f38b0649b : 897 + 0x4]
     r4 = 0x00340028    r5 = 0x5e9ff380    r6 = 0x5cd0d1c8    r7 = 0x00000001
     r8 = 0x5cd0d1c4    r9 = 0x56e333ba   r10 = 0x00000001    fp = 0x00000000
     sp = 0x449c5dc8    lr = 0x55d93c45    pc = 0x55d92f62
    Found by: given as instruction pointer in context
 1  libxul.so!mozilla::net::CacheFileChunk::Release() [CacheFileChunk.cpp:505f38b0649b : 64 + 0x5]
     r4 = 0x00000001    r5 = 0x00000000    r6 = 0x5cd0d1c8    r7 = 0x00000001
     r8 = 0x5cd0d1c4    r9 = 0x56e333ba   r10 = 0x00000001    fp = 0x00000000
     sp = 0x449c5de8    pc = 0x55d93c45
    Found by: call frame info
 2  libxul.so!mozilla::RefPtr<mozilla::SharedThreadPool>::~RefPtr() + 0xd
     r4 = 0x449c5e04    r5 = 0x00000000    r6 = 0x5cd0d1c8    r7 = 0x00000001
     r8 = 0x5cd0d1c4    r9 = 0x56e333ba   r10 = 0x00000001    fp = 0x00000000
     sp = 0x449c5df8    pc = 0x55cecfaf
    Found by: call frame info
 3  libxul.so!mozilla::net::CacheFile::RemoveChunk(mozilla::net::CacheFileChunk*) [CacheFile.cpp:505f38b0649b : 1348 + 0x5]
     r4 = 0x5ee581c0    r5 = 0x00000000    r6 = 0x5cd0d1c8    r7 = 0x00000001
     r8 = 0x5cd0d1c4    r9 = 0x56e333ba   r10 = 0x00000001    fp = 0x00000000
     sp = 0x449c5e00    pc = 0x55d930cf
    Found by: call frame info
 4  libxul.so!mozilla::net::CacheFileChunk::Release() [CacheFileChunk.cpp:505f38b0649b : 64 + 0x5]
     r4 = 0x00000001    r5 = 0x00000000    r6 = 0x5cd0d1c8    r7 = 0x00000001
     r8 = 0x5cd0d1c4    r9 = 0x56e333ba   r10 = 0x00000001    fp = 0x00000000
     sp = 0x449c5e20    pc = 0x55d93c45
    Found by: call frame info
michal/honza:  new cache?
Assignee: nobody → michal.novotny
FYI, there is a lot of crashes with mozilla::net::CacheFile::RemoveChunk in the stack, maybe check on all of them, some may give a clue.
The problem is that mRemovingChunk is accessed in CacheFileChunk::Release() after the refcnt is decreased. When the CacheFileChunk::Release() method is called on two threads at the same time, it is possible that the object is destroyed on one thread before we execute the condition on the second thread.
Attached patch patch v1 (obsolete) — Splinter Review
The patch fixes the problem described in comment #6. I've also renamed the member mRemovingChunk. The name was chosen at time when we didn't cache chunks, but now is its name incorrect and misleading.
Attachment #8426991 - Flags: review?(honzab.moz)
Comment on attachment 8426991 [details] [diff] [review]
patch v1

It can still crash since the activeChunk information might be outdated in Release() so we would still call RemoveChunk() on a deleted object. Reposting Release() to a single thread seems to be the simplest solution for this problem. I'll create a new patch.
Attachment #8426991 - Attachment is obsolete: true
Attachment #8426991 - Flags: review?(honzab.moz)
Comment on attachment 8426991 [details] [diff] [review]
patch v1

Review of attachment 8426991 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Michal Novotny (:michal) from comment #8)
> Comment on attachment 8426991 [details] [diff] [review]
> patch v1
> 
> It can still crash since the activeChunk information might be outdated in
> Release() so we would still call RemoveChunk() on a deleted object.
> Reposting Release() to a single thread seems to be the simplest solution for
> this problem. I'll create a new patch.

Yes, it apparently can.

Can AddRef be called on a different thread then the thread you want to post Release() to?  If yes, then posting is not a solution here.

::: netwerk/cache2/CacheFileChunk.cpp
@@ +61,5 @@
>      delete (this);
>      return 0;
>    }
>  
> +  if (activeChunk && count == 1) {

so, here you are using a local var since |this| can already be freed, right?  But you are passing |this| to RemoveChunk that happily dereferences it again.  Do you really think that is a correct way of coding?


This code is all wrong.  You should actually allow Put/Get to/from both the mChunks and mCachedChunks _and_ calls to AddRef() and Release() on chunks happen just an only under the file's lock.  Chunk may remove itself from the hashtable when it reaches count 1 safely - under the lock.
Attachment #8426991 - Flags: review-
Actually, Get/Put+AddRef and Release+Remove should both be strictly atomic.
(In reply to Honza Bambas (:mayhemer) from comment #9)
> ::: netwerk/cache2/CacheFileChunk.cpp
> @@ +61,5 @@
> >      delete (this);
> >      return 0;
> >    }
> >  
> > +  if (activeChunk && count == 1) {
> 
> so, here you are using a local var since |this| can already be freed, right?
> But you are passing |this| to RemoveChunk that happily dereferences it
> again.  Do you really think that is a correct way of coding?

You should first try to read the code and think about it a bit before you post any comment. This would be OK if activeChunk would contain fresh information, because if the chunk is active it is in mChunks and this release was the last reference other than from HT. So it is ensured that the object exists.

So back to reposting Release(). This will ensure that the object is not destroyed on a different thread. This still won't ensure that anybody AddRefs() the chunk on the different thread but I can safely call RemoveChunk (which should be maybe renamed to DeactivateChunk or MaybeDeactivateChunk) which can grabs the CacheFile's lock and decide whether this is really the last reference.


> You should actually allow Put/Get to/from both the mChunks and mCachedChunks

I don't understand what are you trying to say with this part of the sentence.


> _and_ calls to AddRef() and Release() on chunks happen just an only under the
> file's lock.

This is a complete nonsense. You can never ensure that CacheFileChunk::AddRef() or CacheFileChunk::Release() is called always under the CacheFile's lock.
(In reply to Michal Novotny (:michal) from comment #11)
> (In reply to Honza Bambas (:mayhemer) from comment #9)
> > ::: netwerk/cache2/CacheFileChunk.cpp
> > @@ +61,5 @@
> > >      delete (this);
> > >      return 0;
> > >    }
> > >  
> > > +  if (activeChunk && count == 1) {
> > 
> > so, here you are using a local var since |this| can already be freed, right?
> > But you are passing |this| to RemoveChunk that happily dereferences it
> > again.  Do you really think that is a correct way of coding?
> 
> You should first try to read the code and think about it a bit before you
> post any comment. This would be OK if activeChunk would contain fresh
> information, because if the chunk is active it is in mChunks and this
> release was the last reference other than from HT. So it is ensured that the
> object exists.

Then you can use mActiveChunk instead, right?

> 
> So back to reposting Release(). This will ensure that the object is not
> destroyed on a different thread. This still won't ensure that anybody
> AddRefs() the chunk on the different thread but I can safely call
> RemoveChunk (which should be maybe renamed to DeactivateChunk or
> MaybeDeactivateChunk) 

I'm in favor of that.

> which can grabs the CacheFile's lock and decide
> whether this is really the last reference.
> 
> 
> > You should actually allow Put/Get to/from both the mChunks and mCachedChunks
> 
> I don't understand what are you trying to say with this part of the sentence.
> 
> 
> > _and_ calls to AddRef() and Release() on chunks happen just an only under the
> > file's lock.

I'll try to rephrase:

The problem here is that when Release() { if (--mRefCnt == 1) m[Cached]Chunks.Remove(); } is not atomic, someone can find the reference to the chunk in the hash table and do AddRef() in a way that you on one thread get to the point of deleting the object and on the other do (dead-object)->AddRef().

So, it should be like (pseudo-code):

GetChunk(**chunk)
{
  autolock();
  if (!m[Cached]Chunks.Get(index, chunk)) { // if found, ref++
    refptr<Chunk> c = new Chunk(); // 1 ref
    m[Cached]Chunks.Put(index, c); // 2 refs
    c.forget(chunk); // 2 refs
  }
}

and

Release()
{
  autolock();
  cnt = --mRefCnt;
  if (cnt == 1) {
    m[Cached]Chunks.Remove(this); // problem here is that you will reenter the Release(), but you could well post here
    // on the other hand, you could just let m[Cached]Chunks don't ref the chunk, but then there is a chance (that will happen! :)) there will be an invalid ptr in the hashtables, so I would take that only as a last resort solution
  }
  if (cnt == 0) 
    delete this;
}

> 
> This is a complete nonsense. You can never ensure that
> CacheFileChunk::AddRef() or CacheFileChunk::Release() is called always under
> the CacheFile's lock.

Of course, what I'm saying is that you need to atomize moving with the refcnt _and_ searching/removing in/from the hashtable.  Sorry if that was not clear.
Other option is to make the hash tables use nsAutoPtr instead.  So it will not alter the mRefCnt and will auto delete the handle when you reach mRefCnt == 0 where you will instead of |delete this| do mChunks.Remove(this).  Moving the counter and access to the hash tables must be atomic.
Also FYI: autoptr1 = autoptr2 will move value from autoptr2 to autoptr1, so that the object won't go away.
I want to make sure you fully understand the life cycle of the chunk. The goal isn't to just simply remove the chunk from hashtables when the refcnt decreases to 1 (or zero if hashtables won't hold a strong reference to it). AFAICS this is what you're focused on with your proposal.

The chunk is "active" when it is actively used and during this time it _must_ be present in mChunks. When the chunk is inactive (all references other than the reference from mChunks were released) it is either put into mCachedChunks or destroyed. I.e. mChunks hashtable is driven by Release() and on the other hand _we_ decide when the cached chunk should be released, so once we release a cached chunk the Release() method must just decrement the RefCnt and it must not do anything else.

Also please note that when all releases of an active chunk are released that doesn't necessarily mean that such chunk will be released or cached. If it is dirty, we initiate writing of this chunk to disk and the chunk _must_ stay during this write in mChunks. The write process can fail synchronously or it can fail/succeed asynchronously. During initiation of this write process the chunk is addrefed/released several times and we must make sure we won't re-enter the CacheFile's lock.
(In reply to Michal Novotny (:michal) from comment #15)
> I.e. mChunks hashtable is driven by Release()
> and on the other hand _we_ decide when the cached chunk should be released,
> so once we release a cached chunk the Release() method must just decrement
> the RefCnt and it must not do anything else.
> 

I only don't follow this part.  Can you please rephrase or something to make this more clear?  I cannot answer before I fully understand.  Thanks.
I just wanted to explain the difference between mChunks and mCachedChunks. Chunk _might_ be deactivated (removed from mChunks) when all references are released, i.e. the impulse comes always from CacheFileChunk::Release(). OTOH chunks from mCachedChunks are removed by impulse from CacheFile::ThrowMemoryCachedData(), CacheFile::CleanUpPreloadedChunks() etc. and in this case we must not try to deactivate the chunk in CacheFileChunk::Release().
I'll try..  Thanks for this outline BTW.

(In reply to Michal Novotny (:michal) from comment #15)
> I want to make sure you fully understand the life cycle of the chunk. The
> goal isn't to just simply remove the chunk from hashtables when the refcnt
> decreases to 1 (or zero if hashtables won't hold a strong reference to it).
> AFAICS this is what you're focused on with your proposal.

Yes, you are right on this.  

> 
> The chunk is "active" when it is actively used and during this time it
> _must_ be present in mChunks. When the chunk is inactive (all references
> other than the reference from mChunks were released) it is either put into
> mCachedChunks or destroyed. I.e. mChunks hashtable is driven by Release()
> and on the other hand _we_ decide when the cached chunk should be released,

I think with the proposal from comment 13 we still can have a control, maybe you got confused as I wrote that instead of delete this you will just do mChunks.Remove(Index()).  No, you would still call mFile->RemoveChunk(this) of course that may do whatever it wants.  Also, can addref the object again safely.

> so once we release a cached chunk the Release() method must just decrement
> the RefCnt and it must not do anything else.
> 
> Also please note that when all releases of an active chunk are released that
> doesn't necessarily mean that such chunk will be released or cached. 

Aha.  Good point.  Anyway, looking into CacheFile::RemoveChunk I can see the dirty write is started from there.  And it happens under the lock.  Note that my proposal means to NOT lock in CacheFileChunk::AddRef().  So, I presume the write process holds a hard ref (in this case it will go from 0 back to 1 - which in our case is perfectly alright) and will hold the chunk live again for the time of write, is that correct?  If so, then I don't see why not to use comment 13 proposal.

> If it
> is dirty, we initiate writing of this chunk to disk and the chunk _must_
> stay during this write in mChunks. The write process can fail synchronously
> or it can fail/succeed asynchronously. During initiation of this write
> process the chunk is addrefed/released several times and we must make sure
> we won't re-enter the CacheFile's lock.

The sync fail would be bad if you addref/realease the chunk under the lock, right, and it can happen when dispatch to the IO thread fails, what easily can in case of shutdown...  Could we just make it fail only asynchronously?

If this becomes too complicated then we may do something similar to CacheEntry referencing.  There is a handle given to consumers that has a side refcoutner to make any before-end-of-life decisions on the object.
(In reply to Michal Novotny (:michal) from comment #17)
> I just wanted to explain the difference between mChunks and mCachedChunks.

I know the difference, but this outline was good anyway.

> Chunk _might_ be deactivated (removed from mChunks) when all references are
> released, i.e. the impulse comes always from CacheFileChunk::Release(). OTOH
> chunks from mCachedChunks are removed by impulse from
> CacheFile::ThrowMemoryCachedData(), CacheFile::CleanUpPreloadedChunks() etc.
> and in this case we must not try to deactivate the chunk in
> CacheFileChunk::Release().

I don't much see an argument against the proposal.  

"we must not try to deactivate" - hmm.. question is if the chunk is added a ref (and potentially released) when in mCachedChunks only.  Is there such a case?  Anyway, we can easily branch by checking if the chunk is in mChunks (by GetWeak()).  If not, then it's not active, and we can just safely do nothing.

If the chunk is released from mCachedChunks (and then deleted) under the lock, we are still safe.
Attached patch patch v2 (obsolete) — Splinter Review
I've added a comment explaining why just re-posting CacheFileChunk::Release() from non-main thread to main thread fixes the problem.
Attachment #8428679 - Flags: review?(honzab.moz)
(In reply to Michal Novotny (:michal) from comment #6)
> The problem is that mRemovingChunk is accessed in CacheFileChunk::Release()
> after the refcnt is decreased. When the CacheFileChunk::Release() method is
> called on two threads at the same time, it is possible that the object is
> destroyed on one thread before we execute the condition on the second thread.

So this was happening:

T1                            T2
Release() // last external ref                    
{
  cnt = refcnt-- (= 1)
  !removing && cnt == 1:      
                              GetChunkLocked()
                              lock
                              AddRef(): refcnt++ (= 2)
                              unlock
                              Release()
                              {
                                cnt = refcnt-- (= 1)
                                !removing && cnt == 1:
     lock                       
     refcnt++ (= 2)
     removing = true
     mChunk.Remove() (= 1)
     unlock
                                   lock
     refcnt-- (= 0)
     delete this;
                                   AddRef() or whatever access -> CRASH!


Hence having a single thread Release() call will fix this bug.  Hopefully there is no other similar to this one just even more hidden...  I still don't like even the patched code but I don't see any other problem with it right now.

And the idea for referring chunks in the hashtable with nsAutoPtr would not help here either.
Comment on attachment 8428679 [details] [diff] [review]
patch v2

Review of attachment 8428679 [details] [diff] [review]:
-----------------------------------------------------------------

Please add AssertOwnsLock to CacheFile::RemoveChunkInternal

::: netwerk/cache2/CacheFile.cpp
@@ +1325,5 @@
>  
>          SetError(rv);
>          CacheFileIOManager::DoomFile(mHandle, nullptr);
>          return rv;
> +      } else {

else after return, btw

::: netwerk/cache2/CacheFileChunk.cpp
@@ +65,5 @@
>  {
> +  nsrefcnt count = mRefCnt - 1;
> +  if (DispatchRelease()) {
> +    // Redispatched to the main thread.
> +    return count;

nit: maybe just return mRefCnt - 1 or just mRefCnt, you will save atomic (expensive) access when already on the main thread.

@@ +79,5 @@
>      return 0;
>    }
>  
> +  // We can safely access this chunk after decreasing mRefCnt since we re-post
> +  // all calls to Release() from non-main thread to the main thread. I.e. no

from non-main threadS

or simply:

off main-thread

::: netwerk/cache2/CacheFileChunk.h
@@ +128,5 @@
>    uint32_t mIndex;
>    EState   mState;
>    nsresult mStatus;
>    bool     mIsDirty;
> +  bool     mActiveChunk;

please comment when this is actually set.  as I understand, mActiveChunk == true <=> mChunk.contains(the_chunk), right?  also a note about locking (that both these states change atomically)
Attachment #8428679 - Flags: review?(honzab.moz) → review+
Michal, any reason not to update/land this asap?
https://hg.mozilla.org/mozilla-central/rev/7430a9048a56
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
Depends on: 1019934
No longer depends on: 1019934
Depends on: 1022625
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: