Last Comment Bug 549767 - Write/flush cache metadata off main-thread
: Write/flush cache metadata off main-thread
Status: RESOLVED FIXED
: main-thread-io
Product: Core
Classification: Components
Component: Networking: Cache (show other bugs)
: unspecified
: All All
: -- normal with 9 votes (vote)
: mozilla6
Assigned To: Bjarne (:bjarne)
:
Mentors:
Depends on: 620660
Blocks: 513074 595574 645848
  Show dependency treegraph
 
Reported: 2010-03-02 16:01 PST by Robert Sayre
Modified: 2011-04-27 10:47 PDT (History)
37 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
.x+
wanted


Attachments
instruments screen grab (390.33 KB, image/png)
2010-03-02 16:01 PST, Robert Sayre
no flags Details
strick that, I can still see it happening. here's a stack (3.60 KB, text/plain)
2010-03-02 16:29 PST, Robert Sayre
no flags Details
Posting DeactivateEntry to the thread (3.03 KB, patch)
2010-03-10 05:32 PST, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Posting DeactivateEntry to the thread, waiting for threads to finish processing in Shutdown (6.01 KB, patch)
2010-03-14 06:50 PDT, Bjarne (:bjarne)
michal.novotny: review-
Details | Diff | Splinter Review
Updated : Post DeactivateEntry, wait for threads to finish processing in Shutdown (6.51 KB, patch)
2010-03-23 03:30 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Updated : Post DeactivateEntry, wait for deactivate-events in Shutdown (8.44 KB, patch)
2010-03-26 06:19 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Updated according to comments from reviewer ++ (8.88 KB, patch)
2010-05-07 05:51 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Modified to detach descriptors from cacheEntry if deleted in DeactivateEntry (11.34 KB, patch)
2010-06-08 07:31 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Modified as described in comment #77 (11.13 KB, patch)
2010-06-09 07:56 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Modified to also handle profile-shutdown (16.28 KB, patch)
2010-06-15 05:18 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Cleaned up and deleted code which disabled IPC (24.60 KB, patch)
2010-06-23 04:02 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
New, lighter approach to achieve the desired effect (5.27 KB, patch)
2010-08-19 07:35 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Handles race-condition described in comment #105 (10.24 KB, patch)
2010-09-08 05:38 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Carefully handling bindings... (10.84 KB, patch)
2010-09-14 11:45 PDT, Bjarne (:bjarne)
michal.novotny: review-
Details | Diff | Splinter Review
Carefully handling bindings, v1.1 (8.61 KB, patch)
2010-10-25 07:08 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Carefully handling bindings v1.2 (blocking on cache-io thread in dtor) (19.52 KB, patch)
2010-11-07 14:46 PST, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Carefully handling bindings version 1.3 (16.08 KB, patch)
2011-03-18 05:04 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Carefully handling bindings version 1.3 (14.12 KB, patch)
2011-04-07 12:50 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Carefully handling bindings version 1.3 (updated) (12.61 KB, patch)
2011-04-11 17:35 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Carefully handling bindings version 1.3 (updated) (12.48 KB, patch)
2011-04-11 17:46 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Carefully handling bindings version 1.3 (updated and unbitrotted) (12.50 KB, patch)
2011-04-11 18:36 PDT, Bjarne (:bjarne)
michal.novotny: review+
Details | Diff | Splinter Review

Description Robert Sayre 2010-03-02 16:01:08 PST
I saw this with a March 2nd m-c nightly. Will attach instruments data shortly.
Comment 1 Robert Sayre 2010-03-02 16:01:52 PST
Created attachment 429871 [details]
instruments screen grab
Comment 2 Robert Sayre 2010-03-02 16:04:51 PST
(also filing on the GetUSLayoutCharFromKeyTranslate thing pictured there... ho ho ho)
Comment 3 Robert Sayre 2010-03-02 16:25:55 PST
hmm, I can't reproduce this anymore, but sdwilsh said he saw the same thing in irc.
Comment 4 Robert Sayre 2010-03-02 16:29:20 PST
Created attachment 429876 [details]
strick that, I can still see it happening. here's a stack
Comment 5 Shawn Wilsher :sdwilsh 2010-03-02 16:36:34 PST
(In reply to comment #4)
> Created an attachment (id=429876) [details]
> strick that, I can still see it happening. here's a stack
Any comments on why we see writes on the main thread still Bjarne?
Comment 6 Boris Zbarsky [:bz] 2010-03-02 18:21:19 PST
So it looks like bug 513074 made the data writes (the actual cached content) happen on a worker thread.

The stack in comment 4 is writing cache entry metadata.

What I actually wonder is whether the DataSize() is correct a this point if there are still pending data writes on the worker thread...
Comment 7 Bjarne (:bjarne) 2010-03-04 08:04:12 PST
(In reply to comment #6)
> So it looks like bug 513074 made the data writes (the actual cached content)
> happen on a worker thread.

Well, yes... The general idea was to post cache-writes on a worker thread while reading from the source instead of writing to the cache synchronously.

Should we also try to write metadata asynchronously? It may be possible to achieve if we e.g. post the call to CloseCacheEntry() on the worker thread, but I'd need to experiment with it to tell for sure.

However, fwiw, keep in mind that if the caller set the UseCacheAsFile-flag (see e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=548406#c33) also data writes happens on the main-thread. Hence, depending on what is loaded during profiling, one may or may not see even data written on the main-thread. 

> What I actually wonder is whether the DataSize() is correct a this point if
> there are still pending data writes on the worker thread...

That's a good point. I guess there are no tests which would catch this...? It may be resolved by the fix suggested above though... :)
Comment 8 Michal Novotny (:michal) 2010-03-07 18:26:50 PST
(In reply to comment #6)
> What I actually wonder is whether the DataSize() is correct a this point if
> there are still pending data writes on the worker thread...

The DataSize() is always correct. It is updated after the call to nsCacheEntryDescriptor::nsOutputStreamWrapper::Write(). If the writes happen on the worker thread then the cache entry isn't closed in nsHttpChannel::CloseCacheEntry() because posted write events on the worker thread hold the reference to the entry. So the cache entry is closed (and also metadata is written) after the last write is done.
Comment 9 Bjarne (:bjarne) 2010-03-10 05:32:45 PST
Created attachment 431620 [details] [diff] [review]
Posting DeactivateEntry to the thread

Suggested way to write metadata on the cache-write thread. There is also a line to ensure mCacheEntry::mThread is initialized in the ctor (odd, really, that this was missing)

This seems to me as the appropriate place for this split : The cache-service does its own clean-up synchronously, but posts the call to device->DeactivateEntry(). We must avoid posting if we are shutting down (since the cache-service lock is held during shutdown), and there is no point to post the call to current-thread IMO.

Pushed to tryserver...
Comment 10 Shawn Wilsher :sdwilsh 2010-03-10 08:06:34 PST
Note that this is responsible for about 16% of our I/O on the main thread during normal browsing:
https://wiki.mozilla.org/Firefox/Goals/2010Q1/IO_Reduction
Comment 11 Bjarne (:bjarne) 2010-03-11 02:48:10 PST
Comment on attachment 431620 [details] [diff] [review]
Posting DeactivateEntry to the thread

Tryserver orange on WINNT 5.2 and Linux unit tests. Ok on Mac. I'll try to reproduce locally...
Comment 12 Bjarne (:bjarne) 2010-03-11 04:07:25 PST
Comment on attachment 431620 [details] [diff] [review]
Posting DeactivateEntry to the thread

Not able to reproduce any of the failures locally but I see some backouts on trunk due to oranges. Re-dispatching to tryserver...
Comment 13 Bjarne (:bjarne) 2010-03-12 01:36:51 PST
Comment on attachment 431620 [details] [diff] [review]
Posting DeactivateEntry to the thread

Still fails on tryserver. I should probably wait for the async call to finish before removing the devices in nsCacheService::Shutdown()...
Comment 14 Bjarne (:bjarne) 2010-03-13 07:42:10 PST
Mmmm...  no complete success on tryserver yet. I sometimes see runtime-msg "pure virtual method called", which most likely comes from using a dangling pointer. Trying to resolve, but cannot reproduce this locally...
Comment 15 Bjarne (:bjarne) 2010-03-14 06:50:42 PDT
Created attachment 432413 [details] [diff] [review]
Posting DeactivateEntry to the thread, waiting for threads to finish processing in Shutdown

Making sure the posted deactivations happens before devices are destroyed. Posted to tryserver...
Comment 16 Johnathan Nightingale [:johnath] 2010-03-16 10:16:54 PDT
(In reply to comment #15)
> Created an attachment (id=432413) [details]
> Posting DeactivateEntry to the thread, waiting for threads to finish processing
> in Shutdown
> 
> Making sure the posted deactivations happens before devices are destroyed.
> Posted to tryserver...

Bjarne - while we wait for review, does silence here suggest that tryserver was happy with it?
Comment 17 Bjarne (:bjarne) 2010-03-16 15:11:04 PDT
> Bjarne - while we wait for review, does silence here suggest that tryserver was
> happy with it?

Unfortunately, no. (It merely suggests that I was busy with a few other things and forgot to update status here.)

This one is on top of my list for tomorrow...
Comment 18 Michal Novotny (:michal) 2010-03-16 18:42:12 PDT
Comment on attachment 432413 [details] [diff] [review]
Posting DeactivateEntry to the thread, waiting for threads to finish processing in Shutdown

Sometimes the metadata is still written on the main thread even if the data is written on the worker thread. nsCacheEntry::mThread is set in nsCacheEntry::SetData(). For new cache entries this is called on the first write, so everything works as expected. But for existing entries SetData() is called when opening the entry and this is done on the main thread.


> +        // release its data-object.
> +        nsCacheServiceAutoLock lock;
> +        (void) mDevice->DeactivateEntry(mEntry);
> +        return NS_OK;

Check the return value and increase nsCacheService::mDeactivateFailures on failure.


> +      mThread(nsnull)

This isn't needed since mThread is nsCOMPtr.


> -    if (mInitialized) {

I don't think this can happen, nevertheless you should probably keep the condition.
Comment 19 Shawn Wilsher :sdwilsh 2010-03-16 23:03:50 PDT
(In reply to comment #18)
> > -    if (mInitialized) {
> 
> I don't think this can happen, nevertheless you should probably keep the
> condition.
drop it and make it an assertion?
Comment 20 Bjarne (:bjarne) 2010-03-17 07:10:56 PDT
(In reply to comment #18)
> (From update of attachment 432413 [details] [diff] [review])
> Sometimes the metadata is still written on the main thread even if the data is
> written on the worker thread. nsCacheEntry::mThread is set in
> nsCacheEntry::SetData(). For new cache entries this is called on the first
> write, so everything works as expected. But for existing entries SetData() is
> called when opening the entry and this is done on the main thread.

Metadata is also written for cache *read* requests...  Yes - I overlooked that one! Lemme' have a closer look at when deactivates are posted! (There is also potential for a similar problem as in bug #548406 I believe...)

> > +        // release its data-object.
> > +        nsCacheServiceAutoLock lock;
> > +        (void) mDevice->DeactivateEntry(mEntry);
> > +        return NS_OK;
> 
> Check the return value and increase nsCacheService::mDeactivateFailures on
> failure.

Fair enough.

> > +      mThread(nsnull)
> 
> This isn't needed since mThread is nsCOMPtr.

Very well... but for my curiosity : Why don't we risk garbage in an nsCOMPtr-member? Will statements like |if (this->mThread) {| always work correctly if |mThread| is an nsCOMPtr?

> > -    if (mInitialized) {
> 
> I don't think this can happen, nevertheless you should probably keep the
> condition.

Wouldn't the assertion handle the situation? Having both the assertion and the |if| seem a little redundant to me...?
Comment 21 Shawn Wilsher :sdwilsh 2010-03-17 08:15:22 PDT
(In reply to comment #20)
> Very well... but for my curiosity : Why don't we risk garbage in an
> nsCOMPtr-member? Will statements like |if (this->mThread) {| always work
> correctly if |mThread| is an nsCOMPtr?
nsCOMPtr's constructor sets it's member data to nsnull so you don't have to.
Comment 22 Bjarne (:bjarne) 2010-03-17 08:27:23 PDT
(In reply to comment #21)
> (In reply to comment #20)
> > Very well... but for my curiosity : Why don't we risk garbage in an
> > nsCOMPtr-member? Will statements like |if (this->mThread) {| always work
> > correctly if |mThread| is an nsCOMPtr?
> nsCOMPtr's constructor sets it's member data to nsnull so you don't have to.

Ok, but the member in question is |nsCacheEntry::mThread| . mThread is an nsCOMPtr-member in nsCacheEntry, but nsCacheEntry is not (even close to) a nsCOMPtr and is created in nsCacheEntry::Create() like this

nsCacheEntry* entry = new nsCacheEntry(newKey, streamBased, storagePolicy);

What I don't see is how the code |if (entry->mThread)| can be safe unless |nsCacheEntry::mThread| is explicitly initialized...
Comment 23 Benjamin Smedberg AWAY UNTIL 2-AUG-2016 [:bsmedberg] 2010-03-17 08:43:37 PDT
the nsCacheEntry constructor runs constructors for all its members, that's normal C++
Comment 24 Bjarne (:bjarne) 2010-03-17 08:56:49 PDT
Ah - ok. Thanks...  :)
Comment 25 Shawn Wilsher :sdwilsh 2010-03-17 09:45:48 PDT
(In reply to comment #22)
> nsCacheEntry* entry = new nsCacheEntry(newKey, streamBased, storagePolicy);
member variables are initialized during construction.
Comment 26 Michal Novotny (:michal) 2010-03-17 19:30:11 PDT
(In reply to comment #19)
> > > -    if (mInitialized) {
> drop it and make it an assertion?

The old code contains NS_ASSERTION as well.

(In reply to comment #20)
> Wouldn't the assertion handle the situation? Having both the assertion and the
> |if| seem a little redundant to me...?

But NS_ASSERTION does nothing in release build, so the code would probably crash.
Comment 27 Bjarne (:bjarne) 2010-03-18 03:36:17 PDT
> But NS_ASSERTION does nothing in release build, so the code would probably
> crash.

Well, then the |if| should definitely be kept. :)
Comment 28 Bjarne (:bjarne) 2010-03-18 08:03:11 PDT
I feel like summarizing (in writing) current thinking around this...  all input and corrections to understanding and/or analysis is highly appreciated!

Bottom line is that we want to let the device deactivate entries off the main-thread because metadata is typically written to disk as a consequence of deactivating and deleting a cache-entry in the disk-cache device. As pointed out by michal in comment #18, this also applies to cache-reads because these may also touch metadata for the entry. In other words, general access to the cache seems to result in writing metadata to disk.

When a cache-entry is read from or written to, data for the entry is set by calling SetData(). If the data set is non-null, nsCacheEntry maintains a member |mThread| pointing to the running thread.

(As a side-note, if a call to SetData(realData) happens prior to a call to SetData(nsnull) we have a weird situation in this respect since |mThread| is not cleared on the second call... maybe we should change this?)

Anyway, when nsCacheService::DeactivateEntry() is called, we have three possible states for nsCacheEntry::mThread

1) not set
2) set to current (main) thread (indicating sync cache-write or a cache-read)
3) set to some other thread (async cache-write implemented in bug #513074)

In the suggested patch, case (3) serializes deactivation of the entry with async writes from bug #513074 by posting to the cache-write thread. The implementation of (2) just deactivates the entry synchronously, resulting in writes on the main-thread (as noted in comment #0).

AFAICS (1) may indicate that there is a pending write (or read) which has not yet accessed its stream, that something (like the collision-detection mechanism in nsDiskCacheBindery::CreateBinding()) failed causing a call like |entry->SetData(nsnull)|, or simply that some component set nsICacheEntryDescriptor::cacheElement to null for whatever reason. The suggested patch implements (1) in the same way as (2) (as a side-effect, not intentionally).

In essence I see two issues here (implementation of (3) is IMO ok) :

First, implementation of (2) must be improved so that possible metadata-writes happens off the main-thread. This can be done by passing the cache-write thread e.g. via nsICacheEntryDescriptor and post deactivation of the entry to this thread. This includes adding a field to a public interface and set it e.g. when the descriptor is created in nsHttpChannel (or it can be added to nsICacheSession and handled by the session).

Second, (1) must be handled properly. In the cases where (1) is caused by a pending request, fixing bug #548406 is IMO a prerequisite (i.e. properly deal with binding doomed entries). Apart from this, we can IMO treat it like (2) above.

I'll implement and test some of this while waiting for comments...
Comment 29 Bjarne (:bjarne) 2010-03-18 08:19:04 PDT
I haven't followed the work of making cache-reads async, but it occurred to me that if read-requests are handled on a separate thread also, deactivating such entries will probably end up in case (3) above also, no?
Comment 30 Michal Novotny (:michal) 2010-03-18 09:33:28 PDT
(In reply to comment #29)
> I haven't followed the work of making cache-reads async, but it occurred to me
> that if read-requests are handled on a separate thread also, deactivating such
> entries will probably end up in case (3) above also, no?

I think you're right. Fixing #513008 (I'm working on it right now) will eliminate case (2).
Comment 31 Bjarne (:bjarne) 2010-03-19 07:31:08 PDT
If this turns out to be right, we could just wait for solutions to bug #513008 and bug #548406 and keep the suggested solution for (3) here, no?
Comment 32 Shawn Wilsher :sdwilsh 2010-03-22 09:49:16 PDT
(In reply to comment #26)
> But NS_ASSERTION does nothing in release build, so the code would probably
> crash.
But we never expect this to be true, right?  We have assertions all over the code base that check that something isn't null and then just use it without checking nullness.

(In reply to comment #31)
> If this turns out to be right, we could just wait for solutions to bug #513008
> and bug #548406 and keep the suggested solution for (3) here, no?
Do we need to wait for those bugs before we can have a patch up here?
Comment 33 Bjarne (:bjarne) 2010-03-22 15:45:52 PDT
> (In reply to comment #31)
> > If this turns out to be right, we could just wait for solutions to bug #513008
> > and bug #548406 and keep the suggested solution for (3) here, no?
> Do we need to wait for those bugs before we can have a patch up here?

Fixing those bugs should eliminate cases (1) and (2) (comment #28) and the suggested patch handles case (3). If we want a fix for case (3) only I can whip it up based on the suggested patch - I see no reason for extending the interface(s) to handle case (2) if it will go away in the near future anyway.
Comment 34 Shawn Wilsher :sdwilsh 2010-03-22 15:57:01 PDT
(In reply to comment #33)
> Fixing those bugs should eliminate cases (1) and (2) (comment #28) and the
> suggested patch handles case (3). If we want a fix for case (3) only I can whip
> it up based on the suggested patch - I see no reason for extending the
> interface(s) to handle case (2) if it will go away in the near future anyway.
I'd like to get as much as we can fix done right away, so getting at least case (3) fixed will be a big help.
Comment 35 Bjarne (:bjarne) 2010-03-23 03:30:42 PDT
Created attachment 434202 [details] [diff] [review]
Updated : Post DeactivateEntry, wait for threads to finish processing in Shutdown

Updated according to comments from reviewer and cleaned up to deliberately only handle "case (3)" entries (i.e. entries which has attached a thread different from main-thread). Pushed to tryserver.
Comment 36 Bjarne (:bjarne) 2010-03-26 06:19:30 PDT
Created attachment 435149 [details] [diff] [review]
Updated : Post DeactivateEntry, wait for deactivate-events in Shutdown

Passes tryserver.

This patch writes metadata for any cache-entry on the thread which accessed the entry the last time. I.e. it should also deal with metadata for async cache-reads when these are implemented. Metadata for cache-entries which are read/written on main-thread are still written on main-thread.

The concern raised in comment #28, 4th paragraph is still unanswered.
 
Moreover, I believe this is a dup of bug #537934 (or vice-versa).

I'd also suggest to rename this bug to "Write/flush cache metadata off main-thread" or something similar to better reflect the real issue.
Comment 37 Shawn Wilsher :sdwilsh 2010-03-26 11:02:21 PDT
(In reply to comment #36)
> The concern raised in comment #28, 4th paragraph is still unanswered.
michal, can you respond to this?

> Moreover, I believe this is a dup of bug #537934 (or vice-versa).
We should probably forward dupe it.
Comment 38 Shawn Wilsher :sdwilsh 2010-03-26 11:09:43 PDT
(In reply to comment #28)
> Second, (1) must be handled properly. In the cases where (1) is caused by a
> pending request, fixing bug #548406 is IMO a prerequisite (i.e. properly deal
> with binding doomed entries). Apart from this, we can IMO treat it like (2)
> above.
Is this the right bug number?
Comment 39 Bjarne (:bjarne) 2010-03-26 15:52:31 PDT
(In reply to comment #38)
> (In reply to comment #28)
> > Second, (1) must be handled properly. In the cases where (1) is caused by a
> > pending request, fixing bug #548406 is IMO a prerequisite (i.e. properly deal
> > with binding doomed entries). Apart from this, we can IMO treat it like (2)
> > above.
> Is this the right bug number?

Yes. Bug #548406 is caused by pending cache-write-requests. If these for some reason are deactivated before executed (e.g. due to another write-request to the same entry) the entries will not have a thread attached at deactivation-time, i.e. case (1).
Comment 40 Michal Novotny (:michal) 2010-03-31 06:42:03 PDT
(In reply to comment #37)
> (In reply to comment #36)
> > The concern raised in comment #28, 4th paragraph is still unanswered.
> michal, can you respond to this?

Yes, we should probably null out mThread as well.

Hmm, cacheElement is a public attribute of the cache entry, but we use it internally to hold the reference to the binding. The question here is what will happen if someone try to use it for his own purpose...
Comment 41 Michal Novotny (:michal) 2010-03-31 08:15:44 PDT
Comment on attachment 435149 [details] [diff] [review]
Updated : Post DeactivateEntry, wait for deactivate-events in Shutdown

Sorry for the delay, I was experiencing a crash when this patch and the patch from #513008 were applied, so I was inspecting it. Regarding my comment #40 I don't think it's a good idea to use mThread for the decision on which thread to deactivate the entry. IMHO we should either remove cacheElement attribute from the IDL if it isn't needed, or add a new thread variable for our purpose.
Comment 42 Michal Novotny (:michal) 2010-04-01 04:51:03 PDT
(In reply to comment #36)
> +  // (The dispatched event must post some event back to this thread to ensure
> +  // the event-loop-spin which implements NS_DISPATCH_SYNC doesn't starve -
> +  // hence the parameter)

Why exactly is this needed?
Comment 43 Bjarne (:bjarne) 2010-04-04 14:06:28 PDT
>Regarding my comment #40 I
> don't think it's a good idea to use mThread for the decision on which thread to
> deactivate the entry. IMHO we should either remove cacheElement attribute from
> the IDL if it isn't needed, or add a new thread variable for our purpose.

Make cacheElement or the mThread "non-scriptable" in the IDL? (Assuming you are worried about scripts accessing this?)

(In reply to comment #42)
> (In reply to comment #36)
> > +  // (The dispatched event must post some event back to this thread to ensure
> > +  // the event-loop-spin which implements NS_DISPATCH_SYNC doesn't starve -
> > +  // hence the parameter)
> 
> Why exactly is this needed?

The code which implements synchronous event-dispatching spins the event-loop for the calling thread until the dispatched event has finished. If the event-queue for the calling thread is exhausted before the dispatched event runs, it *looks like* it keeps spinning until an event is dispatched to the calling thread. So, by posting an (empty) event back to the calling thread I make sure the calling thread receives an event and breaks out of the spin.
Comment 44 Shawn Wilsher :sdwilsh 2010-04-04 15:59:08 PDT
(In reply to comment #43)
> Make cacheElement or the mThread "non-scriptable" in the IDL? (Assuming you are
> worried about scripts accessing this?)
C++ could still access it then (think add-ons).
Comment 45 Mike Shaver (:shaver -- probably not reading bugmail closely) 2010-04-04 16:33:11 PDT
Speaking only for myself, I'm willing to break all 0.05 add-ons that want to mess around with this stuff from C++, and give them use-case-specific APIs if their authors ask.
Comment 46 Michal Novotny (:michal) 2010-04-08 14:55:14 PDT
I'm not sure about the reason why this attribute exists, but it doesn't seem to be used by our code (only some tests use it). And I could find only one addon that uses it (Bugmail for Thunderbird https://mxr.mozilla.org/addons/source/9584/chrome/content/overlay.js#75). So can we simply remove the attribute or do we need to keep it and create a new member variable in nsCacheEntry for our internal purposes?
Comment 47 Michal Novotny (:michal) 2010-04-09 07:24:24 PDT
(In reply to comment #43)
> The code which implements synchronous event-dispatching spins the event-loop
> for the calling thread until the dispatched event has finished. If the
> event-queue for the calling thread is exhausted before the dispatched event
> runs, it *looks like* it keeps spinning until an event is dispatched to the
> calling thread. So, by posting an (empty) event back to the calling thread I
> make sure the calling thread receives an event and breaks out of the spin.

This shouldn't happen and IMHO if it happens it is a bug and it should be fixed in nsThread. BTW what you're trying to do is exactly what nsThreadSyncDispatch::Run() already does.
Comment 48 Bjarne (:bjarne) 2010-04-09 16:06:38 PDT
(In reply to comment #47)
> This shouldn't happen and IMHO if it happens it is a bug and it should be fixed
> in nsThread. BTW what you're trying to do is exactly what
> nsThreadSyncDispatch::Run() already does.

You're right on both, of-course. But the fact remains : The ack made the patch pass the tryserver (tests were hanging without it). I'll dig a little deeper on this next week and re-run the patch a few times to make sure it's robust in this sense.

As a side-note, there are very few places where sync-dispatching is used at all (ogg- and wave-decoders, gnomevfs protocolhandler and then in TestProxies). Not entirely sure what TestProxies do, but I'll figure out next week...
Comment 49 Shawn Wilsher :sdwilsh 2010-05-03 12:18:17 PDT
What is the status of this bug?
Comment 50 Bjarne (:bjarne) 2010-05-07 05:51:03 PDT
Created attachment 444071 [details] [diff] [review]
Updated according to comments from reviewer ++

This has been on the list for too long... here's an updated patch.

(In reply to comment #37)
> > Moreover, I believe this is a dup of bug #537934 (or vice-versa).
> We should probably forward dupe it.

Uncertain about terminology...  should I mark 537934 a duplicate of this bug?

(In reply to comment #40)
> Yes, we should probably null out mThread as well.

Ok - done.

> Hmm, cacheElement is a public attribute of the cache entry, but we use it
> internally to hold the reference to the binding. The question here is what will
> happen if someone try to use it for his own purpose...

Well, |cacheElement| only provides access to the data of the element, not the nsCacheEntry-object itself. Setting the attribute sets the data and stores a reference to the calling thread in |nsCacheEntry::mThread|, but apart from storing this reference I don't see any problems with |cacheElement| being public (please correct me if I'm missing something!).

The lifecycle of |nsCacheEntry::mThread| may thus be outside our control (e.g. it may be created and destroyed by an addon). Which means that when we try to deactivate the entry using |nsCacheEntry::mThread| it may fail miserably. I've added code to deactivate the entry on main-thread if dispatching fails. Should we be even more careful? Is there a util-method to tell if a thread is still valid? Can we tell somehow if the thread is out of our control and avoid storing it in |nsCacheEntry::mThread| in this case?

Or the simplest solution is of-course to remove the attribute completely... :) It is used in tests, like michal says, to verify that the cache returns what is stored, but there may be other ways to verify that.

> (In reply to comment #43)
> This shouldn't happen and IMHO if it happens it is a bug and it should be fixed
> in nsThread. BTW what you're trying to do is exactly what
> nsThreadSyncDispatch::Run() already does.

I've removed this from the current patch. Pushed to tryserver for verification (again).
Comment 51 Bjarne (:bjarne) 2010-05-10 03:15:04 PDT
Patch passes tryserver on Linux and OSX. Fails one reftest on WinNT and encounters a number of 

###!!! [Parent][RPCChannel] Error: Channel error: cannot send/recv

which results in a crash, logged as

PROCESS-CRASH | automation.py | application crashed (minidump found)

(Brief log can be found at http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1273235252.1273245260.13979.gz )

The reftest-failure seems completely unrelated to the patch, and I can see these types of crashes in other tryserver-runs, hence I'd claim that patch passes tryserver. Waiting for review.
Comment 52 Shawn Wilsher :sdwilsh 2010-05-10 07:38:12 PDT
(In reply to comment #51)
> ###!!! [Parent][RPCChannel] Error: Channel error: cannot send/recv
Did you build with electrolysis?  It sounds like it's trying to to do ipc stuff there.
Comment 53 Bjarne (:bjarne) 2010-05-10 08:33:11 PDT
That would probably be a bummer, yes... I've not enabled that intentionally, and 'hg identify' returns

0459345d8026 tip

Anything else I can check to verify/falsify? (Patch did apply cleanly on tryserver "mozilla-central" repository...)
Comment 54 Shawn Wilsher :sdwilsh 2010-05-10 09:39:26 PDT
(In reply to comment #53)
> Anything else I can check to verify/falsify? (Patch did apply cleanly on
> tryserver "mozilla-central" repository...)
Ask in #content?  I'm not sure offhand.
Comment 55 Bjarne (:bjarne) 2010-05-12 02:53:22 PDT
I've disabled electrolysis at runtime in this patch by setting pref 'dom.ipc.plugins.enabled' to false everywhere in "browser/app/profile/firefox.js". Pushed to tryserver again...
Comment 56 Bjarne (:bjarne) 2010-05-12 06:41:20 PDT
Patch as described above passes tryserver for OSX but encounters bug #562886 on Linux and bug #538616 on WinNT (with electrolysis disabled though, which is interesting).

Link to WinNT log : http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1273657708.1273667985.641.gz

Link to Linux log : http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1273657708.1273667868.32729.gz

Patch is identical to the patch attached in comment #50 (with the exception of setting "dom.ipc.plugins.enabled" to false). Hence, I claim again that it passes tryserver. Waiting for review.
Comment 57 Bjarne (:bjarne) 2010-05-12 07:14:17 PDT
Hmm....  I was mistaken about OSX. I confused email about building successfully on OSX with successfully running unit tests on OSX. Finally, the email with 2 warnings in OSX unit tests came along...

I have to look closer at it later - cannot reproduce any of those issues locally. Review would still be valuable, I believe.

Log for OSX is : http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1273657708.1273671840.19635.gz
Comment 58 Bjarne (:bjarne) 2010-05-12 08:48:57 PDT
Ohh - if it's feasible to run the DTrace-script which defines the criteria for IO-reduction, please direct me to instructions for how to do so. Or maybe sdwilsh wants to do that? Even without the async cache-read patch we should see some effect...
Comment 59 Shawn Wilsher :sdwilsh 2010-05-12 09:51:52 PDT
(In reply to comment #58)
> Ohh - if it's feasible to run the DTrace-script which defines the criteria for
> IO-reduction, please direct me to instructions for how to do so. Or maybe
> sdwilsh wants to do that? Even without the async cache-read patch we should see
> some effect...
I documented this in the wiki :)
https://wiki.mozilla.org/Firefox/Goals/2010Q1/IO_Reduction_Criteria
Comment 60 Shawn Wilsher :sdwilsh 2010-05-12 09:54:20 PDT
(In reply to comment #57)
> I have to look closer at it later - cannot reproduce any of those issues
> locally. Review would still be valuable, I believe.
The failure is from thread 5 crashing (EXC_BAD_ACCESS).  Backtrace looks like this:
 0  XUL + 0x73721
    eip = 0x02281721   esp = 0xb030adc0   ebp = 0xb030ade8   ebx = 0x0228170d
    esi = 0x0020002a   edi = 0x00dd3d60   eax = 0x00000000   ecx = 0xb030ab4c
    edx = 0xb030addc   efl = 0x00010202
    Found by: given as instruction pointer in context
 1  XUL + 0x7188d
    eip = 0x0227f88e   esp = 0xb030adf0   ebp = 0xb030ae18
    Found by: previous frame's frame pointer
 2  XUL!nsMemoryCacheDevice::DeactivateEntry(nsCacheEntry*) [nsMemoryCacheDevice.cpp:2108d2458719 : 170 + 0x7]
    eip = 0x022870c5   esp = 0xb030ae20   ebp = 0xb030ae48
    Found by: previous frame's frame pointer
 3  XUL!nsCacheServiceDeactivateEntryEvent::Run() [nsCacheService.cpp:2108d2458719 : 1876 + 0x11]
    eip = 0x022860f8   esp = 0xb030ae50   ebp = 0xb030ae68
    Found by: previous frame's frame pointer
 4  XUL!nsThread::ProcessNextEvent(int, int*) [nsThread.cpp:2108d2458719 : 527 + 0x7]
    eip = 0x02ef263d   esp = 0xb030ae70   ebp = 0xb030aeb8
    Found by: previous frame's frame pointer
 5  XUL!NS_ProcessNextEvent_P(nsIThread*, int) [nsThreadUtils.cpp : 250 + 0x15]
    eip = 0x02eaf5ba   esp = 0xb030aec0   ebp = 0xb030aee8
    Found by: previous frame's frame pointer
 6  XUL!nsThread::ThreadFunc(void*) [nsThread.cpp:2108d2458719 : 254 + 0xf]
    eip = 0x02ef27f0   esp = 0xb030aef0   ebp = 0xb030af38
    Found by: previous frame's frame pointer
 7  libnspr4.dylib!_pt_root [ptthread.c:2108d2458719 : 228 + 0x8]
    eip = 0x0005f892   esp = 0xb030af40   ebp = 0xb030af78
    Found by: previous frame's frame pointer
 8  libSystem.B.dylib + 0x31c54
    eip = 0x90037c55   esp = 0xb030af80   ebp = 0xb030afc8
    Found by: previous frame's frame pointer
 9  libSystem.B.dylib + 0x31b11
    eip = 0x90037b12   esp = 0xb030afd0   ebp = 0xb030afec
    Found by: previous frame's frame pointer
Comment 61 Shawn Wilsher :sdwilsh 2010-05-12 09:54:50 PDT
also see this just before the crash:
xpcshell(30795,0xb030b000) malloc: *** error for object 0xdd0006: Non-aligned pointer being freed
*** set a breakpoint in malloc_error_break to debug
Comment 62 Bjarne (:bjarne) 2010-05-14 07:57:56 PDT
> > IO-reduction, please direct me to instructions for how to do so. Or maybe
> > sdwilsh wants to do that? Even without the async cache-read patch we should see
> > some effect...
> I documented this in the wiki :)
> https://wiki.mozilla.org/Firefox/Goals/2010Q1/IO_Reduction_Criteria

Dtrace does not seem to be available for my normal development-box (Ubuntu Linux) but I'm planning to get my hands on a Mac-box. I assume that would help?

(In reply to comment #60)
> The failure is from thread 5 crashing (EXC_BAD_ACCESS).  Backtrace looks like
> this:
>  0  XUL + 0x73721
>     eip = 0x02281721   esp = 0xb030adc0   ebp = 0xb030ade8   ebx = 0x0228170d
>     esi = 0x0020002a   edi = 0x00dd3d60   eax = 0x00000000   ecx = 0xb030ab4c
>     edx = 0xb030addc   efl = 0x00010202
>     Found by: given as instruction pointer in context
>  1  XUL + 0x7188d
>     eip = 0x0227f88e   esp = 0xb030adf0   ebp = 0xb030ae18
>     Found by: previous frame's frame pointer
>  2  XUL!nsMemoryCacheDevice::DeactivateEntry(nsCacheEntry*)
> [nsMemoryCacheDevice.cpp:2108d2458719 : 170 + 0x7]
>     eip = 0x022870c5   esp = 0xb030ae20   ebp = 0xb030ae48
>     Found by: previous frame's frame pointer

  [ snip ]

Thanks a bundle! This is certainly relevant to the patch. However, I'm wondering if maybe the currently suggested patch for bug #548406 may help here : it cleans up the sequence of removing entries from memory-cache, which is where this crash happens.

The suggested patch for this bug changes timing of deactivating / removing entries, and a cleaner sequence for this in the memory-cache cannot hurt. I'll submit a combined patch to the tryserver.
Comment 63 Shawn Wilsher :sdwilsh 2010-05-14 09:49:58 PDT
(In reply to comment #62)
> Dtrace does not seem to be available for my normal development-box (Ubuntu
> Linux) but I'm planning to get my hands on a Mac-box. I assume that would help?
Very likely.  I was using it on a mac, don't know if that script would work on Linux.
Comment 64 Bjarne (:bjarne) 2010-05-18 03:26:20 PDT
> The suggested patch for this bug changes timing of deactivating / removing
> entries, and a cleaner sequence for this in the memory-cache cannot hurt. I'll
> submit a combined patch to the tryserver.

Results from tryserver for combined patch for this bug with patch for bug #548406 :

1) Combined patch fails with bug #562886 on Linux.
2) On WinNT it fails with the error described in bug #548406, comment #323.
3) On all three platforms it fails in chrome://mochikit/content/browser/browser/components/sessionstore/test/browser/browser_522545.js

Analysis

(1) Is a known issue on Linux and is unrelated to this patch
(2) Indicates that the patch for bug #548406 introduces/triggers some issue in the application-cache. Must be resolved.
(3) Can be reproduced locally both with and without the patch. However, when using the latest version from trunk it passes locally without and with the patch. I'll ignore this issue for now.

Conclusion

Need to figure out what causes (2) above (failure in "test_simpleManifest.html")
Comment 65 Bjarne (:bjarne) 2010-05-18 07:15:33 PDT
> Need to figure out what causes (2) above (failure in
> "test_simpleManifest.html")

Cannot reproduce locally (yet). An observation is that the listener for the app-cache is installed before the ordinary cache-listener, hence when the ordinary cache-listener becomes an async tee, the app-cache will also be written asynchronously on the cache-write thread (since these tees are chained with the last one installed first in the list).
Comment 66 Bjarne (:bjarne) 2010-05-19 04:58:58 PDT
Now I see what probably causes the problem (ref also comment above).

The failing check in "test_simpleManifest.js" uses the attribute "mozLength" which is derived from the list of entries gathered from the cache-device. Before performing this check, the test waits for the cache-entry to become available using nsCacheService::OpenCacheEntry() which calls nsCacheService::ActivateCache(). ActivateCache() may find the cache-entry in mActiveEntries *before* the entry is bound to the device, hence there is a time-window from the cache-entry is found and until the cache-device knows the entry. This time-window is also the root of bug #548406, however the fix for bug #548406 is not to eliminate the time-window but rather to handle it properly. I'll find a way to handle it here also.
Comment 67 Bjarne (:bjarne) 2010-05-19 06:45:57 PDT
Ahh...  can this be bug #484123?

I've uploaded the patch from comment #64 to tryserver including a small fix to the test to ensure the time-window is properly handled.
Comment 68 Johnny Stenback (:jst, jst@mozilla.com) 2010-05-19 17:23:38 PDT
I think we want this for 1.9.3.
Comment 69 Shawn Wilsher :sdwilsh 2010-05-19 18:03:15 PDT
(In reply to comment #68)
> I think we want this for 1.9.3.
I think we *need* it ;)
Comment 70 Bjarne (:bjarne) 2010-05-20 01:01:17 PDT
(In reply to comment #67)
> I've uploaded the patch from comment #64 to tryserver including a small fix to
> the test to ensure the time-window is properly handled.

Results from tryserver contain too many of these to make sense

###!!! [Parent][RPCChannel] Error: Channel error: cannot send/recv

followed by a crash. I (accidentally) re-enabled the electrolysis-stuff in this version, so I've disabled it again and resubmitted to tryserver...
Comment 71 Bjarne (:bjarne) 2010-05-21 02:08:59 PDT
(In reply to comment #70)
> ... , so I've disabled it again and resubmitted to tryserver...

Results from tryserver

OSX   : ok
Linux : hits bug #562866, plus a crash in test_updatecheck.js
WinNT : hits bug #538616

The crash on Linux in test_updatecheck.js shows this log

Crash reason:  SIGSEGV
Crash address: 0x1100

Thread 5 (crashed)
 0  libxul.so!nsDiskCacheMap::DeleteRecord(nsDiskCacheRecord*) [nsDiskCacheMap.h:962135366d6b : 139 + 0x0]
    eip = 0x4026de26   esp = 0x4270127c   ebp = 0x0a92b308   ebx = 0x40ff6974
    esi = 0x0fb6aab1   edi = 0x00001100   eax = 0x00000010   ecx = 0x00000000
    edx = 0x00001100   efl = 0x00010246
    Found by: given as instruction pointer in context

It can certainly be related to this patch (it posts deactivation on the device to another thread) and I cannot find any existing bugs mentioning this. Will have to look closer at this issue.
Comment 72 Bjarne (:bjarne) 2010-05-21 07:38:10 PDT
(In reply to comment #71)
> Linux : hits bug #562866, plus a crash in test_updatecheck.js

Sorry - should be :

Linux : hits bug #562886, plus a crash in test_updatecheck.js
Comment 73 Bjarne (:bjarne) 2010-06-08 07:31:15 PDT
Created attachment 449863 [details] [diff] [review]
Modified to detach descriptors from cacheEntry if deleted in DeactivateEntry

An nsCacheEntry may actually have descriptors attached when we delete it in nsCacheService::DeactivateEntry(), so in this version we make sure to clear these before deleting the entry. Failing to do so may have caused the crashes in comment #62 and comment #71.

(It passes local testing - just waiting to get push-to-tryserver access so that it can be tested properly...)
Comment 74 :Ehsan Akhgari (away Aug 1-5) 2010-06-08 08:50:18 PDT
Comment on attachment 449863 [details] [diff] [review]
Modified to detach descriptors from cacheEntry if deleted in DeactivateEntry

I've pushed this patch to the try server on your behalf (changeset 3f7a8816a26f).

You can see the results (in a few hours/days/months/centuries) on <http://tests.themasta.com/tinderboxpushlog/?tree=MozillaTry>.  :-)
Comment 75 Shawn Wilsher :sdwilsh 2010-06-08 13:17:49 PDT
Try server builds look good.  Note that we have ten days until the beta code freeze, and this is blocking.
Comment 76 Bjarne (:bjarne) 2010-06-09 05:01:21 PDT
(In reply to comment #74)
> I've pushed this patch to the try server on your behalf (changeset
> 3f7a8816a26f).

Thanks! Appreciate it!

Note that the attached patch explicitly disables IPC globally by changing browser/app/profile.firefox.js. Should probably be stripped from the final patch.

(In reply to comment #75)
> Try server builds look good.  Note that we have ten days until the beta code
> freeze, and this is blocking.

Lots of things can be done in ten days. :)

try-mac-slave15 is orange, but so are most builds before and after
try-mac-slave08 is orange, the two subsequent builds have the exact same error
try-mac-slave24 is orange, the builds before and after have same errors
try-win32-slave01 is orange, the builds before and after have same errors
try-win32-slave17 is orange, suspicious-looking crash!
try-win32-slave19 is orange, Java RT died

otherwise all is green.

I'll have a look at the issue on try-win32-slave17 which seems relevant to this patch. (Hopefully, the Java-crash is a coincidence although I cannot find any bugs mentioning it.)
Comment 77 Bjarne (:bjarne) 2010-06-09 07:53:19 PDT
(In reply to comment #76)
> I'll have a look at the issue on try-win32-slave17 which seems relevant to this
> patch.

Seems like the same issue as seen in comment #71.

nsCacheEntry-objects are actually deleted in a number of places so I'll detach descriptors (ref comment #73) in the dtor to block pending operations on the entry.
Comment 78 Bjarne (:bjarne) 2010-06-09 07:56:23 PDT
Created attachment 450116 [details] [diff] [review]
Modified as described in comment #77

I'm still waiting for permission to push to tryserver, so if anyone would like to push it for me I'd be grateful. :)
Comment 79 Johnathan Nightingale [:johnath] 2010-06-09 08:00:38 PDT
(In reply to comment #78)
> Created an attachment (id=450116) [details]
> Modified as described in comment #77
> 
> I'm still waiting for permission to push to tryserver, so if anyone would like
> to push it for me I'd be grateful. :)

[Aside: bug number for your request? Level 1 access should not take that long.]
Comment 80 Johnny Stenback (:jst, jst@mozilla.com) 2010-06-09 09:13:44 PDT
Bug 570475, and I think the reason for the delay is that the magic token message saying it's ready wasn't added, so I just added that.

I just pushed the latest patch to the try server, commit id is 828551107c36.
Comment 81 :Ehsan Akhgari (away Aug 1-5) 2010-06-09 17:54:27 PDT
I've pushed a try server run with this patch, and I just saw comment 78.  Bjarne, feel free to watch over it (changeset 27d8113d68f7 at Wed Jun 9 17:36:00 2010 titled: imported patch attachment.cgi?id=450116).
Comment 82 :Ehsan Akhgari (away Aug 1-5) 2010-06-09 17:56:00 PDT
And I just saw comment 80!  Well, now you have two try server runs!  :-)
Comment 83 Bjarne (:bjarne) 2010-06-10 01:55:55 PDT
(In reply to comment #80)
> Bug 570475, and I think the reason for the delay is that the magic token
> message saying it's ready wasn't added, so I just added that.

Thanks! I'm not too familiar with these formal procedures...

> I just pushed the latest patch to the try server, commit id is 828551107c36.

try-win32-slave10 (opt, mochitests-1/5): fails with bug #565245
try-win32-slave29: Java RT died
try-win32-slave10 (opt, xpcshell): fails with bug #538616
try-linux-slave13 (debug, reftest): fails reftest/tests/layout/reftests/svg/dynamic-use-nested-01.svg
try-linux-slave18 (debug, crashtest): fails with assertion count for crashtest 432114-2.html
Comment 84 :Ehsan Akhgari (away Aug 1-5) 2010-06-10 08:15:11 PDT
Bjarne, since you have two try server pushes with the same patch, you could compare their results, specifically for the second one below.

The dynamic-use-nested-01.svg failure was bug 571036 which is fixed now.  On the assertion count failure, you need to check the actual assertions triggered.  If they're not related to the cache code, this is probably an intermittent failure, which you should file a bug for.  :-)
Comment 85 Bjarne (:bjarne) 2010-06-11 00:44:16 PDT
Seems like my previous comment was submitted to early...  here's the rest.

> I just pushed the latest patch to the try server, commit id is 828551107c36.

try-w32-slave04 (debug, crashtest): application timeout near crashtests/386680-1.html
try-w32-slave10 (opt, xpcshell): fails with bug #538616
try-w32-slave10 (opt, mochitests-1/5): fails with bug #565245
try-w32-slave10 (debug, mochitests-2/5): fails with bug #492232
try-w32-slave18 (debug, mochitests-3/5): LEAK
try-w32-slave25 (debug, jsreftest): fails with wrong assertion count for regress-344804.js and regress-181834.js
try-w32-slave29: Java RT died

try-linux-slave02 (maemo build): failed, but previous builds also failed
try-linux-slave13 (debug, reftest): fails reftest/tests/layout/reftests/svg/dynamic-use-nested-01.svg
try-linux-slave18 (debug, crashtest): fails with wrong assertion count for crashtest 432114-2.html
try-linux-slave22 (opt, mochitests-other): some failures, starting with bug #539283

try-mac-slave24 (opt, mochitest/other): fails with bug #567950
try-mac-slave28 (opt, reftest): crash in ValueToNumberSlow() while processing an nsInputStreamReadyEvent - not sure whether it's relevant
OSX 10.6.2 (talos tp4): Busted, but build prior to this was also busted
try-mac64-slave09 (OSX 10.6.2 leak test): fails like all builds before and after

Phew...

(In reply to comment #81)
> Bjarne, feel free to watch over it (changeset 27d8113d68f7 at Wed Jun 9

Thanks again. :) 

try-w32-slave09 (opt, xpcshell): fails with bug #538616 (like above)
try-w32-slave16 (opt, mochitests-1/5): fails with bug #565245 (like above)
try-linux-slave11 (debug, reftest): fails in "dynamic-use-nested-01.svg" (like above)
try-w32-slave26 (debug, jsreftest): fails with wrong assertion count for regress-344804.js and regress-181834.js (like above)

try-linux-slave08 (opt, xpcshell): crash in "test_bug_406857.js" - very much like the one in comment #71
try-linux-slave21 (opt, mochitest-other): some failures, first in "browser_sidebarpanels_click.js" which probably is bug #539283 (like above)
try-linux-slave25 (opt, reftest): fails reftest/tests/layout/reftests/svg/dynamic-use-nested-01.svg (like above for debug-build)

+ several more failures

The interesting (and relevant) thing is the (Linux, opt, xpcshell) crash on try-linux-slave08 which looks similar to comment #71. The latest fix apparently did not resolve this, so I'll have a look at it again.

It seems to be triggered by accessing a field in a nsDiskCacheMap-object which, most likely, is deleted:

http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheMap.h#139

probably called from

http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheMap.cpp#478

and I have also seen a stack pointing to

http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheMap.cpp#484
Comment 86 Bjarne (:bjarne) 2010-06-11 08:29:14 PDT
The interesting crash happens in this context

1. nsDiskCacheDevice::DeactivateEntry() is called off main-thread
2. The entry is not doomed...
3. ...hence DeactivateEntry() tries to write the entry to disk, but fails
4. ...and crashes in mCacheMap::DeleteRecord() when cleaning up

Stack is here http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1276132562.1276133440.25820.gz#err1

Looks like the nsDiskCacheRecord-param to mCacheMap::DeleteRecord() is corrupted, causing the crash. Questions:

- why would writing the cache entry fail in the first place?
- how can the param to nsDiskCacheMap::DeleteRecord() be corrupted?

I'm a little stuck at the moment. I don't see what can happen in the timeslot between posting the deactivate-event and running it, which causes writing the entry to disk to fail (or corrupt the param to DeleteRecord for that matter).

Any suggestions are welcome, as well as any ideas for how to set up a test to trigger this. (I'm not able to reproduce.) Will look into this over the weekend again.
Comment 87 :Ehsan Akhgari (away Aug 1-5) 2010-06-11 11:09:19 PDT
Given a quick look at this patch, it's clearly wrong to assume that the weak references helf by nsCacheServiceDeactivateEntryEvent are going to be valid once we actually need them.  For example, what would prevent the service itself from going away?  What about the entry?  I think we should be holding some kind of a strong pointer to those objects.
Comment 88 Bjarne (:bjarne) 2010-06-14 04:29:15 PDT
That's valuable input. Thanks!

In principle I agree, but the plan was to control this by SYNC_DISPATCHing the dummy-events in nsCacheService::Shutdown() to ensure deactivation-events have finished before releasing the service. This assumes that Shutdown() in fact is called, and that nsCacheEntry-objects are not delete'd inappropriately (I'm tracing the potential for the latter now, but it looks ok). Are these assumptions completely wrong?

Nevertheless, I'll try using a strong ref and once my commit-access is working...
Comment 89 :Ehsan Akhgari (away Aug 1-5) 2010-06-14 09:18:00 PDT
(In reply to comment #88)
> That's valuable input. Thanks!
> 
> In principle I agree, but the plan was to control this by SYNC_DISPATCHing the
> dummy-events in nsCacheService::Shutdown() to ensure deactivation-events have
> finished before releasing the service. This assumes that Shutdown() in fact is
> called, and that nsCacheEntry-objects are not delete'd inappropriately (I'm
> tracing the potential for the latter now, but it looks ok). Are these
> assumptions completely wrong?

I don't know this code very well, so I'm probably not the best person to answer that.  But I have a helpful tip for you, try running Firefox under valgrind, which would assist you in determining memory safety issues.

> Nevertheless, I'll try using a strong ref and once my commit-access is
> working...

Feel free to ask me at any time to do a try server push in the mean time.  You can comment on the bug, send me a direct email, ping me on irc, give me a call, whatever!  :-)  I'd really like to help you get this landed on mozilla-central as quickly as possible.
Comment 90 Bjarne (:bjarne) 2010-06-15 05:18:17 PDT
Created attachment 451264 [details] [diff] [review]
Modified to also handle profile-shutdown

Apppreciate it, Ehsan - thanks for your support!

Finally managed to reproduce locally, and I really wonder why I didn't see this issue before: The cache-devices are shut down also in nsCacheService::OnProfileChanged() but with no code to block on pending deactivation-events. Fixed, cannot reproduce locally anymore, and pushed to tryserver (yes - I can do that now! :)  ).

Waiting for results from tryserver and review.
Comment 91 Bjarne (:bjarne) 2010-06-16 15:03:42 PDT
Results are better, but still there is a weird crash. After spending most of the day reproducing it locally and analyzing it I realize that nsMemoryCache::FindEntry() just picks up the entry from a hashtable, keeping old values for |mData| and |mThread| in particular.

This seems wrong since deactivation of the entry will then be dispatched to the thread which initially wrote data to the entry, as opposed to the thread which called FindEntry().

Fixed, cannot reproduce locally, and new patch pushed to tryserver...
Comment 92 Bjarne (:bjarne) 2010-06-17 15:10:05 PDT
More details fixed and another push to tryserver...  (this feels more and more like Pandora's box)
Comment 93 Bjarne (:bjarne) 2010-06-18 07:55:25 PDT
Still ironing out details, but at least we're down to details now. One more push to tryserver before the week-end...
Comment 94 Johnny Stenback (:jst, jst@mozilla.com) 2010-06-18 15:42:15 PDT
I'd love to see this in beta1, but I can't say it's really a blocker. If this is landable before the deadline (TBD, but not very many days from now), let's make that happen!
Comment 95 Bjarne (:bjarne) 2010-06-22 06:30:00 PDT
Must have been something wrong with tryservers over the weekend because results all over the place look really funny... pushed again on monday when things seemed to be stable.

Now results look ok. There are two failing jsreftests (orange) with wrong assertion-count on w32/dbg but these seem to be common these days. The last (and interesting) orange is in

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1277123225.1277125340.1704.gz#err0

Scroll one line up and observe the explicit abort from imglib... looks like imglib aborts in certain cases (Release() in this case) when multiple threads are involved and I doubt such behaviour will/can be handled properly by the cache in the same process...

(The next error in the given log is clearly related to this patch but I expect it to be a consequence of the explicit abort by imglib. I'm trying to reproduce it locally but no success yet.)

Question is : What to do with explicit protection against multiple threads in imglib with increased usage of multiple threads in the cache?
Comment 96 Bjarne (:bjarne) 2010-06-23 04:02:17 PDT
Created attachment 453341 [details] [diff] [review]
Cleaned up and deleted code which disabled IPC

This concludes work here for now. Pushed to tryserver one last time for verification and requesting review.

There is still the open question/concern in comment #95 (last paragraph) but I think it should be handled as a separate bug, and it may also be relevant in other contexts than this one.
Comment 97 Johnny Stenback (:jst, jst@mozilla.com) 2010-07-19 17:27:58 PDT
Michal, any chance you can get to this review any time soon?
Comment 98 Mike Beltzner [:beltzner, not reading bugmail] 2010-07-29 13:13:42 PDT
Still a b3 blocker? Not comfortable with it continuing to bump down the line, maybe betaN+?
Comment 99 Johnny Stenback (:jst, jst@mozilla.com) 2010-07-30 17:11:12 PDT
Marking betaN+
Comment 100 Bjarne (:bjarne) 2010-08-16 08:09:55 PDT
Unbitrotted and passed to tryserver to get a fresh feeling about current status.

The open question in comment #95 (in essence: How to release imgCacheEntry-objects on an arbitrary thread) is probably bug #529949 so I'm taking the liberty of adding a dependency to it. I also observe bug #542809 which seems to be the same problem in a different context.
Comment 101 Bjarne (:bjarne) 2010-08-18 00:29:48 PDT
(In reply to comment #100)
> Unbitrotted and passed to tryserver to get a fresh feeling about current
> status.

Five starred failures. However, it also fails consistently on all platforms, both debug and opt on

chrome://mochikit/content/chrome/widget/tests/test_imestate.html

I find this failure mentioned in bug #584944 and bug #585361 although it is not the main issue of either of those bugs. Trying to reproduce locally...
Comment 102 Bjarne (:bjarne) 2010-08-18 02:35:40 PDT
Not able to reproduce locally (opt/dbg). Push again to tryserver...
Comment 103 Bjarne (:bjarne) 2010-08-19 02:24:18 PDT
Results from last push are surprisingly (suspiciously?) green with only three oranges :

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282131772.1282134988.1526.gz

which looks like bug #544601

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282124270.1282134123.28731.gz

where I cannot find the problem (a leak test showing no leaks afaics)

and http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282131772.1282133850.27248.gz

which fails in /tests/dom/tests/mochitest/dom-level2-core/test_elementgetattributenodens03.html on the assertion

###!!! ASSERTION: nsScriptCacheCleaner not thread-safe: '_mOwningThread.GetThread() == PR_GetCurrentThread()', file e:/builds/moz2_slave/tryserver-win32-debug/build/content/base/src/nsFrameMessageManager.cpp, line 681

The latter issue should probably be taken seriously although the test itself only contains two TODOs. Cannot reproduce locally but will investigate further.
Comment 104 Bjarne (:bjarne) 2010-08-19 07:35:41 PDT
Created attachment 467409 [details] [diff] [review]
New, lighter approach to achieve the desired effect

As discussed on irc, this is a different and more direct approach to writing cache-metadata off the main-thread. I'm attaching it here to give you an idea of what I mean - comments are appreciated.
Comment 105 Michal Novotny (:michal) 2010-08-25 06:42:45 PDT
Comment on attachment 467409 [details] [diff] [review]
New, lighter approach to achieve the desired effect

In the first patch (#453341) I really don't like collecting threads in mDeactivateEntryThreads and the method BlockOnAsyncDeactivateEvents(). Now when patch from bug #513008 has landed we have a dedicated cache IO thread that we can use to deactivate the entry. IMO there is also no reason to try to deactivate entry on the same thread where SetData() was called. Correct me if I'm wrong but there shouldn't be any problem to deactivate entry on the background IO thread even if data was written synchronously due to STORE_ON_DISK_AS_FILE policy.

The second patch (#467409) is more straightforward. But what I don't like on both patches is the fact that we have the cache in inconsistent state for a short period. We proceed nsCacheService::DeactivateEntry() immediately but the disk device deactivates the entry later. I.e. within this period we don't have the entry in nsCacheService::mActiveEntries but nsDiskCacheBindery::FindActiveBinding() can find a binding for the entry.

Instead of posting nsDiskCacheDevice::DeactivateEntry() to the background thread we should probably try to change the code so that most of the calls to nsCacheService::DeactivateEntry() happen on the background thread. It is called from

1) nsCacheService::ProcessRequest()
2) nsCacheService::DoomEntry_Internal()
3) nsCacheService::CloseDescriptor()
4) nsCacheService::ClearDoomList()
5) nsCacheService::DeactivateAndClearEntry()

We should focus on 3 which happens all the time during normal browsing.
Comment 106 Bjarne (:bjarne) 2010-08-27 04:51:06 PDT
(In reply to comment #105)
> In the first patch (#453341) I really don't like collecting threads in
> mDeactivateEntryThreads and the method BlockOnAsyncDeactivateEvents(). Now when
> patch from bug #513008 has landed we have a dedicated cache IO thread that we
> can use to deactivate the entry.

I agree.

> IMO there is also no reason to try to
> deactivate entry on the same thread where SetData() was called.

Not when we have a suitable thread available for this, no.

> Correct me if
> I'm wrong but there shouldn't be any problem to deactivate entry on the
> background IO thread even if data was written synchronously due to
> STORE_ON_DISK_AS_FILE policy.

I think you're right.

> The second patch (#467409) is more straightforward. But what I don't like on
> both patches is the fact that we have the cache in inconsistent state for a
> short period. We proceed nsCacheService::DeactivateEntry() immediately but the
> disk device deactivates the entry later. I.e. within this period we don't have
> the entry in nsCacheService::mActiveEntries but
> nsDiskCacheBindery::FindActiveBinding() can find a binding for the entry.

Good point! But (in the second approach) couldn't we just call nsDiskCacheBindery::RemoveBinding() before proxying the call?

> Instead of posting nsDiskCacheDevice::DeactivateEntry() to the background
> thread we should probably try to change the code so that most of the calls to
> nsCacheService::DeactivateEntry() happen on the background thread.

I see (and fwiw I like) what you say but I suspect it will be even harder to keep the cache consistent if we do it like that (i.e. there will be more state to synchronize between the threads). I'll keep this in mind, but I would like to give my second approach a harder try, addressing your concerns above.
Comment 107 Bjarne (:bjarne) 2010-09-08 05:38:14 PDT
Created attachment 473021 [details] [diff] [review]
Handles race-condition described in comment #105

I believe this slight modification takes care of the potential for inconsistent state which Michal points out in comment #105 by introducing a flag to prevent GetCacheEntryBinding(nsCacheEntry) to return the binding if there is a pending deactivate.

(IMO there is a very unfortunate binding between the nsCacheEntry and the nsDiskCacheEntry where they point to each other and where the mData field of nsCacheEntry actually *is* the binding-object, but those are not relevant here...)

It passes simple local testing as well as local random browsing. Pushed to tryserver...
Comment 108 Bjarne (:bjarne) 2010-09-09 02:01:35 PDT
(In reply to comment #107)
> It passes simple local testing as well as local random browsing. Pushed to
> tryserver...

Some timeouts seen from tryserver caused by the changes to dtor. Reverted and pushed again.
Comment 109 Bjarne (:bjarne) 2010-09-09 12:59:48 PDT
Timeouts gone but more careful handling of bindings is required.
Comment 110 Bjarne (:bjarne) 2010-09-14 11:45:51 PDT
Created attachment 475153 [details] [diff] [review]
Carefully handling bindings...

This version passes local testing and random browsing - pushed to tryserver...
Comment 111 Bjarne (:bjarne) 2010-09-14 18:44:15 PDT
Afaics, it also passes tryserver. Five starred oranges plus some OpenGL oranges, which are to be expected these days. Waiting for review...
Comment 112 Johnny Stenback (:jst, jst@mozilla.com) 2010-09-23 23:36:35 PDT
Running with the latest patch in this bug *and* with the patch in bug 588804 (I think, at least) applied I just hit a crash with the following stack:
#2  <signal handler called>
#3  arena_dalloc (ptr=0x100000001)
    at ../../../mozilla/memory/jemalloc/jemalloc.c:4271
#4  free (ptr=0x100000001) at ../../../mozilla/memory/jemalloc/jemalloc.c:6121
#5  0x00007fca4f16acfc in ~nsACString_internal (this=0x7fc9fbac7280, 
    __in_chrg=<value optimized out>) at ../../dist/include/nsTSubstring.h:115
#6  ~nsCString (this=0x7fc9fbac7280, __in_chrg=<value optimized out>)
    at ../../dist/include/nsTString.h:55
#7  nsCacheEntry::~nsCacheEntry (this=0x7fc9fbac7280, 
    __in_chrg=<value optimized out>)
    at ../../../mozilla/netwerk/cache/nsCacheEntry.cpp:83
#8  0x00007fca4f170a0b in nsDiskCacheDevice::DeactivateEntry (this=
    0x7fca32f44200, entry=0x7fc9fbac7280)
    at ../../../mozilla/netwerk/cache/nsDiskCacheDevice.cpp:527
#9  0x00007fca4f16eb9d in nsCacheServiceDeactivateEntryEvent::Run (this=
    0x7fc9a2293ae0) at ../../../mozilla/netwerk/cache/nsCacheService.cpp:1989
#10 0x00007fca4fab47ad in nsThread::ProcessNextEvent (this=0x7fc9c1bfde90, 
    mayWait=1, result=0x7fc9ae5fdd3c)
    at ../../../mozilla/xpcom/threads/nsThread.cpp:547
#11 0x00007fca4fa852a9 in NS_ProcessNextEvent_P (thread=<value optimized out>, 
    mayWait=<value optimized out>) at nsThreadUtils.cpp:250
#12 0x00007fca4fab4d5a in nsThread::ThreadFunc (arg=0x7fc9c1bfde90)
    at ../../../mozilla/xpcom/threads/nsThread.cpp:263
#13 0x00007fca4e16f4a3 in _pt_root (arg=0x7fc9b48c7240)
    at ../../../../mozilla/nsprpub/pr/src/pthreads/./ptthread.c:228
#14 0x00000037ed406a3a in start_thread () from /lib64/libpthread.so.0
#15 0x00000037eccde77d in clone () from /lib64/libc.so.6
(gdb) f5
#5  0x00007fca4f16acfc in ~nsACString_internal (this=0x7fc9fbac7280, 
    __in_chrg=<value optimized out>) at ../../dist/include/nsTSubstring.h:115
115	      ~nsTSubstring_CharT() { Finalize(); }
(gdb) p *this
$1 = {mData = 0x100000001 <Address 0x100000001 out of bounds>, mLength = 
    52500848, mFlags = 32714}
(gdb) up
#6  ~nsCString (this=0x7fc9fbac7280, __in_chrg=<value optimized out>)
    at ../../dist/include/nsTString.h:55
55	  {
(gdb) up
#7  nsCacheEntry::~nsCacheEntry (this=0x7fc9fbac7280, 
    __in_chrg=<value optimized out>)
    at ../../../mozilla/netwerk/cache/nsCacheEntry.cpp:83
warning: Source file is more recent than executable.
83	    delete mKey;
(gdb) p *this
$2 = {<PRCListStr> = {next = 0x7fc9fbac7280, prev = 0x7fc9fbac7280}, mKey = 
    0x7fca2fb064c0, mFetchCount = 2, mLastFetched = 1285305343, 
  mLastModified = 1285305348, mLastValidated = 1936024697, mExpirationTime = 
    1286880963, mFlags = 154368, mDataSize = 0, mCacheDevice = 0x7fca32f44200, 
  mSecurityInfo = {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, 
  mData = 0x7fc98df897e0, mThread = {<nsCOMPtr_base> = {mRawPtr = 
    0x7fc9c1bfde90}, <No data fields>}, mMetaData = {mBuffer = 0x0, 
    mBufferSize = 0, mMetaSize = 0}, mRequestQ = {next = 0x7fc9fbac72e8, 
    prev = 0x7fc9fbac72e8}, mDescriptorQ = {next = 0x7fc9fbac72f8, prev = 
    0x7fc9fbac72f8}}
(gdb)
I have a core file (from an optimized with symbols build) for this crash in case more info is needed.
Comment 113 Shawn Wilsher :sdwilsh 2010-10-13 09:56:00 PDT
We have any forward progress on this yet?
Comment 114 Michal Novotny (:michal) 2010-10-22 13:47:59 PDT
Comment on attachment 475153 [details] [diff] [review]
Carefully handling bindings...

> +  nsresult rv = NS_NewThread(getter_AddRefs(mDiskCacheThread));

Shouldn't you explicitly shutdown mDiskCacheThread somewhere? Why didn't you use mCacheIOThread?


>    if (binding->mGeneration == p->mGeneration) {
> -      if (calcGeneration)  ++binding->mGeneration;    // try the next generation
> -      else {
> +      if (calcGeneration || p->mDeactivateInProgress) {
> +          ++binding->mGeneration;    // try the next generation
> +      } else {
>            NS_ERROR("### disk cache: generations collide!");

What does this change mean? I do think that it isn't needed and if yes then there is some serious problem. This code could potentially change existing generation number, which is bad.


>  nsDiskCacheBinding *
>  GetCacheEntryBinding(nsCacheEntry * entry)
>  {
> -    return (nsDiskCacheBinding *) entry->Data();
> +    nsDiskCacheBinding *tmp = (nsDiskCacheBinding *) entry->Data();
> +    if (tmp->mDeactivateInProgress)
> +        return nsnull;
> +    else
> +        return tmp;
>  }

This doesn't solve the issue that FindActiveEntry() sees (and returns) the binding. Also did you check all places where GetCacheEntryBinding() is called, if this can't be a problem? E.g. in nsDiskCacheDevice::DoomEntry() is an assertion which can be IMO triggered in following situation:

- nsDiskCacheDevice::DeactivateEntry() is called and nsDiskCacheDeviceDeactivateEntryEvent is posted
- some other entry is written and the cache is full, this starts evicting some entries in EvictDiskCacheEntries()
- nsDiskCacheEvictor::VisitRecord() is called for the entry being deactivated
- nsDiskCacheDevice::DoomEntry() is called -> assertion


This is just one example. I'm afraid that there could be a lot of other possible problems that would occur only in specific situations, which would make them really hard to debug. So I think that we should keep nsDiskCacheBindery in sync with nsCacheService::mActiveEntries.
Comment 115 Bjarne (:bjarne) 2010-10-25 07:08:00 PDT
Created attachment 485727 [details] [diff] [review]
Carefully handling bindings, v1.1

(In reply to comment #114)
> Shouldn't you explicitly shutdown mDiskCacheThread somewhere? Why didn't you
> use mCacheIOThread?

Most of this was done before mDiskCacheThread existed, but yes: It should be used. Fixed.

> This code could potentially change existing
> generation number, which is bad.

You're right - the OR is definitely wrong. In fact, this method doesn't need to know anything about whether the entry is about to be deactivated - it can be left unchanged. Should be ok now.

> So I think that we should keep
> nsDiskCacheBindery in sync with nsCacheService::mActiveEntries.

I agree or disagree, depending on what you mean by "keep in sync". :)

The flag nsDiskCacheBinding::mDeactivateInProgress was introduced to prevent the bindery from returning entries which are queued to be deactivated. I.e. it is meant to "bridge the time-slot" which is introduced by deactivating disk-cache entries asynchronously. I believe this is sufficiently "in sync" for our needs. E.g. ActiveBinding() and AddBinding() do *not* take this flag into account, while FindActiveBinding() and GetCacheEntryBinding() do.

> This doesn't solve the issue that FindActiveEntry() sees (and returns) the
> binding.

FindActiveEntry() skip this binding like it does for doomed entries.

> - nsDiskCacheDevice::DeactivateEntry() is called and
> nsDiskCacheDeviceDeactivateEntryEvent is posted
> - some other entry is written and the cache is full, this starts evicting some
> entries in EvictDiskCacheEntries()
> - nsDiskCacheEvictor::VisitRecord() is called for the entry being deactivated
> - nsDiskCacheDevice::DoomEntry() is called -> assertion

This shouldn't happen because FindActiveEntry() would skip entries with the flag set, ref above.

However, you do have a point about the assertions which are often used immediately after calling GetCacheEntryBinding(). This applies to DeactivateEntry(), DoomEntry(), GetFileForEntry() and OnDataSizeChange(), and the question is whether these may be called in the time-slot introduced by this patch. The methods OpenInput/OutputStreamForEntry() use ENSURE_TRUE first, which seems to me like a reasonable solution, do you agree?

Patch passes basic local testing - pushed to tryserver.
Comment 116 Michal Novotny (:michal) 2010-10-25 17:14:45 PDT
(In reply to comment #112)
> 83        delete mKey;
> (gdb) p *this
> $2 = {<PRCListStr> = {next = 0x7fc9fbac7280, prev = 0x7fc9fbac7280}, mKey = 
>     0x7fca2fb064c0, mFetchCount = 2, mLastFetched = 1285305343, 
>   mLastModified = 1285305348, mLastValidated = 1936024697, mExpirationTime = 
>     1286880963, mFlags = 154368, mDataSize = 0, mCacheDevice = 0x7fca32f44200, 
>   mSecurityInfo = {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, 
>   mData = 0x7fc98df897e0, mThread = {<nsCOMPtr_base> = {mRawPtr = 
>     0x7fc9c1bfde90}, <No data fields>}, mMetaData = {mBuffer = 0x0, 
>     mBufferSize = 0, mMetaSize = 0}, mRequestQ = {next = 0x7fc9fbac72e8, 
>     prev = 0x7fc9fbac72e8}, mDescriptorQ = {next = 0x7fc9fbac72f8, prev = 
>     0x7fc9fbac72f8}}

The members seem to be reasonable, except mFlags. I'd guess that the cache entry is deleted for the second time. I don't think that the patch from bug #588804 could cause this, so I'd suspect patch #475153 from this bug.
Comment 117 Bjarne (:bjarne) 2010-10-26 06:30:45 PDT
Not sure where you're going with this, but the mFlags value means that the nsCacheEntry is

1. doomed
2. dirty
3. metadata is dirty
4. it has stream data
5. it's initialized

and the 0x2000-value looks weird. But isn't it deleting |mKey| which fails? Why do you believe the nsCacheEntry gets deleted twice? (I'm sorry if I'm a bit slow here, but I just don't see this...)
Comment 118 Bjarne (:bjarne) 2010-10-26 06:37:50 PDT
Oh - and note that patch #475153 from this bug did have some bad logic (as you pointed out in comment #114 and I acked in comment #115) which the new patch has fixed.
Comment 119 Michal Novotny (:michal) 2010-10-27 08:34:47 PDT
(In reply to comment #117)
> Not sure where you're going with this, but the mFlags value means that the
> nsCacheEntry is
> 
> 1. doomed
> 2. dirty
> 3. metadata is dirty
> 4. it has stream data
> 5. it's initialized
> 
> and the 0x2000-value looks weird.

That's exactly the problem. How 0x20000 could get into mFlags? I'd guess that the memory was freed and overwritten.


> But isn't it deleting |mKey| which fails? Why do you believe the nsCacheEntry
> gets deleted twice?

I've checked the lifetime of the mKey and I don't see a way how we could delete an invalid nsCString. First I thought that the problem could be caused by passing the key from the request to the entry in nsCacheService::ActivateEntry. But mKey is nulled out almost immediately after that in nsCacheEntry::RequestAccess(). The problem occurs only if mActiveEntries.AddEntry() fails since RequestAccess() isn't called and we release the key twice. But this isn't our case because the entry seems to be used.

I've tried to add second delete entry to nsDiskCacheDevice::DeactivateEntry() and the debugger crashed when deleting the nsCString just like in comment #112, so that's why I think that the entry was deleted twice.


> Oh - and note that patch #475153 from this bug did have some bad logic (as you
> pointed out in comment #114 and I acked in comment #115) which the new patch
> has fixed.

Do you mean the condition when calculating generation? I think that the wrong condition can't happen at all, so this couldn't cause the crash.
Comment 120 Mike Shaver (:shaver -- probably not reading bugmail closely) 2010-10-27 08:42:37 PDT
what does valgrind say about the overwriting of freed memory?
Comment 121 Bjarne (:bjarne) 2010-10-28 07:44:45 PDT
(In reply to comment #119)
> I've tried to add second delete entry to nsDiskCacheDevice::DeactivateEntry()
> and the debugger crashed when deleting the nsCString just like in comment #112,
> so that's why I think that the entry was deleted twice.

Fair enough - what did mFlags look like?

In fact, one way I can see this happening is that nsCacheService::DectivateEntry() is called for a doomed entry which has not bound yet due to delayed writing. The call will delete the entry, but when the write-request comes, the entry will bind. If the entry is deactivated later, we will get the second delete by the event. (This should not happen after the fix for bug #596443 landed though.)

Will try to find other patterns which may give us two deletes. I'd also appreciate comments to the patch.
Comment 122 Bjarne (:bjarne) 2010-10-28 15:49:52 PDT
Mmmm...  I'm looking at a crash from tryserver (with this patch) which asserts in nsDiskCacheBindery::RemoveBinding() that the bindery is not initialized. Wonder how that could happen. Note also that it is the code-line immediately before the one crashing in comment #116.

An uninitialized bindery indicates that nsDiskCacheDevice::Shutdwon_Private() is called while a deactivate-event is pending. Either from nsDiskCacheDevice::ClearDiskCache() or from nsDiskCacheDevice::Shutdown().

Maybe nsDiskCacheDevice::Shutdown() should wait until mBindery has no more active bindings, like nsDiskCacheDevice::ClearDiskCache() does..
Comment 123 Mike Shaver (:shaver -- probably not reading bugmail closely) 2010-10-28 15:51:59 PDT
Does valgrind show anything when you run under it?  It often provides very useful clues when dealing with memory issues.
Comment 124 Michal Novotny (:michal) 2010-10-28 15:55:11 PDT
AFAIK we can't reproduce the crash, so valgrind can't help here.
Comment 125 Shawn Wilsher :sdwilsh 2010-10-28 15:56:28 PDT
(In reply to comment #124)
> AFAIK we can't reproduce the crash, so valgrind can't help here.
Doesn't mean it won't flag bad memory access patterns that don't result in a crash.
Comment 126 Bjarne (:bjarne) 2010-10-30 04:06:37 PDT
(Warning: I have no previous experience with valgrind.)

I've followed the guidelines at https://developer.mozilla.org/en/Debugging_Mozilla_with_Valgrind but can find nothing even remotely relevant in the output. As Michal says, the crash mentioned in comment #112 and comment #116 is not really reproducible.

The crash mentioned in comment #122 is in leaktest/Linux. I've not yet been able to reproduce, but I'm currently trying to figure out how to run leaktest under valgrind (any pointers or input is appreciated) to see if it gives any clues.
Comment 127 Bjarne (:bjarne) 2010-11-07 14:44:36 PST
An update:

I've occasionally seen crashes on the tryserver with this patch very similar to comment #112 and comment #122. However, after some investigation I've found that these are not caused by deactivating an entry twice, but rather by the way nsCacheService shuts down the devices: The device is deleted, the dtor calls Shutdown(), which (with this patch) may fire async deactivations. When an async deactivation runs, the device w/ members is gone but the memory is accessed by the async deactivation event, and boom...

I'll attach an updated patch which blocks the dtor of the disk-device until all cache-io events have run, and I have not seen any crashes with this modification. However, there is an issue in /tests/content/base/test/test_bug475156.html where 3-4 of the tests fails with a clearly cache-related message. The suspicion is that there is a small timeslot while deactivation is pending where the cache doesn't find the entry, and I'm currently trying to find it and close it.
Comment 128 Bjarne (:bjarne) 2010-11-07 14:46:19 PST
Created attachment 488785 [details] [diff] [review]
Carefully handling bindings v1.2 (blocking on cache-io thread in dtor)
Comment 129 Markus Stange [:mstange] 2010-11-24 02:27:21 PST
There has been a similar crash as in comment 112 just now on the Firefox Tinderbox, see bug 614513.
Comment 130 Bjarne (:bjarne) 2010-11-25 00:49:04 PST
Nice catch. :)  See bug #bug 614513, comment #5 in particular the last paragraph.
Comment 131 Michal Novotny (:michal) 2010-11-25 05:21:13 PST
Comment on attachment 488785 [details] [diff] [review]
Carefully handling bindings v1.2 (blocking on cache-io thread in dtor)

> +nsCacheService::BlockOnCacheIOThread()
> +{
> +    NS_ASSERTION(gService->mLockedThread == PR_GetCurrentThread(), "oops");
> +    if (!gService->mCacheIOThread) return NS_ERROR_NOT_AVAILABLE;
> +
> +    nsBlockOnCacheThreadEvent *event = new nsBlockOnCacheThreadEvent();
> +
> +    Unlock();
> +        nsresult rv =
> +            gService->mCacheIOThread->Dispatch(event, NS_DISPATCH_SYNC);
> +    Lock();

Posting a synchronous event will spin the event loop until the event finishes. This can be IMO problematic.

Maybe a better solution would be to have a counter of posted events and wait using monitor until it is 0. Also it makes sense to do a synchronous deactivation while doing shutdown. I.e. in Shutdown_Private() set variable mShutdownInProgress = PR_TRUE and check this variable in DeactivateEntry(), in case it is true do a synchronous deactivation.


> +class nsDiskCacheDeviceDeactivateEntryEvent : public nsRunnable {
> +public:
> +    nsDiskCacheDeviceDeactivateEntryEvent(nsDiskCacheDevice *device,
> +                                          nsCacheEntry * entry,
> +                                          nsDiskCacheBinding * binding)

Events and other helper classes are usually at the beginning of the file. Please move it there.


> -    binding = nsnull;
> +    binding = nsnull;  // why assign here??

This should be changed to an assertion NS_ASSERTION(!binding, "..."). It used to be a debug check (see http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/netwerk/cache/src/nsDiskCacheDevice.cpp&rev=1.59) but became a regular code which is incorrect and misleading. The logic is that if we find an active binding it must be a collision, otherwise there is some serious problem.
Comment 132 Bjarne (:bjarne) 2010-11-25 06:52:51 PST
(In reply to comment #131)
> Posting a synchronous event will spin the event loop until the event finishes.

Yes - that's the idea. :)  I want to block current thread until all pending events on the cache-io thread has finished. We are, after all, shutting down the cache at this point.

> This can be IMO problematic.

How? (Observe that the lock is released before the sync-event.)

> Events and other helper classes are usually at the beginning of the file.
> Please move it there.

Fair enough.

> This should be changed to an assertion NS_ASSERTION(!binding, "..."). It used
> to be a debug check (see
> http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/netwerk/cache/src/nsDiskCacheDevice.cpp&rev=1.59)
> but became a regular code which is incorrect and misleading. The logic is that
> if we find an active binding it must be a collision, otherwise there is some
> serious problem.

I can add an assertion in FindEntry() but the TODOs are really about why we assign binding to null (also present in 1.59) when it's not used before the next assignment (also in 1.59).
Comment 133 Bjarne (:bjarne) 2010-11-25 07:41:18 PST
(In reply to comment #132)
> but the TODOs are really about why we
> assign binding to null (also present in 1.59) when it's not used before the
> next assignment (also in 1.59).

Ahh - sorry. I missed all the cleanup in dtor.  :P
Comment 134 Michal Novotny (:michal) 2010-11-25 17:02:22 PST
(In reply to comment #132)
> (In reply to comment #131)
> > Posting a synchronous event will spin the event loop until the event finishes.
> 
> Yes - that's the idea. :)  I want to block current thread until all pending
> events on the cache-io thread has finished. We are, after all, shutting down
> the cache at this point.

But the problem is that the current thread won't be blocked. It will process any pending event on the main thread. Maybe that's OK, but some time ago I wanted to spin the event queue using NS_ProcessNextEvent while waiting for a result and I was told that I should avoid doing that since it isn't always safe.
Comment 135 Shawn Wilsher :sdwilsh 2010-11-27 18:33:17 PST
(In reply to comment #134)
> But the problem is that the current thread won't be blocked. It will process
> any pending event on the main thread. Maybe that's OK, but some time ago I
> wanted to spin the event queue using NS_ProcessNextEvent while waiting for a
> result and I was told that I should avoid doing that since it isn't always
> safe.
It might be OK for this.  When do we shut the cache down?
Comment 136 Michal Novotny (:michal) 2010-11-28 12:28:41 PST
(In reply to comment #135)
> It might be OK for this.  When do we shut the cache down?

The code is called on "xpcom-shutdown" and "profile-before-change",  and also when somebody calls nsICacheSession::evictEntries() to evict entries in the disk cache.
Comment 137 Bjarne (:bjarne) 2010-11-29 04:01:44 PST
(In reply to comment #134)
> But the problem is that the current thread won't be blocked. It will process
> any pending event on the main thread. 

Reading the nsThread impl I see what you mean... :(  The calling thread won't block but rather execute pending events while stalling the return of the dispatch-sync. Effectively, upon calling dispatch-sync necko will be in some half-way-done state, subsequent events will run in this state, and nobody knows what will happen. Makes me wonder which situations the dispatch-sync actually can handle safely... (and consequently why it's available at all).

I've re-implemented BlockOnCacheIoThread() to use conditional variables instead, attempting to properly block the calling thread (which is what I want to do). Are there any known pitfalls/issues with condvars? :)

(Patch is running on tryserver now - I'll attach it asap.)
Comment 138 Michal Novotny (:michal) 2010-11-29 06:04:53 PST
(In reply to comment #137)
> I've re-implemented BlockOnCacheIoThread() to use conditional variables
> instead, attempting to properly block the calling thread (which is what I want
> to do). Are there any known pitfalls/issues with condvars? :)

AFAICS the only problem could be if we would do the same on the cache IO thread, i.e. if we would need to do some task on the main thread to finish all events posted to the IO thread. But we don't do this so it should be safe.
Comment 139 Bjarne (:bjarne) 2010-11-30 06:45:27 PST
Referring to comment #127 and the intermittent failure in test test_bug475156.html, I finally think I see what's going on...

nsHttpChannel::OnStopRequest() is called when the response has been retrieved from the server. It calls CloseCacheEntry() which sets |mCacheEntry| to null. This will/might trigger the dtor of |mCacheEntry| and the dtor of nsCacheEntryDescriptor calls Close() which calls nsCacheService::CloseDescriptor(), which finally deactivates the entry if it's considered inactive.

Now, when deactivating an entry we remove it from nsCacheService::mActiveEntries and make we sure to remove it from nsDiskCacheDevice::mBindery (ref last sentence in comment #114) to keep these in sync. Then we post an event which writes the entry to disk at some point later. Note that this creates a time-slot between removing it from mActiveEntries/mBindery and writing the updated entry to disk. If another request for this entry is processed in this time-slot it will find the *old* entry on disk. In the case of "test_bug475156.html" we find an entry on disk with the wrong eTag and thus validate it with the server, leading to intermittent failures depending on the size and timing of this time-slot.

IMO we must delay removing an entry from mBindery until it has been written to disk. If the entry is requested in the time-slot, we should re-use the binding and not remove it from mBindery. I'll see what I can come up with in terms of code - feel free to challenge the reasoning above in the meantime.
Comment 140 Jonas Sicking (:sicking) PTO Until July 5th 2010-12-14 09:54:32 PST
We really want this, but it's no longer a blocker this late in the game.
Comment 141 Bjarne (:bjarne) 2010-12-30 02:05:09 PST
Part of solution here is described separately in bug #620660 - adding dependency.
Comment 142 Bjarne (:bjarne) 2011-03-18 05:04:54 PDT
Created attachment 520177 [details] [diff] [review]
Carefully handling bindings version 1.3

After a fair number of tryserver-runs I finally identified an issue which caused similar symptoms as seen in bug #614513 (which btw probably could be confirmed fixed by now). The issue is essentially that a cache-entry gets deactivated twice, but this particular code-path can only occur with the rest of this patch applied.

Problem is avoided by canceling the pending event doing the first deactivation and after the fix, this patch passes a full tryserver-run (a few starred oranges) with one exception. The exception really looks like bug #529733, although the crash happens in strlen instead of memcpy. However, origin is the same: nsHttpChannel::ContinueProcessNormal() and InitCacheEntry(). Stack can be seen at http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1300415888.1300416140.711.gz#err1

and full tryserver-run can be seen at  http://tbpl.mozilla.org/?tree=MozillaTry&rev=adfb1b91e827

I'll try to reproduce a couple of times before moving on with this. If there is any insight wrt whether the exception is a version of bug #529733, I'd be happy to listen...
Comment 143 Bjarne (:bjarne) 2011-03-21 06:21:31 PDT
(In reply to comment #142)
> I'll try to reproduce a couple of times before moving on with this.

The same test crashed once again (also on Win32) but with a different stack. Must investigate further...
Comment 144 Bjarne (:bjarne) 2011-04-07 12:50:46 PDT
Created attachment 524455 [details] [diff] [review]
Carefully handling bindings version 1.3

This adds some minor cleanup compared to the previous patch. It has passed two full tryserver-runs with 3-4 starred oranges in each run.
Comment 145 Michal Novotny (:michal) 2011-04-11 16:36:24 PDT
Comment on attachment 524455 [details] [diff] [review]
Carefully handling bindings version 1.3

> +    void NeverMind() { mNeverMind = PR_TRUE; }

I think that names like CancelDeactivate() nad mCancelled would be more self-explaining.


> +    nsDiskCacheDeviceDeactivateEntryEvent *event =
> +        new nsDiskCacheDeviceDeactivateEntryEvent(this, entry, binding);
> +    if (!event) {
> +        CACHE_LOG_DEBUG(("  failed allocating deactivate-event - "
> +                         "falling back to synchronous operation.\n"));
> +        return DeactivateEntry_Private(entry, binding);
> +    }

new doesn't fail.


> -    if (!binding)  return NS_ERROR_UNEXPECTED;
> +    NS_ASSERTION(binding->mDeactivateEvent == nsnull,
> +                 "OnDataSizeChange: entry in process of deactivation");
> +    if (!binding || binding->mDeactivateEvent)
> +        return NS_ERROR_UNEXPECTED;

Just a nit. There are several places where this check is used. Couldn't we create some inline function for it? Like:

inline bool ValidBinding(nsDiskCacheBinding *binding)
{
  if (!binding) {
    NS_ASSERTION(...);
    return false;
  }
  if (binding->mDeactivateEvent) {
    NS_ASSERTION(...);
    return false;
  }
  return true;
}

And then just use
if (!ValidBinding(binding))
  return NS_ERROR_UNEXPECTED;


> -        binding = nsnull;
> +        binding = nsnull; // why assign here? it's set below...

See my comment #131


> +        if (!mNeverMind) {
> +            nsCacheServiceAutoLock lock;

There is a possible race condition if we want to reuse the binding on different thread.
Comment 146 Bjarne (:bjarne) 2011-04-11 17:31:52 PDT
(In reply to comment #145)
> > +    void NeverMind() { mNeverMind = PR_TRUE; }
> 
> I think that names like CancelDeactivate() nad mCancelled would be more
> self-explaining.

Fair enough. Fixed.

> > +    nsDiskCacheDeviceDeactivateEntryEvent *event =
> > +        new nsDiskCacheDeviceDeactivateEntryEvent(this, entry, binding);
> > +    if (!event) {
> new doesn't fail.

Ok.

> > -    if (!binding)  return NS_ERROR_UNEXPECTED;
> > +    NS_ASSERTION(binding->mDeactivateEvent == nsnull,
> > +                 "OnDataSizeChange: entry in process of deactivation");
> > +    if (!binding || binding->mDeactivateEvent)
> > +        return NS_ERROR_UNEXPECTED;
> 
> Just a nit. There are several places where this check is used. Couldn't we
> create some inline function for it? 

Ok

> > -        binding = nsnull;
> > +        binding = nsnull; // why assign here? it's set below...
> 
> See my comment #131

Ok - fixed as agreed.

> > +        if (!mNeverMind) {
> > +            nsCacheServiceAutoLock lock;
> 
> There is a possible race condition if we want to reuse the binding on different
> thread.

Good catch! Fixed.
Comment 147 Bjarne (:bjarne) 2011-04-11 17:35:07 PDT
Created attachment 525240 [details] [diff] [review]
Carefully handling bindings version 1.3 (updated)

Fixed as requested, pushed to tryserver.
Comment 148 Bjarne (:bjarne) 2011-04-11 17:46:18 PDT
Created attachment 525247 [details] [diff] [review]
Carefully handling bindings version 1.3 (updated)

One more place to use the util-function...
Comment 149 Bjarne (:bjarne) 2011-04-11 18:36:52 PDT
Created attachment 525263 [details] [diff] [review]
Carefully handling bindings version 1.3 (updated and unbitrotted)

... and here it is, unbitrotted as well...
Comment 150 Shawn Wilsher :sdwilsh 2011-04-11 19:19:26 PDT
Is this going to land tonight so it makes Firefox 5?
Comment 151 Bjarne (:bjarne) 2011-04-11 22:58:19 PDT
Tryserver is still running

http://tbpl.mozilla.org/?tree=MozillaTry&rev=5e9434e84258

If someone wants to land it - feel free.
Comment 152 Bjarne (:bjarne) 2011-04-13 10:03:05 PDT
Lots of traffic on tryserver these days...  according to the web interface this job is not yet complete. It seems to be ok but since we missed the window anyway I'll re-submit the job to get a coherent picture.
Comment 153 Bjarne (:bjarne) 2011-04-17 06:09:41 PDT
Pushed again to tryserver as http://tbpl.mozilla.org/?tree=MozillaTry&rev=a2588d175b07
Comment 154 Bjarne (:bjarne) 2011-04-17 06:11:01 PDT
Testrun looks ok. Requesting check-in.
Comment 155 Dão Gottwald [:dao] 2011-04-17 10:33:05 PDT
http://hg.mozilla.org/mozilla-central/rev/1346f7ada9d5
Comment 156 Shawn Wilsher :sdwilsh 2011-04-17 11:52:57 PDT
(In reply to comment #155)
> http://hg.mozilla.org/mozilla-central/rev/1346f7ada9d5
Let me be the first to say "SQUEEEEEEEEE!"

Note You need to log in before you can comment on or make changes to this bug.