549767 - Write/flush cache metadata off main-thread

Reporter

Description

•

14 years ago

I saw this with a March 2nd m-c nightly. Will attach instruments data shortly.

Robert Sayre

Reporter

Comment 1

•

14 years ago

Attached image instruments screen grab — Details

Robert Sayre

Reporter

Updated

•

14 years ago

Blocks: 513074

Robert Sayre

Reporter

Comment 2

•

14 years ago

(also filing on the GetUSLayoutCharFromKeyTranslate thing pictured there... ho ho ho)

Robert Sayre

Reporter

Updated

•

14 years ago

Assignee: general → nobody

Component: JavaScript Engine → Networking: Cache

QA Contact: general → networking.cache

Robert Sayre

Reporter

Comment 3

•

14 years ago

hmm, I can't reproduce this anymore, but sdwilsh said he saw the same thing in irc.

Robert Sayre

Reporter

Comment 4

•

14 years ago

Attached file strick that, I can still see it happening. here's a stack — Details

Shawn Wilsher :sdwilsh

Comment 5

•

14 years ago

(In reply to comment #4)
> Created an attachment (id=429876) [details]
> strick that, I can still see it happening. here's a stack
Any comments on why we see writes on the main thread still Bjarne?

Boris Zbarsky [:bzbarsky]

Comment 6

•

14 years ago

So it looks like bug 513074 made the data writes (the actual cached content) happen on a worker thread.

The stack in comment 4 is writing cache entry metadata.

What I actually wonder is whether the DataSize() is correct a this point if there are still pending data writes on the worker thread...

Bjarne (:bjarne)

Assignee

Comment 7

•

14 years ago

(In reply to comment #6)
> So it looks like bug 513074 made the data writes (the actual cached content)
> happen on a worker thread.

Well, yes... The general idea was to post cache-writes on a worker thread while reading from the source instead of writing to the cache synchronously.

Should we also try to write metadata asynchronously? It may be possible to achieve if we e.g. post the call to CloseCacheEntry() on the worker thread, but I'd need to experiment with it to tell for sure.

However, fwiw, keep in mind that if the caller set the UseCacheAsFile-flag (see e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=548406#c33) also data writes happens on the main-thread. Hence, depending on what is loaded during profiling, one may or may not see even data written on the main-thread. 

> What I actually wonder is whether the DataSize() is correct a this point if
> there are still pending data writes on the worker thread...

That's a good point. I guess there are no tests which would catch this...? It may be resolved by the fix suggested above though... :)

Michal Novotny [:michal]

Comment 8

•

14 years ago

(In reply to comment #6)
> What I actually wonder is whether the DataSize() is correct a this point if
> there are still pending data writes on the worker thread...

The DataSize() is always correct. It is updated after the call to nsCacheEntryDescriptor::nsOutputStreamWrapper::Write(). If the writes happen on the worker thread then the cache entry isn't closed in nsHttpChannel::CloseCacheEntry() because posted write events on the worker thread hold the reference to the entry. So the cache entry is closed (and also metadata is written) after the last write is done.

Bjarne (:bjarne)

Assignee

Comment 9

•

14 years ago

Attached patch Posting DeactivateEntry to the thread (obsolete) — Details — Splinter Review

Suggested way to write metadata on the cache-write thread. There is also a line to ensure mCacheEntry::mThread is initialized in the ctor (odd, really, that this was missing)

This seems to me as the appropriate place for this split : The cache-service does its own clean-up synchronously, but posts the call to device->DeactivateEntry(). We must avoid posting if we are shutting down (since the cache-service lock is held during shutdown), and there is no point to post the call to current-thread IMO.

Pushed to tryserver...

Assignee: nobody → bjarne

Status: NEW → ASSIGNED

Attachment #431620 - Flags: review?(michal.novotny)

Shawn Wilsher :sdwilsh

Comment 10

•

14 years ago

Note that this is responsible for about 16% of our I/O on the main thread during normal browsing:
https://wiki.mozilla.org/Firefox/Goals/2010Q1/IO_Reduction

Bjarne (:bjarne)

Assignee

Comment 11

•

14 years ago

Comment on attachment 431620 [details] [diff] [review]
Posting DeactivateEntry to the thread

Tryserver orange on WINNT 5.2 and Linux unit tests. Ok on Mac. I'll try to reproduce locally...

Bjarne (:bjarne)

Assignee

Comment 12

•

14 years ago

Comment on attachment 431620 [details] [diff] [review]
Posting DeactivateEntry to the thread

Not able to reproduce any of the failures locally but I see some backouts on trunk due to oranges. Re-dispatching to tryserver...

Bjarne (:bjarne)

Assignee

Comment 13

•

14 years ago

Comment on attachment 431620 [details] [diff] [review]
Posting DeactivateEntry to the thread

Still fails on tryserver. I should probably wait for the async call to finish before removing the devices in nsCacheService::Shutdown()...

Bjarne (:bjarne)

Assignee

Comment 14

•

14 years ago

Mmmm...  no complete success on tryserver yet. I sometimes see runtime-msg "pure virtual method called", which most likely comes from using a dangling pointer. Trying to resolve, but cannot reproduce this locally...

Bjarne (:bjarne)

Assignee

Comment 15

•

14 years ago

Attached patch Posting DeactivateEntry to the thread, waiting for threads to finish processing in Shutdown (obsolete) — Details — Splinter Review

Making sure the posted deactivations happens before devices are destroyed. Posted to tryserver...

Attachment #431620 - Attachment is obsolete: true

Attachment #432413 - Flags: review?(michal.novotny)

Attachment #431620 - Flags: review?(michal.novotny)

Johnathan Nightingale [:johnath]

Comment 16

•

14 years ago

(In reply to comment #15)
> Created an attachment (id=432413) [details]
> Posting DeactivateEntry to the thread, waiting for threads to finish processing
> in Shutdown
> 
> Making sure the posted deactivations happens before devices are destroyed.
> Posted to tryserver...

Bjarne - while we wait for review, does silence here suggest that tryserver was happy with it?

Bjarne (:bjarne)

Assignee

Comment 17

•

14 years ago

> Bjarne - while we wait for review, does silence here suggest that tryserver was
> happy with it?

Unfortunately, no. (It merely suggests that I was busy with a few other things and forgot to update status here.)

This one is on top of my list for tomorrow...

Michal Novotny [:michal]

Comment 18

•

14 years ago

Comment on attachment 432413 [details] [diff] [review]
Posting DeactivateEntry to the thread, waiting for threads to finish processing in Shutdown

Sometimes the metadata is still written on the main thread even if the data is written on the worker thread. nsCacheEntry::mThread is set in nsCacheEntry::SetData(). For new cache entries this is called on the first write, so everything works as expected. But for existing entries SetData() is called when opening the entry and this is done on the main thread.


> +        // release its data-object.
> +        nsCacheServiceAutoLock lock;
> +        (void) mDevice->DeactivateEntry(mEntry);
> +        return NS_OK;

Check the return value and increase nsCacheService::mDeactivateFailures on failure.


> +      mThread(nsnull)

This isn't needed since mThread is nsCOMPtr.


> -    if (mInitialized) {

I don't think this can happen, nevertheless you should probably keep the condition.

Attachment #432413 - Flags: review?(michal.novotny) → review-

Shawn Wilsher :sdwilsh

Comment 19

•

14 years ago

(In reply to comment #18)
> > -    if (mInitialized) {
> 
> I don't think this can happen, nevertheless you should probably keep the
> condition.
drop it and make it an assertion?

Bjarne (:bjarne)

Assignee

Comment 20

•

14 years ago

(In reply to comment #18)
> (From update of attachment 432413 [details] [diff] [review])
> Sometimes the metadata is still written on the main thread even if the data is
> written on the worker thread. nsCacheEntry::mThread is set in
> nsCacheEntry::SetData(). For new cache entries this is called on the first
> write, so everything works as expected. But for existing entries SetData() is
> called when opening the entry and this is done on the main thread.

Metadata is also written for cache *read* requests...  Yes - I overlooked that one! Lemme' have a closer look at when deactivates are posted! (There is also potential for a similar problem as in bug #548406 I believe...)

> > +        // release its data-object.
> > +        nsCacheServiceAutoLock lock;
> > +        (void) mDevice->DeactivateEntry(mEntry);
> > +        return NS_OK;
> 
> Check the return value and increase nsCacheService::mDeactivateFailures on
> failure.

Fair enough.

> > +      mThread(nsnull)
> 
> This isn't needed since mThread is nsCOMPtr.

Very well... but for my curiosity : Why don't we risk garbage in an nsCOMPtr-member? Will statements like |if (this->mThread) {| always work correctly if |mThread| is an nsCOMPtr?

> > -    if (mInitialized) {
> 
> I don't think this can happen, nevertheless you should probably keep the
> condition.

Wouldn't the assertion handle the situation? Having both the assertion and the |if| seem a little redundant to me...?

Shawn Wilsher :sdwilsh

Comment 21

•

14 years ago

(In reply to comment #20)
> Very well... but for my curiosity : Why don't we risk garbage in an
> nsCOMPtr-member? Will statements like |if (this->mThread) {| always work
> correctly if |mThread| is an nsCOMPtr?
nsCOMPtr's constructor sets it's member data to nsnull so you don't have to.

Bjarne (:bjarne)

Assignee

Comment 22

•

14 years ago

(In reply to comment #21)
> (In reply to comment #20)
> > Very well... but for my curiosity : Why don't we risk garbage in an
> > nsCOMPtr-member? Will statements like |if (this->mThread) {| always work
> > correctly if |mThread| is an nsCOMPtr?
> nsCOMPtr's constructor sets it's member data to nsnull so you don't have to.

Ok, but the member in question is |nsCacheEntry::mThread| . mThread is an nsCOMPtr-member in nsCacheEntry, but nsCacheEntry is not (even close to) a nsCOMPtr and is created in nsCacheEntry::Create() like this

nsCacheEntry* entry = new nsCacheEntry(newKey, streamBased, storagePolicy);

What I don't see is how the code |if (entry->mThread)| can be safe unless |nsCacheEntry::mThread| is explicitly initialized...

Benjamin Smedberg

Comment 23

•

14 years ago

the nsCacheEntry constructor runs constructors for all its members, that's normal C++

Bjarne (:bjarne)

Assignee

Comment 24

•

14 years ago

Ah - ok. Thanks...  :)

Shawn Wilsher :sdwilsh

Comment 25

•

14 years ago

(In reply to comment #22)
> nsCacheEntry* entry = new nsCacheEntry(newKey, streamBased, storagePolicy);
member variables are initialized during construction.

Michal Novotny [:michal]

Comment 26

•

14 years ago

(In reply to comment #19)
> > > -    if (mInitialized) {
> drop it and make it an assertion?

The old code contains NS_ASSERTION as well.

(In reply to comment #20)
> Wouldn't the assertion handle the situation? Having both the assertion and the
> |if| seem a little redundant to me...?

But NS_ASSERTION does nothing in release build, so the code would probably crash.

Bjarne (:bjarne)

Assignee

Comment 27

•

14 years ago

> But NS_ASSERTION does nothing in release build, so the code would probably
> crash.

Well, then the |if| should definitely be kept. :)

Bjarne (:bjarne)

Assignee

Comment 28

•

14 years ago

I feel like summarizing (in writing) current thinking around this...  all input and corrections to understanding and/or analysis is highly appreciated!

Bottom line is that we want to let the device deactivate entries off the main-thread because metadata is typically written to disk as a consequence of deactivating and deleting a cache-entry in the disk-cache device. As pointed out by michal in comment #18, this also applies to cache-reads because these may also touch metadata for the entry. In other words, general access to the cache seems to result in writing metadata to disk.

When a cache-entry is read from or written to, data for the entry is set by calling SetData(). If the data set is non-null, nsCacheEntry maintains a member |mThread| pointing to the running thread.

(As a side-note, if a call to SetData(realData) happens prior to a call to SetData(nsnull) we have a weird situation in this respect since |mThread| is not cleared on the second call... maybe we should change this?)

Anyway, when nsCacheService::DeactivateEntry() is called, we have three possible states for nsCacheEntry::mThread

1) not set
2) set to current (main) thread (indicating sync cache-write or a cache-read)
3) set to some other thread (async cache-write implemented in bug #513074)

In the suggested patch, case (3) serializes deactivation of the entry with async writes from bug #513074 by posting to the cache-write thread. The implementation of (2) just deactivates the entry synchronously, resulting in writes on the main-thread (as noted in comment #0).

AFAICS (1) may indicate that there is a pending write (or read) which has not yet accessed its stream, that something (like the collision-detection mechanism in nsDiskCacheBindery::CreateBinding()) failed causing a call like |entry->SetData(nsnull)|, or simply that some component set nsICacheEntryDescriptor::cacheElement to null for whatever reason. The suggested patch implements (1) in the same way as (2) (as a side-effect, not intentionally).

In essence I see two issues here (implementation of (3) is IMO ok) :

First, implementation of (2) must be improved so that possible metadata-writes happens off the main-thread. This can be done by passing the cache-write thread e.g. via nsICacheEntryDescriptor and post deactivation of the entry to this thread. This includes adding a field to a public interface and set it e.g. when the descriptor is created in nsHttpChannel (or it can be added to nsICacheSession and handled by the session).

Second, (1) must be handled properly. In the cases where (1) is caused by a pending request, fixing bug #548406 is IMO a prerequisite (i.e. properly deal with binding doomed entries). Apart from this, we can IMO treat it like (2) above.

I'll implement and test some of this while waiting for comments...

Bjarne (:bjarne)

Assignee

Comment 29

•

14 years ago

I haven't followed the work of making cache-reads async, but it occurred to me that if read-requests are handled on a separate thread also, deactivating such entries will probably end up in case (3) above also, no?

Michal Novotny [:michal]

Comment 30

•

14 years ago

(In reply to comment #29)
> I haven't followed the work of making cache-reads async, but it occurred to me
> that if read-requests are handled on a separate thread also, deactivating such
> entries will probably end up in case (3) above also, no?

I think you're right. Fixing #513008 (I'm working on it right now) will eliminate case (2).

Bjarne (:bjarne)

Assignee

Comment 31

•

14 years ago

If this turns out to be right, we could just wait for solutions to bug #513008 and bug #548406 and keep the suggested solution for (3) here, no?

Shawn Wilsher :sdwilsh

Comment 32

•

14 years ago

(In reply to comment #26)
> But NS_ASSERTION does nothing in release build, so the code would probably
> crash.
But we never expect this to be true, right?  We have assertions all over the code base that check that something isn't null and then just use it without checking nullness.

(In reply to comment #31)
> If this turns out to be right, we could just wait for solutions to bug #513008
> and bug #548406 and keep the suggested solution for (3) here, no?
Do we need to wait for those bugs before we can have a patch up here?

Bjarne (:bjarne)

Assignee

Comment 33

•

14 years ago

> (In reply to comment #31)
> > If this turns out to be right, we could just wait for solutions to bug #513008
> > and bug #548406 and keep the suggested solution for (3) here, no?
> Do we need to wait for those bugs before we can have a patch up here?

Fixing those bugs should eliminate cases (1) and (2) (comment #28) and the suggested patch handles case (3). If we want a fix for case (3) only I can whip it up based on the suggested patch - I see no reason for extending the interface(s) to handle case (2) if it will go away in the near future anyway.

Shawn Wilsher :sdwilsh

Comment 34

•

14 years ago

(In reply to comment #33)
> Fixing those bugs should eliminate cases (1) and (2) (comment #28) and the
> suggested patch handles case (3). If we want a fix for case (3) only I can whip
> it up based on the suggested patch - I see no reason for extending the
> interface(s) to handle case (2) if it will go away in the near future anyway.
I'd like to get as much as we can fix done right away, so getting at least case (3) fixed will be a big help.

Bjarne (:bjarne)

Assignee

Comment 35

•

14 years ago

Attached patch Updated : Post DeactivateEntry, wait for threads to finish processing in Shutdown (obsolete) — Details — Splinter Review

Updated according to comments from reviewer and cleaned up to deliberately only handle "case (3)" entries (i.e. entries which has attached a thread different from main-thread). Pushed to tryserver.

Attachment #432413 - Attachment is obsolete: true

Attachment #434202 - Flags: review?(michal.novotny)

Bjarne (:bjarne)

Assignee

Comment 36

•

14 years ago

Attached patch Updated : Post DeactivateEntry, wait for deactivate-events in Shutdown (obsolete) — Details — Splinter Review

Passes tryserver.

This patch writes metadata for any cache-entry on the thread which accessed the entry the last time. I.e. it should also deal with metadata for async cache-reads when these are implemented. Metadata for cache-entries which are read/written on main-thread are still written on main-thread.

The concern raised in comment #28, 4th paragraph is still unanswered.
 
Moreover, I believe this is a dup of bug #537934 (or vice-versa).

I'd also suggest to rename this bug to "Write/flush cache metadata off main-thread" or something similar to better reflect the real issue.

Attachment #434202 - Attachment is obsolete: true

Attachment #435149 - Flags: review?(michal.novotny)

Attachment #434202 - Flags: review?(michal.novotny)

Shawn Wilsher :sdwilsh

Comment 37

•

14 years ago

(In reply to comment #36)
> The concern raised in comment #28, 4th paragraph is still unanswered.
michal, can you respond to this?

> Moreover, I believe this is a dup of bug #537934 (or vice-versa).
We should probably forward dupe it.

Summary: Cache Writes Still happening on the main thread? → Write/flush cache metadata off main-thread

Shawn Wilsher :sdwilsh

Comment 38

•

14 years ago

(In reply to comment #28)
> Second, (1) must be handled properly. In the cases where (1) is caused by a
> pending request, fixing bug #548406 is IMO a prerequisite (i.e. properly deal
> with binding doomed entries). Apart from this, we can IMO treat it like (2)
> above.
Is this the right bug number?

Bjarne (:bjarne)

Assignee

Comment 39

•

14 years ago

(In reply to comment #38)
> (In reply to comment #28)
> > Second, (1) must be handled properly. In the cases where (1) is caused by a
> > pending request, fixing bug #548406 is IMO a prerequisite (i.e. properly deal
> > with binding doomed entries). Apart from this, we can IMO treat it like (2)
> > above.
> Is this the right bug number?

Yes. Bug #548406 is caused by pending cache-write-requests. If these for some reason are deactivated before executed (e.g. due to another write-request to the same entry) the entries will not have a thread attached at deactivation-time, i.e. case (1).

Michal Novotny [:michal]

Comment 40

•

14 years ago

(In reply to comment #37)
> (In reply to comment #36)
> > The concern raised in comment #28, 4th paragraph is still unanswered.
> michal, can you respond to this?

Yes, we should probably null out mThread as well.

Hmm, cacheElement is a public attribute of the cache entry, but we use it internally to hold the reference to the binding. The question here is what will happen if someone try to use it for his own purpose...

Michal Novotny [:michal]

Comment 41

•

14 years ago

Comment on attachment 435149 [details] [diff] [review]
Updated : Post DeactivateEntry, wait for deactivate-events in Shutdown

Sorry for the delay, I was experiencing a crash when this patch and the patch from #513008 were applied, so I was inspecting it. Regarding my comment #40 I don't think it's a good idea to use mThread for the decision on which thread to deactivate the entry. IMHO we should either remove cacheElement attribute from the IDL if it isn't needed, or add a new thread variable for our purpose.

Attachment #435149 - Flags: review?(michal.novotny)

Michal Novotny [:michal]

Comment 42

•

14 years ago

(In reply to comment #36)
> +  // (The dispatched event must post some event back to this thread to ensure
> +  // the event-loop-spin which implements NS_DISPATCH_SYNC doesn't starve -
> +  // hence the parameter)

Why exactly is this needed?

Bjarne (:bjarne)

Assignee

Comment 43

•

14 years ago

>Regarding my comment #40 I
> don't think it's a good idea to use mThread for the decision on which thread to
> deactivate the entry. IMHO we should either remove cacheElement attribute from
> the IDL if it isn't needed, or add a new thread variable for our purpose.

Make cacheElement or the mThread "non-scriptable" in the IDL? (Assuming you are worried about scripts accessing this?)

(In reply to comment #42)
> (In reply to comment #36)
> > +  // (The dispatched event must post some event back to this thread to ensure
> > +  // the event-loop-spin which implements NS_DISPATCH_SYNC doesn't starve -
> > +  // hence the parameter)
> 
> Why exactly is this needed?

The code which implements synchronous event-dispatching spins the event-loop for the calling thread until the dispatched event has finished. If the event-queue for the calling thread is exhausted before the dispatched event runs, it *looks like* it keeps spinning until an event is dispatched to the calling thread. So, by posting an (empty) event back to the calling thread I make sure the calling thread receives an event and breaks out of the spin.

Shawn Wilsher :sdwilsh

Comment 44

•

14 years ago

(In reply to comment #43)
> Make cacheElement or the mThread "non-scriptable" in the IDL? (Assuming you are
> worried about scripts accessing this?)
C++ could still access it then (think add-ons).

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 45

•

14 years ago

Speaking only for myself, I'm willing to break all 0.05 add-ons that want to mess around with this stuff from C++, and give them use-case-specific APIs if their authors ask.

Michal Novotny [:michal]

Comment 46

•

14 years ago

I'm not sure about the reason why this attribute exists, but it doesn't seem to be used by our code (only some tests use it). And I could find only one addon that uses it (Bugmail for Thunderbird https://mxr.mozilla.org/addons/source/9584/chrome/content/overlay.js#75). So can we simply remove the attribute or do we need to keep it and create a new member variable in nsCacheEntry for our internal purposes?

Michal Novotny [:michal]

Comment 47

•

14 years ago

(In reply to comment #43)
> The code which implements synchronous event-dispatching spins the event-loop
> for the calling thread until the dispatched event has finished. If the
> event-queue for the calling thread is exhausted before the dispatched event
> runs, it *looks like* it keeps spinning until an event is dispatched to the
> calling thread. So, by posting an (empty) event back to the calling thread I
> make sure the calling thread receives an event and breaks out of the spin.

This shouldn't happen and IMHO if it happens it is a bug and it should be fixed in nsThread. BTW what you're trying to do is exactly what nsThreadSyncDispatch::Run() already does.

Bjarne (:bjarne)

Assignee

Comment 48

•

14 years ago

(In reply to comment #47)
> This shouldn't happen and IMHO if it happens it is a bug and it should be fixed
> in nsThread. BTW what you're trying to do is exactly what
> nsThreadSyncDispatch::Run() already does.

You're right on both, of-course. But the fact remains : The ack made the patch pass the tryserver (tests were hanging without it). I'll dig a little deeper on this next week and re-run the patch a few times to make sure it's robust in this sense.

As a side-note, there are very few places where sync-dispatching is used at all (ogg- and wave-decoders, gnomevfs protocolhandler and then in TestProxies). Not entirely sure what TestProxies do, but I'll figure out next week...

Shawn Wilsher :sdwilsh

Comment 49

•

14 years ago

What is the status of this bug?

Shawn Wilsher :sdwilsh

Updated

•

14 years ago

blocking2.0: --- → ?

Bjarne (:bjarne)

Assignee

Comment 50

•

14 years ago

Attached patch Updated according to comments from reviewer ++ (obsolete) — Details — Splinter Review

This has been on the list for too long... here's an updated patch.

(In reply to comment #37)
> > Moreover, I believe this is a dup of bug #537934 (or vice-versa).
> We should probably forward dupe it.

Uncertain about terminology...  should I mark 537934 a duplicate of this bug?

(In reply to comment #40)
> Yes, we should probably null out mThread as well.

Ok - done.

> Hmm, cacheElement is a public attribute of the cache entry, but we use it
> internally to hold the reference to the binding. The question here is what will
> happen if someone try to use it for his own purpose...

Well, |cacheElement| only provides access to the data of the element, not the nsCacheEntry-object itself. Setting the attribute sets the data and stores a reference to the calling thread in |nsCacheEntry::mThread|, but apart from storing this reference I don't see any problems with |cacheElement| being public (please correct me if I'm missing something!).

The lifecycle of |nsCacheEntry::mThread| may thus be outside our control (e.g. it may be created and destroyed by an addon). Which means that when we try to deactivate the entry using |nsCacheEntry::mThread| it may fail miserably. I've added code to deactivate the entry on main-thread if dispatching fails. Should we be even more careful? Is there a util-method to tell if a thread is still valid? Can we tell somehow if the thread is out of our control and avoid storing it in |nsCacheEntry::mThread| in this case?

Or the simplest solution is of-course to remove the attribute completely... :) It is used in tests, like michal says, to verify that the cache returns what is stored, but there may be other ways to verify that.

> (In reply to comment #43)
> This shouldn't happen and IMHO if it happens it is a bug and it should be fixed
> in nsThread. BTW what you're trying to do is exactly what
> nsThreadSyncDispatch::Run() already does.

I've removed this from the current patch. Pushed to tryserver for verification (again).

Attachment #435149 - Attachment is obsolete: true

Attachment #444071 - Flags: review?(michal.novotny)

Bjarne (:bjarne)

Assignee

Comment 51

•

14 years ago

Patch passes tryserver on Linux and OSX. Fails one reftest on WinNT and encounters a number of 

###!!! [Parent][RPCChannel] Error: Channel error: cannot send/recv

which results in a crash, logged as

PROCESS-CRASH | automation.py | application crashed (minidump found)

(Brief log can be found at http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1273235252.1273245260.13979.gz )

The reftest-failure seems completely unrelated to the patch, and I can see these types of crashes in other tryserver-runs, hence I'd claim that patch passes tryserver. Waiting for review.

Shawn Wilsher :sdwilsh

Comment 52

•

14 years ago

(In reply to comment #51)
> ###!!! [Parent][RPCChannel] Error: Channel error: cannot send/recv
Did you build with electrolysis?  It sounds like it's trying to to do ipc stuff there.

Bjarne (:bjarne)

Assignee

Comment 53

•

14 years ago

That would probably be a bummer, yes... I've not enabled that intentionally, and 'hg identify' returns

0459345d8026 tip

Anything else I can check to verify/falsify? (Patch did apply cleanly on tryserver "mozilla-central" repository...)

Shawn Wilsher :sdwilsh

Comment 54

•

14 years ago

(In reply to comment #53)
> Anything else I can check to verify/falsify? (Patch did apply cleanly on
> tryserver "mozilla-central" repository...)
Ask in #content?  I'm not sure offhand.

Bjarne (:bjarne)

Assignee

Comment 55

•

14 years ago

I've disabled electrolysis at runtime in this patch by setting pref 'dom.ipc.plugins.enabled' to false everywhere in "browser/app/profile/firefox.js". Pushed to tryserver again...

Bjarne (:bjarne)

Assignee

Comment 56

•

14 years ago

Patch as described above passes tryserver for OSX but encounters bug #562886 on Linux and bug #538616 on WinNT (with electrolysis disabled though, which is interesting).

Link to WinNT log : http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1273657708.1273667985.641.gz

Link to Linux log : http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1273657708.1273667868.32729.gz

Patch is identical to the patch attached in comment #50 (with the exception of setting "dom.ipc.plugins.enabled" to false). Hence, I claim again that it passes tryserver. Waiting for review.

Bjarne (:bjarne)

Assignee

Comment 57

•

14 years ago

Hmm....  I was mistaken about OSX. I confused email about building successfully on OSX with successfully running unit tests on OSX. Finally, the email with 2 warnings in OSX unit tests came along...

I have to look closer at it later - cannot reproduce any of those issues locally. Review would still be valuable, I believe.

Log for OSX is : http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1273657708.1273671840.19635.gz

Bjarne (:bjarne)

Assignee

Comment 58

•

14 years ago

Ohh - if it's feasible to run the DTrace-script which defines the criteria for IO-reduction, please direct me to instructions for how to do so. Or maybe sdwilsh wants to do that? Even without the async cache-read patch we should see some effect...

Shawn Wilsher :sdwilsh

Comment 59

•

14 years ago

(In reply to comment #58)
> Ohh - if it's feasible to run the DTrace-script which defines the criteria for
> IO-reduction, please direct me to instructions for how to do so. Or maybe
> sdwilsh wants to do that? Even without the async cache-read patch we should see
> some effect...
I documented this in the wiki :)
https://wiki.mozilla.org/Firefox/Goals/2010Q1/IO_Reduction_Criteria

Shawn Wilsher :sdwilsh

Comment 60

•

14 years ago

(In reply to comment #57)
> I have to look closer at it later - cannot reproduce any of those issues
> locally. Review would still be valuable, I believe.
The failure is from thread 5 crashing (EXC_BAD_ACCESS).  Backtrace looks like this:
 0  XUL + 0x73721
    eip = 0x02281721   esp = 0xb030adc0   ebp = 0xb030ade8   ebx = 0x0228170d
    esi = 0x0020002a   edi = 0x00dd3d60   eax = 0x00000000   ecx = 0xb030ab4c
    edx = 0xb030addc   efl = 0x00010202
    Found by: given as instruction pointer in context
 1  XUL + 0x7188d
    eip = 0x0227f88e   esp = 0xb030adf0   ebp = 0xb030ae18
    Found by: previous frame's frame pointer
 2  XUL!nsMemoryCacheDevice::DeactivateEntry(nsCacheEntry*) [nsMemoryCacheDevice.cpp:2108d2458719 : 170 + 0x7]
    eip = 0x022870c5   esp = 0xb030ae20   ebp = 0xb030ae48
    Found by: previous frame's frame pointer
 3  XUL!nsCacheServiceDeactivateEntryEvent::Run() [nsCacheService.cpp:2108d2458719 : 1876 + 0x11]
    eip = 0x022860f8   esp = 0xb030ae50   ebp = 0xb030ae68
    Found by: previous frame's frame pointer
 4  XUL!nsThread::ProcessNextEvent(int, int*) [nsThread.cpp:2108d2458719 : 527 + 0x7]
    eip = 0x02ef263d   esp = 0xb030ae70   ebp = 0xb030aeb8
    Found by: previous frame's frame pointer
 5  XUL!NS_ProcessNextEvent_P(nsIThread*, int) [nsThreadUtils.cpp : 250 + 0x15]
    eip = 0x02eaf5ba   esp = 0xb030aec0   ebp = 0xb030aee8
    Found by: previous frame's frame pointer
 6  XUL!nsThread::ThreadFunc(void*) [nsThread.cpp:2108d2458719 : 254 + 0xf]
    eip = 0x02ef27f0   esp = 0xb030aef0   ebp = 0xb030af38
    Found by: previous frame's frame pointer
 7  libnspr4.dylib!_pt_root [ptthread.c:2108d2458719 : 228 + 0x8]
    eip = 0x0005f892   esp = 0xb030af40   ebp = 0xb030af78
    Found by: previous frame's frame pointer
 8  libSystem.B.dylib + 0x31c54
    eip = 0x90037c55   esp = 0xb030af80   ebp = 0xb030afc8
    Found by: previous frame's frame pointer
 9  libSystem.B.dylib + 0x31b11
    eip = 0x90037b12   esp = 0xb030afd0   ebp = 0xb030afec
    Found by: previous frame's frame pointer

Shawn Wilsher :sdwilsh

Comment 61

•

14 years ago

also see this just before the crash:
xpcshell(30795,0xb030b000) malloc: *** error for object 0xdd0006: Non-aligned pointer being freed
*** set a breakpoint in malloc_error_break to debug

Bjarne (:bjarne)

Assignee

Comment 62

•

14 years ago

> > IO-reduction, please direct me to instructions for how to do so. Or maybe
> > sdwilsh wants to do that? Even without the async cache-read patch we should see
> > some effect...
> I documented this in the wiki :)
> https://wiki.mozilla.org/Firefox/Goals/2010Q1/IO_Reduction_Criteria

Dtrace does not seem to be available for my normal development-box (Ubuntu Linux) but I'm planning to get my hands on a Mac-box. I assume that would help?

(In reply to comment #60)
> The failure is from thread 5 crashing (EXC_BAD_ACCESS).  Backtrace looks like
> this:
>  0  XUL + 0x73721
>     eip = 0x02281721   esp = 0xb030adc0   ebp = 0xb030ade8   ebx = 0x0228170d
>     esi = 0x0020002a   edi = 0x00dd3d60   eax = 0x00000000   ecx = 0xb030ab4c
>     edx = 0xb030addc   efl = 0x00010202
>     Found by: given as instruction pointer in context
>  1  XUL + 0x7188d
>     eip = 0x0227f88e   esp = 0xb030adf0   ebp = 0xb030ae18
>     Found by: previous frame's frame pointer
>  2  XUL!nsMemoryCacheDevice::DeactivateEntry(nsCacheEntry*)
> [nsMemoryCacheDevice.cpp:2108d2458719 : 170 + 0x7]
>     eip = 0x022870c5   esp = 0xb030ae20   ebp = 0xb030ae48
>     Found by: previous frame's frame pointer

  [ snip ]

Thanks a bundle! This is certainly relevant to the patch. However, I'm wondering if maybe the currently suggested patch for bug #548406 may help here : it cleans up the sequence of removing entries from memory-cache, which is where this crash happens.

The suggested patch for this bug changes timing of deactivating / removing entries, and a cleaner sequence for this in the memory-cache cannot hurt. I'll submit a combined patch to the tryserver.

Shawn Wilsher :sdwilsh

Comment 63

•

14 years ago

(In reply to comment #62)
> Dtrace does not seem to be available for my normal development-box (Ubuntu
> Linux) but I'm planning to get my hands on a Mac-box. I assume that would help?
Very likely.  I was using it on a mac, don't know if that script would work on Linux.

Bjarne (:bjarne)

Assignee

Comment 64

•

14 years ago

> The suggested patch for this bug changes timing of deactivating / removing
> entries, and a cleaner sequence for this in the memory-cache cannot hurt. I'll
> submit a combined patch to the tryserver.

Results from tryserver for combined patch for this bug with patch for bug #548406 :

1) Combined patch fails with bug #562886 on Linux.
2) On WinNT it fails with the error described in bug #548406, comment #323.
3) On all three platforms it fails in chrome://mochikit/content/browser/browser/components/sessionstore/test/browser/browser_522545.js

Analysis

(1) Is a known issue on Linux and is unrelated to this patch
(2) Indicates that the patch for bug #548406 introduces/triggers some issue in the application-cache. Must be resolved.
(3) Can be reproduced locally both with and without the patch. However, when using the latest version from trunk it passes locally without and with the patch. I'll ignore this issue for now.

Conclusion

Need to figure out what causes (2) above (failure in "test_simpleManifest.html")

Bjarne (:bjarne)

Assignee

Comment 65

•

14 years ago

> Need to figure out what causes (2) above (failure in
> "test_simpleManifest.html")

Cannot reproduce locally (yet). An observation is that the listener for the app-cache is installed before the ordinary cache-listener, hence when the ordinary cache-listener becomes an async tee, the app-cache will also be written asynchronously on the cache-write thread (since these tees are chained with the last one installed first in the list).

Bjarne (:bjarne)

Assignee

Comment 66

•

14 years ago

Now I see what probably causes the problem (ref also comment above).

The failing check in "test_simpleManifest.js" uses the attribute "mozLength" which is derived from the list of entries gathered from the cache-device. Before performing this check, the test waits for the cache-entry to become available using nsCacheService::OpenCacheEntry() which calls nsCacheService::ActivateCache(). ActivateCache() may find the cache-entry in mActiveEntries *before* the entry is bound to the device, hence there is a time-window from the cache-entry is found and until the cache-device knows the entry. This time-window is also the root of bug #548406, however the fix for bug #548406 is not to eliminate the time-window but rather to handle it properly. I'll find a way to handle it here also.

Bjarne (:bjarne)

Assignee

Comment 67

•

14 years ago

Ahh...  can this be bug #484123?

I've uploaded the patch from comment #64 to tryserver including a small fix to the test to ensure the time-window is properly handled.

Johnny Stenback (:jst)

Comment 68

•

14 years ago

I think we want this for 1.9.3.

blocking2.0: ? → beta1+

Shawn Wilsher :sdwilsh

Comment 69

•

14 years ago

(In reply to comment #68)
> I think we want this for 1.9.3.
I think we *need* it ;)

Bjarne (:bjarne)

Assignee

Comment 70

•

14 years ago

(In reply to comment #67)
> I've uploaded the patch from comment #64 to tryserver including a small fix to
> the test to ensure the time-window is properly handled.

Results from tryserver contain too many of these to make sense

###!!! [Parent][RPCChannel] Error: Channel error: cannot send/recv

followed by a crash. I (accidentally) re-enabled the electrolysis-stuff in this version, so I've disabled it again and resubmitted to tryserver...

Bjarne (:bjarne)

Assignee

Comment 71

•

14 years ago

(In reply to comment #70)
> ... , so I've disabled it again and resubmitted to tryserver...

Results from tryserver

OSX   : ok
Linux : hits bug #562866, plus a crash in test_updatecheck.js
WinNT : hits bug #538616

The crash on Linux in test_updatecheck.js shows this log

Crash reason:  SIGSEGV
Crash address: 0x1100

Thread 5 (crashed)
 0  libxul.so!nsDiskCacheMap::DeleteRecord(nsDiskCacheRecord*) [nsDiskCacheMap.h:962135366d6b : 139 + 0x0]
    eip = 0x4026de26   esp = 0x4270127c   ebp = 0x0a92b308   ebx = 0x40ff6974
    esi = 0x0fb6aab1   edi = 0x00001100   eax = 0x00000010   ecx = 0x00000000
    edx = 0x00001100   efl = 0x00010246
    Found by: given as instruction pointer in context

It can certainly be related to this patch (it posts deactivation on the device to another thread) and I cannot find any existing bugs mentioning this. Will have to look closer at this issue.

Bjarne (:bjarne)

Assignee

Comment 72

•

14 years ago

(In reply to comment #71)
> Linux : hits bug #562866, plus a crash in test_updatecheck.js

Sorry - should be :

Linux : hits bug #562886, plus a crash in test_updatecheck.js

Bjarne (:bjarne)

Assignee

Comment 73

•

14 years ago

Attached patch Modified to detach descriptors from cacheEntry if deleted in DeactivateEntry (obsolete) — Details — Splinter Review

An nsCacheEntry may actually have descriptors attached when we delete it in nsCacheService::DeactivateEntry(), so in this version we make sure to clear these before deleting the entry. Failing to do so may have caused the crashes in comment #62 and comment #71.

(It passes local testing - just waiting to get push-to-tryserver access so that it can be tested properly...)

Attachment #444071 - Attachment is obsolete: true

Attachment #449863 - Flags: review?(michal.novotny)

Attachment #444071 - Flags: review?(michal.novotny)

(no longer active)

Comment 74

•

14 years ago

Comment on attachment 449863 [details] [diff] [review]
Modified to detach descriptors from cacheEntry if deleted in DeactivateEntry

I've pushed this patch to the try server on your behalf (changeset 3f7a8816a26f).

You can see the results (in a few hours/days/months/centuries) on <http://tests.themasta.com/tinderboxpushlog/?tree=MozillaTry>.  :-)

Shawn Wilsher :sdwilsh

Comment 75

•

14 years ago

Try server builds look good.  Note that we have ten days until the beta code freeze, and this is blocking.

Bjarne (:bjarne)

Assignee

Comment 76

•

14 years ago

(In reply to comment #74)
> I've pushed this patch to the try server on your behalf (changeset
> 3f7a8816a26f).

Thanks! Appreciate it!

Note that the attached patch explicitly disables IPC globally by changing browser/app/profile.firefox.js. Should probably be stripped from the final patch.

(In reply to comment #75)
> Try server builds look good.  Note that we have ten days until the beta code
> freeze, and this is blocking.

Lots of things can be done in ten days. :)

try-mac-slave15 is orange, but so are most builds before and after
try-mac-slave08 is orange, the two subsequent builds have the exact same error
try-mac-slave24 is orange, the builds before and after have same errors
try-win32-slave01 is orange, the builds before and after have same errors
try-win32-slave17 is orange, suspicious-looking crash!
try-win32-slave19 is orange, Java RT died

otherwise all is green.

I'll have a look at the issue on try-win32-slave17 which seems relevant to this patch. (Hopefully, the Java-crash is a coincidence although I cannot find any bugs mentioning it.)

Bjarne (:bjarne)

Assignee

Comment 77

•

14 years ago

(In reply to comment #76)
> I'll have a look at the issue on try-win32-slave17 which seems relevant to this
> patch.

Seems like the same issue as seen in comment #71.

nsCacheEntry-objects are actually deleted in a number of places so I'll detach descriptors (ref comment #73) in the dtor to block pending operations on the entry.

Bjarne (:bjarne)

Assignee

Comment 78

•

14 years ago

Attached patch Modified as described in comment #77 (obsolete) — Details — Splinter Review

I'm still waiting for permission to push to tryserver, so if anyone would like to push it for me I'd be grateful. :)

Attachment #449863 - Attachment is obsolete: true

Attachment #450116 - Flags: review?(michal.novotny)

Attachment #449863 - Flags: review?(michal.novotny)

Johnathan Nightingale [:johnath]

Comment 79

•

14 years ago

(In reply to comment #78)
> Created an attachment (id=450116) [details]
> Modified as described in comment #77
> 
> I'm still waiting for permission to push to tryserver, so if anyone would like
> to push it for me I'd be grateful. :)

[Aside: bug number for your request? Level 1 access should not take that long.]

Johnny Stenback (:jst)

Comment 80

•

14 years ago

Bug 570475, and I think the reason for the delay is that the magic token message saying it's ready wasn't added, so I just added that.

I just pushed the latest patch to the try server, commit id is 828551107c36.

(no longer active)

Comment 81

•

14 years ago

I've pushed a try server run with this patch, and I just saw comment 78.  Bjarne, feel free to watch over it (changeset 27d8113d68f7 at Wed Jun 9 17:36:00 2010 titled: imported patch attachment.cgi?id=450116).

(no longer active)

Comment 82

•

14 years ago

And I just saw comment 80!  Well, now you have two try server runs!  :-)

Bjarne (:bjarne)

Assignee

Comment 83

•

14 years ago

(In reply to comment #80)
> Bug 570475, and I think the reason for the delay is that the magic token
> message saying it's ready wasn't added, so I just added that.

Thanks! I'm not too familiar with these formal procedures...

> I just pushed the latest patch to the try server, commit id is 828551107c36.

try-win32-slave10 (opt, mochitests-1/5): fails with bug #565245
try-win32-slave29: Java RT died
try-win32-slave10 (opt, xpcshell): fails with bug #538616
try-linux-slave13 (debug, reftest): fails reftest/tests/layout/reftests/svg/dynamic-use-nested-01.svg
try-linux-slave18 (debug, crashtest): fails with assertion count for crashtest 432114-2.html

(no longer active)

Comment 84

•

14 years ago

Bjarne, since you have two try server pushes with the same patch, you could compare their results, specifically for the second one below.

The dynamic-use-nested-01.svg failure was bug 571036 which is fixed now.  On the assertion count failure, you need to check the actual assertions triggered.  If they're not related to the cache code, this is probably an intermittent failure, which you should file a bug for.  :-)

Bjarne (:bjarne)

Assignee

Comment 85

•

14 years ago

Seems like my previous comment was submitted to early...  here's the rest.

> I just pushed the latest patch to the try server, commit id is 828551107c36.

try-w32-slave04 (debug, crashtest): application timeout near crashtests/386680-1.html
try-w32-slave10 (opt, xpcshell): fails with bug #538616
try-w32-slave10 (opt, mochitests-1/5): fails with bug #565245
try-w32-slave10 (debug, mochitests-2/5): fails with bug #492232
try-w32-slave18 (debug, mochitests-3/5): LEAK
try-w32-slave25 (debug, jsreftest): fails with wrong assertion count for regress-344804.js and regress-181834.js
try-w32-slave29: Java RT died

try-linux-slave02 (maemo build): failed, but previous builds also failed
try-linux-slave13 (debug, reftest): fails reftest/tests/layout/reftests/svg/dynamic-use-nested-01.svg
try-linux-slave18 (debug, crashtest): fails with wrong assertion count for crashtest 432114-2.html
try-linux-slave22 (opt, mochitests-other): some failures, starting with bug #539283

try-mac-slave24 (opt, mochitest/other): fails with bug #567950
try-mac-slave28 (opt, reftest): crash in ValueToNumberSlow() while processing an nsInputStreamReadyEvent - not sure whether it's relevant
OSX 10.6.2 (talos tp4): Busted, but build prior to this was also busted
try-mac64-slave09 (OSX 10.6.2 leak test): fails like all builds before and after

Phew...

(In reply to comment #81)
> Bjarne, feel free to watch over it (changeset 27d8113d68f7 at Wed Jun 9

Thanks again. :) 

try-w32-slave09 (opt, xpcshell): fails with bug #538616 (like above)
try-w32-slave16 (opt, mochitests-1/5): fails with bug #565245 (like above)
try-linux-slave11 (debug, reftest): fails in "dynamic-use-nested-01.svg" (like above)
try-w32-slave26 (debug, jsreftest): fails with wrong assertion count for regress-344804.js and regress-181834.js (like above)

try-linux-slave08 (opt, xpcshell): crash in "test_bug_406857.js" - very much like the one in comment #71
try-linux-slave21 (opt, mochitest-other): some failures, first in "browser_sidebarpanels_click.js" which probably is bug #539283 (like above)
try-linux-slave25 (opt, reftest): fails reftest/tests/layout/reftests/svg/dynamic-use-nested-01.svg (like above for debug-build)

+ several more failures

The interesting (and relevant) thing is the (Linux, opt, xpcshell) crash on try-linux-slave08 which looks similar to comment #71. The latest fix apparently did not resolve this, so I'll have a look at it again.

It seems to be triggered by accessing a field in a nsDiskCacheMap-object which, most likely, is deleted:

http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheMap.h#139

probably called from

http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheMap.cpp#478

and I have also seen a stack pointing to

http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheMap.cpp#484

Bjarne (:bjarne)

Assignee

Comment 86

•

14 years ago

The interesting crash happens in this context

1. nsDiskCacheDevice::DeactivateEntry() is called off main-thread
2. The entry is not doomed...
3. ...hence DeactivateEntry() tries to write the entry to disk, but fails
4. ...and crashes in mCacheMap::DeleteRecord() when cleaning up

Stack is here http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1276132562.1276133440.25820.gz#err1

Looks like the nsDiskCacheRecord-param to mCacheMap::DeleteRecord() is corrupted, causing the crash. Questions:

- why would writing the cache entry fail in the first place?
- how can the param to nsDiskCacheMap::DeleteRecord() be corrupted?

I'm a little stuck at the moment. I don't see what can happen in the timeslot between posting the deactivate-event and running it, which causes writing the entry to disk to fail (or corrupt the param to DeleteRecord for that matter).

Any suggestions are welcome, as well as any ideas for how to set up a test to trigger this. (I'm not able to reproduce.) Will look into this over the weekend again.

(no longer active)

Comment 87

•

14 years ago

Given a quick look at this patch, it's clearly wrong to assume that the weak references helf by nsCacheServiceDeactivateEntryEvent are going to be valid once we actually need them.  For example, what would prevent the service itself from going away?  What about the entry?  I think we should be holding some kind of a strong pointer to those objects.

Bjarne (:bjarne)

Assignee

Comment 88

•

14 years ago

That's valuable input. Thanks!

In principle I agree, but the plan was to control this by SYNC_DISPATCHing the dummy-events in nsCacheService::Shutdown() to ensure deactivation-events have finished before releasing the service. This assumes that Shutdown() in fact is called, and that nsCacheEntry-objects are not delete'd inappropriately (I'm tracing the potential for the latter now, but it looks ok). Are these assumptions completely wrong?

Nevertheless, I'll try using a strong ref and once my commit-access is working...

(no longer active)

Comment 89

•

14 years ago

(In reply to comment #88)
> That's valuable input. Thanks!
> 
> In principle I agree, but the plan was to control this by SYNC_DISPATCHing the
> dummy-events in nsCacheService::Shutdown() to ensure deactivation-events have
> finished before releasing the service. This assumes that Shutdown() in fact is
> called, and that nsCacheEntry-objects are not delete'd inappropriately (I'm
> tracing the potential for the latter now, but it looks ok). Are these
> assumptions completely wrong?

I don't know this code very well, so I'm probably not the best person to answer that.  But I have a helpful tip for you, try running Firefox under valgrind, which would assist you in determining memory safety issues.

> Nevertheless, I'll try using a strong ref and once my commit-access is
> working...

Feel free to ask me at any time to do a try server push in the mean time.  You can comment on the bug, send me a direct email, ping me on irc, give me a call, whatever!  :-)  I'd really like to help you get this landed on mozilla-central as quickly as possible.

Bjarne (:bjarne)

Assignee

Comment 90

•

14 years ago

Attached patch Modified to also handle profile-shutdown (obsolete) — Details — Splinter Review

Apppreciate it, Ehsan - thanks for your support!

Finally managed to reproduce locally, and I really wonder why I didn't see this issue before: The cache-devices are shut down also in nsCacheService::OnProfileChanged() but with no code to block on pending deactivation-events. Fixed, cannot reproduce locally anymore, and pushed to tryserver (yes - I can do that now! :)  ).

Waiting for results from tryserver and review.

Attachment #450116 - Attachment is obsolete: true

Attachment #451264 - Flags: review?(michal.novotny)

Attachment #450116 - Flags: review?(michal.novotny)

Bjarne (:bjarne)

Assignee

Comment 91

•

14 years ago

Results are better, but still there is a weird crash. After spending most of the day reproducing it locally and analyzing it I realize that nsMemoryCache::FindEntry() just picks up the entry from a hashtable, keeping old values for |mData| and |mThread| in particular.

This seems wrong since deactivation of the entry will then be dispatched to the thread which initially wrote data to the entry, as opposed to the thread which called FindEntry().

Fixed, cannot reproduce locally, and new patch pushed to tryserver...

Bjarne (:bjarne)

Assignee

Comment 92

•

14 years ago

More details fixed and another push to tryserver...  (this feels more and more like Pandora's box)

Bjarne (:bjarne)

Assignee

Comment 93

•

14 years ago

Still ironing out details, but at least we're down to details now. One more push to tryserver before the week-end...

Johnny Stenback (:jst)

Comment 94

•

14 years ago

I'd love to see this in beta1, but I can't say it's really a blocker. If this is landable before the deadline (TBD, but not very many days from now), let's make that happen!

blocking2.0: beta1+ → beta2+

Bjarne (:bjarne)

Assignee

Comment 95

•

14 years ago

Must have been something wrong with tryservers over the weekend because results all over the place look really funny... pushed again on monday when things seemed to be stable.

Now results look ok. There are two failing jsreftests (orange) with wrong assertion-count on w32/dbg but these seem to be common these days. The last (and interesting) orange is in

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1277123225.1277125340.1704.gz#err0

Scroll one line up and observe the explicit abort from imglib... looks like imglib aborts in certain cases (Release() in this case) when multiple threads are involved and I doubt such behaviour will/can be handled properly by the cache in the same process...

(The next error in the given log is clearly related to this patch but I expect it to be a consequence of the explicit abort by imglib. I'm trying to reproduce it locally but no success yet.)

Question is : What to do with explicit protection against multiple threads in imglib with increased usage of multiple threads in the cache?

Bjarne (:bjarne)

Assignee

Comment 96

•

14 years ago

Attached patch Cleaned up and deleted code which disabled IPC — Details — Splinter Review

This concludes work here for now. Pushed to tryserver one last time for verification and requesting review.

There is still the open question/concern in comment #95 (last paragraph) but I think it should be handled as a separate bug, and it may also be relevant in other contexts than this one.

Attachment #451264 - Attachment is obsolete: true

Attachment #453341 - Flags: review?(michal.novotny)

Attachment #451264 - Flags: review?(michal.novotny)

Johnny Stenback (:jst)

Comment 97

•

14 years ago

Michal, any chance you can get to this review any time soon?

blocking2.0: beta2+ → beta3+

Mike Beltzner [:beltzner, not reading bugmail]

Comment 98

•

14 years ago

Still a b3 blocker? Not comfortable with it continuing to bump down the line, maybe betaN+?

Johnny Stenback (:jst)

Comment 99

•

14 years ago

Marking betaN+

blocking2.0: beta3+ → betaN+

Bjarne (:bjarne)

Assignee

Comment 100

•

14 years ago

Unbitrotted and passed to tryserver to get a fresh feeling about current status.

The open question in comment #95 (in essence: How to release imgCacheEntry-objects on an arbitrary thread) is probably bug #529949 so I'm taking the liberty of adding a dependency to it. I also observe bug #542809 which seems to be the same problem in a different context.

Depends on: 529949

Bjarne (:bjarne)

Assignee

Comment 101

•

14 years ago

(In reply to comment #100)
> Unbitrotted and passed to tryserver to get a fresh feeling about current
> status.

Five starred failures. However, it also fails consistently on all platforms, both debug and opt on

chrome://mochikit/content/chrome/widget/tests/test_imestate.html

I find this failure mentioned in bug #584944 and bug #585361 although it is not the main issue of either of those bugs. Trying to reproduce locally...

Bjarne (:bjarne)

Assignee

Comment 102

•

14 years ago

Not able to reproduce locally (opt/dbg). Push again to tryserver...

Bjarne (:bjarne)

Assignee

Comment 103

•

14 years ago

Results from last push are surprisingly (suspiciously?) green with only three oranges :

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282131772.1282134988.1526.gz

which looks like bug #544601

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282124270.1282134123.28731.gz

where I cannot find the problem (a leak test showing no leaks afaics)

and http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282131772.1282133850.27248.gz

which fails in /tests/dom/tests/mochitest/dom-level2-core/test_elementgetattributenodens03.html on the assertion

###!!! ASSERTION: nsScriptCacheCleaner not thread-safe: '_mOwningThread.GetThread() == PR_GetCurrentThread()', file e:/builds/moz2_slave/tryserver-win32-debug/build/content/base/src/nsFrameMessageManager.cpp, line 681

The latter issue should probably be taken seriously although the test itself only contains two TODOs. Cannot reproduce locally but will investigate further.

Bjarne (:bjarne)

Assignee

Comment 104

•

14 years ago

Attached patch New, lighter approach to achieve the desired effect (obsolete) — Details — Splinter Review

As discussed on irc, this is a different and more direct approach to writing cache-metadata off the main-thread. I'm attaching it here to give you an idea of what I mean - comments are appreciated.

Attachment #467409 - Flags: review?(michal.novotny)

Michal Novotny [:michal]

Comment 105

•

14 years ago

Comment on attachment 467409 [details] [diff] [review]
New, lighter approach to achieve the desired effect

In the first patch (#453341) I really don't like collecting threads in mDeactivateEntryThreads and the method BlockOnAsyncDeactivateEvents(). Now when patch from bug #513008 has landed we have a dedicated cache IO thread that we can use to deactivate the entry. IMO there is also no reason to try to deactivate entry on the same thread where SetData() was called. Correct me if I'm wrong but there shouldn't be any problem to deactivate entry on the background IO thread even if data was written synchronously due to STORE_ON_DISK_AS_FILE policy.

The second patch (#467409) is more straightforward. But what I don't like on both patches is the fact that we have the cache in inconsistent state for a short period. We proceed nsCacheService::DeactivateEntry() immediately but the disk device deactivates the entry later. I.e. within this period we don't have the entry in nsCacheService::mActiveEntries but nsDiskCacheBindery::FindActiveBinding() can find a binding for the entry.

Instead of posting nsDiskCacheDevice::DeactivateEntry() to the background thread we should probably try to change the code so that most of the calls to nsCacheService::DeactivateEntry() happen on the background thread. It is called from

1) nsCacheService::ProcessRequest()
2) nsCacheService::DoomEntry_Internal()
3) nsCacheService::CloseDescriptor()
4) nsCacheService::ClearDoomList()
5) nsCacheService::DeactivateAndClearEntry()

We should focus on 3 which happens all the time during normal browsing.

Attachment #467409 - Flags: review?(michal.novotny)

Michal Novotny [:michal]

Updated

•

14 years ago

Attachment #453341 - Flags: review?(michal.novotny)

Bjarne (:bjarne)

Assignee

Comment 106

•

14 years ago

(In reply to comment #105)
> In the first patch (#453341) I really don't like collecting threads in
> mDeactivateEntryThreads and the method BlockOnAsyncDeactivateEvents(). Now when
> patch from bug #513008 has landed we have a dedicated cache IO thread that we
> can use to deactivate the entry.

I agree.

> IMO there is also no reason to try to
> deactivate entry on the same thread where SetData() was called.

Not when we have a suitable thread available for this, no.

> Correct me if
> I'm wrong but there shouldn't be any problem to deactivate entry on the
> background IO thread even if data was written synchronously due to
> STORE_ON_DISK_AS_FILE policy.

I think you're right.

> The second patch (#467409) is more straightforward. But what I don't like on
> both patches is the fact that we have the cache in inconsistent state for a
> short period. We proceed nsCacheService::DeactivateEntry() immediately but the
> disk device deactivates the entry later. I.e. within this period we don't have
> the entry in nsCacheService::mActiveEntries but
> nsDiskCacheBindery::FindActiveBinding() can find a binding for the entry.

Good point! But (in the second approach) couldn't we just call nsDiskCacheBindery::RemoveBinding() before proxying the call?

> Instead of posting nsDiskCacheDevice::DeactivateEntry() to the background
> thread we should probably try to change the code so that most of the calls to
> nsCacheService::DeactivateEntry() happen on the background thread.

I see (and fwiw I like) what you say but I suspect it will be even harder to keep the cache consistent if we do it like that (i.e. there will be more state to synchronize between the threads). I'll keep this in mind, but I would like to give my second approach a harder try, addressing your concerns above.

Bjarne (:bjarne)

Assignee

Comment 107

•

14 years ago

Attached patch Handles race-condition described in comment #105 (obsolete) — Details — Splinter Review

I believe this slight modification takes care of the potential for inconsistent state which Michal points out in comment #105 by introducing a flag to prevent GetCacheEntryBinding(nsCacheEntry) to return the binding if there is a pending deactivate.

(IMO there is a very unfortunate binding between the nsCacheEntry and the nsDiskCacheEntry where they point to each other and where the mData field of nsCacheEntry actually *is* the binding-object, but those are not relevant here...)

It passes simple local testing as well as local random browsing. Pushed to tryserver...

Attachment #467409 - Attachment is obsolete: true

Attachment #473021 - Flags: review?(michal.novotny)

Bjarne (:bjarne)

Assignee

Comment 108

•

14 years ago

(In reply to comment #107)
> It passes simple local testing as well as local random browsing. Pushed to
> tryserver...

Some timeouts seen from tryserver caused by the changes to dtor. Reverted and pushed again.

Bjarne (:bjarne)

Assignee

Comment 109

•

14 years ago

Timeouts gone but more careful handling of bindings is required.

Matheus Kerschbaum

Updated

•

14 years ago

Blocks: 595574

Bjarne (:bjarne)

Assignee

Comment 110

•

14 years ago

Attached patch Carefully handling bindings... (obsolete) — Details — Splinter Review

This version passes local testing and random browsing - pushed to tryserver...

Attachment #473021 - Attachment is obsolete: true

Attachment #475153 - Flags: review?(michal.novotny)

Attachment #473021 - Flags: review?(michal.novotny)

Bjarne (:bjarne)

Assignee

Comment 111

•

14 years ago

Afaics, it also passes tryserver. Five starred oranges plus some OpenGL oranges, which are to be expected these days. Waiting for review...

Johnny Stenback (:jst)

Comment 112

•

14 years ago

Running with the latest patch in this bug *and* with the patch in bug 588804 (I think, at least) applied I just hit a crash with the following stack:
#2  <signal handler called>
#3  arena_dalloc (ptr=0x100000001)
    at ../../../mozilla/memory/jemalloc/jemalloc.c:4271
#4  free (ptr=0x100000001) at ../../../mozilla/memory/jemalloc/jemalloc.c:6121
#5  0x00007fca4f16acfc in ~nsACString_internal (this=0x7fc9fbac7280, 
    __in_chrg=<value optimized out>) at ../../dist/include/nsTSubstring.h:115
#6  ~nsCString (this=0x7fc9fbac7280, __in_chrg=<value optimized out>)
    at ../../dist/include/nsTString.h:55
#7  nsCacheEntry::~nsCacheEntry (this=0x7fc9fbac7280, 
    __in_chrg=<value optimized out>)
    at ../../../mozilla/netwerk/cache/nsCacheEntry.cpp:83
#8  0x00007fca4f170a0b in nsDiskCacheDevice::DeactivateEntry (this=
    0x7fca32f44200, entry=0x7fc9fbac7280)
    at ../../../mozilla/netwerk/cache/nsDiskCacheDevice.cpp:527
#9  0x00007fca4f16eb9d in nsCacheServiceDeactivateEntryEvent::Run (this=
    0x7fc9a2293ae0) at ../../../mozilla/netwerk/cache/nsCacheService.cpp:1989
#10 0x00007fca4fab47ad in nsThread::ProcessNextEvent (this=0x7fc9c1bfde90, 
    mayWait=1, result=0x7fc9ae5fdd3c)
    at ../../../mozilla/xpcom/threads/nsThread.cpp:547
#11 0x00007fca4fa852a9 in NS_ProcessNextEvent_P (thread=<value optimized out>, 
    mayWait=<value optimized out>) at nsThreadUtils.cpp:250
#12 0x00007fca4fab4d5a in nsThread::ThreadFunc (arg=0x7fc9c1bfde90)
    at ../../../mozilla/xpcom/threads/nsThread.cpp:263
#13 0x00007fca4e16f4a3 in _pt_root (arg=0x7fc9b48c7240)
    at ../../../../mozilla/nsprpub/pr/src/pthreads/./ptthread.c:228
#14 0x00000037ed406a3a in start_thread () from /lib64/libpthread.so.0
#15 0x00000037eccde77d in clone () from /lib64/libc.so.6
(gdb) f5
#5  0x00007fca4f16acfc in ~nsACString_internal (this=0x7fc9fbac7280, 
    __in_chrg=<value optimized out>) at ../../dist/include/nsTSubstring.h:115
115	      ~nsTSubstring_CharT() { Finalize(); }
(gdb) p *this
$1 = {mData = 0x100000001 <Address 0x100000001 out of bounds>, mLength = 
    52500848, mFlags = 32714}
(gdb) up
#6  ~nsCString (this=0x7fc9fbac7280, __in_chrg=<value optimized out>)
    at ../../dist/include/nsTString.h:55
55	  {
(gdb) up
#7  nsCacheEntry::~nsCacheEntry (this=0x7fc9fbac7280, 
    __in_chrg=<value optimized out>)
    at ../../../mozilla/netwerk/cache/nsCacheEntry.cpp:83
warning: Source file is more recent than executable.
83	    delete mKey;
(gdb) p *this
$2 = {<PRCListStr> = {next = 0x7fc9fbac7280, prev = 0x7fc9fbac7280}, mKey = 
    0x7fca2fb064c0, mFetchCount = 2, mLastFetched = 1285305343, 
  mLastModified = 1285305348, mLastValidated = 1936024697, mExpirationTime = 
    1286880963, mFlags = 154368, mDataSize = 0, mCacheDevice = 0x7fca32f44200, 
  mSecurityInfo = {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, 
  mData = 0x7fc98df897e0, mThread = {<nsCOMPtr_base> = {mRawPtr = 
    0x7fc9c1bfde90}, <No data fields>}, mMetaData = {mBuffer = 0x0, 
    mBufferSize = 0, mMetaSize = 0}, mRequestQ = {next = 0x7fc9fbac72e8, 
    prev = 0x7fc9fbac72e8}, mDescriptorQ = {next = 0x7fc9fbac72f8, prev = 
    0x7fc9fbac72f8}}
(gdb)
I have a core file (from an optimized with symbols build) for this crash in case more info is needed.

Shawn Wilsher :sdwilsh

Updated

•

14 years ago

Keywords: main-thread-io

Shawn Wilsher :sdwilsh

Comment 113

•

14 years ago

We have any forward progress on this yet?

Michal Novotny [:michal]

Comment 114

•

14 years ago

Comment on attachment 475153 [details] [diff] [review]
Carefully handling bindings...

> +  nsresult rv = NS_NewThread(getter_AddRefs(mDiskCacheThread));

Shouldn't you explicitly shutdown mDiskCacheThread somewhere? Why didn't you use mCacheIOThread?


>    if (binding->mGeneration == p->mGeneration) {
> -      if (calcGeneration)  ++binding->mGeneration;    // try the next generation
> -      else {
> +      if (calcGeneration || p->mDeactivateInProgress) {
> +          ++binding->mGeneration;    // try the next generation
> +      } else {
>            NS_ERROR("### disk cache: generations collide!");

What does this change mean? I do think that it isn't needed and if yes then there is some serious problem. This code could potentially change existing generation number, which is bad.


>  nsDiskCacheBinding *
>  GetCacheEntryBinding(nsCacheEntry * entry)
>  {
> -    return (nsDiskCacheBinding *) entry->Data();
> +    nsDiskCacheBinding *tmp = (nsDiskCacheBinding *) entry->Data();
> +    if (tmp->mDeactivateInProgress)
> +        return nsnull;
> +    else
> +        return tmp;
>  }

This doesn't solve the issue that FindActiveEntry() sees (and returns) the binding. Also did you check all places where GetCacheEntryBinding() is called, if this can't be a problem? E.g. in nsDiskCacheDevice::DoomEntry() is an assertion which can be IMO triggered in following situation:

- nsDiskCacheDevice::DeactivateEntry() is called and nsDiskCacheDeviceDeactivateEntryEvent is posted
- some other entry is written and the cache is full, this starts evicting some entries in EvictDiskCacheEntries()
- nsDiskCacheEvictor::VisitRecord() is called for the entry being deactivated
- nsDiskCacheDevice::DoomEntry() is called -> assertion


This is just one example. I'm afraid that there could be a lot of other possible problems that would occur only in specific situations, which would make them really hard to debug. So I think that we should keep nsDiskCacheBindery in sync with nsCacheService::mActiveEntries.

Attachment #475153 - Flags: review?(michal.novotny) → review-

Bjarne (:bjarne)

Assignee

Comment 115

•

14 years ago

Attached patch Carefully handling bindings, v1.1 (obsolete) — Details — Splinter Review

(In reply to comment #114)
> Shouldn't you explicitly shutdown mDiskCacheThread somewhere? Why didn't you
> use mCacheIOThread?

Most of this was done before mDiskCacheThread existed, but yes: It should be used. Fixed.

> This code could potentially change existing
> generation number, which is bad.

You're right - the OR is definitely wrong. In fact, this method doesn't need to know anything about whether the entry is about to be deactivated - it can be left unchanged. Should be ok now.

> So I think that we should keep
> nsDiskCacheBindery in sync with nsCacheService::mActiveEntries.

I agree or disagree, depending on what you mean by "keep in sync". :)

The flag nsDiskCacheBinding::mDeactivateInProgress was introduced to prevent the bindery from returning entries which are queued to be deactivated. I.e. it is meant to "bridge the time-slot" which is introduced by deactivating disk-cache entries asynchronously. I believe this is sufficiently "in sync" for our needs. E.g. ActiveBinding() and AddBinding() do *not* take this flag into account, while FindActiveBinding() and GetCacheEntryBinding() do.

> This doesn't solve the issue that FindActiveEntry() sees (and returns) the
> binding.

FindActiveEntry() skip this binding like it does for doomed entries.

> - nsDiskCacheDevice::DeactivateEntry() is called and
> nsDiskCacheDeviceDeactivateEntryEvent is posted
> - some other entry is written and the cache is full, this starts evicting some
> entries in EvictDiskCacheEntries()
> - nsDiskCacheEvictor::VisitRecord() is called for the entry being deactivated
> - nsDiskCacheDevice::DoomEntry() is called -> assertion

This shouldn't happen because FindActiveEntry() would skip entries with the flag set, ref above.

However, you do have a point about the assertions which are often used immediately after calling GetCacheEntryBinding(). This applies to DeactivateEntry(), DoomEntry(), GetFileForEntry() and OnDataSizeChange(), and the question is whether these may be called in the time-slot introduced by this patch. The methods OpenInput/OutputStreamForEntry() use ENSURE_TRUE first, which seems to me like a reasonable solution, do you agree?

Patch passes basic local testing - pushed to tryserver.

Attachment #475153 - Attachment is obsolete: true

Attachment #485727 - Flags: review?(michal.novotny)

Michal Novotny [:michal]

Comment 116

•

14 years ago

(In reply to comment #112)
> 83        delete mKey;
> (gdb) p *this
> $2 = {<PRCListStr> = {next = 0x7fc9fbac7280, prev = 0x7fc9fbac7280}, mKey = 
>     0x7fca2fb064c0, mFetchCount = 2, mLastFetched = 1285305343, 
>   mLastModified = 1285305348, mLastValidated = 1936024697, mExpirationTime = 
>     1286880963, mFlags = 154368, mDataSize = 0, mCacheDevice = 0x7fca32f44200, 
>   mSecurityInfo = {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, 
>   mData = 0x7fc98df897e0, mThread = {<nsCOMPtr_base> = {mRawPtr = 
>     0x7fc9c1bfde90}, <No data fields>}, mMetaData = {mBuffer = 0x0, 
>     mBufferSize = 0, mMetaSize = 0}, mRequestQ = {next = 0x7fc9fbac72e8, 
>     prev = 0x7fc9fbac72e8}, mDescriptorQ = {next = 0x7fc9fbac72f8, prev = 
>     0x7fc9fbac72f8}}

The members seem to be reasonable, except mFlags. I'd guess that the cache entry is deleted for the second time. I don't think that the patch from bug #588804 could cause this, so I'd suspect patch #475153 from this bug.

Bjarne (:bjarne)

Assignee

Comment 117

•

14 years ago

Not sure where you're going with this, but the mFlags value means that the nsCacheEntry is

1. doomed
2. dirty
3. metadata is dirty
4. it has stream data
5. it's initialized

and the 0x2000-value looks weird. But isn't it deleting |mKey| which fails? Why do you believe the nsCacheEntry gets deleted twice? (I'm sorry if I'm a bit slow here, but I just don't see this...)

Bjarne (:bjarne)

Assignee

Comment 118

•

14 years ago

Oh - and note that patch #475153 from this bug did have some bad logic (as you pointed out in comment #114 and I acked in comment #115) which the new patch has fixed.

Michal Novotny [:michal]

Comment 119

•

14 years ago

(In reply to comment #117)
> Not sure where you're going with this, but the mFlags value means that the
> nsCacheEntry is
> 
> 1. doomed
> 2. dirty
> 3. metadata is dirty
> 4. it has stream data
> 5. it's initialized
> 
> and the 0x2000-value looks weird.

That's exactly the problem. How 0x20000 could get into mFlags? I'd guess that the memory was freed and overwritten.


> But isn't it deleting |mKey| which fails? Why do you believe the nsCacheEntry
> gets deleted twice?

I've checked the lifetime of the mKey and I don't see a way how we could delete an invalid nsCString. First I thought that the problem could be caused by passing the key from the request to the entry in nsCacheService::ActivateEntry. But mKey is nulled out almost immediately after that in nsCacheEntry::RequestAccess(). The problem occurs only if mActiveEntries.AddEntry() fails since RequestAccess() isn't called and we release the key twice. But this isn't our case because the entry seems to be used.

I've tried to add second delete entry to nsDiskCacheDevice::DeactivateEntry() and the debugger crashed when deleting the nsCString just like in comment #112, so that's why I think that the entry was deleted twice.


> Oh - and note that patch #475153 from this bug did have some bad logic (as you
> pointed out in comment #114 and I acked in comment #115) which the new patch
> has fixed.

Do you mean the condition when calculating generation? I think that the wrong condition can't happen at all, so this couldn't cause the crash.

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 120

•

14 years ago

what does valgrind say about the overwriting of freed memory?

Bjarne (:bjarne)

Assignee

Comment 121

•

14 years ago

(In reply to comment #119)
> I've tried to add second delete entry to nsDiskCacheDevice::DeactivateEntry()
> and the debugger crashed when deleting the nsCString just like in comment #112,
> so that's why I think that the entry was deleted twice.

Fair enough - what did mFlags look like?

In fact, one way I can see this happening is that nsCacheService::DectivateEntry() is called for a doomed entry which has not bound yet due to delayed writing. The call will delete the entry, but when the write-request comes, the entry will bind. If the entry is deactivated later, we will get the second delete by the event. (This should not happen after the fix for bug #596443 landed though.)

Will try to find other patterns which may give us two deletes. I'd also appreciate comments to the patch.

Bjarne (:bjarne)

Assignee

Comment 122

•

14 years ago

Mmmm...  I'm looking at a crash from tryserver (with this patch) which asserts in nsDiskCacheBindery::RemoveBinding() that the bindery is not initialized. Wonder how that could happen. Note also that it is the code-line immediately before the one crashing in comment #116.

An uninitialized bindery indicates that nsDiskCacheDevice::Shutdwon_Private() is called while a deactivate-event is pending. Either from nsDiskCacheDevice::ClearDiskCache() or from nsDiskCacheDevice::Shutdown().

Maybe nsDiskCacheDevice::Shutdown() should wait until mBindery has no more active bindings, like nsDiskCacheDevice::ClearDiskCache() does..

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 123

•

14 years ago

Does valgrind show anything when you run under it?  It often provides very useful clues when dealing with memory issues.

Michal Novotny [:michal]

Comment 124

•

14 years ago

AFAIK we can't reproduce the crash, so valgrind can't help here.

Shawn Wilsher :sdwilsh

Comment 125

•

14 years ago

(In reply to comment #124)
> AFAIK we can't reproduce the crash, so valgrind can't help here.
Doesn't mean it won't flag bad memory access patterns that don't result in a crash.

Bjarne (:bjarne)

Assignee

Comment 126

•

14 years ago

(Warning: I have no previous experience with valgrind.)

I've followed the guidelines at https://developer.mozilla.org/en/Debugging_Mozilla_with_Valgrind but can find nothing even remotely relevant in the output. As Michal says, the crash mentioned in comment #112 and comment #116 is not really reproducible.

The crash mentioned in comment #122 is in leaktest/Linux. I've not yet been able to reproduce, but I'm currently trying to figure out how to run leaktest under valgrind (any pointers or input is appreciated) to see if it gives any clues.

Bjarne (:bjarne)

Assignee

Comment 127

•

14 years ago

An update:

I've occasionally seen crashes on the tryserver with this patch very similar to comment #112 and comment #122. However, after some investigation I've found that these are not caused by deactivating an entry twice, but rather by the way nsCacheService shuts down the devices: The device is deleted, the dtor calls Shutdown(), which (with this patch) may fire async deactivations. When an async deactivation runs, the device w/ members is gone but the memory is accessed by the async deactivation event, and boom...

I'll attach an updated patch which blocks the dtor of the disk-device until all cache-io events have run, and I have not seen any crashes with this modification. However, there is an issue in /tests/content/base/test/test_bug475156.html where 3-4 of the tests fails with a clearly cache-related message. The suspicion is that there is a small timeslot while deactivation is pending where the cache doesn't find the entry, and I'm currently trying to find it and close it.

Bjarne (:bjarne)

Assignee

Comment 128

•

14 years ago

Attached patch Carefully handling bindings v1.2 (blocking on cache-io thread in dtor) (obsolete) — Details — Splinter Review

Attachment #485727 - Attachment is obsolete: true

Attachment #488785 - Flags: review?(michal.novotny)

Attachment #485727 - Flags: review?(michal.novotny)

Markus Stange [:mstange]

Comment 129

•

14 years ago

There has been a similar crash as in comment 112 just now on the Firefox Tinderbox, see bug 614513.

Bjarne (:bjarne)

Assignee

Comment 130

•

14 years ago

Nice catch. :)  See bug #bug 614513, comment #5 in particular the last paragraph.

Michal Novotny [:michal]

Comment 131

•

14 years ago

Comment on attachment 488785 [details] [diff] [review]
Carefully handling bindings v1.2 (blocking on cache-io thread in dtor)

> +nsCacheService::BlockOnCacheIOThread()
> +{
> +    NS_ASSERTION(gService->mLockedThread == PR_GetCurrentThread(), "oops");
> +    if (!gService->mCacheIOThread) return NS_ERROR_NOT_AVAILABLE;
> +
> +    nsBlockOnCacheThreadEvent *event = new nsBlockOnCacheThreadEvent();
> +
> +    Unlock();
> +        nsresult rv =
> +            gService->mCacheIOThread->Dispatch(event, NS_DISPATCH_SYNC);
> +    Lock();

Posting a synchronous event will spin the event loop until the event finishes. This can be IMO problematic.

Maybe a better solution would be to have a counter of posted events and wait using monitor until it is 0. Also it makes sense to do a synchronous deactivation while doing shutdown. I.e. in Shutdown_Private() set variable mShutdownInProgress = PR_TRUE and check this variable in DeactivateEntry(), in case it is true do a synchronous deactivation.


> +class nsDiskCacheDeviceDeactivateEntryEvent : public nsRunnable {
> +public:
> +    nsDiskCacheDeviceDeactivateEntryEvent(nsDiskCacheDevice *device,
> +                                          nsCacheEntry * entry,
> +                                          nsDiskCacheBinding * binding)

Events and other helper classes are usually at the beginning of the file. Please move it there.


> -    binding = nsnull;
> +    binding = nsnull;  // why assign here??

This should be changed to an assertion NS_ASSERTION(!binding, "..."). It used to be a debug check (see http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/netwerk/cache/src/nsDiskCacheDevice.cpp&rev=1.59) but became a regular code which is incorrect and misleading. The logic is that if we find an active binding it must be a collision, otherwise there is some serious problem.

Attachment #488785 - Flags: review?(michal.novotny)

Bjarne (:bjarne)

Assignee

Comment 132

•

14 years ago

(In reply to comment #131)
> Posting a synchronous event will spin the event loop until the event finishes.

Yes - that's the idea. :)  I want to block current thread until all pending events on the cache-io thread has finished. We are, after all, shutting down the cache at this point.

> This can be IMO problematic.

How? (Observe that the lock is released before the sync-event.)

> Events and other helper classes are usually at the beginning of the file.
> Please move it there.

Fair enough.

> This should be changed to an assertion NS_ASSERTION(!binding, "..."). It used
> to be a debug check (see
> http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/netwerk/cache/src/nsDiskCacheDevice.cpp&rev=1.59)
> but became a regular code which is incorrect and misleading. The logic is that
> if we find an active binding it must be a collision, otherwise there is some
> serious problem.

I can add an assertion in FindEntry() but the TODOs are really about why we assign binding to null (also present in 1.59) when it's not used before the next assignment (also in 1.59).

Bjarne (:bjarne)

Assignee

Comment 133

•

14 years ago

(In reply to comment #132)
> but the TODOs are really about why we
> assign binding to null (also present in 1.59) when it's not used before the
> next assignment (also in 1.59).

Ahh - sorry. I missed all the cleanup in dtor.  :P

Michal Novotny [:michal]

Comment 134

•

14 years ago

(In reply to comment #132)
> (In reply to comment #131)
> > Posting a synchronous event will spin the event loop until the event finishes.
> 
> Yes - that's the idea. :)  I want to block current thread until all pending
> events on the cache-io thread has finished. We are, after all, shutting down
> the cache at this point.

But the problem is that the current thread won't be blocked. It will process any pending event on the main thread. Maybe that's OK, but some time ago I wanted to spin the event queue using NS_ProcessNextEvent while waiting for a result and I was told that I should avoid doing that since it isn't always safe.

Shawn Wilsher :sdwilsh

Comment 135

•

14 years ago

(In reply to comment #134)
> But the problem is that the current thread won't be blocked. It will process
> any pending event on the main thread. Maybe that's OK, but some time ago I
> wanted to spin the event queue using NS_ProcessNextEvent while waiting for a
> result and I was told that I should avoid doing that since it isn't always
> safe.
It might be OK for this.  When do we shut the cache down?

Michal Novotny [:michal]

Comment 136

•

14 years ago

(In reply to comment #135)
> It might be OK for this.  When do we shut the cache down?

The code is called on "xpcom-shutdown" and "profile-before-change",  and also when somebody calls nsICacheSession::evictEntries() to evict entries in the disk cache.

Bjarne (:bjarne)

Assignee

Comment 137

•

14 years ago

(In reply to comment #134)
> But the problem is that the current thread won't be blocked. It will process
> any pending event on the main thread. 

Reading the nsThread impl I see what you mean... :(  The calling thread won't block but rather execute pending events while stalling the return of the dispatch-sync. Effectively, upon calling dispatch-sync necko will be in some half-way-done state, subsequent events will run in this state, and nobody knows what will happen. Makes me wonder which situations the dispatch-sync actually can handle safely... (and consequently why it's available at all).

I've re-implemented BlockOnCacheIoThread() to use conditional variables instead, attempting to properly block the calling thread (which is what I want to do). Are there any known pitfalls/issues with condvars? :)

(Patch is running on tryserver now - I'll attach it asap.)

Michal Novotny [:michal]

Comment 138

•

14 years ago

(In reply to comment #137)
> I've re-implemented BlockOnCacheIoThread() to use conditional variables
> instead, attempting to properly block the calling thread (which is what I want
> to do). Are there any known pitfalls/issues with condvars? :)

AFAICS the only problem could be if we would do the same on the cache IO thread, i.e. if we would need to do some task on the main thread to finish all events posted to the IO thread. But we don't do this so it should be safe.

Bjarne (:bjarne)

Assignee

Comment 139

•

14 years ago

Referring to comment #127 and the intermittent failure in test test_bug475156.html, I finally think I see what's going on...

nsHttpChannel::OnStopRequest() is called when the response has been retrieved from the server. It calls CloseCacheEntry() which sets |mCacheEntry| to null. This will/might trigger the dtor of |mCacheEntry| and the dtor of nsCacheEntryDescriptor calls Close() which calls nsCacheService::CloseDescriptor(), which finally deactivates the entry if it's considered inactive.

Now, when deactivating an entry we remove it from nsCacheService::mActiveEntries and make we sure to remove it from nsDiskCacheDevice::mBindery (ref last sentence in comment #114) to keep these in sync. Then we post an event which writes the entry to disk at some point later. Note that this creates a time-slot between removing it from mActiveEntries/mBindery and writing the updated entry to disk. If another request for this entry is processed in this time-slot it will find the *old* entry on disk. In the case of "test_bug475156.html" we find an entry on disk with the wrong eTag and thus validate it with the server, leading to intermittent failures depending on the size and timing of this time-slot.

IMO we must delay removing an entry from mBindery until it has been written to disk. If the entry is requested in the time-slot, we should re-use the binding and not remove it from mBindery. I'll see what I can come up with in terms of code - feel free to challenge the reasoning above in the meantime.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 140

•

14 years ago

We really want this, but it's no longer a blocker this late in the game.

blocking2.0: betaN+ → .x

status2.0: --- → wanted

Bjarne (:bjarne)

Assignee

Comment 141

•

14 years ago

Part of solution here is described separately in bug #620660 - adding dependency.

Depends on: 620660
No longer depends on: 529949

Bjarne (:bjarne)

Assignee

Comment 142

•

13 years ago

Attached patch Carefully handling bindings version 1.3 (obsolete) — Details — Splinter Review

After a fair number of tryserver-runs I finally identified an issue which caused similar symptoms as seen in bug #614513 (which btw probably could be confirmed fixed by now). The issue is essentially that a cache-entry gets deactivated twice, but this particular code-path can only occur with the rest of this patch applied.

Problem is avoided by canceling the pending event doing the first deactivation and after the fix, this patch passes a full tryserver-run (a few starred oranges) with one exception. The exception really looks like bug #529733, although the crash happens in strlen instead of memcpy. However, origin is the same: nsHttpChannel::ContinueProcessNormal() and InitCacheEntry(). Stack can be seen at http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1300415888.1300416140.711.gz#err1

and full tryserver-run can be seen at  http://tbpl.mozilla.org/?tree=MozillaTry&rev=adfb1b91e827

I'll try to reproduce a couple of times before moving on with this. If there is any insight wrt whether the exception is a version of bug #529733, I'd be happy to listen...

Attachment #488785 - Attachment is obsolete: true

Bjarne (:bjarne)

Assignee

Comment 143

•

13 years ago

(In reply to comment #142)
> I'll try to reproduce a couple of times before moving on with this.

The same test crashed once again (also on Win32) but with a different stack. Must investigate further...

Bjarne (:bjarne)

Assignee

Comment 144

•

13 years ago

Attached patch Carefully handling bindings version 1.3 (obsolete) — Details — Splinter Review

This adds some minor cleanup compared to the previous patch. It has passed two full tryserver-runs with 3-4 starred oranges in each run.

Attachment #520177 - Attachment is obsolete: true

Attachment #524455 - Flags: review?(michal.novotny)

Jason Duell

Updated

•

13 years ago

Blocks: 645848

Michal Novotny [:michal]

Comment 145

•

13 years ago

Comment on attachment 524455 [details] [diff] [review]
Carefully handling bindings version 1.3

> +    void NeverMind() { mNeverMind = PR_TRUE; }

I think that names like CancelDeactivate() nad mCancelled would be more self-explaining.


> +    nsDiskCacheDeviceDeactivateEntryEvent *event =
> +        new nsDiskCacheDeviceDeactivateEntryEvent(this, entry, binding);
> +    if (!event) {
> +        CACHE_LOG_DEBUG(("  failed allocating deactivate-event - "
> +                         "falling back to synchronous operation.\n"));
> +        return DeactivateEntry_Private(entry, binding);
> +    }

new doesn't fail.


> -    if (!binding)  return NS_ERROR_UNEXPECTED;
> +    NS_ASSERTION(binding->mDeactivateEvent == nsnull,
> +                 "OnDataSizeChange: entry in process of deactivation");
> +    if (!binding || binding->mDeactivateEvent)
> +        return NS_ERROR_UNEXPECTED;

Just a nit. There are several places where this check is used. Couldn't we create some inline function for it? Like:

inline bool ValidBinding(nsDiskCacheBinding *binding)
{
  if (!binding) {
    NS_ASSERTION(...);
    return false;
  }
  if (binding->mDeactivateEvent) {
    NS_ASSERTION(...);
    return false;
  }
  return true;
}

And then just use
if (!ValidBinding(binding))
  return NS_ERROR_UNEXPECTED;


> -        binding = nsnull;
> +        binding = nsnull; // why assign here? it's set below...

See my comment #131


> +        if (!mNeverMind) {
> +            nsCacheServiceAutoLock lock;

There is a possible race condition if we want to reuse the binding on different thread.

Attachment #524455 - Flags: review?(michal.novotny)

Bjarne (:bjarne)

Assignee

Comment 146

•

13 years ago

(In reply to comment #145)
> > +    void NeverMind() { mNeverMind = PR_TRUE; }
> 
> I think that names like CancelDeactivate() nad mCancelled would be more
> self-explaining.

Fair enough. Fixed.

> > +    nsDiskCacheDeviceDeactivateEntryEvent *event =
> > +        new nsDiskCacheDeviceDeactivateEntryEvent(this, entry, binding);
> > +    if (!event) {
> new doesn't fail.

Ok.

> > -    if (!binding)  return NS_ERROR_UNEXPECTED;
> > +    NS_ASSERTION(binding->mDeactivateEvent == nsnull,
> > +                 "OnDataSizeChange: entry in process of deactivation");
> > +    if (!binding || binding->mDeactivateEvent)
> > +        return NS_ERROR_UNEXPECTED;
> 
> Just a nit. There are several places where this check is used. Couldn't we
> create some inline function for it? 

Ok

> > -        binding = nsnull;
> > +        binding = nsnull; // why assign here? it's set below...
> 
> See my comment #131

Ok - fixed as agreed.

> > +        if (!mNeverMind) {
> > +            nsCacheServiceAutoLock lock;
> 
> There is a possible race condition if we want to reuse the binding on different
> thread.

Good catch! Fixed.

Bjarne (:bjarne)

Assignee

Comment 147

•

13 years ago

Attached patch Carefully handling bindings version 1.3 (updated) (obsolete) — Details — Splinter Review

Fixed as requested, pushed to tryserver.

Attachment #524455 - Attachment is obsolete: true

Attachment #525240 - Flags: review?(michal.novotny)

Bjarne (:bjarne)

Assignee

Comment 148

•

13 years ago

Attached patch Carefully handling bindings version 1.3 (updated) (obsolete) — Details — Splinter Review

One more place to use the util-function...

Attachment #525240 - Attachment is obsolete: true

Attachment #525247 - Flags: review?(michal.novotny)

Attachment #525240 - Flags: review?(michal.novotny)

Bjarne (:bjarne)

Assignee

Comment 149

•

13 years ago

Attached patch Carefully handling bindings version 1.3 (updated and unbitrotted) — Details — Splinter Review

... and here it is, unbitrotted as well...

Attachment #525247 - Attachment is obsolete: true

Attachment #525263 - Flags: review?(michal.novotny)

Attachment #525247 - Flags: review?(michal.novotny)

Michal Novotny [:michal]

Updated

•

13 years ago

Attachment #525263 - Flags: review?(michal.novotny) → review+

Shawn Wilsher :sdwilsh

Comment 150

•

13 years ago

Is this going to land tonight so it makes Firefox 5?

Bjarne (:bjarne)

Assignee

Comment 151

•

13 years ago

Tryserver is still running

http://tbpl.mozilla.org/?tree=MozillaTry&rev=5e9434e84258

If someone wants to land it - feel free.

Bjarne (:bjarne)

Assignee

Comment 152

•

13 years ago

Lots of traffic on tryserver these days...  according to the web interface this job is not yet complete. It seems to be ok but since we missed the window anyway I'll re-submit the job to get a coherent picture.

Bjarne (:bjarne)

Assignee

Comment 153

•

13 years ago

Pushed again to tryserver as http://tbpl.mozilla.org/?tree=MozillaTry&rev=a2588d175b07

Bjarne (:bjarne)

Assignee

Comment 154

•

13 years ago

Testrun looks ok. Requesting check-in.

Keywords: checkin-needed

Dão Gottwald [:dao]

Comment 155

•

13 years ago

http://hg.mozilla.org/mozilla-central/rev/1346f7ada9d5

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Keywords: checkin-needed

OS: Mac OS X → All

Hardware: x86 → All

Resolution: --- → FIXED

Target Milestone: --- → mozilla6

Shawn Wilsher :sdwilsh

Comment 156

•

13 years ago

(In reply to comment #155)
> http://hg.mozilla.org/mozilla-central/rev/1346f7ada9d5
Let me be the first to say "SQUEEEEEEEEE!"

instruments screen grab 14 years ago Robert Sayre 390.33 KB, image/png		Details
strick that, I can still see it happening. here's a stack 14 years ago Robert Sayre 3.60 KB, text/plain		Details
Posting DeactivateEntry to the thread 14 years ago Bjarne (:bjarne) 3.03 KB, patch		Details \| Diff \| Splinter Review
Posting DeactivateEntry to the thread, waiting for threads to finish processing in Shutdown 14 years ago Bjarne (:bjarne) 6.01 KB, patch	michal : review-	Details \| Diff \| Splinter Review
Updated : Post DeactivateEntry, wait for threads to finish processing in Shutdown 14 years ago Bjarne (:bjarne) 6.51 KB, patch		Details \| Diff \| Splinter Review
Updated : Post DeactivateEntry, wait for deactivate-events in Shutdown 14 years ago Bjarne (:bjarne) 8.44 KB, patch		Details \| Diff \| Splinter Review
Updated according to comments from reviewer ++ 14 years ago Bjarne (:bjarne) 8.88 KB, patch		Details \| Diff \| Splinter Review
Modified to detach descriptors from cacheEntry if deleted in DeactivateEntry 14 years ago Bjarne (:bjarne) 11.34 KB, patch		Details \| Diff \| Splinter Review
Modified as described in comment #77 14 years ago Bjarne (:bjarne) 11.13 KB, patch		Details \| Diff \| Splinter Review
Modified to also handle profile-shutdown 14 years ago Bjarne (:bjarne) 16.28 KB, patch		Details \| Diff \| Splinter Review
Cleaned up and deleted code which disabled IPC 14 years ago Bjarne (:bjarne) 24.60 KB, patch		Details \| Diff \| Splinter Review
New, lighter approach to achieve the desired effect 14 years ago Bjarne (:bjarne) 5.27 KB, patch		Details \| Diff \| Splinter Review
Handles race-condition described in comment #105 14 years ago Bjarne (:bjarne) 10.24 KB, patch		Details \| Diff \| Splinter Review
Carefully handling bindings... 14 years ago Bjarne (:bjarne) 10.84 KB, patch	michal : review-	Details \| Diff \| Splinter Review
Carefully handling bindings, v1.1 14 years ago Bjarne (:bjarne) 8.61 KB, patch		Details \| Diff \| Splinter Review
Carefully handling bindings v1.2 (blocking on cache-io thread in dtor) 14 years ago Bjarne (:bjarne) 19.52 KB, patch		Details \| Diff \| Splinter Review
Carefully handling bindings version 1.3 13 years ago Bjarne (:bjarne) 16.08 KB, patch		Details \| Diff \| Splinter Review
Carefully handling bindings version 1.3 13 years ago Bjarne (:bjarne) 14.12 KB, patch		Details \| Diff \| Splinter Review
Carefully handling bindings version 1.3 (updated) 13 years ago Bjarne (:bjarne) 12.61 KB, patch		Details \| Diff \| Splinter Review
Carefully handling bindings version 1.3 (updated) 13 years ago Bjarne (:bjarne) 12.48 KB, patch		Details \| Diff \| Splinter Review
Carefully handling bindings version 1.3 (updated and unbitrotted) 13 years ago Bjarne (:bjarne) 12.50 KB, patch	michal : review+	Details \| Diff \| Splinter Review