Closed
Bug 549767
Opened 13 years ago
Closed 12 years ago
Write/flush cache metadata off main-thread
Categories
(Core :: Networking: Cache, defect)
Core
Networking: Cache
Tracking
()
RESOLVED
FIXED
mozilla6
People
(Reporter: sayrer, Assigned: bjarne)
References
Details
(Keywords: main-thread-io)
Attachments
(4 files, 17 obsolete files)
390.33 KB,
image/png
|
Details | |
3.60 KB,
text/plain
|
Details | |
24.60 KB,
patch
|
Details | Diff | Splinter Review | |
12.50 KB,
patch
|
michal
:
review+
|
Details | Diff | Splinter Review |
I saw this with a March 2nd m-c nightly. Will attach instruments data shortly.
Reporter | ||
Comment 1•13 years ago
|
||
Reporter | ||
Comment 2•13 years ago
|
||
(also filing on the GetUSLayoutCharFromKeyTranslate thing pictured there... ho ho ho)
Reporter | ||
Updated•13 years ago
|
Assignee: general → nobody
Component: JavaScript Engine → Networking: Cache
QA Contact: general → networking.cache
Reporter | ||
Comment 3•13 years ago
|
||
hmm, I can't reproduce this anymore, but sdwilsh said he saw the same thing in irc.
Reporter | ||
Comment 4•13 years ago
|
||
Comment 5•13 years ago
|
||
(In reply to comment #4) > Created an attachment (id=429876) [details] > strick that, I can still see it happening. here's a stack Any comments on why we see writes on the main thread still Bjarne?
![]() |
||
Comment 6•13 years ago
|
||
So it looks like bug 513074 made the data writes (the actual cached content) happen on a worker thread. The stack in comment 4 is writing cache entry metadata. What I actually wonder is whether the DataSize() is correct a this point if there are still pending data writes on the worker thread...
Assignee | ||
Comment 7•13 years ago
|
||
(In reply to comment #6) > So it looks like bug 513074 made the data writes (the actual cached content) > happen on a worker thread. Well, yes... The general idea was to post cache-writes on a worker thread while reading from the source instead of writing to the cache synchronously. Should we also try to write metadata asynchronously? It may be possible to achieve if we e.g. post the call to CloseCacheEntry() on the worker thread, but I'd need to experiment with it to tell for sure. However, fwiw, keep in mind that if the caller set the UseCacheAsFile-flag (see e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=548406#c33) also data writes happens on the main-thread. Hence, depending on what is loaded during profiling, one may or may not see even data written on the main-thread. > What I actually wonder is whether the DataSize() is correct a this point if > there are still pending data writes on the worker thread... That's a good point. I guess there are no tests which would catch this...? It may be resolved by the fix suggested above though... :)
Comment 8•13 years ago
|
||
(In reply to comment #6) > What I actually wonder is whether the DataSize() is correct a this point if > there are still pending data writes on the worker thread... The DataSize() is always correct. It is updated after the call to nsCacheEntryDescriptor::nsOutputStreamWrapper::Write(). If the writes happen on the worker thread then the cache entry isn't closed in nsHttpChannel::CloseCacheEntry() because posted write events on the worker thread hold the reference to the entry. So the cache entry is closed (and also metadata is written) after the last write is done.
Assignee | ||
Comment 9•13 years ago
|
||
Suggested way to write metadata on the cache-write thread. There is also a line to ensure mCacheEntry::mThread is initialized in the ctor (odd, really, that this was missing) This seems to me as the appropriate place for this split : The cache-service does its own clean-up synchronously, but posts the call to device->DeactivateEntry(). We must avoid posting if we are shutting down (since the cache-service lock is held during shutdown), and there is no point to post the call to current-thread IMO. Pushed to tryserver...
Comment 10•13 years ago
|
||
Note that this is responsible for about 16% of our I/O on the main thread during normal browsing: https://wiki.mozilla.org/Firefox/Goals/2010Q1/IO_Reduction
Assignee | ||
Comment 11•13 years ago
|
||
Comment on attachment 431620 [details] [diff] [review] Posting DeactivateEntry to the thread Tryserver orange on WINNT 5.2 and Linux unit tests. Ok on Mac. I'll try to reproduce locally...
Assignee | ||
Comment 12•13 years ago
|
||
Comment on attachment 431620 [details] [diff] [review] Posting DeactivateEntry to the thread Not able to reproduce any of the failures locally but I see some backouts on trunk due to oranges. Re-dispatching to tryserver...
Assignee | ||
Comment 13•13 years ago
|
||
Comment on attachment 431620 [details] [diff] [review] Posting DeactivateEntry to the thread Still fails on tryserver. I should probably wait for the async call to finish before removing the devices in nsCacheService::Shutdown()...
Assignee | ||
Comment 14•13 years ago
|
||
Mmmm... no complete success on tryserver yet. I sometimes see runtime-msg "pure virtual method called", which most likely comes from using a dangling pointer. Trying to resolve, but cannot reproduce this locally...
Assignee | ||
Comment 15•13 years ago
|
||
Making sure the posted deactivations happens before devices are destroyed. Posted to tryserver...
Attachment #431620 -
Attachment is obsolete: true
Attachment #432413 -
Flags: review?(michal.novotny)
Attachment #431620 -
Flags: review?(michal.novotny)
Comment 16•13 years ago
|
||
(In reply to comment #15) > Created an attachment (id=432413) [details] > Posting DeactivateEntry to the thread, waiting for threads to finish processing > in Shutdown > > Making sure the posted deactivations happens before devices are destroyed. > Posted to tryserver... Bjarne - while we wait for review, does silence here suggest that tryserver was happy with it?
Assignee | ||
Comment 17•13 years ago
|
||
> Bjarne - while we wait for review, does silence here suggest that tryserver was
> happy with it?
Unfortunately, no. (It merely suggests that I was busy with a few other things and forgot to update status here.)
This one is on top of my list for tomorrow...
Comment 18•13 years ago
|
||
Comment on attachment 432413 [details] [diff] [review] Posting DeactivateEntry to the thread, waiting for threads to finish processing in Shutdown Sometimes the metadata is still written on the main thread even if the data is written on the worker thread. nsCacheEntry::mThread is set in nsCacheEntry::SetData(). For new cache entries this is called on the first write, so everything works as expected. But for existing entries SetData() is called when opening the entry and this is done on the main thread. > + // release its data-object. > + nsCacheServiceAutoLock lock; > + (void) mDevice->DeactivateEntry(mEntry); > + return NS_OK; Check the return value and increase nsCacheService::mDeactivateFailures on failure. > + mThread(nsnull) This isn't needed since mThread is nsCOMPtr. > - if (mInitialized) { I don't think this can happen, nevertheless you should probably keep the condition.
Attachment #432413 -
Flags: review?(michal.novotny) → review-
Comment 19•13 years ago
|
||
(In reply to comment #18) > > - if (mInitialized) { > > I don't think this can happen, nevertheless you should probably keep the > condition. drop it and make it an assertion?
Assignee | ||
Comment 20•13 years ago
|
||
(In reply to comment #18) > (From update of attachment 432413 [details] [diff] [review]) > Sometimes the metadata is still written on the main thread even if the data is > written on the worker thread. nsCacheEntry::mThread is set in > nsCacheEntry::SetData(). For new cache entries this is called on the first > write, so everything works as expected. But for existing entries SetData() is > called when opening the entry and this is done on the main thread. Metadata is also written for cache *read* requests... Yes - I overlooked that one! Lemme' have a closer look at when deactivates are posted! (There is also potential for a similar problem as in bug #548406 I believe...) > > + // release its data-object. > > + nsCacheServiceAutoLock lock; > > + (void) mDevice->DeactivateEntry(mEntry); > > + return NS_OK; > > Check the return value and increase nsCacheService::mDeactivateFailures on > failure. Fair enough. > > + mThread(nsnull) > > This isn't needed since mThread is nsCOMPtr. Very well... but for my curiosity : Why don't we risk garbage in an nsCOMPtr-member? Will statements like |if (this->mThread) {| always work correctly if |mThread| is an nsCOMPtr? > > - if (mInitialized) { > > I don't think this can happen, nevertheless you should probably keep the > condition. Wouldn't the assertion handle the situation? Having both the assertion and the |if| seem a little redundant to me...?
Comment 21•13 years ago
|
||
(In reply to comment #20) > Very well... but for my curiosity : Why don't we risk garbage in an > nsCOMPtr-member? Will statements like |if (this->mThread) {| always work > correctly if |mThread| is an nsCOMPtr? nsCOMPtr's constructor sets it's member data to nsnull so you don't have to.
Assignee | ||
Comment 22•13 years ago
|
||
(In reply to comment #21) > (In reply to comment #20) > > Very well... but for my curiosity : Why don't we risk garbage in an > > nsCOMPtr-member? Will statements like |if (this->mThread) {| always work > > correctly if |mThread| is an nsCOMPtr? > nsCOMPtr's constructor sets it's member data to nsnull so you don't have to. Ok, but the member in question is |nsCacheEntry::mThread| . mThread is an nsCOMPtr-member in nsCacheEntry, but nsCacheEntry is not (even close to) a nsCOMPtr and is created in nsCacheEntry::Create() like this nsCacheEntry* entry = new nsCacheEntry(newKey, streamBased, storagePolicy); What I don't see is how the code |if (entry->mThread)| can be safe unless |nsCacheEntry::mThread| is explicitly initialized...
Comment 23•13 years ago
|
||
the nsCacheEntry constructor runs constructors for all its members, that's normal C++
Assignee | ||
Comment 24•13 years ago
|
||
Ah - ok. Thanks... :)
Comment 25•13 years ago
|
||
(In reply to comment #22) > nsCacheEntry* entry = new nsCacheEntry(newKey, streamBased, storagePolicy); member variables are initialized during construction.
Comment 26•13 years ago
|
||
(In reply to comment #19) > > > - if (mInitialized) { > drop it and make it an assertion? The old code contains NS_ASSERTION as well. (In reply to comment #20) > Wouldn't the assertion handle the situation? Having both the assertion and the > |if| seem a little redundant to me...? But NS_ASSERTION does nothing in release build, so the code would probably crash.
Assignee | ||
Comment 27•13 years ago
|
||
> But NS_ASSERTION does nothing in release build, so the code would probably
> crash.
Well, then the |if| should definitely be kept. :)
Assignee | ||
Comment 28•13 years ago
|
||
I feel like summarizing (in writing) current thinking around this... all input and corrections to understanding and/or analysis is highly appreciated! Bottom line is that we want to let the device deactivate entries off the main-thread because metadata is typically written to disk as a consequence of deactivating and deleting a cache-entry in the disk-cache device. As pointed out by michal in comment #18, this also applies to cache-reads because these may also touch metadata for the entry. In other words, general access to the cache seems to result in writing metadata to disk. When a cache-entry is read from or written to, data for the entry is set by calling SetData(). If the data set is non-null, nsCacheEntry maintains a member |mThread| pointing to the running thread. (As a side-note, if a call to SetData(realData) happens prior to a call to SetData(nsnull) we have a weird situation in this respect since |mThread| is not cleared on the second call... maybe we should change this?) Anyway, when nsCacheService::DeactivateEntry() is called, we have three possible states for nsCacheEntry::mThread 1) not set 2) set to current (main) thread (indicating sync cache-write or a cache-read) 3) set to some other thread (async cache-write implemented in bug #513074) In the suggested patch, case (3) serializes deactivation of the entry with async writes from bug #513074 by posting to the cache-write thread. The implementation of (2) just deactivates the entry synchronously, resulting in writes on the main-thread (as noted in comment #0). AFAICS (1) may indicate that there is a pending write (or read) which has not yet accessed its stream, that something (like the collision-detection mechanism in nsDiskCacheBindery::CreateBinding()) failed causing a call like |entry->SetData(nsnull)|, or simply that some component set nsICacheEntryDescriptor::cacheElement to null for whatever reason. The suggested patch implements (1) in the same way as (2) (as a side-effect, not intentionally). In essence I see two issues here (implementation of (3) is IMO ok) : First, implementation of (2) must be improved so that possible metadata-writes happens off the main-thread. This can be done by passing the cache-write thread e.g. via nsICacheEntryDescriptor and post deactivation of the entry to this thread. This includes adding a field to a public interface and set it e.g. when the descriptor is created in nsHttpChannel (or it can be added to nsICacheSession and handled by the session). Second, (1) must be handled properly. In the cases where (1) is caused by a pending request, fixing bug #548406 is IMO a prerequisite (i.e. properly deal with binding doomed entries). Apart from this, we can IMO treat it like (2) above. I'll implement and test some of this while waiting for comments...
Assignee | ||
Comment 29•13 years ago
|
||
I haven't followed the work of making cache-reads async, but it occurred to me that if read-requests are handled on a separate thread also, deactivating such entries will probably end up in case (3) above also, no?
Comment 30•13 years ago
|
||
(In reply to comment #29) > I haven't followed the work of making cache-reads async, but it occurred to me > that if read-requests are handled on a separate thread also, deactivating such > entries will probably end up in case (3) above also, no? I think you're right. Fixing #513008 (I'm working on it right now) will eliminate case (2).
Assignee | ||
Comment 31•13 years ago
|
||
If this turns out to be right, we could just wait for solutions to bug #513008 and bug #548406 and keep the suggested solution for (3) here, no?
Comment 32•13 years ago
|
||
(In reply to comment #26) > But NS_ASSERTION does nothing in release build, so the code would probably > crash. But we never expect this to be true, right? We have assertions all over the code base that check that something isn't null and then just use it without checking nullness. (In reply to comment #31) > If this turns out to be right, we could just wait for solutions to bug #513008 > and bug #548406 and keep the suggested solution for (3) here, no? Do we need to wait for those bugs before we can have a patch up here?
Assignee | ||
Comment 33•13 years ago
|
||
> (In reply to comment #31) > > If this turns out to be right, we could just wait for solutions to bug #513008 > > and bug #548406 and keep the suggested solution for (3) here, no? > Do we need to wait for those bugs before we can have a patch up here? Fixing those bugs should eliminate cases (1) and (2) (comment #28) and the suggested patch handles case (3). If we want a fix for case (3) only I can whip it up based on the suggested patch - I see no reason for extending the interface(s) to handle case (2) if it will go away in the near future anyway.
Comment 34•13 years ago
|
||
(In reply to comment #33) > Fixing those bugs should eliminate cases (1) and (2) (comment #28) and the > suggested patch handles case (3). If we want a fix for case (3) only I can whip > it up based on the suggested patch - I see no reason for extending the > interface(s) to handle case (2) if it will go away in the near future anyway. I'd like to get as much as we can fix done right away, so getting at least case (3) fixed will be a big help.
Assignee | ||
Comment 35•13 years ago
|
||
Updated according to comments from reviewer and cleaned up to deliberately only handle "case (3)" entries (i.e. entries which has attached a thread different from main-thread). Pushed to tryserver.
Attachment #432413 -
Attachment is obsolete: true
Attachment #434202 -
Flags: review?(michal.novotny)
Assignee | ||
Comment 36•13 years ago
|
||
Passes tryserver. This patch writes metadata for any cache-entry on the thread which accessed the entry the last time. I.e. it should also deal with metadata for async cache-reads when these are implemented. Metadata for cache-entries which are read/written on main-thread are still written on main-thread. The concern raised in comment #28, 4th paragraph is still unanswered. Moreover, I believe this is a dup of bug #537934 (or vice-versa). I'd also suggest to rename this bug to "Write/flush cache metadata off main-thread" or something similar to better reflect the real issue.
Attachment #434202 -
Attachment is obsolete: true
Attachment #435149 -
Flags: review?(michal.novotny)
Attachment #434202 -
Flags: review?(michal.novotny)
Comment 37•13 years ago
|
||
(In reply to comment #36) > The concern raised in comment #28, 4th paragraph is still unanswered. michal, can you respond to this? > Moreover, I believe this is a dup of bug #537934 (or vice-versa). We should probably forward dupe it.
Summary: Cache Writes Still happening on the main thread? → Write/flush cache metadata off main-thread
Comment 38•13 years ago
|
||
(In reply to comment #28) > Second, (1) must be handled properly. In the cases where (1) is caused by a > pending request, fixing bug #548406 is IMO a prerequisite (i.e. properly deal > with binding doomed entries). Apart from this, we can IMO treat it like (2) > above. Is this the right bug number?
Assignee | ||
Comment 39•13 years ago
|
||
(In reply to comment #38) > (In reply to comment #28) > > Second, (1) must be handled properly. In the cases where (1) is caused by a > > pending request, fixing bug #548406 is IMO a prerequisite (i.e. properly deal > > with binding doomed entries). Apart from this, we can IMO treat it like (2) > > above. > Is this the right bug number? Yes. Bug #548406 is caused by pending cache-write-requests. If these for some reason are deactivated before executed (e.g. due to another write-request to the same entry) the entries will not have a thread attached at deactivation-time, i.e. case (1).
Comment 40•13 years ago
|
||
(In reply to comment #37) > (In reply to comment #36) > > The concern raised in comment #28, 4th paragraph is still unanswered. > michal, can you respond to this? Yes, we should probably null out mThread as well. Hmm, cacheElement is a public attribute of the cache entry, but we use it internally to hold the reference to the binding. The question here is what will happen if someone try to use it for his own purpose...
Comment 41•13 years ago
|
||
Comment on attachment 435149 [details] [diff] [review] Updated : Post DeactivateEntry, wait for deactivate-events in Shutdown Sorry for the delay, I was experiencing a crash when this patch and the patch from #513008 were applied, so I was inspecting it. Regarding my comment #40 I don't think it's a good idea to use mThread for the decision on which thread to deactivate the entry. IMHO we should either remove cacheElement attribute from the IDL if it isn't needed, or add a new thread variable for our purpose.
Attachment #435149 -
Flags: review?(michal.novotny)
Comment 42•13 years ago
|
||
(In reply to comment #36) > + // (The dispatched event must post some event back to this thread to ensure > + // the event-loop-spin which implements NS_DISPATCH_SYNC doesn't starve - > + // hence the parameter) Why exactly is this needed?
Assignee | ||
Comment 43•13 years ago
|
||
>Regarding my comment #40 I > don't think it's a good idea to use mThread for the decision on which thread to > deactivate the entry. IMHO we should either remove cacheElement attribute from > the IDL if it isn't needed, or add a new thread variable for our purpose. Make cacheElement or the mThread "non-scriptable" in the IDL? (Assuming you are worried about scripts accessing this?) (In reply to comment #42) > (In reply to comment #36) > > + // (The dispatched event must post some event back to this thread to ensure > > + // the event-loop-spin which implements NS_DISPATCH_SYNC doesn't starve - > > + // hence the parameter) > > Why exactly is this needed? The code which implements synchronous event-dispatching spins the event-loop for the calling thread until the dispatched event has finished. If the event-queue for the calling thread is exhausted before the dispatched event runs, it *looks like* it keeps spinning until an event is dispatched to the calling thread. So, by posting an (empty) event back to the calling thread I make sure the calling thread receives an event and breaks out of the spin.
Comment 44•13 years ago
|
||
(In reply to comment #43) > Make cacheElement or the mThread "non-scriptable" in the IDL? (Assuming you are > worried about scripts accessing this?) C++ could still access it then (think add-ons).
Speaking only for myself, I'm willing to break all 0.05 add-ons that want to mess around with this stuff from C++, and give them use-case-specific APIs if their authors ask.
Comment 46•13 years ago
|
||
I'm not sure about the reason why this attribute exists, but it doesn't seem to be used by our code (only some tests use it). And I could find only one addon that uses it (Bugmail for Thunderbird https://mxr.mozilla.org/addons/source/9584/chrome/content/overlay.js#75). So can we simply remove the attribute or do we need to keep it and create a new member variable in nsCacheEntry for our internal purposes?
Comment 47•13 years ago
|
||
(In reply to comment #43) > The code which implements synchronous event-dispatching spins the event-loop > for the calling thread until the dispatched event has finished. If the > event-queue for the calling thread is exhausted before the dispatched event > runs, it *looks like* it keeps spinning until an event is dispatched to the > calling thread. So, by posting an (empty) event back to the calling thread I > make sure the calling thread receives an event and breaks out of the spin. This shouldn't happen and IMHO if it happens it is a bug and it should be fixed in nsThread. BTW what you're trying to do is exactly what nsThreadSyncDispatch::Run() already does.
Assignee | ||
Comment 48•13 years ago
|
||
(In reply to comment #47) > This shouldn't happen and IMHO if it happens it is a bug and it should be fixed > in nsThread. BTW what you're trying to do is exactly what > nsThreadSyncDispatch::Run() already does. You're right on both, of-course. But the fact remains : The ack made the patch pass the tryserver (tests were hanging without it). I'll dig a little deeper on this next week and re-run the patch a few times to make sure it's robust in this sense. As a side-note, there are very few places where sync-dispatching is used at all (ogg- and wave-decoders, gnomevfs protocolhandler and then in TestProxies). Not entirely sure what TestProxies do, but I'll figure out next week...
Comment 49•13 years ago
|
||
What is the status of this bug?
Updated•13 years ago
|
blocking2.0: --- → ?
Assignee | ||
Comment 50•13 years ago
|
||
This has been on the list for too long... here's an updated patch. (In reply to comment #37) > > Moreover, I believe this is a dup of bug #537934 (or vice-versa). > We should probably forward dupe it. Uncertain about terminology... should I mark 537934 a duplicate of this bug? (In reply to comment #40) > Yes, we should probably null out mThread as well. Ok - done. > Hmm, cacheElement is a public attribute of the cache entry, but we use it > internally to hold the reference to the binding. The question here is what will > happen if someone try to use it for his own purpose... Well, |cacheElement| only provides access to the data of the element, not the nsCacheEntry-object itself. Setting the attribute sets the data and stores a reference to the calling thread in |nsCacheEntry::mThread|, but apart from storing this reference I don't see any problems with |cacheElement| being public (please correct me if I'm missing something!). The lifecycle of |nsCacheEntry::mThread| may thus be outside our control (e.g. it may be created and destroyed by an addon). Which means that when we try to deactivate the entry using |nsCacheEntry::mThread| it may fail miserably. I've added code to deactivate the entry on main-thread if dispatching fails. Should we be even more careful? Is there a util-method to tell if a thread is still valid? Can we tell somehow if the thread is out of our control and avoid storing it in |nsCacheEntry::mThread| in this case? Or the simplest solution is of-course to remove the attribute completely... :) It is used in tests, like michal says, to verify that the cache returns what is stored, but there may be other ways to verify that. > (In reply to comment #43) > This shouldn't happen and IMHO if it happens it is a bug and it should be fixed > in nsThread. BTW what you're trying to do is exactly what > nsThreadSyncDispatch::Run() already does. I've removed this from the current patch. Pushed to tryserver for verification (again).
Attachment #435149 -
Attachment is obsolete: true
Attachment #444071 -
Flags: review?(michal.novotny)
Assignee | ||
Comment 51•13 years ago
|
||
Patch passes tryserver on Linux and OSX. Fails one reftest on WinNT and encounters a number of ###!!! [Parent][RPCChannel] Error: Channel error: cannot send/recv which results in a crash, logged as PROCESS-CRASH | automation.py | application crashed (minidump found) (Brief log can be found at http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1273235252.1273245260.13979.gz ) The reftest-failure seems completely unrelated to the patch, and I can see these types of crashes in other tryserver-runs, hence I'd claim that patch passes tryserver. Waiting for review.
Comment 52•13 years ago
|
||
(In reply to comment #51) > ###!!! [Parent][RPCChannel] Error: Channel error: cannot send/recv Did you build with electrolysis? It sounds like it's trying to to do ipc stuff there.
Assignee | ||
Comment 53•13 years ago
|
||
That would probably be a bummer, yes... I've not enabled that intentionally, and 'hg identify' returns 0459345d8026 tip Anything else I can check to verify/falsify? (Patch did apply cleanly on tryserver "mozilla-central" repository...)
Comment 54•13 years ago
|
||
(In reply to comment #53) > Anything else I can check to verify/falsify? (Patch did apply cleanly on > tryserver "mozilla-central" repository...) Ask in #content? I'm not sure offhand.
Assignee | ||
Comment 55•13 years ago
|
||
I've disabled electrolysis at runtime in this patch by setting pref 'dom.ipc.plugins.enabled' to false everywhere in "browser/app/profile/firefox.js". Pushed to tryserver again...
Assignee | ||
Comment 56•13 years ago
|
||
Patch as described above passes tryserver for OSX but encounters bug #562886 on Linux and bug #538616 on WinNT (with electrolysis disabled though, which is interesting). Link to WinNT log : http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1273657708.1273667985.641.gz Link to Linux log : http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1273657708.1273667868.32729.gz Patch is identical to the patch attached in comment #50 (with the exception of setting "dom.ipc.plugins.enabled" to false). Hence, I claim again that it passes tryserver. Waiting for review.
Assignee | ||
Comment 57•13 years ago
|
||
Hmm.... I was mistaken about OSX. I confused email about building successfully on OSX with successfully running unit tests on OSX. Finally, the email with 2 warnings in OSX unit tests came along... I have to look closer at it later - cannot reproduce any of those issues locally. Review would still be valuable, I believe. Log for OSX is : http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1273657708.1273671840.19635.gz
Assignee | ||
Comment 58•13 years ago
|
||
Ohh - if it's feasible to run the DTrace-script which defines the criteria for IO-reduction, please direct me to instructions for how to do so. Or maybe sdwilsh wants to do that? Even without the async cache-read patch we should see some effect...
Comment 59•13 years ago
|
||
(In reply to comment #58) > Ohh - if it's feasible to run the DTrace-script which defines the criteria for > IO-reduction, please direct me to instructions for how to do so. Or maybe > sdwilsh wants to do that? Even without the async cache-read patch we should see > some effect... I documented this in the wiki :) https://wiki.mozilla.org/Firefox/Goals/2010Q1/IO_Reduction_Criteria
Comment 60•13 years ago
|
||
(In reply to comment #57) > I have to look closer at it later - cannot reproduce any of those issues > locally. Review would still be valuable, I believe. The failure is from thread 5 crashing (EXC_BAD_ACCESS). Backtrace looks like this: 0 XUL + 0x73721 eip = 0x02281721 esp = 0xb030adc0 ebp = 0xb030ade8 ebx = 0x0228170d esi = 0x0020002a edi = 0x00dd3d60 eax = 0x00000000 ecx = 0xb030ab4c edx = 0xb030addc efl = 0x00010202 Found by: given as instruction pointer in context 1 XUL + 0x7188d eip = 0x0227f88e esp = 0xb030adf0 ebp = 0xb030ae18 Found by: previous frame's frame pointer 2 XUL!nsMemoryCacheDevice::DeactivateEntry(nsCacheEntry*) [nsMemoryCacheDevice.cpp:2108d2458719 : 170 + 0x7] eip = 0x022870c5 esp = 0xb030ae20 ebp = 0xb030ae48 Found by: previous frame's frame pointer 3 XUL!nsCacheServiceDeactivateEntryEvent::Run() [nsCacheService.cpp:2108d2458719 : 1876 + 0x11] eip = 0x022860f8 esp = 0xb030ae50 ebp = 0xb030ae68 Found by: previous frame's frame pointer 4 XUL!nsThread::ProcessNextEvent(int, int*) [nsThread.cpp:2108d2458719 : 527 + 0x7] eip = 0x02ef263d esp = 0xb030ae70 ebp = 0xb030aeb8 Found by: previous frame's frame pointer 5 XUL!NS_ProcessNextEvent_P(nsIThread*, int) [nsThreadUtils.cpp : 250 + 0x15] eip = 0x02eaf5ba esp = 0xb030aec0 ebp = 0xb030aee8 Found by: previous frame's frame pointer 6 XUL!nsThread::ThreadFunc(void*) [nsThread.cpp:2108d2458719 : 254 + 0xf] eip = 0x02ef27f0 esp = 0xb030aef0 ebp = 0xb030af38 Found by: previous frame's frame pointer 7 libnspr4.dylib!_pt_root [ptthread.c:2108d2458719 : 228 + 0x8] eip = 0x0005f892 esp = 0xb030af40 ebp = 0xb030af78 Found by: previous frame's frame pointer 8 libSystem.B.dylib + 0x31c54 eip = 0x90037c55 esp = 0xb030af80 ebp = 0xb030afc8 Found by: previous frame's frame pointer 9 libSystem.B.dylib + 0x31b11 eip = 0x90037b12 esp = 0xb030afd0 ebp = 0xb030afec Found by: previous frame's frame pointer
Comment 61•13 years ago
|
||
also see this just before the crash: xpcshell(30795,0xb030b000) malloc: *** error for object 0xdd0006: Non-aligned pointer being freed *** set a breakpoint in malloc_error_break to debug
Assignee | ||
Comment 62•13 years ago
|
||
> > IO-reduction, please direct me to instructions for how to do so. Or maybe > > sdwilsh wants to do that? Even without the async cache-read patch we should see > > some effect... > I documented this in the wiki :) > https://wiki.mozilla.org/Firefox/Goals/2010Q1/IO_Reduction_Criteria Dtrace does not seem to be available for my normal development-box (Ubuntu Linux) but I'm planning to get my hands on a Mac-box. I assume that would help? (In reply to comment #60) > The failure is from thread 5 crashing (EXC_BAD_ACCESS). Backtrace looks like > this: > 0 XUL + 0x73721 > eip = 0x02281721 esp = 0xb030adc0 ebp = 0xb030ade8 ebx = 0x0228170d > esi = 0x0020002a edi = 0x00dd3d60 eax = 0x00000000 ecx = 0xb030ab4c > edx = 0xb030addc efl = 0x00010202 > Found by: given as instruction pointer in context > 1 XUL + 0x7188d > eip = 0x0227f88e esp = 0xb030adf0 ebp = 0xb030ae18 > Found by: previous frame's frame pointer > 2 XUL!nsMemoryCacheDevice::DeactivateEntry(nsCacheEntry*) > [nsMemoryCacheDevice.cpp:2108d2458719 : 170 + 0x7] > eip = 0x022870c5 esp = 0xb030ae20 ebp = 0xb030ae48 > Found by: previous frame's frame pointer [ snip ] Thanks a bundle! This is certainly relevant to the patch. However, I'm wondering if maybe the currently suggested patch for bug #548406 may help here : it cleans up the sequence of removing entries from memory-cache, which is where this crash happens. The suggested patch for this bug changes timing of deactivating / removing entries, and a cleaner sequence for this in the memory-cache cannot hurt. I'll submit a combined patch to the tryserver.
Comment 63•13 years ago
|
||
(In reply to comment #62) > Dtrace does not seem to be available for my normal development-box (Ubuntu > Linux) but I'm planning to get my hands on a Mac-box. I assume that would help? Very likely. I was using it on a mac, don't know if that script would work on Linux.
Assignee | ||
Comment 64•13 years ago
|
||
> The suggested patch for this bug changes timing of deactivating / removing > entries, and a cleaner sequence for this in the memory-cache cannot hurt. I'll > submit a combined patch to the tryserver. Results from tryserver for combined patch for this bug with patch for bug #548406 : 1) Combined patch fails with bug #562886 on Linux. 2) On WinNT it fails with the error described in bug #548406, comment #323. 3) On all three platforms it fails in chrome://mochikit/content/browser/browser/components/sessionstore/test/browser/browser_522545.js Analysis (1) Is a known issue on Linux and is unrelated to this patch (2) Indicates that the patch for bug #548406 introduces/triggers some issue in the application-cache. Must be resolved. (3) Can be reproduced locally both with and without the patch. However, when using the latest version from trunk it passes locally without and with the patch. I'll ignore this issue for now. Conclusion Need to figure out what causes (2) above (failure in "test_simpleManifest.html")
Assignee | ||
Comment 65•13 years ago
|
||
> Need to figure out what causes (2) above (failure in
> "test_simpleManifest.html")
Cannot reproduce locally (yet). An observation is that the listener for the app-cache is installed before the ordinary cache-listener, hence when the ordinary cache-listener becomes an async tee, the app-cache will also be written asynchronously on the cache-write thread (since these tees are chained with the last one installed first in the list).
Assignee | ||
Comment 66•13 years ago
|
||
Now I see what probably causes the problem (ref also comment above). The failing check in "test_simpleManifest.js" uses the attribute "mozLength" which is derived from the list of entries gathered from the cache-device. Before performing this check, the test waits for the cache-entry to become available using nsCacheService::OpenCacheEntry() which calls nsCacheService::ActivateCache(). ActivateCache() may find the cache-entry in mActiveEntries *before* the entry is bound to the device, hence there is a time-window from the cache-entry is found and until the cache-device knows the entry. This time-window is also the root of bug #548406, however the fix for bug #548406 is not to eliminate the time-window but rather to handle it properly. I'll find a way to handle it here also.
Assignee | ||
Comment 67•13 years ago
|
||
Ahh... can this be bug #484123? I've uploaded the patch from comment #64 to tryserver including a small fix to the test to ensure the time-window is properly handled.
Comment 69•13 years ago
|
||
(In reply to comment #68) > I think we want this for 1.9.3. I think we *need* it ;)
Assignee | ||
Comment 70•13 years ago
|
||
(In reply to comment #67) > I've uploaded the patch from comment #64 to tryserver including a small fix to > the test to ensure the time-window is properly handled. Results from tryserver contain too many of these to make sense ###!!! [Parent][RPCChannel] Error: Channel error: cannot send/recv followed by a crash. I (accidentally) re-enabled the electrolysis-stuff in this version, so I've disabled it again and resubmitted to tryserver...
Assignee | ||
Comment 71•13 years ago
|
||
(In reply to comment #70) > ... , so I've disabled it again and resubmitted to tryserver... Results from tryserver OSX : ok Linux : hits bug #562866, plus a crash in test_updatecheck.js WinNT : hits bug #538616 The crash on Linux in test_updatecheck.js shows this log Crash reason: SIGSEGV Crash address: 0x1100 Thread 5 (crashed) 0 libxul.so!nsDiskCacheMap::DeleteRecord(nsDiskCacheRecord*) [nsDiskCacheMap.h:962135366d6b : 139 + 0x0] eip = 0x4026de26 esp = 0x4270127c ebp = 0x0a92b308 ebx = 0x40ff6974 esi = 0x0fb6aab1 edi = 0x00001100 eax = 0x00000010 ecx = 0x00000000 edx = 0x00001100 efl = 0x00010246 Found by: given as instruction pointer in context It can certainly be related to this patch (it posts deactivation on the device to another thread) and I cannot find any existing bugs mentioning this. Will have to look closer at this issue.
Assignee | ||
Comment 72•13 years ago
|
||
(In reply to comment #71) > Linux : hits bug #562866, plus a crash in test_updatecheck.js Sorry - should be : Linux : hits bug #562886, plus a crash in test_updatecheck.js
Assignee | ||
Comment 73•13 years ago
|
||
An nsCacheEntry may actually have descriptors attached when we delete it in nsCacheService::DeactivateEntry(), so in this version we make sure to clear these before deleting the entry. Failing to do so may have caused the crashes in comment #62 and comment #71. (It passes local testing - just waiting to get push-to-tryserver access so that it can be tested properly...)
Attachment #444071 -
Attachment is obsolete: true
Attachment #449863 -
Flags: review?(michal.novotny)
Attachment #444071 -
Flags: review?(michal.novotny)
Comment 74•13 years ago
|
||
Comment on attachment 449863 [details] [diff] [review] Modified to detach descriptors from cacheEntry if deleted in DeactivateEntry I've pushed this patch to the try server on your behalf (changeset 3f7a8816a26f). You can see the results (in a few hours/days/months/centuries) on <http://tests.themasta.com/tinderboxpushlog/?tree=MozillaTry>. :-)
Comment 75•13 years ago
|
||
Try server builds look good. Note that we have ten days until the beta code freeze, and this is blocking.
Assignee | ||
Comment 76•13 years ago
|
||
(In reply to comment #74) > I've pushed this patch to the try server on your behalf (changeset > 3f7a8816a26f). Thanks! Appreciate it! Note that the attached patch explicitly disables IPC globally by changing browser/app/profile.firefox.js. Should probably be stripped from the final patch. (In reply to comment #75) > Try server builds look good. Note that we have ten days until the beta code > freeze, and this is blocking. Lots of things can be done in ten days. :) try-mac-slave15 is orange, but so are most builds before and after try-mac-slave08 is orange, the two subsequent builds have the exact same error try-mac-slave24 is orange, the builds before and after have same errors try-win32-slave01 is orange, the builds before and after have same errors try-win32-slave17 is orange, suspicious-looking crash! try-win32-slave19 is orange, Java RT died otherwise all is green. I'll have a look at the issue on try-win32-slave17 which seems relevant to this patch. (Hopefully, the Java-crash is a coincidence although I cannot find any bugs mentioning it.)
Assignee | ||
Comment 77•13 years ago
|
||
(In reply to comment #76) > I'll have a look at the issue on try-win32-slave17 which seems relevant to this > patch. Seems like the same issue as seen in comment #71. nsCacheEntry-objects are actually deleted in a number of places so I'll detach descriptors (ref comment #73) in the dtor to block pending operations on the entry.
Assignee | ||
Comment 78•13 years ago
|
||
I'm still waiting for permission to push to tryserver, so if anyone would like to push it for me I'd be grateful. :)
Attachment #449863 -
Attachment is obsolete: true
Attachment #450116 -
Flags: review?(michal.novotny)
Attachment #449863 -
Flags: review?(michal.novotny)
Comment 79•13 years ago
|
||
(In reply to comment #78) > Created an attachment (id=450116) [details] > Modified as described in comment #77 > > I'm still waiting for permission to push to tryserver, so if anyone would like > to push it for me I'd be grateful. :) [Aside: bug number for your request? Level 1 access should not take that long.]
Comment 80•13 years ago
|
||
Bug 570475, and I think the reason for the delay is that the magic token message saying it's ready wasn't added, so I just added that. I just pushed the latest patch to the try server, commit id is 828551107c36.
Comment 81•13 years ago
|
||
I've pushed a try server run with this patch, and I just saw comment 78. Bjarne, feel free to watch over it (changeset 27d8113d68f7 at Wed Jun 9 17:36:00 2010 titled: imported patch attachment.cgi?id=450116).
Comment 82•13 years ago
|
||
And I just saw comment 80! Well, now you have two try server runs! :-)
Assignee | ||
Comment 83•13 years ago
|
||
(In reply to comment #80) > Bug 570475, and I think the reason for the delay is that the magic token > message saying it's ready wasn't added, so I just added that. Thanks! I'm not too familiar with these formal procedures... > I just pushed the latest patch to the try server, commit id is 828551107c36. try-win32-slave10 (opt, mochitests-1/5): fails with bug #565245 try-win32-slave29: Java RT died try-win32-slave10 (opt, xpcshell): fails with bug #538616 try-linux-slave13 (debug, reftest): fails reftest/tests/layout/reftests/svg/dynamic-use-nested-01.svg try-linux-slave18 (debug, crashtest): fails with assertion count for crashtest 432114-2.html
Comment 84•13 years ago
|
||
Bjarne, since you have two try server pushes with the same patch, you could compare their results, specifically for the second one below. The dynamic-use-nested-01.svg failure was bug 571036 which is fixed now. On the assertion count failure, you need to check the actual assertions triggered. If they're not related to the cache code, this is probably an intermittent failure, which you should file a bug for. :-)
Assignee | ||
Comment 85•13 years ago
|
||
Seems like my previous comment was submitted to early... here's the rest. > I just pushed the latest patch to the try server, commit id is 828551107c36. try-w32-slave04 (debug, crashtest): application timeout near crashtests/386680-1.html try-w32-slave10 (opt, xpcshell): fails with bug #538616 try-w32-slave10 (opt, mochitests-1/5): fails with bug #565245 try-w32-slave10 (debug, mochitests-2/5): fails with bug #492232 try-w32-slave18 (debug, mochitests-3/5): LEAK try-w32-slave25 (debug, jsreftest): fails with wrong assertion count for regress-344804.js and regress-181834.js try-w32-slave29: Java RT died try-linux-slave02 (maemo build): failed, but previous builds also failed try-linux-slave13 (debug, reftest): fails reftest/tests/layout/reftests/svg/dynamic-use-nested-01.svg try-linux-slave18 (debug, crashtest): fails with wrong assertion count for crashtest 432114-2.html try-linux-slave22 (opt, mochitests-other): some failures, starting with bug #539283 try-mac-slave24 (opt, mochitest/other): fails with bug #567950 try-mac-slave28 (opt, reftest): crash in ValueToNumberSlow() while processing an nsInputStreamReadyEvent - not sure whether it's relevant OSX 10.6.2 (talos tp4): Busted, but build prior to this was also busted try-mac64-slave09 (OSX 10.6.2 leak test): fails like all builds before and after Phew... (In reply to comment #81) > Bjarne, feel free to watch over it (changeset 27d8113d68f7 at Wed Jun 9 Thanks again. :) try-w32-slave09 (opt, xpcshell): fails with bug #538616 (like above) try-w32-slave16 (opt, mochitests-1/5): fails with bug #565245 (like above) try-linux-slave11 (debug, reftest): fails in "dynamic-use-nested-01.svg" (like above) try-w32-slave26 (debug, jsreftest): fails with wrong assertion count for regress-344804.js and regress-181834.js (like above) try-linux-slave08 (opt, xpcshell): crash in "test_bug_406857.js" - very much like the one in comment #71 try-linux-slave21 (opt, mochitest-other): some failures, first in "browser_sidebarpanels_click.js" which probably is bug #539283 (like above) try-linux-slave25 (opt, reftest): fails reftest/tests/layout/reftests/svg/dynamic-use-nested-01.svg (like above for debug-build) + several more failures The interesting (and relevant) thing is the (Linux, opt, xpcshell) crash on try-linux-slave08 which looks similar to comment #71. The latest fix apparently did not resolve this, so I'll have a look at it again. It seems to be triggered by accessing a field in a nsDiskCacheMap-object which, most likely, is deleted: http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheMap.h#139 probably called from http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheMap.cpp#478 and I have also seen a stack pointing to http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheMap.cpp#484
Assignee | ||
Comment 86•13 years ago
|
||
The interesting crash happens in this context 1. nsDiskCacheDevice::DeactivateEntry() is called off main-thread 2. The entry is not doomed... 3. ...hence DeactivateEntry() tries to write the entry to disk, but fails 4. ...and crashes in mCacheMap::DeleteRecord() when cleaning up Stack is here http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1276132562.1276133440.25820.gz#err1 Looks like the nsDiskCacheRecord-param to mCacheMap::DeleteRecord() is corrupted, causing the crash. Questions: - why would writing the cache entry fail in the first place? - how can the param to nsDiskCacheMap::DeleteRecord() be corrupted? I'm a little stuck at the moment. I don't see what can happen in the timeslot between posting the deactivate-event and running it, which causes writing the entry to disk to fail (or corrupt the param to DeleteRecord for that matter). Any suggestions are welcome, as well as any ideas for how to set up a test to trigger this. (I'm not able to reproduce.) Will look into this over the weekend again.
Comment 87•13 years ago
|
||
Given a quick look at this patch, it's clearly wrong to assume that the weak references helf by nsCacheServiceDeactivateEntryEvent are going to be valid once we actually need them. For example, what would prevent the service itself from going away? What about the entry? I think we should be holding some kind of a strong pointer to those objects.
Assignee | ||
Comment 88•13 years ago
|
||
That's valuable input. Thanks! In principle I agree, but the plan was to control this by SYNC_DISPATCHing the dummy-events in nsCacheService::Shutdown() to ensure deactivation-events have finished before releasing the service. This assumes that Shutdown() in fact is called, and that nsCacheEntry-objects are not delete'd inappropriately (I'm tracing the potential for the latter now, but it looks ok). Are these assumptions completely wrong? Nevertheless, I'll try using a strong ref and once my commit-access is working...
Comment 89•13 years ago
|
||
(In reply to comment #88) > That's valuable input. Thanks! > > In principle I agree, but the plan was to control this by SYNC_DISPATCHing the > dummy-events in nsCacheService::Shutdown() to ensure deactivation-events have > finished before releasing the service. This assumes that Shutdown() in fact is > called, and that nsCacheEntry-objects are not delete'd inappropriately (I'm > tracing the potential for the latter now, but it looks ok). Are these > assumptions completely wrong? I don't know this code very well, so I'm probably not the best person to answer that. But I have a helpful tip for you, try running Firefox under valgrind, which would assist you in determining memory safety issues. > Nevertheless, I'll try using a strong ref and once my commit-access is > working... Feel free to ask me at any time to do a try server push in the mean time. You can comment on the bug, send me a direct email, ping me on irc, give me a call, whatever! :-) I'd really like to help you get this landed on mozilla-central as quickly as possible.
Assignee | ||
Comment 90•13 years ago
|
||
Apppreciate it, Ehsan - thanks for your support! Finally managed to reproduce locally, and I really wonder why I didn't see this issue before: The cache-devices are shut down also in nsCacheService::OnProfileChanged() but with no code to block on pending deactivation-events. Fixed, cannot reproduce locally anymore, and pushed to tryserver (yes - I can do that now! :) ). Waiting for results from tryserver and review.
Attachment #450116 -
Attachment is obsolete: true
Attachment #451264 -
Flags: review?(michal.novotny)
Attachment #450116 -
Flags: review?(michal.novotny)
Assignee | ||
Comment 91•13 years ago
|
||
Results are better, but still there is a weird crash. After spending most of the day reproducing it locally and analyzing it I realize that nsMemoryCache::FindEntry() just picks up the entry from a hashtable, keeping old values for |mData| and |mThread| in particular. This seems wrong since deactivation of the entry will then be dispatched to the thread which initially wrote data to the entry, as opposed to the thread which called FindEntry(). Fixed, cannot reproduce locally, and new patch pushed to tryserver...
Assignee | ||
Comment 92•13 years ago
|
||
More details fixed and another push to tryserver... (this feels more and more like Pandora's box)
Assignee | ||
Comment 93•13 years ago
|
||
Still ironing out details, but at least we're down to details now. One more push to tryserver before the week-end...
Comment 94•13 years ago
|
||
I'd love to see this in beta1, but I can't say it's really a blocker. If this is landable before the deadline (TBD, but not very many days from now), let's make that happen!
blocking2.0: beta1+ → beta2+
Assignee | ||
Comment 95•13 years ago
|
||
Must have been something wrong with tryservers over the weekend because results all over the place look really funny... pushed again on monday when things seemed to be stable. Now results look ok. There are two failing jsreftests (orange) with wrong assertion-count on w32/dbg but these seem to be common these days. The last (and interesting) orange is in http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1277123225.1277125340.1704.gz#err0 Scroll one line up and observe the explicit abort from imglib... looks like imglib aborts in certain cases (Release() in this case) when multiple threads are involved and I doubt such behaviour will/can be handled properly by the cache in the same process... (The next error in the given log is clearly related to this patch but I expect it to be a consequence of the explicit abort by imglib. I'm trying to reproduce it locally but no success yet.) Question is : What to do with explicit protection against multiple threads in imglib with increased usage of multiple threads in the cache?
Assignee | ||
Comment 96•13 years ago
|
||
This concludes work here for now. Pushed to tryserver one last time for verification and requesting review. There is still the open question/concern in comment #95 (last paragraph) but I think it should be handled as a separate bug, and it may also be relevant in other contexts than this one.
Attachment #451264 -
Attachment is obsolete: true
Attachment #453341 -
Flags: review?(michal.novotny)
Attachment #451264 -
Flags: review?(michal.novotny)
Comment 97•13 years ago
|
||
Michal, any chance you can get to this review any time soon?
blocking2.0: beta2+ → beta3+
Comment 98•13 years ago
|
||
Still a b3 blocker? Not comfortable with it continuing to bump down the line, maybe betaN+?
Assignee | ||
Comment 100•13 years ago
|
||
Unbitrotted and passed to tryserver to get a fresh feeling about current status. The open question in comment #95 (in essence: How to release imgCacheEntry-objects on an arbitrary thread) is probably bug #529949 so I'm taking the liberty of adding a dependency to it. I also observe bug #542809 which seems to be the same problem in a different context.
Depends on: 529949
Assignee | ||
Comment 101•13 years ago
|
||
(In reply to comment #100) > Unbitrotted and passed to tryserver to get a fresh feeling about current > status. Five starred failures. However, it also fails consistently on all platforms, both debug and opt on chrome://mochikit/content/chrome/widget/tests/test_imestate.html I find this failure mentioned in bug #584944 and bug #585361 although it is not the main issue of either of those bugs. Trying to reproduce locally...
Assignee | ||
Comment 102•13 years ago
|
||
Not able to reproduce locally (opt/dbg). Push again to tryserver...
Assignee | ||
Comment 103•13 years ago
|
||
Results from last push are surprisingly (suspiciously?) green with only three oranges : http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282131772.1282134988.1526.gz which looks like bug #544601 http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282124270.1282134123.28731.gz where I cannot find the problem (a leak test showing no leaks afaics) and http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282131772.1282133850.27248.gz which fails in /tests/dom/tests/mochitest/dom-level2-core/test_elementgetattributenodens03.html on the assertion ###!!! ASSERTION: nsScriptCacheCleaner not thread-safe: '_mOwningThread.GetThread() == PR_GetCurrentThread()', file e:/builds/moz2_slave/tryserver-win32-debug/build/content/base/src/nsFrameMessageManager.cpp, line 681 The latter issue should probably be taken seriously although the test itself only contains two TODOs. Cannot reproduce locally but will investigate further.
Assignee | ||
Comment 104•13 years ago
|
||
As discussed on irc, this is a different and more direct approach to writing cache-metadata off the main-thread. I'm attaching it here to give you an idea of what I mean - comments are appreciated.
Attachment #467409 -
Flags: review?(michal.novotny)
Comment 105•13 years ago
|
||
Comment on attachment 467409 [details] [diff] [review] New, lighter approach to achieve the desired effect In the first patch (#453341) I really don't like collecting threads in mDeactivateEntryThreads and the method BlockOnAsyncDeactivateEvents(). Now when patch from bug #513008 has landed we have a dedicated cache IO thread that we can use to deactivate the entry. IMO there is also no reason to try to deactivate entry on the same thread where SetData() was called. Correct me if I'm wrong but there shouldn't be any problem to deactivate entry on the background IO thread even if data was written synchronously due to STORE_ON_DISK_AS_FILE policy. The second patch (#467409) is more straightforward. But what I don't like on both patches is the fact that we have the cache in inconsistent state for a short period. We proceed nsCacheService::DeactivateEntry() immediately but the disk device deactivates the entry later. I.e. within this period we don't have the entry in nsCacheService::mActiveEntries but nsDiskCacheBindery::FindActiveBinding() can find a binding for the entry. Instead of posting nsDiskCacheDevice::DeactivateEntry() to the background thread we should probably try to change the code so that most of the calls to nsCacheService::DeactivateEntry() happen on the background thread. It is called from 1) nsCacheService::ProcessRequest() 2) nsCacheService::DoomEntry_Internal() 3) nsCacheService::CloseDescriptor() 4) nsCacheService::ClearDoomList() 5) nsCacheService::DeactivateAndClearEntry() We should focus on 3 which happens all the time during normal browsing.
Attachment #467409 -
Flags: review?(michal.novotny)
Updated•13 years ago
|
Attachment #453341 -
Flags: review?(michal.novotny)
Assignee | ||
Comment 106•13 years ago
|
||
(In reply to comment #105) > In the first patch (#453341) I really don't like collecting threads in > mDeactivateEntryThreads and the method BlockOnAsyncDeactivateEvents(). Now when > patch from bug #513008 has landed we have a dedicated cache IO thread that we > can use to deactivate the entry. I agree. > IMO there is also no reason to try to > deactivate entry on the same thread where SetData() was called. Not when we have a suitable thread available for this, no. > Correct me if > I'm wrong but there shouldn't be any problem to deactivate entry on the > background IO thread even if data was written synchronously due to > STORE_ON_DISK_AS_FILE policy. I think you're right. > The second patch (#467409) is more straightforward. But what I don't like on > both patches is the fact that we have the cache in inconsistent state for a > short period. We proceed nsCacheService::DeactivateEntry() immediately but the > disk device deactivates the entry later. I.e. within this period we don't have > the entry in nsCacheService::mActiveEntries but > nsDiskCacheBindery::FindActiveBinding() can find a binding for the entry. Good point! But (in the second approach) couldn't we just call nsDiskCacheBindery::RemoveBinding() before proxying the call? > Instead of posting nsDiskCacheDevice::DeactivateEntry() to the background > thread we should probably try to change the code so that most of the calls to > nsCacheService::DeactivateEntry() happen on the background thread. I see (and fwiw I like) what you say but I suspect it will be even harder to keep the cache consistent if we do it like that (i.e. there will be more state to synchronize between the threads). I'll keep this in mind, but I would like to give my second approach a harder try, addressing your concerns above.
Assignee | ||
Comment 107•13 years ago
|
||
I believe this slight modification takes care of the potential for inconsistent state which Michal points out in comment #105 by introducing a flag to prevent GetCacheEntryBinding(nsCacheEntry) to return the binding if there is a pending deactivate. (IMO there is a very unfortunate binding between the nsCacheEntry and the nsDiskCacheEntry where they point to each other and where the mData field of nsCacheEntry actually *is* the binding-object, but those are not relevant here...) It passes simple local testing as well as local random browsing. Pushed to tryserver...
Attachment #467409 -
Attachment is obsolete: true
Attachment #473021 -
Flags: review?(michal.novotny)
Assignee | ||
Comment 108•13 years ago
|
||
(In reply to comment #107) > It passes simple local testing as well as local random browsing. Pushed to > tryserver... Some timeouts seen from tryserver caused by the changes to dtor. Reverted and pushed again.
Assignee | ||
Comment 109•13 years ago
|
||
Timeouts gone but more careful handling of bindings is required.
Assignee | ||
Comment 110•13 years ago
|
||
This version passes local testing and random browsing - pushed to tryserver...
Attachment #473021 -
Attachment is obsolete: true
Attachment #475153 -
Flags: review?(michal.novotny)
Attachment #473021 -
Flags: review?(michal.novotny)
Assignee | ||
Comment 111•13 years ago
|
||
Afaics, it also passes tryserver. Five starred oranges plus some OpenGL oranges, which are to be expected these days. Waiting for review...
Comment 112•13 years ago
|
||
Running with the latest patch in this bug *and* with the patch in bug 588804 (I think, at least) applied I just hit a crash with the following stack: #2 <signal handler called> #3 arena_dalloc (ptr=0x100000001) at ../../../mozilla/memory/jemalloc/jemalloc.c:4271 #4 free (ptr=0x100000001) at ../../../mozilla/memory/jemalloc/jemalloc.c:6121 #5 0x00007fca4f16acfc in ~nsACString_internal (this=0x7fc9fbac7280, __in_chrg=<value optimized out>) at ../../dist/include/nsTSubstring.h:115 #6 ~nsCString (this=0x7fc9fbac7280, __in_chrg=<value optimized out>) at ../../dist/include/nsTString.h:55 #7 nsCacheEntry::~nsCacheEntry (this=0x7fc9fbac7280, __in_chrg=<value optimized out>) at ../../../mozilla/netwerk/cache/nsCacheEntry.cpp:83 #8 0x00007fca4f170a0b in nsDiskCacheDevice::DeactivateEntry (this= 0x7fca32f44200, entry=0x7fc9fbac7280) at ../../../mozilla/netwerk/cache/nsDiskCacheDevice.cpp:527 #9 0x00007fca4f16eb9d in nsCacheServiceDeactivateEntryEvent::Run (this= 0x7fc9a2293ae0) at ../../../mozilla/netwerk/cache/nsCacheService.cpp:1989 #10 0x00007fca4fab47ad in nsThread::ProcessNextEvent (this=0x7fc9c1bfde90, mayWait=1, result=0x7fc9ae5fdd3c) at ../../../mozilla/xpcom/threads/nsThread.cpp:547 #11 0x00007fca4fa852a9 in NS_ProcessNextEvent_P (thread=<value optimized out>, mayWait=<value optimized out>) at nsThreadUtils.cpp:250 #12 0x00007fca4fab4d5a in nsThread::ThreadFunc (arg=0x7fc9c1bfde90) at ../../../mozilla/xpcom/threads/nsThread.cpp:263 #13 0x00007fca4e16f4a3 in _pt_root (arg=0x7fc9b48c7240) at ../../../../mozilla/nsprpub/pr/src/pthreads/./ptthread.c:228 #14 0x00000037ed406a3a in start_thread () from /lib64/libpthread.so.0 #15 0x00000037eccde77d in clone () from /lib64/libc.so.6 (gdb) f5 #5 0x00007fca4f16acfc in ~nsACString_internal (this=0x7fc9fbac7280, __in_chrg=<value optimized out>) at ../../dist/include/nsTSubstring.h:115 115 ~nsTSubstring_CharT() { Finalize(); } (gdb) p *this $1 = {mData = 0x100000001 <Address 0x100000001 out of bounds>, mLength = 52500848, mFlags = 32714} (gdb) up #6 ~nsCString (this=0x7fc9fbac7280, __in_chrg=<value optimized out>) at ../../dist/include/nsTString.h:55 55 { (gdb) up #7 nsCacheEntry::~nsCacheEntry (this=0x7fc9fbac7280, __in_chrg=<value optimized out>) at ../../../mozilla/netwerk/cache/nsCacheEntry.cpp:83 warning: Source file is more recent than executable. 83 delete mKey; (gdb) p *this $2 = {<PRCListStr> = {next = 0x7fc9fbac7280, prev = 0x7fc9fbac7280}, mKey = 0x7fca2fb064c0, mFetchCount = 2, mLastFetched = 1285305343, mLastModified = 1285305348, mLastValidated = 1936024697, mExpirationTime = 1286880963, mFlags = 154368, mDataSize = 0, mCacheDevice = 0x7fca32f44200, mSecurityInfo = {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, mData = 0x7fc98df897e0, mThread = {<nsCOMPtr_base> = {mRawPtr = 0x7fc9c1bfde90}, <No data fields>}, mMetaData = {mBuffer = 0x0, mBufferSize = 0, mMetaSize = 0}, mRequestQ = {next = 0x7fc9fbac72e8, prev = 0x7fc9fbac72e8}, mDescriptorQ = {next = 0x7fc9fbac72f8, prev = 0x7fc9fbac72f8}} (gdb) I have a core file (from an optimized with symbols build) for this crash in case more info is needed.
Updated•13 years ago
|
Keywords: main-thread-io
Comment 113•13 years ago
|
||
We have any forward progress on this yet?
Comment 114•13 years ago
|
||
Comment on attachment 475153 [details] [diff] [review] Carefully handling bindings... > + nsresult rv = NS_NewThread(getter_AddRefs(mDiskCacheThread)); Shouldn't you explicitly shutdown mDiskCacheThread somewhere? Why didn't you use mCacheIOThread? > if (binding->mGeneration == p->mGeneration) { > - if (calcGeneration) ++binding->mGeneration; // try the next generation > - else { > + if (calcGeneration || p->mDeactivateInProgress) { > + ++binding->mGeneration; // try the next generation > + } else { > NS_ERROR("### disk cache: generations collide!"); What does this change mean? I do think that it isn't needed and if yes then there is some serious problem. This code could potentially change existing generation number, which is bad. > nsDiskCacheBinding * > GetCacheEntryBinding(nsCacheEntry * entry) > { > - return (nsDiskCacheBinding *) entry->Data(); > + nsDiskCacheBinding *tmp = (nsDiskCacheBinding *) entry->Data(); > + if (tmp->mDeactivateInProgress) > + return nsnull; > + else > + return tmp; > } This doesn't solve the issue that FindActiveEntry() sees (and returns) the binding. Also did you check all places where GetCacheEntryBinding() is called, if this can't be a problem? E.g. in nsDiskCacheDevice::DoomEntry() is an assertion which can be IMO triggered in following situation: - nsDiskCacheDevice::DeactivateEntry() is called and nsDiskCacheDeviceDeactivateEntryEvent is posted - some other entry is written and the cache is full, this starts evicting some entries in EvictDiskCacheEntries() - nsDiskCacheEvictor::VisitRecord() is called for the entry being deactivated - nsDiskCacheDevice::DoomEntry() is called -> assertion This is just one example. I'm afraid that there could be a lot of other possible problems that would occur only in specific situations, which would make them really hard to debug. So I think that we should keep nsDiskCacheBindery in sync with nsCacheService::mActiveEntries.
Attachment #475153 -
Flags: review?(michal.novotny) → review-
Assignee | ||
Comment 115•13 years ago
|
||
(In reply to comment #114) > Shouldn't you explicitly shutdown mDiskCacheThread somewhere? Why didn't you > use mCacheIOThread? Most of this was done before mDiskCacheThread existed, but yes: It should be used. Fixed. > This code could potentially change existing > generation number, which is bad. You're right - the OR is definitely wrong. In fact, this method doesn't need to know anything about whether the entry is about to be deactivated - it can be left unchanged. Should be ok now. > So I think that we should keep > nsDiskCacheBindery in sync with nsCacheService::mActiveEntries. I agree or disagree, depending on what you mean by "keep in sync". :) The flag nsDiskCacheBinding::mDeactivateInProgress was introduced to prevent the bindery from returning entries which are queued to be deactivated. I.e. it is meant to "bridge the time-slot" which is introduced by deactivating disk-cache entries asynchronously. I believe this is sufficiently "in sync" for our needs. E.g. ActiveBinding() and AddBinding() do *not* take this flag into account, while FindActiveBinding() and GetCacheEntryBinding() do. > This doesn't solve the issue that FindActiveEntry() sees (and returns) the > binding. FindActiveEntry() skip this binding like it does for doomed entries. > - nsDiskCacheDevice::DeactivateEntry() is called and > nsDiskCacheDeviceDeactivateEntryEvent is posted > - some other entry is written and the cache is full, this starts evicting some > entries in EvictDiskCacheEntries() > - nsDiskCacheEvictor::VisitRecord() is called for the entry being deactivated > - nsDiskCacheDevice::DoomEntry() is called -> assertion This shouldn't happen because FindActiveEntry() would skip entries with the flag set, ref above. However, you do have a point about the assertions which are often used immediately after calling GetCacheEntryBinding(). This applies to DeactivateEntry(), DoomEntry(), GetFileForEntry() and OnDataSizeChange(), and the question is whether these may be called in the time-slot introduced by this patch. The methods OpenInput/OutputStreamForEntry() use ENSURE_TRUE first, which seems to me like a reasonable solution, do you agree? Patch passes basic local testing - pushed to tryserver.
Attachment #475153 -
Attachment is obsolete: true
Attachment #485727 -
Flags: review?(michal.novotny)
Comment 116•13 years ago
|
||
(In reply to comment #112) > 83 delete mKey; > (gdb) p *this > $2 = {<PRCListStr> = {next = 0x7fc9fbac7280, prev = 0x7fc9fbac7280}, mKey = > 0x7fca2fb064c0, mFetchCount = 2, mLastFetched = 1285305343, > mLastModified = 1285305348, mLastValidated = 1936024697, mExpirationTime = > 1286880963, mFlags = 154368, mDataSize = 0, mCacheDevice = 0x7fca32f44200, > mSecurityInfo = {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, > mData = 0x7fc98df897e0, mThread = {<nsCOMPtr_base> = {mRawPtr = > 0x7fc9c1bfde90}, <No data fields>}, mMetaData = {mBuffer = 0x0, > mBufferSize = 0, mMetaSize = 0}, mRequestQ = {next = 0x7fc9fbac72e8, > prev = 0x7fc9fbac72e8}, mDescriptorQ = {next = 0x7fc9fbac72f8, prev = > 0x7fc9fbac72f8}} The members seem to be reasonable, except mFlags. I'd guess that the cache entry is deleted for the second time. I don't think that the patch from bug #588804 could cause this, so I'd suspect patch #475153 from this bug.
Assignee | ||
Comment 117•13 years ago
|
||
Not sure where you're going with this, but the mFlags value means that the nsCacheEntry is 1. doomed 2. dirty 3. metadata is dirty 4. it has stream data 5. it's initialized and the 0x2000-value looks weird. But isn't it deleting |mKey| which fails? Why do you believe the nsCacheEntry gets deleted twice? (I'm sorry if I'm a bit slow here, but I just don't see this...)
Assignee | ||
Comment 118•13 years ago
|
||
Oh - and note that patch #475153 from this bug did have some bad logic (as you pointed out in comment #114 and I acked in comment #115) which the new patch has fixed.
Comment 119•13 years ago
|
||
(In reply to comment #117) > Not sure where you're going with this, but the mFlags value means that the > nsCacheEntry is > > 1. doomed > 2. dirty > 3. metadata is dirty > 4. it has stream data > 5. it's initialized > > and the 0x2000-value looks weird. That's exactly the problem. How 0x20000 could get into mFlags? I'd guess that the memory was freed and overwritten. > But isn't it deleting |mKey| which fails? Why do you believe the nsCacheEntry > gets deleted twice? I've checked the lifetime of the mKey and I don't see a way how we could delete an invalid nsCString. First I thought that the problem could be caused by passing the key from the request to the entry in nsCacheService::ActivateEntry. But mKey is nulled out almost immediately after that in nsCacheEntry::RequestAccess(). The problem occurs only if mActiveEntries.AddEntry() fails since RequestAccess() isn't called and we release the key twice. But this isn't our case because the entry seems to be used. I've tried to add second delete entry to nsDiskCacheDevice::DeactivateEntry() and the debugger crashed when deleting the nsCString just like in comment #112, so that's why I think that the entry was deleted twice. > Oh - and note that patch #475153 from this bug did have some bad logic (as you > pointed out in comment #114 and I acked in comment #115) which the new patch > has fixed. Do you mean the condition when calculating generation? I think that the wrong condition can't happen at all, so this couldn't cause the crash.
what does valgrind say about the overwriting of freed memory?
Assignee | ||
Comment 121•13 years ago
|
||
(In reply to comment #119) > I've tried to add second delete entry to nsDiskCacheDevice::DeactivateEntry() > and the debugger crashed when deleting the nsCString just like in comment #112, > so that's why I think that the entry was deleted twice. Fair enough - what did mFlags look like? In fact, one way I can see this happening is that nsCacheService::DectivateEntry() is called for a doomed entry which has not bound yet due to delayed writing. The call will delete the entry, but when the write-request comes, the entry will bind. If the entry is deactivated later, we will get the second delete by the event. (This should not happen after the fix for bug #596443 landed though.) Will try to find other patterns which may give us two deletes. I'd also appreciate comments to the patch.
Assignee | ||
Comment 122•13 years ago
|
||
Mmmm... I'm looking at a crash from tryserver (with this patch) which asserts in nsDiskCacheBindery::RemoveBinding() that the bindery is not initialized. Wonder how that could happen. Note also that it is the code-line immediately before the one crashing in comment #116. An uninitialized bindery indicates that nsDiskCacheDevice::Shutdwon_Private() is called while a deactivate-event is pending. Either from nsDiskCacheDevice::ClearDiskCache() or from nsDiskCacheDevice::Shutdown(). Maybe nsDiskCacheDevice::Shutdown() should wait until mBindery has no more active bindings, like nsDiskCacheDevice::ClearDiskCache() does..
Does valgrind show anything when you run under it? It often provides very useful clues when dealing with memory issues.
Comment 124•13 years ago
|
||
AFAIK we can't reproduce the crash, so valgrind can't help here.
Comment 125•13 years ago
|
||
(In reply to comment #124) > AFAIK we can't reproduce the crash, so valgrind can't help here. Doesn't mean it won't flag bad memory access patterns that don't result in a crash.
Assignee | ||
Comment 126•13 years ago
|
||
(Warning: I have no previous experience with valgrind.) I've followed the guidelines at https://developer.mozilla.org/en/Debugging_Mozilla_with_Valgrind but can find nothing even remotely relevant in the output. As Michal says, the crash mentioned in comment #112 and comment #116 is not really reproducible. The crash mentioned in comment #122 is in leaktest/Linux. I've not yet been able to reproduce, but I'm currently trying to figure out how to run leaktest under valgrind (any pointers or input is appreciated) to see if it gives any clues.
Assignee | ||
Comment 127•13 years ago
|
||
An update: I've occasionally seen crashes on the tryserver with this patch very similar to comment #112 and comment #122. However, after some investigation I've found that these are not caused by deactivating an entry twice, but rather by the way nsCacheService shuts down the devices: The device is deleted, the dtor calls Shutdown(), which (with this patch) may fire async deactivations. When an async deactivation runs, the device w/ members is gone but the memory is accessed by the async deactivation event, and boom... I'll attach an updated patch which blocks the dtor of the disk-device until all cache-io events have run, and I have not seen any crashes with this modification. However, there is an issue in /tests/content/base/test/test_bug475156.html where 3-4 of the tests fails with a clearly cache-related message. The suspicion is that there is a small timeslot while deactivation is pending where the cache doesn't find the entry, and I'm currently trying to find it and close it.
Assignee | ||
Comment 128•13 years ago
|
||
Attachment #485727 -
Attachment is obsolete: true
Attachment #488785 -
Flags: review?(michal.novotny)
Attachment #485727 -
Flags: review?(michal.novotny)
Comment 129•13 years ago
|
||
There has been a similar crash as in comment 112 just now on the Firefox Tinderbox, see bug 614513.
Assignee | ||
Comment 130•13 years ago
|
||
Nice catch. :) See bug #bug 614513, comment #5 in particular the last paragraph.
Comment 131•13 years ago
|
||
Comment on attachment 488785 [details] [diff] [review] Carefully handling bindings v1.2 (blocking on cache-io thread in dtor) > +nsCacheService::BlockOnCacheIOThread() > +{ > + NS_ASSERTION(gService->mLockedThread == PR_GetCurrentThread(), "oops"); > + if (!gService->mCacheIOThread) return NS_ERROR_NOT_AVAILABLE; > + > + nsBlockOnCacheThreadEvent *event = new nsBlockOnCacheThreadEvent(); > + > + Unlock(); > + nsresult rv = > + gService->mCacheIOThread->Dispatch(event, NS_DISPATCH_SYNC); > + Lock(); Posting a synchronous event will spin the event loop until the event finishes. This can be IMO problematic. Maybe a better solution would be to have a counter of posted events and wait using monitor until it is 0. Also it makes sense to do a synchronous deactivation while doing shutdown. I.e. in Shutdown_Private() set variable mShutdownInProgress = PR_TRUE and check this variable in DeactivateEntry(), in case it is true do a synchronous deactivation. > +class nsDiskCacheDeviceDeactivateEntryEvent : public nsRunnable { > +public: > + nsDiskCacheDeviceDeactivateEntryEvent(nsDiskCacheDevice *device, > + nsCacheEntry * entry, > + nsDiskCacheBinding * binding) Events and other helper classes are usually at the beginning of the file. Please move it there. > - binding = nsnull; > + binding = nsnull; // why assign here?? This should be changed to an assertion NS_ASSERTION(!binding, "..."). It used to be a debug check (see http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/netwerk/cache/src/nsDiskCacheDevice.cpp&rev=1.59) but became a regular code which is incorrect and misleading. The logic is that if we find an active binding it must be a collision, otherwise there is some serious problem.
Attachment #488785 -
Flags: review?(michal.novotny)
Assignee | ||
Comment 132•13 years ago
|
||
(In reply to comment #131) > Posting a synchronous event will spin the event loop until the event finishes. Yes - that's the idea. :) I want to block current thread until all pending events on the cache-io thread has finished. We are, after all, shutting down the cache at this point. > This can be IMO problematic. How? (Observe that the lock is released before the sync-event.) > Events and other helper classes are usually at the beginning of the file. > Please move it there. Fair enough. > This should be changed to an assertion NS_ASSERTION(!binding, "..."). It used > to be a debug check (see > http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/netwerk/cache/src/nsDiskCacheDevice.cpp&rev=1.59) > but became a regular code which is incorrect and misleading. The logic is that > if we find an active binding it must be a collision, otherwise there is some > serious problem. I can add an assertion in FindEntry() but the TODOs are really about why we assign binding to null (also present in 1.59) when it's not used before the next assignment (also in 1.59).
Assignee | ||
Comment 133•13 years ago
|
||
(In reply to comment #132) > but the TODOs are really about why we > assign binding to null (also present in 1.59) when it's not used before the > next assignment (also in 1.59). Ahh - sorry. I missed all the cleanup in dtor. :P
Comment 134•13 years ago
|
||
(In reply to comment #132) > (In reply to comment #131) > > Posting a synchronous event will spin the event loop until the event finishes. > > Yes - that's the idea. :) I want to block current thread until all pending > events on the cache-io thread has finished. We are, after all, shutting down > the cache at this point. But the problem is that the current thread won't be blocked. It will process any pending event on the main thread. Maybe that's OK, but some time ago I wanted to spin the event queue using NS_ProcessNextEvent while waiting for a result and I was told that I should avoid doing that since it isn't always safe.
Comment 135•13 years ago
|
||
(In reply to comment #134) > But the problem is that the current thread won't be blocked. It will process > any pending event on the main thread. Maybe that's OK, but some time ago I > wanted to spin the event queue using NS_ProcessNextEvent while waiting for a > result and I was told that I should avoid doing that since it isn't always > safe. It might be OK for this. When do we shut the cache down?
Comment 136•13 years ago
|
||
(In reply to comment #135) > It might be OK for this. When do we shut the cache down? The code is called on "xpcom-shutdown" and "profile-before-change", and also when somebody calls nsICacheSession::evictEntries() to evict entries in the disk cache.
Assignee | ||
Comment 137•13 years ago
|
||
(In reply to comment #134) > But the problem is that the current thread won't be blocked. It will process > any pending event on the main thread. Reading the nsThread impl I see what you mean... :( The calling thread won't block but rather execute pending events while stalling the return of the dispatch-sync. Effectively, upon calling dispatch-sync necko will be in some half-way-done state, subsequent events will run in this state, and nobody knows what will happen. Makes me wonder which situations the dispatch-sync actually can handle safely... (and consequently why it's available at all). I've re-implemented BlockOnCacheIoThread() to use conditional variables instead, attempting to properly block the calling thread (which is what I want to do). Are there any known pitfalls/issues with condvars? :) (Patch is running on tryserver now - I'll attach it asap.)
Comment 138•13 years ago
|
||
(In reply to comment #137) > I've re-implemented BlockOnCacheIoThread() to use conditional variables > instead, attempting to properly block the calling thread (which is what I want > to do). Are there any known pitfalls/issues with condvars? :) AFAICS the only problem could be if we would do the same on the cache IO thread, i.e. if we would need to do some task on the main thread to finish all events posted to the IO thread. But we don't do this so it should be safe.
Assignee | ||
Comment 139•13 years ago
|
||
Referring to comment #127 and the intermittent failure in test test_bug475156.html, I finally think I see what's going on... nsHttpChannel::OnStopRequest() is called when the response has been retrieved from the server. It calls CloseCacheEntry() which sets |mCacheEntry| to null. This will/might trigger the dtor of |mCacheEntry| and the dtor of nsCacheEntryDescriptor calls Close() which calls nsCacheService::CloseDescriptor(), which finally deactivates the entry if it's considered inactive. Now, when deactivating an entry we remove it from nsCacheService::mActiveEntries and make we sure to remove it from nsDiskCacheDevice::mBindery (ref last sentence in comment #114) to keep these in sync. Then we post an event which writes the entry to disk at some point later. Note that this creates a time-slot between removing it from mActiveEntries/mBindery and writing the updated entry to disk. If another request for this entry is processed in this time-slot it will find the *old* entry on disk. In the case of "test_bug475156.html" we find an entry on disk with the wrong eTag and thus validate it with the server, leading to intermittent failures depending on the size and timing of this time-slot. IMO we must delay removing an entry from mBindery until it has been written to disk. If the entry is requested in the time-slot, we should re-use the binding and not remove it from mBindery. I'll see what I can come up with in terms of code - feel free to challenge the reasoning above in the meantime.
We really want this, but it's no longer a blocker this late in the game.
Assignee | ||
Comment 141•12 years ago
|
||
Part of solution here is described separately in bug #620660 - adding dependency.
Assignee | ||
Comment 142•12 years ago
|
||
After a fair number of tryserver-runs I finally identified an issue which caused similar symptoms as seen in bug #614513 (which btw probably could be confirmed fixed by now). The issue is essentially that a cache-entry gets deactivated twice, but this particular code-path can only occur with the rest of this patch applied. Problem is avoided by canceling the pending event doing the first deactivation and after the fix, this patch passes a full tryserver-run (a few starred oranges) with one exception. The exception really looks like bug #529733, although the crash happens in strlen instead of memcpy. However, origin is the same: nsHttpChannel::ContinueProcessNormal() and InitCacheEntry(). Stack can be seen at http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1300415888.1300416140.711.gz#err1 and full tryserver-run can be seen at http://tbpl.mozilla.org/?tree=MozillaTry&rev=adfb1b91e827 I'll try to reproduce a couple of times before moving on with this. If there is any insight wrt whether the exception is a version of bug #529733, I'd be happy to listen...
Attachment #488785 -
Attachment is obsolete: true
Assignee | ||
Comment 143•12 years ago
|
||
(In reply to comment #142) > I'll try to reproduce a couple of times before moving on with this. The same test crashed once again (also on Win32) but with a different stack. Must investigate further...
Assignee | ||
Comment 144•12 years ago
|
||
This adds some minor cleanup compared to the previous patch. It has passed two full tryserver-runs with 3-4 starred oranges in each run.
Attachment #520177 -
Attachment is obsolete: true
Attachment #524455 -
Flags: review?(michal.novotny)
Comment 145•12 years ago
|
||
Comment on attachment 524455 [details] [diff] [review] Carefully handling bindings version 1.3 > + void NeverMind() { mNeverMind = PR_TRUE; } I think that names like CancelDeactivate() nad mCancelled would be more self-explaining. > + nsDiskCacheDeviceDeactivateEntryEvent *event = > + new nsDiskCacheDeviceDeactivateEntryEvent(this, entry, binding); > + if (!event) { > + CACHE_LOG_DEBUG((" failed allocating deactivate-event - " > + "falling back to synchronous operation.\n")); > + return DeactivateEntry_Private(entry, binding); > + } new doesn't fail. > - if (!binding) return NS_ERROR_UNEXPECTED; > + NS_ASSERTION(binding->mDeactivateEvent == nsnull, > + "OnDataSizeChange: entry in process of deactivation"); > + if (!binding || binding->mDeactivateEvent) > + return NS_ERROR_UNEXPECTED; Just a nit. There are several places where this check is used. Couldn't we create some inline function for it? Like: inline bool ValidBinding(nsDiskCacheBinding *binding) { if (!binding) { NS_ASSERTION(...); return false; } if (binding->mDeactivateEvent) { NS_ASSERTION(...); return false; } return true; } And then just use if (!ValidBinding(binding)) return NS_ERROR_UNEXPECTED; > - binding = nsnull; > + binding = nsnull; // why assign here? it's set below... See my comment #131 > + if (!mNeverMind) { > + nsCacheServiceAutoLock lock; There is a possible race condition if we want to reuse the binding on different thread.
Attachment #524455 -
Flags: review?(michal.novotny)
Assignee | ||
Comment 146•12 years ago
|
||
(In reply to comment #145) > > + void NeverMind() { mNeverMind = PR_TRUE; } > > I think that names like CancelDeactivate() nad mCancelled would be more > self-explaining. Fair enough. Fixed. > > + nsDiskCacheDeviceDeactivateEntryEvent *event = > > + new nsDiskCacheDeviceDeactivateEntryEvent(this, entry, binding); > > + if (!event) { > new doesn't fail. Ok. > > - if (!binding) return NS_ERROR_UNEXPECTED; > > + NS_ASSERTION(binding->mDeactivateEvent == nsnull, > > + "OnDataSizeChange: entry in process of deactivation"); > > + if (!binding || binding->mDeactivateEvent) > > + return NS_ERROR_UNEXPECTED; > > Just a nit. There are several places where this check is used. Couldn't we > create some inline function for it? Ok > > - binding = nsnull; > > + binding = nsnull; // why assign here? it's set below... > > See my comment #131 Ok - fixed as agreed. > > + if (!mNeverMind) { > > + nsCacheServiceAutoLock lock; > > There is a possible race condition if we want to reuse the binding on different > thread. Good catch! Fixed.
Assignee | ||
Comment 147•12 years ago
|
||
Fixed as requested, pushed to tryserver.
Attachment #524455 -
Attachment is obsolete: true
Attachment #525240 -
Flags: review?(michal.novotny)
Assignee | ||
Comment 148•12 years ago
|
||
One more place to use the util-function...
Attachment #525240 -
Attachment is obsolete: true
Attachment #525247 -
Flags: review?(michal.novotny)
Attachment #525240 -
Flags: review?(michal.novotny)
Assignee | ||
Comment 149•12 years ago
|
||
... and here it is, unbitrotted as well...
Attachment #525247 -
Attachment is obsolete: true
Attachment #525263 -
Flags: review?(michal.novotny)
Attachment #525247 -
Flags: review?(michal.novotny)
Updated•12 years ago
|
Attachment #525263 -
Flags: review?(michal.novotny) → review+
Comment 150•12 years ago
|
||
Is this going to land tonight so it makes Firefox 5?
Assignee | ||
Comment 151•12 years ago
|
||
Tryserver is still running http://tbpl.mozilla.org/?tree=MozillaTry&rev=5e9434e84258 If someone wants to land it - feel free.
Assignee | ||
Comment 152•12 years ago
|
||
Lots of traffic on tryserver these days... according to the web interface this job is not yet complete. It seems to be ok but since we missed the window anyway I'll re-submit the job to get a coherent picture.
Assignee | ||
Comment 153•12 years ago
|
||
Pushed again to tryserver as http://tbpl.mozilla.org/?tree=MozillaTry&rev=a2588d175b07
Comment 155•12 years ago
|
||
http://hg.mozilla.org/mozilla-central/rev/1346f7ada9d5
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Keywords: checkin-needed
OS: Mac OS X → All
Hardware: x86 → All
Resolution: --- → FIXED
Target Milestone: --- → mozilla6
Comment 156•12 years ago
|
||
(In reply to comment #155) > http://hg.mozilla.org/mozilla-central/rev/1346f7ada9d5 Let me be the first to say "SQUEEEEEEEEE!"
You need to log in
before you can comment on or make changes to this bug.
Description
•