194568 - Creating two instances of the XULPrototype Cache

Reporter

Description

•

22 years ago

I'm able to produce a scenario where we end up creating two instances of the XULPrototype Cache such that the XBLService has one instance and XBLListeners (and other parts of the product) use another instance. The symptom of this problem is an infinite loop where we end up trying to reload the same set of xbl bindings over and over again. Here's how this situation can happen. 1) In nsChromeProtocolHandler::NewChannel, we create the an instance of the XUL prototype cache service: nsCOMPtr<nsIXULPrototypeCache> cache = do_GetService(kXULPrototypeCacheCID, &rv); IF we have not previously created a prototype cache, we do the following: 2) Go to the service manager which detects that we have not created an instance yet. 3) The xul prototype cache is registered by the layout module. So the Component manager loads layout and initializes the layout module 4) The act of initializing the layout module causes the static method: CSSFrameConstructor::InitGlobals to get called which creates an instance of the XBL Service: static nsresult InitGlobals() { return CallGetService("@mozilla.org/xbl;1", &gXBLService); } 5) Now the service manager has not created the xbl service yet, so we got to the component manager which creates our first instance. In the constructor for the XBL Service, we attempt to create a XUL Prototype Cache: nsXBLService::nsXBLService() { .... CallGetService("@mozilla.org/xul/xul-prototype-cache;1", &gXULCache); } 6) This causes us to go back into the service manager, we still have not created an instance of the xul prototype cache yet since we are still on the stack from our attempt to create one back in step (1). So we create an instance and give it back to the XBL Service. 7) Now the stack unwinds back down to the component manager's attempt to create the xul prototype cache for the chrome protocol handler. We've successfully loaded the layout module, so we now ask it to create the 2nd instance of the cache. We now have two instances of the cache running at the same time. In my particular case, nsXBLListener's end up using a different instance from the XBL service, and this causes us to never find any bindings in the cache.

Boris Zbarsky [:bzbarsky]

Comment 1

•

22 years ago

> So the Component manager loads layout and initializes the layout module So we seem to have a few options: 1) Move the XUL prototype cache into the chrome stuff so it does not require layout to be loaded 2) Find whoever is trying to open a chrome channel before layout has been inited and make them stop. Who's the caller of NewChannel? 3) Make some of those getService calls lazy.... And this has come up before; it would be good to have the service manager asserting if it's reentered for the same contractid/cid.

Brendan Eich [:brendan]

Comment 2

•

22 years ago

bz's questions are all good, but I have one too (I hope it's not dumb): Why doesn't nsComponentManagerImpl::GetService in the outer activation (step 1, the one looking by CID) re-try the lookup (around line 2006 of nsComponentManger.cpp) and find the instance created by contract-id by the inner activation (the one looking up by contract-id). /be

Doug Turner (:dougt)

Assignee

Comment 3

•

22 years ago

scott - can you provide a test case. I would be happy to debug this problem. Calling GetService(CID) followed by GetService(ContractID) is known to work. I just tested it on the 1.3 branch and it worked fine. Although these object's are suppose to be used as services, this is the simple test i used: static NS_DEFINE_CID(kPRBoolCID, NS_SUPPORTS_PRBOOL_CID); int main(int argc, char* argv[]) { NS_InitXPCOM2(NULL, nsnull, nsnull); nsCOMPtr<nsISupportsPRBool> a = do_GetService(kPRBoolCID); nsCOMPtr<nsISupportsPRBool> b = do_GetService(NS_SUPPORTS_PRBOOL_CONTRACTID); NS_ASSERTION(a == b, "ERROR!"); NS_ShutdownXPCOM(nsnull); return 0; } It is a known problem that CreateInstance isn't thread safe. This problem may be do to the factory implementation not properly implementing CreateInstance (or rather, the generic module code). See bug 5795. Looking at the "constructor" function of the xul prototype cache (the function that is called by the generic module code), there is nothing to protect two objects from being created: http://lxr.mozilla.org/seamonkey/source/content/xul/document/src/nsXULPrototypeCache.cpp#213 Scott, it would be interesting to just cache the value of the new and return it if this function is called again.

Boris Zbarsky [:bzbarsky]

Comment 4

•

22 years ago

> Calling GetService(CID) followed by GetService(ContractID) is known to work. In this case the second GetService call comes from _inside_ the first GetService call....

Alec Flett

Comment 5

•

22 years ago

dougt: this is a legitimate bug, one that I've encountered myself before. Its a hard problem to solve though. The call stack would look like this (theoretical.. I'm writing this off the top of my head) do_GetService("component1") nsComponentManager::CreateInstanceByCID() nsGenericFactory::CreateInstance() nsSomeServiceConstructor() do_GetService("component2") nsComponentManager::CreateInstanceByCID() nsGenericFactory::CreateInstance() nsSomeServiceConstructor() do_GetService("component1") what needs to happen is that nsComponentManager::CreateInstanceByCID needs to keep a stack of services (probably just a reference to the CID that got passed in each time) that it is in the process of creating, and when someone calls do_GetService the list needs to be walked.. if the service in question is found, then an assertion should be thrown and an error returned. The component manager can't cache the instance of the service in question because the initial object hasn't been returned by the factory constructor method.

Scott MacGregor

Reporter

Comment 6

•

22 years ago

Boris was curious about who was calling nsChromeProtocolHandler::NewChannel before layout has been initialized. In my particular case, I had two prefs set in my prefs file: user_pref("capability.principal.codebase.p1.granted", "UniversalXPConnect"); user_pref("capability.principal.codebase.p0.id", "http://kezwald.mcom.com"); On startup, we first ensure that we have a profile, which causes prefs to load. Our script security manager is a prefs observer, looking for changes to capability.principal.codebase.*. It gets notified when prefs load. The security manager attempts to create a new URI out of http://kezwald.mcom.com. The http protocol handler is created and during it's initialization, it attempts to load a navigator string bundle in order to determine our locale. Loading the chrome resource for the string bundle causes us to enter the chrome protocol handler. Then we enter the scenario I described in the initial bug description. Alec, your theoretical stack trace looked pretty much like what I had on my machine over the weekend when I was running into this.

Doug Turner (:dougt)

Assignee

Comment 7

•

22 years ago

over to me. alec, thank you for the clarification. I will look into it this week.

Assignee: hyatt → dougt

Severity: normal → critical

Component: XP Toolkit/Widgets: XUL → XPCOM

OS: Windows 2000 → All

Priority: -- → P1

Hardware: PC → All

Target Milestone: --- → mozilla1.4alpha

Brendan Eich [:brendan]

Comment 8

•

22 years ago

I agree we should have assertions, but they may not indicate how to fix the underlying problem. One thought: for nsSomeServiceConstructor() do_GetService("component2") we could require nsSomeServiceConstructor to re-lookup the service instance it is trying to create, to see whether it was created by the nested call into the component manager. Is there a way to do that? /be

Doug Turner (:dougt)

Assignee

Comment 9

•

22 years ago

we do provide a |isServiceInstantiated| method.

Doug Turner (:dougt)

Assignee

Comment 10

•

22 years ago

i think we have to protect both the gs(cid) path, as well as the gs(contractID) path.

Doug Turner (:dougt)

Assignee

Comment 11

•

22 years ago

Attached patch patch v.1 — Details — Splinter Review

(a) instead of calling FindFactory() which only returns a nsIFactory, I am instead calling what FindFactory does so that I can do two things - get access to the CID in the CI(contractID) case and assert if someone is calling CI when there exist a service for this same CID (a assertion nothing more). (b) I am blocking all reentrant CID creation. I think this is what we want. Can you look over the patch and try it out? Alec - should I be using a nsSmallVoidArray?

Alec Flett

Comment 12

•

22 years ago

I wonder if it would be faster to use the nsDoubleHashSet or whatever that jkeiser invented.. it might be faster given that we don't want to spend all our time looking up and down an array? But maybe for small sets its irrelevant? hmm... as for nsSmallVoidArray, probably not - if anything you probably want nsAutoVoidArray.. nsSmallVoidArray is for structures that might never have any values, so a zero-length array takes up only a word as long as you never add data to it.

Doug Turner (:dougt)

Assignee

Comment 13

•

22 years ago

the list is quite short.

Alec Flett

Updated

•

22 years ago

Attachment #116435 - Flags: superreview?(alecf)

Alec Flett

Comment 14

•

22 years ago

Comment on attachment 116435 [details] [diff] [review] patch v.1 I'm not sure I agree that we need an assertion when you do a CreateInstance() for something that is already a service - I think this occurs naturally in the code.. perhaps a warning so we don't bug the crap out of developers The other thing I was thinking was that perhaps the CID-checking should be debug-only...if not for performance (which I suppose we'll see on tinderbox if its an issue) - but also because we are CPU-intense as it is and this only makes us worse at a key point (creation of objects) in XPCOM - if the architecture of some component is flawed enough to cause this assertion, then hopefully we'd start catching it in a debug build. sr=alecf with the assertion changed to a warning, and optionally with the CID list thing made debug-only.

Attachment #116435 - Flags: superreview?(alecf) → superreview+

Doug Turner (:dougt)

Assignee

Comment 15

•

22 years ago

I can agree with making this all debug-only. However, the problem will still exist where you can create two instances of the XUL prototype cache by the steps mscott outlined. Someone is going to have to solve the problem in the xul prototype cache. It is also fair to note that some developers will not be using a debug version of XPCOM so detection of this class of bug will not occur. Also, we have to consider the case where all component configuration combinations are not be tested. For example, a user can download a dozen different components from mozdev.org and we have no real clue how they interact. Any cycle that is produce may lead to this class of problems. Returning an error from XPCOM allows these cases to unwind instead of a much worse alternative. I am going to check this in as debug only for now. Enabling it is something we can consider tomorrow.

Flags: blocking1.4a?

Doug Turner (:dougt)

Assignee

Comment 16

•

22 years ago

Checking in nsComponentManager.cpp; /cvsroot/mozilla/xpcom/components/nsComponentManager.cpp,v <-- nsComponentManager.cpp new revision: 1.226; previous revision: 1.225 done Checking in nsComponentManager.h; /cvsroot/mozilla/xpcom/components/nsComponentManager.h,v <-- nsComponentManager.h new revision: 1.85; previous revision: 1.84 done

Boris Zbarsky [:bzbarsky]

Comment 17

•

22 years ago

Um... there should never be a CreateInstance call for _services_ unless it's called from GetService. There should never be a GetService for an existing service calling CreateInstance unless we are recursing (as in this bug). Any instance of such a recursion is a proble, in my opinion and needs to be detected and either asserted on or (preferably) handled gracefully.

Doug Turner (:dougt)

Assignee

Comment 18

•

22 years ago

bz - that is what we are now doing in debug builds. the question now stands - do we do this in opt builds?

Boris Zbarsky [:bzbarsky]

Comment 19

•

22 years ago

Right. Sorry, I didn't make it clear I was replying to comment 14's suggestion to make it a warning instead of an assert.

Alec Flett

Comment 20

•

22 years ago

no, I totally disagree. what I'm saying is that it IS reasonable to have multiple instances of a class as well as have a service version of the class. There's no reason that XPCOM should prevent that.. For example, something that I'm working on right now for bug 176559 - we need a global nsINodeInfoManager to keep track of (and cache) the most commonly accessed nsINodeInfo objects. But we also need a per-document nsINodeInfoManager for nodes which are not in the "commonly accessed" list. The simplest and most straight forward way to maintain the global version is to just do a GetService() and let XPCOM handle the singleton aspect of that particular instance. From there, each document will create their own nsINodeInfoManagers and refer to both it and to the service version. While it is certainly incorrect to call CreateInstance() on a object that is SUPPOSED to be a service, I don't think you can make the blanket statement that all class that are instantiated somewhere via GetService() cannot be CreateInstanced()'d elsewhere.

Alec Flett

Comment 21

•

22 years ago

ugh, and just for this reason I asked that the NS_ASSERTION be a NS_WARNING as a part of the review. What happened there?

Doug Turner (:dougt)

Assignee

Comment 22

•

22 years ago

alec, you must compile with SHOW_CI_ON_EXISTING_SERVICE defined to have this extra ci check. Where should this bug go now? Who owns the xul prototype cache or who has a good argument why the reentrant CID creation check should be enabled?

Doug Turner (:dougt)

Assignee

Comment 23

•

22 years ago

brendan - do you have any comments?

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 24

•

22 years ago

This seems like it's just the problem that was discussed in bug 5795 and bug 128646, so it's good to see a fix coming. It looks like it's already landed in ifdefs, although I'm wondering why the SHOW_CI_ON_EXISTING_SERVICE ifdef doesn't cover more of the code. However, I wonder (along the lines of alecf's comments) why these checks aren't in the GetService functions instead of the CreateInstance functions. I don't think you'd want multiple calls to CreateInstance that happen to occur on different threads at the same time to fail.

Doug Turner (:dougt)

Assignee

Comment 25

•

22 years ago

dbaron - see comment 11 (b). Maybe it isn't what we want. :-)

Asa Dotzler [:asa]

Updated

•

22 years ago

Flags: blocking1.4a? → blocking1.4a-

Scott MacGregor

Reporter

Comment 26

•

22 years ago

I still run into this infinite loop. Just found another code path that causes it to happen. I'm always getting bit by it because the act of loading the layout module, causes nsCSSFrameConstructor to get created which calls nsCSSFrameConstructor::InitGlobals. InitGlobals calls get service on the xbl service which in turn creates a xul prototype cache in it's constructor. Any time a xul prototype cache object is instantiated through get service before layout has been loaded, causes this infinite loop. Just found a new case with minotaur where this happens if we don't have a minotaur profile. dbaron, is there a way we can break this cycle between the CSS frame constructor, the xbl service and the xul prototype cache?

Scott MacGregor

Reporter

Comment 27

•

22 years ago

Here's a less complicated stack trace from the one I showed before when I filed the bug. nsChromeRegistry::FlushCaches gets an instance of the xul prototype cache which kicks off the chain of events I described in my previous comment 25. nsChromeRegistry::FlushCaches() line 1420 + 30 bytes nsChromeRegistry::LoadProfileDataSource() line 3219 + 11 bytes nsChromeRegistry::Observe(nsChromeRegistry * const 0x00b6fee4, nsISupports * 0x00aefae0, const char * 0x0132f7e8, const unsigned short * 0x0132f668) line 3556 + 11 bytes nsObserverService::NotifyObservers(nsObserverService * const 0x00a92980, nsISupports * 0x00aefae0, const char * 0x0132f7e8, const unsigned short * 0x0132f668) line 212 nsProfile::SetCurrentProfile(nsProfile * const 0x00aefae0, const unsigned short * 0x00b2ba08) line 1260 nsProfile::LoadDefaultProfileDir(nsCString & {...}, int 0x00000001) line 548 + 28 bytes nsProfile::StartupWithArgs(nsProfile * const 0x00aefae0, nsICmdLineService * 0x00af0230, int 0x00000001) line 356 + 16 bytes nsAppShellService::DoProfileStartup(nsAppShellService * const 0x00b5c3c8, nsICmdLineService * 0x00af0230, int 0x00000001) line 273 + 31 bytes InitializeProfileService(nsICmdLineService * 0x00af0230) line 908 + 31 bytes main1(int 0x00000001, char * * 0x003677b8, nsISupports * 0x00ad7d68, const nsXREAppData & {...}) line 1189 + 14 bytes xre_main(int 0x00000001, char * * 0x003677b8, const nsXREAppData & {...}) line 1692 + 41 bytes

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 28

•

22 years ago

The easy fix is to make gXBLService lazily initialized. We'd have to fix this anyway once the XPCOM bug is fixed.

Doug Turner (:dougt)

Assignee

Comment 29

•

22 years ago

dbaron - see comment #20. This is all disabled. Are you suggestion that you would want to see these checks enabled by default?

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 30

•

22 years ago

Yes, definitely, for threadsafety. I still think the checks are wrong, though.

Doug Turner (:dougt)

Assignee

Comment 31

•

22 years ago

i have no problem enabling as long as performance isn't hit hard. alec, agree?

Alec Flett

Comment 32

•

22 years ago

ok, so I'm all for turning on the checks for when a reentrant service creation happens (i.e. you're creating service "Foo" which has the side effect of creating service "bar" which in turn requests service "Foo") - but ONLY for services. Not for normal createinstance stuff.

Scott MacGregor

Reporter

Comment 33

•

22 years ago

Doug, in a debug build, I still run into this problem. I just noticed that XPCOM_CHECK_PENDING_CIDS is defined for debug builds. Should I still be seeing this problem with two instances of a service getting created with this define enabled?

Doug Turner (:dougt)

Assignee

Comment 34

•

22 years ago

scott, yes. you should be seeing an assertion, but as dbaron mentioned, the check are in the wrong place.

Scott MacGregor

Reporter

Comment 35

•

22 years ago

Attached patch move add/remove pending cds into getservice not create instance (obsolete) — Details — Splinter Review

With this new updated patch Doug and I came up with, I now run into the xpcom warning. The add and remove pendind CIDs needed to be placed in GetService not in CreateInstance. Note: The define is still only turned on for debug builds.

Doug Turner (:dougt)

Assignee

Updated

•

22 years ago

Attachment #120003 - Flags: review?(dbaron)

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 36

•

22 years ago

Comment on attachment 120003 [details] [diff] [review] move add/remove pending cds into getservice not create instance Both of the AddPendingCID sections are mis-indented. With this patch, all AddPendingCID and RemovePendingCID calls are within the monitor, so I think you should document that the caller must be inside the monitor and remove the nsAutoMonitor inside AddPendingCID and RemovePendingCID. The NS_WARNING in AddPendingCID should also mention GetService rather than CI, and should also mention the possibility that the assertion is triggered by near-simultaneous calls to GetService on multiple threads. (I've seen that happen.) I also think you should get rid of the ifdefs.

Doug Turner (:dougt)

Assignee

Comment 37

•

22 years ago

Attached patch with dbarons comments. — Details — Splinter Review

I added your suggestions, but stilll haven't removed the #ifdef's.

Attachment #120003 - Attachment is obsolete: true

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Updated

•

22 years ago

Attachment #120012 - Flags: review+

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Updated

•

22 years ago

Attachment #120003 - Flags: review?(dbaron) → review-

Scott MacGregor

Reporter

Comment 38

•

22 years ago

I filed Bug #201891 to break the XULPrototypeCache scenario which causes us to try to create two instances of the xul prototype cache. If we fix that bug then we won't see the warning message being introduced by Doug's patch here. And Minotaur won't have these infinite loop issues =).

Doug Turner (:dougt)

Assignee

Comment 39

•

22 years ago

i check the fix mscott and I worked on: Checking in nsComponentManager.cpp; /cvsroot/mozilla/xpcom/components/nsComponentManager.cpp,v <-- nsComponentManager.cpp new revision: 1.231; previous revision: 1.230 done marking as fixed.

Status: NEW → RESOLVED

Closed: 22 years ago

Resolution: --- → FIXED

Darin Fisher

Comment 40

•

21 years ago

so, what does it mean if i hit this warning: NS_WARNING("Creation in progress (Reentrant GS - see bug 194568)"); i am basically calling GetServiceByContractID for the same service on multiple threads. does it mean that this bug isn't quite fixed?

Darin Fisher

Comment 41

•

21 years ago

It seems that it is not possible to call GetService for the same service from two threads at the same time. The problem goes away if I create it once from the main thread before spawning the worker threads. Afterwards it is ok if the worker threads call GetService to acquire a reference to the service. Perhaps we should enforce this somehow?

Brendan Eich [:brendan]

Comment 42

•

21 years ago

Darin, can you file a bug? It sounds like we simply are not distinguishing reentry on the same thread from races among threads. If AddPending associated a thread-id with the pending CID .... /be

Allan Beaufour

Updated

•

18 years ago

Blocks: 362564

patch v.1 22 years ago Doug Turner (:dougt) 6.46 KB, patch	alecf : superreview+	Details \| Diff \| Splinter Review
move add/remove pending cds into getservice not create instance 22 years ago Scott MacGregor 2.45 KB, patch	dbaron : review-	Details \| Diff \| Splinter Review
with dbarons comments. 22 years ago Doug Turner (:dougt) 4.39 KB, patch	dbaron : review+	Details \| Diff \| Splinter Review