Closed Bug 194568 Opened 22 years ago Closed 22 years ago

Creating two instances of the XULPrototype Cache

Categories

(Core :: XPCOM, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla1.4alpha

People

(Reporter: mscott, Assigned: dougt)

References

Details

Attachments

(2 files, 1 obsolete file)

I'm able to produce a scenario where we end up creating two instances of the XULPrototype Cache such that the XBLService has one instance and XBLListeners (and other parts of the product) use another instance. The symptom of this problem is an infinite loop where we end up trying to reload the same set of xbl bindings over and over again. Here's how this situation can happen. 1) In nsChromeProtocolHandler::NewChannel, we create the an instance of the XUL prototype cache service: nsCOMPtr<nsIXULPrototypeCache> cache = do_GetService(kXULPrototypeCacheCID, &rv); IF we have not previously created a prototype cache, we do the following: 2) Go to the service manager which detects that we have not created an instance yet. 3) The xul prototype cache is registered by the layout module. So the Component manager loads layout and initializes the layout module 4) The act of initializing the layout module causes the static method: CSSFrameConstructor::InitGlobals to get called which creates an instance of the XBL Service: static nsresult InitGlobals() { return CallGetService("@mozilla.org/xbl;1", &gXBLService); } 5) Now the service manager has not created the xbl service yet, so we got to the component manager which creates our first instance. In the constructor for the XBL Service, we attempt to create a XUL Prototype Cache: nsXBLService::nsXBLService() { .... CallGetService("@mozilla.org/xul/xul-prototype-cache;1", &gXULCache); } 6) This causes us to go back into the service manager, we still have not created an instance of the xul prototype cache yet since we are still on the stack from our attempt to create one back in step (1). So we create an instance and give it back to the XBL Service. 7) Now the stack unwinds back down to the component manager's attempt to create the xul prototype cache for the chrome protocol handler. We've successfully loaded the layout module, so we now ask it to create the 2nd instance of the cache. We now have two instances of the cache running at the same time. In my particular case, nsXBLListener's end up using a different instance from the XBL service, and this causes us to never find any bindings in the cache.
> So the Component manager loads layout and initializes the layout module So we seem to have a few options: 1) Move the XUL prototype cache into the chrome stuff so it does not require layout to be loaded 2) Find whoever is trying to open a chrome channel before layout has been inited and make them stop. Who's the caller of NewChannel? 3) Make some of those getService calls lazy.... And this has come up before; it would be good to have the service manager asserting if it's reentered for the same contractid/cid.
bz's questions are all good, but I have one too (I hope it's not dumb): Why doesn't nsComponentManagerImpl::GetService in the outer activation (step 1, the one looking by CID) re-try the lookup (around line 2006 of nsComponentManger.cpp) and find the instance created by contract-id by the inner activation (the one looking up by contract-id). /be
scott - can you provide a test case. I would be happy to debug this problem. Calling GetService(CID) followed by GetService(ContractID) is known to work. I just tested it on the 1.3 branch and it worked fine. Although these object's are suppose to be used as services, this is the simple test i used: static NS_DEFINE_CID(kPRBoolCID, NS_SUPPORTS_PRBOOL_CID); int main(int argc, char* argv[]) { NS_InitXPCOM2(NULL, nsnull, nsnull); nsCOMPtr<nsISupportsPRBool> a = do_GetService(kPRBoolCID); nsCOMPtr<nsISupportsPRBool> b = do_GetService(NS_SUPPORTS_PRBOOL_CONTRACTID); NS_ASSERTION(a == b, "ERROR!"); NS_ShutdownXPCOM(nsnull); return 0; } It is a known problem that CreateInstance isn't thread safe. This problem may be do to the factory implementation not properly implementing CreateInstance (or rather, the generic module code). See bug 5795. Looking at the "constructor" function of the xul prototype cache (the function that is called by the generic module code), there is nothing to protect two objects from being created: http://lxr.mozilla.org/seamonkey/source/content/xul/document/src/nsXULPrototypeCache.cpp#213 Scott, it would be interesting to just cache the value of the new and return it if this function is called again.
> Calling GetService(CID) followed by GetService(ContractID) is known to work. In this case the second GetService call comes from _inside_ the first GetService call....
dougt: this is a legitimate bug, one that I've encountered myself before. Its a hard problem to solve though. The call stack would look like this (theoretical.. I'm writing this off the top of my head) do_GetService("component1") nsComponentManager::CreateInstanceByCID() nsGenericFactory::CreateInstance() nsSomeServiceConstructor() do_GetService("component2") nsComponentManager::CreateInstanceByCID() nsGenericFactory::CreateInstance() nsSomeServiceConstructor() do_GetService("component1") what needs to happen is that nsComponentManager::CreateInstanceByCID needs to keep a stack of services (probably just a reference to the CID that got passed in each time) that it is in the process of creating, and when someone calls do_GetService the list needs to be walked.. if the service in question is found, then an assertion should be thrown and an error returned. The component manager can't cache the instance of the service in question because the initial object hasn't been returned by the factory constructor method.
Boris was curious about who was calling nsChromeProtocolHandler::NewChannel before layout has been initialized. In my particular case, I had two prefs set in my prefs file: user_pref("capability.principal.codebase.p1.granted", "UniversalXPConnect"); user_pref("capability.principal.codebase.p0.id", "http://kezwald.mcom.com"); On startup, we first ensure that we have a profile, which causes prefs to load. Our script security manager is a prefs observer, looking for changes to capability.principal.codebase.*. It gets notified when prefs load. The security manager attempts to create a new URI out of http://kezwald.mcom.com. The http protocol handler is created and during it's initialization, it attempts to load a navigator string bundle in order to determine our locale. Loading the chrome resource for the string bundle causes us to enter the chrome protocol handler. Then we enter the scenario I described in the initial bug description. Alec, your theoretical stack trace looked pretty much like what I had on my machine over the weekend when I was running into this.
over to me. alec, thank you for the clarification. I will look into it this week.
Assignee: hyatt → dougt
Severity: normal → critical
Component: XP Toolkit/Widgets: XUL → XPCOM
OS: Windows 2000 → All
Priority: -- → P1
Hardware: PC → All
Target Milestone: --- → mozilla1.4alpha
I agree we should have assertions, but they may not indicate how to fix the underlying problem. One thought: for nsSomeServiceConstructor() do_GetService("component2") we could require nsSomeServiceConstructor to re-lookup the service instance it is trying to create, to see whether it was created by the nested call into the component manager. Is there a way to do that? /be
we do provide a |isServiceInstantiated| method.
i think we have to protect both the gs(cid) path, as well as the gs(contractID) path.
Attached patch patch v.1Splinter Review
(a) instead of calling FindFactory() which only returns a nsIFactory, I am instead calling what FindFactory does so that I can do two things - get access to the CID in the CI(contractID) case and assert if someone is calling CI when there exist a service for this same CID (a assertion nothing more). (b) I am blocking all reentrant CID creation. I think this is what we want. Can you look over the patch and try it out? Alec - should I be using a nsSmallVoidArray?
I wonder if it would be faster to use the nsDoubleHashSet or whatever that jkeiser invented.. it might be faster given that we don't want to spend all our time looking up and down an array? But maybe for small sets its irrelevant? hmm... as for nsSmallVoidArray, probably not - if anything you probably want nsAutoVoidArray.. nsSmallVoidArray is for structures that might never have any values, so a zero-length array takes up only a word as long as you never add data to it.
the list is quite short.
Attachment #116435 - Flags: superreview?(alecf)
Comment on attachment 116435 [details] [diff] [review] patch v.1 I'm not sure I agree that we need an assertion when you do a CreateInstance() for something that is already a service - I think this occurs naturally in the code.. perhaps a warning so we don't bug the crap out of developers The other thing I was thinking was that perhaps the CID-checking should be debug-only...if not for performance (which I suppose we'll see on tinderbox if its an issue) - but also because we are CPU-intense as it is and this only makes us worse at a key point (creation of objects) in XPCOM - if the architecture of some component is flawed enough to cause this assertion, then hopefully we'd start catching it in a debug build. sr=alecf with the assertion changed to a warning, and optionally with the CID list thing made debug-only.
Attachment #116435 - Flags: superreview?(alecf) → superreview+
I can agree with making this all debug-only. However, the problem will still exist where you can create two instances of the XUL prototype cache by the steps mscott outlined. Someone is going to have to solve the problem in the xul prototype cache. It is also fair to note that some developers will not be using a debug version of XPCOM so detection of this class of bug will not occur. Also, we have to consider the case where all component configuration combinations are not be tested. For example, a user can download a dozen different components from mozdev.org and we have no real clue how they interact. Any cycle that is produce may lead to this class of problems. Returning an error from XPCOM allows these cases to unwind instead of a much worse alternative. I am going to check this in as debug only for now. Enabling it is something we can consider tomorrow.
Flags: blocking1.4a?
Checking in nsComponentManager.cpp; /cvsroot/mozilla/xpcom/components/nsComponentManager.cpp,v <-- nsComponentManager.cpp new revision: 1.226; previous revision: 1.225 done Checking in nsComponentManager.h; /cvsroot/mozilla/xpcom/components/nsComponentManager.h,v <-- nsComponentManager.h new revision: 1.85; previous revision: 1.84 done
Um... there should never be a CreateInstance call for _services_ unless it's called from GetService. There should never be a GetService for an existing service calling CreateInstance unless we are recursing (as in this bug). Any instance of such a recursion is a proble, in my opinion and needs to be detected and either asserted on or (preferably) handled gracefully.
bz - that is what we are now doing in debug builds. the question now stands - do we do this in opt builds?
Right. Sorry, I didn't make it clear I was replying to comment 14's suggestion to make it a warning instead of an assert.
no, I totally disagree. what I'm saying is that it IS reasonable to have multiple instances of a class as well as have a service version of the class. There's no reason that XPCOM should prevent that.. For example, something that I'm working on right now for bug 176559 - we need a global nsINodeInfoManager to keep track of (and cache) the most commonly accessed nsINodeInfo objects. But we also need a per-document nsINodeInfoManager for nodes which are not in the "commonly accessed" list. The simplest and most straight forward way to maintain the global version is to just do a GetService() and let XPCOM handle the singleton aspect of that particular instance. From there, each document will create their own nsINodeInfoManagers and refer to both it and to the service version. While it is certainly incorrect to call CreateInstance() on a object that is SUPPOSED to be a service, I don't think you can make the blanket statement that all class that are instantiated somewhere via GetService() cannot be CreateInstanced()'d elsewhere.
ugh, and just for this reason I asked that the NS_ASSERTION be a NS_WARNING as a part of the review. What happened there?
alec, you must compile with SHOW_CI_ON_EXISTING_SERVICE defined to have this extra ci check. Where should this bug go now? Who owns the xul prototype cache or who has a good argument why the reentrant CID creation check should be enabled?
brendan - do you have any comments?
This seems like it's just the problem that was discussed in bug 5795 and bug 128646, so it's good to see a fix coming. It looks like it's already landed in ifdefs, although I'm wondering why the SHOW_CI_ON_EXISTING_SERVICE ifdef doesn't cover more of the code. However, I wonder (along the lines of alecf's comments) why these checks aren't in the GetService functions instead of the CreateInstance functions. I don't think you'd want multiple calls to CreateInstance that happen to occur on different threads at the same time to fail.
dbaron - see comment 11 (b). Maybe it isn't what we want. :-)
Flags: blocking1.4a? → blocking1.4a-
I still run into this infinite loop. Just found another code path that causes it to happen. I'm always getting bit by it because the act of loading the layout module, causes nsCSSFrameConstructor to get created which calls nsCSSFrameConstructor::InitGlobals. InitGlobals calls get service on the xbl service which in turn creates a xul prototype cache in it's constructor. Any time a xul prototype cache object is instantiated through get service before layout has been loaded, causes this infinite loop. Just found a new case with minotaur where this happens if we don't have a minotaur profile. dbaron, is there a way we can break this cycle between the CSS frame constructor, the xbl service and the xul prototype cache?
Here's a less complicated stack trace from the one I showed before when I filed the bug. nsChromeRegistry::FlushCaches gets an instance of the xul prototype cache which kicks off the chain of events I described in my previous comment 25. nsChromeRegistry::FlushCaches() line 1420 + 30 bytes nsChromeRegistry::LoadProfileDataSource() line 3219 + 11 bytes nsChromeRegistry::Observe(nsChromeRegistry * const 0x00b6fee4, nsISupports * 0x00aefae0, const char * 0x0132f7e8, const unsigned short * 0x0132f668) line 3556 + 11 bytes nsObserverService::NotifyObservers(nsObserverService * const 0x00a92980, nsISupports * 0x00aefae0, const char * 0x0132f7e8, const unsigned short * 0x0132f668) line 212 nsProfile::SetCurrentProfile(nsProfile * const 0x00aefae0, const unsigned short * 0x00b2ba08) line 1260 nsProfile::LoadDefaultProfileDir(nsCString & {...}, int 0x00000001) line 548 + 28 bytes nsProfile::StartupWithArgs(nsProfile * const 0x00aefae0, nsICmdLineService * 0x00af0230, int 0x00000001) line 356 + 16 bytes nsAppShellService::DoProfileStartup(nsAppShellService * const 0x00b5c3c8, nsICmdLineService * 0x00af0230, int 0x00000001) line 273 + 31 bytes InitializeProfileService(nsICmdLineService * 0x00af0230) line 908 + 31 bytes main1(int 0x00000001, char * * 0x003677b8, nsISupports * 0x00ad7d68, const nsXREAppData & {...}) line 1189 + 14 bytes xre_main(int 0x00000001, char * * 0x003677b8, const nsXREAppData & {...}) line 1692 + 41 bytes
The easy fix is to make gXBLService lazily initialized. We'd have to fix this anyway once the XPCOM bug is fixed.
dbaron - see comment #20. This is all disabled. Are you suggestion that you would want to see these checks enabled by default?
Yes, definitely, for threadsafety. I still think the checks are wrong, though.
i have no problem enabling as long as performance isn't hit hard. alec, agree?
ok, so I'm all for turning on the checks for when a reentrant service creation happens (i.e. you're creating service "Foo" which has the side effect of creating service "bar" which in turn requests service "Foo") - but ONLY for services. Not for normal createinstance stuff.
Doug, in a debug build, I still run into this problem. I just noticed that XPCOM_CHECK_PENDING_CIDS is defined for debug builds. Should I still be seeing this problem with two instances of a service getting created with this define enabled?
scott, yes. you should be seeing an assertion, but as dbaron mentioned, the check are in the wrong place.
With this new updated patch Doug and I came up with, I now run into the xpcom warning. The add and remove pendind CIDs needed to be placed in GetService not in CreateInstance. Note: The define is still only turned on for debug builds.
Attachment #120003 - Flags: review?(dbaron)
Comment on attachment 120003 [details] [diff] [review] move add/remove pending cds into getservice not create instance Both of the AddPendingCID sections are mis-indented. With this patch, all AddPendingCID and RemovePendingCID calls are within the monitor, so I think you should document that the caller must be inside the monitor and remove the nsAutoMonitor inside AddPendingCID and RemovePendingCID. The NS_WARNING in AddPendingCID should also mention GetService rather than CI, and should also mention the possibility that the assertion is triggered by near-simultaneous calls to GetService on multiple threads. (I've seen that happen.) I also think you should get rid of the ifdefs.
I added your suggestions, but stilll haven't removed the #ifdef's.
Attachment #120003 - Attachment is obsolete: true
I filed Bug #201891 to break the XULPrototypeCache scenario which causes us to try to create two instances of the xul prototype cache. If we fix that bug then we won't see the warning message being introduced by Doug's patch here. And Minotaur won't have these infinite loop issues =).
i check the fix mscott and I worked on: Checking in nsComponentManager.cpp; /cvsroot/mozilla/xpcom/components/nsComponentManager.cpp,v <-- nsComponentManager.cpp new revision: 1.231; previous revision: 1.230 done marking as fixed.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
so, what does it mean if i hit this warning: NS_WARNING("Creation in progress (Reentrant GS - see bug 194568)"); i am basically calling GetServiceByContractID for the same service on multiple threads. does it mean that this bug isn't quite fixed?
It seems that it is not possible to call GetService for the same service from two threads at the same time. The problem goes away if I create it once from the main thread before spawning the worker threads. Afterwards it is ok if the worker threads call GetService to acquire a reference to the service. Perhaps we should enforce this somehow?
Darin, can you file a bug? It sounds like we simply are not distinguishing reentry on the same thread from races among threads. If AddPending associated a thread-id with the pending CID .... /be
Blocks: 362564
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: