Since my patch for bug 509130 landed (on the trunk, and on the 1.9.2 branch in time for FF 3.6 beta3), http://crash-stats.mozilla.com/ has been reporting crashes at WebKit_WebInitForCarbon() in InitCarbonWebKit() (in nsAppShell.mm). This is in code added by my patch. These crashes are currently at #24 in the OS X topcrash list for builds that contain my patch. All of them happen on OS X 10.4.11. The crashes aren't 100% reproducible on OS X 10.4.11: I can't reproduce them, and if they were anywhere close to 100% reproducible their numbers would be much higher. That they only happen on OS X 10.4.11 suggests they're caused by bugs in the (older) version of the WebKit framework available on 10.4.11. We can't just back out my patch for bug 509130 -- it fixed a much higher-volume crash (at IdleTimerVector), which is still the #1 or #2 OS X topcrash for builds that don't contain the patch. This might leave us in a quandary. But since I wrote my patch for bug 509130 I've discovered a way to hook/subclass dynamically bound C/C++ functions. This has allowed me to write a much better (and more narrowly targeted) patch to work around the WebKit bug that triggers the IdleTimerVector crashes. I'll present it in my next comment.
Created attachment 416180 [details] [diff] [review] Fix As I explained in the comments to my original patch for bug 509130 (https://bugzilla.mozilla.org/attachment.cgi?id=409942), that patch got around the WebKit bug by always loading the WebKit framework and calling WebInitForCarbon() as the browser starts up. This is the only way to control the number of autorelease pools on the stack when WebInitForCarbon() is first called -- which (at the time) I thought was the only way to disable the WebKit behavior that triggers the crashes. Since then I've discovered a way to hook C/C++ functions, which allows us to write a much better (and more narrowly targeted) patch -- one that only does anything if the plugin loads the WebKit framework and calls WebInitForCarbon(). No longer always calling WebInitForCarbon() ourselves should greatly reduce the frequency of this bug's crashes, or even make them go away completely. The remaining crashes (if any) will (presumably) be something a plugin vendor has to deal with, and are likely also to happen in Safari. My new patch hooks WebInitForCarbon() and HIToolbox's InstallEventLoopIdleTimer() (the latter is what WebInitForCarbon() calls to install an idle timer, from which WebKit's buggy PoolCleaner() method is called). It makes InstallEventLoopIdleTimer() return without doing anything when called from WebInitForCarbon(), which stops the idle timer from being installed, which prevents calls to PoolCleaner(). The technique I use to hook C/C++ functions is brand new. Though (as I say in my patch's comments) the individual steps are mostly documented by Apple, I don't think anyone else has documented (or is using) the whole thing. It's worked well in all my tests, and I'm very confident in it. But we should let this patch bake for a while on the trunk before landing it on any of the branches. This means it won't make it into FF 3.6. But I hope that (if no unfixable problems are discovered) we can get it into FF 3.6.1. A tryserver build will follow in a few hours.
Side note: The reason I've wrapped my patch in an #ifndef __LP64__ macro is that it won't work in 64-bit builds in its current state: ScanImportedFunctions() currently only works on 32-bit Intel and PPC builds. But it shouldn't be hard to rewrite it also to work in 64-bit builds. I've just postponed doing the work (and the testing) until I'm more comfortable playing with our 64-bit builds.
Since the DivXBrowserPlugin is currently the only plugin that uses WebKit in Carbon mode, you'll need to use it to test my patch. The current version (7.1 or 7.2) is available at: http://www.divx.com/en/downloads/divx/mac Here's a URL to test with (once the tryserver build is available). It loads a DivX video. http://www.divx.com/en/downloads/post-install/mac This URL is much better to test with than the one Marcia and I used at bug 509130 (see bug 509130 comment #15). You can find other URLs by doing a Google search for "No video? Get the DivX Web Player" (this was suggested to me in email by Paul Ellis, the DivX Web Player's Product Manager). Since neither this bug's nor bug 509130's crashes are reproducible, you won't be able to test them. One thing you *can* test is to run the tryserver build in gdb and confirm that IdleTimerVector() never gets called, even after a DivX video is loaded. I used a custom mozconfig to stop the tryserver build's symbols from being stripped.
Here's a tryserver build made with my patch from comment #1: https://email@example.com/1259966609-macosx.dmg
We should decide if we should fix this for 1.9.2 soon, especially since 1.9.3 won't support 10.4.
+ // If any of our "images" gets unloaded, an image's 'index' will no longer + // correspond to the order in which ScanImportedFunctions() is called. But + // it appears that images never do get unloaded on OS X (bug 509130 comment + // #24 shows that plugins never get unloaded). And in any case we only use + // imageName for logging, so it's not a disaster if it's wrong. + static uint32_t count = 1; + const char *imageName = _dyld_get_image_name(count - 1); + ++count; This code is unnecessary and is left over from debugging versions of my patch. I'll remove it before I land the patch.
How many crashes is this responsible for? Speculatively blocking since crashes R bad, but we might change that if it turns out to not be worth holding the release. Josh, can you expedite review and/or comment on blocking merit?
> How many crashes is this responsible for? 51 in the last week. The number will (of course) increase when FF 3.6 is released (if the bug isn't fixed before then). But the patch that triggers the crashes has been in betas since beta3, and the crashes only happen on 10.4. So (if left unfixed) this bug will be responsible for a significant number of crashes, though not a huge one.
Created attachment 419964 [details] [diff] [review] Fix rev1 Here's a revision of my patch that drops the code mentioned in comment #6. And here's a new tryserver build: https://firstname.lastname@example.org/bugzilla533001-rev1-macosx.dmg
Created attachment 424163 [details] [diff] [review] Fix rev2 (fix crashes in dyld) Here's a new version of my patch that fixes a problem reported at bug 536684 comment #7 and following (see bug 536684 comment #13 for STR) -- wierd crashes in dyld. Turns out I wasn't setting permissions correctly on some modules' __IMPORT segments (these contain the jump table, some of whose entries my patch overwrites to hook dynamically bound functions). My previous patches always changed the permissions of these segments (to VM_PROT_READ | VM_PROT_WRITE) before attempting to change them, then afterwards always made them read-only. But some __IMPORT segments are writeable to begin with (so there's no need to change their permissions), and making them read-only causes access errors when dyld tries to bind a new symbol in that module: If dyld sees that an __IMPORT segment is *supposed* to be read-only, it will temporarily make it writeable before changing one of its entries. But it doesn't do this if the __IMPORT segment is supposed to be writeable. A tryserver build will follow in a few hours.
It's a bit more than a few hours later ... but here's a tryserver build made with my patch from comment #10: https://email@example.com/bugzilla533001-rev2-macosx.dmg
Steven, can you begin the process of getting this patch reviewed and landed on m-c and then m1.9.2 as soon as it reopens? We're very close to being able to spit out Camino nightlies on 1.9.2, and the presence of the crash/this crash/bug 546902 crash/whatever that is caused by the initial fix in bug 509130 makes stock-1.9.2 builds pretty hard to dogfood. In addition, I'm sure people using 10.4 and people using PKCS#11 modules would appreciate seeing this fixed in Fx 3.6.3, too.
Since trunk doesn't support 10.4 we can kill all of this code off there. Can you put up a patch for that? This patch is 1.9.2-only now, right? And since 1.9.2 will never be 64-bit, we don't need any of the 64-bit accommodations in the patch, right?
> Since trunk doesn't support 10.4 we can kill all of this code off > there. Can you put up a patch for that? What do you mean by "all of this code"? Note that my patch also fixes two other problems (bug 545653 and bug 546902) that aren't 10.4-specific (they also happen on Leopard). > This patch is 1.9.2-only now, right? No, it's not. It's also needed on trunk for at least bug 545653.
Why don't we just blacklist DivXBrowserPlugin? Is it really important to support it?
> Why don't we just blacklist DivXBrowserPlugin? Is it really important to > support it? I get the impression it has lots of users. I don't really know how to quantify that, though. In any case it's really bad form to blacklist something just because it causes us inconvenience. These crashes (and the IdleTimerVector crashes) aren't really the fault of the DivX plugin, and we do have a fix for them.
Yeah, but let's be clear --- the fix is an evil, evil hack. Sometimes such things are necessary; the question is whether this one is really necessary.
> the fix is an evil, evil hack It's not. It's just something rather rare in our profession -- a new idea (though it's composed of "old" (i.e. reasonably well documented) parts.) Nonetheless, I'm not sure Mozilla has anyone qualified to review my patch (someone with expertise in the areas it covers). And (sighs of relief all around) I've since discovered another, much simpler technique that might also be able to fix the problem -- "dyld interposing" (http://books.google.com/books?id=K8vUkpOXhN4C&pg=PA73&lpg=PA73&dq=__interpose&source=bl&ots=OJnnXZYpZC&sig=o7I3lXvoduUi13SrPfOON7o3do4&hl=en&ei=AoehS9brCYGQNrvsmeUM&sa=X&oi=book_result&ct=result&resnum=6&ved=0CBsQ6AEwBQ#v=onepage&q=__interpose&f=false). It'll take a few days for me to try out "dyld interposing". Do you want me to do it now, or wait for later?
Comment on attachment 424163 [details] [diff] [review] Fix rev2 (fix crashes in dyld) Have you looked at how Chromium handles hooking functions like this, if it does, just in case there is a simpler way that we don't know about? I'm pretty sure they also hook Carbon functions like this in order to simulate parent process behavior for Carbon plugins. If they do, they aren't using '_dyld_register_func_for_add_image'. A comment of yours notes that '_dyld_register_func_for_add_image' will trigger the callback for all of the images that have already been loaded, including the main image. Since presumably we're never going to want to hook things in our main image (firefox-bin, for example) here, can we detect when the callback happens for the main image and bail to save work? Why store oldAddressPtr in _nsHookedFunctionSpec? Doesn't look like it is ever used. If/when this lands it needs to bake on trunk for at least a week and we need to watch crash reports, Tp and Ts closely.
Has anyone contacted DivX and asked them to stop doing this?
(In reply to comment #18) > > the fix is an evil, evil hack > > It's not. It's just something rather rare in our profession -- a new > idea (though it's composed of "old" (i.e. reasonably well documented) > parts.) We've had this discussion before. Any solution that involves hooking/overriding particular functions in libraries we don't control is an evil hack and we need to keep such things to an absolute minimum.
If we do decide we need to keep supporting DivXBrowserPlugin in 1.9.2 on 10.4, then we should gratefully take this hack, evil and all :-). But blacklisting DivXBrowserPlugin seems safer and simpler to me. How many users can there be using DivXBrowserPlugin with Firefox 3.6 on 10.4 (which Apple doesn't even support anymore)?
> How many users can there be using DivXBrowserPlugin with Firefox 3.6 > on 10.4 (which Apple doesn't even support anymore)? As I say in comment #14, this is no longer just a problem on 10.4.
> Have you looked at how Chromium handles hooking functions like this The latest Chromium code does use dyld interposing. > can we detect when the callback happens for the main image and bail > to save work? We probably can. But Firefox loads hundreds of modules (and _dyld_register_func_for_add_image() gets called hundreds of times). So it's probably not worth the trouble. > Why store oldAddressPtr in _nsHookedFunctionSpec? Doesn't look like > it is ever used. Not so. It's by means of the void** that the HIToolbox_InstallEventLoopIdleTimer and WebKit_WebInitForCarbon function pointers get set. > If/when this lands it needs to bake on trunk for at least a week and > we need to watch crash reports, Tp and Ts closely. Absolutely! I think it should bake for longer (a month?), and should start out on the trunk. As with any technique/idea that's this new and different :-)
Let's see if we can get the DivX developers to fix their stuff. If we can, then we can blacklist the old versions of the plugin.
Josh, would you feel more comfortable with dyld interposing? Is it worth my while spending a couple of days looking into that?
> Let's see if we can get the DivX developers to fix their stuff. If > we can, then we can blacklist the old versions of the plugin. It's WebKit that needs to be fixed ... and I doubt very much that's going to happen.
>> can we detect when the callback happens for the main image and bail >> to save work? > > We probably can. But Firefox loads hundreds of modules (and > _dyld_register_func_for_add_image() gets called hundreds of times). > So it's probably not worth the trouble. It's the callback registered by _dyld_register_func_for_add_image() that gets called hundreds of times.
Maybe they don't actually need to link in Webkit when they're running in Gecko.
(In reply to comment #29) > Maybe they don't actually need to link in Webkit when they're running in Gecko. Right. We should talk to the DivX people about potential solutions before we invest any more time trying to solve this on our end. I have sent them email.
> Maybe they don't actually need to link in Webkit when they're > running in Gecko. Possibly. But only DivX can answer that question. And judging by what transpired at bug 509130, getting them to say "yes" and waiting for the new version would take a long time ... possibly forever. I don't have the energy or the time to pursue this. Someone else would need to do it.
Let me say some things very clearly (and, frankly, bluntly): The code that is currently in the tree (e.g., the patch from bug 509130), is unusable from a "platform" perspective. Sure, the code in the tree works fine for Firefox--unless you're on 10.4 (this bug) or have to load PKCS#11 modules (bug 545653)--but it's absolutely disastrous for embedders like Camino (bug 546902) or XULrunner users (bug 536684). I cannot ship software to users that crashes randomly once an hour--not even nightly builds. I don't think even Flash crashes that often. It seems to me like we have three options in the short term: 1) this patch messing with low-level stuff 2) another patch messing with low-level stuff, which method may already be in use by Chromium 3) back out the patch for bug 509130 entirely and restore a topcrasher In the longer term: 1) Rearcitect Gecko to change the way the runloop works (bug 509130 comment 44) 2) Hope there's something that DivX can do to not involve WebKit at all when running in Gecko browsrs and that DivX can release a new version, so that then we can blacklist old versions and remove whatever short-term fix we've chosen I think my position against messing around with low-level stuff is well-known, but I can't wait around for one of the longer-term solutions to manifest itself. As distasteful as the current patch is to me, it works, and in the past month of using a build with this patch, I've not had a single crash, nor any other bizarre behavior I can reasonably attribute to the patch. Is backing out bug 509130 safer? Absolutely. Will restoring a Firefox topcrash fly? Probably not. Of the three short-term solutions, this patch is at least the devil I know :(
That's fair. I'll be OK with taking this patch as an interim fix until we get a better one.
Josh: ping. It's been over a week now since the last comments here, and all that stands between me spitting out builds is branch approval on a tiny packaging patch (any day now) and the fact that the builds won't even be dogfoodable without getting the patch here (or some other short-term solution, but as I said before, this patch is the devil I know) on the branch.
Landed on trunk: http://hg.mozilla.org/mozilla-central/rev/35955c4e9112 As I said in comment #24, I'd like to have this bake on the trunk for a while -- perhaps a month. But in my experience it takes a long time for a patch to get through the approval queue. So (presuming everything's OK), in one week I'll seek 1.9.2-branch approval.
Comment on attachment 424163 [details] [diff] [review] Fix rev2 (fix crashes in dyld) It's been a week, and no problems with this patch have been reported. Ts and Ts-cold performance seem actually to have improved. This is especially apparent when you compare the results of tests on machines running OS X 10.5.8 (where the patch has effect) to those running OS X 10.6.2 (where the patch has no effect). So my patch might actually be responsible for the improvement. This isn't what I intended. But if so it's likely because of what I do at the top of ScanImportedFunctions(). My patch seems to make no difference to the Tp (Tp4) test results. These have been slowly worsening, but that trend started before my patch landed, and appears on both OS X 10.5.8 and 10.6.2. As I said above, I think this patch should bake on the trunk for about a month, total. The hooking strategy (which functions I hook and what I do there) is very straightforward, and very unlikely to cause trouble. But the hooking infrastucture (how the patch hooks functions) is new. And, while I'm confident in it, there might be problems I haven't forseen. As I also said above, the approval queue is long. So I'm seeking approval now, on the understanding that my patch will have been on the trunk for about a month by the time the decision is made. Presuming the hooking infrastructure works as planned (as it seems to), this patch is a vast improvement on the patch for bug 509130. If we're able to hook dynamically bound functions, the fix for bug 509130 is almost trivially easy. And it avoids the original patch's serious side effects (this bug, bug 546902, bug 545653 and bug 536684).
(Following up comment #36) > So my patch might actually be responsible for the improvement. This > isn't what I intended. But if so it's likely because of what I do > at the top of ScanImportedFunctions(). On second thought this isn't right. What I do at the top of ScanImportedFunctions() is a significant optimization, but it can't be responsible for a speed improvement relative to my patch not being present. Another possibility is that, if my patch *has* speeded up Ts results, this is because of the changes it makes to the __jump_table section of the __IMPORT segment. Whenever my patch overwrites an unitialized "lazy pointer", it means that dyld no longer needs to "lazily" initialize it later on. Initializating these pointers all at once may very well be more efficient than doing it piecemeal (as dyld normally does). And my initialization code may be faster than dyld's -- because I know beforehand what the pointer's value should be.
Comment on attachment 424163 [details] [diff] [review] Fix rev2 (fix crashes in dyld) a=LegNeato for 18.104.22.168. Please ONLY land this on mozilla-1.9.2 default, as we are still working on 22.214.171.124 on the relbranch
Landed on the 1.9.2 branch (default, aka 126.96.36.199): http://hg.mozilla.org/releases/mozilla-1.9.2/rev/6814976c4b11
(In reply to comment #33) > That's fair. I'll be OK with taking this patch as an interim fix until we get a > better one. I filed bug 570380 as the follow-up for investigating/developing that better fix.
Marcia, you have access to OS X 10.4. Can you verify this for 1.9.2?
There is a 10.4 machine available in the lab. At the moment I don't have time to verify this particular bug.
I wish you would have told me that two weeks ago when I asked you to verify it. I only work in the office two days a week. I can't verify this until next week now.