[10.4] Crashes [@ objc_msgSend | nsAppShell::Init() ] triggered by patch for bug 509130

RESOLVED FIXED

Status

()

Core
Plug-ins
--
critical
RESOLVED FIXED
8 years ago
7 years ago

People

(Reporter: smichaud, Assigned: smichaud)

Tracking

({topcrash})

1.9.2 Branch
All
Mac OS X
topcrash
Points:
---
Dependency tree / graph
Bug Flags:
blocking1.9.2 -

Firefox Tracking Flags

(status1.9.2 .5-fixed)

Details

(crash signature)

Attachments

(1 attachment, 2 obsolete attachments)

Since my patch for bug 509130 landed (on the trunk, and on the 1.9.2
branch in time for FF 3.6 beta3), http://crash-stats.mozilla.com/ has
been reporting crashes at WebKit_WebInitForCarbon() in
InitCarbonWebKit() (in nsAppShell.mm).  This is in code added by my
patch.  These crashes are currently at #24 in the OS X topcrash list
for builds that contain my patch.  All of them happen on OS X 10.4.11.

The crashes aren't 100% reproducible on OS X 10.4.11:  I can't
reproduce them, and if they were anywhere close to 100% reproducible
their numbers would be much higher.  That they only happen on OS X
10.4.11 suggests they're caused by bugs in the (older) version of the
WebKit framework available on 10.4.11.

We can't just back out my patch for bug 509130 -- it fixed a much
higher-volume crash (at IdleTimerVector), which is still the #1 or #2
OS X topcrash for builds that don't contain the patch.

This might leave us in a quandary.  But since I wrote my patch for bug
509130 I've discovered a way to hook/subclass dynamically bound C/C++
functions.  This has allowed me to write a much better (and more
narrowly targeted) patch to work around the WebKit bug that triggers
the IdleTimerVector crashes.  I'll present it in my next comment.
(Assignee)

Updated

8 years ago
Assignee: nobody → smichaud
(Assignee)

Updated

8 years ago
Keywords: topcrash
(Assignee)

Updated

8 years ago
Blocks: 509130
(Assignee)

Comment 1

8 years ago
Created attachment 416180 [details] [diff] [review]
Fix

As I explained in the comments to my original patch for bug 509130
(https://bugzilla.mozilla.org/attachment.cgi?id=409942), that patch
got around the WebKit bug by always loading the WebKit framework and
calling WebInitForCarbon() as the browser starts up.  This is the only
way to control the number of autorelease pools on the stack when
WebInitForCarbon() is first called -- which (at the time) I thought
was the only way to disable the WebKit behavior that triggers the
crashes.

Since then I've discovered a way to hook C/C++ functions, which allows
us to write a much better (and more narrowly targeted) patch -- one
that only does anything if the plugin loads the WebKit framework and
calls WebInitForCarbon().  No longer always calling WebInitForCarbon()
ourselves should greatly reduce the frequency of this bug's crashes,
or even make them go away completely.  The remaining crashes (if any)
will (presumably) be something a plugin vendor has to deal with, and
are likely also to happen in Safari.

My new patch hooks WebInitForCarbon() and HIToolbox's
InstallEventLoopIdleTimer() (the latter is what WebInitForCarbon()
calls to install an idle timer, from which WebKit's buggy
PoolCleaner() method is called).  It makes InstallEventLoopIdleTimer()
return without doing anything when called from WebInitForCarbon(),
which stops the idle timer from being installed, which prevents calls
to PoolCleaner().

The technique I use to hook C/C++ functions is brand new.  Though (as
I say in my patch's comments) the individual steps are mostly
documented by Apple, I don't think anyone else has documented (or is
using) the whole thing.  It's worked well in all my tests, and I'm
very confident in it.  But we should let this patch bake for a while
on the trunk before landing it on any of the branches.  This means it
won't make it into FF 3.6.  But I hope that (if no unfixable problems
are discovered) we can get it into FF 3.6.1.

A tryserver build will follow in a few hours.
(Assignee)

Comment 2

8 years ago
Side note:

The reason I've wrapped my patch in an #ifndef __LP64__ macro is that
it won't work in 64-bit builds in its current state:
ScanImportedFunctions() currently only works on 32-bit Intel and PPC
builds.

But it shouldn't be hard to rewrite it also to work in 64-bit builds.
I've just postponed doing the work (and the testing) until I'm more
comfortable playing with our 64-bit builds.
(Assignee)

Comment 3

8 years ago
Since the DivXBrowserPlugin is currently the only plugin that uses
WebKit in Carbon mode, you'll need to use it to test my patch.  The
current version (7.1 or 7.2) is available at:

http://www.divx.com/en/downloads/divx/mac

Here's a URL to test with (once the tryserver build is available).  It
loads a DivX video.

http://www.divx.com/en/downloads/post-install/mac

This URL is much better to test with than the one Marcia and I used at
bug 509130 (see bug 509130 comment #15).

You can find other URLs by doing a Google search for "No video? Get
the DivX Web Player" (this was suggested to me in email by Paul Ellis,
the DivX Web Player's Product Manager).

Since neither this bug's nor bug 509130's crashes are reproducible,
you won't be able to test them.  One thing you *can* test is to run
the tryserver build in gdb and confirm that IdleTimerVector() never
gets called, even after a DivX video is loaded.  I used a custom
mozconfig to stop the tryserver build's symbols from being stripped.
(Assignee)

Comment 4

8 years ago
Here's a tryserver build made with my patch from comment #1:
https://build.mozilla.org/tryserver-builds/smichaud@pobox.com-1259966609/1259966609-macosx.dmg
We should decide if we should fix this for 1.9.2 soon, especially since 1.9.3 won't support 10.4.
Flags: blocking1.9.2?
(Assignee)

Updated

8 years ago
Attachment #416180 - Flags: review?(joshmoz)
(Assignee)

Comment 6

8 years ago
+  // If any of our "images" gets unloaded, an image's 'index' will no longer
+  // correspond to the order in which ScanImportedFunctions() is called.  But
+  // it appears that images never do get unloaded on OS X  (bug 509130 comment
+  // #24 shows that plugins never get unloaded).  And in any case we only use
+  // imageName for logging, so it's not a disaster if it's wrong.
+  static uint32_t count = 1;
+  const char *imageName = _dyld_get_image_name(count - 1);
+  ++count;

This code is unnecessary and is left over from debugging versions of my patch.  I'll remove it before I land the patch.
How many crashes is this responsible for? Speculatively blocking since crashes R bad, but we might change that if it turns out to not be worth holding the release.

Josh, can you expedite review and/or comment on blocking merit?
Flags: blocking1.9.2? → blocking1.9.2+
(Assignee)

Comment 8

8 years ago
> How many crashes is this responsible for?

51 in the last week.

The number will (of course) increase when FF 3.6 is released (if the bug isn't fixed before then).  But the patch that triggers the crashes has been in betas since beta3, and the crashes only happen on 10.4.

So (if left unfixed) this bug will be responsible for a significant number of crashes, though not a huge one.

Updated

8 years ago
Flags: blocking1.9.2+ → blocking1.9.2-

Updated

8 years ago
Attachment #416180 - Flags: review?(joshmoz)
(Assignee)

Comment 9

8 years ago
Created attachment 419964 [details] [diff] [review]
Fix rev1

Here's a revision of my patch that drops the code mentioned in comment
#6.  And here's a new tryserver build:

https://build.mozilla.org/tryserver-builds/smichaud@pobox.com-bugzilla533001-rev1/bugzilla533001-rev1-macosx.dmg
Attachment #416180 - Attachment is obsolete: true
Created attachment 424163 [details] [diff] [review]
Fix rev2 (fix crashes in dyld)

Here's a new version of my patch that fixes a problem reported at bug
536684 comment #7 and following (see bug 536684 comment #13 for STR)
-- wierd crashes in dyld.

Turns out I wasn't setting permissions correctly on some modules'
__IMPORT segments (these contain the jump table, some of whose entries
my patch overwrites to hook dynamically bound functions).  My previous
patches always changed the permissions of these segments (to
VM_PROT_READ | VM_PROT_WRITE) before attempting to change them, then
afterwards always made them read-only.  But some __IMPORT segments are
writeable to begin with (so there's no need to change their
permissions), and making them read-only causes access errors when dyld
tries to bind a new symbol in that module:

If dyld sees that an __IMPORT segment is *supposed* to be read-only,
it will temporarily make it writeable before changing one of its
entries.  But it doesn't do this if the __IMPORT segment is supposed
to be writeable.

A tryserver build will follow in a few hours.
Attachment #419964 - Attachment is obsolete: true
Blocks: 546902
It's a bit more than a few hours later ... but here's a tryserver
build made with my patch from comment #10:

https://build.mozilla.org/tryserver-builds/smichaud@pobox.com-bugzilla533001-rev2/bugzilla533001-rev2-macosx.dmg
(Assignee)

Updated

8 years ago
Blocks: 545653
Steven, can you begin the process of getting this patch reviewed and landed on m-c and then m1.9.2 as soon as it reopens?  

We're very close to being able to spit out Camino nightlies on 1.9.2, and the presence of the crash/this crash/bug 546902 crash/whatever that is caused by the initial fix in bug 509130 makes stock-1.9.2 builds pretty hard to dogfood.

In addition, I'm sure people using 10.4 and people using PKCS#11 modules would appreciate seeing this fixed in Fx 3.6.3, too.
(Assignee)

Updated

8 years ago
Attachment #424163 - Flags: review?(joshmoz)

Comment 13

8 years ago
Since trunk doesn't support 10.4 we can kill all of this code off there. Can you put up a patch for that?

This patch is 1.9.2-only now, right? And since 1.9.2 will never be 64-bit, we don't need any of the 64-bit accommodations in the patch, right?
> Since trunk doesn't support 10.4 we can kill all of this code off
> there. Can you put up a patch for that?

What do you mean by "all of this code"?

Note that my patch also fixes two other problems (bug 545653 and bug
546902) that aren't 10.4-specific (they also happen on Leopard).

> This patch is 1.9.2-only now, right?

No, it's not.  It's also needed on trunk for at least bug 545653.
Why don't we just blacklist DivXBrowserPlugin? Is it really important to support it?
> Why don't we just blacklist DivXBrowserPlugin? Is it really important to
> support it?

I get the impression it has lots of users.  I don't really know how to quantify that, though.

In any case it's really bad form to blacklist something just because it causes us inconvenience.  These crashes (and the IdleTimerVector crashes) aren't really the fault of the DivX plugin, and we do have a fix for them.
Yeah, but let's be clear --- the fix is an evil, evil hack. Sometimes such things are necessary; the question is whether this one is really necessary.
> the fix is an evil, evil hack

It's not.  It's just something rather rare in our profession -- a new
idea (though it's composed of "old" (i.e. reasonably well documented)
parts.)

Nonetheless, I'm not sure Mozilla has anyone qualified to review
my patch (someone with expertise in the areas it covers).  And (sighs
of relief all around) I've since discovered another, much simpler
technique that might also be able to fix the problem -- "dyld
interposing" (http://books.google.com/books?id=K8vUkpOXhN4C&pg=PA73&lpg=PA73&dq=__interpose&source=bl&ots=OJnnXZYpZC&sig=o7I3lXvoduUi13SrPfOON7o3do4&hl=en&ei=AoehS9brCYGQNrvsmeUM&sa=X&oi=book_result&ct=result&resnum=6&ved=0CBsQ6AEwBQ#v=onepage&q=__interpose&f=false).

It'll take a few days for me to try out "dyld interposing".  Do you
want me to do it now, or wait for later?

Comment 19

8 years ago
Comment on attachment 424163 [details] [diff] [review]
Fix rev2 (fix crashes in dyld)

Have you looked at how Chromium handles hooking functions like this, if it does, just in case there is a simpler way that we don't know about? I'm pretty sure they also hook Carbon functions like this in order to simulate parent process behavior for Carbon plugins. If they do, they aren't using '_dyld_register_func_for_add_image'.

A comment of yours notes that '_dyld_register_func_for_add_image' will trigger the callback for all of the images that have already been loaded, including the main image. Since presumably we're never going to want to hook things in our main image (firefox-bin, for example) here, can we detect when the callback happens for the main image and bail to save work?

Why store oldAddressPtr in _nsHookedFunctionSpec? Doesn't look like it is ever used.

If/when this lands it needs to bake on trunk for at least a week and we need to watch crash reports, Tp and Ts closely.

Comment 20

8 years ago
Has anyone contacted DivX and asked them to stop doing this?
(In reply to comment #18)
> > the fix is an evil, evil hack
> 
> It's not.  It's just something rather rare in our profession -- a new
> idea (though it's composed of "old" (i.e. reasonably well documented)
> parts.)

We've had this discussion before. Any solution that involves hooking/overriding particular functions in libraries we don't control is an evil hack and we need to keep such things to an absolute minimum.
If we do decide we need to keep supporting DivXBrowserPlugin in 1.9.2 on 10.4, then we should gratefully take this hack, evil and all :-). But blacklisting DivXBrowserPlugin seems safer and simpler to me. How many users can there be using DivXBrowserPlugin with Firefox 3.6 on 10.4 (which Apple doesn't even support anymore)?
> How many users can there be using DivXBrowserPlugin with Firefox 3.6
> on 10.4 (which Apple doesn't even support anymore)?

As I say in comment #14, this is no longer just a problem on 10.4.
> Have you looked at how Chromium handles hooking functions like this

The latest Chromium code does use dyld interposing.

> can we detect when the callback happens for the main image and bail
> to save work?

We probably can.  But Firefox loads hundreds of modules (and
_dyld_register_func_for_add_image() gets called hundreds of times).
So it's probably not worth the trouble.

> Why store oldAddressPtr in _nsHookedFunctionSpec? Doesn't look like
> it is ever used.

Not so.  It's by means of the void** that the
HIToolbox_InstallEventLoopIdleTimer and WebKit_WebInitForCarbon
function pointers get set.

> If/when this lands it needs to bake on trunk for at least a week and
> we need to watch crash reports, Tp and Ts closely.

Absolutely!  I think it should bake for longer (a month?), and should
start out on the trunk.  As with any technique/idea that's this new
and different :-)
Let's see if we can get the DivX developers to fix their stuff. If we can, then we can blacklist the old versions of the plugin.
Josh, would you feel more comfortable with dyld interposing?

Is it worth my while spending a couple of days looking into that?
> Let's see if we can get the DivX developers to fix their stuff. If
> we can, then we can blacklist the old versions of the plugin.

It's WebKit that needs to be fixed ... and I doubt very much that's
going to happen.
>> can we detect when the callback happens for the main image and bail
>> to save work?
>
> We probably can.  But Firefox loads hundreds of modules (and
> _dyld_register_func_for_add_image() gets called hundreds of times).
> So it's probably not worth the trouble.

It's the callback registered by _dyld_register_func_for_add_image()
that gets called hundreds of times.
Maybe they don't actually need to link in Webkit when they're running in Gecko.

Comment 30

8 years ago
(In reply to comment #29)
> Maybe they don't actually need to link in Webkit when they're running in Gecko.

Right. We should talk to the DivX people about potential solutions before we invest any more time trying to solve this on our end. I have sent them email.
> Maybe they don't actually need to link in Webkit when they're
> running in Gecko.

Possibly.  But only DivX can answer that question.  And judging by
what transpired at bug 509130, getting them to say "yes" and waiting
for the new version would take a long time ... possibly forever.

I don't have the energy or the time to pursue this.  Someone else
would need to do it.
Let me say some things very clearly (and, frankly, bluntly):

The code that is currently in the tree (e.g., the patch from bug 509130), is unusable from a "platform" perspective.

Sure, the code in the tree works fine for Firefox--unless you're on 10.4 (this bug) or have to load PKCS#11 modules (bug 545653)--but it's absolutely disastrous for embedders like Camino (bug 546902) or XULrunner users (bug 536684).

I cannot ship software to users that crashes randomly once an hour--not even nightly builds.  I don't think even Flash crashes that often.

It seems to me like we have three options in the short term:
 
1) this patch messing with low-level stuff
2) another patch messing with low-level stuff, which method may already be in use by Chromium
3) back out the patch for bug 509130 entirely and restore a topcrasher

In the longer term:

1) Rearcitect Gecko to change the way the runloop works (bug 509130 comment 44)
2) Hope there's something that DivX can do to not involve WebKit at all when running in Gecko browsrs and that DivX can release a new version, so that then we can blacklist old versions and remove whatever short-term fix we've chosen

I think my position against messing around with low-level stuff is well-known, but I can't wait around for one of the longer-term solutions to manifest itself.  

As distasteful as the current patch is to me, it works, and in the past month of using a build with this patch, I've not had a single crash, nor any other bizarre behavior I can reasonably attribute to the patch.  Is backing out bug 509130 safer?  Absolutely.  Will restoring a Firefox topcrash fly?  Probably not.  Of the three short-term solutions, this patch is at least the devil I know :(
That's fair. I'll be OK with taking this patch as an interim fix until we get a better one.
Josh: ping.  

It's been over a week now since the last comments here, and all that stands between me spitting out builds is branch approval on a tiny packaging patch (any day now) and the fact that the builds won't even be dogfoodable without getting the patch here (or some other short-term solution, but as I said before, this patch is the devil I know) on the branch.

Updated

8 years ago
Attachment #424163 - Flags: review?(joshmoz) → review+
Landed on trunk:
http://hg.mozilla.org/mozilla-central/rev/35955c4e9112

As I said in comment #24, I'd like to have this bake on the trunk for
a while -- perhaps a month.  But in my experience it takes a long time
for a patch to get through the approval queue.  So (presuming
everything's OK), in one week I'll seek 1.9.2-branch approval.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
Comment on attachment 424163 [details] [diff] [review]
Fix rev2 (fix crashes in dyld)

It's been a week, and no problems with this patch have been reported.

Ts and Ts-cold performance seem actually to have improved.  This is
especially apparent when you compare the results of tests on machines
running OS X 10.5.8 (where the patch has effect) to those running OS X
10.6.2 (where the patch has no effect).  So my patch might actually be
responsible for the improvement.  This isn't what I intended.  But if
so it's likely because of what I do at the top of
ScanImportedFunctions().

My patch seems to make no difference to the Tp (Tp4) test results.
These have been slowly worsening, but that trend started before my
patch landed, and appears on both OS X 10.5.8 and 10.6.2.

As I said above, I think this patch should bake on the trunk for about
a month, total.  The hooking strategy (which functions I hook and what
I do there) is very straightforward, and very unlikely to cause
trouble.  But the hooking infrastucture (how the patch hooks
functions) is new.  And, while I'm confident in it, there might be
problems I haven't forseen.

As I also said above, the approval queue is long.  So I'm seeking
approval now, on the understanding that my patch will have been on the
trunk for about a month by the time the decision is made.

Presuming the hooking infrastructure works as planned (as it seems
to), this patch is a vast improvement on the patch for bug 509130.  If
we're able to hook dynamically bound functions, the fix for bug 509130
is almost trivially easy.  And it avoids the original patch's serious
side effects (this bug, bug 546902, bug 545653 and bug 536684).
Attachment #424163 - Flags: approval1.9.2.4?
(Following up comment #36)

> So my patch might actually be responsible for the improvement.  This
> isn't what I intended.  But if so it's likely because of what I do
> at the top of ScanImportedFunctions().

On second thought this isn't right.  What I do at the top of
ScanImportedFunctions() is a significant optimization, but it can't be
responsible for a speed improvement relative to my patch not being
present.

Another possibility is that, if my patch *has* speeded up Ts results,
this is because of the changes it makes to the __jump_table section of
the __IMPORT segment.  Whenever my patch overwrites an unitialized
"lazy pointer", it means that dyld no longer needs to "lazily"
initialize it later on.  Initializating these pointers all at once may
very well be more efficient than doing it piecemeal (as dyld normally
does).  And my initialization code may be faster than dyld's --
because I know beforehand what the pointer's value should be.

Comment 38

8 years ago
Comment on attachment 424163 [details] [diff] [review]
Fix rev2 (fix crashes in dyld)

a=LegNeato for 1.9.2.5. Please ONLY land this on mozilla-1.9.2 default, as we
are still working on 1.9.2.4 on the relbranch
Attachment #424163 - Flags: approval1.9.2.4? → approval1.9.2.5+
Landed on the 1.9.2 branch (default, aka 1.9.2.5):
http://hg.mozilla.org/releases/mozilla-1.9.2/rev/6814976c4b11
status1.9.2: --- → .5-fixed
(In reply to comment #33)
> That's fair. I'll be OK with taking this patch as an interim fix until we get a
> better one.

I filed bug 570380 as the follow-up for investigating/developing that better fix.
Marcia, you have access to OS X 10.4. Can you verify this for 1.9.2?
There is a 10.4 machine available in the lab. At the moment I don't have time to verify this particular bug.
I wish you would have told me that two weeks ago when I asked you to verify it.

I only work in the office two days a week. I can't verify this until next week now.
Crash Signature: [@ objc_msgSend | nsAppShell::Init() ]
You need to log in before you can comment on or make changes to this bug.