Closed Bug 525579 Opened 15 years ago Closed 15 years ago

topcrash [@ nsAccessibilityService::GetAccessible(nsIDOMNode*, nsIPresShell*, nsIWeakReference*, nsIFrame**, int*, nsIAccessible**)]

Categories

(Core :: Disability Access APIs, defect)

x86
Windows XP
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
status1.9.2 --- beta4-fixed

People

(Reporter: dbaron, Assigned: davidb)

References

(Blocks 1 open bug)

Details

(Keywords: access, crash, topcrash, Whiteboard: [crashkill][#7 Firefox 3.6b1 topcrash][crashkill-fix])

Crash Data

Attachments

(2 files, 9 obsolete files)

23.18 KB, patch
roc
: review+
surkov
: review+
Details | Diff | Splinter Review
24.51 KB, patch
roc
: review+
Details | Diff | Splinter Review
Severity: normal → critical
Whiteboard: [crashkill]
Alex, any idea?
88 crashers like this within the last 7 days. Requesting blocking. Note that I haven't seen this crash myself at all yet.
Flags: blocking1.9.2?
The stacks seem to indicate we walk up an accessible parent chain and then look at siblings. The source for frame 0 is:
if (!frame || content != frame->GetContent())

Could the frame be defunct? We should probably call: FlushPendingNotifications(Flush_Layout) near the top of GetAccessible.

Also, this might indicate that we are allowing the call run at an unsafe time -- not sure.
Hoping for steps to reproduce in order to debug and fix this properly. I guess wallpapering it for 3.6 might be our best option for now.
Hmm I wonder if frame poisoning bug 497495 uncovered this bug.
The address is 0xfffffffff0dea813 so the frame is poisoned. Zack, what is the right approach here? We used to just check for a null frame.
for 2009 11 03 about half the crashes are near startup.

37 total crashes for nsAccessibilityService::GetAccessible on
20091103-crashdata.csv
19 start up crashes inside 3 minutes

distribution of all versions where the nsAccessibilityService::GetAccessible
crash was found on 20091103-crashdata.csv
  29 Firefox 3.6b1
   4 Firefox 3.6b2pre
   2 Firefox 3.7a1pre
   2 Firefox 3.5.4

____________number of uniq sites found with this signature:
      16
   3  about:blank
   1  about:blank
   1  http://www.youtube.com/results?search_query=GAME&page=&utm_source=opense
arch
   1 http://www.youtube.com/republicandebate
   1 http://www.yandex.ru/?edit=1&ncrnd=2395080453#widgetAdded
   1 http://www.photohunters.de/meist-gesehen/natur/landschaft/sonnenaufgang-
in-sassenberg-349.html
   1 http://www.google.it/#hl=it&source=hp&q=testo+legge+328+00&meta=&aq=0&oq
=testo+legge++328&fp=8170e646eb90a16c
   1 http://www.google.fr/ig
   1 http://www.bild.de/
   1 http://windows.microsoft.com/en-US/windows/downloads/personalize?T1=desk
topgadgets
   1 http://news.google.de/news/search?um=1&cf=all&ned=de&hl=de&q=t-home+ente
rtain&cf=all&scoring=n
   1 http://mail2.daum.net/hanmail/spam/SpamReport.daum
   1 http://kirutoku-rublog.seesaa.net/article/124938451.html#comment

some other possible related signatures

signature list
  35 nsAccessibilityService::GetAccessible(nsIDOMNode*, nsIPresShell*,
nsIWeakReference*, nsIFrame**, int*, nsIAccessible**)
   1 nsAccessibilityService::GetAccessibleByType(nsIDOMNode*, nsIAccessible**)
   1 @0x0 | nsAccessibilityService::GetAccessible(nsIDOMNode*, nsIPresShell*,
nsIWeakReference*, nsIFrame**, int*, nsIAccessible**)
Chris thanks! While some of those reports seem to pre-date frame poisoning, they are a separate kind of crash, happening on a different line.

Note a proper fix for bug 423737 might solve this bug.

Marco, can you try those sites to see if you can recreate?
Unfortunately, simply calling these links will crash me. NONE. BTW I always start Firefox with about:blank, and new tabs are also blank when I open them. I've never seen this crash happen ever myself.
Depends on: 423737
(In reply to comment #6)
> The address is 0xfffffffff0dea813 so the frame is poisoned. Zack, what is the
> right approach here? We used to just check for a null frame.

You need to find the place that is freeing a frame that still has live pointers, and make it not do that, or else null out those pointers when it does.
David, this bug should be your #1 priority.
Assignee: nobody → bolterbugz
Whiteboard: [crashkill] → [crashkill][#7 Firefox 3.6b1 topcrash]
Still no luck reproducing using any of the above links with either Minefield or Namoroka, with either JAWS or NVDA.
(In reply to comment #11)
> David, this bug should be your #1 priority.

Yep. Let's fix this.

(In reply to comment #10)
> (In reply to comment #6)
> > The address is 0xfffffffff0dea813 so the frame is poisoned. Zack, what is the
> > right approach here? We used to just check for a null frame.
> 
> You need to find the place that is freeing a frame that still has live
> pointers, and make it not do that, or else null out those pointers when it
> does.

Thanks, this makes sense; so far I haven't pin-pointed it.
I am very suspicious of the way that nsAccessibleTreeWalker caches a frame in mState.frame. This means if any layout flush happens while an nsAccessibleTreeWalker is alive, we could crash or do something random and unsafe (before frame poisoning) or crash or do something random but safe (after frame poisoning).

(Were you guys already aware that flushing is not permitted while an nsAccessibleTreeWalker is alive?)

So far I can't find a spot where such a flush could have happened, but this is an awful lot of code to secure against flushes. Can we make mState.frame go away?

Another option is to expose nsWeakFrame and use it here, although then an unexpected flush would cause the frame to go away, and recovering from that reliably would probably be harder than just not using a frame in the first place.

On another topic, I wonder why our #4 topcrash is in accessibility at all. Why are so many people running with accessibility on? Is it because of that antivirus software that turns it on?
3.6b1 doesn't have any users or much crash volume yet so the top crash list can be influenced by a few users that are crashing a lot.
It might be reasonable to use IsDefunct inside of CacheChildren() instead of mWeakShell check because IsDefunct() returns true if presshell died which could happen before we'll process delayed event for example. It's totally safe and easy, we could try and see if it helps.
IsDefunct won't protect you from all frame destruction/reconstruction situations, at best it's just a stop-gap measure.
(In reply to comment #17)
> IsDefunct won't protect you from all frame destruction/reconstruction
> situations, at best it's just a stop-gap measure.

Sure, but nsAccTreeWalker is always local, used to create accessibles and accessible tree and cache them. I can't see anything dangerous we could launch destruction/reconstruction frame process. What should we call from layout to start this process? How is this process usually started?
Anything that calls nsIPresShell::FlushPendingNotifications can create or destroy frames. For example, nsAccessible::GetState(). Are you sure that nothing in nsAccessibilityService::GetAccessible can ever call nsAccessible::GetState, directly or indirectly? That's an awful lot of code.
Yes, it could be, for example, nsAccessibleService::GetAccessible() -> nsLinkableAccessible::Init() -> nsLinkableAccessible::CacheActionContent().
(In reply to comment #14)
> I am very suspicious of the way that nsAccessibleTreeWalker caches a frame in
> mState.frame.

Yes me too. I've been looking at tree walker quite a bit today -- it is fairly
new to me since that code has been mostly untouched for a while.

> (Were you guys already aware that flushing is not permitted while an
> nsAccessibleTreeWalker is alive?)

I only had a hunch, but now I'm aware thanks. I will not add a flush during
tree walking. 

> So far I can't find a spot where such a flush could have happened, but this is
> an awful lot of code to secure against flushes. Can we make mState.frame go
> away?

I don't know yet.

> 
> Another option is to expose nsWeakFrame and use it here, although then an
> unexpected flush would cause the frame to go away, and recovering from that
> reliably would probably be harder than just not using a frame in the first
> place.

Good to know. This might be an option if it isn't a flush that is biting us.
But if it isn't a flush I wonder what it is... how are we walking a tree that
has a poisoned frame?

> On another topic, I wonder why our #4 topcrash is in accessibility at all. Why
> are so many people running with accessibility on? Is it because of that
> antivirus software that turns it on?

It surprises me. As Chris says in comment #15 it could be sample size. I didn't
notice the virtual keyboard (mzvkbd.dll) that Kaspersky installs listed in the
modules. Not sure if there are others. I've started a list on
https://wiki.mozilla.org/Accessibility/Crash.
(In reply to comment #20)
> Yes, it could be, for example, nsAccessibleService::GetAccessible() ->
> nsLinkableAccessible::Init() -> nsLinkableAccessible::CacheActionContent().

OK, so that's one way we could get a random crash.

I looked pretty hard for a possible cause and failed to notice that code path, because there's just a lot of code here. I think that's a strong reason to make nsAccessibleTreeWalker not cache the frame. Can't we just call GetPrimaryFrameFor whenever we need a frame? I.e., get rid of the frame hints?

(In reply to comment #21)
> Good to know. This might be an option if it isn't a flush that is biting us.
> But if it isn't a flush I wonder what it is... how are we walking a tree that
> has a poisoned frame?

I'm guessing not. Bad frame trees are very rare in real Web content and would cause crashes outside accessibility a lot more often than in it, I would think, since accessibility code is normally not run.

By the way, let me explain why this wouldn't have crashed (much) before frame poisoning. From the crash stack it looks like we're doing
  mState.frame = mState.frame->GetNextSibling();
and then (where frame is mState.frame)
  if (!frame || content != frame->GetContent()) {
    frame = aPresShell->GetRealPrimaryFrameFor(content);

If the first mState.frame has been deleted, then mState.frame->GetNextSibling() is probably going to return null, or a pointer to another frame (dead or alive). If it's null, then !frame is false. If it's a pointer to another frame --- dead or alive --- "content != frame->GetContent()" will almost certainly return false. So, unless we're very unlucky, we'll go ahead and reset 'frame' to the real frame.
(In reply to comment #22)
> (In reply to comment #20)
> > Yes, it could be, for example, nsAccessibleService::GetAccessible() ->
> > nsLinkableAccessible::Init() -> nsLinkableAccessible::CacheActionContent().
> 
> OK, so that's one way we could get a random crash.
> 
> I looked pretty hard for a possible cause and failed to notice that code path,
> because there's just a lot of code here. I think that's a strong reason to make
> nsAccessibleTreeWalker not cache the frame. Can't we just call
> GetPrimaryFrameFor whenever we need a frame? I.e., get rid of the frame hints?

Seems reasonable to me. Alexander do you know if we were keeping mState.frame for performance only or was there more to it?

> (In reply to comment #21)
> > Good to know. This might be an option if it isn't a flush that is biting us.
> > But if it isn't a flush I wonder what it is... how are we walking a tree that
> > has a poisoned frame?
> 
> I'm guessing not. Bad frame trees are very rare in real Web content and would
> cause crashes outside accessibility a lot more often than in it, I would think,
> since accessibility code is normally not run.
> 
> By the way, let me explain why this wouldn't have crashed (much) before frame
> poisoning.

(...)

Yes thanks for the explanation, I suspected frame poisoning was unveiling some hidden badness.

I should mention that I'm not married to our two calls to FlushPendingNotifications and would consider (in a pinch) pulling them at least for 1.9.2.

First let's try your recommended approach of not storing mState.frame.
(In reply to comment #22)
> (In reply to comment #20)
> > Yes, it could be, for example, nsAccessibleService::GetAccessible() ->
> > nsLinkableAccessible::Init() -> nsLinkableAccessible::CacheActionContent().
> 
> OK, so that's one way we could get a random crash.
> 
> I looked pretty hard for a possible cause and failed to notice that code path,
> because there's just a lot of code here.

nsAccessibleTreeWalker calls nsAccessibilityService::GetAccessible when it needs to create an accessible, which calls nsAccessibilityService::InitAccessible for every new created accessible and then the patch I described above.

> I think that's a strong reason to make
> nsAccessibleTreeWalker not cache the frame. Can't we just call
> GetPrimaryFrameFor whenever we need a frame? I.e., get rid of the frame hints?

Technically this might work. But it will make impossible to use nsAccTreeWalker to deal with generated content (after and before styles) and list bullet frames. I should say now we don't create accessible for generated content (it's our bug) and we use hacks to create accessible for list bullet frame. Therefore I said this might work but this is not good in perspective. However I'm afraid to touch nsAccTreeWalker without good set of mochitests because I don't know this code well.
(In reply to comment #23)

> Seems reasonable to me. Alexander do you know if we were keeping mState.frame
> for performance only or was there more to it?

I don't think it's performance issue. I would guess (I didn't look at history) we traversed frame tree to create generated content and etc and therefore we cached frame. But I might be wrong. We stopped to traverse frame tree because of huge performance problem.
I dug up the old performance bug 242589 where the frame hint was added.

TreeWalker changes: http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla%2Faccessible%2Fsrc%2Fbase%2FnsAccessibleTreeWalker.cpp&rev=&cvsroot=%2Fcvsroot

I can essentially undo that... feels dangerous.
Backing out the frame hint sounds like the right thing to do, to me.

If you come up with a testcase that shows a significant performance loss after the backout, we can have a new bug on getting that performance back, safely. It's probably not too hard.
Attached patch 192-fix (don't flush layout) (obsolete) — Splinter Review
This is a stopgap with low risk just in cased we are uncomfortable with the right fix which changes code that has been around longer. Note we don't have test coverage that will fail with this patch, so it just a code change. The only known possible regression is regarding accessibility in chatzilla + screen reader usage, though there may be others.

Do we want to push this ASAP into a beta to make sure it works?


Next stop, frame hint removal...
Comment on attachment 410488 [details] [diff] [review]
192-fix (don't flush layout)

Seeking r+ just in case we go with this stopgap.
Attachment #410488 - Flags: review?(surkov.alexander)
Attachment #410488 - Flags: review?(roc)
Comment on attachment 410488 [details] [diff] [review]
192-fix (don't flush layout)

Marco, your assessment on the stopgap would be great. I recall you found the flushing to help the chatzilla bug.
Attachment #410488 - Flags: review?(marco.zehe)
Comment on attachment 410488 [details] [diff] [review]
192-fix (don't flush layout)

r=me. This reverts bug 444644. I don't really expect any negative side effects because there were more factors playing in with bug 444644 than this one. This will bring us back to a level of 3.5, which is fine.
Attachment #410488 - Flags: review?(marco.zehe) → review+
Attached patch WIP (framehint removal) (obsolete) — Splinter Review
Comments on the approach encouraged. Try build started. Marco will give it a spin.
Comment on attachment 410512 [details] [diff] [review]
WIP (framehint removal)

With this patch, both fileupload widget and the embedded HTML5 audio/video controls stop working completely. They simply disappear from the accessible tree alltogether.
OK sounds like we might need to walk into anonymous child frames.
I'm now fuzzy on what "back(In reply to comment #27)
> Backing out the frame hint sounds like the right thing to do, to me.

I think I might have gone a little too far on my "framehint removal" patch.

Just to be clear, did you mean do not walk frames? Or did you mean grab the frame as we need it in the walker? Or?
What is the best way to walk the children of an <input type="file" />?

I think this is anonymous non XBL content?
Comment on attachment 410488 [details] [diff] [review]
192-fix (don't flush layout)

Looks OK to me but I guess it might break stuff for accessibility.

And it's definitely just a stopgap, since it's difficult to verify there aren't other places that flush.
Attachment #410488 - Flags: review?(roc) → review+
(In reply to comment #36)
> Just to be clear, did you mean do not walk frames? Or did you mean grab the
> frame as we need it in the walker? Or?

You can work with frames, but you can't hang onto an nsIFrame* between treewalker steps.

If you need to find anonymous content by walking the frame tree, you probably need to collect that anonymous content into an array, by getting the primary frame for the parent content, then walking its frame subtree *all at once* to collect all the anonymous content nodes into the array. Does that make sense?
(In reply to comment #40)
> (In reply to comment #36)
> > Just to be clear, did you mean do not walk frames? Or did you mean grab the
> > frame as we need it in the walker? Or?
> 
> You can work with frames, but you can't hang onto an nsIFrame* between
> treewalker steps.

OK.

> 
> If you need to find anonymous content by walking the frame tree, you probably
> need to collect that anonymous content into an array, by getting the primary
> frame for the parent content, then walking its frame subtree *all at once* to
> collect all the anonymous content nodes into the array. Does that make sense?

Yes I think so; instead of as a state machine.
(I mean during sibling traversal)
I'd sleep better if we can get the stopgap out there and tested. When is the next beta?

I like Roc's direction for the longer term more resilient fix but it feels like it is going to be higher risk to me, given the timing.
It's already too late for beta2. I don't think there is a next beta.
Comment on attachment 410488 [details] [diff] [review]
192-fix (don't flush layout)

r=me as for temporary solution
Attachment #410488 - Flags: review?(surkov.alexander) → review+
I'm still building the big picture in my mind. Why doesn't content let me walk down into native anonymous content?
Elements do not have references to their anonymous children. For XBL anonymous children there are auxiliary data structures you can use to look up the XBL anonymous children given a parent, but for native anonymous children only the element's frame has references to the children. It would be possible to create a way to find the native anonymous children for an element, but we haven't needed it yet.
Comment on attachment 410488 [details] [diff] [review]
192-fix (don't flush layout)

Renaming 192 fix for book-keeping.
Attachment #410488 - Attachment description: stopgap → 192-fix (don't flush layout)
Attached patch updated 192 fix (obsolete) — Splinter Review
Needed to tweak it for 1.9.2. (pushed to try).
Attachment #410488 - Attachment is obsolete: true
Attachment #410805 - Flags: approval1.9.2?
Attachment #410805 - Flags: approval1.9.2? → approval1.9.2+
Do we need this fix on the older branches? The 1.9.1 branch code doesn't seem to have the problem in the two places that were patched, but maybe code was just moved around a bit.
blocking1.9.1: --- → ?
status1.9.1: --- → ?
Keywords: testcase-wanted
Whiteboard: [crashkill][#7 Firefox 3.6b1 topcrash] → [needs landing][crashkill][#7 Firefox 3.6b1 topcrash]
No need to backport a fix pre 1.9.2.; thanks for checking.
blocking1.9.1: ? → ---
status1.9.1: ? → ---
(Note to self: I could flush layout in the treewalker ctor for further protection)
Different crash signature, but probably the same root cause:
http://crash-stats.mozilla.com/report/index/cdf9f4bc-e10c-4c10-b158-9f1bf2091105

Possible STR: "From Facebook's private message view, I click on the Updates tab on the left, and Firefox crashes every time."
(In reply to comment #56)
> Possible STR: "From Facebook's private message view, I click on the Updates tab
> on the left, and Firefox crashes every time."

I cannot reproduce this with either JAWS or NVDA and Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.3a1pre) Gecko/20091029 Minefield/3.7a1pre (.NET CLR 3.5.30729)

David, did you have a screen reader running while doing this, or what did you do to turn on accessibility while reproducing this?
Also no luck reproducing with Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2b2pre) Gecko/20091029 Namoroka/3.6b2pre (.NET CLR 3.5.30729) and either JAWS or NVDA.
(In reply to comment #57)
> David, did you have a screen reader running while doing this, or what did you
> do to turn on accessibility while reproducing this?

Those aren't my steps; I just pasted them from a user comment in a crash report.
Closing as fixed. I landed the 192 fix on the weekend before beta2 was tagged. I spun off bug 527466 to make sure we don't hold onto frames between calls as per Roc's suggestion.

Please reopen if the 192 fix doesn't resolve (or at least significantly reduce) these crashes in FF3.6beta2.
Status: NEW → RESOLVED
Closed: 15 years ago
Flags: blocking1.9.2?
Resolution: --- → FIXED
Heavy sigh. Still seeing the crash frequently in 3.6b2.

E.g. http://crash-stats.mozilla.com/report/index/57fa3fb4-615b-4cef-9e46-e05e52091111

Unfortunately, this is going to be really tricky to get right without regressing some a11y.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Roc is there a way we can freeze the frame tree while we walk it and unfreeze when we're done?
Adding a script blocker (e.g. with nsAutoScriptBlocker) will stop FlushPendingNotifications from flushing notifications. But it won't block all changes to the frame tree. For example, removing elements from the DOM will still destroy frames.

You can't call nsContentUtils from accessibility I guess, so you'd have to add some API to make that work.

Another option would just be to add a flag to your accessibility code to disable the calls to FlushPendingNotifications from accessibility.
Hmm, but I guess you've already removed those completely, and that wasn't enough.

Perhaps the best quick fix is to use nsWeakFrame. Although you can't do that directly, since again it's not linkable from accessibility.

Maybe you should make nsWeakFrame usable externally by adding ClearInternal/InitInternal non-virtual methods that do what Init and Clear currently do, add ClearExternal/InitExternal that are virtual, and make Init and Clear inline methods that use _IMPL_NS_LAYOUT #ifdefs to choose either the internal or external versions.
Roc thanks so much for the leads!

(In reply to comment #63)
> Adding a script blocker (e.g. with nsAutoScriptBlocker) will stop
> FlushPendingNotifications from flushing notifications. But it won't block all
> changes to the frame tree. For example, removing elements from the DOM will
> still destroy frames.

This might cover us actually, since I can't imagine our calls into layout during frame walking causing element removal but I'm not sure. (?)

(In reply to comment #64)
> Perhaps the best quick fix is to use nsWeakFrame. Although you can't do that
> directly, since again it's not linkable from accessibility.
> 
> Maybe you should make nsWeakFrame usable externally by adding
> ClearInternal/InitInternal non-virtual methods that do what Init and Clear
> currently do, add ClearExternal/InitExternal that are virtual, and make Init
> and Clear inline methods that use _IMPL_NS_LAYOUT #ifdefs to choose either the
> internal or external versions.

This sounds like a decent plan too. It sounds like you are recommending this approach over the nsAutoScriptBlocker? Your advice is appreciated.
Yes, it's more robust, lower risk, and about as much work as making script blocking usable from accessibility.
OK great thanks, I have made the layout changes, now looking at accessibility.
Attached patch layout-wip (sad vtable) (obsolete) — Splinter Review
Roc, anyone, seen this error before?

(attached is what I currently have changed in layout)

In our a11y code I use the nsWeakFrame in this way:
nsWeakFrame weakFrame = nsWeakFrame(mState.frame);
nsIFrame *frame = weakFrame.GetFrame();
..

But problems with linking (when building a11y):
Undefined symbols:
  "vtable for nsWeakFrame", referenced from:
      __ZTV11nsWeakFrame$non_lazy_ptr in nsAccessibleTreeWalker.o
ld: symbol(s) not found
collect2: ld returned 1 exit status
make[2]: *** [libaccessibility.dylib] Error 1
make[1]: *** [libs] Error 2
make: *** [default] Error 2

Is it not seeing the vtable at all? After googling I think it means I haven't defined all my virtual (nsWeakFrame) methods but I'm just not seeing that to be true.
Word on #developers is that this means we are out of luck for this approach.
Hmm does this have something to do with NS_NO_VTABLE?
What didn't compile when you left nsWeakFrame as-is?  Was it just the call to nsIPresShell::RemoveWeakFrame?  Would it work to, instead of modifying nsWeakFrame, split nsIPresShell::RemoveWeakFrame into Internal and External variants much like this.  (But the External one should be implemented in the .cpp file.)
Oh... you'd also need to add an external version of Init; probably easiest if it also just calls a virtual method on the pres shell.  (It would have to be inline, call Clear first, and then, if aFrame, call a method on the presshell that just calls nsWeakFrame::Init.)
(In reply to comment #71)
> What didn't compile when you left nsWeakFrame as-is?

Undefined symbols:
  "nsIPresShell::RemoveWeakFrame(nsWeakFrame*)", referenced from:
      nsWeakFrame::Clear(nsIPresShell*)       in nsAccessibleTreeWalker.o
  "nsWeakFrame::Init(nsIFrame*)", referenced from:
      nsWeakFrame::nsWeakFrame(nsIFrame*)in nsAccessibleTreeWalker.o


Hmmm yeah, I'll try your suggestions thanks.
David, do we have to remove the beta2-fixed flag too? It looks like.
David, over Twitter I received several reports from people using Windows Vista and ZoomText by AISquared and reporting big instability problems with Firefox 3.6. Since ZT is a screen magnifier, I obviously can't use it like a low-vision user would. But this might be something to go on as well.
I've removed the beta2-fixed flag (thanks Henrik).

Marco, yes, I'm going to be heads down on making our frame usage safer, but if anyone can help diagnose the the main usage cause (with STR) in the meantime that would be great.
(In reply to comment #72)
> Oh... you'd also need to add an external version of Init; probably easiest if
> it also just calls a virtual method on the pres shell.  (It would have to be
> inline, call Clear first, and then, if aFrame, call a method on the presshell
> that just calls nsWeakFrame::Init.)

To followup here, unfortunately I got the same linkage error from comment #68.

I'm going to cut my losses and create API (have patch locally) that will throw an nsWeakFrame on the heap and return a pointer to it (caller cleans up). A bit sad but it should work.
It was pointed out to me that you need to have a viable frame in order to create a weakframe (otherwise creating a weakframe would crash right away in the current implementation). It isn't exactly what I want but I can work with that.

Can someone reasure me that if I have a weakframe and I do weakframe->IsAlive() it will return false if the actual frame is poisoned?
(I lied, trying one more thing on dbaron's approach)
(In reply to comment #72)
> Oh... you'd also need to add an external version of Init; probably easiest if
> it also just calls a virtual method on the pres shell.  (It would have to be
> inline, call Clear first, and then, if aFrame, call a method on the presshell
> that just calls nsWeakFrame::Init.)

The gotcha here is that the init chain is called during nsWeakFrame construction, so I can't pass 'this' to the presShell method for a subsequent call to nsWeakFrame::Init (which is not static).
An idea (since my changes are not small so far)...

Could we turn off frame poisoning if accessibility is invoked? (For 3.6 only)
No.

Also, it seems like an awful lot of the users who hit this crash (over a third) are using the "Tab Mix Plus" extension at https://addons.mozilla.org/addon/1122
(In reply to comment #80)
> (In reply to comment #72)
> > Oh... you'd also need to add an external version of Init; probably easiest if
> > it also just calls a virtual method on the pres shell.  (It would have to be
> > inline, call Clear first, and then, if aFrame, call a method on the presshell
> > that just calls nsWeakFrame::Init.)
> 
> The gotcha here is that the init chain is called during nsWeakFrame
> construction, so I can't pass 'this' to the presShell method for a subsequent
> call to nsWeakFrame::Init (which is not static).

I don't follow.  Could you attach the patch you had so I can take a look?
Attached patch WIP (linkage problems) (obsolete) — Splinter Review
(In reply to comment #83)
> (In reply to comment #80)
> > (In reply to comment #72)
> > > Oh... you'd also need to add an external version of Init; probably easiest if
> > > it also just calls a virtual method on the pres shell.  (It would have to be
> > > inline, call Clear first, and then, if aFrame, call a method on the presshell
> > > that just calls nsWeakFrame::Init.)
> > 
> > The gotcha here is that the init chain is called during nsWeakFrame
> > construction, so I can't pass 'this' to the presShell method for a subsequent
> > call to nsWeakFrame::Init (which is not static).
> 
> I don't follow.  Could you attach the patch you had so I can take a look?

Sure. Here's where I left off (not sure if there's kruft).

I'm moved along quite far along the heap alloc approach (which I can switch back to using this approach if you can help me with the linking).
I probably misunderstood you and I really appreciate you taking a look.
Hmm that patch unfortunately went further with some trial and error stuff (duplicating the body of InitInternal for example). Sorry.
To be clear here's where I am:

I gave up on trying to expose nsWeakFrame such that I could stack allocate an nsWeakFrame via nsWeakFrame wf = nsWeakFrame(aFrame); due to the linkage pain.

I proceeded with adding nsIFrame::GetWeakFrame API that heap allocates an nsWeakFrame (with cleanup responsibility current on the caller). I have proceeded to use this API in accessibility code and am just debugging and fixing some errors I have.

If you can solve the linkage problem with the first way (above) I'll happily take that and combine/rework what I've done in the second approach.

Do you want me to clean up that patch I posted for you to look at? (IRC me?)
(In reply to comment #78)
> It was pointed out to me that you need to have a viable frame in order to
> create a weakframe (otherwise creating a weakframe would crash right away in
> the current implementation). It isn't exactly what I want but I can work with
> that.

The idea is that right after getting a valid nsIFrame pointer, before there's any possibility of the frame being deleted, you stash it in an nsWeakFrame. Then you can go ahead and flush and do whatever else might cause frames to be deleted. Later you come back to your nsWeakFrame and you can test if the frame is still alive.
Yep, got it Roc but...

I'm finding that deleting my (new local api) heap allocated nsWeakFrame (in stack frame 3 below) doesn't clean up well inside nsIFrame.

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0xc785b9b3
0x15122abd in nsIFrame::GetStyleContext (this=0xc785b99b) at nsIFrame.h:642
642	  nsStyleContext* GetStyleContext() const { return mStyleContext; }
(gdb) bt
#0  0x15122abd in nsIFrame::GetStyleContext (this=0xc785b99b) at nsIFrame.h:642
#1  0x15122ad3 in nsIFrame::PresContext (this=0xc785b99b) at nsIFrame.h:499
#2  0x15137644 in nsWeakFrame::~nsWeakFrame (this=0x1e1717e0) at nsIFrame.h:2537
#3  0x151493fa in nsAccessibleTreeWalker::PopState (this=0xbfffc85c) at /Users/dtb/mozworkspace/src/accessible/src/base/nsAccessibleTreeWalker.cpp:149
#4  0x151498ba in nsAccessibleTreeWalker::~nsAccessibleTreeWalker (this=0xbfffc85c) at /Users/dtb/mozworkspace/src/accessible/src/base/nsAccessibleTreeWalker.cpp:70
#5  0x15143fbc in nsAccessible::CacheChildren (this=0x1e38ae10) at /Users/dtb/mozworkspace/src/accessible/src/base/nsAccessible.cpp:770
#6  0x151801b9 in nsHyperTextAccessible::CacheChildren (this=0x1e38ae10) at /Users/dtb/mozworkspace/src/accessible/src/html/nsHyperTextAccessible.cpp:217

Could the mStyleContext go bad before nsWeakFrame dies?  There is probably some life-cycle here detail I don't know.
No where near polished, but posting because of weird crash (see comment 89)
Attached patch WIP (linkage problems fixed) (obsolete) — Splinter Review
Basically, what you did to layout was exactly right, except you need to do the same thing for AddWeakFrame that you did for RemoveWeakFrame. :-)

I also wrote something more like what I think the accessibility changes would need to look like -- using nsWeakFrame to store the frame pointer.  (The point of nsWeakFrame is that it's a frame pointer that gets notified and set to null if the frame gets deleted, so you have to store the problematic long-lived pointer in an nsWeakFrame.)  But I did them rather quickly, and I'm not very familiar with the accessibility code?

Could you continue work on this based on this?
(In reply to comment #91)
> Created an attachment (id=412286) [details]
> WIP (linkage problems fixed)
> 
> Basically, what you did to layout was exactly right, except you need to do the
> same thing for AddWeakFrame that you did for RemoveWeakFrame. :-)

Thanks!

> I also wrote something more like what I think the accessibility changes would
> need to look like -- using nsWeakFrame to store the frame pointer.  (The point
> of nsWeakFrame is that it's a frame pointer that gets notified and set to null
> if the frame gets deleted, so you have to store the problematic long-lived
> pointer in an nsWeakFrame.)

Ah yes, this patch just had dummy usage of the call to make sure I could link. The more elaborate usage was in the patch you noticed later ;)

>  But I did them rather quickly, and I'm not very
> familiar with the accessibility code?
> 
> Could you continue work on this based on this?

Yes, thanks a bunch!
(In reply to comment #91)
> Created an attachment (id=412286) [details]
> WIP (linkage problems fixed)
>

David Baron, FF crashes on startup (not in accessibility) with your patch -- did this happen for you? (I'm updating my tree and will rebuild).
(In reply to comment #94)
> David Baron, FF crashes on startup (not in accessibility) with your patch --
> did this happen for you? (I'm updating my tree and will rebuild).

It didn't happen for me.  Maybe you didn't rebuild enough and some things are expecting the old layout of nsIPresShell's vtable?
Yep, full clobber build launches -- sorry for the noise and thanks again :)
Attached patch patch 1 (obsolete) — Splinter Review
roc, dbaron, can you look this one over? I probably should sr one of you?

dbaron, it should be easy to r+ the changes to layout ;), and I like what you had started with the accessibility tree walker (it closely matches an earlier local patch I have and we can't both be wrong), I just added one small addition to pass the frame(hint) to GetAccessible since it is an in/out param.

I added some frame paranoia to nsAccessibilityService where we were getting bitten on this bug and a few other spots. (Overkill?)

Unsurprisingly, we pass all accessibility mochitests, including file uploader Marco :)
Attachment #410512 - Attachment is obsolete: true
Attachment #410578 - Attachment is obsolete: true
Attachment #410805 - Attachment is obsolete: true
Attachment #411920 - Attachment is obsolete: true
Attachment #412263 - Attachment is obsolete: true
Attachment #412284 - Attachment is obsolete: true
Attachment #412286 - Attachment is obsolete: true
Attachment #412368 - Flags: review?(surkov.alexander)
Attachment #412368 - Flags: review?(roc)
Attachment #412368 - Flags: review?(dbaron)
Attachment #412368 - Flags: superreview?(roc)
Attachment #412368 - Flags: review?(roc)
Attachment #412368 - Flags: review-
+  weakFrame = *aFrameHint;
 #ifdef DEBUG_A11Y
   static int frameHintFailed, frameHintTried, frameHintNonexistant, frameHintFailedForText;
   ++frameHintTried;
 #endif
-  if (!frame || content != frame->GetContent()) {
+  if (!weakFrame.GetFrame() || content != weakFrame.GetFrame()->GetContent()) {

What is the point of weakFrame here? the frame can't be destroyed between setting weakFrame and testing weakFrame.GetFrame(). weakFrame.GetFrame() is guaranteed to be exactly *aFrameHint, whether that's bogus or not.

You need to rev the IID on nsIPresShell. That's OK on trunk, but unfortunately we can't do that on the 1.9.2 branch. For the branch you'll need to create a new interface, say nsIPresShell_1_9_2_BRANCH, that contains the methods you need and that PresShell can be QIed to.
(In reply to comment #98)
> +  weakFrame = *aFrameHint;
>  #ifdef DEBUG_A11Y
>    static int frameHintFailed, frameHintTried, frameHintNonexistant,
> frameHintFailedForText;
>    ++frameHintTried;
>  #endif
> -  if (!frame || content != frame->GetContent()) {
> +  if (!weakFrame.GetFrame() || content != weakFrame.GetFrame()->GetContent())
> {
> 
> What is the point of weakFrame here? the frame can't be destroyed between
> setting weakFrame and testing weakFrame.GetFrame(). weakFrame.GetFrame() is
> guaranteed to be exactly *aFrameHint, whether that's bogus or not.

Right, the assignment there is actually unnecessary. Also, we could just null check *aFrameHint here but it is low cost to leave this guard in. (?)

> You need to rev the IID on nsIPresShell. That's OK on trunk, but unfortunately
> we can't do that on the 1.9.2 branch. For the branch you'll need to create a
> new interface, say nsIPresShell_1_9_2_BRANCH, that contains the methods you
> need and that PresShell can be QIed to.

OK I'll simply rev for this (trunk) patch. Hmmm I might not get to the 192 interface until Sunday night, Monday morning.
Note there are some other places in accessible code (e.g. nsHyperTextAccessible) where which we'll need to investigate in follow up bugs, but I haven't addressed them here. We still aren't certain when the frames where getting deallocated right? (Since we removed our calls to flush layout).
Rev'ed the PresShell uuid, removed two bogus msWeakFrame assignments and added some comments.
Attachment #412368 - Attachment is obsolete: true
Attachment #412421 - Flags: review?(surkov.alexander)
Attachment #412421 - Flags: review?(roc)
Attachment #412368 - Flags: superreview?(roc)
Attachment #412368 - Flags: review?(surkov.alexander)
Attachment #412368 - Flags: review?(dbaron)
i just experienced this crash twice during firefox startup when the addon-update notifier was showing up. some other users report the same via the crash reporter...
(In reply to comment #97)

> I added some frame paranoia to nsAccessibilityService where we were getting
> bitten on this bug and a few other spots. (Overkill?)

Yes, until you point the code what can destroy the frame. We don't flush layout (we can flush layout in nsAccessibilityService::InitAccessible but we don't use frame at this point already), we don't change the DOM and we don't have crash reports pointing to nsAccessibilityService::GetAccessible(). I think this protection is unnecessary.
I would prefer to see method GetFrame or GetStateFrame instead of mState.frame.GetFrame().

> http://lxr.mozilla.org/seamonkey/source/accessible/accessible-docs.html

that's a bit out of date. Either remove this comment or change it on something more valid.
(In reply to comment #102)
> i just experienced this crash twice during firefox startup when the
> addon-update notifier was showing up. some other users report the same via the
> crash reporter...

I also was able to finally reproduce this top crasher. But ONLY on my Win 7 64-bit installation with the same profile that I previously used on Windows XP 32 bit, where I never saw this crash even with add-on updates coming up.

Right now, Firefox 3.6b2 never starts for me with that profile, it imeediately crashes. See these reports:
bp-bb3549f2-c8bb-432d-8767-fe56a2091114
bp-ae9a8022-04e6-4685-b623-a188a2091114
bp-beff5628-ea92-41d5-aa3f-e2d6f2091114
bp-43c9a35f-69d4-4749-9d8e-251da2091114
Comment on attachment 412421 [details] [diff] [review]
patch 2 (trunk version)

I think I can give conditional r+
Attachment #412421 - Flags: review?(surkov.alexander) → review+
(In reply to comment #103)
> (In reply to comment #97)
> 
> > I added some frame paranoia to nsAccessibilityService where we were getting
> > bitten on this bug and a few other spots. (Overkill?)
> 
> Yes, until you point the code what can destroy the frame. We don't flush layout
> (we can flush layout in nsAccessibilityService::InitAccessible but we don't use
> frame at this point already), we don't change the DOM and we don't have crash
> reports pointing to nsAccessibilityService::GetAccessible(). I think this
> protection is unnecessary.

I've had similar thoughts, but since the crashes were actually happening inside nsAccessibilityService::GetAccessible. I think we should err on the side of caution for now.

Now that we can recreate the bug, we can fine tune it later.
(In reply to comment #104)
> I would prefer to see method GetFrame or GetStateFrame instead of
> mState.frame.GetFrame().
> 
> > http://lxr.mozilla.org/seamonkey/source/accessible/accessible-docs.html
> 
> that's a bit out of date. Either remove this comment or change it on something
> more valid.

Yes thanks, updated locally.
Pushed http://hg.mozilla.org/mozilla-central/rev/f360de4b8b08

I will roll the 192 version... might as well add to blassey's nsIPresShell_MOZILLA_1_9_2.
A quick note that I've been able toverify using a local build that this patch fixes the bug. The STR I used were:

1. Install Firebug 1.5xb2, which is one version older than the current b3 release.
2. Restart Minefield, and throubh about:config, set the option to enable notification of add-on updates to true.
3. Closed Minefield and restarted it.

In a regular nightly build from yesterday, on Windows 7 only, I crash when the add-on updates UI should appear. This is consistent with reports we receivedd. See this report which I submitted: bp-3d3dec95-2a49-4190-ac4c-846392091116

In the build I made after David landed the fix above, the crash no longer happens, and the add-ons UI appears.
Roc, is this what you had in mind? (BTW can you approve this 1.9.2?)
Attachment #412644 - Flags: review?(roc)
Attachment #412644 - Flags: approval1.9.2?
Attachment #412644 - Attachment description: (192 version) → patch 2 (192 version)
Comment on attachment 412644 [details] [diff] [review]
patch 2 (192 version)

Unrequesting 1.9.2 until I have r+ so that people don't get confused.
Attachment #412644 - Flags: approval1.9.2?
Attachment #412644 - Flags: approval1.9.2?
Note to 1.9.2 drivers, this landed on trunk and is a confirmed fix. The 1.9.2 patch moves some virtual methods (in layout) into a 192 specific interface to make sure we are not breaking anybody.
Comment on attachment 412644 [details] [diff] [review]
patch 2 (192 version)

a1.9.2=dbaron
Attachment #412644 - Flags: approval1.9.2? → approval1.9.2+
Landed on 1.9.2. http://hg.mozilla.org/releases/mozilla-1.9.2/rev/4eda96e67e9a

Thanks everyone!
Status: REOPENED → RESOLVED
Closed: 15 years ago15 years ago
Resolution: --- → FIXED
Whiteboard: [needs landing][crashkill][#7 Firefox 3.6b1 topcrash] → [crashkill][#7 Firefox 3.6b1 topcrash]
Depends on: 529397
Whiteboard: [crashkill][#7 Firefox 3.6b1 topcrash] → [crashkill][#7 Firefox 3.6b1 topcrash][crashkill-fix]
Crash Signature: [@ nsAccessibilityService::GetAccessible(nsIDOMNode*, nsIPresShell*, nsIWeakReference*, nsIFrame**, int*, nsIAccessible**)]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: