crash in nsTArray_Impl<mozilla::dom::workers::WorkerFeature*, ... >, often at 0x5a5a5caa address (Web Worker)

RESOLVED FIXED

Status

()

defect
--
critical
RESOLVED FIXED
5 years ago
4 years ago

People

(Reporter: kairo, Unassigned)

Tracking

({crash})

Trunk
x86
Windows NT
Points:
---

Firefox Tracking Flags

(firefox33+ wontfix, firefox34+ wontfix, firefox35+ wontfix, firefox36+ wontfix, firefox37- unaffected, firefox38- unaffected)

Details

(crash signature)

This bug was filed from the Socorro interface and is 
report bp-945b1211-2a62-4a39-8380-3e89e2141110.
=============================================================

Top frames:
0 	xul.dll 	nsTArray_Impl<mozilla::dom::workers::WorkerFeature*, nsTArrayInfallibleAllocator>::IndexOf<mozilla::dom::workers::WorkerFeature*, nsDefaultComparator<mozilla::dom::workers::WorkerFeature*, mozilla::dom::workers::WorkerFeature*> >(mozilla::dom::workers::WorkerFeature* const&, unsigned int, nsDefaultComparator<mozilla::dom::workers::WorkerFeature*, mozilla::dom::workers::WorkerFeature*> const&) 	obj-firefox/dist/include/nsTArray.h
1 	xul.dll 	mozilla::dom::workers::WorkerPrivate::RemoveFeature(JSContext*, mozilla::dom::workers::WorkerFeature*) 	dom/workers/WorkerPrivate.cpp
2 	xul.dll 	mozilla::dom::workers::XMLHttpRequest::Unpin() 	dom/workers/XMLHttpRequest.cpp
3 	xul.dll 	`anonymous namespace'::LoadStartDetectionRunnable::ProxyCompleteRunnable::WorkerRun(JSContext*, mozilla::dom::workers::WorkerPrivate*) 	dom/workers/XMLHttpRequest.cpp
4 	xul.dll 	mozilla::dom::workers::WorkerRunnable::Run() 	dom/workers/WorkerRunnable.cpp
5 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp

This is one of the top crashes in early 33.1 data, we also have seen it in 34 beta versions.
Many of those crashes, but not all, have a 0x5a5a5caa address listed, which is an offset from our poison-on-free value.

For more reports, see https://crash-stats.mozilla.com/report/list?signature=nsTArray_Impl%3Cmozilla%3A%3Adom%3A%3Aworkers%3A%3AWorkerFeature*%2C+nsTArrayInfallibleAllocator%3E%3A%3AIndexOf%3Cmozilla%3A%3Adom%3A%3Aworkers%3A%3AWorkerFeature*%2C+nsDefaultComparator%3Cmozilla%3A%3Adom%3A%3Aworkers%3A%3AWorkerFeature*%2C+mozilla%3A%3Adom%3A%3Aworkers%3A%3AWorkerFeature*%3E+%3E%28mozilla%3A%3Adom%3A%3Aworkers%3A...#tab-reports
Note that it looks like most of these crashes seem to be on Google Maps.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #1)
> This might be related to bug 1063538 or bug 1091002.

Can you say with enough certainty that the patch to either of them will fix this one? The crash in here is the #2 topcrash in 33.1 so far, with 2.4% of all crashes, and is almost exclusive to Google Maps. If we happen to do any followup release too 33.1, we should see to have this fixed.
Flags: needinfo?(khuey)
Indeed, if the patch is low risk, we would probably do a 33.1.1 with this patch.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #3)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #1)
> > This might be related to bug 1063538 or bug 1091002.
> 
> Can you say with enough certainty that the patch to either of them will fix
> this one?

No, I can't, which is why I said this *might* be related.
Flags: needinfo?(khuey)
OK, what strategy would you advise for this then? This crash could be a driver for a 33.1.1 if we actually get fairly confident we have a fix.
Flags: needinfo?(khuey)
It might be good to land this patch in Beta now, for Beta 9 since the crash is showing up in Beta as well.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #6)
> OK, what strategy would you advise for this then? This crash could be a
> driver for a 33.1.1 if we actually get fairly confident we have a fix.

Finding STR would be a good start.

(In reply to Liz Henry :lizzard from comment #7)
> It might be good to land this patch in Beta now, for Beta 9 since the crash
> is showing up in Beta as well.

There is no patch ...
Flags: needinfo?(khuey)
Kyle, the patch in bug 1091002 mentioned in comment 3. But, ok, never mind.
Ben, any thoughts on this would be greatly appreciated.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #8)
> Finding STR would be a good start.

So, far, all we have is crash data and URLs from that. The URLs point almost entirely to Google Maps, as I mentioned before.

Also, it affects all Windows versions (XP, Vista, 7, 8, 8.1) and only Firefox 33.1 and some 34 betas. We do not see this signature in 33.0.x, interestingly. It's mostly, but not exclusively happening with Intel graphics and the count of installations is almost 1:1 with the amount of crashes, so it seems to not happen multiple times to most users.
That makes me believe it's a race condition of some sort that is more likely to happen with slower GPUs.
It looks a lot to me like https://crash-stats.mozilla.com/report/list?signature=nsTArray_Impl%3CgfxFont%2A%2C%20nsTArrayInfallibleAllocator%3E%3A%3AIndexOf%3CgfxFont%2A%2C%20nsDefaultComparator%3CgfxFont%2A%2C%20gfxFont%2A%3E%20%3E%28gfxFont%2A%20const%26%2C%20unsigned%20int%2C%20nsDefaultComparator%3CgfxFont%2A%2C%20gfxFont%2A%3E%20const%26%29 is the same crash as the stacks are also in workers with XHR and that signature appears in versions where we do not see the other one.
And in other versions (like 33.0.3), we see https://crash-stats.mozilla.com/report/list?signature=nsTArray_Impl%3Cmozilla%3A%3ACSSStyleSheet%2A%2C%20nsTArrayInfallibleAllocator%3E%3A%3AIndexOf%3Cmozilla%3A%3ACSSStyleSheet%2A%2C%20nsDefaultComparator%3Cmozilla%3A%3ACSSStyleSheet%2A%2C%20mozilla%3A%3ACSSStyleSheet%2A%3E%20%3E%28mozilla%3A%3ACSSStyleSheet%2A%20const%26%2C%20unsigned%20int%2C%20nsDefaultComparator%3Cmozilla%3A%3ACSS...#tab-reports

Also, given that all those only came up towards the end of last week across different versions and all those have mostly Google Maps URLs, I suspect that Google changed something on Maps late last week that triggers us running into those crashes.
Crash Signature: , mozilla::dom::workers::WorkerFeature*> >(mozilla::dom::workers:...] → , mozilla::dom::workers::WorkerFeature*> >(mozilla::dom::workers:...] [@ nsTArray_Impl<gfxFont*, nsTArrayInfallibleAllocator>::IndexOf<gfxFont*, nsDefaultComparator<gfxFont*, gfxFont*> >(gfxFont* const&, unsigned int, nsDefaultComparator<gfxFont*, gfxFon…
We contacted Google, they did push a new change on Google Maps. They disabled it for now. So, wontfix for 33.
I can confirm that this has vanished from 33.1 stats now that Google Maps has disabled the feature. We still should work on getting a proper fix on our side but we don't need to do it under pressure any more.
Agreed. jst has asked for a way to reenable the feature in maps so that we can continue to debug. That should happen within a few weeks time. We can investigate a fix in the 35 time frame.
There's not a whole lot that I can deduce from the minidumps we have collected...

There are only two places where we delete a WorkerPrivate object, and neither are possible with the stacks we're seeing, so I'm willing to bet money that the WorkerPrivate object is alive and well.

The problem is most likely that the ProxyCompleteRunnable is holding an XMLHttpRequest that has already been unpinned and deleted, so the ProxyCompleteRunnable::mWorkerPrivate member has the poisoned value 0x5a5a5a5a. Then RemoveFeature is attempting to modify mFeatures, probably with an offset that lands it at 0x5a5a5cc2 (which is where I see most of the crashes in the reports happening). nsTArray then derefs that address trying to find the length of the array.

One fix we could try is making the ProxyCompleteRunnable hold a strong ref to XMLHttpRequest, but I am not sure that's really the problem.
wontfixing on 35 as this doesn't appear to have been re-enabled (yet?). Ben, when you get back will you update with any new information from trying your fix suggestion in comment 18?
Flags: needinfo?(bent.mozilla)
Not sure how to proceed here because Google has apparently removed whatever it was that tripped us up before. jst, any news on what it was or how to re-enable it?
Flags: needinfo?(bent.mozilla) → needinfo?(jst)
Heads up, we have enabled Web Workers on http://google.com/maps excluding user agent strings that contain "Firefox".
Ben, see previous comment, it's enabled again if you tweak the UA string.
Flags: needinfo?(jst)
So far in my testing I have only been able to trigger bug 1091002 on release. It's fixed on beta, so I think we can either let it ride the train and try again with FF 36 or we can backport it to a point release.
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #23)
> So far in my testing I have only been able to trigger bug 1091002 on
> release. It's fixed on beta, so I think we can either let it ride the train
> and try again with FF 36 or we can backport it to a point release.

Given that this bug isn't currently affecting the 35 release, I don't think it's worth backporting. Given that this is fixed in 36+ (how did you test? can we get someone else to confirm?) it sounds like we should resolve this bug as fixed.
(In reply to Lawrence Mandel [:lmandel] (use needinfo) from comment #24)
> (In reply to ben turner [:bent] (use the needinfo? flag!) from comment #23)
> Given that this bug isn't currently affecting the 35 release,

Well, that's only because Google has blacklisted Firefox from using the new worker code on Google Maps. As soon as they flip that back we'll see more crashes.

> Given that this is fixed in 36+ (how did you test?
> can we get someone else to confirm?)

I used my own build of 35, and set the hidden pref "general.useragent.override" to a Chrome string and then did everything I could think of on Google Maps. The steps that triggered problems were somehow getting the maps to start loading new tiles, but then panning far away before they finished loading. That triggers some code to abort the previous tile loads, and that seemed to trip us up.

Once I applied the patch from bug 1091002 on that build of 35 I was unable to see any problems.
> That triggers some code to abort the previous tile loads, and that seemed to trip us up.

Another reliable way to abort tile downloads is to throttle your network connection (e.g. 50 Kbps, 500ms RTT) and zoom in or out many times. The XHR for the intermediate zoom levels should be aborted.
(In reply to John Tantalo from comment #26)

Thanks John!
Wontfix for the 36 release. Tracking for 37 as we want to have this bug fixed (sooner or later).
(In reply to Sylvestre Ledru [:sylvestre] from comment #28)
> Wontfix for the 36 release. Tracking for 37 as we want to have this bug
> fixed (sooner or later).

AFAICT from comment 25, this bug has actually been fixed in 36 by bug 1091002. Once 36 hits release we should be able to ask Google to remove the hack noted in comment 21. Can we resolve this as fixed?
Well, assuming that bug 1091002 is the only problem, yes, this would be a duplicate. It's always possible that there is another problem lurking that I haven't been able to reproduce though.
This seems to be resolved now, as these crash signatures are no longer appearing.  I don't think we need to track it.
(In reply to John Tantalo from comment #21)
> Heads up, we have enabled Web Workers on http://google.com/maps excluding
> user agent strings that contain "Firefox".

John, we think this is fixed in Firefox now.  Can you enabled Web Workers for Firefox UAs?
Flags: needinfo?(tantalo)
John, Lawrence brought up another good point: is there a way to limit turning on Web Workers for Google Maps to certain versions of Firefox?
Yes, we could restrict Web Workers by User Agent or feature detection.

For UA, what versions of Firefox should get Web Workers? Is the version number in the UA string?

For feature detection, I need a piece of code that can repro the failure case and classify the browser as capable or not.

The feature detection test may not be feasible in this case, but the UA test should be sufficient.

When I re-enable Web Workers for Firefox, I can gradually increase the number of users with the feature to 100%. Meanwhile we should monitor the crash rate; if it rises I can rollback the feature. Can somebody provide the crash rate statistics to me during the rollout?
Flags: needinfo?(tantalo)
(In reply to John Tantalo from comment #34)
> Yes, we could restrict Web Workers by User Agent or feature detection.

Great, thanks!

> For UA, what versions of Firefox should get Web Workers? Is the version
> number in the UA string?

I'm pretty sure the version number is in the UA string.  "Mozilla/5.0 (X11; Linux x86_64; rv:39.0) Gecko/20100101 Firefox/xx.y" where xx >= 37 is probably what you're looking for.  Lawrence, is that correct?

> The feature detection test may not be feasible in this case, but the UA test
> should be sufficient.

Yeah, the feature will be there but the crash likely won't be detectible :)

> When I re-enable Web Workers for Firefox, I can gradually increase the
> number of users with the feature to 100%. Meanwhile we should monitor the
> crash rate; if it rises I can rollback the feature. Can somebody provide the
> crash rate statistics to me during the rollout?

Lawrence can hook you up with that.
Flags: needinfo?(lmandel)
(In reply to Andrew Overholt [:overholt] from comment #35)
> (In reply to John Tantalo from comment #34)
> > Yes, we could restrict Web Workers by User Agent or feature detection.
> 
> Great, thanks!
> 
> > For UA, what versions of Firefox should get Web Workers? Is the version
> > number in the UA string?
> 
> I'm pretty sure the version number is in the UA string.  "Mozilla/5.0 (X11;
> Linux x86_64; rv:39.0) Gecko/20100101 Firefox/xx.y" where xx >= 37 is
> probably what you're looking for.  Lawrence, is that correct?

Yes. Firefox's UA string structure is defined in the following doc: 

https://developer.mozilla.org/en-US/docs/Web/HTTP/Gecko_user_agent_string_reference
 
> > When I re-enable Web Workers for Firefox, I can gradually increase the
> > number of users with the feature to 100%. Meanwhile we should monitor the
> > crash rate; if it rises I can rollback the feature. Can somebody provide the
> > crash rate statistics to me during the rollout?
> 
> Lawrence can hook you up with that.

Yes. Please let me know when you enable the feature. We can check in our crash reporting system (crash-stats.mozilla.com) to review crash reports.

ni Kairo, who runs our stability program, to make him aware of this upcoming change.
Flags: needinfo?(lmandel) → needinfo?(kairo)
I'm reading bugmail here anyhow, I'll be watching for something to get enabled. You can also click at the links in the Crash Signature field and/or the one at the end of comment #0 here in the bug and should be able to find new reports coming in.
Flags: needinfo?(kairo)
John, I am the release manager for the next Firefox release (38) which is going to be live in ~7 weeks. Please let me know if I can help you in any way.
Flags: needinfo?(tantalo)
Summary: crash in nsTArray_Impl<mozilla::dom::workers::WorkerFeature*, ... >, often at 0x5a5a5caa address → crash in nsTArray_Impl<mozilla::dom::workers::WorkerFeature*, ... >, often at 0x5a5a5caa address (Web Worker)
I see Firefox 37 is now the prevalent version, so I will begin to rollout a change to enable Web Workers on Firefox 37 (and higher) this week. My expected timeline is,

  Mon 13 April 3pm: 1%
  Tue 14 April 3pm: 10%
  Wed 15 April 3pm: 50%
  Thu 16 April 3pm: 100%

*all times GMT-0700 (PDT)
Flags: needinfo?(tantalo)
KaiRo, can you keep an eye on crashstats for any spikes in the crash data?
Flags: needinfo?(kairo)
(In reply to John Tantalo from comment #39)

Thanks for the heads up John!
I'm always keeping an eye on crash-stats. CCing dmajor though as he is looking at them quite a bit as well.
Web Workers on Google Maps are enabled for 100% of Firefox 37+ clients as of 2015-04-17 09:24 PDT.
Glancing at crashstats, I don't see any spikes in relevant signatures on our end.
Yes, still no crashes with the signatures of this bug.

The largest crashes with 37.0.1 release on google maps in the last 7 days seem to be bug 1152972 and bug 1151736, FWIW.
Flags: needinfo?(kairo)
Crash Signature: , nsDefaultComparator<mozilla::CSS... ] → , nsDefaultComparator<mozilla::CSS... ] [@ nsTArray_Impl<T>::IndexOf<T> ]
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.