Closed Bug 716945 Opened 13 years ago Closed 13 years ago

Figure out why Flash hang volume dropped very significantly in 10.0a1 trunk on 2011-10-29

Categories

(Core :: General, defect)

x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla13
Tracking Status
firefox10 + wontfix
firefox11 + fixed
firefox12 + fixed

People

(Reporter: kairo, Assigned: felix)

References

Details

(Whiteboard: [qa-])

Attachments

(1 file)

Not sure what's the best place to file this in, as this is just about an investigation of something good that happened.
It would be very good to know how we fixed significant problems for our users, so that we understand the mechanisms of Flash hangs better.

Looking at https://crash-analysis.mozilla.com/rkaiser/firefox.10.0a1.flashsummary.html you'll see that on 2011-10-29 the volume of Flash hangs dropped significantly, in the following days to less than 50% of what it was before (on that day, we can assume that a lot of nightlies from the last day were still in use, so the full extent is only visible the day after, but the day the drop started is probably the Nightly that had the fix). This was not coupled with a drop in ADUs or so (we only saw that later when Nightly changed to 12.0a1, where volume drops again, of course).
This trend continued with the switch of Aurora and now Beta from 9 to 10 - comparing https://crash-analysis.mozilla.com/rkaiser/firefox.10.0.flashsummary.html to when https://crash-analysis.mozilla.com/rkaiser/firefox.9.0.flashsummary.html tracked Beta has our Flash hangs dropping by almost a factor 10, which is incredibly significant (on Beta, we're usually seeing a lot more intense Flash usage than on Aurora and Nightly, so the effect getting stronger could be somewhat expected, but it's really a huge difference between 3000-3600 on 9 Beta to 300-450 on 10 Beta).

The assumption is that something in http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2011-10-28&enddate=2011-10-29+04%3A00%3A00 probably caused this tremendous drop in Flash hang volume, but it would be really good to know what exactly we did to make things so much better for users - or have at least a really good educated guess on what it was.
Was this seen on both Windows and Mac or just Windows?

Are we sure that this is a real improvement, or perhaps we're just failing to submit hang reports? I've seen a lot of people seeing "Failed to submit crash report" from plugin reports...

Was there any corresponding drop in plugin crash (non-hang) reports? I see that bug 665196 landed on the 27th, so I strongly suspect that the UI change is the root cause here. Either a bug is causing us not to submit the report, or people are misreading the UI and more of them are choosing not to submit a report.
Component: IPC → General
QA Contact: ipc → general
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #1)
> Was this seen on both Windows and Mac or just Windows?

I don't have that breakdown, would need to create it.

> Are we sure that this is a real improvement, or perhaps we're just failing
> to submit hang reports?

That's surely a possibility.

> Was there any corresponding drop in plugin crash (non-hang) reports?

Actually, there was, and your explanation of bug 665196 sounds somewhat likely. Does that mean we should back this out for 10? We are getting late in Beta with it now...
I could theorize that the stream we use to submit the report only stays alive as long as the current page is loaded, which means that we cancel the stream as soon as we reload. Now that we're submitting-and-reloading all at once, most users aren't submitting reports.

So yeah, we could back out the patch. It's got l10n and UI impact, so it's not a terribly easy thing to back out of release channels.
Blocks: 665196
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #3)
> I could theorize that the stream we use to submit the report only stays
> alive as long as the current page is loaded, which means that we cancel the
> stream as soon as we reload. Now that we're submitting-and-reloading all at
> once, most users aren't submitting reports.

I'm looking into whether that's the case and whether there is a relatively simple fix...
OK, just as a data point, I can't make out any obvious hang signature pairs we're missing, but it's a bit hard to tell. I think we should for now assume that we're not missing any specific hangs but that we're just getting fewer reports overall (but keep in mind that this is an assumption).
What's interesting is that while in 10.0a1 Nightly data, the drop in hangs and crashes was similar at about a factor 3 (but hard to tell exactly on the low level we have in Nightly), in beta data (9.0 vs. 10.0) OTOH, the drop is way higher in hangs than in crashes - while hangs are down almost by a factor of 10 (~10-15% of 9.0 values), crashes are down only slightly by at most 1/3 or 1/4 from what I can estimate from the numbers (~65-80% of 9.0 values).
FWIW, out of about 6 flash crashes with the new UI I have yet to have a successful crash report submission. Like David, I get the failed submission message http://www.flickr.com/photos/djst/6635768441
Talked this through with verdi: One additional scenario is that the new UI tells you to reload to fix the problem.  However, if you don't click the reload button in the toolbar or error message and instead use the regular reload button then the crash report doesn't submit even though the UI seems to imply that it will.
I have a backout prepared for Aurora, see try results here: https://tbpl.mozilla.org/?tree=Try&rev=0a073bc8c072

Felix/dolske, unless you have a real fix in-hand, I think we should go ahead and land this on Aurora, do you agree?
Not sure if related but I had recently a Flash-related bug report that wouldn't submit: https://crash-stats.mozilla.com/report/index/bp-1a3aa2e0-a92b-473e-bf6a-fbc542120121
I posted some thinking I had on this on IRC, I guess this should make it here as well for preservation:

<bsmedberg> so I think that somehow we're timing out or doing something else funky with the XHR doing the submission
<ted> seems odd, since we're using the same code as about:crashes
<KaiRo> ted: well, just thinking about the difference between about:crashes and the plugin crash reporter, in the former we navigate to a different URL, while in the latter we reload (the parent) - maybe the latter is more destructive than the former and we get destroyed before we can finish sending - after all, at navigation something stays around in the bfcache while reload doesn't do that, right?
<ted> KaiRo: certainly possible
<KaiRo> ted: I'm only theorizing without any detailed code knowledge of course

I have enough overview knowledge to make educated guesses there, I hope, but I surely don't know anything of how the code itself works, so I leave it up to those who do to dig deeper into this (if they think it's valuable) or not (if it's clearly a wrong idea).
https://hg.mozilla.org/releases/mozilla-aurora/rev/2878c9b2f02e
https://hg.mozilla.org/releases/mozilla-aurora/rev/03ff575be222
https://hg.mozilla.org/releases/mozilla-aurora/rev/18d1d5407dbe

Fixed on Aurora. We decided not to fix this on Beta for 10. I have not pushed the backout to -central, so I'm marking tracking for 12 so that we either fix or perform the backout again on the next aurora.
I think those aurora l10n changes are going to lead to problems, without additional outreach.
To quote from my previous answer in email:

For the l10n impact, I suggest that we keep the the new string, add back the old string, and add a fallback to content that picks up the new string if the old one's not there.

Something like
<!ENTITY report.please "&report.checkbox;">
either before or after the l10n file, in pluginProblem.xml. If you get the order right, the double definition just gets dropped, and otherwise we end up with the slightly odd string, but still a localized one.
I believed that advice was for the beta channel only; why can't we or localizers just retrieve the previous string from hg blame?
Aurora is string frozen, and the advice was for any string frozen tree.

I don't see any reason to debate string freezes, surely not in this bug, so I'll just ignore the "why" question.
What happens about this on central? Like, backing this out again on aurora for 12 shouldn't happen.
Backing this out again on Aurora is the plan unless Felix comes up with the correct fix.
Assignee: nobody → felix.the.cheshire.cat
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #17)
> Backing this out again on Aurora is the plan unless Felix comes up with the
> correct fix.

In order to avoid the l10n impact on Aurora, we should back out of m-c as well. We can re-land bug 665196 on FF13 once we have a fix in hand.
Blocks: 722394
This is a backout patch that matches the three changesets in comment 11, minus the change to nsExceptionHandler (which is harmless to leave as-is). Want to make sure I got this right before backing it out on trunk.
Attachment #592897 - Flags: feedback?(dolske)
Attachment #592897 - Flags: feedback?(dolske)
bsmedberg, gavin, akeybl, apparently we missed to do this on trunk before the uplift to aurora, can we get this backout done on aurora at least very early? What should we do about trunk?
The patch in comment 19 is just a combined version of the changes in comment 14, minus an unrelated change that doesn't need to be backed out. They're the same from an l10n perspective - we should use comment 19's patch for any backouts.
So, can we please get the backout landed at least on Aurora 12 ASP?
Maybe we can even land it on Nightly 13 so that we get better data there and can detect any regressions or improvements better, until we find a real fix to the problem?
Comment on attachment 592897 [details] [diff] [review]
backout of bug 665196

[Approval Request Comment]
String changes made by this patch: one string change
Attachment #592897 - Flags: approval-mozilla-aurora?
Comment on attachment 592897 [details] [diff] [review]
backout of bug 665196

[Triage Comment]
Approved for Aurora 12 - thanks Gavin.
Attachment #592897 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
This is still not backed out on Aurora, nor on central. Gavin, can you own that?
(In reply to Gavin Sharp (use gavin@gavinsharp.com for email) from comment #27)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/2087f423a1a4

https://hg.mozilla.org/mozilla-central/rev/2087f423a1a4
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla13
Is there anything here for QA to verify?
Whiteboard: [qa?]
Blocks: 725898
Whiteboard: [qa?] → [qa-]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: