Last Comment Bug 716945 - Figure out why Flash hang volume dropped very significantly in 10.0a1 trunk on 2011-10-29
: Figure out why Flash hang volume dropped very significantly in 10.0a1 trunk o...
Status: RESOLVED FIXED
[qa-]
:
Product: Core
Classification: Components
Component: General (show other bugs)
: Trunk
: x86 Windows 7
: -- normal (vote)
: mozilla13
Assigned To: Felix Fung (:felix)
:
:
Mentors:
Depends on:
Blocks: 665196 722394 723970 725898
  Show dependency treegraph
 
Reported: 2012-01-10 10:26 PST by Robert Kaiser
Modified: 2012-03-29 12:40 PDT (History)
16 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
wontfix
+
fixed
+
fixed


Attachments
backout of bug 665196 (13.61 KB, patch)
2012-01-30 15:39 PST, :Gavin Sharp [email: gavin@gavinsharp.com]
akeybl: approval‑mozilla‑aurora+
Details | Diff | Splinter Review

Description Robert Kaiser 2012-01-10 10:26:50 PST
Not sure what's the best place to file this in, as this is just about an investigation of something good that happened.
It would be very good to know how we fixed significant problems for our users, so that we understand the mechanisms of Flash hangs better.

Looking at https://crash-analysis.mozilla.com/rkaiser/firefox.10.0a1.flashsummary.html you'll see that on 2011-10-29 the volume of Flash hangs dropped significantly, in the following days to less than 50% of what it was before (on that day, we can assume that a lot of nightlies from the last day were still in use, so the full extent is only visible the day after, but the day the drop started is probably the Nightly that had the fix). This was not coupled with a drop in ADUs or so (we only saw that later when Nightly changed to 12.0a1, where volume drops again, of course).
This trend continued with the switch of Aurora and now Beta from 9 to 10 - comparing https://crash-analysis.mozilla.com/rkaiser/firefox.10.0.flashsummary.html to when https://crash-analysis.mozilla.com/rkaiser/firefox.9.0.flashsummary.html tracked Beta has our Flash hangs dropping by almost a factor 10, which is incredibly significant (on Beta, we're usually seeing a lot more intense Flash usage than on Aurora and Nightly, so the effect getting stronger could be somewhat expected, but it's really a huge difference between 3000-3600 on 9 Beta to 300-450 on 10 Beta).

The assumption is that something in http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2011-10-28&enddate=2011-10-29+04%3A00%3A00 probably caused this tremendous drop in Flash hang volume, but it would be really good to know what exactly we did to make things so much better for users - or have at least a really good educated guess on what it was.
Comment 1 Benjamin Smedberg [:bsmedberg] 2012-01-17 11:59:18 PST
Was this seen on both Windows and Mac or just Windows?

Are we sure that this is a real improvement, or perhaps we're just failing to submit hang reports? I've seen a lot of people seeing "Failed to submit crash report" from plugin reports...

Was there any corresponding drop in plugin crash (non-hang) reports? I see that bug 665196 landed on the 27th, so I strongly suspect that the UI change is the root cause here. Either a bug is causing us not to submit the report, or people are misreading the UI and more of them are choosing not to submit a report.
Comment 2 Robert Kaiser 2012-01-17 12:35:44 PST
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #1)
> Was this seen on both Windows and Mac or just Windows?

I don't have that breakdown, would need to create it.

> Are we sure that this is a real improvement, or perhaps we're just failing
> to submit hang reports?

That's surely a possibility.

> Was there any corresponding drop in plugin crash (non-hang) reports?

Actually, there was, and your explanation of bug 665196 sounds somewhat likely. Does that mean we should back this out for 10? We are getting late in Beta with it now...
Comment 3 Benjamin Smedberg [:bsmedberg] 2012-01-17 12:42:15 PST
I could theorize that the stream we use to submit the report only stays alive as long as the current page is loaded, which means that we cancel the stream as soon as we reload. Now that we're submitting-and-reloading all at once, most users aren't submitting reports.

So yeah, we could back out the patch. It's got l10n and UI impact, so it's not a terribly easy thing to back out of release channels.
Comment 4 Felix Fung (:felix) 2012-01-17 13:53:23 PST
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #3)
> I could theorize that the stream we use to submit the report only stays
> alive as long as the current page is loaded, which means that we cancel the
> stream as soon as we reload. Now that we're submitting-and-reloading all at
> once, most users aren't submitting reports.

I'm looking into whether that's the case and whether there is a relatively simple fix...
Comment 5 Robert Kaiser 2012-01-18 11:59:51 PST
OK, just as a data point, I can't make out any obvious hang signature pairs we're missing, but it's a bit hard to tell. I think we should for now assume that we're not missing any specific hangs but that we're just getting fewer reports overall (but keep in mind that this is an assumption).
What's interesting is that while in 10.0a1 Nightly data, the drop in hangs and crashes was similar at about a factor 3 (but hard to tell exactly on the low level we have in Nightly), in beta data (9.0 vs. 10.0) OTOH, the drop is way higher in hangs than in crashes - while hangs are down almost by a factor of 10 (~10-15% of 9.0 values), crashes are down only slightly by at most 1/3 or 1/4 from what I can estimate from the numbers (~65-80% of 9.0 values).
Comment 6 Verdi [:verdi] 2012-01-20 11:42:59 PST
FWIW, out of about 6 flash crashes with the new UI I have yet to have a successful crash report submission. Like David, I get the failed submission message http://www.flickr.com/photos/djst/6635768441
Comment 7 [:Cww] 2012-01-20 16:08:08 PST
Talked this through with verdi: One additional scenario is that the new UI tells you to reload to fix the problem.  However, if you don't click the reload button in the toolbar or error message and instead use the regular reload button then the crash report doesn't submit even though the UI seems to imply that it will.
Comment 8 Benjamin Smedberg [:bsmedberg] 2012-01-23 15:32:32 PST
I have a backout prepared for Aurora, see try results here: https://tbpl.mozilla.org/?tree=Try&rev=0a073bc8c072

Felix/dolske, unless you have a real fix in-hand, I think we should go ahead and land this on Aurora, do you agree?
Comment 9 [Baboo] 2012-01-24 11:31:07 PST
Not sure if related but I had recently a Flash-related bug report that wouldn't submit: https://crash-stats.mozilla.com/report/index/bp-1a3aa2e0-a92b-473e-bf6a-fbc542120121
Comment 10 Robert Kaiser 2012-01-24 12:28:12 PST
I posted some thinking I had on this on IRC, I guess this should make it here as well for preservation:

<bsmedberg> so I think that somehow we're timing out or doing something else funky with the XHR doing the submission
<ted> seems odd, since we're using the same code as about:crashes
<KaiRo> ted: well, just thinking about the difference between about:crashes and the plugin crash reporter, in the former we navigate to a different URL, while in the latter we reload (the parent) - maybe the latter is more destructive than the former and we get destroyed before we can finish sending - after all, at navigation something stays around in the bfcache while reload doesn't do that, right?
<ted> KaiRo: certainly possible
<KaiRo> ted: I'm only theorizing without any detailed code knowledge of course

I have enough overview knowledge to make educated guesses there, I hope, but I surely don't know anything of how the code itself works, so I leave it up to those who do to dig deeper into this (if they think it's valuable) or not (if it's clearly a wrong idea).
Comment 11 Benjamin Smedberg [:bsmedberg] 2012-01-24 12:57:40 PST
https://hg.mozilla.org/releases/mozilla-aurora/rev/2878c9b2f02e
https://hg.mozilla.org/releases/mozilla-aurora/rev/03ff575be222
https://hg.mozilla.org/releases/mozilla-aurora/rev/18d1d5407dbe

Fixed on Aurora. We decided not to fix this on Beta for 10. I have not pushed the backout to -central, so I'm marking tracking for 12 so that we either fix or perform the backout again on the next aurora.
Comment 12 :Gavin Sharp [email: gavin@gavinsharp.com] 2012-01-24 16:06:44 PST
I think those aurora l10n changes are going to lead to problems, without additional outreach.
Comment 13 Axel Hecht [:Pike] 2012-01-25 00:56:43 PST
To quote from my previous answer in email:

For the l10n impact, I suggest that we keep the the new string, add back the old string, and add a fallback to content that picks up the new string if the old one's not there.

Something like
<!ENTITY report.please "&report.checkbox;">
either before or after the l10n file, in pluginProblem.xml. If you get the order right, the double definition just gets dropped, and otherwise we end up with the slightly odd string, but still a localized one.
Comment 14 Benjamin Smedberg [:bsmedberg] 2012-01-25 06:23:06 PST
I believed that advice was for the beta channel only; why can't we or localizers just retrieve the previous string from hg blame?
Comment 15 Axel Hecht [:Pike] 2012-01-25 06:36:51 PST
Aurora is string frozen, and the advice was for any string frozen tree.

I don't see any reason to debate string freezes, surely not in this bug, so I'll just ignore the "why" question.
Comment 16 Axel Hecht [:Pike] 2012-01-30 04:25:27 PST
What happens about this on central? Like, backing this out again on aurora for 12 shouldn't happen.
Comment 17 Benjamin Smedberg [:bsmedberg] 2012-01-30 10:13:28 PST
Backing this out again on Aurora is the plan unless Felix comes up with the correct fix.
Comment 18 Alex Keybl [:akeybl] 2012-01-30 10:47:15 PST
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #17)
> Backing this out again on Aurora is the plan unless Felix comes up with the
> correct fix.

In order to avoid the l10n impact on Aurora, we should back out of m-c as well. We can re-land bug 665196 on FF13 once we have a fix in hand.
Comment 19 :Gavin Sharp [email: gavin@gavinsharp.com] 2012-01-30 15:39:55 PST
Created attachment 592897 [details] [diff] [review]
backout of bug 665196

This is a backout patch that matches the three changesets in comment 11, minus the change to nsExceptionHandler (which is harmless to leave as-is). Want to make sure I got this right before backing it out on trunk.
Comment 20 Robert Kaiser 2012-02-01 14:54:50 PST
bsmedberg, gavin, akeybl, apparently we missed to do this on trunk before the uplift to aurora, can we get this backout done on aurora at least very early? What should we do about trunk?
Comment 21 Alex Keybl [:akeybl] 2012-02-02 12:15:13 PST
I believe we should land https://bugzilla.mozilla.org/show_bug.cgi?id=716945#c11 on Aurora and https://bugzilla.mozilla.org/show_bug.cgi?id=716945#c19 on m-c.
Comment 22 :Gavin Sharp [email: gavin@gavinsharp.com] 2012-02-02 13:19:21 PST
The patch in comment 19 is just a combined version of the changes in comment 14, minus an unrelated change that doesn't need to be backed out. They're the same from an l10n perspective - we should use comment 19's patch for any backouts.
Comment 23 Robert Kaiser 2012-02-03 07:00:56 PST
So, can we please get the backout landed at least on Aurora 12 ASP?
Maybe we can even land it on Nightly 13 so that we get better data there and can detect any regressions or improvements better, until we find a real fix to the problem?
Comment 24 :Gavin Sharp [email: gavin@gavinsharp.com] 2012-02-03 17:11:05 PST
Comment on attachment 592897 [details] [diff] [review]
backout of bug 665196

[Approval Request Comment]
String changes made by this patch: one string change
Comment 25 Alex Keybl [:akeybl] 2012-02-05 13:08:39 PST
Comment on attachment 592897 [details] [diff] [review]
backout of bug 665196

[Triage Comment]
Approved for Aurora 12 - thanks Gavin.
Comment 26 Axel Hecht [:Pike] 2012-02-08 00:37:15 PST
This is still not backed out on Aurora, nor on central. Gavin, can you own that?
Comment 29 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-02-09 15:53:45 PST
Is there anything here for QA to verify?

Note You need to log in before you can comment on or make changes to this bug.