Created attachment 8863988 [details] media-notify-data-received.txt BHR shows we have 11 reports in the 2017-04-29 period  with the main thread blocked for at least 8 seconds where when we walk the main thread it has MediaCacheStream::NotifyDataReceived on the stack. The reports often are blocked waiting on a monitor acquisition. Thread pseudostacks with counts are attached.  https://people-mozilla.org/~mlayzell/bhr/20170429/hangs.json
Do we have the build id on which these hangs happen?
Unfortunately no. We're working on having the background hang reports include the buildId in future. We should have them in the next couple of weeks.
It would be very useful to have stack traces from all threads to find out which thread is holding the monitor. Otherwise we need to struggle in the code review.
That would increase the size of the data returned maybe 100X... I'll ask...
We already include a huge amount of data in each crash report. Maybe we can add a special mode to BHR which just crashes when hang is detected.
We hang far too frequently for that to really be practical. :-)
Right, so we need to run the special mode in a less frequent fashion. Once we get the crash report, we can start debugging immediately without further reports.
Looking at BHR logs, I think we're in a much better situation now, thanks to dependent bugs that have already been solved: We now only have zero to two up-to-2-second hangs per day, instead of dozens of 2-to-65s hangs in April-May. The remaining hangs would most likely be solved by bug 1369263 ("Offload all MediaCache activities to another thread"). But I believe that bug would require a lot of work to get done, so I'm not sure if it can be safely achieved before beta57. And its priority is P3, so it's not deemed urgent. In the meantime, I think this [qf:p1] bug could either be closed as "mission pretty much accomplished", or be downgraded to [qf:p3] (backlog queue; may have a large impact on performance) so as not to unnecessarily block 57. JW, as champion for bug 1369263, what do you think?
I think we should lower the priority and keep it tracked. IIUC, we can't rule out the CPU busy case from BHR. So there might be false positives where CPU is just too busy to finish jobs in time on the main thread. It doesn't make sense for these false positives to block 57. Since bug 1369263 is a huge change and is too risky to make it in 57, I will proceed the bug in an incremental way so we can monitor BHR and crash reports to fix regressions as soon as possible. Therefore we should keep tracking this bug as an indicator that we are in a good shape while moving bug 1369263 forward.
Thank you JW, I'll lower the qf:priority.
Mass change P1->P2 to align with new Mozilla triage process