If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

MediaCacheStream::NotifyDataReceived() may block the main thread for up to 8 seconds

NEW
Unassigned

Status

()

Core
Audio/Video: Playback
P2
normal
6 months ago
11 days ago

People

(Reporter: cpearce, Unassigned)

Tracking

(Depends on: 2 bugs, Blocks: 1 bug, {perf, stale-bug})

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [qf:p3][bhr])

Attachments

(1 attachment)

(Reporter)

Description

6 months ago
Created attachment 8863988 [details]
media-notify-data-received.txt

BHR shows we have 11 reports in the 2017-04-29 period [1] with the main thread blocked for at least 8 seconds where when we walk the main thread it has MediaCacheStream::NotifyDataReceived on the stack.

The reports often are blocked waiting on a monitor acquisition.

Thread pseudostacks with counts are attached.


[1] https://people-mozilla.org/~mlayzell/bhr/20170429/hangs.json
Whiteboard: [qf:p1] → [qf:p1][bhr]
Do we have the build id on which these hangs happen?
(Reporter)

Comment 2

6 months ago
Unfortunately no. We're working on having the background hang reports include the buildId in future. We should have them in the  next couple of weeks.
It would be very useful to have stack traces from all threads to find out which thread is holding the monitor. Otherwise we need to struggle in the code review.
(Reporter)

Comment 4

6 months ago
That would increase the size of the data returned maybe 100X... I'll ask...
We already include a huge amount of data in each crash report. Maybe we can add a special mode to BHR which just crashes when hang is detected.

Comment 6

5 months ago
We hang far too frequently for that to really be practical.  :-)
Right, so we need to run the special mode in a less frequent fashion. Once we get the crash report, we can start debugging immediately without further reports.
Depends on: 1365534
Depends on: 1365538
Depends on: 1366929
Depends on: 1367277
Depends on: 1368837
See Also: → bug 1369263
Depends on: 1369263
See Also: bug 1369263
Depends on: 1369322
Depends on: 1369326
Depends on: 1369538
Depends on: 1371205
Depends on: 1371882
Depends on: 1374180
Depends on: 1374441
Depends on: 1374545
Depends on: 1377575
Depends on: 1379091
Looking at BHR logs, I think we're in a much better situation now, thanks to dependent bugs that have already been solved: We now only have zero to two up-to-2-second hangs per day, instead of dozens of 2-to-65s hangs in April-May.

The remaining hangs would most likely be solved by bug 1369263 ("Offload all MediaCache activities to another thread").
But I believe that bug would require a lot of work to get done, so I'm not sure if it can be safely achieved before beta57. And its priority is P3, so it's not deemed urgent.

In the meantime, I think this [qf:p1] bug could either be closed as "mission pretty much accomplished", or be downgraded to [qf:p3] (backlog queue; may have a large impact on performance) so as not to unnecessarily block 57.

JW, as champion for bug 1369263, what do you think?
Flags: needinfo?(jwwang)
I think we should lower the priority and keep it tracked. IIUC, we can't rule out the CPU busy case from BHR. So there might be false positives where CPU is just too busy to finish jobs in time on the main thread. It doesn't make sense for these false positives to block 57.

Since bug 1369263 is a huge change and is too risky to make it in 57, I will proceed the bug in an incremental way so we can monitor BHR and crash reports to fix regressions as soon as possible. Therefore we should keep tracking this bug as an indicator that we are in a good shape while moving bug 1369263 forward.
Flags: needinfo?(jwwang)
Thank you JW, I'll lower the qf:priority.
Whiteboard: [qf:p1][bhr] → [qf:p3][bhr]
Keywords: stale-bug
Mass change P1->P2 to align with new Mozilla triage process
Priority: P1 → P2
Keywords: perf
You need to log in before you can comment on or make changes to this bug.