Closed Bug 970916 Opened 10 years ago Closed 8 years ago

Compositor sometimes hangs in CompositorOGL::DrawQuad

Categories

(Core :: Graphics, defect, P5)

ARM
Android
defect

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox30 + wontfix
firefox31 + wontfix
firefox32 - wontfix
firefox33 - wontfix
firefox34 --- wontfix
fennec + ---

People

(Reporter: snorp, Unassigned)

References

Details

According to hang reports (http://darchons.github.io/hang-telemetry-dashboard/bhr.html), we are frequently hanging the compositor in CompositorOGL::DrawQuad. I don't know the duration of time (tool is having trouble atm), but that shouldn't really happen.
CJ, what do you think about somebody from your team investigating this as a good Android introduction?
Flags: needinfo?(cku)
This might affect b2g too, we just don't have hang detection there yet.
Flags: needinfo?(cku) → needinfo?(pchang)
Looks like we didn't enable OMTC on Android. Prepare environment to check.
Flags: needinfo?(pchang)
(In reply to peter chang[:pchang][:peter] from comment #3)
> Looks like we didn't enable OMTC on Android. Prepare environment to check.

Why do you say that?
(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #4)
> (In reply to peter chang[:pchang][:peter] from comment #3)
> > Looks like we didn't enable OMTC on Android. Prepare environment to check.
> 
> Why do you say that?

The reason I said because I saw the callstack of several cases from hung report system.
And I though it was dump from main thread, but actually I'm wrong.

[one hung case]
    Timer.Fire:
    Startup.XRE_Main:
    Gecko:
[compositor hung case]
    CompositorOGL.DrawQuad:
    ThebesLayerComposite.RenderLayer:
    LayerManagerComposite.Render:
    CompositorParent.Composite:
    Compositor:

Bug 725095 already enabled OMTC on fennec.

BTW, are we able to know which line caused this hang problem?
(In reply to peter chang[:pchang][:peter] from comment #5)
> 
> BTW, are we able to know which line caused this hang problem?

Not right now, and I doubt we'll ever have that information unless someone is able to reproduce it under gdb. Is that right, Jim?
Flags: needinfo?(nchen)
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #6)
> (In reply to peter chang[:pchang][:peter] from comment #5)
> > 
> > BTW, are we able to know which line caused this hang problem?
> 
> Not right now, and I doubt we'll ever have that information unless someone
> is able to reproduce it under gdb. Is that right, Jim?

That's right for now, but once bug 938157 (new unwinding library) lands, we will be able to get frame-by-frame hang stacks.

BTW, the hang times plot in the dashboard is now fixed.
Flags: needinfo?(nchen)
(In reply to Jim Chen [:jchen :nchen] from comment #7)
> (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #6)
> > (In reply to peter chang[:pchang][:peter] from comment #5)
> > > 
> > > BTW, are we able to know which line caused this hang problem?
> > 
> > Not right now, and I doubt we'll ever have that information unless someone
> > is able to reproduce it under gdb. Is that right, Jim?
> 
> That's right for now, but once bug 938157 (new unwinding library) lands, we
> will be able to get frame-by-frame hang stacks.
> 
> BTW, the hang times plot in the dashboard is now fixed.
With the line info, it would be great.

Jim, I got a question that how do we define app become "hang" and where is the thread stack from in comment 5. For fennec, is it from ANR log? If yes, how could I get the original ANR log?
If so, where can I get the
(In reply to peter chang[:pchang][:peter] from comment #8)
> 
> Jim, I got a question that how do we define app become "hang" and where is
> the thread stack from in comment 5. For fennec, is it from ANR log? If yes,
> how could I get the original ANR log?

For the Compositor thread, a hang is defined as any event that takes more than 128ms to execute.

The stack is from the pesudo-stack used by the Gecko profiler [1]. It is an internal Gecko structure and it is not related to the ANR traces.

[1] https://developer.mozilla.org/en-US/docs/Performance/Profiling_with_the_Built-in_Profiler#Native_stack_vs._Pseudo_stack
Could bug 974054 be related?
I just checked the last week data from now and didn't see the hang for DrawQuad.
I will keep monitoring one more week to see this is still an issue or not.
Based on Comment 11 not going to track this now. Peter is the data in the next week changes to show the hang and you believe it needs tracking please re-nominate.
:jchen indicates that the data might not be terribly accurate this week, so I think we should continue looking into this bug.
Do you think we might have so more data to look at by EOW so we can make a decision on tracking?
Flags: needinfo?(snorp)
(In reply to Benjamin Kerensa [:bkerensa] from comment #14)
> Do you think we might have so more data to look at by EOW so we can make a
> decision on tracking?

I don't really know. Jim, is that fixed now?
Flags: needinfo?(snorp) → needinfo?(nchen)
(In reply to Andreas Gal :gal from comment #10)
> Could bug 974054 be related?

Hah, that is funny. I wonder if that did indeed fix this.
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #15)
> (In reply to Benjamin Kerensa [:bkerensa] from comment #14)
> > Do you think we might have so more data to look at by EOW so we can make a
> > decision on tracking?
> 
> I don't really know. Jim, is that fixed now?

Yeah it's mostly fixed now and data from last week are in. Unfortunately DrawQuad still seems to be the top hang for Fennec. Looking at the build ids, it's hard to see, but there doesn't seem to be a decrease in hangs after bug 974054 landed. :(
Flags: needinfo?(nchen)
James,

I guess its unclear from Comment 17 whether this has resolved the issue enough that it no longer warrants tracking. Do you believe this should be a blocker?
Flags: needinfo?(snorp)
(In reply to Benjamin Kerensa [:bkerensa] from comment #18)
> James,
> 
> I guess its unclear from Comment 17 whether this has resolved the issue
> enough that it no longer warrants tracking. Do you believe this should be a
> blocker?

We need to track this. Comment #17 is saying that the hang tool is fixed now, not that this bug is fixed. The latest hang data from last week indicates we still frequently have a hang here, with about half of them taking longer than 255ms.
Flags: needinfo?(snorp)
Tracking based on feedback from Comment #17 and Comment #19
tracking-fennec: --- → 30+
Peter, are you working on this?
Assignee: nobody → pchang
tracking-fennec: 30+ → ---
tracking-fennec: --- → +
tracking-fennec: + → 30+
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #21)
> Peter, are you working on this?

James, I just created bug 983540 for b2g compositor high CPU usage. And the related CPU bottlenecks are matched the report from http://darchons.github.io/hang-telemetry-dashboard/bhr.html.

Do you think they are related?
(In reply to peter chang[:pchang][:peter] from comment #22)
> (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #21)
> > Peter, are you working on this?
> 
> James, I just created bug 983540 for b2g compositor high CPU usage. And the
> related CPU bottlenecks are matched the report from
> http://darchons.github.io/hang-telemetry-dashboard/bhr.html.
> 
> Do you think they are related?

Possible, I suppose.
Assuming that Peter is indeed working on this. Please flip flags accordingly if that's not the case.
Status: NEW → ASSIGNED
(In reply to Jim Chen [:jchen :nchen] from comment #7)
> (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #6)
> > (In reply to peter chang[:pchang][:peter] from comment #5)
> > > 
> > > BTW, are we able to know which line caused this hang problem?
> > 
> > Not right now, and I doubt we'll ever have that information unless someone
> > is able to reproduce it under gdb. Is that right, Jim?
> 
> That's right for now, but once bug 938157 (new unwinding library) lands, we
> will be able to get frame-by-frame hang stacks.
> 
> BTW, the hang times plot in the dashboard is now fixed.

I would like to wait for bug 938157 landed to see which line cause the hang issue.

Because from Bug 983540 comment 0, the stack for DrawQuad looks like related to driver implementation, also the hang of CompositorOGL::EndFrame.
(In reply to peter chang[:pchang][:peter] from comment #25)
> (In reply to Jim Chen [:jchen :nchen] from comment #7)
> > (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #6)
> > > (In reply to peter chang[:pchang][:peter] from comment #5)
> > > > 
> > > > BTW, are we able to know which line caused this hang problem?
> > > 
> > > Not right now, and I doubt we'll ever have that information unless someone
> > > is able to reproduce it under gdb. Is that right, Jim?
> > 
> > That's right for now, but once bug 938157 (new unwinding library) lands, we
> > will be able to get frame-by-frame hang stacks.
> > 
> > BTW, the hang times plot in the dashboard is now fixed.
> 
> I would like to wait for bug 938157 landed to see which line cause the hang
> issue.
> 
> Because from Bug 983540 comment 0, the stack for DrawQuad looks like related
> to driver implementation, also the hang of CompositorOGL::EndFrame.


Peter - bug 938157 landed to FF31 and we're a couple of weeks from shipping F30 - is this on your radar for this week?
Flags: needinfo?(pchang)
(In reply to Jim Chen [:jchen :nchen] from comment #7)
> (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #6)
> > (In reply to peter chang[:pchang][:peter] from comment #5)
> > > 
> > > BTW, are we able to know which line caused this hang problem?
> > 
> > Not right now, and I doubt we'll ever have that information unless someone
> > is able to reproduce it under gdb. Is that right, Jim?
> 
> That's right for now, but once bug 938157 (new unwinding library) lands, we
> will be able to get frame-by-frame hang stacks.
> 
> BTW, the hang times plot in the dashboard is now fixed.
Jim, I just checked the latest hang report but I didn't see the lin info.
Do you know where I can get the line info from hang report?
Flags: needinfo?(pchang) → needinfo?(nchen)
Sorry, we don't have native stack support yet. I just filed bug 1016629 and I'll be working on it soon.
Flags: needinfo?(nchen)
We're now too late to get this into Firefox 30. Marking affected for 31/32 and carrying forward tracking to try and target a landing in FF31.
tracking-fennec: 30+ → 31+
Jim, any news on this? Thanks
Flags: needinfo?(nchen)
Not much progress so far, but I'm going back to working on it today.
Flags: needinfo?(nchen)
Jim, any news on this?
Flags: needinfo?(nchen)
Depends on: 1016629
Untracking. We won't block the 31 release because of this bug. However, we could take a patch for 32.
tracking-fennec: 31+ → 32+
To really get this going, we at least need bug 1016629, which just landed on the trunk, so unless we can gleam as to why this is happening some other way, I think that means it's at least 34 where we could act on this?
I'm marking tracking- for Firefox 32/33 based on comment 34. With the current state, I think it's unrealistic to expect that we're going to ship a fix in 32. I have left 33 as affected in case there is a way to uplift.
Looking at the latest BHR data, it seems the hangs are happening in the graphics driver, so I'm not sure there's much we can do at the moment.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: needinfo?(nchen)
Resolution: --- → WONTFIX
I'm not convinced we should just close this if there is a hang in the driver.  Maybe we're getting the driver int a bad state or giving it bad input.  Even if not, we should at least open a driver bug and point this to it.
Unless the volume of hangs has gone down, and now we don't care?
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
I couldn't load the result of last week from http://darchons.github.io/hang-telemetry-dashboard/bhr.html . Is anyone able to load above page?
The dashboard is now at http://telemetry.mozilla.org/hang/bhr/. So for CompositorOGL::DrawQuad, the native stack shows it's inside the graphics driver.
tracking-fennec: 32+ → +
filter on [mass-p5]
Priority: -- → P5
This bug is now 9+ months old. We have shipped many releases with this bug. The Android team has assigned this priority P5. I don't see the value in continuing to track this bug.
Unassign myself since I don't have time to work on this bug now.
Assignee: pchang → nobody
It's now been a year since we stopped tracking this. I suggest this bug report should get closed if we're  not going to fix it.
(In reply to Anthony Hughes (:ashughes) [GFX][QA][Mentor] from comment #43)
> It's now been a year since we stopped tracking this. I suggest this bug
> report should get closed if we're  not going to fix it.

I am finally closing this bug report as it's been several months with no objections to the above. Please reopen if we're going to fix it.
Status: NEW → RESOLVED
Closed: 10 years ago8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.