Closed Bug 1338844 Opened 7 years ago Closed 2 years ago

[e10s] Hangs on 2-core AMD platforms with e10s enabled

Categories

(Core :: General, defect)

50 Branch
defect
Not set
normal

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
platform-rel --- ?

People

(Reporter: marco, Unassigned)

References

Details

(Whiteboard: [ele:1a][platform-rel-AMD])

AMD reported this to us:

> We are currently seeing some hang issues on 2-core AMD platforms (e.g. A6-6400K,
> A6-7050B, A9-9410) using Firefox v50 within the PCMark10 web workloads.

> The hangs have been tracked down so far to the Multi-process feature in Mozilla
> (Electrolysis). Experiments showed that the hangs vanish in the following cases.
> 1) Downgrade the version of FF used in PCMark10 from v50 to v48, where the
>    multi-process feature -- Electrolysis -- does not exist
> 2) Disable Electrolysis in FF v50 by setting the
>    "browser.tabs.remote.autostart.2" flag to "false"
> 3) Upgrade the version of FF used in PCMark10 from v50 to v53-Nightly.  v51 and
>    v52-Developer do not fix the problem.

> AMD machines with more cores e.g. 4-core or higher run through it just fine. Also,
> no problems are seen with Chrome to my knowledge.

I have asked them to run mozregression to find what fixed the hangs between 50 and 53:

> The fix was narrowed down to commit b4ade2b0841c in v52 nightly.  It enables
> out-of-process GPU compositing be default.
> I do worry that enabling GPU compositing "fixing" the hang might just be covering
> up an underlying issue on 2-core machines, but it does provide us a solution in
> the meantime.
Anthony, could you please advise on who should own the next step for this issue and also what would be the best component for this bug?
Flags: needinfo?(anthony.s.hughes)
(In reply to Adrian Florinescu [:AdrianSV] from comment #1)
> Anthony, could you please advise on who should own the next step for this
> issue and also what would be the best component for this bug?

I don't know where this belongs. I suggest tracking down an affected system and debugging the issue so you can determine what code is causing this to happen. This will in turn inform the appropriate component.

If this only affects AMD APU chipsets it could be an underlying issue with the on-die GPU which might explain why GPU Process helps this situation. However this is pure speculation and debugging the issue further should help answer some of your questions.

I might suggest keeping this in Untriaged (or moving to General) until you can find out more about this issue. But I certainly don't think this belongs on Graphics, at least not yet.
Flags: needinfo?(anthony.s.hughes)
Marco, do you own next steps on this? Our QA teams don't have these computers and so aren't going to be much help. If we're talking about *Firefox* hanging, the AMD team should be able to break into a debugger and get us stacks of the hang, or at least follow the instructions at https://developer.mozilla.org/en-US/docs/Mozilla/How_to_report_a_hung_Firefox to generate a crash report.

In untriaged this bug doesn't have an owner, and we don't want it to be stuck here forever without a decision or clear followup path.
Flags: needinfo?(mcastelluccio)
I've contacted AMD and asked them if they can give us a stack trace, they agreed but haven't sent me anything yet. I'll ping them again.

In the meantime, perhaps QA can try to reproduce with another dual core machine under the same workload?
Marco, if you want any of the QA teams to help you should probably make a more specific request. I'm going to assign this to you since you own next steps.
Assignee: nobody → mcastelluccio
Component: Untriaged → General
(In reply to Benjamin Smedberg [:bsmedberg] from comment #5)
> Marco, if you want any of the QA teams to help you should probably make a
> more specific request. I'm going to assign this to you since you own next
> steps.

I've asked Andrei if they can test with one of their 2-core machines. I will leave the needinfo to me until I hear back from AMD or Andrei.
(In reply to Marco Castelluccio [:marco] from comment #6)
> (In reply to Benjamin Smedberg [:bsmedberg] from comment #5)
> > Marco, if you want any of the QA teams to help you should probably make a
> > more specific request. I'm going to assign this to you since you own next
> > steps.
> 
> I've asked Andrei if they can test with one of their 2-core machines. I will
> leave the needinfo to me until I hear back from AMD or Andrei.

We've been looking for a 2-core AMD test machine across the teams, but the only legacy hardware we have is an Intel-based one. I don't think we'll be able to be of any assistance here, but feel free to ni? me if you think there's some other way we could help.
(In reply to Andrei Vaida, QA [:avaida] – please ni? me from comment #8)
> (In reply to Marco Castelluccio [:marco] from comment #6)
> > (In reply to Benjamin Smedberg [:bsmedberg] from comment #5)
> > > Marco, if you want any of the QA teams to help you should probably make a
> > > more specific request. I'm going to assign this to you since you own next
> > > steps.
> > 
> > I've asked Andrei if they can test with one of their 2-core machines. I will
> > leave the needinfo to me until I hear back from AMD or Andrei.
> 
> We've been looking for a 2-core AMD test machine across the teams, but the
> only legacy hardware we have is an Intel-based one. I don't think we'll be
> able to be of any assistance here, but feel free to ni? me if you think
> there's some other way we could help.

Could you test with a 2-core machine (even Intel) under the same workload (PCMark10 web)? This way we can make sure the hang is restricted to those 2-core AMD machines.
Flags: needinfo?(andrei.vaida)
These two hang reports don't appear to similar on the main thread, but there may be an issue in the compositor/graphiucs subsystem. Here's my basic diagnosis of each:

https://crash-stats.mozilla.com/report/index/796e510d-9cb5-4c03-9cad-033d12170313#allthreads

The main thread (thread 0) is at this precise moment allocating memory through jemalloc, and waiting on the jemalloc lock. I don't see any other thread currently waiting on this lock, so we either caught it right at this time or it's stuck. A debugger might tell, or more runs. The main thread is running the refresh driver to paint a frame.

Thread 24 is the compositor thread and it is at the end of painting a frame and is within the ATI driver. Thread 22 and 23 appear to be ATI threads and they are both blocked. This appears a bit suspicious to me, especially because...

https://crash-stats.mozilla.com/report/index/29534afc-240c-4775-a409-64fe82170313#allthreads

The main thread appears to be perfectly normal waiting on the event loop.

Thread 27 is the compositor thread and again we're at the end of painting a frame, in the ATI driver, and the thread is blocked on NtGdiDdDDISubmitCommand.

So I'd point investigation at the ATI driver and its behavior on a 2-core system as much as the processor itself.
Andrei, could you test with a 2-core machine (Intel or AMD, it doesn't matter) and with different graphic cards?

If you can't reproduce with any configuration, it might be worth trying also with the gfx configuration from those crash reports (AMD graphic card 0x130b with driver 21.19.407.0).

In the meantime, I will ask AMD if they can get those raw crash dumps and tell us what the driver is doing in those threads.
Tried to test using an Intel Pentium D 923 CPU with an Intel 82945G gpu.
We were unable to run into this crash, there were no visible differences if e10s was enabled or disabled (Fx 50) and no crashes occurred.

Unfortunately we don't have access to another similar low-end machines and we cannot test using a different GPU on this setup.
Flags: needinfo?(andrei.vaida)
platform-rel: --- → ?
Whiteboard: [ele:1a] → [ele:1a][platform-rel-AMD]
So.. This time I don't have one of the "lucky systems"
But "unexplained crashes" being brought up, I can't avoid to think of some hidden hardware erratum. 
https://support.amd.com/TechDocs/48931_15h_Mod_10h-1Fh_Rev_Guide.pdf#page=14
https://support.amd.com/TechDocs/51603_Rev_Guide_15h_Models_30h-3Fh.pdf#page=13
https://support.amd.com/TechDocs/55370_Rev_Guide_For_Family_15h_Models_70h-7Fh_Processors.pdf#page=12
(these should cover the architectures of the CPUs mentioned in the op, respectively Piledriver, Steamroller and Excavator, all in turn sort of revisions of Bulldozer)

There are many possible ideas in the first two documents (but for the love of me, I cannot believe Stoney Ridge would have just a dozen of IOMMU bugs, and period end of it)
So I guess like some AMD rep seriously looking into this would really help.
Assignee: mcastelluccio → nobody

I think we can close e10s regressions at this point. Hopefully this is not still an issue.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.