Open Bug 1673201 Opened 11 months ago Updated 1 month ago

Crash in [@ OpenAdapter10_2] affecting all Firefox versions

Categories

(Core :: Graphics, defect)

Unspecified
Windows
defect

Tracking

()

Tracking Status
firefox-esr78 --- affected
firefox81 --- wontfix
firefox82 --- affected
firefox83 --- affected
firefox84 --- affected

People

(Reporter: aryx, Unassigned)

References

Details

(Keywords: crash, Whiteboard: [tbird crash])

Crash Data

Attachments

(1 file)

Jimm, this graphics related crash popped up on Friday across Firefox versions. If it's from updates of third party software/drivers, the volume might increase in the next day. Please have a look at the issue.

This will likely turn into a top crasher (~60 crashes yesterday, ~50 so far today; 81.0.2+82.0 are 77 installations for which we have crash reports stored). There were only a few crashes before 2020-10-23, the very first on 2020-09-17, the others in October. Websites in active tab - where reported - are mostly the usual social media sites and a few others.

All crash reports are for Intel Graphics with either driver version 23.20.16.4973 (114 crashes) or 20.19.15.5107 (7 crashes). The former is already in the Windows update catalog since early 2018.

All except 2 crashes are on Windows 10 but on different versions of it.

See https://crash-stats.mozilla.org/signature/?product=Firefox&signature=OpenAdapter10_2&date=%3E%3D2020-04-24T20%3A50%3A00.000Z&date=%3C2020-10-24T20%3A50%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=startup_crash&_sort=-date&page=3#summary

Crash report: https://crash-stats.mozilla.org/report/index/f7e1e052-a848-4071-83f6-f204d0201024

Reason: EXCEPTION_ACCESS_VIOLATION_WRITE

Top 10 frames of crashing thread:

0 igd10iumd64.dll OpenAdapter10_2 
1 igd10iumd64.dll OpenAdapter10_2 
2 igd10iumd64.dll OpenAdapter10_2 
3 igd10iumd64.dll GTPIN_IGC_Instrument 
4 d3d11.dll CContext::TID3D11DeviceContext_IASetVertexBuffers_<2> 
5 d2d1.dll void GeometryStageManager::Flush 
6 d2d1.dll virtual void CHwSurfaceRenderTarget::FlushQueuedOperations 
7 d2d1.dll class CHwShaderState* CDeferredRenderingManager::LockForNewPrimitive 
8 d2d1.dll virtual void CHwSurfaceRenderTarget::SetClipRect 
9 d2d1.dll void CBaseRenderTarget::SetFinalTargetSpaceClip 

More info about OpenAdapter10_2.

Flags: needinfo?(jmathies)

AIUI we (by which I mean Gabriele) just recently started importing symbols from intel graphics drivers, so this may not be actually new, just a signature shift?

Yes, this is an effect of my first tests in bug 1655476. I found a version of Intel drivers which was particularly crashy, and used the updated dump_syms and scripts to see if they would locate the correct symbols which they did. This week this process should become automatic for all Intel, NVidia and AMD drivers symbols. I'll send an e-mail to stability when it happens. Sorry for not having raised a warning about this particular instance.

Flags: needinfo?(jmathies)

The severity field is not set for this bug.
:jimm, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jmathies)

FYI as more and more symbols are scraped it seems that this signature will go up a bit but there seem to be multiple stacks below the first call. It might be worthy to tell them apart by adding some of the function names to Socorro's prefix list. Someone with a better understanding of the graphics stack than me would have to take a look though.

Blocks: gfx-triage
Flags: needinfo?(jmathies)
Flags: needinfo?(jmuizelaar)

So getting OpenAdapter10_2 here is actually a regression because that public symbol doesn't have a length and so we just assume that most of the dll is covered by it. i.e. when ever we encounter a Intel driver crash it will resolve to OpenAdapter10_2.

We could use the ExceptionData that we create CFI to get the actual function starts and ends and create pseudo functions with made up names. We would want to have these made up names always be in the prefix list so that we keep including them in the signature until we hit a real function.
Alternatively, we could use the exception data to find the real size of OpenAdapater10_2 and emit that. I think doing that should be sufficient to have it not match and then we'd fall back to something like the old behaviour.

Flags: needinfo?(jmuizelaar) → needinfo?(gsvelto)

So the goal here is to have actionable, high-value signatures. The ones w/o symbols were not because they were all different - even those for the same crash - and the symbolicated ones are too generic. I don't know how hard it would be to generate synthetic symbols from the CFI data so I'm NI?ing Calixte who should know better.

That being said that kind of approach might be much better than having the library name + offset given the large number of driver versions available. One quick alternative is to add OpenAdapter10, OpenAdapter10_2 and OpenAdapter12 to the prefix list and see what comes out of it. Poking a few crashes shows that we will get at least these four different signatures:

[@ OpenAdapter10_2 | GTPIN_IGC_Instrument]
[@ OpenAdapter10_2 | RtlpTpWaitCallback]
[@ OpenAdapter10_2 | CContext::TID3D11DeviceContext_SetConstantBuffers_]
[@ OpenAdapter10_2 | CContext::UMQueryVS_ConstBuf_]

Alternatively if we feel that these crashes are driver-specific we coul append the adapter driver version to the signature so that they "clump" together by driver version. That would make it easier to blacklist specific driver versions.

This last change would require a little bit of extra work on the Socorro side but I'm confident I can do it.

Flags: needinfo?(gsvelto) → needinfo?(cdenizet)

Gabriele and I talked on matrix about this a bit. We agreed that for https://crash-stats.mozilla.org/report/index/f7e1e052-a848-4071-83f6-f204d0201024 we want the signature to be [igd10iumd64.dll | CContext::TID3D11DeviceContext_IASetVertexBuffers_<2>].

"The breakpad stack-walker will pick the nearest public symbol to an address if it can't find a FUNC entry that covers that particular address range. I think it should be possible to change that behavior into emitting the library name instead (w/o the offset). Assuming we can extract address ranges from the CFI info we should be able to add a size to the symbol which would make this process almost automatic. I'll ask Calixte because he's the one with the most hands-on experience dealing with PDB info and how it's transformed into SYM."

I made a patch:
https://github.com/mozilla/dump_syms/pull/153

PUBLIC symbols size is guessed using CFI info.
And a dummy symbol is added to catch addresses in the wild.

Assignee: nobody → cdenizet
Status: NEW → ASSIGNED
Flags: needinfo?(cdenizet)
Assignee: cdenizet → nobody
Status: ASSIGNED → NEW

The severity field is not set for this bug.
:jimm, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jmathies)
Severity: -- → S3
See Also: → 1677281

I'm testing Calixte's patch with a few crash different driver versions and crashes.

I tested a few crashes with the proposed changes and here are the results. I'm posting the crash URL, the stack frames contributing to the signature and an example of how the new signature should look like for each crash (right now the signature is OpenAdapter10_2 for all of these crashes and they all happen in the same driver version):

  • https://crash-stats.mozilla.org/report/index/985bbd5d-a145-41c3-b676-47f070201117

     0  igd10iumd64.dll!<unknown in igd10iumd64.dll> + 0xe3ed
     1  igd10iumd64.dll!<unknown in igd10iumd64.dll> + 0xe7f5
     2  igd10iumd64.dll!<unknown in igd10iumd64.dll> + 0xe319
     3  igd10iumd64.dll!<unknown in igd10iumd64.dll> + 0x395e7
     4  d3d11.dll!static void CContext::TID3D11DeviceContext_RSSetScissorRects_<2>(struct ID3D11DeviceContext4*,unsigned int,struct tagRECT const *) + 0x119
    

    Signature: <unknown in igd10iumd64.dll> | <unknown in igd10iumd64.dll> | <unknown in igd10iumd64.dll> | CContext::TID3D11DeviceContext_RSSetScissorRects_<2>

  • https://crash-stats.mozilla.org/report/index/65067595-d23a-4e57-8296-cee450201117

     0  igd10iumd64.dll!<unknown in igd10iumd64.dll> + 0x15ea
     1  igd10iumd64.dll!<unknown in igd10iumd64.dll> + 0x8f2d7
     2  igd10iumd64.dll!<unknown in igd10iumd64.dll> + 0x86155
     3  igd10iumd64.dll!<unknown in igd10iumd64.dll> + 0x5ab9a
     4  d3d11.dll!static void CContext::TID3D11DeviceContext_ClearView_<1>(struct ID3D11DeviceContext4*,struct ID3D11View *,float const * const,struct tagRECT const *,unsigned int) + 0x16b
    

    Signature: <unknown in igd10iumd64.dll> | <unknown in igd10iumd64.dll> | <unknown in igd10iumd64.dll> | CContext::TID3D11DeviceContext_ClearView_<1>

  • https://crash-stats.mozilla.org/report/index/0d33d130-e55f-4cb6-bd37-88c390201117

     0  igd10iumd64.dll!<unknown in igd10iumd64.dll> + 0x13e46
     1  igd10iumd64.dll!<unknown in igd10iumd64.dll> + 0x61af
     2  d3d11.dll!static void CContext::UMQueryVS_ConstBuf_(struct D3D10DDI_HRTCORELAYER,unsigned int,unsigned int) + 0x11e
    

    Signature: <unknown in igd10iumd64.dll> | <unknown in igd10iumd64.dll> | CContext::UMQueryVS_ConstBuf_

  • https://crash-stats.mozilla.org/report/index/ac4949e7-0bb4-433d-8677-1cf400201117#tab-rawdump

     0  igd10iumd64.dll!<unknown in igd10iumd64.dll> + 0xe3ed
     1  igd10iumd64.dll!<unknown in igd10iumd64.dll> + 0xe7f5
     2  igd10iumd64.dll!<unknown in igd10iumd64.dll> + 0x41aac
     3  d3d11.dll!int NDXGI::CDevice::Flush(unsigned int,enum D3DWDDM2_0DDI_CONTEXTTYPE_FLAG) + 0xa7
    

    Signature: <unknown in igd10iumd64.dll> | <unknown in igd10iumd64.dll> | <unknown in igd10iumd64.dll> | NDXGI::CDevice::Flush

Does this grouping look better?

edit: corrected the signatures

Flags: needinfo?(jmuizelaar)

Yes

Flags: needinfo?(jmuizelaar)

OK. Note that I've put on hold further scraping until we ship the updated dump_syms. It's no use to pile more signatures here.

No longer blocks: gfx-triage
Flags: needinfo?(jmathies)

Calixte has finished modifying dump_syms so that the new symbols will be as shown in the examples in comment 11. He'll be making a release today. Once that's done I'll reprocess all the symbols affected by this crash, this signature will most likely go away and we'll be able to re-triage all the resulting crashes under new more meaningful signatures.

Nice!

Blocks: 1684166

#60 crash for Thunderbird 78.6.0

Whiteboard: [tbird crash]

Update: I haven't reprocessed the symbols yet because the format we choose to use seems to be tripping up Socorro's signature generation. See bug 1685178 for more info. I'll reprocess the graphics drivers' symbols once we figure that one out.

There are also

(In reply to Gabriele Svelto [:gsvelto] from comment #17)

... See bug 1685178 for more info. I'll reprocess the graphics drivers' symbols once we figure that one out.

Flags: needinfo?(gsvelto)

We've landed all the client-side and server-side bits to reprocess these crashes so I'll do it tonight.

Flags: needinfo?(gsvelto)

I've reprocessed the drivers that were accounting for the majority of the crashes here. The volume under this signature should go down dramatically from this point on. Some volume will remain because 32-bit drivers don't have any form of stack-walking information in them - just public symbols - and as such they will keep clumping under this crash signature.

(In reply to Gabriele Svelto [:gsvelto] from comment #20)

I've reprocessed the drivers that were accounting for the majority of the crashes here. The volume under this signature should go down dramatically from this point on. Some volume will remain because 32-bit drivers don't have any form of stack-walking information in them - just public symbols - and as such they will keep clumping under this crash signature.

The signature volume had already dropped. What signatures should we now look for?

Flags: needinfo?(gsvelto)

Graphics driver crashes will now have signatures that start with something like <unknown in name_of_library.dll>. Many of the crashes that were here can be found with a query like this one. Note that there doesn't seem to be just a single high-volume crash there, which is to be expected. Those crashes come from different driver versions, with different bugs.

Flags: needinfo?(gsvelto)
Duplicate of this bug: 1684166
You need to log in before you can comment on or make changes to this bug.