Closed Bug 1717909 Opened 4 years ago Closed 4 years ago

System symbols scraped from Windows symbol servers are missing CFI information

Categories

(Toolkit :: Crash Reporting, defect)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
94 Branch
Tracking Status
firefox94 --- fixed

People

(Reporter: gsvelto, Assigned: calixte)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

I was looking at some recent symbols that were scraped from Microsoft's servers and noticed that they are missing CFI information:

This suggests a bug in the symbol scraping task. I'll try to investigate today or tomorrow at the latest.

This is the task that generated the ntdll.sym symbols in comment 0. Now to figure out what went wrong...

This is super silly. We're making too many connections to the symbol server and silently failing, this is in the task log:

[task 2021-06-23T00:03:15.609Z] WARNING:root:Error with https://msdl.microsoft.com/download/symbols/MLogHook.pdb/AE890A7047524698BC2CC65A0AC24C631/MLogHook.pdb: retry
[task 2021-06-23T00:03:15.610Z] ERROR:root:Cannot connect to host msdl.microsoft.com:443 ssl:default [Connection reset by peer]
[task 2021-06-23T00:03:15.610Z] Traceback (most recent call last):
[task 2021-06-23T00:03:15.610Z]   File "/usr/local/lib/python3.7/dist-packages/aiohttp/connector.py", line 936, in _wrap_create_connection
[task 2021-06-23T00:03:15.610Z]     return await self._loop.create_connection(*args, **kwargs)  # type: ignore  # noqa
[task 2021-06-23T00:03:15.610Z]   File "/usr/lib/python3.7/asyncio/base_events.py", line 986, in create_connection
[task 2021-06-23T00:03:15.610Z]     ssl_handshake_timeout=ssl_handshake_timeout)
[task 2021-06-23T00:03:15.610Z]   File "/usr/lib/python3.7/asyncio/base_events.py", line 1014, in _create_connection_transport
[task 2021-06-23T00:03:15.610Z]     await waiter
[task 2021-06-23T00:03:15.610Z]   File "/usr/lib/python3.7/asyncio/selector_events.py", line 801, in _read_ready__data_received
[task 2021-06-23T00:03:15.610Z]     data = self._sock.recv(self.max_size)
[task 2021-06-23T00:03:15.610Z] ConnectionResetError: [Errno 104] Connection reset by peer

We have to further slow down how many symbols we fetch at the same time.

Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Blocks: 1718294

I was expecting that 4 is not so bad and if we miss few pdbs then we can get them the day after.

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/5d32419d6b5b Fetch only one file at a time from Microsoft's symbol servers; r=marco,calixte

After thinking about that:

if we don't manage to get a file we expect to get it in next run, so finally I think that reducing the limit_per_host will just increase the time spent in this task with no special benefit.
Anyway for x86_64 target, CFI info are in the dll:
https://searchfox.org/mozilla-central/source/tools/crashreporter/system-symbols/win/symsrv-fetch.py#297

so I'd say that we manage to get the pdb, for any reason we don't have the dll/exe for it and then we finally have partial data in .sym.
Or maybe for any reason the dll/exe doesn't contain CFI data (is it possible ??) and so we've pdb & dll but no CFI in the .sym.
Normally we don't produce the .sym when the dll is missing (only for x86_64 target).

Maybe it's possible to not have CFI info when exception stuff is disabled at compile time because in this case CFI are pretty useless, but maybe I'm wrong.
If I'm correct the fix could be to add an option in dump_syms to fail in case of missing CFI data, because it's useless for stack walker.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 91 Branch

(In reply to Calixte Denizet (:calixte) from comment #6)

After thinking about that:

if we don't manage to get a file we expect to get it in next run, so finally I think that reducing the limit_per_host will just increase the time spent in this task with no special benefit.
Anyway for x86_64 target, CFI info are in the dll:
https://searchfox.org/mozilla-central/source/tools/crashreporter/system-symbols/win/symsrv-fetch.py#297

I see, so I have misinterpreted those, they're just warnings, right?

so I'd say that we manage to get the pdb, for any reason we don't have the dll/exe for it and then we finally have partial data in .sym.
Or maybe for any reason the dll/exe doesn't contain CFI data (is it possible ??) and so we've pdb & dll but no CFI in the .sym.
Normally we don't produce the .sym when the dll is missing (only for x86_64 target).

Maybe it's possible to not have CFI info when exception stuff is disabled at compile time because in this case CFI are pretty useless, but maybe I'm wrong.
If I'm correct the fix could be to add an option in dump_syms to fail in case of missing CFI data, because it's useless for stack walker.

Well, there's definitely something odd going on. If I look at what happened with the ntdll.dll from comment 0 I find these two lines in the task output:

[task 2021-06-23T00:05:37.729Z] INFO:root:To dump: ntdll.pdb/7FCBBFA95B3DDBC61A90F7C3FD4A7F371, ntdll.dll/088BF6211f5000 and has_code = False
[task 2021-06-23T00:05:37.729Z] INFO:root:To dump: ntdll.pdb/7FCBBFA95B3DDBC61A90F7C3FD4A7F371, / and has_code = 

They're odd, they look like some kind of duplication. Additionally we seem to download the PDB twice:

[task 2021-06-23T00:05:37.816Z] DEBUG:root:Fetch url: https://msdl.microsoft.com/download/symbols/ntdll.pdb/7FCBBFA95B3DDBC61A90F7C3FD4A7F371/ntdll.pdb
[task 2021-06-23T00:05:37.816Z] DEBUG:root:Fetch url: https://msdl.microsoft.com/download/symbols/ntdll.pdb/7FCBBFA95B3DDBC61A90F7C3FD4A7F371/ntdll.pdb

Additionally the associated DLL is ntdll.dll with code id 088BF6211f5000 and it seems that it cannot be found. I tried to manually scrape it again today and it's not there either. So yeah, it seems like the correct fix is to fail if we only have the PDB, so that the scripts will try to re-scrape the file at a later time and hopefully will find the corresponding DLL. I'm re-opening this.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

I removed the tracking flags because this only affects our automation scripts, not Firefox per se.

Target Milestone: 91 Branch → ---

Handing this over to Calixte for the actual fix.

Assignee: gsvelto → cdenizet

I discovered something funny: sometimes we can't seem to find a PDB in Microsoft's symbol servers, but if we look for the DLL we also get the PDB. Here's an example, this command:

dump_syms rpcrt4.pdb --code-id 5DA53549140000 --store output --symbol-server "SRV*output*https://msdl.microsoft.com/download/symbols" --verbose error

Fails with:

Impossible to get file rpcrt4.pdb with id 5DA53549140000

But this one succeeds and also finds the corresponding PDB:

dump_syms rpcrt4.dll --code-id 5DA53549140000 --store output --symbol-server "SRV*output*https://msdl.microsoft.com/download/symbols" --verbose error

Maybe we should try both before giving up?

Duh, I forgot to post the fix for this one.

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/bb0fd62efdd4 Ensure that symbol files for system libraries on Windows contain unwinding information r=calixte
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 94 Branch

FYI this is working as expected and catching a rather large number of instances where CFI data is missing, see this report.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: