Open Bug 1524574 Opened 5 years ago Updated 1 year ago

mainthreadio markers missing for closing a file on Windows (NtClose)

Categories

(Core :: Gecko Profiler, enhancement, P2)

x86_64
Windows 10
enhancement

Tracking

()

People

(Reporter: florian, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [fxp])

Attachments

(2 files)

See this profile where most of the time is spent in NtClose, but there's no I/O marker for it: https://perfht.ml/2DSi7lv

We do have open/read/write/stat/fsync markers.

It looks like the IOInterposer code landed as part of bug 902587 doesn't interpose NtClose.

Long time ago, so hard for me to page in, but I see the unlanded patch 1 in bug 888534 adds these operations to IOInterposer. The specific hooks to PoisonIOInterposer would then need to be added via a subsequent patch.

If we do add it, keep in mind that NtClose is not just used for I/O, and in fact we won't want markers for most calls to that function.

Really we'd only want markers for NtClose that actually correspond to closing of file handles, so we should only log a marker when GetFileType() == FILE_TYPE_DISK. We probably should measure the perf impacts of such a check. Hopefully there aren't any, but should verify that.

(In reply to Aaron Klotz [:aklotz] from comment #1)

If we do add it, keep in mind that NtClose is not just used for I/O, and in fact we won't want markers for most calls to that function.

Really we'd only want markers for NtClose that actually correspond to closing of file handles

Is CloseHandle a better symbol to interpose?

In the profile from comment 0, the stack looks like:
NtClose (ntdll.dll)
CloseHandle (KERNELBASE.dll)
static <unnamed-tag> FileClose(struct PRFileDesc *) (nss3.dll)
<actual Firefox code> (xul.dll)

NtClose is the right one to use for this, as the other PoisonIOInterposer interceptions are at the native NT layer as well.

Blocks: 1527023
Priority: -- → P2

Here is a try run of the above patch with the main thread I/O startup test: https://treeherder.mozilla.org/#/jobs?repo=try&revision=84bfbc48d5ed482a273a0c9c1351551bf6978f88

On Windows 7 32bit it produces the expected output. On Windows 10 64bit, it's busted to the point that even the xpcshell fake http server fails to start.

I tried to debug this locally on my Windows 10 machine but didn't get anywhere. When using startup+shutdown profiling, I got profiles with correct 'close' FileIO markers. But when using startup profiling and capturing the profile while the browser was running, I never got a profile displayed, the Firefox Profiler UI was stuck on waiting to receive the profile from the Gecko profiler. This looks like there's a deadlock, but I couldn't figure out where.

Then I tried using a debug build, hoping it would give me some clue of what could be causing the deadlock... but I realized that just using startup profiling with the mainthreadio feature broke the browser, both with and without my patch. I filed bug 1543060.

Aaron, any idea of what I could have done wrong in this patch, or what could be broken elsewhere that my patch exposes?

Flags: needinfo?(aklotz)

Any time I see anything about deadlocks involving the profiler on 64-bit, it makes it sound to me like something is missing stackwalk suppressing.

Flags: needinfo?(aklotz)

(In reply to Aaron Klotz [:aklotz] from comment #6)

Any time I see anything about deadlocks involving the profiler on 64-bit, it makes it sound to me like something is missing stackwalk suppressing.

Seems like a good guess, thanks!

I tried adding "AutoSuppressStackWalking suppress;" in the functions that seem to play a role here (InterposedNtClose and locked_profiler_stream_json_for_this_process) but unfortunately that didn't fix the deadlock I'm encountering.

Unfortunately without stacks that's about the most that I can speculate.

See Also: → 1554667

Oops, I missed this bug, so I opened another one (now marked duplicate). Here's the info I wrote there, in case there's something useful:

Intercepting NtClose is possible, and would be very useful, because it is a suspected to sometimes have a heavy cost: See "Closing File Handles on Windows" in this article about surprisingly-slow things.
But it presents some technical difficulties:

  • NtClose may be used on any type of object, so we'd need to filter out non-file objects.
  • Also because NtClose may be called with any object, once in the interception routine, the interceptor could too easily be recursively called! In particular it makes stack-walking impossible (TBD), making the interception less useful.

I'm attaching an old WIP patch that may give some ideas.

Assignee: nobody → gsquelart

(Adding mention of NtClose in bug title.)

Summary: mainthreadio markers missing for closing a file on Windows → mainthreadio markers missing for closing a file on Windows (NtClose)
Assignee: mozbugz → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: