Open Bug 1908630 Opened 3 months ago Updated 14 days ago

Enable ForkServer during PGO

Categories

(Core :: Security: Process Sandboxing, defect, P2)

defect

Tracking

()

REOPENED

People

(Reporter: gerard-majax, Assigned: gerard-majax)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

Attachments

(1 file, 5 obsolete files)

From bug 1907883 we have been able to identify that forkserver during PGO profiling results in:

  • much less *.profraw data files (~80 with forkserver VS ~180 without forkserver)
  • less process being profiled (counted 42 processes without forkserver, 27 with forkserver)

It would be nice to understand why.

LVM Profile Error: Invalid profile data to merge
LLVM Profile Error: Profile Merging of file /home/alex/codaz/Mozilla/gecko-cinnabar/PGO_data/pgo_forkserver_4138445_random_2970392330336318723_0.profraw failed: Le fichier existe
LLVM Profile Error: Failed to write file "/home/alex/codaz/Mozilla/gecko-cinnabar/PGO_data/pgo_forkserver_4138445_random_2970392330336318723_0.profraw": Le fichier existe

Getting this when I try to run the instrumented build. 4138445 is the ForkServer PID, so likely LLVM profiling gets tricked on the PID of WebContent process after they came from ForkServer

See Also: → 1196094
See Also: → 1553850
Attachment #9417510 - Attachment description: WIP: Bug 1908630 - Disabling LLVM Profiling for manual handling → WIP: Bug 1908630 - Override LLVM Profiling for manual handling
Attached file WIP: Bug 1908630 - Enable PGO on NSS ? (obsolete) —
Summary: Investigate why ForkServer enabled during PGO profiling step results in so bad data → Enable ForkServer during PGO
Attachment #9417509 - Attachment is obsolete: true
Attachment #9417511 - Attachment is obsolete: true
Attachment #9417512 - Attachment is obsolete: true
Attachment #9417762 - Attachment is obsolete: true
Attachment #9417510 - Attachment description: WIP: Bug 1908630 - Override LLVM Profiling for manual handling → Bug 1908630 - Override LLVM Profiling for manual handling r?#build

https://treeherder.mozilla.org/jobs?repo=try&revision=324e0fda0a80879857dbc3ada3895ba8cd82a1a4
this shows the same amount of data for merged.profdata as compared to previously:

$ tar tvf profdata.tar.xz 
-rw-r--r-- worker/worker 106602448 2024-09-04 13:59 merged.profdata

With e.g., a mozilla-central run [https://treeherder.mozilla.org/jobs?repo=mozilla-central&searchStr=linux64%2Crun]:

$ tar tvf profdata.tar.xz 
-rw-r--r-- worker/worker 106487816 2024-09-05 07:18 merged.profdata
Attachment #9418043 - Attachment is obsolete: true
Blocks: 1874689
Pushed by alissy@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/819d4c9682d4 Override LLVM Profiling for manual handling r=firefox-build-system-reviewers,glandium

Backed out as requested for working in the incorrect sandbox.

Flags: needinfo?(lissyx+mozillians)
Flags: needinfo?(lissyx+mozillians)
Status: NEW → RESOLVED
Closed: 29 days ago
Resolution: --- → FIXED
Target Milestone: --- → 132 Branch
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 132 Branch → ---
Depends on: 1917575

So with the progress on bug 1917575, I rebased on top of that stack and I'm getting stuck into a kind of deadlock state at startup within updateFilemaneAfterFork:

$ grep --color -E "0x7f957bb06c20|0x564b7954be10" pgo_pthread_atfork.log 
[297869] [bool malloc_init_hard()] register _malloc_postfork_child@0x564b7954be10
[297869] [int CustomRegisterRuntime()] register updateFilenameAfterFork@0x7f957bb06c20
[297887] [void updateFilenameAfterFork()] running updateFilenameAfterFork@0x7f957bb06c20 START
[297904] [void updateFilenameAfterFork()] running updateFilenameAfterFork@0x7f957bb06c20 START
[297917] [void updateFilenameAfterFork()] running updateFilenameAfterFork@0x7f957bb06c20 START

Obviously _malloc_postfork_child is registered before updateFilenameAfterFork but does not get called before we call updateFilenameAfterFork and I suspect this may be because how we handle interposition of pthread_atfork and forwarding to the real atfork: the real_pthread_atfork symbol is resolved after the malloc init (because searching the symbol triggers some malloc so we would deadlock), so the interposer registers and will replay once things allows.

But our interposer does not looks to be applied to the LLVM bypass code, so the pthread_atfork it uses is directly the one from the glibc: ergo, even if malloc calls before, the actual moment the glibc really knows about it is after the registration of updateFilenameAfterFork and the child handlers are called in order of registration.

Tricking by just using the pthread_atfork child handler to atexit(...) with ... a function that will call updateFilenameAfterFork seems enough.

$ grep --color -E "0x7f7985506c40|0x5580a6468ec0" pgo_pthread_atfork.log 
[291301] [bool malloc_init_hard()] register _malloc_postfork_child@0x5580a6468ec0
[291301] [int CustomRegisterRuntime()] register registerAtExit@0x7f7985506c40
[291320] [void registerAtExit()] running registerAtExit@0x7f7985506c40 START
[291320] [void registerAtExit()] running registerAtExit@0x7f7985506c40 END
[291320] [void _malloc_postfork_child()] running _malloc_postfork_child@0x5580a6468ec0 START
[291320] [void _malloc_postfork_child()] running _malloc_postfork_child@0x5580a6468ec0 END
[291444] [void registerAtExit()] running registerAtExit@0x7f7985506c40 START
[291444] [void registerAtExit()] running registerAtExit@0x7f7985506c40 END
[291444] [void _malloc_postfork_child()] running _malloc_postfork_child@0x5580a6468ec0 START
[291444] [void _malloc_postfork_child()] running _malloc_postfork_child@0x5580a6468ec0 END
[291563] [void registerAtExit()] running registerAtExit@0x7f7985506c40 START
[291563] [void registerAtExit()] running registerAtExit@0x7f7985506c40 END
[291563] [void _malloc_postfork_child()] running _malloc_postfork_child@0x5580a6468ec0 START
[291563] [void _malloc_postfork_child()] running _malloc_postfork_child@0x5580a6468ec0 END
[291571] [void registerAtExit()] running registerAtExit@0x7f7985506c40 START
[291571] [void registerAtExit()] running registerAtExit@0x7f7985506c40 END
[291571] [void _malloc_postfork_child()] running _malloc_postfork_child@0x5580a6468ec0 START
[291571] [void _malloc_postfork_child()] running _malloc_postfork_child@0x5580a6468ec0 END
[291574] [void registerAtExit()] running registerAtExit@0x7f7985506c40 START
[291574] [void registerAtExit()] running registerAtExit@0x7f7985506c40 END
[291574] [void _malloc_postfork_child()] running _malloc_postfork_child@0x5580a6468ec0 START
[291574] [void _malloc_postfork_child()] running _malloc_postfork_child@0x5580a6468ec0 END
[291718] [void registerAtExit()] running registerAtExit@0x7f7985506c40 START
[291718] [void registerAtExit()] running registerAtExit@0x7f7985506c40 END
[291718] [void _malloc_postfork_child()] running _malloc_postfork_child@0x5580a6468ec0 START
[291718] [void _malloc_postfork_child()] running _malloc_postfork_child@0x5580a6468ec0 END
[291910] [void registerAtExit()] running registerAtExit@0x7f7985506c40 START
[291910] [void registerAtExit()] running registerAtExit@0x7f7985506c40 END
[291910] [void _malloc_postfork_child()] running _malloc_postfork_child@0x5580a6468ec0 START
[291910] [void _malloc_postfork_child()] running _malloc_postfork_child@0x5580a6468ec0 END

I guess that contrary to within the interposer code, we want to dlsym(nullptr, "pthread_atfork") here to find the interposed function and not the glibc one. This way we follow the same call order than the rest of the interposed calls and we are fine.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: