Closed Bug 1720746 Opened 3 years ago Closed 3 years ago

Assertion failures on RandomNum.cpp on Windows 10 2004 asan mochitest-media jobs

Categories

(Core :: MFBT, defect)

x86_64
Windows 10
defect

Tracking

()

RESOLVED FIXED
92 Branch
Tracking Status
firefox-esr91 --- fixed
firefox92 --- fixed

People

(Reporter: masterwayz, Assigned: toshi)

References

Details

Attachments

(1 file)

When moving the mochitest-media test suite over to Windows 10 2004 on Azure in Bug 1718297, when running the Windows 10 x64 asan WebRender opt jobs, there are perm failures with a lot of Assertion failure: maybeRandomNum.isSome(), at /builds/worker/checkouts/gecko/mfbt/RandomNum.cpp:156 in the log files. Example:
https://treeherder.mozilla.org/jobs?repo=try&revision=079c7fcc8addaf4dee9f21df964fb4b74efe0d79&selectedTaskRun=A5Cdb1LRSsSl3VKUeEt8Jg.0

Gijs recommended me to create a bug. What can cause this?
The differences are:

  • Upgrade Windows 10 from something to 2004.
  • Move from AWS to Azure.
Flags: needinfo?(tkikuchi)

This is very strange. I started a couple of more tests on the same job above, then I can see this happens with ASan only, and TestRandomNum.exe in the cppunit succeeded.

How can I run a job on Azure? To investigate further, I need to add extra logging code and see what happens.

Flags: needinfo?(tkikuchi) → needinfo?(michelle)

When using ./mach try, add the following at the end of the command: --worker-override="t-win10-64=gecko-t/win10-64-azure-2004"
Please do note that this worker pool is made in Firefox CI manually by me. Whenever automation runs it gets deleted and I need to re-add it manually, as currently in the automation we cannot run instances on Azure. So if it gets deleted and I don't catch it (I get pinged), feel free to poke me here, or on Matrix or Slack.

Flags: needinfo?(michelle) → needinfo?(tkikuchi)

I posted several instrumentation patches to see the behavior. This job shows that the failing RandomUint64 was called by GenerateRandomPortName here. That call was introduced recently, but it should not fail. Still not sure why it failed. The possible reason is the stack pointer was out of range. I need to do more experiments to figure it out.

The failing tests appear to all be EME-related. Is there something about the GMP sandbox policy that could be causing problems?

There's also bug 1718348 about crashes in mozilla::RandomUint64OrDie() when security.sandbox.content.level is set to 20. Given the EME sandbox, maybe it could be the same issue.

See Also: → 1718348

I think I got something. Calling RtlGenRandom delays-loads bcryptPrimitives.dll and it is forwarded to bcryptPrimitives!ProcessPrng. As you can see it in this job with instrumentation, however, the problem is bcryptPrimitives.dll failed to be loaded with ERROR_ACCESS_DENIED. I think that's why RtlGenRandom failed. Probably it's because of the GMP sandbox policy, but not sure which policy blocks the module.

[task 2021-07-23T22:57:46.840Z] 22:57:46     INFO - GECKO(7136) | RandomUint64 failed: 0000000B97FFF0A0
[task 2021-07-23T22:57:46.840Z] 22:57:46     INFO - GECKO(7136) | MozGlueRandomUint64 failed - 00000000
[task 2021-07-23T22:57:46.841Z] 22:57:46     INFO - GECKO(7136) | LoadLibraryW(bcryptPrimitives) failed - 00000005
[task 2021-07-23T22:57:46.841Z] 22:57:46     INFO - GECKO(7136) | RtlGenRandom failed - 00000005
Flags: needinfo?(tkikuchi)

Interestingly this issue was fixed by adding bcryptPrimitives.dll to the CIG allowlist in this job. This proved that the failure of RtlGenRandom was caused by the loading failure of bcryptPrimitives.dll in the plugin process.

Luckily I could reproduce the loading failure with my local ASan build. This means this issue is not Azure specific. So far I just found LdrLoadDll for bcryptPrimitives.dll returned c0000022. I'll debug it further.

The call to RtlGenRandom delay-loads bcryptPrimitives.dll. The GMP process,
however, cannot load bcryptPrimitives.dll after process launch because its process
token is restricted. This is not a problem normally because bcryptPrimitives.dll
is loaded in early stage when the main thread still has a non-restricted impersonation
token.

With ASan, however, the first call to RandomUint64 happens in a non-main thread,
and it fails because the thread is not impersonated. We have PreloadLibs to mitigate
this kind of problem, but in this case adding bcryptPrimitives.dll to the list does
not help because the call to RandomUint64 happens before we load PreloadLibs.

The proposed fix is to explicitly call RandomUint64 in the main thread before
any call to RandomUint64 in the process.

Assignee: nobody → tkikuchi
Status: NEW → ASSIGNED

Here are the callstacks when the call to RandomUint64 failed. More specifically, the call to NtCreateSection to open a section handle of bcryptPrimitives.dll in the KnownDlls directory failed with STATUS_ACCESS_DENIED. The thread 0 was impersonated, but the thread 1, which tried to load the module, was not.

.  0  Id: a98.43d8 Suspend: 1 Teb: 000000c0`4f674000 Unfrozen
 # Child-SP          RetAddr           Call Site
00 000000c0`4ffff308 00007ffa`368919ce ntdll!NtWaitForSingleObject+0x14
01 000000c0`4ffff310 00007ff9`b7dc8036 KERNELBASE!WaitForSingleObjectEx+0x8e
02 000000c0`4ffff3b0 00007ff9`b7df76c3 xul!base::WaitableEvent::Wait+0x26
03 000000c0`4ffff3e0 00007ff9`b7df8efe xul!base::Thread::StartWithOptions+0x1b3
04 000000c0`4ffff4d0 00007ff9`b7ee5d56 xul!ChildProcess::ChildProcess+0x4e
05 000000c0`4ffff500 00007ff9`be330265 xul!mozilla::ipc::ProcessChild::ProcessChild+0x156
06 000000c0`4ffff600 00007ff9`c476601d xul!mozilla::gmp::GMPProcessChild::GMPProcessChild+0x15
07 (Inline Function) --------`-------- xul!mozilla::MakeUnique+0x25
08 000000c0`4ffff640 00007ff7`66651592 xul!XRE_InitChildProcess+0xfad
09 (Inline Function) --------`-------- plugin_container!content_process_main+0x18d
0a 000000c0`4ffffaa0 00007ff7`6665124c plugin_container!NS_internal_main+0x2b2
0b 000000c0`4ffffbc0 00007ff7`66716ab8 plugin_container!wmain+0x24c
0c (Inline Function) --------`-------- plugin_container!invoke_main+0x22
0d 000000c0`4ffffca0 00007ffa`38cf7034 plugin_container!__scrt_common_main_seh+0x10c
0e 000000c0`4ffffce0 00007ffa`38ec2651 kernel32!BaseThreadInitThunk+0x14
0f 000000c0`4ffffd10 00000000`00000000 ntdll!RtlUserThreadStart+0x21

   1  Id: a98.3edc Suspend: 1 Teb: 000000c0`4f676000 Unfrozen
 # Child-SP          RetAddr           Call Site
00 000000c0`507fe640 00007ffa`38e843ea ntdll!LdrpFindKnownDll+0x77
01 000000c0`507fe6b0 00007ffa`38edb1dd ntdll!LdrpLoadKnownDll+0x52
02 000000c0`507fe710 00007ffa`38e8fb31 ntdll!LdrpFindOrPrepareLoadingModule+0xbd
03 000000c0`507fe780 00007ffa`38e873e4 ntdll!LdrpLoadDllInternal+0x11d
04 000000c0`507fe800 00007ffa`38e86af4 ntdll!LdrpLoadDll+0xa8
05 000000c0`507fe9b0 00007ff9`d54a3d7a ntdll!LdrLoadDll+0xe4
06 (Inline Function) --------`-------- mozglue!mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared>,long (*)(wchar_t *, unsigned long *, _UNICODE_STRING *, void **)>::operator()+0x1f
07 000000c0`507feaa0 00007ffa`3689ad52 mozglue!patched_LdrLoadDll+0xa4a
08 000000c0`507ff0d0 00007ff9`d562c6b3 KERNELBASE!LoadLibraryExW+0x162
09 000000c0`507ff140 00007ff9`d562c84f mozglue!mozilla::RandomUint64+0x153
0a 000000c0`507ff220 00007ff9`b7ee33cc mozglue!mozilla::RandomUint64OrDie+0xbf
0b (Inline Function) --------`-------- xul!mozilla::ipc::RandomNodeName+0xd
0c 000000c0`507ff300 00007ff9`b7e02061 xul!mozilla::ipc::NodeController::InitChildProcess+0x15c
0d 000000c0`507ff680 00007ff9`b7df80f9 xul!ChildThread::Init+0x141
0e 000000c0`507ff7c0 00007ff9`b7dbf057 xul!base::Thread::ThreadMain+0x8c9
0f 000000c0`507ffb10 00007ff9`d449e378 xul!`anonymous namespace'::ThreadFunc+0x37
10 000000c0`507ffb40 00007ffa`38cf7034 clang_rt_asan_dynamic_x86_64!__asan::AsanThread::ThreadStart+0x98
11 000000c0`507ffb90 00007ff9`d54a45fd kernel32!BaseThreadInitThunk+0x14
12 (Inline Function) --------`-------- mozglue!mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared>,void (*)(int, void *, void *)>::operator()+0x1a
13 000000c0`507ffbc0 00007ffa`38ec2651 mozglue!patched_BaseThreadInitThunk+0x1ed
14 000000c0`507ffce0 00000000`00000000 ntdll!RtlUserThreadStart+0x21
Pushed by tkikuchi@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d74caf9f8941 Preload bcryptPrimitives.dll in the main thread of GMP. r=bobowen
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 92 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: