Closed Bug 1508312 Opened 5 years ago Closed 5 years ago

Mochitest BC tests fail if MFBT function `mozilla::RandomUint64()` used in mozjemalloc CreateArena()

Categories

(Core :: Memory Allocator, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla66
Tracking Status
firefox65 --- wontfix
firefox66 --- fixed

People

(Reporter: cmartin, Assigned: cmartin)

Details

Attachments

(1 file)

Bug 1402282(https://bugzilla.mozilla.org/show_bug.cgi?id=1402282) was supposed to make any new arena allocated by mozjemalloc have a random id for hardening purposes.

When change was landed, several Mochitest BC tests started failing. (https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=db7059b57f929b08f3367c7815ba08b76728fe6b)

Debugging this issue was very time-consuming because the issue was not reproducible on my local build - It seemed to only happen on Try servers.

I've actually solved this issue now, so I am filing this bug retroactively for documentation's sake and in-case anyone ever runs into a similar issue.
Assignee: nobody → cmartin
Initially, I did several Try builds with logging to narrow down what specific part of my change was causing the issue. 

Step 1 - Let's see if the random arena id itself is causing the problem - So what happens if I generate a random number but then don't use it for anything? 

Still fails - So something about just calling "mozilla:RandomUint64()" causes the test failure.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=1de117dda169cbfeaf4a76f0f96570e9bd4f68dc
Step 2 - Maybe the issue is some of the dependencies that `mozilla::RandomUint64()` brings it. Let's just call `RtlGenRandom()` directly.

Still fails. Also, note that in both situations the RNG works correctly for a while from the logs - You can see it generate tons of random numbers correctly.

But definitely at some point, just the act of calling `RtlGenRandom()` causes the test to fail.


https://treeherder.mozilla.org/#/jobs?repo=try&revision=768812fb2ee7eb5a7cbbb59318892d2a576f447a
Step 3 - Maybe the issue is some kind of loading or linking error? Try loading `RtlGenRandom()` dynamically with `LoadLibrary()` and `GetProcAddress()` and see if that works.

Nope - Still fails. Same thing too - Works well for a long time and then suddenly doesn't.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8f8b82498ef97b78ac976a3682897df8e29cd53d
Step 4 - Perhaps there's something about calling `RtlGenRandom()` over-and-over again that causes a race or something and crashes. Let's try calling `RtlGenRandom()` exactly one time right after initialization, and never again.

Nope - That still fails. But again it does show up in logs as working a bunch of times, telling me that multiple processes must be separately calling it, and that most of them are able to use `RtlGenRandom()` successfully. But at-least one of them definitely isn't.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=97200fd290cc6e06e252e184651ba11e77d1ae5c
Step 5 - Log the process name, process id, and thread id of everything that calls `CreateArena()` and calls `RtlGenRandom()`, and log both before-and-after the call to `RtlGenRandom()` to see which process doesn't return from it.

As it turns out, there is one single process that calls RtlGenRandom() and never returns from it: minidump-analyzer.exe

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b87509aeffe3412093c69ee4539bca0e15f0891f



To validate I was correct, I made a special change where it would look at the process name, and if it was "minidump-analyzer.exe" it would not call `RtlGenRandom()`, otherwise it would.

That build passed just fine - So "minidump-analyzer.exe" was definitely the problem

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9da28963161bf27110c01c453284153a6ca531f4
Step 6 - Took a debugger to minidump-analyzer.exe on my own machine and looked at the call stack for its arrival at CreateArena. I won't post every single build, but a did a linear search between the `minidump-analyzer::main()` function and `CreateArena()` to see at what point calling "RtlGenRandom()" started causing test failures.

The result was that anything after the call to the `mozglue/build` function `DllBlocklist_SetDllServices` was causing a hang.
Step 7 - I noticed there was some DLL hooking stuff in `DllBlocklist_SetDllServices`, so I figured maybe DLL loading had something to do with it. But if that's the case, why wasn't it hanging way long ago when I switched to dynamically loading "Advapi32.dll" back in Step 3?

I thought, "Hmm... Maybe Advapi32.dll is already loaded, so there's no hang. Let's load some silly DLL that definitely would not be already loaded.", so I had it load "dinput.dll" (The old DirectInput API DLL).

And that works - Just the act of me calling "LoadLibraryW(L"dinput.dll");" was enough to fail the test.

So now it seems likely the problem isn't `RtlGenRandom()`, it's that `RtlGenRandom()` is loading a DLL as part of its implementation.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=831fe08088bf33396079c6e66c6faecaa94b096f
Step 8 - To verify that it was the act of installing the DLL Loader Hook, I tried loading a DLL both before and after the snippet of code that installs the hook.

Sure enough - Loading the DLL works just fine right before the hook is installed, and causes test failure after the hook is installed.

Before: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3c52e23f7af1c2ea5115472af4f3f27039c39378

After: https://treeherder.mozilla.org/#/jobs?repo=try&revision=054b0ccf0582e8938aa3653ec0d84a8136a1489c
Step 9 - Looking around the code, I notice that `DllBlocklist_SetDllServices()` holds a mutex called `gDllServicesLock` for its entire duration. 

Likewise, that same mutex is also requested by `DllLoadNotification()` - The callback function that is triggered anytime a new DLL is loaded.

This means that if `DllBlocklist_SetDllServices()` calls anything that loads a DLL synchronously, that will cause a deadlock because the callback function will block on the mutex and wait for `DllBlocklist_SetDllServices()` to release it, but `DllBlocklist_SetDllServices()` won't release it until the callback function returns.

Unfortunately, thanks to our new change, that's exactly what happens.

This time near the bottom of `DllBlocklist_SetDllServices()` is the culprit:

```
if (glue::UntrustedDllsHandler::TakePendingEvents(events)) {
```

`TakePendingEvents()` dynamically allocates memory using "new()", which calls mozjemalloc's `CreateArena()`, which now calls `RtlGenRandom()`, which loads a DLL called 'C:\windows\system32\cryptbase.dll', which causes a deadlock.
Step 10 - Finally, I wanted to obtain a nice clean log showcasing exactly what my suspicion was. So I added logging before acquire, after acquire, and after release of the mutex in both `DllBlocklist_SetDllServices()` and the `DllLoadNotification()` callback.

Here is a part of the log that shows what the logging should look like in a process where this deadlock doesn't happen:

(Note that the [xxxx:xxxx] part is the [PID:TID] combo)

```

18:42:10     INFO - GECKO(5408) | RUSTAFARIAN: [2928:4628]: DllBlocklist_SetDllServices called. Process command line: 'Z:\task_1542564680\build\application\firefox\firefox.exe -no-remote -foreground -profile C:\Users\task_1542564680\AppData\Local\Temp\tmptlolbj.mozrunner\chrome_debugger_profile -chrome chrome://devtools/content/framework/toolbox-process-window.html'
18:42:10     INFO - GECKO(5408) | RUSTAFARIAN: [2928:4628]: DllBlocklist_SetDllServices about to acquire gDllServicesLock
18:42:10     INFO - GECKO(5408) | RUSTAFARIAN: [2928:4628]: DllBlocklist_SetDllServices acquired gDllServicesLock
18:42:10     INFO - GECKO(5408) | RUSTAFARIAN: [2928:4628]: DllBlocklist_SetDllServices released gDllServicesLock
18:42:10     INFO - GECKO(5408) | RUSTAFARIAN: [2928:6084]: Load notification for DLL 'C:\windows\system32\CLBCatQ.DLL'
18:42:10     INFO - GECKO(5408) | RUSTAFARIAN: [2928:6084]: DllLoadNotification about to acquire gDllServicesLock
18:42:10     INFO - GECKO(5408) | RUSTAFARIAN: [2928:6084]: DllLoadNotification acquired gDllServicesLock
18:42:10     INFO - GECKO(5408) | RUSTAFARIAN: [2928:6084]: DllLoadNotification released gDllServicesLock
18:42:10     INFO - GECKO(5408) | RUSTAFARIAN: [2928:6084]: Load notification for DLL 'C:\windows\system32\wbem\wbemprox.dll'
18:42:10     INFO - GECKO(5408) | RUSTAFARIAN: [2928:6084]: DllLoadNotification about to acquire gDllServicesLock
18:42:10     INFO - GECKO(5408) | RUSTAFARIAN: [2928:6084]: DllLoadNotification acquired gDllServicesLock
18:42:10     INFO - GECKO(5408) | RUSTAFARIAN: [2928:6084]: DllLoadNotification released gDllServicesLock

```

And here is the part of the log that confirms my suspicion:

```
18:49:06     INFO - GECKO(5408) | RUSTAFARIAN: [3440:5112]: DllBlocklist_SetDllServices called. Process command line: 'Z:\task_1542564680\build\application\firefox\minidump-analyzer.exe C:\Users\task_1542564680\AppData\Local\Temp\tmptlolbj.mozrunner\minidumps\81b8938b-b518-4cbb-b736-4008a4b5dedd.dmp'
18:49:06     INFO - GECKO(5408) | RUSTAFARIAN: [3440:5112]: DllBlocklist_SetDllServices about to acquire gDllServicesLock
18:49:06     INFO - GECKO(5408) | RUSTAFARIAN: [3440:5112]: DllBlocklist_SetDllServices acquired gDllServicesLock
18:49:06     INFO - GECKO(5408) | RUSTAFARIAN: [3440:5112]: About to call RtlGenRandom()
18:49:06     INFO - GECKO(5408) | RUSTAFARIAN: [3440:5112]: Load notification for DLL 'C:\windows\system32\cryptbase.dll'
18:49:06     INFO - GECKO(5408) | RUSTAFARIAN: [3440:5112]: DllLoadNotification about to acquire gDllServicesLock
```
There are likely several solutions that could be implemented to fix the problem - for example, maybe we can reduce the scope of that lock.

But right now, my main thought is this: It's a bit "unexpected behaviour" that calling `new()` in your code would result in a DLL being loaded, so I don't necessarily think that `DllBlocklist_SetDllServices()` is in-the-wrong here.

It seems like the root problem is that the `RtlGenRandom()` function lazily initializes itself, which unfortunately for us is at a time that we can't handle it because we're in the middle of sensitive DLL loader hooking code.

So the solution seems to me to just change it to eager initialization - We call `RtlGenRandom()` once as part of mozjemalloc's initialization code.

I'm about to run a Try build with such a change to see if it works, but does that seem like the right way to go?
(In reply to Chris Martin [:rustafarian] from comment #11)
> But right now, my main thought is this: It's a bit "unexpected behaviour"
> that calling `new()` in your code would result in a DLL being loaded, so I
> don't necessarily think that `DllBlocklist_SetDllServices()` is in-the-wrong
> here.

+1.
As previously discussed with Carl - Having an assert warn us if we attempt to re-enter a Slim RW lock (as opposed to the process just hanging) would've made catching this mistake much faster to debug.

Here is a treeherder job with everything passing: https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd6e90673f8d994dc0d4e2efe1f27902b6fdecde

Here is a treeherder job showing the same change I originally made with "RtlGenRandom()" with the new assert. In the "debug" build, we now get a message explaining the SRWLock situation, whereas the "opt" build still just hangs and gets terminated by watchdog: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c58a28a059c3442657f1f4efbd777cb227ed2f26
Flags: needinfo?(ccorcoran)

Nice solution, and cool to see it in action on try.

Flags: needinfo?(ccorcoran)
Keywords: checkin-needed

Pushed by ccoroiu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/493f7ee0ee00
Add assert to catch accidental re-enter of Slim RW lock r=ccorcoran,aklotz

Keywords: checkin-needed
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla66
You need to log in before you can comment on or make changes to this bug.