Closed
Bug 1508312
Opened 5 years ago
Closed 5 years ago
Mochitest BC tests fail if MFBT function `mozilla::RandomUint64()` used in mozjemalloc CreateArena()
Categories
(Core :: Memory Allocator, defect)
Core
Memory Allocator
Tracking
()
RESOLVED
FIXED
mozilla66
People
(Reporter: cmartin, Assigned: cmartin)
Details
Attachments
(1 file)
Bug 1402282(https://bugzilla.mozilla.org/show_bug.cgi?id=1402282) was supposed to make any new arena allocated by mozjemalloc have a random id for hardening purposes. When change was landed, several Mochitest BC tests started failing. (https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=db7059b57f929b08f3367c7815ba08b76728fe6b) Debugging this issue was very time-consuming because the issue was not reproducible on my local build - It seemed to only happen on Try servers. I've actually solved this issue now, so I am filing this bug retroactively for documentation's sake and in-case anyone ever runs into a similar issue.
Assignee | ||
Updated•5 years ago
|
Assignee: nobody → cmartin
Assignee | ||
Comment 1•5 years ago
|
||
Initially, I did several Try builds with logging to narrow down what specific part of my change was causing the issue. Step 1 - Let's see if the random arena id itself is causing the problem - So what happens if I generate a random number but then don't use it for anything? Still fails - So something about just calling "mozilla:RandomUint64()" causes the test failure. https://treeherder.mozilla.org/#/jobs?repo=try&revision=1de117dda169cbfeaf4a76f0f96570e9bd4f68dc
Assignee | ||
Comment 2•5 years ago
|
||
Step 2 - Maybe the issue is some of the dependencies that `mozilla::RandomUint64()` brings it. Let's just call `RtlGenRandom()` directly. Still fails. Also, note that in both situations the RNG works correctly for a while from the logs - You can see it generate tons of random numbers correctly. But definitely at some point, just the act of calling `RtlGenRandom()` causes the test to fail. https://treeherder.mozilla.org/#/jobs?repo=try&revision=768812fb2ee7eb5a7cbbb59318892d2a576f447a
Assignee | ||
Comment 3•5 years ago
|
||
Step 3 - Maybe the issue is some kind of loading or linking error? Try loading `RtlGenRandom()` dynamically with `LoadLibrary()` and `GetProcAddress()` and see if that works. Nope - Still fails. Same thing too - Works well for a long time and then suddenly doesn't. https://treeherder.mozilla.org/#/jobs?repo=try&revision=8f8b82498ef97b78ac976a3682897df8e29cd53d
Assignee | ||
Comment 4•5 years ago
|
||
Step 4 - Perhaps there's something about calling `RtlGenRandom()` over-and-over again that causes a race or something and crashes. Let's try calling `RtlGenRandom()` exactly one time right after initialization, and never again. Nope - That still fails. But again it does show up in logs as working a bunch of times, telling me that multiple processes must be separately calling it, and that most of them are able to use `RtlGenRandom()` successfully. But at-least one of them definitely isn't. https://treeherder.mozilla.org/#/jobs?repo=try&revision=97200fd290cc6e06e252e184651ba11e77d1ae5c
Assignee | ||
Comment 5•5 years ago
|
||
Step 5 - Log the process name, process id, and thread id of everything that calls `CreateArena()` and calls `RtlGenRandom()`, and log both before-and-after the call to `RtlGenRandom()` to see which process doesn't return from it. As it turns out, there is one single process that calls RtlGenRandom() and never returns from it: minidump-analyzer.exe https://treeherder.mozilla.org/#/jobs?repo=try&revision=b87509aeffe3412093c69ee4539bca0e15f0891f To validate I was correct, I made a special change where it would look at the process name, and if it was "minidump-analyzer.exe" it would not call `RtlGenRandom()`, otherwise it would. That build passed just fine - So "minidump-analyzer.exe" was definitely the problem https://treeherder.mozilla.org/#/jobs?repo=try&revision=9da28963161bf27110c01c453284153a6ca531f4
Assignee | ||
Comment 6•5 years ago
|
||
Step 6 - Took a debugger to minidump-analyzer.exe on my own machine and looked at the call stack for its arrival at CreateArena. I won't post every single build, but a did a linear search between the `minidump-analyzer::main()` function and `CreateArena()` to see at what point calling "RtlGenRandom()" started causing test failures. The result was that anything after the call to the `mozglue/build` function `DllBlocklist_SetDllServices` was causing a hang.
Assignee | ||
Comment 7•5 years ago
|
||
Step 7 - I noticed there was some DLL hooking stuff in `DllBlocklist_SetDllServices`, so I figured maybe DLL loading had something to do with it. But if that's the case, why wasn't it hanging way long ago when I switched to dynamically loading "Advapi32.dll" back in Step 3? I thought, "Hmm... Maybe Advapi32.dll is already loaded, so there's no hang. Let's load some silly DLL that definitely would not be already loaded.", so I had it load "dinput.dll" (The old DirectInput API DLL). And that works - Just the act of me calling "LoadLibraryW(L"dinput.dll");" was enough to fail the test. So now it seems likely the problem isn't `RtlGenRandom()`, it's that `RtlGenRandom()` is loading a DLL as part of its implementation. https://treeherder.mozilla.org/#/jobs?repo=try&revision=831fe08088bf33396079c6e66c6faecaa94b096f
Assignee | ||
Comment 8•5 years ago
|
||
Step 8 - To verify that it was the act of installing the DLL Loader Hook, I tried loading a DLL both before and after the snippet of code that installs the hook. Sure enough - Loading the DLL works just fine right before the hook is installed, and causes test failure after the hook is installed. Before: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3c52e23f7af1c2ea5115472af4f3f27039c39378 After: https://treeherder.mozilla.org/#/jobs?repo=try&revision=054b0ccf0582e8938aa3653ec0d84a8136a1489c
Assignee | ||
Comment 9•5 years ago
|
||
Step 9 - Looking around the code, I notice that `DllBlocklist_SetDllServices()` holds a mutex called `gDllServicesLock` for its entire duration. Likewise, that same mutex is also requested by `DllLoadNotification()` - The callback function that is triggered anytime a new DLL is loaded. This means that if `DllBlocklist_SetDllServices()` calls anything that loads a DLL synchronously, that will cause a deadlock because the callback function will block on the mutex and wait for `DllBlocklist_SetDllServices()` to release it, but `DllBlocklist_SetDllServices()` won't release it until the callback function returns. Unfortunately, thanks to our new change, that's exactly what happens. This time near the bottom of `DllBlocklist_SetDllServices()` is the culprit: ``` if (glue::UntrustedDllsHandler::TakePendingEvents(events)) { ``` `TakePendingEvents()` dynamically allocates memory using "new()", which calls mozjemalloc's `CreateArena()`, which now calls `RtlGenRandom()`, which loads a DLL called 'C:\windows\system32\cryptbase.dll', which causes a deadlock.
Assignee | ||
Comment 10•5 years ago
|
||
Step 10 - Finally, I wanted to obtain a nice clean log showcasing exactly what my suspicion was. So I added logging before acquire, after acquire, and after release of the mutex in both `DllBlocklist_SetDllServices()` and the `DllLoadNotification()` callback. Here is a part of the log that shows what the logging should look like in a process where this deadlock doesn't happen: (Note that the [xxxx:xxxx] part is the [PID:TID] combo) ``` 18:42:10 INFO - GECKO(5408) | RUSTAFARIAN: [2928:4628]: DllBlocklist_SetDllServices called. Process command line: 'Z:\task_1542564680\build\application\firefox\firefox.exe -no-remote -foreground -profile C:\Users\task_1542564680\AppData\Local\Temp\tmptlolbj.mozrunner\chrome_debugger_profile -chrome chrome://devtools/content/framework/toolbox-process-window.html' 18:42:10 INFO - GECKO(5408) | RUSTAFARIAN: [2928:4628]: DllBlocklist_SetDllServices about to acquire gDllServicesLock 18:42:10 INFO - GECKO(5408) | RUSTAFARIAN: [2928:4628]: DllBlocklist_SetDllServices acquired gDllServicesLock 18:42:10 INFO - GECKO(5408) | RUSTAFARIAN: [2928:4628]: DllBlocklist_SetDllServices released gDllServicesLock 18:42:10 INFO - GECKO(5408) | RUSTAFARIAN: [2928:6084]: Load notification for DLL 'C:\windows\system32\CLBCatQ.DLL' 18:42:10 INFO - GECKO(5408) | RUSTAFARIAN: [2928:6084]: DllLoadNotification about to acquire gDllServicesLock 18:42:10 INFO - GECKO(5408) | RUSTAFARIAN: [2928:6084]: DllLoadNotification acquired gDllServicesLock 18:42:10 INFO - GECKO(5408) | RUSTAFARIAN: [2928:6084]: DllLoadNotification released gDllServicesLock 18:42:10 INFO - GECKO(5408) | RUSTAFARIAN: [2928:6084]: Load notification for DLL 'C:\windows\system32\wbem\wbemprox.dll' 18:42:10 INFO - GECKO(5408) | RUSTAFARIAN: [2928:6084]: DllLoadNotification about to acquire gDllServicesLock 18:42:10 INFO - GECKO(5408) | RUSTAFARIAN: [2928:6084]: DllLoadNotification acquired gDllServicesLock 18:42:10 INFO - GECKO(5408) | RUSTAFARIAN: [2928:6084]: DllLoadNotification released gDllServicesLock ``` And here is the part of the log that confirms my suspicion: ``` 18:49:06 INFO - GECKO(5408) | RUSTAFARIAN: [3440:5112]: DllBlocklist_SetDllServices called. Process command line: 'Z:\task_1542564680\build\application\firefox\minidump-analyzer.exe C:\Users\task_1542564680\AppData\Local\Temp\tmptlolbj.mozrunner\minidumps\81b8938b-b518-4cbb-b736-4008a4b5dedd.dmp' 18:49:06 INFO - GECKO(5408) | RUSTAFARIAN: [3440:5112]: DllBlocklist_SetDllServices about to acquire gDllServicesLock 18:49:06 INFO - GECKO(5408) | RUSTAFARIAN: [3440:5112]: DllBlocklist_SetDllServices acquired gDllServicesLock 18:49:06 INFO - GECKO(5408) | RUSTAFARIAN: [3440:5112]: About to call RtlGenRandom() 18:49:06 INFO - GECKO(5408) | RUSTAFARIAN: [3440:5112]: Load notification for DLL 'C:\windows\system32\cryptbase.dll' 18:49:06 INFO - GECKO(5408) | RUSTAFARIAN: [3440:5112]: DllLoadNotification about to acquire gDllServicesLock ```
Assignee | ||
Comment 11•5 years ago
|
||
There are likely several solutions that could be implemented to fix the problem - for example, maybe we can reduce the scope of that lock. But right now, my main thought is this: It's a bit "unexpected behaviour" that calling `new()` in your code would result in a DLL being loaded, so I don't necessarily think that `DllBlocklist_SetDllServices()` is in-the-wrong here. It seems like the root problem is that the `RtlGenRandom()` function lazily initializes itself, which unfortunately for us is at a time that we can't handle it because we're in the middle of sensitive DLL loader hooking code. So the solution seems to me to just change it to eager initialization - We call `RtlGenRandom()` once as part of mozjemalloc's initialization code. I'm about to run a Try build with such a change to see if it works, but does that seem like the right way to go?
Comment 12•5 years ago
|
||
(In reply to Chris Martin [:rustafarian] from comment #11) > But right now, my main thought is this: It's a bit "unexpected behaviour" > that calling `new()` in your code would result in a DLL being loaded, so I > don't necessarily think that `DllBlocklist_SetDllServices()` is in-the-wrong > here. +1.
Assignee | ||
Comment 13•5 years ago
|
||
Assignee | ||
Comment 14•5 years ago
|
||
As previously discussed with Carl - Having an assert warn us if we attempt to re-enter a Slim RW lock (as opposed to the process just hanging) would've made catching this mistake much faster to debug. Here is a treeherder job with everything passing: https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd6e90673f8d994dc0d4e2efe1f27902b6fdecde Here is a treeherder job showing the same change I originally made with "RtlGenRandom()" with the new assert. In the "debug" build, we now get a message explaining the SRWLock situation, whereas the "opt" build still just hangs and gets terminated by watchdog: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c58a28a059c3442657f1f4efbd777cb227ed2f26
Flags: needinfo?(ccorcoran)
Comment 15•5 years ago
|
||
Nice solution, and cool to see it in action on try.
Flags: needinfo?(ccorcoran)
Assignee | ||
Updated•5 years ago
|
Keywords: checkin-needed
Comment 16•5 years ago
|
||
Pushed by ccoroiu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/493f7ee0ee00
Add assert to catch accidental re-enter of Slim RW lock r=ccorcoran,aklotz
Keywords: checkin-needed
Comment 17•5 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 5 years ago
status-firefox66:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla66
Updated•5 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•