Closed Bug 1502424 Opened 4 years ago Closed 4 years ago
Crash in std::vector<T>::_M
47 bytes, text/x-phabricator-request
|Details | Review|
This bug was filed from the Socorro interface and is report bp-863a06a3-8755-41a9-8e77-6c84d0181026. ============================================================= Seen while reviewing nightly crash stats: https://bit.ly/2PotBF4. This linux crash dates back to when nightly was in 64. At the top of the stack is BHMgr Processor. 105 crashes, but only 12 installations. No comments. A few tab related addons seem to be in the correlations: * (100.0% in signature vs 01.01% overall) Addon "Undo Close Tab" = true * (100.0% in signature vs 01.08% overall) Addon "Open Tabs Next to Current" = true * (100.0% in signature vs 01.87% overall) Addon "Snooze Tabs" = true Top 10 frames of crashing thread: 0 libxul.so void std::vector<unsigned char, google_breakpad::PageStdAllocator<unsigned char> >::_M_range_insert<unsigned char const*> clang/include/c++/4.9.4/ext/new_allocator.h:120 1 libxul.so google_breakpad::FileID::ElfFileIdentifierFromMappedFile clang/include/c++/4.9.4/bits/stl_vector.h:1377 2 libxul.so SharedLibraryAtPath toolkit/crashreporter/google-breakpad/src/common/linux/file_id.cc:162 3 libxul.so SharedLibraryInfo::GetInfoForSelf tools/profiler/core/shared-libraries-linux.cc:253 4 libxul.so mozilla::ProcessHangStackRunnable::Run toolkit/components/backgroundhangmonitor/HangDetails.cpp:334 5 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1245 6 libxul.so NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:530 7 libxul.so mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:364 8 libxul.so nsThread::ThreadFunc ipc/chromium/src/base/message_loop.cc:325 9 libnspr4.so _pt_root nsprpub/pr/src/pthreads/ptthread.c:201 =============================================================
This looks like a null deref crash inside breakpad. Ted, any ideas?
Component: General → Crash Reporting
Product: Core → Toolkit
So interestingly this is crashing in the background hang monitor, which calls into some Breakpad code just to get file identifiers for each loaded library. There are some inlined frames here, but it looks like `SharedLibraryAtPath` calls `getId` here: https://dxr.mozilla.org/mozilla-central/rev/c291143e24019097d087f9307e59b49facaf90cb/tools/profiler/core/shared-libraries-linux.cc#127 and `getId` calls `ElfFileIdentifier` here: https://dxr.mozilla.org/mozilla-central/rev/c291143e24019097d087f9307e59b49facaf90cb/tools/profiler/core/shared-libraries-linux.cc#105 The actual crash looks to be inside std::vector, it appears to be failing to allocate 14 bytes, I guess? The confounding factor is that this code works with `wasteful_vector`, which is a std::vector with a custom allocator that doesn't use malloc (so it's safe to use during a crash): https://hg.mozilla.org/mozilla-central/file/3cc04ee79005058d817daf66da7963dfac3f0a3a/toolkit/crashreporter/google-breakpad/src/common/memory_allocator.h#l216 Specifically the caller uses an `auto_wasteful_vector` here, which is a subclass that has some stack space (like `nsAutoTArray`): https://dxr.mozilla.org/mozilla-central/rev/c291143e24019097d087f9307e59b49facaf90cb/tools/profiler/core/shared-libraries-linux.cc#89 As it turns out `sizeof(MDGUID)` is not actually the best constant there (it actually wants `kDefaultBuildIdSize`, since it's almost certainly going to get an ELF Build ID and they're almost always a SHA-1 hash which is 20 bytes): https://dxr.mozilla.org/mozilla-central/rev/c291143e24019097d087f9307e59b49facaf90cb/toolkit/crashreporter/breakpad-client/linux/minidump_writer/minidump_writer.cc#634 so it winds up allocating, which goes through this code: https://hg.mozilla.org/mozilla-central/file/3cc04ee79005058d817daf66da7963dfac3f0a3a/toolkit/crashreporter/google-breakpad/src/common/memory_allocator.h#l74 but apparently that's returning NULL? I'm not sure what's happening there. We should definitely fix that code in `getId` to use the right constant which ought to avoid heap allocations in 99% of cases.
This is really weird, there doesn't seem to be any reasons why that allocation would fail. Also this seems to be happening only within content processes. BTW the allocator is not thread-safe but it's being instanced on the calling stack so a race shouldn't be possible here, right?
We've discussed this a bit on IRC and while what we think is happening is possible it's also quite unlikely. That being said all the crashes but one come from a single machine that has nightly installed under /root (yeah, you read that right). There has been another crash though from a different machine with a more sensible install (under a user's home directory) so this might be a valid bug. If it's what we think it is (mmap() failing under weird conditions) we can mitigate it but a fix would be much more complex.
BTW while triaging I run into other "weird" crashes, this is one of them: https://crash-stats.mozilla.com/signature/?product=Firefox&signature=mozilla%3A%3Aipc%3A%3AIToplevelProtocol%3A%3AToplevelState%3A%3ACreateSharedMemory&date=%3E%3D2018-10-26T17%3A39%3A13.000Z&date=%3C2018-11-02T16%3A39%3A13.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=cpu_info&_sort=-date&page=1 The CPU info and Firefox installation path match the one here. This crash is dereferencing a null pointer that has been returned in the form of a shared mapped segment; said segment was allocated via mmap() which most likely failed and lead us to return a null pointer responsible for the crash here: https://searchfox.org/mozilla-central/rev/50ba1dd30cf013bddce1ae756f1b3c95b26f0628/ipc/glue/Shmem.cpp#90 So it's pretty obvious to me that our friend here is running out of mmap()'able memory and we're dealing with this condition very poorly in various parts of our codebase. We should probably OOM-crash. So again, we can mitigate this bug (and we probably should) but the real fix would be to handle mmap() failures correctly. The problem is that would require an audit of almost the entire codebase since we're using it in a number of different places.
Crash Signature: [@ std::vector<T>::_M_range_insert<T>] → [@ std::vector<T>::_M_range_insert<T>] [@ OOM | large | std::vector<T>::_M_range_insert<T>]
I've given this a little thought and I finally realized what could cause this issue: since this user is most likely running Firefox as root it's possible that he has a different ulimit configuration than a regular user. In fact this might be happening in a container or something like that and one of the limits that can be set is the amount of virtual memory one can allocate. If that's the case then mmap() failing is a natural consequence. Let's mitigate the issue here as Ted has described then I'll open a separate bug for auditing mmap() call sites.
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Pushed by email@example.com: https://hg.mozilla.org/integration/autoland/rev/dc0584144f76 Accommodate for larger ELF identifiers to prevent dynamic allocations r=ted
You need to log in before you can comment on or make changes to this bug.