If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Crash: mozalloc_abort | pthread_mutex_unlock | ?? (ABORT: NULL actor value passed to non-nullable param: file ipc/ipdl/PIndexedDBObject)

RESOLVED FIXED in Firefox 20

Status

()

Core
DOM: IndexedDB
P1
critical
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: m1, Assigned: Ben Turner (not reading bugmail, use the needinfo flag!))

Tracking

({crash})

unspecified
B2G C4 (2jan on)
ARM
Gonk (Firefox OS)
crash
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking-basecamp:+, firefox17 wontfix, firefox18 wontfix, firefox19 wontfix, firefox20 fixed, b2g18 fixed)

Details

(Whiteboard: [b2g-crash], crash signature)

Attachments

(6 attachments)

Seen during stability testing.   Seen multiple times by dogfooders as well.  

e.g. bp-c9519aee-4476-4ec1-bd20-a186a2121130

---
Crash reason:  SIGSEGV
Crash address: 0x0

Thread 0 (crashed)
 0  libxul.so!mozalloc_abort [mozalloc_abort.cpp : 23 + 0x4]
     r4 = 0x4011a758    r5 = 0xbedb152c    r6 = 0x00000000    r7 = 0xffffffff
     r8 = 0x411730d5    r9 = 0x00000001   r10 = 0xbedb1140    fp = 0x41659efb
     sp = 0xbedb1130    lr = 0x400e8434    pc = 0x415eab28
    Found by: given as instruction pointer in context
 1  libc.so!pthread_mutex_unlock [pthread.c : 973 + 0x2]
     r4 = 0x4011a758    r5 = 0xbedb152c    r6 = 0x00000000    r7 = 0xffffffff
     r8 = 0x411730d5    r9 = 0x00000001   r10 = 0xbedb1140    fp = 0x41659efb
     sp = 0xbedb1130    pc = 0x400e8434
    Found by: call frame info
 2  0x2323231e
     r4 = 0xffffffff    r5 = 0x401b5040    r6 = 0x00000000    r7 = 0x4165a87f
     r8 = 0x7261505b    r9 = 0x20746e65   r10 = 0x5d353231    fp = 0x41659efb
     sp = 0xbedb1150    pc = 0x23232320
    Found by: call frame info
This is the video app, so I bet this is bug 821082.
basecamp+ Socorro states 189 crashes with this signature. Hoping cjones is right about this being a dup of bug 821082.
blocking-basecamp: ? → +
Target Milestone: --- → B2G C3 (12dec-1jan)
Are we still seeing this?  Please reopen if so.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 824230
(Reporter)

Comment 4

5 years ago
Re-opening.  Still seeing this on AU 161 [1].



[1] https://www.codeaurora.org/gitweb/quic/b2g/?p=manifest.git;a=commitdiff;h=f29f8ffd172e4bf693fb857e7728c5b5c2725507
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
(Reporter)

Comment 5

5 years ago
Created attachment 697078 [details]
decoded minidump of crash

Latest minidump from this crash.   Looks like it's the b2g process not the video app that crashed here.

Comment 6

5 years ago
Hey Michael,

Do you have logcat's of the crash, maybe run a debug build, or use helgrind against these builds?

Comment 7

5 years ago
We need an owner for this.
(I know Michael is currently on vacation but I wanted to get this into NEEDINFO)
Flags: needinfo?(mvines)
(Reporter)

Comment 9

5 years ago
I have a log that is 8MB compressed.  Bugzilla won't accept it, so please let me know who wants it via email!
Flags: needinfo?(mvines)
The backtrace here is misleading: the crash is a null deref of mutex state at an atomic decrement.  That means the mutex itself is most likely null.

The content process crash is a non-main thread, but the b2g process crash here is the main thread.  I suspect these are different bugs.

Can we try re-analyzing this most recent dump with the tool that scans the stack harder?  (Don't remember if that was ted's or glandium's.)

mvines, do you guys have this tool already?
(Reporter)

Comment 11

5 years ago
Do you mean dumplookup?   The output that it produces it at the end of the attachment 697078 [details] starting with this:

----------------------

0xbe8fd39c: libxul.so!js::RunScript [jsinterp.cpp : 324 + 0xb]
0xbe8fd3cc: libxul.so!js::InvokeKernel [Stack.h : 1137 + 0x1]
0xbe8fd3ec: libxul.so!nsAString_internal::ReplaceASCII [nsTSubstring.cpp : 507 + 0x9]
0xbe8fd404: libxul.so!nsAString_internal::AppendFunc [nsTSubstring.cpp : 766 + 0x1]
0xbe8fd414: libnspr4.so!FuncStuff [prprf.c : 1044 + 0x1]
0xbe8fd424: libnspr4.so!dosprintf [prprf.c : 1028 + 0xf]
0xbe8fd454: libmozglue.so!malloc_mutex_unlock [jemalloc.c : 1662 + 0x1]
0xbe8fd45c: libmozglue.so!arena_malloc [jemalloc.c : 4162 + 0x1]
...
...


But if there's another tool, happy to run that over the .dmp file too.  lmk.
Nope, that's the one!  Thanks.

The first chunk looks quite believable, basically a perfect backtrace for SendPIndexedDBCursorConstructor().  The line numbers don't match my local build so I can't quite tell what's going on.

0xbe8fd65c: libmozglue.so!realloc [jemalloc.c : 4684 + 0x7]
0xbe8fd6b4: libxul.so!mozilla::dom::indexedDB::PIndexedDBObjectStoreParent::Write [PIndexedDBObjectStoreParent.cpp : 895 + 0x17]
0xbe8fd6c4: libxul.so!IPC::WriteParam<uint32_t> [ipc_message_utils.h : 75 + 0x1]
0xbe8fd6d4: libxul.so!mozilla::dom::indexedDB::PIndexedDBObjectStoreParent::Write [PIndexedDBObjectStoreParent.cpp : 1136 + 0x1]
0xbe8fd6f4: libxul.so!mozilla::dom::indexedDB::PIndexedDBObjectStoreParent::Write [PIndexedDBObjectStoreParent.cpp : 676 + 0x1]
0xbe8fd714: libxul.so!mozilla::dom::indexedDB::PIndexedDBObjectStoreParent::SendPIndexedDBCursorConstructor [ipc_message.h : 148 + 0x1]
0xbe8fd74c: libxul.so!OpenCursorHelper::SendResponseToChildProcess [IndexedDBParent.h : 489 + 0xd]
0xbe8fd7d4: libxul.so!mozilla::dom::indexedDB::AsyncConnectionHelper::MaybeSendResponseToChildProcess [AsyncConnectionHelper.cpp : 557 + 0xb]
0xbe8fd7e4: libxul.so!mozilla::dom::indexedDB::AsyncConnectionHelper::Run [AsyncConnectionHelper.cpp : 207 + 0x1]
0xbe8fd804: libxul.so!nsThread::ProcessNextEvent [nsThread.cpp : 620 + 0x7]
0xbe8fd844: libxul.so!NS_ProcessNextEvent_P [nsError.h : 1069 + 0x1]
0xbe8fd854: libxul.so!mozilla::ipc::MessagePump::Run [MessagePump.cpp : 83 + 0x1]
0xbe8fd87c: libxul.so!MessageLoop::RunInternal [message_loop.cc : 216 + 0x1]
0xbe8fd884: libxul.so!MessageLoop::Run [message_loop.cc : 502 + 0x1]
0xbe8fd89c: libxul.so!nsBaseAppShell::Run [nsBaseAppShell.cpp : 165 + 0x1]
0xbe8fd8ac: libxul.so!nsAppStartup::Run [nsError.h : 1065 + 0x1]
0xbe8fd8b4: libxul.so!XREMain::XRE_mainRun [nsError.h : 1065 + 0x1]

After that, we go through some garbage but then end up with another believable stack snippet.  Maybe this is serializing some IDB data?  Or maybe it's just junk.

0xbe8fd4ac: libnspr4.so!PR_AtomicDecrement [pratom.c : 281 + 0x1]
0xbe8fd4b4: libxul.so!nsStringBuffer::Release [nsSubstring.cpp : 158 + 0x1]
0xbe8fd4bc: libxul.so!ReleaseData [nsSubstring.cpp : 86 + 0x7]
0xbe8fd4c4: libxul.so!nsACString_internal::ReplacePrepInternal [nsTSubstring.cpp : 189 + 0x9]
0xbe8fd4e4: libxul.so!nsACString_internal::ReplacePrep [nsTSubstring.h : 719 + 0x11]

Lastly we hit this insanity

0xbe8fd39c: libxul.so!js::RunScript [jsinterp.cpp : 324 + 0xb]
0xbe8fd3cc: libxul.so!js::InvokeKernel [Stack.h : 1137 + 0x1]

which unfortunately might make sense given the bare address in the stack, like 0x2323205b (JIT'd code?).

So looks like the ingredients here are IDB, IPC, and maybe JS (o_O).  bent, ring any bells?  Would it be easy to set up the IDB cross-process tests to run continuously, as stress tests, and inject some subprocess crashes?
Flags: needinfo?(bent.mozilla)
(Reporter)

Comment 13

5 years ago
(In reply to Chris Jones [:cjones] [:warhammer] from comment #12)
>  The line numbers don't match my local build so I can't quite tell what's going on.

Gecko SHA1 for the dump can be found via comment 4.
(hint: it starts with 1a362ba230c21)
Alas, that's not the problem, but thanks.
FYI dumps on Linux platforms contain /proc/self/maps, and you can run the minidump_dump tool on them (from the Breakpad tree) to get that out, if you want to see if a bare address was in executable memory.
(Reporter)

Comment 16

5 years ago
(this crash is now appearing *very* frequently in the stab runs overnight based of AU 163)
Severity: normal → critical
Priority: -- → P1
Based on the first chunk of comment 12, I think we're calling this:

[OpenCursorHelper::SendResponseToChildProcess] http://hg.mozilla.org/releases/mozilla-b2g18/file/42aac34da660/dom/indexedDB/IDBObjectStore.cpp#l3471 ->
[IndexedDBObjectStoreParent::OpenCursor] http://hg.mozilla.org/releases/mozilla-b2g18/file/42aac34da660/dom/indexedDB/ipc/IndexedDBParent.h#l489

That gets into generated code which looks like this:

  PIndexedDBCursorParent*
  PIndexedDBObjectStoreParent::SendPIndexedDBCursorConstructor(
          PIndexedDBCursorParent* actor,
          const ObjectStoreCursorConstructorParams& params)
  {
    // ...
    Write(params, __msg);

And that calls this:

  void
  PIndexedDBObjectStoreParent::Write(
          const ObjectStoreCursorConstructorParams& __v,
          Message* __msg)
  {
      Write((__v).requestParent(), __msg, false);
      // skipping actor field that's meaningless on this side
      Write((__v).direction(), __msg);
      Write((__v).key(), __msg);
      Write((__v).cloneInfo(), __msg);
      Write((__v).blobsParent(), __msg);
      // skipping actor field that's meaningless on this side
  }

Based on the line numbers and the remaining frames in the stack trace I think we're calling Write on the blobsParent() array. That looks like this:

  void
  PIndexedDBObjectStoreParent::Write(
          const InfallibleTArray<PBlobParent*>& __v,
          Message* __msg)
  {
      uint32_t length = (__v).Length();
      Write(length, __msg);

      for (uint32_t i = 0; (i) < (length); (++(i))) {
          Write(__v[i], __msg, false);
      }
  }

Then we call Write on each element of the array:

  void
  PIndexedDBObjectStoreParent::Write(
          PBlobParent* __v,
          Message* __msg,
          bool __nullable)
  {
      int32_t id;
      if ((!(__v))) {
          if ((!(__nullable))) {
              NS_RUNTIMEABORT("NULL actor value passed to non-nullable param");
          }
          id = 0;
      }
      else {
          id = (__v)->mId;
          if ((1) == (id)) {
              NS_RUNTIMEABORT("actor has been |delete|d");
          }
      }

      Write(id, __msg);
  }

That Write call for that int32_t would be defined here:

  class PIndexedDBObjectStoreParent
  {
      // ...
      template<typename T>
      void
      Write(
              const T& __v,
              Message* __msg)
      {
          IPC::WriteParam(__msg, __v);
      }
      // ...
  }

Which then calls through:

[IPC::ParamTraits<long>::Write] http://hg.mozilla.org/releases/mozilla-b2g18/file/42aac34da660/ipc/chromium/src/chrome/common/ipc_message_utils.h#l148 ->
[Pickle::WriteLong] http://hg.mozilla.org/releases/mozilla-b2g18/file/42aac34da660/ipc/chromium/src/base/pickle.h#l112 ->
[Pickle::WriteInt64] http://hg.mozilla.org/releases/mozilla-b2g18/file/42aac34da660/ipc/chromium/src/base/pickle.h#l131 ->
[Pickle::WriteBytes] http://hg.mozilla.org/releases/mozilla-b2g18/file/42aac34da660/ipc/chromium/src/base/pickle.cc#l453 ->
[Pickle::BeginWrite] http://hg.mozilla.org/releases/mozilla-b2g18/file/42aac34da660/ipc/chromium/src/base/pickle.cc#l419 ->
[Pickle::Resize] http://hg.mozilla.org/releases/mozilla-b2g18/file/42aac34da660/ipc/chromium/src/base/pickle.cc#l530 ->
[realloc]

Phew!

Anyway, are we just running out of memory here? That realloc will be infallible, right?
Flags: needinfo?(bent.mozilla)
Looking at the first stack trace, the PC looks like ASCII data. If I convert a few other ASCII'ish looking registers and endian convert, I see:


r8  = 0x7261505b = '[par'
r9  = 0x20746e65 = 'ent '
r10 = 0x5d353231 = '125]'
pc  = 0x23232320 = ' ###'

Since these values would all have been pulled from the stack, this seems like it might be a stack overwrite. It might also be stack show through, except that the PC just doesn't look legit, so I have to believe that the real return address stored on the stack was overwritten.

Based on the stack that does look valid, it doesn't appear that the code path itself is overwriting its own stack, so I would have to guess that some other thread is doing it.

And if that's the case, its probably been a stack address that was passed to another thread and what we're seeing in the crash is just a manifestation of the corruption and not related to to corruption itself.

Updated

5 years ago
Target Milestone: B2G C3 (12dec-1jan) → B2G C4 (2jan on)
bent, I don't trust the line numbers here very much.  Is there any other part of the Blob (the Key maybe?) that writes string or other variable-length data?
Flags: needinfo?(bent.mozilla)
(In reply to Chris Jones [:cjones] [:warhammer] from comment #19)
> bent, I don't trust the line numbers here very much.

Well, my guess wasn't just based on line numbers but also on the other frames that came after. But anyway,

> Is there any other
> part of the Blob (the Key maybe?) that writes string or other
> variable-length data?

IndexedDB keys are basically an encoded string, see http://mxr.mozilla.org/mozilla-central/source/dom/indexedDB/ipc/SerializationHelpers.h#25 for where we copy it when writing.
Flags: needinfo?(bent.mozilla)
Assigning to Ben for now. Re-assign or take as appropriate.
Assignee: nobody → bent.mozilla
OK so evidence is pointing to a fatal IPDL error.  And of course, we're not logging those by default :(.  We should fix that.

mozalloc_abort() is __attribute__((noreturn)), which IIRC messed with backtraces in the past.  That could explain the garbage frames above mozalloc_abort.

However, this might have been an NS_RUNTIMEABORT(), in which case we should have annotated the crash report with the abort message.

mvines, do you happen to have the .extra file that was generated along with this dump?
(Reporter)

Comment 23

5 years ago
oooo, i didn't realize the .extra file actually had goodies in it!   I see this in it:

xpcom_runtime_abort([Parent 5113] ###!!! ABORT: NULL actor value passed to non-nullable param: file /local/mnt/workspace/lnxbuild/project/release_dev_msm7627a_575508/checkout/out/target/product/msm7627a/obj/objdir-gecko/ipc/ipdl/PIndexedDBObjectStoreParent.cpp, line 888)
(Reporter)

Comment 24

5 years ago
Created attachment 697721 [details]
matching extra file
Great, got lucky :).
Created attachment 697732 [details] [diff] [review]
Try to produce better crash reports

m1, I'm going to try to get these changes in asap, but if you want to kick off a run with this patch it should hopefully produce better stacks and more log info.
Created attachment 697756 [details] [diff] [review]
Always log protocol errors

These messages aren't fantastically useful but can provide clues.
Attachment #697756 - Flags: review?(bent.mozilla)
Created attachment 697757 [details] [diff] [review]
Make mozalloc_abort() not MOZ_NORETURN and log errors to logcat

I may be misremembering MOZ_NORETURN and this may be superstition.  The logging is useful though.
Attachment #697757 - Flags: review?(mh+mozilla)
Attachment #697756 - Flags: review?(bent.mozilla) → review+
(Reporter)

Comment 29

5 years ago
(In reply to Chris Jones [:cjones] [:warhammer] from comment #26)
> m1, I'm going to try to get these changes in asap, but if you want to kick
> off a run with this patch it should hopefully produce better stacks and more
> log info.

Thanks.  At this point our next test cycle won't start until Sunday evening PST, so if we get this landed before mid-day Sunday then the normal workflow will pick up this patch.
Attachment #697757 - Flags: review?(mh+mozilla) → review+
I reproduced a crash with this signature when running MW1 [1], but it's a camera crash.  I think it's the same as comment 0.

I'm going to file a separate bug for that.  Keeping this one as the IDB crash.
Summary: Crash: mozalloc_abort | pthread_mutex_unlock | ?? → Crash: mozalloc_abort | pthread_mutex_unlock | ?? (ABORT: NULL actor value passed to non-nullable param: file ipc/ipdl/PIndexedDBObject)
Component: General → DOM: IndexedDB
Product: Boot2Gecko → Core
https://hg.mozilla.org/integration/mozilla-inbound/rev/4fd475c2a94a
https://hg.mozilla.org/integration/mozilla-inbound/rev/496d7707ded1
Whiteboard: [leave open]
https://hg.mozilla.org/mozilla-central/rev/4fd475c2a94a
https://hg.mozilla.org/mozilla-central/rev/496d7707ded1

Updated

5 years ago
Crash Signature: [@ mozalloc_abort | pthread_mutex_unlock]
Keywords: crash
Whiteboard: [leave open] → [leave open][b2g-crash]
I wonder if bug 806503 is related?
(In reply to Michael Vines [:m1] from comment #16)
> (this crash is now appearing *very* frequently in the stab runs overnight
> based of AU 163)

For the next batch of results, the .extra or dumplookup spew should distinguish this bug from bug 826829.
Created attachment 698581 [details] [diff] [review]
Patch, v1

This is my best guess. The non-nullable thing is what was killing the old logic.
Attachment #698581 - Flags: review?(jones.chris.g)
Comment on attachment 698581 [details] [diff] [review]
Patch, v1

\o/
Attachment #698581 - Flags: review?(jones.chris.g) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/fc7f83b1bc57
https://hg.mozilla.org/mozilla-central/rev/0719d446b16a
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago5 years ago
status-b2g18: --- → affected
status-firefox17: --- → affected
status-firefox18: --- → affected
status-firefox19: --- → affected
status-firefox20: --- → fixed
Resolution: --- → FIXED
Comment on attachment 698581 [details] [diff] [review]
Patch, v1

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 759427
User impact if declined: Edge case crash in main process caused by crash of child process
Testing completed: m-i
Risk to taking this patch (and alternatives if risky): Very safe patch. Converts a null pointer assertion to a failure code that uses the same code path as other failures.
String or UUID changes made by this patch: None
Attachment #698581 - Flags: approval-mozilla-b2g18?
Whiteboard: [leave open][b2g-crash] → [b2g-crash]
Note, this is bb+.  Please uplift the other two patches above along with this.
Do we know if the mozalloc_abort changes had any effect on the quality of crash reports? Or does comment 18 describe what went wrong here: just stomping on the stack and thus screwing our ability to get a useful stack trace.
We don't know, however that patch adds some logging that's very useful.  At worst the mozalloc_abort() stacks stay shite.
From past experience, I'd say it might improve crash reports, but is quite unlikely to make them worse.
https://hg.mozilla.org/releases/mozilla-b2g18/rev/57c6af0b0cc6
https://hg.mozilla.org/releases/mozilla-b2g18/rev/37dedf9a9a5e
https://hg.mozilla.org/releases/mozilla-b2g18/rev/1339d655c554
status-b2g18: affected → fixed
Attachment #698581 - Flags: approval-mozilla-b2g18?
https://hg.mozilla.org/mozilla-central/rev/fc7f83b1bc57
status-firefox17: affected → wontfix
status-firefox18: affected → wontfix
status-firefox19: affected → wontfix
(In reply to Chris Jones [:cjones] [:warhammer] from comment #28)
> Created attachment 697757 [details] [diff] [review]
> Make mozalloc_abort() not MOZ_NORETURN and log errors to logcat
> 
> I may be misremembering MOZ_NORETURN and this may be superstition.  The
> logging is useful though.

Was this change ^^ important?  It causes this build warning on Linux (w/ GCC 4.7.2 at least):
> mozalloc_abort.cpp: In function ‘void abort()’:
> mozalloc_abort.cpp:40:1: warning: ‘noreturn’ function does return [enabled by default]

This is because "abort()" is declared in system headers as being noreturn, and we're replacing the system impl with our implementation that (from the compiler's perspective) could still return (since all it does is call some mysterious mozalloc_abort function), unless we tell the compiler that mozalloc_abort is also 'noreturn'.
(In reply to Daniel Holbert [:dholbert] from comment #46)
> Was this change ^^ important?
(specifically: was removing MOZ_NORETURN important?  If not, I'll file a followup on adding it back.)
We started getting better stacks out of mozalloc_abort() after this landed.  (That's a correlation though, still no proof of causation.)
Cool, thanks for the explanation.  I've filed bug 829327 on adding the annotation back, and we can see if that regresses crash-stack-quality.  I suspect you won't object, given the end of comment 48, but if you do, let's have that discussion over on bug 829327.  Thanks!
Blocks: 829327
You need to log in before you can comment on or make changes to this bug.