Closed Bug 1472889 Opened 6 years ago Closed 6 years ago

Intermittent INFO - PROCESS-CRASH | automation.py | application crashed [@ mozilla::dom::ipc::SharedStringMap::SharedStringMap(mozilla::dom::ipc::SharedStringMapBuilder&&)]

Categories

(Core :: IPC, defect)

defect
Not set
critical

Tracking

()

RESOLVED INCOMPLETE
mozilla63
Tracking Status
firefox63 --- fixed

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: crash, intermittent-failure, regression, Whiteboard: [retriggered][stockwell needswork:owner])

Crash Data

Attachments

(1 file, 1 obsolete file)

Filed by: toros [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=186067456&repo=mozilla-central

https://queue.taskcluster.net/v1/task/f3aQ73HVTsOy9zLh5IJGzQ/runs/0/artifacts/public/logs/live_backing.log

[task 2018-07-02T23:09:31.379Z]     INFO - MochitestServer : launching [u'/builds/worker/workspace/build/tests/bin/xpcshell', '-g', '/builds/worker/workspace/build/application/firefox', '-f', '/builds/worker/workspace/build/tests/bin/components/httpd.js', '-e', "const _PROFILE_PATH = '/tmp/tmpFD87Ws.mozrunner'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', '/builds/worker/workspace/build/tests/mochitest/server.js']
[task 2018-07-02T23:09:31.380Z]     INFO - runtests.py | Server pid: 4586
[task 2018-07-02T23:09:31.399Z]     INFO - runtests.py | Websocket server pid: 4589
[task 2018-07-02T23:09:31.436Z]     INFO - runtests.py | SSL tunnel pid: 4592
[task 2018-07-02T23:09:31.585Z]     INFO - runtests.py | Running with e10s: True
[task 2018-07-02T23:09:31.587Z]     INFO - runtests.py | Running tests: start.
[task 2018-07-02T23:09:31.588Z]     INFO - 
[task 2018-07-02T23:09:31.605Z]     INFO - Application command: /builds/worker/workspace/build/application/firefox/firefox -marionette -foreground -profile /tmp/tmpFD87Ws.mozrunner
[task 2018-07-02T23:09:31.629Z]     INFO - runtests.py | Application pid: 4611
[task 2018-07-02T23:09:31.630Z]     INFO - TEST-INFO | started process GECKO(4611)
[task 2018-07-02T23:09:32.104Z]     INFO - GECKO(4611) | ExceptionHandler::GenerateDump cloned child 4638
[task 2018-07-02T23:09:32.104Z]     INFO - GECKO(4611) | ExceptionHandler::SendContinueSignalToChild sent continue signal to child
[task 2018-07-02T23:09:32.104Z]     INFO - GECKO(4611) | ExceptionHandler::WaitForContinueSignal waiting for continue signal...
[task 2018-07-02T23:12:31.634Z]     INFO - runtests.py | Waiting for browser...
[task 2018-07-02T23:12:31.635Z]     INFO - TEST-INFO | Main app process: exit 11
[task 2018-07-02T23:12:31.636Z]     INFO - Buffered messages finished
[task 2018-07-02T23:12:31.637Z]    ERROR - TEST-UNEXPECTED-FAIL | automation.py | application terminated with exit code 11
[task 2018-07-02T23:12:31.638Z]     INFO - runtests.py | Application ran for: 0:03:00.030697
[task 2018-07-02T23:12:31.638Z]     INFO - zombiecheck | Reading PID log: /tmp/tmpw3bHhspidlog
[task 2018-07-02T23:12:31.639Z]     INFO - ==> process 4611 launched child process 4633
[task 2018-07-02T23:12:31.640Z]     INFO - zombiecheck | Checking for orphan process with PID: 4633
[task 2018-07-02T23:12:31.641Z]     INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/Yex9d7yISmKY4DrqIAfN4A/artifacts/public/build/target.crashreporter-symbols.zip
[task 2018-07-02T23:12:37.456Z]     INFO - mozcrash Copy/paste: /usr/local/bin/linux64-minidump_stackwalk /tmp/tmpFD87Ws.mozrunner/minidumps/16bda064-7e66-b27f-7a49-2526ecc5b607.dmp /tmp/tmp9aOSGp
[task 2018-07-02T23:12:44.805Z]     INFO - mozcrash Saved minidump as /builds/worker/workspace/build/blobber_upload_dir/16bda064-7e66-b27f-7a49-2526ecc5b607.dmp
[task 2018-07-02T23:12:44.806Z]     INFO - mozcrash Saved app info as /builds/worker/workspace/build/blobber_upload_dir/16bda064-7e66-b27f-7a49-2526ecc5b607.extra
[task 2018-07-02T23:12:44.869Z]     INFO - PROCESS-CRASH | automation.py | application crashed [@ mozilla::dom::ipc::SharedStringMap::SharedStringMap(mozilla::dom::ipc::SharedStringMapBuilder&&)]
[task 2018-07-02T23:12:44.869Z]     INFO - Crash dump filename: /tmp/tmpFD87Ws.mozrunner/minidumps/16bda064-7e66-b27f-7a49-2526ecc5b607.dmp
[task 2018-07-02T23:12:44.869Z]     INFO - Operating system: Linux
[task 2018-07-02T23:12:44.869Z]     INFO -                   0.0.0 Linux 4.4.0-1014-aws #14taskcluster1-Ubuntu SMP Tue Apr 3 10:27:00 UTC 2018 x86_64
[task 2018-07-02T23:12:44.870Z]     INFO - CPU: amd64
[task 2018-07-02T23:12:44.870Z]     INFO -      family 6 model 62 stepping 4
[task 2018-07-02T23:12:44.871Z]     INFO -      2 CPUs
[task 2018-07-02T23:12:44.872Z]     INFO - 
[task 2018-07-02T23:12:44.872Z]     INFO - GPU: UNKNOWN
[task 2018-07-02T23:12:44.873Z]     INFO - 
[task 2018-07-02T23:12:44.874Z]     INFO - Crash reason:  SIGSEGV
[task 2018-07-02T23:12:44.875Z]     INFO - Crash address: 0x0
[task 2018-07-02T23:12:44.875Z]     INFO - Process uptime: not available
[task 2018-07-02T23:12:44.876Z]     INFO - 
[task 2018-07-02T23:12:44.877Z]     INFO - Thread 0 (crashed)
[task 2018-07-02T23:12:44.878Z]     INFO -  0  libxul.so!mozilla::dom::ipc::SharedStringMap::SharedStringMap(mozilla::dom::ipc::SharedStringMapBuilder&&) [SharedStringMap.cpp:7d20e7fae1039720f92db1a3a72bc2c7424b5f98 : 46 + 0x0]
[task 2018-07-02T23:12:44.878Z]     INFO -     rax = 0x0000000000635c80   rdx = 0x00007f9681b1cab8
[task 2018-07-02T23:12:44.879Z]     INFO -     rcx = 0x0000000000000003   rbx = 0x00007f9670f63820
[task 2018-07-02T23:12:44.880Z]     INFO -     rsi = 0x0000000000000000   rdi = 0x00007f968dd00020
[task 2018-07-02T23:12:44.880Z]     INFO -     rbp = 0x00007ffc1a681c90   rsp = 0x00007ffc1a681c80
[task 2018-07-02T23:12:44.881Z]     INFO -      r8 = 0x0000000000000000    r9 = 0x0000000000000000
[task 2018-07-02T23:12:44.881Z]     INFO -     r10 = 0x001b1554477eec6f   r11 = 0x0000000000000246
[task 2018-07-02T23:12:44.882Z]     INFO -     r12 = 0x00007f9670f63820   r13 = 0x00007ffc1a681d20
[task 2018-07-02T23:12:44.883Z]     INFO -     r14 = 0x00007ffc1a681cf0   r15 = 0x0000000000000000
[task 2018-07-02T23:12:44.884Z]     INFO -     rip = 0x00007f967eb7e3e9
[task 2018-07-02T23:12:44.884Z]     INFO -     Found by: given as instruction pointer in context
[task 2018-07-02T23:12:44.886Z]     INFO -  1  libxul.so!SharedStringBundle::LoadProperties [nsStringBundle.cpp:7d20e7fae1039720f92db1a3a72bc2c7424b5f98 : 561 + 0x8]
[task 2018-07-02T23:12:44.886Z]     INFO -     rbx = 0x00007f9672e380b0   rbp = 0x00007ffc1a681dc0
[task 2018-07-02T23:12:44.887Z]     INFO -     rsp = 0x00007ffc1a681ca0   r12 = 0x00007f9670f63820
[task 2018-07-02T23:12:44.888Z]     INFO -     r13 = 0x00007ffc1a681d20   r14 = 0x00007ffc1a681cf0
[task 2018-07-02T23:12:44.889Z]     INFO -     r15 = 0x0000000000000000   rip = 0x00007f967ce2a13c
[task 2018-07-02T23:12:44.890Z]     INFO -     Found by: call frame info
[task 2018-07-02T23:12:44.890Z]     INFO -  2  libxul.so!mozilla::detail::RunnableMethodImpl<nsStringBundleBase*, nsresult (nsStringBundleBase::*)(), true, (mozilla::RunnableKind)2>::Run() [nsThreadUtils.h:7d20e7fae1039720f92db1a3a72bc2c7424b5f98 : 1166 + 0x12]
[task 2018-07-02T23:12:44.892Z]     INFO -     rbx = 0x00007f968dce9100   rbp = 0x00007ffc1a681dd0
[task 2018-07-02T23:12:44.892Z]     INFO -     rsp = 0x00007ffc1a681dd0   r12 = 0x00007ffc1a681e90
[task 2018-07-02T23:12:44.893Z]     INFO -     r13 = 0x00007f9672e38100   r14 = 0x00007f967746b0a8
[task 2018-07-02T23:12:44.894Z]     INFO -     r15 = 0x00007ffc1a68233f   rip = 0x00007f967ce26bb7
[task 2018-07-02T23:12:44.895Z]     INFO -     Found by: call frame info
[task 2018-07-02T23:12:44.896Z]     INFO -  3  libxul.so!nsThread::ProcessNextEvent(bool, bool*) [nsThread.cpp:7d20e7fae1039720f92db1a3a72bc2c7424b5f98 : 1051 + 0xa]
[task 2018-07-02T23:12:44.896Z]     INFO -     rbx = 0x00007f968dce9100   rbp = 0x00007ffc1a682310
[task 2018-07-02T23:12:44.896Z]     INFO -     rsp = 0x00007ffc1a681de0   r12 = 0x00007ffc1a681e90
[task 2018-07-02T23:12:44.896Z]     INFO -     r13 = 0x00007f9672e38100   r14 = 0x00007f967746b0a8
[task 2018-07-02T23:12:44.897Z]     INFO -     r15 = 0x00007ffc1a68233f   rip = 0x00007f967cdef05e
[task 2018-07-02T23:12:44.898Z]     INFO -     Found by: call frame info
[task 2018-07-02T23:12:44.898Z]     INFO -  4  libxul.so!NS_ProcessNextEvent(nsIThread*, bool) [nsThreadUtils.cpp:7d20e7fae1039720f92db1a3a72bc2c7424b5f98 : 519 + 0xe]
[task 2018-07-02T23:12:44.899Z]     INFO -     rbx = 0x00007f9670f890a0   rbp = 0x00007ffc1a682340
[task 2018-07-02T23:12:44.900Z]     INFO -     rsp = 0x00007ffc1a682320   r12 = 0x0000000000000000
[task 2018-07-02T23:12:44.900Z]     INFO -     r13 = 0x0000000000000000   r14 = 0x00007f968dce9100
[task 2018-07-02T23:12:44.901Z]     INFO -     r15 = 0x00007ffc1a68235f   rip = 0x00007f967cde6c0b
[task 2018-07-02T23:12:44.902Z]     INFO -     Found by: call frame info
[task 2018-07-02T23:12:4
I think this is either a mochitest or marionette bug, though I might be wrong.
Component: Applications: MozharnessCore → Mochitest
Product: Release Engineering → Testing
QA Contact: jlund
Component: Mochitest → IPC
Product: Testing → Core
This might be regression by bug 1470365 or bug 1470771.  Adding kris.
This is unfortunately somewhat expected, for the moment. As I understand it, the Docker containers that we use on automation tend to limit the size of the tmpfs that backs /dev/shm, which means that our shared memory allocations fail even when the system has plenty of memory available to back them.

At some point, I'm going to migrate this code to use memory allocation APIs that don't rely on /dev/shm to avoid this problem.
So you mean it's related to bug 1245239? Also CC'ing Jed.
Flags: needinfo?(jld)
Flags: needinfo?(jld)
(In reply to Treeherder Bug Filer from comment #0)
> [task 2018-07-02T23:12:44.874Z]     INFO - Crash reason:  SIGSEGV

This is not SIGBUS; it's an assertion failure.  Are we sure that the failure from SharedStringMapBuilder::Finalize is ENOSPC?  I don't see anything in the log to either confirm or deny that.

Also, the Docker containers we use for CI generally haven't been a problem for this; the bug reports we've gotten have been other people using Docker's default of 64MB, which winds up not being enough for our graphics stack.  There's a lot of indirection in our CI so I don't know what the situation is there, but I'd be surprised if string bundles caused space exhaustion but all the cross-process textures didn't.
This is unfortunately somewhat expected, for the moment. As I understand it, the Docker containers that we use on automation tend to limit the size of the tmpfs that backs /dev/shm, which means that our shared memory allocations fail even when the system has plenty of memory available to back them.

At some point, I'm going to migrate this code to use memory allocation APIs that don't rely on /dev/shm to avoid this problem.
Depends on: 1440203
(In reply to Kris Maglione [:kmag] from comment #7)
> This is unfortunately somewhat expected, for the moment. As I understand it,
> the Docker containers that we use on automation tend to limit the size of
> the tmpfs that backs /dev/shm, which means that our shared memory
> allocations fail even when the system has plenty of memory available to back
> them.
> 
> At some point, I'm going to migrate this code to use memory allocation APIs
> that don't rely on /dev/shm to avoid this problem.

I'm not convinced that this is what's happening; see comment #5.  (I'm also not completely convinced that it's *not* what's happening, but there doesn't seem to be concrete evidence either way?)
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #9)
> I'm not convinced that this is what's happening; see comment #5.  (I'm also
> not completely convinced that it's *not* what's happening, but there doesn't
> seem to be concrete evidence either way?)

It's SIGSEGV rather than SIGBUS because we assert that the shmem creation was successful. Failing to allocate the shmem files is the only way to hit that assertion, so the only other possibility I can think of is OOM.

And since this is only happening on Linux so far, and the patch should actually significantly decrease our memory usage overall, that doesn't seem super likely. I suppose it's possible if our overall rate of OOM crashes is high on Linux automation, though.
From the failing tests resource-usage.json files, it doesn't look like we ever get above 30% of available RAM. Unfortunately, system-info.log doesn't seem to show tmpfs mounts, so I don't have much to go on...
(Er, ignore comment 7. Apparently session store saved my original comment and re-posted it when I added the dependency.)
Comment on attachment 8990893 [details]
Bug 1472889: Include all partition types in system-info.log.

https://reviewboard.mozilla.org/r/255892/#review262836
Attachment #8990893 - Flags: review?(gbrown) → review+
OK, apparently this is something slightly more prosaic.

Once every few hundred thousand calls, we fail to stat the read-only copy of the map file we just opened with EBADF. That... is very odd.

The open definitely succeeds, so we should be guaranteed a valid file descriptor. The only plausible explanation I can think of right now is that some background thread closes the wrong file, and sometimes does it at exactly the wrong time.
No longer depends on: 1440203
Attachment #8990893 - Attachment is obsolete: true
Comment on attachment 8991501 [details]
Bug 1472889: Work around sporadic automation failures on Linux.

https://reviewboard.mozilla.org/r/256394/#review263536

::: dom/ipc/MemMapSnapshot.cpp:124
(Diff revision 1)
>                                  getter_AddRefs(file)));
>  
> -  return aMem.init(file);
> +  auto result = aMem.init(file);
> +#ifdef XP_LINUX
> +  // On Linux automation runs, every few hundred thousand calls, our attempt to
> +  // stat the file that we just successfully opened fails with EBADF (bug

Should we fix this at that point where we stat instead?
Attachment #8991501 - Flags: review?(erahm) → review+
Comment on attachment 8991501 [details]
Bug 1472889: Work around sporadic automation failures on Linux.

https://reviewboard.mozilla.org/r/256394/#review263536

> Should we fix this at that point where we stat instead?

We have to re-open the file. The opening happens in init(nsIFile), the stat happens in initInternal(), so it would have to happen in init(). And it would get kind of ugly...
https://hg.mozilla.org/mozilla-central/rev/a7185bc9281c
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=188023941&repo=try&lineNumber=12388
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 1475612
https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: