Closed Bug 1200139 Opened 9 years ago Closed 9 years ago

Replace Mozmill CI master VMs with Ubuntu 64bit to hopefully stop low memory crashes of Java

Categories

(Mozilla QA Graveyard :: Infrastructure, defect)

43 Branch
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

Attachments

(1 file)

So by last week I upgraded the staging and production machines from Ubuntu 12.04 to 14.04. While staging is working properly I already have seen 2 crashes of Jenkins on the production machine since Friday afternoon. Both times due to low memory:

# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 32756 bytes for ChunkPool::allocate

As it looks like the old bug we had a couple of months ago is back. :( Maybe due to the upgrade the newly installed Java version is causing this problem. I will have to dive deeper into this issue today.

For now I will simply bring the machine back. Lets see how long it will survive.
Details of the Java error log:

# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 32756 bytes for ChunkPool::allocate
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (allocation.cpp:211), pid=4525, tid=102509376
#
# JRE version: Java(TM) SE Runtime Environment (7.0_80-b15) (build 1.7.0_80-b15)
# Java VM: Java HotSpot(TM) Server VM (24.80-b11 mixed mode linux-x86 )
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#

Current thread (0x08f42800):  JavaThread "Computer.threadPoolForRemoting [#3848] for mm-osx-106-1" daemon [_thread_in_vm, id=6072, stack(0x06172000,0x061c3000)]

Stack: [0x06172000,0x061c3000],  sp=0x061c13a0,  free space=316k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x7ea656]  VMError::report_and_die()+0x1a6
V  [libjvm.so+0x33eb82]  report_vm_out_of_memory(char const*, int, unsigned int, char const*)+0x72
V  [libjvm.so+0x18dc27]  ChunkPool::allocate(unsigned int, AllocFailStrategy::AllocFailEnum)+0x97
V  [libjvm.so+0x18d8cc]  Arena::grow(unsigned int, AllocFailStrategy::AllocFailEnum)+0x2c
V  [libjvm.so+0x7017e4]  resource_allocate_bytes(unsigned int, AllocFailStrategy::AllocFailEnum)+0x64
V  [libjvm.so+0x70c9bb]  ScopeDesc::sender() const+0x3b
V  [libjvm.so+0x7e49d9]  compiledVFrame::sender() const+0x69
V  [libjvm.so+0x7def2f]  vframe::java_sender() const+0xf
V  [libjvm.so+0x1d4eb7]  get_or_compute_monitor_info(JavaThread*)+0x247
V  [libjvm.so+0x1d571d]  revoke_bias(oopDesc*, bool, bool, JavaThread*)+0x1bd
V  [libjvm.so+0x1d5c35]  BiasedLocking::revoke_and_rebias(Handle, bool, Thread*)+0x3b5
V  [libjvm.so+0x76e35e]  ObjectSynchronizer::FastHashCode(Thread*, oopDesc*)+0x1ce
V  [libjvm.so+0x512f18]  JVM_IHashCode+0xa8
C  [libjava.so+0x10414]  Java_java_lang_System_identityHashCode+0x24
J 1052  java.lang.System.identityHashCode(Ljava/lang/Object;)I (0 bytes) @ 0xb37b50f5 [0xb37b5060+0x95]

Interesting here is again that we have a method of the monitoring plugin (get_or_compute_monitor_info) within the stack, which was causing issues like those before.
The listed command (https://github.com/mozilla/mozmill-ci/issues/450) to free up used memory doesn't seem to work anytime longer.

$ sync | sudo tee /proc/sys/vm/drop_caches

Oh, also keep in mind that we had a network outage over the weekend. Maybe that could also have been played a role here.
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
One problem we have here with the staging and production machines is that both run under 32bit! It's not that critical for staging but production has 8GB of memory which will not be accurately handled in that case.

$ uname -a
Linux mm-ci-production.qa.scl3.mozilla.com 3.13.0-62-generic #102-Ubuntu SMP Tue Aug 11 14:28:35 UTC 2015 i686 i686 i686 GNU/Linux

$ cat /proc/meminfo 
MemTotal:        8283944 kB
MemFree:         2645960 kB
Buffers:          375304 kB
Cached:          4574996 kB
SwapCached:          828 kB
Active:          1447432 kB
Inactive:        3803424 kB
Active(anon):      61208 kB
Inactive(anon):   248116 kB
Active(file):    1386224 kB
Inactive(file):  3555308 kB
Unevictable:           0 kB
Mlocked:               0 kB
HighTotal:       7475144 kB
HighFree:        2590972 kB
LowTotal:         808800 kB
LowFree:           54988 kB
SwapTotal:       1046524 kB
SwapFree:        1037372 kB
Dirty:                 4 kB
Writeback:             0 kB
AnonPages:        299832 kB
Mapped:            71152 kB
Shmem:              8768 kB
Slab:             349608 kB
SReclaimable:     333948 kB
SUnreclaim:        15660 kB
KernelStack:        3104 kB
PageTables:         6324 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     5188496 kB
Committed_AS:    2175768 kB
VmallocTotal:     122880 kB
VmallocUsed:        9276 kB
VmallocChunk:      90092 kB
HardwareCorrupted:     0 kB
AnonHugePages:    129024 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       14328 kB
DirectMap2M:      899072 kB

Chris, can I get a new Ubuntu 14.04 64bit VM created which I can prepare and then replace mm-ci-production (DNS, and IP address)?
Flags: needinfo?(cknowles)
> A 32-bit computer has a word size of 32 bits, this limits the memory theoretically to 4GB.
> This barrier has been extended through the use of 'Physical Address Extension' (or PAE)
> which increases the limit to 64GB although the memory access above 4GB will be slightly slower.

This might be one of the reasons why the box feels slower. We usually always extend the 4GB limit so PAE comes into play.
$ free -m
             total       used       free     shared    buffers     cached
Mem:       8283944    5620092    2663852       8768     358716    4575332
-/+ buffers/cache:     686044    7597900
Swap:      1046524       9332    1037192

$ sudo sysctl -w vm.drop_caches=3
vm.drop_caches = 3

mozauto@mm-ci-production:/data/mozmill-ci$ free -m
             total       used       free     shared    buffers     cached
Mem:          8089        428       7661          8          1         71
-/+ buffers/cache:        355       7734
Swap:         1021          9       1012

Jenkins is running again for now.
:whimboo: - Probably best for us to split that into a separate bug for tracking and "not crossing streams of questioning" purposes.

I'm assuming that you'll want it at the same specs (8GB, 2 CPU, 16G /, 50G /data)

Also, will you want to do the same thing to mm-ci-staging?  

Spin up a new bug (please) and answer my questions, and we can get rolling.
Flags: needinfo?(cknowles)
Thanks Chris. Exactly that was my idea. Just wanted to get some feedback first. So memory usage numbers are looking fine at the moment and I want to observe those for a while now. There is one other big task for me this week, so I really only want to tackle that bug if it is really necessary. I think later today or tomorrow we should know more. I will file a new bug then with the requirements.
Alright, great.
We've had similar issues with Buildbot from time to time. It doesn't crash, but it does slow down over time. Something we've done to help avoid hitting it is restart them regularly during quiet or down times. Maybe something similar could help here? Eg: restart with a cronjob on the weekend.
Attached image usedMemory.png
So we had another crash today. It happened out of sudden and most likely was triggered by the ondemand config upload from Robert for the Beta today. The memory usage was just fine when it happened. No critical values.

So we have 3 days and 3 crashes!

-rw-rw-r-- 1 mozauto mozauto 121582 Aug 28 05:00 hs_err_pid2863.log
-rw-rw-r-- 1 mozauto mozauto  10956 Aug 30 05:22 hs_err_pid4525.log
-rw-rw-r-- 1 mozauto mozauto 130519 Sep  1 07:56 hs_err_pid7472.log

This is totally not going to work. So I will file the bug shortly to get a fresh 64bit VM created. Also because the error log perfectly shows the issue:

# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 1057552 bytes for Chunk::new
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
Depends on: 1200800
Since we have replaced the VMs by last week, I have some updated stats for ci-production:

top - 04:03:34 up 5 days, 20:21,  1 user,  load average: 0.19, 0.06, 0.06
Tasks: 129 total,   2 running, 127 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us,  0.3 sy,  0.0 ni, 98.0 id,  1.2 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   8176980 total,  5966956 used,  2210024 free,   393552 buffers
KiB Swap:  4192252 total,        0 used,  4192252 free.  2223320 cached Mem

So we are currently using 6GB out of 8GB. Most of it by Java but its more cached memory given that recent memory drops (once daily) inside of Java as reported by the Jenkins monitoring plugin dropped the usage from 1.5GB usage down to 300MB. 

I will still keep an eye on those boxes but I don't see any remaining actionable item for me on this bug. If a crash happens again I will get a new bug filed. But let hope it will not happen.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Summary: Constant crashes of Jenkins on mm-ci-production → Replace Mozmill CI master VMs with Ubuntu 64bit to hopefully stop low memory crashes of Java
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: