Closed Bug 1200139 Opened 7 years ago Closed 7 years ago
Replace Mozmill CI master VMs with Ubuntu 64bit to hopefully stop low memory crashes of Java
So by last week I upgraded the staging and production machines from Ubuntu 12.04 to 14.04. While staging is working properly I already have seen 2 crashes of Jenkins on the production machine since Friday afternoon. Both times due to low memory: # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 32756 bytes for ChunkPool::allocate As it looks like the old bug we had a couple of months ago is back. :( Maybe due to the upgrade the newly installed Java version is causing this problem. I will have to dive deeper into this issue today. For now I will simply bring the machine back. Lets see how long it will survive.
Details of the Java error log: # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 32756 bytes for ChunkPool::allocate # Possible reasons: # The system is out of physical RAM or swap space # In 32 bit mode, the process size limit was hit # Possible solutions: # Reduce memory load on the system # Increase physical memory or swap space # Check if swap backing store is full # Use 64 bit Java on a 64 bit OS # Decrease Java heap size (-Xmx/-Xms) # Decrease number of Java threads # Decrease Java thread stack sizes (-Xss) # Set larger code cache with -XX:ReservedCodeCacheSize= # This output file may be truncated or incomplete. # # Out of Memory Error (allocation.cpp:211), pid=4525, tid=102509376 # # JRE version: Java(TM) SE Runtime Environment (7.0_80-b15) (build 1.7.0_80-b15) # Java VM: Java HotSpot(TM) Server VM (24.80-b11 mixed mode linux-x86 ) # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # Current thread (0x08f42800): JavaThread "Computer.threadPoolForRemoting [#3848] for mm-osx-106-1" daemon [_thread_in_vm, id=6072, stack(0x06172000,0x061c3000)] Stack: [0x06172000,0x061c3000], sp=0x061c13a0, free space=316k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x7ea656] VMError::report_and_die()+0x1a6 V [libjvm.so+0x33eb82] report_vm_out_of_memory(char const*, int, unsigned int, char const*)+0x72 V [libjvm.so+0x18dc27] ChunkPool::allocate(unsigned int, AllocFailStrategy::AllocFailEnum)+0x97 V [libjvm.so+0x18d8cc] Arena::grow(unsigned int, AllocFailStrategy::AllocFailEnum)+0x2c V [libjvm.so+0x7017e4] resource_allocate_bytes(unsigned int, AllocFailStrategy::AllocFailEnum)+0x64 V [libjvm.so+0x70c9bb] ScopeDesc::sender() const+0x3b V [libjvm.so+0x7e49d9] compiledVFrame::sender() const+0x69 V [libjvm.so+0x7def2f] vframe::java_sender() const+0xf V [libjvm.so+0x1d4eb7] get_or_compute_monitor_info(JavaThread*)+0x247 V [libjvm.so+0x1d571d] revoke_bias(oopDesc*, bool, bool, JavaThread*)+0x1bd V [libjvm.so+0x1d5c35] BiasedLocking::revoke_and_rebias(Handle, bool, Thread*)+0x3b5 V [libjvm.so+0x76e35e] ObjectSynchronizer::FastHashCode(Thread*, oopDesc*)+0x1ce V [libjvm.so+0x512f18] JVM_IHashCode+0xa8 C [libjava.so+0x10414] Java_java_lang_System_identityHashCode+0x24 J 1052 java.lang.System.identityHashCode(Ljava/lang/Object;)I (0 bytes) @ 0xb37b50f5 [0xb37b5060+0x95] Interesting here is again that we have a method of the monitoring plugin (get_or_compute_monitor_info) within the stack, which was causing issues like those before.
The listed command (https://github.com/mozilla/mozmill-ci/issues/450) to free up used memory doesn't seem to work anytime longer. $ sync | sudo tee /proc/sys/vm/drop_caches Oh, also keep in mind that we had a network outage over the weekend. Maybe that could also have been played a role here.
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
One problem we have here with the staging and production machines is that both run under 32bit! It's not that critical for staging but production has 8GB of memory which will not be accurately handled in that case. $ uname -a Linux mm-ci-production.qa.scl3.mozilla.com 3.13.0-62-generic #102-Ubuntu SMP Tue Aug 11 14:28:35 UTC 2015 i686 i686 i686 GNU/Linux $ cat /proc/meminfo MemTotal: 8283944 kB MemFree: 2645960 kB Buffers: 375304 kB Cached: 4574996 kB SwapCached: 828 kB Active: 1447432 kB Inactive: 3803424 kB Active(anon): 61208 kB Inactive(anon): 248116 kB Active(file): 1386224 kB Inactive(file): 3555308 kB Unevictable: 0 kB Mlocked: 0 kB HighTotal: 7475144 kB HighFree: 2590972 kB LowTotal: 808800 kB LowFree: 54988 kB SwapTotal: 1046524 kB SwapFree: 1037372 kB Dirty: 4 kB Writeback: 0 kB AnonPages: 299832 kB Mapped: 71152 kB Shmem: 8768 kB Slab: 349608 kB SReclaimable: 333948 kB SUnreclaim: 15660 kB KernelStack: 3104 kB PageTables: 6324 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 5188496 kB Committed_AS: 2175768 kB VmallocTotal: 122880 kB VmallocUsed: 9276 kB VmallocChunk: 90092 kB HardwareCorrupted: 0 kB AnonHugePages: 129024 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 14328 kB DirectMap2M: 899072 kB Chris, can I get a new Ubuntu 14.04 64bit VM created which I can prepare and then replace mm-ci-production (DNS, and IP address)?
> A 32-bit computer has a word size of 32 bits, this limits the memory theoretically to 4GB. > This barrier has been extended through the use of 'Physical Address Extension' (or PAE) > which increases the limit to 64GB although the memory access above 4GB will be slightly slower. This might be one of the reasons why the box feels slower. We usually always extend the 4GB limit so PAE comes into play.
$ free -m total used free shared buffers cached Mem: 8283944 5620092 2663852 8768 358716 4575332 -/+ buffers/cache: 686044 7597900 Swap: 1046524 9332 1037192 $ sudo sysctl -w vm.drop_caches=3 vm.drop_caches = 3 mozauto@mm-ci-production:/data/mozmill-ci$ free -m total used free shared buffers cached Mem: 8089 428 7661 8 1 71 -/+ buffers/cache: 355 7734 Swap: 1021 9 1012 Jenkins is running again for now.
:whimboo: - Probably best for us to split that into a separate bug for tracking and "not crossing streams of questioning" purposes. I'm assuming that you'll want it at the same specs (8GB, 2 CPU, 16G /, 50G /data) Also, will you want to do the same thing to mm-ci-staging? Spin up a new bug (please) and answer my questions, and we can get rolling.
Thanks Chris. Exactly that was my idea. Just wanted to get some feedback first. So memory usage numbers are looking fine at the moment and I want to observe those for a while now. There is one other big task for me this week, so I really only want to tackle that bug if it is really necessary. I think later today or tomorrow we should know more. I will file a new bug then with the requirements.
We've had similar issues with Buildbot from time to time. It doesn't crash, but it does slow down over time. Something we've done to help avoid hitting it is restart them regularly during quiet or down times. Maybe something similar could help here? Eg: restart with a cronjob on the weekend.
So we had another crash today. It happened out of sudden and most likely was triggered by the ondemand config upload from Robert for the Beta today. The memory usage was just fine when it happened. No critical values. So we have 3 days and 3 crashes! -rw-rw-r-- 1 mozauto mozauto 121582 Aug 28 05:00 hs_err_pid2863.log -rw-rw-r-- 1 mozauto mozauto 10956 Aug 30 05:22 hs_err_pid4525.log -rw-rw-r-- 1 mozauto mozauto 130519 Sep 1 07:56 hs_err_pid7472.log This is totally not going to work. So I will file the bug shortly to get a fresh 64bit VM created. Also because the error log perfectly shows the issue: # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 1057552 bytes for Chunk::new # Possible reasons: # The system is out of physical RAM or swap space # In 32 bit mode, the process size limit was hit
Since we have replaced the VMs by last week, I have some updated stats for ci-production: top - 04:03:34 up 5 days, 20:21, 1 user, load average: 0.19, 0.06, 0.06 Tasks: 129 total, 2 running, 127 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.5 us, 0.3 sy, 0.0 ni, 98.0 id, 1.2 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 8176980 total, 5966956 used, 2210024 free, 393552 buffers KiB Swap: 4192252 total, 0 used, 4192252 free. 2223320 cached Mem So we are currently using 6GB out of 8GB. Most of it by Java but its more cached memory given that recent memory drops (once daily) inside of Java as reported by the Jenkins monitoring plugin dropped the usage from 1.5GB usage down to 300MB. I will still keep an eye on those boxes but I don't see any remaining actionable item for me on this bug. If a crash happens again I will get a new bug filed. But let hope it will not happen.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Summary: Constant crashes of Jenkins on mm-ci-production → Replace Mozmill CI master VMs with Ubuntu 64bit to hopefully stop low memory crashes of Java
You need to log in before you can comment on or make changes to this bug.