Closed Bug 964880 Opened 10 years ago Closed 10 years ago

Add swap to linux build machines with <12GB RAM

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Assigned: massimo)

References

Details

Attachments

(3 files, 2 obsolete files)

started on: https://tbpl.mozilla.org/?tree=Mozilla-Aurora&rev=a581d53d51f4

other cases:
https://tbpl.mozilla.org/?tree=Mozilla-Aurora&rev=4936f09590a0
https://tbpl.mozilla.org/php/getParsedLog.php?id=33691486&tree=Mozilla-Aurora

example log:
https://tbpl.mozilla.org/php/getParsedLog.php?id=33695402&tree=Mozilla-Aurora&full=1

error context:
collect2: error: ld terminated with signal 9 [Killed]
make[5]: *** [libxul.so] Error 1
make[5]: Leaving directory `/builds/slave/m-aurora-l64-00000000000000000/build/obj-firefox/toolkit/library'
make[4]: *** [libs] Error 2
make[4]: Leaving directory `/builds/slave/m-aurora-l64-00000000000000000/build/obj-firefox'
make[3]: *** [default] Error 2
make[3]: Leaving directory `/builds/slave/m-aurora-l64-00000000000000000/build/obj-firefox'
make[2]: *** [realbuild] Error 2
make[2]: Leaving directory `/builds/slave/m-aurora-l64-00000000000000000/build'
make[1]: *** [profiledbuild] Error 2
make[1]: Leaving directory `/builds/slave/m-aurora-l64-00000000000000000/build'
make: *** [build] Error 2
State Changed: unlock buildroot
I am worried that this could be a memory issue: collect2: ld terminated with signal 9 [Killed]

Could this be due to a difference in mem sizes on our EC2 machines?
Flags: needinfo?(rail)
Summary: Mozilla Aurora Linux64 opt builds failing: make/compile step → Linux64 opt builds failing: make/compile step
we have recently switched EC2 instance types from m3.xlarge to c3.xlarge resulting in a ram drop from 15GB RAM to 7.5GB RAM.

I have a feeling based off errors like this: collect2: error: ld terminated with signal 9 [Killed]

we are hitting memory limit issues on builds like linux64 opt builds.

Trees are currently down so it is hard to verfiy this with current builds. However one linux64 esr24 build is running against one of our 'in house' linux machines. These machines also have around 7.5gb ram but with an additional 4gb swap space.

Taking a recent snapshot of free memory from one of these in-house machines at similar point to where the builds have been failing (~85min), we can see that memory is getting pretty sparse and swap is being used a fair bit:

[root@bld-linux64-ix-036.build.scl1.mozilla.com ~]# free -m 
             total       used       free     shared    buffers     cached
Mem:          7864       7753        110          0          0         35
-/+ buffers/cache:       7717        146
Swap:         4095       2249       1846
it sounds like an oom kill. We may need to add swap (SSD instance storage for bld-linux64). I hope this is not an issue for the try-linux64-ec2 machines, otherwise we may need to use EBS for swap (instance storage is mounted as /builds/slave already).
Flags: needinfo?(rail)
Instead of re-partitioning to add swap, we could add a swap file on instance storage.
OS: Mac OS X → Linux
Bumping priority on this. We're hitting these OOMs on !trunk very frequently, and it takes 2+ hours on a retrigger before we know if it'll die again.
Severity: normal → critical
Even better, now a lot of the retriggers aren't even working. But hey, it's not like we have a last-second crash landing on esr24 that needs builds and tests or something.
[23:03:08]	RyanVM	to recap: most !trunk linux64 builds OOM and die and we can't retrigger them
[23:03:25]	RyanVM	last_good_build is going to *love* this tomorrow
[23:03:31]	nthomas	so this was a new class of slave right ?
[23:03:34]	RyanVM	s/build/rev
[23:03:36]	RyanVM	yes
[23:03:38]	RyanVM	c3 or something

Is it worth reverting the c3.whatever instance type due to this bug?
Flags: needinfo?(taras.mozilla)
Flags: needinfo?(rail)
Flags: needinfo?(catlee)
My bet is this is a PGO-only problem.
Probably, otherwise we'd be seeing it on the non-unified trunk builds. Or trunk PGO builds where unified is saving our bacon.
(In reply to Rail Aliiev [:rail] from comment #4)
> I hope this is not an issue for the try-linux64-ec2 machines

Doesn't appear to be - I ran 49 (because I can't count to 50) PGO non-unified trunk builds on try last night, and didn't hit this once.
(In reply to Justin Wood (:Callek) from comment #8)
> [23:03:08]	RyanVM	to recap: most !trunk linux64 builds OOM and die and we
> can't retrigger them
> [23:03:25]	RyanVM	last_good_build is going to *love* this tomorrow
> [23:03:31]	nthomas	so this was a new class of slave right ?
> [23:03:34]	RyanVM	s/build/rev
> [23:03:36]	RyanVM	yes
> [23:03:38]	RyanVM	c3 or something
> 
> Is it worth reverting the c3.whatever instance type due to this bug?
No. These new c3 nodes are slightly faster, have ssds and are much cheaper. Sorry this came up. We didn't notice this during testing due to unified builds.

catlee will add a swap file tomorrow.

catlee, ideally you'd use the fallocate util to create the file. it's faster and better to do that than use dd or sparse file: https://ausinfotech.net/blog/fallocate-linux-utility-to-preallocate-space-to-a-file/
Flags: needinfo?(taras.mozilla)
and indeed our build machines do have fallocate installed.

I watched a pair of machines through a build cycle on aurora after added 4G swap, and neither ended up using more than 500k of swap. We're really close to the edge here!

Let's add 4G just to be safe for the future.
Flags: needinfo?(catlee)
(In reply to Chris AtLee [:catlee] from comment #13)
> and indeed our build machines do have fallocate installed.
> 
> I watched a pair of machines through a build cycle on aurora after added 4G
> swap, and neither ended up using more than 500k of swap. We're really close
> to the edge here!
> 
> Let's add 4G just to be safe for the future.

sounds good. However once we have proper pooling, lets pool builds that need more memory onto machines with swap files and keep other machines swap-free.
Blocks: 944113
Massimo and I are going to poke this issue today. The idea is to add a swap partition or file. If this approach doesn't work we may consider switching back to older hardware. :/
Flags: needinfo?(rail)
Added a script to create and mount a swap file for CentOS/xenhvm machines.
Attachment #8368632 - Flags: review?(dustin)
Assignee: nobody → mgervasini
Comment on attachment 8368632 [details] [diff] [review]
adding swap file to centos/xenhvm instances

Review of attachment 8368632 [details] [diff] [review]:
-----------------------------------------------------------------

Rail just made some comments on bug 924430 regarding use of 'xenhvm' to indicate an AWS instance.  It doesn't look like it's the best indicator.  Let's land this with the suggested modifications, and I'll file a new bug to add an aws_instance_type fact that can be used instead.

::: modules/tweaks/manifests/swap.pp
@@ +1,4 @@
> +# This Source Code Form is subject to the terms of the Mozilla Public
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +class tweaks::swap {

The name of this class suggests "I enable swap", when in fact for the vast majority of hosts, it does no such thing.  Can you rename it to something like tweaks::swap_on_instance_storage?  That at least makes it clear what it's doing.  An extra comment explaining this would help, too, and pointing to this bug.  

The second bit will be to only include this class where necessary.  I think that will be easier to fix with an `aws_instance_type` fact, which can be used in toplevel::slave::build::mock to only include the class on `c3.*` instances.  Once that's in place, both of the case statements in this class can have defaults which fail(), instead of simply doing nothing.

@@ +19,5 @@
> +                    exec {
> +                        "/sbin/service add_swap start || true":
> +                            subscribe   => File["/etc/init.d/add_swap"],
> +                            refreshonly => true;
> +                    }

This bit shouldn't be necessary if you use ensure => running.  It might be good to add a 'status' command, or make starting idempotent.  I don't know what happens if you delete a swapfile and then run swapon on the new version..
Attachment #8368632 - Flags: review?(dustin) → review+
Comment on attachment 8368632 [details] [diff] [review]
adding swap file to centos/xenhvm instances

https://hg.mozilla.org/build/puppet/rev/235d596dfcae with some changes
Attachment #8368632 - Flags: checked-in+
I see swap enabled on freshly rebooted/started slaves. \o/

In a couple of hours the errors should disappear.
(In reply to Rail Aliiev [:rail] from comment #18)
> Comment on attachment 8368632 [details] [diff] [review]
> adding swap file to centos/xenhvm instances
> 
> https://hg.mozilla.org/build/puppet/rev/235d596dfcae with some changes

Why is this pattern-matching on instance type vs amount of ram < 12gb? Seems fragile
(In reply to Taras Glek (:taras) from comment #20)
> Why is this pattern-matching on instance type vs amount of ram < 12gb? Seems
> fragile

This is a good idea.

Massimo, can we use RAM, so we don't even rely on the fact whether a slave is in AWS or not?
Attached patch add_swap.patch (obsolete) — Splinter Review
Hi Rail,

this patch removes the amazon box check; a swap file is created if MemTotal (from /proc/meminfo) is smaller than 12 GB
Attachment #8373299 - Flags: review?(rail)
Comment on attachment 8373299 [details] [diff] [review]
add_swap.patch

Review of attachment 8373299 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/tweaks/files/add_swap
@@ +17,5 @@
>  PATH="/sbin:/bin:/usr/sbin:/usr/bin"
>  
>  DEFAULT_SWAP_FILE="/swap_file"
> +MIN_MEMORY_SIZE_GB=12
> +MIN_MEMORY_SIZE_KB="(($MIN_MEMORY_SIZE_GB * 1024 * 1024))"

You forgot "$" -- it should be $((...))

@@ +18,5 @@
>  
>  DEFAULT_SWAP_FILE="/swap_file"
> +MIN_MEMORY_SIZE_GB=12
> +MIN_MEMORY_SIZE_KB="(($MIN_MEMORY_SIZE_GB * 1024 * 1024))"
> +MEMORY_SIZE_KB="$(cat /proc/meminfo | grep MemTotal | sed -n 's/^.* \([[:digit:]]*\) .*$/\1/p')"

It's hard to parse this... Can you use something like 

MEMORY_SIZE_KB="$(grep ^MemTotal: /proc/meminfo | awk '{print $2}')"

@@ +24,5 @@
>  
>  set -e
>  
> +# Do we need a swap file? If not, exit without error.
> +if (($MEMORY_SIZE_KB > $MIN_MEMORY_SIZE_KB)); then exit 0; fi

I don't think this will work, you need to use [..] for conditions..

I'd suggest something like this:

if [ -z $MEMORY_SIZE_KB ]; then
    log "Unable to detect RAM available, adding swap..."
elif [ $MEMORY_SIZE_KB -gt $MIN_MEMORY_SIZE_KB ]; then
    # check if we need a swap file; if not, exit without error
    log "No swap required, $MEMORY_SIZE_KB KB mem available"
    exit 0
fi
Attachment #8373299 - Flags: review?(rail) → review-
Summary: Linux64 opt builds failing: make/compile step → Add swap to linux build machines with <12GB RAM
Attached patch 964880.patchSplinter Review
Hi Rail,

I've updated the patch as requested, here is the new one.
Attachment #8373299 - Attachment is obsolete: true
Attachment #8374171 - Flags: review?(rail)
Attachment #8374171 - Flags: review?(rail) → review+
Severity: critical → normal
Attachment #8374171 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
In bug 966070, we are moving from m1.large to m3.medium; we need to enable the same swap file mechanism on masters too.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Blocks: 966070
Attached patch 964880.master.patch (obsolete) — Splinter Review
Enabling swap on buildbot master nodes too.

add_swap will be available on every buildbot master but the script will create a swap file (4GB) only if the machine has less than 12 GB ram as defined in the script itself.
Attachment #8382250 - Flags: review?(rail)
Comment on attachment 8382250 [details] [diff] [review]
964880.master.patch

Can you use the following file instead:
modules/toplevel/manifests/server/buildmaster/mozilla.pp

just to follow the same pattern as we introduced in modules/toplevel/manifests/slave/build/mock.pp

Additionally, can you also wrap it with | if $::virtual == "xenhvm" | (just being super cautious).
Attachment #8382250 - Flags: review?(rail) → review-
thanks, rail!

here's the updated patch
Attachment #8382250 - Attachment is obsolete: true
Attachment #8382316 - Flags: review?(rail)
Attachment #8382316 - Flags: review?(rail) → review+
Attachment #8382316 - Flags: checked-in+
Tested on buildbot-master70. It works.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: