AWS Sanity Check lies about how long an instance was shut down for...

RESOLVED FIXED

Status

Release Engineering
General
RESOLVED FIXED
4 years ago
3 months ago

People

(Reporter: Callek, Assigned: massimo)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(4 attachments, 4 obsolete attachments)

39.16 KB, patch
rail
: review+
massimo
: checked-in+
Details | Diff | Splinter Review
4.69 KB, patch
rail
: review+
massimo
: checked-in+
Details | Diff | Splinter Review
3.94 KB, patch
rail
: review+
massimo
: checked-in+
Details | Diff | Splinter Review
1.05 KB, patch
rail
: review+
massimo
: checked-in+
Details | Diff | Splinter Review
(Reporter)

Description

4 years ago
Sooo, when reviewing AWS sanity check I saw:

==== Instances stopped for a while ====
0 buildbot-master56 (i-b0c8df82, us-west-2): down for 8157 hours
1 buildbot-master58 (i-26f2e514, us-west-2): down for 8157 hours
2 buildbot-master63 (i-5eb82736, us-east-1): down for 8116 hours
3 buildbot-master59 (i-9b09d0f9, us-east-1): down for 8116 hours
4 buildbot-master64 (i-784b594a, us-west-2): down for 8115 hours
5 buildbot-master57 (i-02cc2669, us-east-1): down for 3745 hours

upon realizing JUST how long that means and knowing about how recently https://bugzilla.mozilla.org/show_bug.cgi?id=966070#c48 was, we were able to comprehend there was something wrong here.

I'm not sure if this is AWS-lying or us miscomputing or what. Either way we should fix if we can
(Assignee)

Comment 1

4 years ago
Here's how the script calculates the "down for" period [1]:

if state is stop:
  down_for = (now - launch_time) / 3600

The boto library provides a launch_time but there's no "stop_time" (or similar), so for stopped instances we provide a rough estimate.

[1] https://hg.mozilla.org/build/cloud-tools/file/6bdf0cf1e8bf/scripts/aws_sanity_checker.py#l133
[2] http://docs.pythonboto.org/en/latest/ref/ec2.html#module-boto.ec2.instanceinfo
Duplicate of this bug: 991936

Updated

3 years ago
Assignee: nobody → mgervasini
(Assignee)

Comment 3

3 years ago
Created attachment 8430187 [details] [diff] [review]
[cloud-tools] Bug 983813 - get StopInstances timestamp from local cache.patch

This patch gets all the StopInstances events from cloudtrail logs and stores them locally so they can be used by aws_sanity_checker.py reports.

There are two new scripts:
* aws_get_cloudtrail_logs.py: it downloads the cloudtrail logs into a cache directory: /builds/aws-cloudtrail-logs (current and previous month logs)

* aws_process_cloudtrail_logs.py: this script parses logs and stores the results in /builds/aws-cloudtrail-logs/events/<Event-Name>/<Instance-Id> in a json file like the following:
{"eventName": "StopInstances", "instances": "i-309b1f1e", "eventTime": "2014-02-25T17:11:07Z"}

Of course aws_sanity_checker.py has been modified to get "stop time" from the "events" directory

Once this gets into production we can add some optimizations:
1. adding a clean up script, we could have thousand of old instance json files
2. using a database (SQLAlchemy + sqlite?) to store the data in a smart way
Attachment #8430187 - Flags: review?(rail)
(Assignee)

Comment 4

3 years ago
Created attachment 8430191 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch

crontabs and directories changes for cloudtrail logs
Attachment #8430191 - Flags: review?(rail)
Comment on attachment 8430191 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch

Review of attachment 8430191 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/dirs/manifests/builds/aws_cloudtrail_logs.pp
@@ +2,5 @@
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +class dirs::builds::aws_cloudtrail_logs {
> +    include dirs::builds
> +    include users::builder

Shouldn't it be users::buildduty?

@@ +7,5 @@
> +    file {
> +            "/builds/aws-cloudtrail-logs":
> +            ensure => directory,
> +            owner => "$users::builder::username",
> +            group => "$users::builder::group",

the same here?
Comment on attachment 8430187 [details] [diff] [review]
[cloud-tools] Bug 983813 - get StopInstances timestamp from local cache.patch

Review of attachment 8430187 [details] [diff] [review]:
-----------------------------------------------------------------

::: scripts/aws_get_cloudtrail_logs.py
@@ +6,5 @@
> +from repoze.lru import lru_cache
> +from multiprocessing import Pool
> +
> +
> +CWD = os.getcwd()

CWD is unused, please remove

@@ +8,5 @@
> +
> +
> +CWD = os.getcwd()
> +CACHE_DIR = "/builds/aws-cloudtrail-logs"
> +BASE_PREFIX = "AWSLogs/314336048151/CloudTrail"

Can you pass CACHE_DIR and BASE_PREFIX via command line?

@@ +9,5 @@
> +
> +CWD = os.getcwd()
> +CACHE_DIR = "/builds/aws-cloudtrail-logs"
> +BASE_PREFIX = "AWSLogs/314336048151/CloudTrail"
> +REGIONS = ["us-west-2", "us-east-1"]

please reuse regions from cloudtools/__init__.py

@@ +24,5 @@
> +            yield i
> +
> +
> +@lru_cache(10)
> +def get_connection():

Move this to cloudtools/__init__.py, rename as get_s3_connection()

@@ +26,5 @@
> +
> +@lru_cache(10)
> +def get_connection():
> +    #return boto.s3.connection
> +    return boto.s3.connection.S3Connection()

you can use shorter boto.connect_s3()

@@ +29,5 @@
> +    #return boto.s3.connection
> +    return boto.s3.connection.S3Connection()
> +
> +
> +def mkdir_p(dst_dir, exist_ok=True):

can you move this to cloudtools/fileutil.py or something like that?

@@ +43,5 @@
> +            raise
> +
> +
> +def limit_logs(limit=LIMIT_MONTHS):
> +    import datetime

can you move the import to the beginning of the file?

@@ +45,5 @@
> +
> +def limit_logs(limit=LIMIT_MONTHS):
> +    import datetime
> +    now = datetime.datetime.now()
> +    start_date = datetime.datetime.now() + datetime.timedelta(-LIMIT_MONTHS * 30)

maybe datetime.datetime.now() - datetime.timedelta(LIMIT_MONTHS * 30)

@@ +50,5 @@
> +
> +    days = []
> +    days.append(start_date.strftime("%Y/%m"))
> +    days.append(now.strftime("%Y/%m"))
> +    return days

you return a list with start and now, but the function name tells me about limiting logs. Can you rename the function?

also this is a good candidate for cloudtools/cloudtrail.py

@@ +65,5 @@
> +def write_to_disk(key):
> +    dst = os.path.join(CACHE_DIR, key.name)
> +    mkdir_p(os.path.dirname(dst))
> +
> +    import signal

can you move the import to the beginning of the file?

@@ +69,5 @@
> +    import signal
> +    signal.signal(signal.SIGALRM, _timeout)
> +    if not os.path.exists(dst):
> +        logging.info('downloading: {0}'.format(key.name))
> +        signal.alarm(GET_CONTENTS_TO_FILENAME_TIMEOUT)

can you add some comments why you use alarm here

@@ +80,5 @@
> +
> +if __name__ == '__main__':
> +    logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
> +    conn = get_connection()
> +    bucket = conn.get_bucket("mozilla-releng-aws-logs")

can you pass this via CLI?

::: scripts/aws_process_cloudtrail_logs.py
@@ +3,5 @@
> +import logging
> +import json
> +import os
> +from multiprocessing import Pool
> +from aws_get_cloudtrail_logs import CACHE_DIR as CACHE_DIR

can you pass this via CLI?

@@ +4,5 @@
> +import json
> +import os
> +from multiprocessing import Pool
> +from aws_get_cloudtrail_logs import CACHE_DIR as CACHE_DIR
> +from aws_get_cloudtrail_logs import mkdir_p as mkdir_p

"as mkdir_p" is redundant here

@@ +9,5 @@
> +
> +CLOUDTRAIL_DIR = os.path.join(CACHE_DIR, 'AWSLogs')
> +SPOT_DATA_DIR = os.path.join(CACHE_DIR, 'SpotInstanceDataFeed')
> +EVENTS_DIR = os.path.join(CACHE_DIR, 'events')
> +BAD_LOGS = os.path.join(EVENTS_DIR, 'bad_logs')

Can you move to them cloudtools/cloudtrail.py or pass via cli?

@@ +12,5 @@
> +EVENTS_DIR = os.path.join(CACHE_DIR, 'events')
> +BAD_LOGS = os.path.join(EVENTS_DIR, 'bad_logs')
> +
> +
> +def get_data_from_gz_file(filename):

move to cloudtools/fileutil.py

@@ +19,5 @@
> +        with gzip.open(filename, 'rb') as f:
> +            return f.read()
> +    except IOError as error:
> +        logging.info('filename: %s', filename)
> +        logging.error(error)

you can also use the following to replace 2 lines with 1:
logging.error('filename: %s', filename, exc_info=True)

@@ +22,5 @@
> +        logging.info('filename: %s', filename)
> +        logging.error(error)
> +
> +
> +def get_data_from_json_file(filename):

move to cloudtools/fileutil.py

@@ +27,5 @@
> +    """returns a json object from filename"""
> +    try:
> +        logging.debug(filename)
> +        with open(filename, 'rb') as f:
> +            return json.loads(f.read())

return json.load(f) maybe?

@@ +30,5 @@
> +        with open(filename, 'rb') as f:
> +            return json.loads(f.read())
> +    except ValueError:
> +        # discard log file if it's not a good json file
> +        move_to_bad_logs(filename)

return None  # to be explicit?

@@ +42,5 @@
> +    mkdir_p(BAD_LOGS)
> +    name = os.path.split(filename)[1]
> +    dst_file = os.path.join(EVENTS_DIR, name)
> +    logging.debug("%s => %s", filename, dst_file)
> +    os.rename(filename, dst_file)

os.rename doesn't work if dest is another mount, use shutil.move

@@ +56,5 @@
> +        return
> +
> +    logging.debug('processing: %s', filename)
> +    for record in data['Records']:
> +        eventName = record['eventName']

can you make the following loop safer? e.g. if the keys you access are missing, the loop should not explode the script. instead move the file to bad logs or something else

@@ +58,5 @@
> +    logging.debug('processing: %s', filename)
> +    for record in data['Records']:
> +        eventName = record['eventName']
> +        # just process stop events, skip StartInstances and TerminateInstances
> +        if eventName in ('StopInstances',):

# can you use the same comparision method for both ifs? == vs in

@@ +61,5 @@
> +        # just process stop events, skip StartInstances and TerminateInstances
> +        if eventName in ('StopInstances',):
> +            process_start_stop_record(record)
> +        elif eventName == 'RequestSpotInstances':
> +            process_request_spot(record)

# this is not used, can you just remove it?

@@ +65,5 @@
> +            process_request_spot(record)
> +
> +
> +def process_request_spot(record):
> +    pass

remove it, the followup patch will be easier to read

@@ +68,5 @@
> +def process_request_spot(record):
> +    pass
> +
> +
> +def process_start_stop_record(record):

move to cloudtools/cloudtrail.py?

@@ +73,5 @@
> +    """process a start/stop/terminate row"""
> +    # this metod works with Start/Stop/Terminate events too
> +    time_ = record['eventTime']
> +    for item in record['requestParameters']['instancesSet']['items']:
> +        instanceId = item['instanceId']

can you make this also safer?

@@ +80,5 @@
> +                'eventTime': time_}
> +        write_to_json(data)
> +
> +
> +def get_time_from_file(filename):

move to cloudtools/cloudtrail.py?

@@ +83,5 @@
> +
> +def get_time_from_file(filename):
> +    """returns the eventTime from filename"""
> +    data = get_data_from_json_file(filename)
> +    return data['eventTime']

it will fail if you have a bad JSON file

@@ +86,5 @@
> +    data = get_data_from_json_file(filename)
> +    return data['eventTime']
> +
> +
> +def write_to_json(data):

move to cloudtools/fileutil.py?

@@ +96,5 @@
> +    filename = os.path.join(EVENTS_DIR, event, instance)
> +    mkdir_p(os.path.dirname(filename))
> +    if not os.path.exists(filename):
> +        with open(filename, 'w') as f:
> +            json.dump(data, f)

json.dump(f, data)

@@ +97,5 @@
> +    mkdir_p(os.path.dirname(filename))
> +    if not os.path.exists(filename):
> +        with open(filename, 'w') as f:
> +            json.dump(data, f)
> +    if data['eventTime'] > get_time_from_file(filename):

use elif to avoid unnecessary get_time_from_file()

@@ +100,5 @@
> +            json.dump(data, f)
> +    if data['eventTime'] > get_time_from_file(filename):
> +        # replace old event with current one
> +        with open(filename, 'w') as f:
> +            json.dump(data, f)

json.dump(f, data)

@@ +103,5 @@
> +        with open(filename, 'w') as f:
> +            json.dump(data, f)
> +
> +
> +def process_spot_data(filename):

remove it, the followup patch will be easier to read
Attachment #8430187 - Flags: review?(rail) → review-
Comment on attachment 8430191 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch

I think with the changes to the scripts you will need to prep another patch with more command line args.
Attachment #8430191 - Flags: review?(rail)
(Assignee)

Comment 8

3 years ago
Created attachment 8439424 [details] [diff] [review]
[cloud-tools] Bug 983813 - use cloudtrail logs in AWS sanity check

Hi Rail,

* updated the code so bucket names, directories, ..., are passed as command line options. 
* in aws_sanity_checker.py, if --events-dir is not specified, the script works as before.
* as discussed, the impaired section reports only non slave hosts.

a new puppet patch will follow soon
Attachment #8430187 - Attachment is obsolete: true
Attachment #8439424 - Flags: review?(rail)
(Assignee)

Comment 9

3 years ago
Created attachment 8439434 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch

This patch takes care of creating the new required directories and adds new cloudtrail jobs in cron
Attachment #8430191 - Attachment is obsolete: true
Attachment #8439434 - Flags: review?(rail)
Comment on attachment 8439424 [details] [diff] [review]
[cloud-tools] Bug 983813 - use cloudtrail logs in AWS sanity check

Review of attachment 8439424 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm!
Attachment #8439424 - Flags: review?(rail) → review+
Comment on attachment 8439434 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch

Review of attachment 8439434 [details] [diff] [review]:
-----------------------------------------------------------------

Need just one fix, see my comments below.

::: modules/aws_manager/manifests/init.pp
@@ +6,4 @@
>      include aws_manager::secrets
>      include aws_manager::cron
>      include nrpe::check::check_stop_idle
> +    include dirs::builds::aws_cloudtrail_logs

see my comments for install.pp

::: modules/aws_manager/manifests/install.pp
@@ +63,5 @@
> +            group   => "${users::buildduty::group}";
> +        "${aws_manager::settings::events_dir}":
> +            ensure  => directory,
> +            owner   => "${users::buildduty::username}",
> +            group   => "${users::buildduty::group}";

These 2 entries duplicate entries in dirs::builds::aws_cloudtrail_logs. You can completely delete dirs::builds::aws_cloudtrail_logs, no need to bloat that module, IMHO, esp if you use ${aws_manager::settings} variables.

::: modules/dirs/manifests/builds/aws_cloudtrail_logs.pp
@@ +6,5 @@
> +    include users::builder
> +    include aws_manager::settings
> +    file {
> +            "${aws_manager::settings::cloudtrail_logs_dir}":
> +                ensure => rectory,

s/rectory/directory/
Attachment #8439434 - Flags: review?(rail) → review-
(Assignee)

Comment 12

3 years ago
Created attachment 8440799 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch

Hi rail,

thanks for the review! This new patch removes modules/dirs/manifests/builds/aws_cloudtrail_logs.pp
Attachment #8439434 - Attachment is obsolete: true
Attachment #8440799 - Flags: review?(rail)
(Assignee)

Comment 13

3 years ago
Created attachment 8440820 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch
Attachment #8440820 - Flags: review?(rail)
Comment on attachment 8440799 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch

Review of attachment 8440799 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/aws_manager/manifests/cron.pp
@@ +69,5 @@
> +            minute         => '35,15',
> +            cwd            => "${aws_manager::settings::cloud_tools_dst}/scripts",
> +            virtualenv_dir => "${aws_manager::settings::root}",
> +            user           => "${users::buildduty::username}",
> +            params         => "--cache_dir ${aws_manager::settings::cloudtrail_logs_dir} --s3-base-prefix ${::config::cloudtrail_s3_base_prefix} --s3-bucket ${::config::cloudtrail_s3_bucket}";

Conditional r+ -- you need to add "include ::config" since you use it here.
Attachment #8440799 - Flags: review?(rail) → review+
Attachment #8440820 - Attachment is obsolete: true
Attachment #8440820 - Flags: review?(rail)
Comment on attachment 8440820 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch

Ooops, wrong patch.

Don't forget to "include ::config"
Attachment #8440820 - Attachment is obsolete: false
Attachment #8440820 - Flags: review+
Attachment #8440799 - Attachment is obsolete: true
Attachment #8440799 - Flags: review+
(Assignee)

Updated

3 years ago
Attachment #8439424 - Flags: checked-in+
(Assignee)

Updated

3 years ago
Attachment #8440820 - Flags: checked-in+
Comment on attachment 8440820 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch

Review of attachment 8440820 [details] [diff] [review]:
-----------------------------------------------------------------

::: manifests/moco-config.pp
@@ +226,5 @@
>      $selfserve_agent_carrot_queue = "buildapi-agent-rabbit2"
>  
>      $aws_manager_mail_to = "release+aws-manager@mozilla.com"
> +    $cloudtrail_s3_bucket = "mozilla-releng-aws-logs"
> +    $cloudtrail_s3_base_prefix = "AWSLogs/314336048151/CloudTrail"

These two also need to be added to config::base
(Assignee)

Comment 17

3 years ago
Hi dustin,

cloudtrail_s3_bucket and cloudtrail_s3_base_prefix are set to "" in modules/config/manifests/base.pp.
Do they need the real value instead of the empty string?
No, they don't -- I somehow didn't see the change to base.pp.  Don't mind me.
(Assignee)

Comment 19

3 years ago
Created attachment 8449065 [details] [diff] [review]
[cloud-tools] Bug 983813 - clean up scripts for cloudtrail logs.patch

This script removes cloudtrail logs (based on directory name) and events older than 60 days
Attachment #8449065 - Flags: review?(rail)
(Assignee)

Comment 20

3 years ago
Created attachment 8449069 [details] [diff] [review]
[puppet] Bug 983813 - added cloudtrail clean up scripts to crontab.patch

crontab entry for cloudtrail logs clean up script
Attachment #8449069 - Flags: review?(rail)
Comment on attachment 8449065 [details] [diff] [review]
[cloud-tools] Bug 983813 - clean up scripts for cloudtrail logs.patch

Review of attachment 8449065 [details] [diff] [review]:
-----------------------------------------------------------------

Conditional r+, see comments below

::: scripts/aws_clean_log_dir.py
@@ +18,5 @@
> +    """removes cloudtrail directories"""
> +    try:
> +        for directory in os.listdir(root_dir):
> +            full_path = os.path.join(root_dir, directory)
> +            if full_path < reference_dir:

can you add a comment why you compare 2 directories and what it means... Sounds like magical string comparison trick.

@@ +26,5 @@
> +        # root dir does not exist, nothing to delete here
> +        pass
> +
> +
> +def delete_obsolete_json_files(json_file, numdays):

Can you rename the function to delete_obsolete_json_file since it accepts a file.

@@ +71,5 @@
> +    today = datetime.date.today()
> +
> +    numdays = 60
> +    base = datetime.datetime.today()
> +    last_day_to_keep = (base - datetime.timedelta(days=numdays))

the parentheses around the right part are redundant, feel free to remove them

@@ +98,5 @@
> +        root_dir = reference_dir
> +        reference_dir = os.path.join(reference_dir, day)
> +        delete_obsolete_logs(root_dir, reference_dir)
> +
> +    print "deleting obsolete event files"

Can you replace all print statements with log.debug()?
Attachment #8449065 - Flags: review?(rail) → review+
Attachment #8449069 - Flags: review?(rail) → review+
(Assignee)

Updated

3 years ago
Attachment #8449065 - Flags: checked-in+
(Assignee)

Updated

3 years ago
Attachment #8449069 - Flags: checked-in+
merged to production
(Assignee)

Updated

3 years ago
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Component: Tools → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.