Closed Bug 983813 Opened 10 years ago Closed 10 years ago

AWS Sanity Check lies about how long an instance was shut down for...

Categories

(Release Engineering :: General, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: massimo)

References

Details

Attachments

(4 files, 4 obsolete files)

39.16 KB, patch
rail
: review+
massimo
: checked-in+
Details | Diff | Splinter Review
4.69 KB, patch
rail
: review+
massimo
: checked-in+
Details | Diff | Splinter Review
3.94 KB, patch
rail
: review+
massimo
: checked-in+
Details | Diff | Splinter Review
1.05 KB, patch
rail
: review+
massimo
: checked-in+
Details | Diff | Splinter Review
Sooo, when reviewing AWS sanity check I saw:

==== Instances stopped for a while ====
0 buildbot-master56 (i-b0c8df82, us-west-2): down for 8157 hours
1 buildbot-master58 (i-26f2e514, us-west-2): down for 8157 hours
2 buildbot-master63 (i-5eb82736, us-east-1): down for 8116 hours
3 buildbot-master59 (i-9b09d0f9, us-east-1): down for 8116 hours
4 buildbot-master64 (i-784b594a, us-west-2): down for 8115 hours
5 buildbot-master57 (i-02cc2669, us-east-1): down for 3745 hours

upon realizing JUST how long that means and knowing about how recently https://bugzilla.mozilla.org/show_bug.cgi?id=966070#c48 was, we were able to comprehend there was something wrong here.

I'm not sure if this is AWS-lying or us miscomputing or what. Either way we should fix if we can
Here's how the script calculates the "down for" period [1]:

if state is stop:
  down_for = (now - launch_time) / 3600

The boto library provides a launch_time but there's no "stop_time" (or similar), so for stopped instances we provide a rough estimate.

[1] https://hg.mozilla.org/build/cloud-tools/file/6bdf0cf1e8bf/scripts/aws_sanity_checker.py#l133
[2] http://docs.pythonboto.org/en/latest/ref/ec2.html#module-boto.ec2.instanceinfo
Assignee: nobody → mgervasini
This patch gets all the StopInstances events from cloudtrail logs and stores them locally so they can be used by aws_sanity_checker.py reports.

There are two new scripts:
* aws_get_cloudtrail_logs.py: it downloads the cloudtrail logs into a cache directory: /builds/aws-cloudtrail-logs (current and previous month logs)

* aws_process_cloudtrail_logs.py: this script parses logs and stores the results in /builds/aws-cloudtrail-logs/events/<Event-Name>/<Instance-Id> in a json file like the following:
{"eventName": "StopInstances", "instances": "i-309b1f1e", "eventTime": "2014-02-25T17:11:07Z"}

Of course aws_sanity_checker.py has been modified to get "stop time" from the "events" directory

Once this gets into production we can add some optimizations:
1. adding a clean up script, we could have thousand of old instance json files
2. using a database (SQLAlchemy + sqlite?) to store the data in a smart way
Attachment #8430187 - Flags: review?(rail)
crontabs and directories changes for cloudtrail logs
Attachment #8430191 - Flags: review?(rail)
Comment on attachment 8430191 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch

Review of attachment 8430191 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/dirs/manifests/builds/aws_cloudtrail_logs.pp
@@ +2,5 @@
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +class dirs::builds::aws_cloudtrail_logs {
> +    include dirs::builds
> +    include users::builder

Shouldn't it be users::buildduty?

@@ +7,5 @@
> +    file {
> +            "/builds/aws-cloudtrail-logs":
> +            ensure => directory,
> +            owner => "$users::builder::username",
> +            group => "$users::builder::group",

the same here?
Comment on attachment 8430187 [details] [diff] [review]
[cloud-tools] Bug 983813 - get StopInstances timestamp from local cache.patch

Review of attachment 8430187 [details] [diff] [review]:
-----------------------------------------------------------------

::: scripts/aws_get_cloudtrail_logs.py
@@ +6,5 @@
> +from repoze.lru import lru_cache
> +from multiprocessing import Pool
> +
> +
> +CWD = os.getcwd()

CWD is unused, please remove

@@ +8,5 @@
> +
> +
> +CWD = os.getcwd()
> +CACHE_DIR = "/builds/aws-cloudtrail-logs"
> +BASE_PREFIX = "AWSLogs/314336048151/CloudTrail"

Can you pass CACHE_DIR and BASE_PREFIX via command line?

@@ +9,5 @@
> +
> +CWD = os.getcwd()
> +CACHE_DIR = "/builds/aws-cloudtrail-logs"
> +BASE_PREFIX = "AWSLogs/314336048151/CloudTrail"
> +REGIONS = ["us-west-2", "us-east-1"]

please reuse regions from cloudtools/__init__.py

@@ +24,5 @@
> +            yield i
> +
> +
> +@lru_cache(10)
> +def get_connection():

Move this to cloudtools/__init__.py, rename as get_s3_connection()

@@ +26,5 @@
> +
> +@lru_cache(10)
> +def get_connection():
> +    #return boto.s3.connection
> +    return boto.s3.connection.S3Connection()

you can use shorter boto.connect_s3()

@@ +29,5 @@
> +    #return boto.s3.connection
> +    return boto.s3.connection.S3Connection()
> +
> +
> +def mkdir_p(dst_dir, exist_ok=True):

can you move this to cloudtools/fileutil.py or something like that?

@@ +43,5 @@
> +            raise
> +
> +
> +def limit_logs(limit=LIMIT_MONTHS):
> +    import datetime

can you move the import to the beginning of the file?

@@ +45,5 @@
> +
> +def limit_logs(limit=LIMIT_MONTHS):
> +    import datetime
> +    now = datetime.datetime.now()
> +    start_date = datetime.datetime.now() + datetime.timedelta(-LIMIT_MONTHS * 30)

maybe datetime.datetime.now() - datetime.timedelta(LIMIT_MONTHS * 30)

@@ +50,5 @@
> +
> +    days = []
> +    days.append(start_date.strftime("%Y/%m"))
> +    days.append(now.strftime("%Y/%m"))
> +    return days

you return a list with start and now, but the function name tells me about limiting logs. Can you rename the function?

also this is a good candidate for cloudtools/cloudtrail.py

@@ +65,5 @@
> +def write_to_disk(key):
> +    dst = os.path.join(CACHE_DIR, key.name)
> +    mkdir_p(os.path.dirname(dst))
> +
> +    import signal

can you move the import to the beginning of the file?

@@ +69,5 @@
> +    import signal
> +    signal.signal(signal.SIGALRM, _timeout)
> +    if not os.path.exists(dst):
> +        logging.info('downloading: {0}'.format(key.name))
> +        signal.alarm(GET_CONTENTS_TO_FILENAME_TIMEOUT)

can you add some comments why you use alarm here

@@ +80,5 @@
> +
> +if __name__ == '__main__':
> +    logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
> +    conn = get_connection()
> +    bucket = conn.get_bucket("mozilla-releng-aws-logs")

can you pass this via CLI?

::: scripts/aws_process_cloudtrail_logs.py
@@ +3,5 @@
> +import logging
> +import json
> +import os
> +from multiprocessing import Pool
> +from aws_get_cloudtrail_logs import CACHE_DIR as CACHE_DIR

can you pass this via CLI?

@@ +4,5 @@
> +import json
> +import os
> +from multiprocessing import Pool
> +from aws_get_cloudtrail_logs import CACHE_DIR as CACHE_DIR
> +from aws_get_cloudtrail_logs import mkdir_p as mkdir_p

"as mkdir_p" is redundant here

@@ +9,5 @@
> +
> +CLOUDTRAIL_DIR = os.path.join(CACHE_DIR, 'AWSLogs')
> +SPOT_DATA_DIR = os.path.join(CACHE_DIR, 'SpotInstanceDataFeed')
> +EVENTS_DIR = os.path.join(CACHE_DIR, 'events')
> +BAD_LOGS = os.path.join(EVENTS_DIR, 'bad_logs')

Can you move to them cloudtools/cloudtrail.py or pass via cli?

@@ +12,5 @@
> +EVENTS_DIR = os.path.join(CACHE_DIR, 'events')
> +BAD_LOGS = os.path.join(EVENTS_DIR, 'bad_logs')
> +
> +
> +def get_data_from_gz_file(filename):

move to cloudtools/fileutil.py

@@ +19,5 @@
> +        with gzip.open(filename, 'rb') as f:
> +            return f.read()
> +    except IOError as error:
> +        logging.info('filename: %s', filename)
> +        logging.error(error)

you can also use the following to replace 2 lines with 1:
logging.error('filename: %s', filename, exc_info=True)

@@ +22,5 @@
> +        logging.info('filename: %s', filename)
> +        logging.error(error)
> +
> +
> +def get_data_from_json_file(filename):

move to cloudtools/fileutil.py

@@ +27,5 @@
> +    """returns a json object from filename"""
> +    try:
> +        logging.debug(filename)
> +        with open(filename, 'rb') as f:
> +            return json.loads(f.read())

return json.load(f) maybe?

@@ +30,5 @@
> +        with open(filename, 'rb') as f:
> +            return json.loads(f.read())
> +    except ValueError:
> +        # discard log file if it's not a good json file
> +        move_to_bad_logs(filename)

return None  # to be explicit?

@@ +42,5 @@
> +    mkdir_p(BAD_LOGS)
> +    name = os.path.split(filename)[1]
> +    dst_file = os.path.join(EVENTS_DIR, name)
> +    logging.debug("%s => %s", filename, dst_file)
> +    os.rename(filename, dst_file)

os.rename doesn't work if dest is another mount, use shutil.move

@@ +56,5 @@
> +        return
> +
> +    logging.debug('processing: %s', filename)
> +    for record in data['Records']:
> +        eventName = record['eventName']

can you make the following loop safer? e.g. if the keys you access are missing, the loop should not explode the script. instead move the file to bad logs or something else

@@ +58,5 @@
> +    logging.debug('processing: %s', filename)
> +    for record in data['Records']:
> +        eventName = record['eventName']
> +        # just process stop events, skip StartInstances and TerminateInstances
> +        if eventName in ('StopInstances',):

# can you use the same comparision method for both ifs? == vs in

@@ +61,5 @@
> +        # just process stop events, skip StartInstances and TerminateInstances
> +        if eventName in ('StopInstances',):
> +            process_start_stop_record(record)
> +        elif eventName == 'RequestSpotInstances':
> +            process_request_spot(record)

# this is not used, can you just remove it?

@@ +65,5 @@
> +            process_request_spot(record)
> +
> +
> +def process_request_spot(record):
> +    pass

remove it, the followup patch will be easier to read

@@ +68,5 @@
> +def process_request_spot(record):
> +    pass
> +
> +
> +def process_start_stop_record(record):

move to cloudtools/cloudtrail.py?

@@ +73,5 @@
> +    """process a start/stop/terminate row"""
> +    # this metod works with Start/Stop/Terminate events too
> +    time_ = record['eventTime']
> +    for item in record['requestParameters']['instancesSet']['items']:
> +        instanceId = item['instanceId']

can you make this also safer?

@@ +80,5 @@
> +                'eventTime': time_}
> +        write_to_json(data)
> +
> +
> +def get_time_from_file(filename):

move to cloudtools/cloudtrail.py?

@@ +83,5 @@
> +
> +def get_time_from_file(filename):
> +    """returns the eventTime from filename"""
> +    data = get_data_from_json_file(filename)
> +    return data['eventTime']

it will fail if you have a bad JSON file

@@ +86,5 @@
> +    data = get_data_from_json_file(filename)
> +    return data['eventTime']
> +
> +
> +def write_to_json(data):

move to cloudtools/fileutil.py?

@@ +96,5 @@
> +    filename = os.path.join(EVENTS_DIR, event, instance)
> +    mkdir_p(os.path.dirname(filename))
> +    if not os.path.exists(filename):
> +        with open(filename, 'w') as f:
> +            json.dump(data, f)

json.dump(f, data)

@@ +97,5 @@
> +    mkdir_p(os.path.dirname(filename))
> +    if not os.path.exists(filename):
> +        with open(filename, 'w') as f:
> +            json.dump(data, f)
> +    if data['eventTime'] > get_time_from_file(filename):

use elif to avoid unnecessary get_time_from_file()

@@ +100,5 @@
> +            json.dump(data, f)
> +    if data['eventTime'] > get_time_from_file(filename):
> +        # replace old event with current one
> +        with open(filename, 'w') as f:
> +            json.dump(data, f)

json.dump(f, data)

@@ +103,5 @@
> +        with open(filename, 'w') as f:
> +            json.dump(data, f)
> +
> +
> +def process_spot_data(filename):

remove it, the followup patch will be easier to read
Attachment #8430187 - Flags: review?(rail) → review-
Comment on attachment 8430191 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch

I think with the changes to the scripts you will need to prep another patch with more command line args.
Attachment #8430191 - Flags: review?(rail)
Hi Rail,

* updated the code so bucket names, directories, ..., are passed as command line options. 
* in aws_sanity_checker.py, if --events-dir is not specified, the script works as before.
* as discussed, the impaired section reports only non slave hosts.

a new puppet patch will follow soon
Attachment #8430187 - Attachment is obsolete: true
Attachment #8439424 - Flags: review?(rail)
This patch takes care of creating the new required directories and adds new cloudtrail jobs in cron
Attachment #8430191 - Attachment is obsolete: true
Attachment #8439434 - Flags: review?(rail)
Comment on attachment 8439424 [details] [diff] [review]
[cloud-tools] Bug 983813 - use cloudtrail logs in AWS sanity check

Review of attachment 8439424 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm!
Attachment #8439424 - Flags: review?(rail) → review+
Comment on attachment 8439434 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch

Review of attachment 8439434 [details] [diff] [review]:
-----------------------------------------------------------------

Need just one fix, see my comments below.

::: modules/aws_manager/manifests/init.pp
@@ +6,4 @@
>      include aws_manager::secrets
>      include aws_manager::cron
>      include nrpe::check::check_stop_idle
> +    include dirs::builds::aws_cloudtrail_logs

see my comments for install.pp

::: modules/aws_manager/manifests/install.pp
@@ +63,5 @@
> +            group   => "${users::buildduty::group}";
> +        "${aws_manager::settings::events_dir}":
> +            ensure  => directory,
> +            owner   => "${users::buildduty::username}",
> +            group   => "${users::buildduty::group}";

These 2 entries duplicate entries in dirs::builds::aws_cloudtrail_logs. You can completely delete dirs::builds::aws_cloudtrail_logs, no need to bloat that module, IMHO, esp if you use ${aws_manager::settings} variables.

::: modules/dirs/manifests/builds/aws_cloudtrail_logs.pp
@@ +6,5 @@
> +    include users::builder
> +    include aws_manager::settings
> +    file {
> +            "${aws_manager::settings::cloudtrail_logs_dir}":
> +                ensure => rectory,

s/rectory/directory/
Attachment #8439434 - Flags: review?(rail) → review-
Hi rail,

thanks for the review! This new patch removes modules/dirs/manifests/builds/aws_cloudtrail_logs.pp
Attachment #8439434 - Attachment is obsolete: true
Attachment #8440799 - Flags: review?(rail)
Comment on attachment 8440799 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch

Review of attachment 8440799 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/aws_manager/manifests/cron.pp
@@ +69,5 @@
> +            minute         => '35,15',
> +            cwd            => "${aws_manager::settings::cloud_tools_dst}/scripts",
> +            virtualenv_dir => "${aws_manager::settings::root}",
> +            user           => "${users::buildduty::username}",
> +            params         => "--cache_dir ${aws_manager::settings::cloudtrail_logs_dir} --s3-base-prefix ${::config::cloudtrail_s3_base_prefix} --s3-bucket ${::config::cloudtrail_s3_bucket}";

Conditional r+ -- you need to add "include ::config" since you use it here.
Attachment #8440799 - Flags: review?(rail) → review+
Attachment #8440820 - Attachment is obsolete: true
Attachment #8440820 - Flags: review?(rail)
Comment on attachment 8440820 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch

Ooops, wrong patch.

Don't forget to "include ::config"
Attachment #8440820 - Attachment is obsolete: false
Attachment #8440820 - Flags: review+
Attachment #8440799 - Attachment is obsolete: true
Attachment #8440799 - Flags: review+
Attachment #8439424 - Flags: checked-in+
Attachment #8440820 - Flags: checked-in+
Comment on attachment 8440820 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch

Review of attachment 8440820 [details] [diff] [review]:
-----------------------------------------------------------------

::: manifests/moco-config.pp
@@ +226,5 @@
>      $selfserve_agent_carrot_queue = "buildapi-agent-rabbit2"
>  
>      $aws_manager_mail_to = "release+aws-manager@mozilla.com"
> +    $cloudtrail_s3_bucket = "mozilla-releng-aws-logs"
> +    $cloudtrail_s3_base_prefix = "AWSLogs/314336048151/CloudTrail"

These two also need to be added to config::base
Hi dustin,

cloudtrail_s3_bucket and cloudtrail_s3_base_prefix are set to "" in modules/config/manifests/base.pp.
Do they need the real value instead of the empty string?
No, they don't -- I somehow didn't see the change to base.pp.  Don't mind me.
This script removes cloudtrail logs (based on directory name) and events older than 60 days
Attachment #8449065 - Flags: review?(rail)
crontab entry for cloudtrail logs clean up script
Attachment #8449069 - Flags: review?(rail)
Comment on attachment 8449065 [details] [diff] [review]
[cloud-tools] Bug 983813 - clean up scripts for cloudtrail logs.patch

Review of attachment 8449065 [details] [diff] [review]:
-----------------------------------------------------------------

Conditional r+, see comments below

::: scripts/aws_clean_log_dir.py
@@ +18,5 @@
> +    """removes cloudtrail directories"""
> +    try:
> +        for directory in os.listdir(root_dir):
> +            full_path = os.path.join(root_dir, directory)
> +            if full_path < reference_dir:

can you add a comment why you compare 2 directories and what it means... Sounds like magical string comparison trick.

@@ +26,5 @@
> +        # root dir does not exist, nothing to delete here
> +        pass
> +
> +
> +def delete_obsolete_json_files(json_file, numdays):

Can you rename the function to delete_obsolete_json_file since it accepts a file.

@@ +71,5 @@
> +    today = datetime.date.today()
> +
> +    numdays = 60
> +    base = datetime.datetime.today()
> +    last_day_to_keep = (base - datetime.timedelta(days=numdays))

the parentheses around the right part are redundant, feel free to remove them

@@ +98,5 @@
> +        root_dir = reference_dir
> +        reference_dir = os.path.join(reference_dir, day)
> +        delete_obsolete_logs(root_dir, reference_dir)
> +
> +    print "deleting obsolete event files"

Can you replace all print statements with log.debug()?
Attachment #8449065 - Flags: review?(rail) → review+
Attachment #8449069 - Flags: review?(rail) → review+
Attachment #8449065 - Flags: checked-in+
Attachment #8449069 - Flags: checked-in+
merged to production
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: