Closed
Bug 983813
Opened 11 years ago
Closed 11 years ago
AWS Sanity Check lies about how long an instance was shut down for...
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Callek, Assigned: massimo)
References
Details
Attachments
(4 files, 4 obsolete files)
39.16 KB,
patch
|
rail
:
review+
massimo
:
checked-in+
|
Details | Diff | Splinter Review |
4.69 KB,
patch
|
rail
:
review+
massimo
:
checked-in+
|
Details | Diff | Splinter Review |
3.94 KB,
patch
|
rail
:
review+
massimo
:
checked-in+
|
Details | Diff | Splinter Review |
1.05 KB,
patch
|
rail
:
review+
massimo
:
checked-in+
|
Details | Diff | Splinter Review |
Sooo, when reviewing AWS sanity check I saw:
==== Instances stopped for a while ====
0 buildbot-master56 (i-b0c8df82, us-west-2): down for 8157 hours
1 buildbot-master58 (i-26f2e514, us-west-2): down for 8157 hours
2 buildbot-master63 (i-5eb82736, us-east-1): down for 8116 hours
3 buildbot-master59 (i-9b09d0f9, us-east-1): down for 8116 hours
4 buildbot-master64 (i-784b594a, us-west-2): down for 8115 hours
5 buildbot-master57 (i-02cc2669, us-east-1): down for 3745 hours
upon realizing JUST how long that means and knowing about how recently https://bugzilla.mozilla.org/show_bug.cgi?id=966070#c48 was, we were able to comprehend there was something wrong here.
I'm not sure if this is AWS-lying or us miscomputing or what. Either way we should fix if we can
Assignee | ||
Comment 1•11 years ago
|
||
Here's how the script calculates the "down for" period [1]:
if state is stop:
down_for = (now - launch_time) / 3600
The boto library provides a launch_time but there's no "stop_time" (or similar), so for stopped instances we provide a rough estimate.
[1] https://hg.mozilla.org/build/cloud-tools/file/6bdf0cf1e8bf/scripts/aws_sanity_checker.py#l133
[2] http://docs.pythonboto.org/en/latest/ref/ec2.html#module-boto.ec2.instanceinfo
Updated•11 years ago
|
Assignee: nobody → mgervasini
Assignee | ||
Comment 3•11 years ago
|
||
This patch gets all the StopInstances events from cloudtrail logs and stores them locally so they can be used by aws_sanity_checker.py reports.
There are two new scripts:
* aws_get_cloudtrail_logs.py: it downloads the cloudtrail logs into a cache directory: /builds/aws-cloudtrail-logs (current and previous month logs)
* aws_process_cloudtrail_logs.py: this script parses logs and stores the results in /builds/aws-cloudtrail-logs/events/<Event-Name>/<Instance-Id> in a json file like the following:
{"eventName": "StopInstances", "instances": "i-309b1f1e", "eventTime": "2014-02-25T17:11:07Z"}
Of course aws_sanity_checker.py has been modified to get "stop time" from the "events" directory
Once this gets into production we can add some optimizations:
1. adding a clean up script, we could have thousand of old instance json files
2. using a database (SQLAlchemy + sqlite?) to store the data in a smart way
Attachment #8430187 -
Flags: review?(rail)
Assignee | ||
Comment 4•11 years ago
|
||
crontabs and directories changes for cloudtrail logs
Attachment #8430191 -
Flags: review?(rail)
Comment 5•11 years ago
|
||
Comment on attachment 8430191 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch
Review of attachment 8430191 [details] [diff] [review]:
-----------------------------------------------------------------
::: modules/dirs/manifests/builds/aws_cloudtrail_logs.pp
@@ +2,5 @@
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +class dirs::builds::aws_cloudtrail_logs {
> + include dirs::builds
> + include users::builder
Shouldn't it be users::buildduty?
@@ +7,5 @@
> + file {
> + "/builds/aws-cloudtrail-logs":
> + ensure => directory,
> + owner => "$users::builder::username",
> + group => "$users::builder::group",
the same here?
Comment 6•11 years ago
|
||
Comment on attachment 8430187 [details] [diff] [review]
[cloud-tools] Bug 983813 - get StopInstances timestamp from local cache.patch
Review of attachment 8430187 [details] [diff] [review]:
-----------------------------------------------------------------
::: scripts/aws_get_cloudtrail_logs.py
@@ +6,5 @@
> +from repoze.lru import lru_cache
> +from multiprocessing import Pool
> +
> +
> +CWD = os.getcwd()
CWD is unused, please remove
@@ +8,5 @@
> +
> +
> +CWD = os.getcwd()
> +CACHE_DIR = "/builds/aws-cloudtrail-logs"
> +BASE_PREFIX = "AWSLogs/314336048151/CloudTrail"
Can you pass CACHE_DIR and BASE_PREFIX via command line?
@@ +9,5 @@
> +
> +CWD = os.getcwd()
> +CACHE_DIR = "/builds/aws-cloudtrail-logs"
> +BASE_PREFIX = "AWSLogs/314336048151/CloudTrail"
> +REGIONS = ["us-west-2", "us-east-1"]
please reuse regions from cloudtools/__init__.py
@@ +24,5 @@
> + yield i
> +
> +
> +@lru_cache(10)
> +def get_connection():
Move this to cloudtools/__init__.py, rename as get_s3_connection()
@@ +26,5 @@
> +
> +@lru_cache(10)
> +def get_connection():
> + #return boto.s3.connection
> + return boto.s3.connection.S3Connection()
you can use shorter boto.connect_s3()
@@ +29,5 @@
> + #return boto.s3.connection
> + return boto.s3.connection.S3Connection()
> +
> +
> +def mkdir_p(dst_dir, exist_ok=True):
can you move this to cloudtools/fileutil.py or something like that?
@@ +43,5 @@
> + raise
> +
> +
> +def limit_logs(limit=LIMIT_MONTHS):
> + import datetime
can you move the import to the beginning of the file?
@@ +45,5 @@
> +
> +def limit_logs(limit=LIMIT_MONTHS):
> + import datetime
> + now = datetime.datetime.now()
> + start_date = datetime.datetime.now() + datetime.timedelta(-LIMIT_MONTHS * 30)
maybe datetime.datetime.now() - datetime.timedelta(LIMIT_MONTHS * 30)
@@ +50,5 @@
> +
> + days = []
> + days.append(start_date.strftime("%Y/%m"))
> + days.append(now.strftime("%Y/%m"))
> + return days
you return a list with start and now, but the function name tells me about limiting logs. Can you rename the function?
also this is a good candidate for cloudtools/cloudtrail.py
@@ +65,5 @@
> +def write_to_disk(key):
> + dst = os.path.join(CACHE_DIR, key.name)
> + mkdir_p(os.path.dirname(dst))
> +
> + import signal
can you move the import to the beginning of the file?
@@ +69,5 @@
> + import signal
> + signal.signal(signal.SIGALRM, _timeout)
> + if not os.path.exists(dst):
> + logging.info('downloading: {0}'.format(key.name))
> + signal.alarm(GET_CONTENTS_TO_FILENAME_TIMEOUT)
can you add some comments why you use alarm here
@@ +80,5 @@
> +
> +if __name__ == '__main__':
> + logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
> + conn = get_connection()
> + bucket = conn.get_bucket("mozilla-releng-aws-logs")
can you pass this via CLI?
::: scripts/aws_process_cloudtrail_logs.py
@@ +3,5 @@
> +import logging
> +import json
> +import os
> +from multiprocessing import Pool
> +from aws_get_cloudtrail_logs import CACHE_DIR as CACHE_DIR
can you pass this via CLI?
@@ +4,5 @@
> +import json
> +import os
> +from multiprocessing import Pool
> +from aws_get_cloudtrail_logs import CACHE_DIR as CACHE_DIR
> +from aws_get_cloudtrail_logs import mkdir_p as mkdir_p
"as mkdir_p" is redundant here
@@ +9,5 @@
> +
> +CLOUDTRAIL_DIR = os.path.join(CACHE_DIR, 'AWSLogs')
> +SPOT_DATA_DIR = os.path.join(CACHE_DIR, 'SpotInstanceDataFeed')
> +EVENTS_DIR = os.path.join(CACHE_DIR, 'events')
> +BAD_LOGS = os.path.join(EVENTS_DIR, 'bad_logs')
Can you move to them cloudtools/cloudtrail.py or pass via cli?
@@ +12,5 @@
> +EVENTS_DIR = os.path.join(CACHE_DIR, 'events')
> +BAD_LOGS = os.path.join(EVENTS_DIR, 'bad_logs')
> +
> +
> +def get_data_from_gz_file(filename):
move to cloudtools/fileutil.py
@@ +19,5 @@
> + with gzip.open(filename, 'rb') as f:
> + return f.read()
> + except IOError as error:
> + logging.info('filename: %s', filename)
> + logging.error(error)
you can also use the following to replace 2 lines with 1:
logging.error('filename: %s', filename, exc_info=True)
@@ +22,5 @@
> + logging.info('filename: %s', filename)
> + logging.error(error)
> +
> +
> +def get_data_from_json_file(filename):
move to cloudtools/fileutil.py
@@ +27,5 @@
> + """returns a json object from filename"""
> + try:
> + logging.debug(filename)
> + with open(filename, 'rb') as f:
> + return json.loads(f.read())
return json.load(f) maybe?
@@ +30,5 @@
> + with open(filename, 'rb') as f:
> + return json.loads(f.read())
> + except ValueError:
> + # discard log file if it's not a good json file
> + move_to_bad_logs(filename)
return None # to be explicit?
@@ +42,5 @@
> + mkdir_p(BAD_LOGS)
> + name = os.path.split(filename)[1]
> + dst_file = os.path.join(EVENTS_DIR, name)
> + logging.debug("%s => %s", filename, dst_file)
> + os.rename(filename, dst_file)
os.rename doesn't work if dest is another mount, use shutil.move
@@ +56,5 @@
> + return
> +
> + logging.debug('processing: %s', filename)
> + for record in data['Records']:
> + eventName = record['eventName']
can you make the following loop safer? e.g. if the keys you access are missing, the loop should not explode the script. instead move the file to bad logs or something else
@@ +58,5 @@
> + logging.debug('processing: %s', filename)
> + for record in data['Records']:
> + eventName = record['eventName']
> + # just process stop events, skip StartInstances and TerminateInstances
> + if eventName in ('StopInstances',):
# can you use the same comparision method for both ifs? == vs in
@@ +61,5 @@
> + # just process stop events, skip StartInstances and TerminateInstances
> + if eventName in ('StopInstances',):
> + process_start_stop_record(record)
> + elif eventName == 'RequestSpotInstances':
> + process_request_spot(record)
# this is not used, can you just remove it?
@@ +65,5 @@
> + process_request_spot(record)
> +
> +
> +def process_request_spot(record):
> + pass
remove it, the followup patch will be easier to read
@@ +68,5 @@
> +def process_request_spot(record):
> + pass
> +
> +
> +def process_start_stop_record(record):
move to cloudtools/cloudtrail.py?
@@ +73,5 @@
> + """process a start/stop/terminate row"""
> + # this metod works with Start/Stop/Terminate events too
> + time_ = record['eventTime']
> + for item in record['requestParameters']['instancesSet']['items']:
> + instanceId = item['instanceId']
can you make this also safer?
@@ +80,5 @@
> + 'eventTime': time_}
> + write_to_json(data)
> +
> +
> +def get_time_from_file(filename):
move to cloudtools/cloudtrail.py?
@@ +83,5 @@
> +
> +def get_time_from_file(filename):
> + """returns the eventTime from filename"""
> + data = get_data_from_json_file(filename)
> + return data['eventTime']
it will fail if you have a bad JSON file
@@ +86,5 @@
> + data = get_data_from_json_file(filename)
> + return data['eventTime']
> +
> +
> +def write_to_json(data):
move to cloudtools/fileutil.py?
@@ +96,5 @@
> + filename = os.path.join(EVENTS_DIR, event, instance)
> + mkdir_p(os.path.dirname(filename))
> + if not os.path.exists(filename):
> + with open(filename, 'w') as f:
> + json.dump(data, f)
json.dump(f, data)
@@ +97,5 @@
> + mkdir_p(os.path.dirname(filename))
> + if not os.path.exists(filename):
> + with open(filename, 'w') as f:
> + json.dump(data, f)
> + if data['eventTime'] > get_time_from_file(filename):
use elif to avoid unnecessary get_time_from_file()
@@ +100,5 @@
> + json.dump(data, f)
> + if data['eventTime'] > get_time_from_file(filename):
> + # replace old event with current one
> + with open(filename, 'w') as f:
> + json.dump(data, f)
json.dump(f, data)
@@ +103,5 @@
> + with open(filename, 'w') as f:
> + json.dump(data, f)
> +
> +
> +def process_spot_data(filename):
remove it, the followup patch will be easier to read
Attachment #8430187 -
Flags: review?(rail) → review-
Comment 7•11 years ago
|
||
Comment on attachment 8430191 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch
I think with the changes to the scripts you will need to prep another patch with more command line args.
Attachment #8430191 -
Flags: review?(rail)
Assignee | ||
Comment 8•11 years ago
|
||
Hi Rail,
* updated the code so bucket names, directories, ..., are passed as command line options.
* in aws_sanity_checker.py, if --events-dir is not specified, the script works as before.
* as discussed, the impaired section reports only non slave hosts.
a new puppet patch will follow soon
Attachment #8430187 -
Attachment is obsolete: true
Attachment #8439424 -
Flags: review?(rail)
Assignee | ||
Comment 9•11 years ago
|
||
This patch takes care of creating the new required directories and adds new cloudtrail jobs in cron
Attachment #8430191 -
Attachment is obsolete: true
Attachment #8439434 -
Flags: review?(rail)
Comment 10•11 years ago
|
||
Comment on attachment 8439424 [details] [diff] [review]
[cloud-tools] Bug 983813 - use cloudtrail logs in AWS sanity check
Review of attachment 8439424 [details] [diff] [review]:
-----------------------------------------------------------------
lgtm!
Attachment #8439424 -
Flags: review?(rail) → review+
Comment 11•11 years ago
|
||
Comment on attachment 8439434 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch
Review of attachment 8439434 [details] [diff] [review]:
-----------------------------------------------------------------
Need just one fix, see my comments below.
::: modules/aws_manager/manifests/init.pp
@@ +6,4 @@
> include aws_manager::secrets
> include aws_manager::cron
> include nrpe::check::check_stop_idle
> + include dirs::builds::aws_cloudtrail_logs
see my comments for install.pp
::: modules/aws_manager/manifests/install.pp
@@ +63,5 @@
> + group => "${users::buildduty::group}";
> + "${aws_manager::settings::events_dir}":
> + ensure => directory,
> + owner => "${users::buildduty::username}",
> + group => "${users::buildduty::group}";
These 2 entries duplicate entries in dirs::builds::aws_cloudtrail_logs. You can completely delete dirs::builds::aws_cloudtrail_logs, no need to bloat that module, IMHO, esp if you use ${aws_manager::settings} variables.
::: modules/dirs/manifests/builds/aws_cloudtrail_logs.pp
@@ +6,5 @@
> + include users::builder
> + include aws_manager::settings
> + file {
> + "${aws_manager::settings::cloudtrail_logs_dir}":
> + ensure => rectory,
s/rectory/directory/
Attachment #8439434 -
Flags: review?(rail) → review-
Assignee | ||
Comment 12•11 years ago
|
||
Hi rail,
thanks for the review! This new patch removes modules/dirs/manifests/builds/aws_cloudtrail_logs.pp
Attachment #8439434 -
Attachment is obsolete: true
Attachment #8440799 -
Flags: review?(rail)
Assignee | ||
Comment 13•11 years ago
|
||
Attachment #8440820 -
Flags: review?(rail)
Comment 14•11 years ago
|
||
Comment on attachment 8440799 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch
Review of attachment 8440799 [details] [diff] [review]:
-----------------------------------------------------------------
::: modules/aws_manager/manifests/cron.pp
@@ +69,5 @@
> + minute => '35,15',
> + cwd => "${aws_manager::settings::cloud_tools_dst}/scripts",
> + virtualenv_dir => "${aws_manager::settings::root}",
> + user => "${users::buildduty::username}",
> + params => "--cache_dir ${aws_manager::settings::cloudtrail_logs_dir} --s3-base-prefix ${::config::cloudtrail_s3_base_prefix} --s3-bucket ${::config::cloudtrail_s3_bucket}";
Conditional r+ -- you need to add "include ::config" since you use it here.
Attachment #8440799 -
Flags: review?(rail) → review+
Updated•11 years ago
|
Attachment #8440820 -
Attachment is obsolete: true
Attachment #8440820 -
Flags: review?(rail)
Comment 15•11 years ago
|
||
Comment on attachment 8440820 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch
Ooops, wrong patch.
Don't forget to "include ::config"
Attachment #8440820 -
Attachment is obsolete: false
Attachment #8440820 -
Flags: review+
Updated•11 years ago
|
Attachment #8440799 -
Attachment is obsolete: true
Attachment #8440799 -
Flags: review+
Assignee | ||
Updated•11 years ago
|
Attachment #8439424 -
Flags: checked-in+
Assignee | ||
Updated•11 years ago
|
Attachment #8440820 -
Flags: checked-in+
Comment 16•11 years ago
|
||
Comment on attachment 8440820 [details] [diff] [review]
[puppet] Bug 983813 - create directories and crontab jobs for cloudtrail logs parsing.patch
Review of attachment 8440820 [details] [diff] [review]:
-----------------------------------------------------------------
::: manifests/moco-config.pp
@@ +226,5 @@
> $selfserve_agent_carrot_queue = "buildapi-agent-rabbit2"
>
> $aws_manager_mail_to = "release+aws-manager@mozilla.com"
> + $cloudtrail_s3_bucket = "mozilla-releng-aws-logs"
> + $cloudtrail_s3_base_prefix = "AWSLogs/314336048151/CloudTrail"
These two also need to be added to config::base
Assignee | ||
Comment 17•11 years ago
|
||
Hi dustin,
cloudtrail_s3_bucket and cloudtrail_s3_base_prefix are set to "" in modules/config/manifests/base.pp.
Do they need the real value instead of the empty string?
Comment 18•11 years ago
|
||
No, they don't -- I somehow didn't see the change to base.pp. Don't mind me.
Assignee | ||
Comment 19•11 years ago
|
||
This script removes cloudtrail logs (based on directory name) and events older than 60 days
Attachment #8449065 -
Flags: review?(rail)
Assignee | ||
Comment 20•11 years ago
|
||
crontab entry for cloudtrail logs clean up script
Attachment #8449069 -
Flags: review?(rail)
Comment 21•11 years ago
|
||
Comment on attachment 8449065 [details] [diff] [review]
[cloud-tools] Bug 983813 - clean up scripts for cloudtrail logs.patch
Review of attachment 8449065 [details] [diff] [review]:
-----------------------------------------------------------------
Conditional r+, see comments below
::: scripts/aws_clean_log_dir.py
@@ +18,5 @@
> + """removes cloudtrail directories"""
> + try:
> + for directory in os.listdir(root_dir):
> + full_path = os.path.join(root_dir, directory)
> + if full_path < reference_dir:
can you add a comment why you compare 2 directories and what it means... Sounds like magical string comparison trick.
@@ +26,5 @@
> + # root dir does not exist, nothing to delete here
> + pass
> +
> +
> +def delete_obsolete_json_files(json_file, numdays):
Can you rename the function to delete_obsolete_json_file since it accepts a file.
@@ +71,5 @@
> + today = datetime.date.today()
> +
> + numdays = 60
> + base = datetime.datetime.today()
> + last_day_to_keep = (base - datetime.timedelta(days=numdays))
the parentheses around the right part are redundant, feel free to remove them
@@ +98,5 @@
> + root_dir = reference_dir
> + reference_dir = os.path.join(reference_dir, day)
> + delete_obsolete_logs(root_dir, reference_dir)
> +
> + print "deleting obsolete event files"
Can you replace all print statements with log.debug()?
Attachment #8449065 -
Flags: review?(rail) → review+
Updated•11 years ago
|
Attachment #8449069 -
Flags: review?(rail) → review+
Assignee | ||
Updated•11 years ago
|
Attachment #8449065 -
Flags: checked-in+
Assignee | ||
Updated•11 years ago
|
Attachment #8449069 -
Flags: checked-in+
Comment 22•11 years ago
|
||
merged to production
Assignee | ||
Updated•11 years ago
|
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•8 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•