Closed Bug 1307095 Opened 8 years ago Closed 8 years ago

Logs for atmo v1 scheduled jobs are missing

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

People

(Reporter: rvitillo, Assigned: frank)

References

Details

User Story

It appears that some logs for jobs scheduled with atmo v1 are now entirely missing  from S3, like for e.g.:

- "fennec summarize csv weekly"
- "mobile-android-addons-v1"
- "txp_install_counts"

Attachments

(1 file)

foo.20161004183138.log.gz 8 years ago Frank Bertsch [:frank] 328 bytes, application/x-gzip		Details

Roberto Agostino Vitillo (:rvitillo)

Reporter

Description

•

8 years ago

      No description provided.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

Severity: normal → major

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

Flags: needinfo?(mdoglio)

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

Blocks: 1307096

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

Points: --- → 1

Priority: -- → P1

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

Assignee: nobody → fbertsch

Mauro Doglio [:mdoglio]

Updated

•

8 years ago

Flags: needinfo?(mdoglio)

Frank Bertsch [:frank]

Assignee

Comment 1

•

8 years ago

Attached file foo.20161004183138.log.gz — Details

Again, cannot reproduce. Tried with TxP Event-based Install Counts New Bucket.ipynb, which both failed and was missing the log files in S3. Successfully ran, and the log files are in S3.

Location: s3://telemetry-public-analysis-2/foo/logs

Successful Job: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-3PGK7BLKBENWU
Failed Job: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-36T7D4OV0EGBW

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 2

•

8 years ago

Besides the number of machines in the clusters the rest looks the same to me. I noticed though that the elapsed time is 2 hours & 7 minutes for the failed job; maybe the job has a timeout of 2 hours?

Sunah, could you please tell us what the configuration of the timeout for the "txp_install_counts" job is?

Flags: needinfo?(ssuh)

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 3

•

8 years ago

I confirm that the timeout for this job is set to 2h. Sunah, this job has been failing for a while, is it still being used?

Frank Bertsch [:frank]

Assignee

Comment 4

•

8 years ago

So I just realized we can tell which ones are failed and which are timed out by looking at the step status - it's cancelled for timed out, and failed for failure. I have a hunch that all of these timed out, but I'm checking into it.

bugzilla

Comment 5

•

8 years ago

I was keeping it around to verify the real-time count, but it's not really necessary (and I can just run it manually for verification)

Flags: needinfo?(ssuh)

Frank Bertsch [:frank]

Assignee

Comment 6

•

8 years ago

Can confirm, these are all cancelled jobs that exceed their limits. The issue is that when a job times out, the cluster automatically shuts down, which does not give the step a chance to move the logs over.

Frank Bertsch [:frank]

Assignee

Comment 7

•

8 years ago

Tried fixing this by catching the SIGINT (and all other varieties) signal and uploading the files then, but unfortunately it doesn't work. Ex (at beginning of batch.sh):


upload_log ()
{
    cd ..
    gzip "$LOG"
    aws s3 cp "${LOG}.gz" "$S3_BASE/logs/$(basename "$LOG").gz" --content-type "text/plain" --content-encoding gzip
}

trap upload_log SIGINT
trap upload_log SIGTERM 
trap upload_log SIGKILL
trap upload_log EXIT

Frank Bertsch [:frank]

Assignee

Comment 8

•

8 years ago

I have a tentative fix that requires adding the --job-name and --data-bucket parameters to telemetry.sh, and copies over the log files before shutting down. Still testing to ensure that it works. Note that this will also require minor changes to telemetry-analysis-service.

Frank Bertsch [:frank]

Assignee

Comment 9

•

8 years ago

We've decided that we won't fix this, and will just use the logs provided by bug 1307528.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WONTFIX

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Logs for atmo v1 scheduled jobs are missing

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

People

(Reporter: rvitillo, Assigned: frank)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Updated

Updated

Updated

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Attachment

General

Description

File Name

Content Type