Closed Bug 1307095 Opened 8 years ago Closed 8 years ago

Logs for atmo v1 scheduled jobs are missing

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: rvitillo, Assigned: frank)

References

Details

User Story

It appears that some logs for jobs scheduled with atmo v1 are now entirely missing  from S3, like for e.g.:

- "fennec summarize csv weekly"
- "mobile-android-addons-v1"
- "txp_install_counts"

Attachments

(1 file)

      No description provided.
Severity: normal → major
Flags: needinfo?(mdoglio)
Blocks: 1307096
Points: --- → 1
Priority: -- → P1
Assignee: nobody → fbertsch
Flags: needinfo?(mdoglio)
Again, cannot reproduce. Tried with TxP Event-based Install Counts New Bucket.ipynb, which both failed and was missing the log files in S3. Successfully ran, and the log files are in S3.

Location: s3://telemetry-public-analysis-2/foo/logs

Successful Job: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-3PGK7BLKBENWU
Failed Job: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-36T7D4OV0EGBW
Besides the number of machines in the clusters the rest looks the same to me. I noticed though that the elapsed time is 2 hours & 7 minutes for the failed job; maybe the job has a timeout of 2 hours?

Sunah, could you please tell us what the configuration of the timeout for the "txp_install_counts" job is?
Flags: needinfo?(ssuh)
I confirm that the timeout for this job is set to 2h. Sunah, this job has been failing for a while, is it still being used?
So I just realized we can tell which ones are failed and which are timed out by looking at the step status - it's cancelled for timed out, and failed for failure. I have a hunch that all of these timed out, but I'm checking into it.
I was keeping it around to verify the real-time count, but it's not really necessary (and I can just run it manually for verification)
Flags: needinfo?(ssuh)
Can confirm, these are all cancelled jobs that exceed their limits. The issue is that when a job times out, the cluster automatically shuts down, which does not give the step a chance to move the logs over.
Tried fixing this by catching the SIGINT (and all other varieties) signal and uploading the files then, but unfortunately it doesn't work. Ex (at beginning of batch.sh):


upload_log ()
{
    cd ..
    gzip "$LOG"
    aws s3 cp "${LOG}.gz" "$S3_BASE/logs/$(basename "$LOG").gz" --content-type "text/plain" --content-encoding gzip
}

trap upload_log SIGINT
trap upload_log SIGTERM 
trap upload_log SIGKILL
trap upload_log EXIT
I have a tentative fix that requires adding the --job-name and --data-bucket parameters to telemetry.sh, and copies over the log files before shutting down. Still testing to ensure that it works. Note that this will also require minor changes to telemetry-analysis-service.
We've decided that we won't fix this, and will just use the logs provided by bug 1307528.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: