Logs for atmo v1 scheduled jobs are missing

RESOLVED WONTFIX

Status

Cloud Services
Metrics: Pipeline
P1
major
RESOLVED WONTFIX
2 years ago
2 years ago

People

(Reporter: rvitillo, Assigned: frank)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

User Story

It appears that some logs for jobs scheduled with atmo v1 are now entirely missing  from S3, like for e.g.:

- "fennec summarize csv weekly"
- "mobile-android-addons-v1"
- "txp_install_counts"

Attachments

(1 attachment)

Comment hidden (empty)
(Reporter)

Updated

2 years ago
Severity: normal → major
(Reporter)

Updated

2 years ago
Flags: needinfo?(mdoglio)
(Reporter)

Updated

2 years ago
Blocks: 1307096
(Reporter)

Updated

2 years ago
Points: --- → 1
Priority: -- → P1
(Reporter)

Updated

2 years ago
Assignee: nobody → fbertsch
Flags: needinfo?(mdoglio)
(Assignee)

Comment 1

2 years ago
Created attachment 8797762 [details]
foo.20161004183138.log.gz

Again, cannot reproduce. Tried with TxP Event-based Install Counts New Bucket.ipynb, which both failed and was missing the log files in S3. Successfully ran, and the log files are in S3.

Location: s3://telemetry-public-analysis-2/foo/logs

Successful Job: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-3PGK7BLKBENWU
Failed Job: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-36T7D4OV0EGBW
(Reporter)

Comment 2

2 years ago
Besides the number of machines in the clusters the rest looks the same to me. I noticed though that the elapsed time is 2 hours & 7 minutes for the failed job; maybe the job has a timeout of 2 hours?

Sunah, could you please tell us what the configuration of the timeout for the "txp_install_counts" job is?
Flags: needinfo?(ssuh)
(Reporter)

Comment 3

2 years ago
I confirm that the timeout for this job is set to 2h. Sunah, this job has been failing for a while, is it still being used?
(Assignee)

Comment 4

2 years ago
So I just realized we can tell which ones are failed and which are timed out by looking at the step status - it's cancelled for timed out, and failed for failure. I have a hunch that all of these timed out, but I'm checking into it.
I was keeping it around to verify the real-time count, but it's not really necessary (and I can just run it manually for verification)
Flags: needinfo?(ssuh)
(Assignee)

Comment 6

2 years ago
Can confirm, these are all cancelled jobs that exceed their limits. The issue is that when a job times out, the cluster automatically shuts down, which does not give the step a chance to move the logs over.
(Assignee)

Comment 7

2 years ago
Tried fixing this by catching the SIGINT (and all other varieties) signal and uploading the files then, but unfortunately it doesn't work. Ex (at beginning of batch.sh):


upload_log ()
{
    cd ..
    gzip "$LOG"
    aws s3 cp "${LOG}.gz" "$S3_BASE/logs/$(basename "$LOG").gz" --content-type "text/plain" --content-encoding gzip
}

trap upload_log SIGINT
trap upload_log SIGTERM 
trap upload_log SIGKILL
trap upload_log EXIT
(Assignee)

Comment 8

2 years ago
I have a tentative fix that requires adding the --job-name and --data-bucket parameters to telemetry.sh, and copies over the log files before shutting down. Still testing to ensure that it works. Note that this will also require minor changes to telemetry-analysis-service.
(Assignee)

Comment 9

2 years ago
We've decided that we won't fix this, and will just use the logs provided by bug 1307528.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.