Closed
Bug 1307095
Opened 8 years ago
Closed 8 years ago
Logs for atmo v1 scheduled jobs are missing
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: rvitillo, Assigned: frank)
References
Details
User Story
It appears that some logs for jobs scheduled with atmo v1 are now entirely missing from S3, like for e.g.: - "fennec summarize csv weekly" - "mobile-android-addons-v1" - "txp_install_counts"
Attachments
(1 file)
328 bytes,
application/x-gzip
|
Details |
No description provided.
Reporter | ||
Updated•8 years ago
|
Severity: normal → major
Reporter | ||
Updated•8 years ago
|
Flags: needinfo?(mdoglio)
Reporter | ||
Updated•8 years ago
|
Points: --- → 1
Priority: -- → P1
Reporter | ||
Updated•8 years ago
|
Assignee: nobody → fbertsch
Updated•8 years ago
|
Flags: needinfo?(mdoglio)
Assignee | ||
Comment 1•8 years ago
|
||
Again, cannot reproduce. Tried with TxP Event-based Install Counts New Bucket.ipynb, which both failed and was missing the log files in S3. Successfully ran, and the log files are in S3. Location: s3://telemetry-public-analysis-2/foo/logs Successful Job: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-3PGK7BLKBENWU Failed Job: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-36T7D4OV0EGBW
Reporter | ||
Comment 2•8 years ago
|
||
Besides the number of machines in the clusters the rest looks the same to me. I noticed though that the elapsed time is 2 hours & 7 minutes for the failed job; maybe the job has a timeout of 2 hours? Sunah, could you please tell us what the configuration of the timeout for the "txp_install_counts" job is?
Flags: needinfo?(ssuh)
Reporter | ||
Comment 3•8 years ago
|
||
I confirm that the timeout for this job is set to 2h. Sunah, this job has been failing for a while, is it still being used?
Assignee | ||
Comment 4•8 years ago
|
||
So I just realized we can tell which ones are failed and which are timed out by looking at the step status - it's cancelled for timed out, and failed for failure. I have a hunch that all of these timed out, but I'm checking into it.
I was keeping it around to verify the real-time count, but it's not really necessary (and I can just run it manually for verification)
Flags: needinfo?(ssuh)
Assignee | ||
Comment 6•8 years ago
|
||
Can confirm, these are all cancelled jobs that exceed their limits. The issue is that when a job times out, the cluster automatically shuts down, which does not give the step a chance to move the logs over.
Assignee | ||
Comment 7•8 years ago
|
||
Tried fixing this by catching the SIGINT (and all other varieties) signal and uploading the files then, but unfortunately it doesn't work. Ex (at beginning of batch.sh): upload_log () { cd .. gzip "$LOG" aws s3 cp "${LOG}.gz" "$S3_BASE/logs/$(basename "$LOG").gz" --content-type "text/plain" --content-encoding gzip } trap upload_log SIGINT trap upload_log SIGTERM trap upload_log SIGKILL trap upload_log EXIT
Assignee | ||
Comment 8•8 years ago
|
||
I have a tentative fix that requires adding the --job-name and --data-bucket parameters to telemetry.sh, and copies over the log files before shutting down. Still testing to ensure that it works. Note that this will also require minor changes to telemetry-analysis-service.
Assignee | ||
Comment 9•8 years ago
|
||
We've decided that we won't fix this, and will just use the logs provided by bug 1307528.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•