I ran a spark cluster on ATMO with an initial lifetime of 24 hours. About 12 hours in, I used the "extend lifetime" button a few times to push the clusters termination time out by 3 days. However, at the end of the 24 hour initial lifetime the cluster was terminated. Steps leading to issue: 1) started a spark cluster with 1 node, 24 hour lifetime 2) 12 hours in, used "extend lifetime" button three times (added 72 hours) 3) cluster terminated after only the initial 24 hour period
Transferred to https://github.com/mozilla/telemetry-analysis-service/issues/627 for tracking.
I looked into this. I tested manually attempting to do the same steps. I also wrote a unit test which passes. I wasn't able to reproduce it. I'm curious if you've experienced this again or if perhaps it was some fluke? Are there any more details you can provide to help reproduce this? When you extended the cluster did you happen to notice if the termination date/time on the page got updated?
Similar case(s) to report as bcase, only the most recent is fresh enough in my memory to report w/accuracy. I've had it happen with 3, 5, and 10 node clusters as well. Steps leading to issue: 1) started a spark cluster with 30 nodes, 24 hour lifetime 2) ~12 hours in, used "extend lifetime" button once (added 24 hours) 3) cluster terminated after only the initial 24 hour period
Note that Josh's cluster was terminated with "The master node was terminated by user", see . This indicates to me that ATMO is shutting down these clusters.  https://screenshots.firefox.com/LiWdmoYKUxPUaXvM/us-west-2.console.aws.amazon.com
Rob, I've experience this issue more than once, with always with 1 node clusters where I use the "extend lifetime" button about 12 hours in. The termination date/time on the page did update properly after using the button. Is there anything else I can help with for reproduction? I'm not sure if there are any other details that would be helpful.
So, according to CloudTrail, it's *not* the prod ATMO instance that's killing the clusters -- the kill request is coming from an old dev instance of ATMO. I'm going to kill that environment (I don't think it's being used anymore -- the code hasn't been updated in nearly a year) and hopefully that fixes the problem.
Assignee: nobody → ssuh
Closing bc this seems to be resolved, we can reopen if the issue comes back.
Status: NEW → RESOLVED
Last Resolved: 11 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.