Extending Spark Cluster Lifetime From ATMO Dashboard Not Working

RESOLVED FIXED

Status

Data Platform and Tools
Telemetry Analysis Service (ATMO)
RESOLVED FIXED
11 months ago
11 months ago

People

(Reporter: Benton Case, Assigned: sunahsuh)

Tracking

Details

(Reporter)

Description

11 months ago
I ran a spark cluster on ATMO with an initial lifetime of 24 hours. About 12 hours in, I used the "extend lifetime" button a few times to push the clusters termination time out by 3 days. However, at the end of the 24 hour initial lifetime the cluster was terminated.

Steps leading to issue:
1) started a spark cluster with 1 node, 24 hour lifetime
2) 12 hours in, used "extend lifetime" button three times (added 72 hours)
3) cluster terminated after only the initial 24 hour period

Comment 2

11 months ago
I looked into this. I tested manually attempting to do the same steps. I also wrote a unit test which passes. I wasn't able to reproduce it.

I'm curious if you've experienced this again or if perhaps it was some fluke? Are there any more details you can provide to help reproduce this? When you extended the cluster did you happen to notice if the termination date/time on the page got updated?
Flags: needinfo?(bcase)

Comment 3

11 months ago
Similar case(s) to report as bcase, only the most recent is fresh enough in my memory to report w/accuracy. I've had it happen with 3, 5, and 10 node clusters as well.

Steps leading to issue:
1) started a spark cluster with 30 nodes, 24 hour lifetime
2) ~12 hours in, used "extend lifetime" button once (added 24 hours)
3) cluster terminated after only the initial 24 hour period
Note that Josh's cluster was terminated with "The master node was terminated by user", see [0]. This indicates to me that ATMO is shutting down these clusters.

[0] https://screenshots.firefox.com/LiWdmoYKUxPUaXvM/us-west-2.console.aws.amazon.com
(Reporter)

Comment 5

11 months ago
Rob, 

I've experience this issue more than once, with always with 1 node clusters where I use the "extend lifetime" button about 12 hours in. The termination date/time on the page did update properly after using the button. 

Is there anything else I can help with for reproduction? I'm not sure if there are any other details that would be helpful.
Flags: needinfo?(bcase)
(Assignee)

Comment 6

11 months ago
So, according to CloudTrail, it's *not* the prod ATMO instance that's killing the clusters -- the kill request is coming from an old dev instance of ATMO. I'm going to kill that environment (I don't think it's being used anymore -- the code hasn't been updated in nearly a year) and hopefully that fixes the problem.
Assignee: nobody → ssuh
(Assignee)

Updated

11 months ago
Flags: needinfo?(mreid)
Closing bc this seems to be resolved, we can reopen if the issue comes back.
Status: NEW → RESOLVED
Last Resolved: 11 months ago
Flags: needinfo?(mreid)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.