If your company is running on AWS, it’s likely that AWS Sagemaker is a central piece of the infrastructure you use daily. It’s fantastic how easy it is to start an instance and get a lot of CPU and GPU resources for your experimentation.

Stopping it is a different story, though. If you forget to stop the instance manually, it can cost you a lot of money.

This guide will teach you how to save money by stopping SageMaker instances when inactive.

The solution:

  • Has a very simple setup (uses lifecycle scripts1)
  • Is configurable (time to stop the machine)
  • Does not require any extra infrastructure (no Lambda or CloudWatch)
  • Leaves logs with an explanation of why the instance was or was not shut down.

Lifecycle configuration

We’ll create a new lifecycle configuration (or edit the one your instances already use). Whenever an instance with a lifecycle configuration starts, it runs a set of scripts as root user.

As part of that lifecycle configuration, we’ll inject a script that checks whether your instance is active, and shuts it down if it’s not (by default after one hour of inactivity).

Let’s do that!

  • In AWS console, go to SageMaker -> Lifecycle configurations
  • Create a new lifecycle configuration
  • Under Scripts section make sure “Start notebook” tab is opened
  • Paste this code
#!/bin/bash
set -e

# PARAMETERS
IDLE_TIME=3600

echo "Fetching the autostop script"
wget -O autostop.py https://raw.githubusercontent.com/mariokostelac/sagemaker-setup/master/scripts/auto-stop-idle/autostop.py

echo "Starting the SageMaker autostop script in cron"
(crontab -l 2>/dev/null; echo "*/5 * * * * /bin/bash -c '/usr/bin/python3 $DIR/autostop.py --time ${IDLE_TIME} | tee -a /home/ec2-user/SageMaker/auto-stop-idle.log'") | crontab -

echo "Changing cloudwatch configuration"
curl https://raw.githubusercontent.com/mariokostelac/sagemaker-setup/master/scripts/publish-logs-to-cloudwatch/on-start.sh | sudo bash -s auto-stop-idle /home/ec2-user/SageMaker/auto-stop-idle.log

  • Save the configuration
  • If you’re creating a new instance, make sure you create an instance with that lifecycle configuration (shown on picture below).
  • If you’ve already have an instance, stop the instance, click edit, click Additional configuration and choose the lifecycle configuration you’ve created.

That’s it!

Every time an instance with this lifecycle configuration starts, it will install the autostop.py script. The script watches for Jupyter kernel activity and connections established to Jupyter server. If there is none, it starts counting towards the shutdown.

Permissions

To make sure the script can shut down the instance, the instance running the script will need permissions to do so. Here’s how you can do that:

  • Go to your instance and find Permissions and encryptions section
  • Click on IAM role ARN
  • Click on Attach policies
  • Click Create policy button
  • Paste the JSON (feel free to restrict the Resource part to your instance only)
    {
      "Version": "2012-10-17",
      "Statement": [
          {
              "Sid": "VisualEditor0",
              "Effect": "Allow",
              "Action": [
                  "sagemaker:StopNotebookInstance",
                  "sagemaker:DescribeNotebookInstance"
              ],
              "Resource": "*"
          }
      ]
    }
    
  • Save the new policy with name “sagemaker-autostop”
  • Go back to the screen where you can click on IAM role ARN link and click it
  • Attach sagemaker-autostop policy to the IAM role.

Configuring the script

By default, the script will turn your instance off after one hour of inactivity. If you want to change that, change the IDLE_TIME variable. It’s defined after line # PARAMETERS.

Logging

The script we’ve configured runs every five minutes and logs the result. Logs can be found in two places:

  • On the instance, in /home/ec2-user/SageMaker/auto-stop-idle.log
  • In CloudWatch
    • If you go to your instance and click on View Logs (under Monitor section), you’ll see an auto-stop-idle log stream.

Is it safe to run curl <url> | bash?

In general, it’s not. It downloads an arbitrary code and executes it straight away. It’s the same security problem as with all package registries - if my GitHub account gets hacked, somebody can replace the current code with something malicious.

That threat can be mitigated, though. You can make sure it always downloads the code you’ve reviewed, a specific git commit. To do that, pick a specific git commit instead of pointing to master in the URL.

Here is an example of such script pointing to the scripts published on the day of writing this blog post.

#!/bin/bash
set -e

# PARAMETERS
IDLE_TIME=3600
COMMIT_SHA="549a931b4cf219c49d93f9229876510c0407f374"

echo "Fetching the autostop script"
wget -O autostop.py https://raw.githubusercontent.com/mariokostelac/sagemaker-setup/${COMMIT_SHA}/scripts/auto-stop-idle/autostop.py

echo "Starting the SageMaker autostop script in cron"
(crontab -l 2>/dev/null; echo "*/5 * * * * /bin/bash -c '/usr/bin/python3 $DIR/autostop.py --time ${IDLE_TIME} | tee -a /home/ec2-user/SageMaker/auto-stop-idle.log'") | crontab -

echo "Changing cloudwatch configuration"
curl https://raw.githubusercontent.com/mariokostelac/sagemaker-setup/${COMMIT_SHA}/scripts/publish-logs-to-cloudwatch/on-start.sh | sudo bash -s auto-stop-idle /home/ec2-user/SageMaker/auto-stop-idle.log

  1. I am presenting a solution that’s modified auto-stop-idle script by Amazon, but changed to leave the trace of why it did or did not turn off the instance. I will work with AWS and try to merge my changes in.