Train a Deep Learning model with AWS Deep Learning Containers
Here I’m going to talk about how to train a TenserFlow machine learning model on an Amazon EC2 instance using AWS deep-learning containers. Hope you already have an AWS account. If you don’t have one here.
As you know AWS deep learning container images are hosted on the Amazon container registry(ECR). It's a fully-managed Docker container registry that makes it easy to store, manage and deploy Docker container images.
Add IAM user permissions to access Amazon ECR
As the first step, we need to grant existing IAM user permissions to access ECR. so navigate to the AWS management console and select IAM.
Then select Users from the navigation page.
Now you are going to add permissions to a new IAM user you created or if you have an existing user, you can use that IAM user to Add permissions. On the summary page select Add Permissions .
Here we are going to attach ECS Access Policy. First, you should select Attach existing policies directly and then search ECS_FullAccess on the search bar. After selecting it click on Review and Add Permissions.
After adding permissions we need to add inline policy. So click on Add inline policy.
In the create policy page select JSON tab then paste following lines. This is the policy we are going to add here.
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "ecr:*",
"Effect": "Allow",
"Resource": "*"
}
]
}
Then click on Review policy, name it as ‘ECR’ and select Create Policy. So we successfully created all the requirements in IAM for this tutorial.
Create an AWS Deep learning Base AMI instance
To follow this step navigate to the EC2 console the click Launch Instance button. For this one we need a deep learning base AMI. you should search Deep Learning Base AMI. from the search results i’m going to choose Deep Learning Base AMI (Ubuntu 18.04) Version 42.0 instance. And you can also select the Deep Learning Base AMI (Amazon Linux) instance.
Here you should choose c5.large instance, but you can choose another instance type based on your requirement, including GPU-based p3 instances. Then Launch your instance.
On the next screen, select Create a new key pair and give it a name. Then download your key pair.
Keep this on your mind if you are not familiar with connecting to your instance using SSH on Windows 10 follow these steps the below documentations.
For this tutorial, I’m doing these on windows. But if you are using Linux you can use these commands to connect to your instance.
cd /Users/<your_username>/Downloads/chmod 0400 <your .pem filename>ssh -L localhost:8888:localhost:8888 -i <your .pem filename> ubuntu@<your instance DNS>
Log in to Amazon ECR
To complete this step you need to configure your EC2 instance with your AWS credentials.
On your terminal type aws configure. Then provide Access Key ID and Secret Access Key
if you don’t already have keys, navigate to the bar on the upper, choose username, then click on My Security Credentials then it will direct you to the IAM console your security credentials page. then expand the Access keys (access key ID and secret access key). Then you will find your access key and security access key.
Now we are going to log in to Amazon ECR. use this command to log in to Amazon ECR.
$(aws ecr get-login — region us-east-1 — no-include-email — registry-ids 763104351884)
Run TensorFlow training with Deep Learning Containers
Here we will use AWS deep learning container image. and I’m using deep learning container image for TenserFlow training on CPU instances with Python 3.6. Now run container images on your EC2 instance using this command below.
docker run -it 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:1.13-cpu-py36-ubuntu16.04
Now pull an example model to train. As an example, I’m going to use the Keras repository, which includes example python scripts to train models. Use the below command to clone the repository
git clone https://github.com/awslabs/keras-apache-mxnet.git
Wait until you complete that step. Then start training the MNIST CNN model with the following command:
python keras-apache-mxnet/examples/mnist_cnn.py
Now you can see your command is running...
You have successfully done training with your deep learning container.
If you are done with all these steps here make sure to terminate resources that are not actively being used. Because not terminating resources can result in charge.