As everyone knows Machine learning studies computer algorithms for learning to do stuff. We might, for instance, be interested in learning to complete a task, or to make accurate predictions, or to behave intelligently. The learning that is being done is always based on some sort of observations or data, such as examples…direct experience, or instruction. So in general, machine learning is about learning to do better in the future based on what was experienced in the past.
Machine learning is being used in lot of real world applications for various purpose. Lot of organizations are trying to integrate some sort of ML capabilities in their application, platform or service they provide and there are many constraints to this. If we want ML or AI as a general purpose application we have to work on Democratising AI by Making it accessible, fast and useful for enterprise and developers. This is the thing sagemaker helps us to achieve. Let’s discuss some problems we are facing in democritizing AI.
Managing Infrastructure can be challenging for many (especially scaling and distributing)
Hybrid : Brittle, opinionated Infrastructure that is hard to productionize and breaks between cloud and on-prem.
- Skilled Talent:
Machine learning expertise is scarce / 10x your ML experts.
Speed of innovation: Experimentation and reuse is limited by tools
Internal COE: Difficult to find leverage existing solutions within your organization.
- Ease of setup:
Setting up and maintaining a stack optimized for Deep Learning is hard and time consuming
Samples: Finding relevant samples and components.
Data: Access to high quality, ready to use data for ML.
How Machine Learning in cloud can help?
- The cloud’s pay-per-use model
- Easy for enterprises to experiment with ML capabilities and scale up as projects go into production and demand increases.
- The cloud makes intelligent capabilities accessible without requiring advanced skills in artificial intelligence or data science.
- AWS, Microsoft Azure, and Google Cloud Platform offer many machine learning options that don’t require deep knowledge of AI, machine learning theory, or a team of data scientists.
<figcaption>AI & ML Capabilities provided by AWS</figcaption>
What is Sagemaker & What it provides?
- Fully managed machine learning service.
- Quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment.
- Integrated Jupyter authoring notebook instance.
- Common machine learning algorithms that are optimized to run efficiently against extremely large data in a distributed environment.
- Bring-your-own-algorithms and frameworks
- Flexible distributed training options that adjust to your specific workflows.
<figcaption>Features Sagemaker provides</figcaption>
<figcaption>Build, Train and Deploy using Amazon Sagemaker</figcaption>
Let’s dig through various features and functionalities amazon sagemaker provides in details:
Amazon SageMaker: Open Source Containers
- Customize them
- Run them locally for development and testing
- Run them on SageMaker for training and prediction at scale
Amazon SageMaker: Bring Your Own Container
- Prepare the training code in Docker container
- Upload container image to Amazon Elastic Container Registry (ECR)
- Upload training dataset to Amazon S3/FSx/EFS
- Invoke Create Training Job API to execute a SageMaker training job
SageMaker training job pulls the container image from Amazon ECR, reads
the training data from the data source, configures the training job with
hyperparameter inputs, trains a model, and saves the model to model_dir so
that it can be deployed for inference later.
<figcaption>Training ML model using Amazon Sagemaker</figcaption>
Distributed Training At Scale on Amazon SageMaker
- Training on Amazon SageMaker can automatically distribute processing across a number of nodes — including P3 instances
- You can choose from two data distribution types for training ML models
- Fully Replicated — This will pass every file in the input to every machine
2 . Sharded S3 Key — This will separate and distribute the files in the input across the training nodes.
- Fully Replicated — This will pass every file in the input to every machine
- Overall, sharding can run faster but it depends on the algorithm.
Amazon SageMaker: Local Mode Training
Local mode training helps in enabling experimentation speed and will ease the path to production.
- Train with local notebooks
- Train on notebook instances
- Iterate faster a small sample of the dataset locally no waiting for a new training cluster to be built each time.
- Emulate CPU (single and multi-instance) and GPU (single instance) in local mode.
- Go distributed with a single line of code.
Automatic Model Tuning on Amazon SageMaker
- Amazon SageMaker automatic model tuning predicts hyperparameter values, which might be most effective at improving fit.
- Automatic model tuning can be used with the Amazon SageMaker
- Built-in algorithms,
- Pre-built deep learning frameworks, and
- Bring-your-own-algorithm containers
Amazon SageMaker: Accelerating ML Training
Amazon Sagemaker provides faster start times and training job execution time with various input modes. Amazon Sagemaker provides two modes for input
- File Mode
- Pipe Mode
Let’s see how this two differ:
- File Mode : S3 data source or file system data source:
- When using S3 as data source, training data set is downloaded to EBS volumes
- Use file system data source (Amazon EFS or Amazon FSx for Lustre) for faster training
- startup and execution time. Different data formats supported: CSV, protobuf, JSON, libsvm (check algo docs!)
- Pipe Mode streams the data set to training instances:
- This allows you to process large data sets and training starts faster
- Dataset must be in recordio-encoded protobuf or csv format
Amazon SageMaker: Fully-Managed Spot Training
Sagemaker’s fully-managed spot-training helps us to reduce training costs at scale.
- Managed Spot training on SageMaker to reduce training costs by up to 90%
- Managed Spot Training is available in all training configurations:
- All instance types supported by Amazon SageMaker.
- All models: built-in algorithms, built-in frameworks, and custom models.
- All configurations: single instance training, distributed training, and automatic model tuning.
- Setting it up is extremely simple:
- If you’re using the console, just switch the feature on.
- If you’re working with the Amazon SageMaker SDK just set train_use_spot_instances to true in the Estimator constructor.
Amazon SageMaker: Secure Machine Learning
- No retention of customers data
- SageMaker provides encryption in transit
- Encryption at rest everywhere
- Compute isolation — instances allocated for computation are never shared with others
- Network isolation: all compute instances run inside private service managed VPCs
- Secure, fully managed infrastructure: Amazon Sagemaker take care of patching and keeping instances up-to-date
- Notebook security — Jupyter notebooks can be operated without internet access and bound to secure customer VPCs
How To Train a Model With Amazon SageMaker
To train a model in Amazon SageMaker, you will need the following:
- A dataset
- An algorithm: Here you can choose from pre-optimized algorithms provided by Amazon sageMaker or use your own alogrithm.
- An Amazon Simple Storage Service (Amazon S3) bucket to store the training data and the model artifacts.
- An Amazon SageMaker notebook instance to prepare and process data and to train and deploy a machine learning model.
- A Jupyter notebook to use with the notebook instance
- For model training, deployment, and validation, I will use the high-level Amazon SageMaker Python SDK
Here are few things you will need to configure before getting started:
- Create the S3 bucket
- Create an Amazon SageMaker Notebook instance by going here: https://console.aws.amazon.com/sagemaker/
- Choose Notebook instances, then choose Create notebook instance.
- On the Create notebook instance page, provide the Notebook instance name, choose ml.t2.medium for instance type (least expensive instance) For IAM role, choose Create a new role, then choose Create role.
- Choose Create notebook instance.
In a few minutes, Amazon SageMaker launches an ML compute instance and attaches an ML storage volume to it. The notebook instance has a preconfigured Jupyter notebook server and a set of Anaconda libraries.
To train a model in Amazon SageMaker, you create a training job. The training job includes the following information:
- The URL of the Amazon Simple Storage Service (Amazon S3) bucket or the file
- system id of the file system where you’ve stored the training data.
- The compute resources that you want Amazon SageMaker to use for model training. Compute resources are ML compute instances that are managed by Amazon SageMaker.
- The URL of the S3 bucket where you want to store the output of the job.
- The Amazon Elastic Container Registry path where the training code is stored.
- Provide the S3 bucket and prefix that you want to use for training and model artifacts. This should be within the same region as the Notebook instance, training, and hosting
- The IAM role arn used to give training and hosting access to your data
- Download the dataset.
- Amazon SageMaker implementation of algorithms takes recordio wrapped protobuf, where as the data we have is a picklelized numpy array on disk.
- This data conversion will be handled by the Amazon SageMaker Python SDK, imported as sageMaker.
Now for train your model you can use either use Amazon SageMaker Python SDK or AWS SDK for Python (Boto 3) or AWS console.
I have created a demo with sageMaker console, find it here:
Amazon SageMaker is a managed service provided by AWS which provides capability to quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment to a developer. Amazon sageMaker is a right step in our objective of democratizing AI.
In next article in this series we will see latest additions, features and services introduced in amazon sageMaker and how they will simplify building and deploying machine learning applications in production.