Trn1 instances powered by AWS Trainium chips deliver the highest performance on deep learning training of popular machine learning models on AWS, while offering up to 50% cost-to-train savings over comparable GPU-based instances
PyTorch, Helixon, and Money Forward among customers and partners using Trn1 instances
Amazon Web Services, Inc. (AWS), an Amazon.com, Inc. company (NASDAQ: AMZN), today announced the general availability of Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances powered by AWS-designed Trainium chips. Trn1 instances are purpose built for high-performance training of machine learning models in the cloud while offering up to 50% cost-to-train savings over comparable GPU-based instances. Trn1 instances provide the fastest time to train popular machine learning models on AWS, enabling customers to reduce training times, rapidly iterate on models to improve accuracy, and increase productivity for workloads like natural language processing, speech and image recognition, semantic search, recommendation engines, fraud detection, and forecasting. There are no minimum commitments or upfront fees to use Trn1 instances, and customers pay only for the amount of compute used. To get started with Trn1 instances, visit aws.amazon.com/ec2/instance-types/trn1.
More customers are building, training, and deploying machine learning models to power applications that have the potential to reinvent their businesses and customer experiences. These machine learning models are becoming increasingly complex and consume ever-growing amounts of training data to help improve accuracy. As a result, customers must scale their models across thousands of accelerators, which makes them more expensive to train. This directly impacts the ability of research and development teams to experiment and train different models, which limits how quickly customers are able to bring their innovations to market. AWS already provides the broadest and deepest choice of compute offerings featuring hardware accelerators for machine learning, including Inf1 instances with AWS-designed Inferentia chips, G5 instances, P4d instances, and DL1 instances. But even with the fastest accelerated instances available today, training more complex machine learning models can still be prohibitively expensive and time consuming.
New Trn1 instances powered by AWS Trainium chips offer the best price performance and the fastest machine learning model training on AWS, providing up to 50% lower cost to train deep learning models compared to the latest GPU-based P4d instances. AWS Neuron, the software development kit (SDK) for Trn1 instances, enables customers to get started with minimal code changes and is integrated into popular frameworks for machine learning like PyTorch and TensorFlow. Trn1 instances feature up to 16 AWS Trainium accelerators that are purpose built for deploying deep learning models. Trn1 instances are the first Amazon EC2 instance to offer up to 800 Gbps of networking bandwidth (lower latency and 2x faster than the latest EC2 GPU-based instances) using the second generation of AWS’s Elastic Fabric Adapter (EFA) network interface to improve scaling efficiency. Trn1 instances also use NeuronLink, a high-speed, intra-instance interconnect, for faster training. Customers deploy Trn1 instances in Amazon EC2 UltraClusters consisting of tens of thousands of Trainium accelerators to rapidly train even the most complex deep learning models with trillions of parameters. With EC2 UltraClusters, customers will be able to scale the training of machine learning models with up to 30,000 Trainium accelerators interconnected with EFA petabit-scale networking, which gives customers on-demand access to supercomputing-class performance to cut training time from months to days. Each Trn1 instance supports up to 8 TB of local NVMe SSD storage for fast access to large datasets. AWS Trainium supports a wide range of data types (FP32, TF32, BF16, FP16, and configurable FP8) and stochastic rounding, a way of rounding probabilistically that enables high performance and higher accuracy as compared to legacy rounding modes often used in deep learning training. AWS Trainium also supports dynamic tensor shapes and custom operators to deliver a flexible infrastructure designed to evolve with customers' training needs.
“Over the years we have seen machine learning go from a niche technology used by the largest enterprises to a core part of many of our customers' businesses, and we expect machine learning training will rapidly make up a large portion of their compute needs,” said David Brown, vice president of Amazon EC2 at AWS. “Building on the success of AWS Inferentia, our high-performance machine learning chip, AWS Trainium is our second-generation machine learning chip purpose built for high-performance training. Trn1 instances powered by AWS Trainium will help our customers reduce their training time from months to days, while being more cost efficient.”
Trn1 instances are built on the AWS Nitro System, a collection of AWS-designed hardware and software innovations that streamline the delivery of isolated multi-tenancy, private networking, and fast local storage. The AWS Nitro System offloads the CPU virtualization, storage, and networking functions to dedicated hardware and software, delivering performance that is nearly indistinguishable from bare metal. Trn1 instances will be available via additional AWS services including Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS Batch. Trn1 instances are available for purchase as On-Demand Instances, with Savings Plans, as Reserved Instances, or as Spot Instances. Trn1 instances are available today in US East (N. Virginia) and US West (Oregon), with availability in additional AWS Regions coming soon. For more information on Trn1 instances, visit aws.amazon.com/blogs/aws/amazon-ec2-trn1-instances-for-high-performance-model-training-are-now-available.
Amazon’s product search engine indexes billions of products, serves billions of customer queries daily, and is one of the most heavily used services in the world. “We are training large language models that are multi-modal, multilingual, multi-locale, pre-trained on multiple tasks, and span multiple entities (products, queries, brands, reviews, etc.) to improve the customer shopping experience,” said Trishul Chilimbi, senior principal scientist at Amazon Search. “Amazon EC2 Trn1 instances provide a more sustainable way to train large language models by delivering the best performance/watt compared to other accelerated machine learning solutions and offers us high performance at the lowest cost. We plan to explore the new configurable FP8 datatype and hardware accelerated stochastic rounding to further increase our training efficiency and development velocity.”
PyTorch is an open source machine learning framework that accelerates the path from research prototyping to production deployment. “At PyTorch, we want to accelerate taking machine learning from research prototyping to production ready for customers. We have collaborated extensively with AWS to provide native PyTorch support for new AWS Trainium-powered Trn1 instances. Developers building PyTorch models can start training on Trn1 instances with minimal code changes,” said Geeta Chauhan, Applied AI, engineering manager at PyTorch. “Additionally, we have worked with the OpenXLA community to enable PyTorch Distributed libraries for easy model migration from GPU-based instances to Trn1 instances. We are excited about the innovation that Trn1 instances bring to the PyTorch community, including more efficient data types, dynamic shapes, custom operators, hardware-optimized stochastic rounding, and eager debug mode. All these capabilities make Trn1 well suited for wide adoption by PyTorch developers, and we look forward to future joint contributions to PyTorch to further optimize training performance.”
Helixon builds next-generation artificial intelligence (AI) solutions to protein-based therapeutics, developing AI tools that empower scientists to decipher protein function and interaction, interrogate large-scale genomic datasets for target identification, and design therapeutics such as antibodies and cell therapies. “Today, we use training distribution libraries like Fully Sharded Data Parallel to parallelize model training over many GPU-based servers, but this still takes us weeks to train a single model,” said Jian Peng, CEO at Helixon. “We are excited to utilize Amazon EC2 Trn1 instances featuring the highest networking bandwidth available on AWS to improve the performance of our distributed training jobs and reduce our model training times, while also reducing our training costs.”
Money Forward, Inc. serves businesses and individuals with an open and fair financial platform. “We launched a large-scale AI chatbot service on the Amazon EC2 Inf1 instances and reduced our inference latency by 97% over comparable GPU-based instances while also reducing costs. As we keep fine-tuning tailored natural language processing models periodically, reducing model training times and costs is also important,” said Takuya Nakade, CTO at Money Forward. “Based on our experience from successful migration of inference workload on Inf1 instances and our initial work on AWS Trainium-based EC2 Trn1 instances, we expect Trn1 instances will provide additional value in improving end-to-end machine learning performance and cost.”
Magic is an integrated product and research company developing AI that feels like a colleague to make the world more productive. “Training large autoregressive transformer-based models is an essential component of our work. AWS Trainium-powered Trn1 instances are designed specifically for these workloads, offering near-infinite scalability, fast inter-node networking, and advanced support for 16-bit and 8-bit data types,” said Eric Steinberger, co-founder and CEO at Magic. “Trn1 instances will help us train large models faster, at a lower cost. We are particularly excited about the native support for BF16 stochastic rounding in Trainium, increasing performance while numerical accuracy indistinguishable from full precision.”
About Amazon Web Services
For over 15 years, Amazon Web Services has been the world’s most comprehensive and broadly adopted cloud offering. AWS has been continually expanding its services to support virtually any cloud workload, and it now has more than 200 fully featured services for compute, storage, databases, networking, analytics, machine learning and artificial intelligence (AI), Internet of Things (IoT), mobile, security, hybrid, virtual and augmented reality (VR and AR), media, and application development, deployment, and management from 87 Availability Zones within 27 geographic regions, with announced plans for 21 more Availability Zones and seven more AWS Regions in Australia, Canada, India, Israel, New Zealand, Spain, and Switzerland. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—trust AWS to power their infrastructure, become more agile, and lower costs. To learn more about AWS, visit aws.amazon.com.
Amazon is guided by four principles: customer obsession rather than competitor focus, passion for invention, commitment to operational excellence, and long-term thinking. Amazon strives to be Earth’s Most Customer-Centric Company, Earth’s Best Employer, and Earth’s Safest Place to Work. Customer reviews, 1-Click shopping, personalized recommendations, Prime, Fulfillment by Amazon, AWS, Kindle Direct Publishing, Kindle, Career Choice, Fire tablets, Fire TV, Amazon Echo, Alexa, Just Walk Out technology, Amazon Studios, and The Climate Pledge are some of the things pioneered by Amazon. For more information, visit amazon.com/about and follow @AmazonNews.