2022 AWS Re:Invent Day 3; Machine Learning and AI Recap
Another day of re:Invent 2022 and another keynote is in the bag. This morning (3:30 am for those of us on the east coast of Australia) we had Swami Sivasubramaniam (Vice President of Data and Machine Learning at AWS) deliver the second keynote of the week, surprisingly on “Data and Machine Learning” (I know right, who could have seen that coming). We saw a LOT of new announcements in the SageMaker space and a number of announcements around Spark, RedShift and Glue. In parallel to the KeyNote, re:Invent also has a bunch of other things going on, so to make sure we’re giving you complete coverage of the event… In this article, I’ll be covering all the announcements made overnight.
To start off, below is a list of all the announcements made overnight:
- Amazon SageMaker now supports geospatial ML (Preview)
- Amazon DocumentDB (with MongoDB compatibility) Elastic Clusters is now generally available
- Amazon Redshift now supports Multi-AZ (Preview) for RA3 clusters
- Amazon SageMaker Studio launches redesigned user interface
- Amazon Athena now supports Apache Spark
- Amazon GuardDuty RDS Protection (Preview)
- Announcing Trusted Language Extensions for PostgreSQL on Amazon Aurora and Amazon RDS
- Introducing Amazon SageMaker support for shadow testing
- Launch Amazon SageMaker Autopilot experiments from Amazon SageMaker Pipelines to easily automate MLOps workflows
- Deploy SageMaker Data Wrangler for real-time and batch inference and additional configurations to processing jobs
- Amazon Redshift data sharing now supports centralized access control with AWS Lake formation (Preview)
- AWS Glue announces AWS Glue Data Quality (Preview)
- Amazon SageMaker Studio now supports the automatic conversion of notebook code to production-ready jobs
- Amazon Redshift now supports auto-copy from Amazon S3
- Amazon SageMaker Data Wrangler now provides built-in data preparation in notebooks
- Amazon SageMaker Data Wrangler now supports over 40 third-party applications as data sources
- Amazon SageMaker JumpStart now enables you to share ML artifacts more within your organization
- Introducing AWS AI Service Cards – a new resource for responsible AI
- Introducing new ML governance tools for Amazon SageMaker
- Amazon AppFlow now supports over 50 Connectors
- AWS Machine Learning University announces educator enablement program for higher education
- Amazon SageMaker Studio now supports real-time collaboration
In this article, I’ll be covering all the announcements made overnight in the world of data. Given how many Machine Learning announcements there were, I’ll be writing a separate article to dive deeper into the Machine Learning announcements.
In the data space, we can see a number of quality-of-life improvements around some of AWS’s most popular and heavily used services.
Amazon DocumentDB Elastic Clusters
Taken directly from the official announcement, with Amazon DocumentDB Elastic Clusters, you can leverage the MongoDB Sharding API to create scalable collections that can be petabytes in size. You can start with Amazon DocumentDB Elastic Clusters for their small applications and scale their clusters to handle millions of reads and writes per second, and PBs of storage capacity as their applications grow. Scaling Amazon DocumentDB Elastic Clusters is as simple as changing the number of cluster shards in the console and the rest is handled by the Amazon DocumentDB service and can be as fast as minutes compared to hours when done manually. You can also scale down to save on cost at any time.
In layman’s terms, this means that you now have a scalable easy-to-use way to scale your DocumentDB cluster based on your CPU, Database Storage and Backup Storage needs. This is a nice improvement and allows your environment to scale in line with your needs rather than picking particular instance sizes.
However, it appears to only be available in the U.S. East (Ohio), U.S. East (N. Virginia), U.S. West (Oregon), Europe (Frankfurt) and Europe (Ireland) regions. No word as to when/if it will come to other regions that currently have DocumentDB capability.
Redshift now supports Multi-AZ (Preview) for RA3 clusters
For those people who truly depend on always-available analytic services (and have the deep pockets to pay for them), AWS has announced Multi-AZ support for the RA3 line of RedShift nodes. This feature is actually pretty cool, at a high level you can now get the same type of experience as you can with a Multi-AZ RDS instance. You get a single endpoint with which to access your workload, and AWS performs all the work to replicate and spread your data across multiple availability zones. What this should mean for workload owners is that In the event of an outage, nothing happens, your workload isn’t aware of any issues, and your workload continues to operate as expected. This means it should be a fairly simple feature to implement should you require higher reliability.
While the increased reliability is nice to have, it will obviously significantly increase the number of nodes you’ll need to run so keep that in mind before turning the feature on. Also, keep in mind that this is currently in the “In Preview” feature and is NOT for production use at this stage. If however, you want to have a play and get started with it… it’s currently available in the US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Tokyo), Europe (Ireland) and Europe (Stockholm) regions.
Amazon Athena now supports Apache Spark
Not a lot to say about this one, to be honest. If you work with Spark, it’s a nice addition to your bag of tricks and doesn’t appear to come with any additional pricing. You are however limited to the US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland) regions. AWS has said more regions will be brought online in the coming months.
Amazon GuardDuty RDS Protection (Preview)
I’ll be playing with this one in the coming weeks, and if it’s as good as it sounds… it will most likely be making it onto my recommended implementation list for any customer storing customer data in an Aurora database.
Ultimately it does pretty much what it says on the tin… It gives us AWS GuardDuty capabilities across our Aurora database and will alert upon suspicious logins and activity. This could be extremely valuable, especially given the need to be focusing on security in the modern climate.
However, Sydney misses out again with only US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland) regions getting access to the Preview for the time being. If you’re lucky enough to be running a workload in one of these regions that could benefit from it, then I’d suggest turning it on as there is no cost during the preview period so you’ve little to lose.
Announcing Trusted Language Extensions for PostgreSQL on Amazon Aurora and Amazon RDS
I have to admit, this one’s a little over my head, but taking a look at the official announcement from AWS (available here) “Trusted Language Extensions for PostgreSQL is a new open source development kit to help you build high-performance extensions that run safely on PostgreSQL. With Trusted Language Extensions, developers can install extensions written in a trusted language on Amazon Aurora PostgreSQL-Compatible Edition and Amazon Relational Database Service (RDS) for PostgreSQL.”
Amazon Redshift data sharing now supports centralised access control with AWS Lake Formation (Preview)
Another RedShift preview feature that’s not available in Sydney just yet (currently this one’s available in Tokyo, Ohio, N. Virginia, Oregon, Ireland and Stockholm )is the ability to share data across RedShift data warehouses via the use of AWS Lake Formation.
This one is a definite quality-of-life improvement for anybody leveraging RedShift and Lake Formation. By leveraging Lake Formation, you can scale permissions more easily with fine-grained security capabilities, including row- and cell-level permissions and tag-based access control.
You can find more information about the announcement here or take a look at the developer guide available here.
AWS Glue announces AWS Glue Data Quality (Preview)
This is one of those announcements that need a little hands-on time before I can make a decision on whether or not it’s going to be a useful feature… But on the surface, it sounds pretty interesting. Given data quality and hygiene is so important to the analytical and forecasting tools we use in business, ensuring the quality of data entering our system needs to be a key focus of any data engineering team.
Enter AWS Glue Data Quality, “a new capability that automatically measures and monitors data lake and data pipeline quality. AWS Glue is a serverless, scalable data integration service that makes it more efficient to discover, prepare, move, and integrate data from multiple sources”. As I said, it sounds really appealing, but I’ll need to get somebody from one of Matel group’s data teams to put it through its paces.
In the meantime, it’s available in US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland) in Preview now (so remember, no production use cases just yet).
Amazon Redshift now supports auto-copy from Amazon S3
Last but not least, we have RedShift auto-copy from Amazon S3. This one’s pretty simple. It’s a way to have items put in an S3 bucket automatically loaded (via a COPY function) into an Amazon RedShift warehouse without the need to build custom solutions (I’m looking at you, event-driven lambda functions).
However, like most things in today’s announcements, it’s only available in a handful of regions, including the U.S. East (N. Virginia), U.S East (Ohio), US West (Oregon), Asia Pacific (Tokyo), Europe (Ireland) and Europe (Stockholm).
And that’s it on the data front, nothing revolutionary that had me jumping up off my seat… but a number of things that make life a little easier and data a little easier to work with and a little more secure.
Keep an eye out in the coming weeks as we start to put some of these announcements through their paces and I’ll be back again shortly to dive deeper into the Machine Learning announcements from the Keynote.