Post by CMD Principal Consultant Michael Ransley.

AWS continues to raise the bar across a whole lot of technology segments and in AWS Lake Formation they have created a one-stop shop for the creation of Data Lakes. As always, AWS is further abstracting their services to provide more and more customer value. The evolution of this process can be seen by looking at AWS Glue.

Note: This document was written in August 2019 – if you are reading this at some distant time in the future, functionality may (will?) have changed.

AWS Glue, a history

  1. Back in the day, when EC2 launched it was a massive game changer. The ability to be able provision instances in seconds through an API was a revolution. It was no surprise that many users who used EC2 used it for Hadoop workloads which could benefit from AWS scale to crunch large volumes of data and then terminate when the work had been completed.
  2. AWS recognised this need and created EMR (Elastic Map Reduce). No longer did we need to configure Hadoop clusters, now all we had to specify through the API was pretty much how many worker nodes we wanted and AWS magic presented us with a cluster with all the node configuration done for us. Additional Softwarewas added in as well allowing to layer additional software onto EMR at instance creation time.
  3. Once again, AWS looked at the workloads and realised that many people were using EMR to run Apache Spark jobs. From this they created Glue, which is effectively a managed Spark implementation (with extensions). Interestingly, they moved away from a raw CPU pricing mechanism to a DPU (Data Processing Unit), which is 4 CPU’s and 16 GB of RAM but your jobs must run a minimum of 2 DPU’s for a minimum of 10 minutes (see pricing for more information).

Back to Data Lakes

Obviously, a massive use case with AWS is Data Lakes and data processing platforms generally. A standard design for the data lakes was to use S3 for storage, EMR/Glue for data processing and the AWS Glue Data Catalog as a metadata store. AWS has rolled these services into a single unified data lake approach called AWS Lake Foundation.

One thing to note about AWS Lake Formation is that while it is a product itself, it is more of an orchestration layer and interface across a whole lot of AWS tools, as shown in the diagram below:

AWS Lake Formation

You can see that Lake Formation contains the building blocks of AWS data platform:

  • The Source crawlers are Glue Crawlers.
  • The ETL and Data Prep are probably provided by Glue Jobs.
  • The Date catalog is probably the Glue Data Catalog.
  • The Security Settings and Access Control are probably provided by a combination of the Glue Data Catalog and AWS IAM.

Obviously, the other components are named above. So what Lake Formation provides is an orchestration and management layer across these services. Once again, AWS is raising the bar in their platform and making it even more usable for customers.

Setup

AWS Lake Formation has a 3 step setup:

  1. Register your AWS storage.
  2. Create a database
  3. Grant permissions

AWS Lake Formation Setup

Register your AWS storage

Interestingly, this requires you to have already created an AWS bucket, it doesn’t create on for you. For the purposes of this investigation I have created a bucket called cmd-lake-formation-demo, clicking on “Register Location” and enter the path of your bucket – s3://cmd-lake-formation-demo:

AWS Lake Formation Location

Clicking “Register Location” creates the location, but click on dashboard again it unfortunately doesn’t show me that a location has been setup. Anyway, I know that I have setup a location so I will roll onto to step 2 of the setup.

Create a database

Once the storage is registered, you need to create a database to store the metadata. This is done by clicking on the “Create Database” button under stage 2. If have placed the database into the bucket that I created above and have gone with some sensible values, as shown below:

AWS Lake Formation Database

Grant permissions

Next step is to grant permission, once again I have chosen the database that we have created earlier and gone with sensible defaults:

AWS Lake Formation Permissions

Interestingly, the screenshot above gives a little nugget of information “Active Directory Users and Groups (EMR Beta Only)” – I am guessing that there is an EMR beta that hooks all the software up to Active Directory for those applications that have a user interface – Spark etc, but most importantly Zeppelin and Jupyter! Note: Another post will be done on this when it becomes Generally Available.

Ignoring the above, I clicked “Grant” and it came back to the permissions section. The dashboard is unfortunately still showing the setup but I suppose there will be the need to add additional locations, permissions and databases as the system is used.

Ingesting data

Lake Formation appears to have three methods for loading data into the lake.

  1. AWS Glue Crawlers.
  2. AWS Glue Jobs.
  3. Blueprints

Obviously the crawlers and jobs are existing technology that has been around for a little while, but the blue prints are interesting…

AWS Lake Formation Blueprints

The common ones are going to be the database snapshot and incremental database loads in my experience. Now looking at the dialog it is clear that this uses AWS Glue for the data movement and hence you will be paying Glue Pricing for the database synchronizations (i.e. 2 * DPU * USD $0.44/hr with a 10 minute minimum). This means that frequent syncing for your data may get pricey, especially if you are syncing lots of individual and small data sources.

I shall look into some of these blueprints in a future post.