by Samer Shami
Senior Cloud Consultant at CMD Solutions

The new AWS Melbourne Region will open up possibilities for Active-Active style workloads with the Sydney region. This blog post will dive into a detailed discussion on the AWS services that can be used to make this possible. We’ll also discuss architectural designs focused around Kubernetes and Lambda Active-Active workloads.

As the AWS Melbourne Region launch date fast approaches, we thought it’d be a good time to reflect on the type of workloads that will benefit from having two Regions in close proximity. Australian AWS customers currently have only one Region – Sydney – to service their local Australian client base. All other Regions have traditionally been international – which introduces additional hurdles for multi-region workloads. The two AWS regions in Australia should achieve sub-100ms round-trip latencies. This means that certain multi-region workloads become far more practical. We also suspect that the interruption rate of Spot instances will go down over the short term since a nearby region will free up capacity.

In order to take advantage of the decreased latency, you could setup workloads in Active-Active (A-A) configuration across the two regions.

 

The advantages to this setup are:

  • Avoid reputation damage when your services go offline. Consumers are now accustomed to having digital services online 24/7.
  • Easily handle a single region failure. If a whole region goes offline, users are all routed to the working region.
  • Easily handle a single service failure within a region. Should AWS Service _xyz_ have issues in one region, simply route users over to the working region.
  • Spread workloads across multiple regions. What if that specific type of GPU accelerated instance that you want for ML is not available in a Region? Or is the interruption rate on spot instances too high? You can now shift workloads to suit underlying AWS capacity.
  • DR. Those dreaded words. Are you 100% certain that your DR process works? So much time is spent implementing, documenting, discussing and testing DR in enterprises – and yet when it is actually required, chances are high that it won’t work as expected. Instead of implementing DR, it would be beneficial to focus your effort on A-A or A-P configurations. These will get continual use, so DR “drills” will become a thing of the past.

 

To that end, we’ll next explore which AWS services make excellent building blocks for multi-region workloads, particularly in A-A configs. At CMD Solutions, we understand that not all workloads can handle A-A configurations, so we’ll also discuss the potential for Active-Passive (A-P) workloads. 

It is true that running A-A or A-P setups will increase your AWS bill. However, the unstated fact is that your enterprise will have to increase IT maturity and velocity in order to run A-A setups in the first place. It takes considerable engineering effort to get to that point. Perhaps you’re still dealing with workloads that weren’t designed with the 12 Factor App (https://12factor.net/ ) in mind, so they won’t easily scale up. You also may not be at a stage in your enterprises’ growth where A-A or A-P setups make sense. Regardless, the lessons and design patterns in this blog post will lay the foundation for future growth and ensure you don’t start that trajectory off with bad technical debt. 

Our focus will be on stateless containerised and serverless workloads. These workloads have likely been written to be cloud native to start with, making achieving the coveted A-A multi-region setup much easier. We’ll go through potential solutions to make a multi-region A-A setup operationally easy to manage as well, and compare it to what you may currently have implemented. If you currently have a lot of your workload based on EC2 workloads, consider the design patterns below and scale them down to a single region but for multi-AZ. They will work equally well in both scenarios.

Commonly used AWS Services to build your workloads ontop of

Let’s break down the commonly used AWS services to build your workloads on top of. They’ll be grouped into tiers **purely** based on their ability to support multi-region workloads out of the box. This is not a reflection on the service’s quality. The more support for multi-region that service has, the higher it ranks.

A Tier
Multi-region support out of the box. Active-Active setup supported.

B Tier
Multi-region support out of the box. Active-Passive setup supported.

C Tier
No multi-region support out of the box. You have to do it yourself. Depending on your solution, it can move into A or B Tiers.

Tier A

AWS Service | AWS Aurora

AWS Aurora does support multi-master mode. There are specific requirements, conditions and data manipulation limitations around enabling this. You’ll want to ensure that each Region ideally works on independent data (ie. different customer records or different shards) to ensure that conflicting data is not written at both master instances. If you cannot meet the specific conditions, it will still be possible to perform all read operations off the local Read Replica and then reach out to the master in the other Region to perform all writes. The lower round-trip time between Sydney and Melbourne will mean this becomes practically possible.

AWS Service | AWS Global Accelerator

This is the ideal method for multi-region traffic control if the backend is a ALB, NLB or EC2 instance. This service gives you two fixed anycast IPs, and you control the routing through AWS backbone from there.

AWS Service | CloudFront

A truly global CDN network with lots of break-out locations for even better local access. Great for hosting static content with  dynamic content behind AWS Global Accelerator if possible.

AWS Service | DynamoDB

AWS DynamoDB has Global Tables that support multi-master mode out of the box.

AWS Service | EventBridge

You can use EventBridge to react to AWS API events in one region as well as custom messages. EventBridge can handle cross-account and cross-region messages passing out of the box. It has a lot of flexibility in message filtering and message destinations. It even has the ability to archive events to S3 and then replay them again – a great feature for regional outages.

AWS Service | Route53

For endpoints that don’t support AWS Global Accelerator, you can control traffic to regions via DNS responses based on geolocation or percentage-based methods. Note that using DNS as a fail-over strategy is not guaranteed to happen in a set time. You have no control of the DNS cache used by clients. So some customers that don’t honor the TTL that you specify will still be accessing old addresses.

AWS Service | S3

AWS S3 does support two-way replication with a small amount of configuration. Be aware that replication latencies are not guaranteed unless you opt for a higher tier (S3 RTC) with additional costs. Also note that replication latencies can be measured in minutes with S3 – so it should be factored in when designing solutions.

Tier B

AWS Service | AWS OpenSearch (formally Elasticsearch)

OpenSearch does support replication cross-region. The replicated node is read-only. Adopt the writes back to master node pattern if your workload can handle the latency

AWS Service | EFS

Cross-region replication was recently announced by AWS. The documentation states that replication can take up to 15 minutes – so just like S3, this needs to be factored in when designing solutions.

AWS Service | Elastic Container Registry

ECR does support automatic cross-region replication for images. It does not support two-way replication. You could have CI systems in both regions, but ideally you push back to a central account and region that you use to distribute to other regions.

AWS Service | Elasticache (Redis)

Elasticache Redis does support a read-only replica cross-region. It is important to note that Elasticache Memcached does not support this feature. Due to the use case of this service – it may not be suitable for writing back to the master. In most cases Redis is used because it provides cache at low latency. It might be better to engineer the solution with isolated Redis clusters per region.

AWS Service | KMS

KMS has support for multi-region keys. This makes it easier to move encrypted data between Regions without having to decrypt and re-encrypt in between the regions.

AWS Service | Secrets Manager

Individual secrets can be replicated across regions. So things like DB keys can be rotated with confidence.

Tier C

AWS Service | AWS App Mesh

The service mesh can make the service discovery and routing easier. However, it is currently a single region solution.

AWS Service | AWS Cognito

It is currently a standalone service per region, and there is no way to automatically replicate users across regions. You will have to write your own functionality to ensure changes in one region and synced to another region via the Lambda hooks that Cognito provides.

AWS Service | Kinesis Streams

Single region solution. You would have to write your own Lambda functions to read from the stream and replicate to another region.

AWS Service | SNS, SQS

Single region solution for messaging. It would be up to your workloads to drop messages in the appropriate topics / queues. Depending on the type of workload, you should consider passing messages through EventBridge instead, since that service supports multi-region out of the box.

AWS Service | SSM Parameter Store

It is a single region solution for storing application config parameters. There is a pre-made CloudFormation template to automate the replication via a Lambda function and EventBridge.

Common strategies to deal with multi-region data

“One common pattern is to have a local read-replica in the secondary region, and make writes back to the master in the primary region. Another pattern is to write data to SNS or Kinesis. The data is then synced locally or back to the master depending on the type of data.”

For applications that are not write heavy, you can use the local read-replica for reads and steer writes directly to the master source. Assuming that the round-trip latency between Sydney and Melbourne will be sub-100ms, this makes it practical. Should Region 1 in the diagram below go offline, you promote the Read-Replica in Region 2 to Master, start writing locally and continue normal operations. When Region 1 comes back on line, you have to ensure that the previous Master in Region 1 is destroyed and a new Read-Replica syncing from Region 2 is created. The roles effectively swap at this point.

Another common method is to perform writes via a fan-out process to replicate the data across both regions. This is another strategy used for workloads that are not write heavy. The important thing to note is that if a Region goes offline, you will potentially have out-of-sync data when it comes back online. You will need to have a process to re-sync the data from the available Region to the previously offline Region. Depending on your use case this might actually be very hard.

Here is a slightly different fan-out process that uses Kinesis Streams and Lambda functions to provide built-in ability to re-sync data in a regional failure. Route53 is used to control incoming requests. A certain percentage will go to the Secondary region, this could even be 50%. Both regions receive data through API Gateway, forwarded to Kinesis Streams. In the Primary region, those events are written to the destination data source; RDS DB, DynamoDB or S3. That data is replicated across to the Secondary Region. In an outage of the Primary region, you shift 100% of the weighted load in Route53 to the Secondary region, and have the Lambda Receiver write directly to the replicated data store. If the local datastore is a Read Replica DB, it would have to be promoted to Master status first. This makes the system more complicated – but can deal with regional outages.

There are more design patterns that can be applied to manually build multi-region support on-top of services that don’t natively support it. Feel free to touch base with CMD Solutions for potential solutions that would address your use case better than those shown.

So, now that we’ve discussed some possible design patterns, let’s move onto workload orchestration.

Lambda orchestration

“Scaling up Lambda to become multi-region is much easier due to its serverless nature. If using DynamoDB for data storage, the matter is easily achieved just by adjusting the CI/CD system.”

AWS Lambda has continued to develop over the years, and it remains a powerful serverless platform for compute. It is highly scalable, cost effective and reduces the amount of operational management an enterprise has to perform. Leveraging the AWS SDKs means that it is extremely convenient to interact with other AWS services. And the ability to launch custom Docker containers means the language possibilities are immense. 

AWS Lambdas are also extremely cheap to set up. If they don’t run, you’re not paying for any infrastructure. This makes them a great solution for A-P solutions where it is not possible to run in A-A mode.

Due to the fact that AWS Lambda is serverless, the biggest concern for running multi-region A-A setups comes down to any ancillary services they use. Lets assume that you currently have this setup:

Let’s take that base and extend it to be multi-region A-A:

User requests can be geographically routed via Route53 to the correct API Gateway endpoint. Running the Lambda is similar to the single region setup, except now you have to pay attention to how data is written.

Deployment can be handled by your serverless framework of choice. This can include AWS Amplify, AWS CDK, AWS Serverless Application Model (SAM) or Serverless. The benefit of these frameworks is that they also include the ability to create AWS infrastructure in addition to the Lambda function. The CI/CD system of choice can trigger on a webhook event from Github/Gitlab and trigger the deployment to both regions.

Active-Active via Container orchestration

“There are excellent Kubernetes tools like ArgoCD / Flux and Crossplane that make multi-region workload orchestration much easier. “

Kuberenetes is a rich platform used to automate deployment and management of containerised applications in large distributed clusters. It has become the dominant name in container orchestration in the last few years – with broad adoption, a great ecosystem of supporting tools, built-in security extensibility and plugability.

We would use the flexibility that it offers to manage our A-A setup. Using the numerous Service Mesh options available in Kubernetes, we could also make an A-P setup work by routing requests through the Service Mesh to the Active cluster. 

Assuming your current single region setup looks like the below diagram. You might or might not have a Pilot light setup in a different region.

Let’s take that base, and extend it to become multi-region A-A:

The first change is to use AWS Global Accelerator to help route traffic between the regions. They will get routed to either the Sydney or Melbourne Region based on geolocation. From there, they enter the appropriate Load Balancer associated with the EKS cluster. The Active-Active version of the DB has been pictured.

Behind the scenes, we are going to automate deployments and infrastructure changes to both clusters using GitOps. Deployments or infrastructure changes are now git commits. We’ll make a commit to Github/Gitlab with the Kubernetes manifest. That manifest will include the updated AWS ECR images, and also include AWS resources in the form of custom CRDs – more on that later. A webhook will notify both ArgoCD/Flux instances and they then commit the changes to the Kubernetes cluster they run on. Kubernetes will then manage the application deployment, and Crossplane will pick up the custom CRDs and apply the changes to surrounding AWS infrastructure.

There are indeed more moving parts than a single region solution. This is required in order to make deployments more consistent between the two regions. The benefit of this process is an automated CD. The CI system makes a commit, and both application and infrastructure deployments are automatically handled by the EKS cluster.

If the workloads use other AWS resources that aren’t A-tier, then the data should be replicated based on the concepts presented in Common strategies.

Conclusion

We’ve demonstrated some design patterns that you can use in Kubernetes and Lambda based workloads to take advantage of the new AWS Melbourne Region opening up. We believe this is a great opportunity to focus on Active-Active workload configurations. Prior to the Melbourne Region, the round trip time to the nearest international Region would have caused too many performance compromises for Active-Active setups. If you focus on AWS services in the highest tier, you’ll be more likely to achieve an Active-Active setup spread across the Sydney and Melbourne Regions without a large performance penalty. You’ll also enjoy the added benefits of increased fault tolerance without the need to constantly worry about DR setups.