Highly Available and Fault-Tolerant Architectures

By Mark Wilkins
Aug 17, 2021

📄 Contents

␡

“Do I Know This Already?” Quiz
Comparing Architecture Designs
Disaster Recovery and Business Continuity
Automating AWS Architecture
Elastic Beanstalk
Deployment Methodologies
Exam Preparation Tasks
Review All Key Topics
Define Key Terms
Q&A

⎙ Print

< Back Page 2 of 10 Next >

This chapter is from the book 

AWS Certified Solutions Architect - Associate (SAA-C02) Cert Guide

Learn More Buy

Comparing Architecture Designs

Successful applications hosted at AWS have some semblance of high availability and/or fault tolerance in the underlying design. As you prepare for the AWS Certified Solutions Architect - Associate (SAA-C02) exam, you need to understand these concepts conceptually in conjunction with the services that can provide a higher level of redundancy and availability. After all, if an application is designed to be highly available, it will have a higher level of redundancy; therefore, regardless of the potential failures that may happen from time to time, your application will continue to function at a working level without massive, unexpected downtime.

The same is true with a fault-tolerant design: If your application has tolerance for any operating issues such as latency, speed, and durability, your application design will be able to overcome most of these issues most of the time. The reality is that nothing is perfect, of course, but you can ensure high availability and fault tolerance for hosted applications by properly designing the application.

Designing for High Availability

Imagine that your initial foray into the AWS cloud involves hosting an application on a single large EC2 instance. This would be the simplest approach in terms of architecture, but there will eventually be issues with availability, such as the instance failing or needing to be taken down for maintenance. There are also design issues related to application state and data records stored on the same local instance.

As a first step, you could split the application into a web tier and a database tier with separate instances to provide some hardware redundancy; this improves availability because the two systems are unlikely to fail at the same time. If the mandate for this application is that it should be down for no more than 44 hours in any given 12-month period, this equates to 99.5% uptime. Say that tests show that the web server actually has 90% availability, and the database server has 95% reliability, as shown in Figure 3-1. Considering that the hardware, operating system, database engine, and network connectivity all have the same availability level, this design results in a total availability of 85.5%.

FIGURE 3-1 Availability Calculation for a Simple Hosted Application

This amount of uptime and availability obviously needs to be increased to meet the availability mandate. Adding an additional web server to the design increases the overall availability to 94.5%, as shown in Figure 3-2.

FIGURE 3-2 Increasing Application Availability by Adding Compute

The design now has some additional fault tolerance for the web tier so that one server can fail, but the application will still be available. However, you also need to increase the database tier availability because, after all, it’s where the important data is stored. Adding to the existing database infrastructure by adding two replica database servers in addition to adding a third web server to the web server tier results in both tiers achieving a total availability figure of 99.8% and achieving the SLA goal, as shown in Figure 3-3.

FIGURE 3-3 Adding Availability to Both Tiers

Adding Fault Tolerance

You can apply the same availability principles just discussed to the AWS data centers that host the application tiers. To do so, you increase application availability and add fault tolerance by designing with multiple availability zones, as shown in Figure 3-4.

FIGURE 3-4 Hosting Across Multiple AZ’s to Increase Availability and Fault Tolerance

Remember that each AWS availability zone has at least one physical data center. AWS data centers are built for high-availability designs, are connected together with high-speed/low-latency links, and are located far enough away from each other that a single natural disaster won’t affect all of the data centers in each AWS region at the same time. Thanks to this design, your application has high availability, and because of the additional availability, the application has more tolerance for failures when they occur.

Removing Single Points of Failure

Eliminating as many single points of failure as possible in your application stack design will also greatly increase high availability and fault tolerance. A single point of failure can be defined as any component of your application stack that will cause the rest of your integrated application to fail if that particular individual component fails. Take some time to review Table 3-2, which discusses possible mitigation paths for single points of failure.

Table 3-2 Avoiding Single Points of Failure

Possible Single Point of Failures	Mitigation Plan	Reason
On-premises DNS	Route 53 DNS	Anycast DNS services hosted across all AWS regions. Health checks and DNS failover.
Third-party load balancer	Elastic Load Balancing (ELB) services	ELB instances form a massive regional server farm with EIP addresses for fast failover.
Web/app server	ELB/Auto Scaling for each tier	Scale compute resources automatically up and down to meet demand.
RDS database servers	Redundant data nodes (primary/standby)	Synchronized replication between primary and standby nodes provides two copies of updated data
EBS data storage	Snapshots and retention schedule	Copy snapshots across regions adding additional redundancy.
Authentication problem	Redundant authentication nodes	Multiple Active Directory domain controllers provide alternate authentication options.
Data center failure	Multiple availability zones	Each region has multiple AZs providing high-availability and failover design options.
Regional disaster	Multi-region deployment with Route 53	Route 53 routing policy provides geo-redundancy for applications hosted across regions.

AWS recommends that you use a load balancer to balance the load of requests between multiple servers and to increase overall availability and reliability. The servers themselves are physically located in separate availability zones targeted directly by the load balancer. However, a single load-balancing appliance could be a single point of failure; if the load balancer failed, there would be no access to the application. You could add a second load balancer, but doing so would make the design more complicated as failover to the redundant load balancer would require a DNS update for the client, and that would take some time. AWS gets around this problem through the use of elastic IP addresses that allow for rapid IP remapping, as shown in Figure 3-5. The elastic IP address is assigned to multiple load balancers, and if one load balancer is not available ,the elastic IP address attaches itself to another load balancer.

FIGURE 3-5 Using Elastic IP Addresses to Provide High Availability

You can think of the elastic IP address as being able to float between resources as required. This software component is the mainstay of providing high-availability infrastructure at AWS. You can read further details about elastic IP addresses in Chapter 8, “Networking Solutions for Workloads.”

Table 3-3 lists AWS services that can be improved with high availability, fault tolerance, and redundancy.

Table 3-3 Planning for High Availability, Fault Tolerance, and Redundancy

AWS Service	High Availability	Fault Tolerance	Redundancy	Multi-Region
EC2 instance	Additional instance	Multiple availability zones	Auto Scaling	Route 53 health checks
EBS volume	Cluster design	Snapshots	AMI	Copy AMI/snapshot
Load balancer	Multiple AZs	Elastic IP addresses	Server farm	Route 53 geo proximity load balancing options
Containers	Elastic Container Service (ECS)	Fargate management	Application load balancer/Auto Scaling	Regional service not multi-region
RDS deployment	Multiple AZs	Synchronous replication	Snapshots/backup EBS data volumes and transaction records	Regional service not multi-region.
Custom EC2 database	Multiple AZs and replicas	Asynchronous/Synchronous replication options	Snapshots/backup EBS volumes	Custom high-availability and failover designs across regions with Route 53 Traffic Policies
Aurora (MySQL/PostgreSQL)	Replication across 3 AZs	Multiple writers	Clustered shared storage VSAN	Global database hosted and replicated across multiple AWS regions
DynamoDB (NoSQL)	Replication across 3 AZs	Multiple writers	Continuous backup to S3	Global table replicated across multiple AWS regions
Route 53	Health checks	Failover routing	Multi-value answer routing	Geolocation/geo proximity routing
S3 bucket	Same-region replication	Built-in	Built-in	Cross-region replication

< Back Page 2 of 10 Next >

🔖 Save To Your Account

Pearson IT Certification Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from Pearson IT Certification and its family of brands. I can unsubscribe at any time.

Email Address