Storing Data in AWS

By Marko Sluga
Jun 10, 2020

📄 Contents

␡

“Do I Know This Already?” Quiz
Foundation Topics
Storing Static Assets in AWS
Relational Versus Nonrelational Databases
Deploying Relational Databases in AWS
Handling Nonrelational Data in AWS
Caching Data in AWS
Exam Preparation Tasks
Review All Key Topics
Define Key Terms
Q & A

⎙ Print

< Back Page 5 of 11 Next >

This chapter is from the book 

AWS Certified Developer - Associate (DVA-C01) Cert Guide

Learn More Buy

Deploying Relational Databases in AWS

Many applications require the ability to store data in a relational database. From web services, business intelligence, and analytics to infrastructure management, many different tasks require the recording of data in a database. In AWS, you have two choices:

You can deploy an EC2 instance with a database server application installed.
You can choose to use Amazon Relational Database Service (RDS).

Amazon RDS

The choice between a standalone EC2 instance with a database on top and RDS is essentially the choice between an unmanaged environment where you have to manage everything yourself and a managed service where most of the management tasks are automated and complete control over deployment, backups, snapshots, restores, sizing, high availability, and replicas is as simple as making an API call. When developing in AWS, it always makes sense to lean toward using a managed service as the benefits of reducing the management overhead can be numerous. Aside from simplifying the management, another business driver can be increased flexibility and automation, which can be achieved by using the AWS CLI, the SDKs, and CloudFormation to deploy the database back end with very little effort or through an automated CI/CD system. Managed services essentially empower developers to take control of the infrastructure and design services that can be easily deployed and replicated and that can have auto-healing characteristics built into them.

Example 4-6 shows how the deployment of an RDS database can be integrated in a Java application by using the AWS Java SDK, giving you the ability to deploy the database and use the database string returned to connect to the newly created database.

Example 4-6 Java Script That Can Be Used to Build an RDS Database

// define the credentials
AWSCredentials credentials = new BasicAWSCredentials(
  "AJDEIX4EE8UER4",
  " D3huG40jThD3huG40jThNPaAx2P3py85NPaAx2P3py85"
);
AmazonRDSClientBuilder.standard().withCredentials(credentials) // pull the
credentials into RDS Builder
  .withRegion(Regions.US_EAST_2) // define the region as us-east-2
  .build();
CreateDBInstanceRequest request = new CreateDBInstanceRequest(); // define the
create request
request.setDBInstanceIdentifier("javadbinstance");  // give the database instance
(the server) a name
request.setDBInstanceClass("db.t3.small"); // define the size of the database
instance
request.setEngine("mysql"); // define the database engine type
request.setMultiAZ(true); // make the database highly available with MultiAZ
request.setMasterUsername("master"); // define the database master username
request.setMasterUserPassword("javadbpw"); // define the database master password
request.setDBName("masterdb"); // give the database a name
request.setStorageType("gp2");  // define the storage type - gp2 is general purpose
SSD
request.setAllocatedStorage(30); // define the storage size as 30 GB
amazonRDS.createDBInstance(request); // issue the request

Once the script is created, you can list all your instances with the DescribeDBInstanceResult class. You will want to get the instance identifier and the endpoint, which is the SQL endpoint URL that you can later use to connect to the database. You can do this by including the snippet shown in Example 4-7 in your Java code.

Example 4-7 Using the Java DescribeDBInstanceResult Class

DescribeDBInstancesResult result = amazonRDS.describeDBInstances();
List<DBInstance> instances = result.getDBInstances();
for (DBInstance instance : instances) {
    String identifier = instance.getDBInstanceIdentifier();
    Endpoint endpoint = instance.getEndpoint();
}

Supported Database Types

Currently the RDS service supports six different database engines that can be deployed from RDS:

MySQL
MariaDB
PostgreSQL
Amazon Aurora
Oracle
Microsoft SQL Server

RDS for MySQL, MariaDB, and PostgreSQL

MySQL, MariaDB, and PostgreSQL are the most popular open-source relational databases used in today’s enterprise environments. Being open source and requiring little or no licensing while still having enterprise-grade support available makes these databases a great choice for an enterprise looking to deploy applications in a more efficient manner. They can easily replace traditional databases that tend to have expensive licensing attached to them.

The MySQL, MariaDB, and PostgreSQL engines all have similar general characteristics and support highly available Multi-AZ deployment topologies with a synchronous master/slave pair across two availability zones. All of them also have the ability to deploy multiple read replicas in the same region or in another region. The RDS service supports the following versions of these open-source databases:

MySQL Community Edition versions 5.5+ and 8.0
MariaDB Server versions 10.0+
All PostgreSQL versions (though version 9.3.5t is required for Multi-AZ and read replicas)

Figure 4-5 illustrates synchronous replication in Multi-AZ RDS deployments.

Figure 4-5 A Multi-AZ RDS Deployment

The MySQL, MariaDB, and PostgreSQL databases all support the use of SSL connections for the encryption of data in transit and can be configured with built-in volume encryption for data at rest.

These three database types are limited in size to 16 TB per volume and can use numerous different RDS instance types so you can scale the size of an instance from small to 8xlarge.

Amazon Aurora

Amazon Aurora is the next-generation open-source engine currently supporting the MySQL and PostgreSQL database types. The benefit of Aurora is that it decouples the processing from the storage. All the data is stored on a synchronously replicated volume in three availability zones, and the processing of SQL requests is performed on the cluster instances. The instances have no local storage, and they all access the cluster volume at the same time, so the performance of the cluster can be linearly scaled by adding nodes.

The write node in an Aurora cluster, also called the primary instance, is used to process all write requests. The primary instance type needs to be scaled to the write performance requirements of your application and can be easily resized by promoting a larger read replica to the primary role. All other members of the cluster are called replica instances, and they can respond to read requests. The primary and the replicas have different DNS names to which you send requests, which means you can simply configure your application with two FQDN targets—one for the writes and another for the reads—and do not need to handle the read/write distribution on your own.

Because the primary and replica instances have access to the same synchronously replicated cluster volume, you can also instantly promote any read replica into the primary role if the primary instance fails or if the availability zone where the primary instance is running experiences difficulties. Figure 4-6 illustrates how the Aurora design ensures synchronous writes and decouples storage from the compute layer.

Figure 4-6 Design of an Aurora Database Cluster

An Aurora cluster can scale quite a bit because you can add up to 15 replicas to the primary instance while additionally adding another 16 asynchronous replicas in another region. The Aurora engine also extends the maximum cluster volume to 64 TB, delivering not only a performance advantage but also a capacity advantage over traditional open-source databases, while maintaining the ability to use SSL for encryption in transit and delivering built-in encryption at rest.

Aurora is now available in serverless on-demand mode as a pay-per-request service. This is a great option for any kind of transient SQL clusters where keeping the primary and replicas running 24/7 would cause unnecessary costs. The on-demand Aurora also handles all scaling and capacity management automatically so that you can send as many requests as you need and always get a response. This essentially allows you to also support very spiky applications where you are not sure of the performance required before the requests start rolling in.

Oracle and Microsoft SQL on RDS

Traditional enterprise databases are sometimes the only option, so RDS allows you to deploy an Oracle 11g or Microsoft 2008 or newer SQL server as a service. The cost of these two engine types can have the licensing included, so there is no need to spend large sums of money for licensing upfront. There is, of course, also an option to bring your own license for each.

While you have a lot of choice of RDS instance types to run on, the Oracle and Microsoft engines are limited to a Multi-AZ mode and provide no support for read replicas and a maximum size of 16 TB per volume. To protect data at rest and in transit, Transparent Data Encryption (TDE) is supported on both engine types.

Scaling Databases

There are four general ways to scale database performance:

Vertical scaling: You can give a single database engine more power by adding more CPU and RAM.
Horizontal scaling: You can give a database cluster more power by adding more instances.
Read offloading: You can add read replicas and redirect read traffic to them.
Sharding: You can distribute the data across multiple database engines, with each one holding one section, or shard, of data.

With relational databases, vertical scaling always works, but it has a maximum limit. In AWS, the maximum limit is the largest instance size that can be deployed in the service. An alternative is horizontal scaling, but generally relational databases are not the best at being able to scale horizontally. The nature of the atomicity of the SQL transactions usually means that the whole transaction must be processed by one server—or sometimes even in one thread on a single CPU.

If an RDS database is deployed in a Multi-AZ configuration, the resizing can be done transparently because the slave database is resized first, the data is synchronized, the connection fails over, and the slave becomes the master while the previous master instance is resized. When the resizing is complete, data is again synchronized, and a failover is performed to the previous master instance.

Example 4-8 uses the boto3 Python SDK to increase the instance size from db.t3.small to db-t3-medium for the instance created in the previous example.

Example 4-8 Python SDK (boto3) Script That Can Be Used to Create an RDS Instance

import boto3 # boto3 is the AWS SDK for python
client = boto3.client('rds') # define the RDS client to be used
response = client.modify_db_instance( # modify an existing instance
    DBInstanceIdentifier=' javadbinstance ', # specify the instance ID
    DBInstanceClass=' db.t3.medium ', # define the new size
    ApplyImmediately=True, # run the command immediately (will not impact the
availability since we set the database to be MultiAZ
)

Another way of scaling is to distribute the read and write transactions on multiple nodes. A typical relational database is more read intensive than write intensive, with a typical read-to-write ratio being 80:20 or even 90:10. By introducing one or more read replicas, you can offload 80% or even 90% of the traffic off your write node. Aurora excels at read replica scaling, whereas the other services that support read replicas support only asynchronous replication, which means the read data is not as easily distributed across the cluster because the data read from the replica might be stale. But even asynchronous replicas can be a great benefit for offloading your write master where historical analytics and business intelligence applications are concerned.

Typically the last resort for scaling relational databases is to shard the data. Essentially this means that a dataset is sliced up into meaningful chunks and distributed across multiple masters, thus linearly increasing write performance.

For example, imagine a phone directory in a database with names from A to Z. When you need more performance, you can simply split up the database into names starting with A to M and N to Z. This way, you have two databases to write to, thus theoretically doubling the performance. Figure 4-7 illustrates the principle of sharding RDS databases to achieve better performance.

Figure 4-7 Sharding a Phone Directory into Two Databases

However, the limitation of sharding is immediately apparent when you try to perform analytics as you need to access two databases, join the two tables together, and only then perform the analytics or BI operation. Figure 4-8 illustrates tables from sharded databases being joined to an analytical database.

Figure 4-8 Steps Required for Analytics on Sharded Databases