Improving file load performance in Amazon Aurora MySQL
In this post, we show techniques to improve data load performance when ingesting data from flat files into Amazon Aurora MySQL-Compatible Edition. A .csv file can be considered an example of flat file.
Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud. Aurora combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open-source databases. Amazon Aurora MySQL is compatible with MySQL 5.6, MySQL 5.7 and with MySQL 8.0.
Aurora MySQL-Compatible offers different methods to ingest text file data, but importing large files can become problematic as those files grow in size. That happens because of the effect of sub-optimal design of the data model, which increases the overhead when loading data. Some data load practices can also prevent the engine from operating at its full potential.
To solve this problem, we make use of proper file placement, parallelism and data modeling techniques to extract the best performance from data load into Aurora MySQL.
We will show how each of those things impacts the overall performance of flat file ingestion in Amazon Aurora MySQL. Here we will teach best practices for data loading and the mechanisms you can use to load data efficiently into Aurora MySQL.
Most determining factors for flat file data load performance
There are design choices and data loading mechanisms that will determine how fast the data can be loaded.
Below are the main factors that influence data loading performance:
- Data loading mechanisms
- Presence and design of a Primary Key
- Loading data in parallel
- File content order
- Number/design of Secondary Keys
- Integrity checks (unique_checks/foreign_key_checks)
Real-world situations will be a combination of the above scenarios. Here you will see the relevance and impact of each isolated item.
For the tests we used a db.r6g.4xlarge instance running Aurora MySQL (MySQL 8.0 Compatible). This instance offers 16 vCPU and 128 GB of memory, powered by AWS Graviton 2 processors.
Data Loading Mechanisms
In Aurora MySQL you can use two methods to load data from text files:
- LOAD DATA LOCAL INFILE
- LOAD DATA FROM S3
LOAD DATA LOCAL INFILE is a popular method used to ingest flat file data. This method reads from a file that is local to the database client machine and writes to a desired table. It offers good overall performance and is a popular data load method, not only in Aurora MySQL but in community MySQL as well, including Amazon Relational Database Service (Amazon RDS) for MySQL.
With this load method, the compute capacity, network and concurrent workload of your client machine has direct impact on the data load speed. You must also consider where to place your client machine. In the case of Amazon Elastic Compute Cloud (Amazon EC2) instances, you should place it on the same Availability Zone as the Aurora MySQL writer DB instance for improved network performance, as well as to reduce inter-AZ data transfer costs.
LOAD DATA FROM S3 is Amazon Aurora MySQL’s specific method to ingest flat file data. When using Amazon Simple Storage Service (Amazon S3) as your file store option you can set data lifecycle policies, replication, versioning, access controls and easily integrate with other parts of your application that runs on other services on the AWS infrastructure. Because Amazon S3 Buckets are not mapped to specific Availability Zones you don’t have to think about where to place the file (but the S3 Bucket must be in the same AWS Region as your Aurora MySQL-Compatible cluster). In this case it doesn’t matter if your client machine is small or busy with other tasks because the heavy work is done between Aurora and S3. Reliability is higher when placing the file on S3 because it doesn’t depend on how reliable the client/network is.
When employing the best practices, both options offer similar performance. LOAD DATA FROM S3 is the superior method in most situations because of the additional resiliency, less management overhead when using S3 built-in features described earlier and integration with other AWS services. Specially if the files must be generated by another system, allowing such files to be generated directly in Amazon S3 saves time and effort.
Presence and design of a Primary Key
In Aurora MySQL (or any MySQL-based engine) if you don’t explicitly define a Primary Key, and there are no other unique constraints that the database can use to satisfy the ordering requirements, it will have to generate its own Key itself and use them for ordering. The mechanism that generates this Key can hinder performance and even become a system-wide contention point. Therefore declaring primary keys on Aurora MySQL tables is strongly recommended, even if you have to use a surrogate key.
Comparing a 20 million row import into a table that has a primary key against a similar table that doesn’t have a primary key shows the difference this makes on total elapsed data load time.
The overhead is more noticeable when using multiple threads to parallelize data load for efficiency. The following example runs a 10-thread parallel import from Amazon S3.
The image shows data load time difference between tables with and without Primary Keys for 10 parallel threads. Elapsed time of 246 seconds without primary key and 59 seconds with Primary Key.
As can be noted on the preceding comparison, in the absence of a Primary Key the process takes four times longer.
Having a Primary Key helps to reduce contention on the table. When there are multiple processes trying to access a table at the same time, you will see a wait event called index_tree_rw_lock. It becomes more evident in tables that don’t have a primary key.
The first 6 bars are from data load into the table that doesn’t have a primary key. Process took 5 minutes and 30 seconds. The second pair of data points at 00:24 and 00:25 are from a load test with a primary key and took 55 seconds.
The index_tree_rw_lock wait event means that there are many similar DMLs accessing the same database object at the same time. When using parallelism, ensure a Primary Key is in place and pay attention to the wait events and performance to find the ideal number of threads.
The relevance of a Primary Key for performance when loading data from flat files is clear, but it is important to pay attention to the columns to be used as Primary Keys.
Surrogate key is an artificial column that exists for the sole purpose of being a Primary Key for that table while a Natural key or Business Key is any column or set of columns that exist because of business needs, but that can also be used to uniquely identify a row, therefore it can be used as a Primary Key. For MySQL, the ideal primary key is a linearly incrementing integer value. This might not exist naturally on most application designs, so a conventional approach is to make use of Surrogate keys.
Using a Natural Key is easier because it doesn’t require altering tables, but you are likely to have to resort to composite Primary Keys or using columns with larger data types than necessary. On the storage, multiple data rows are stored within a data page (fixed size of 16KB in Aurora MySQL).
If you have larger data types or use composite Primary Keys, this will reduce your per-page record density (hence requiring more data pages for the same number of rows), which will increase your IO needs when querying the table.
Large Primary Keys also becomes a problem as you add secondary indexes on the table. Indexes other than the clustered index are known as secondary indexes. In InnoDB, each record in a secondary index contains the primary key columns for the row, as well as the columns specified for the secondary index. InnoDB uses this Primary Key value to search for the row in the clustered index.
If the Primary Key is long, the secondary indexes use more space on storage, require more IO when accessed and require more memory on the buffer pool, so it is advantageous to have a short primary key.
A popular solution that many applications employ is to use a column with MD5 hash values as a primary key.
MD5 hashes work because they are mostly unique (although there are chances of collisions) and it is not uncommon to find that they already exist on the database.
On tests with a table with an Integer column and sequential numbers as the primary key, the import has proven to be almost 30% faster than if using a VARCHAR column with MD5 hash instead. This has to do with the size of the data type but also with the fact that MD5 produces a non-sequential hash which causes extra overhead for b-tree page merge and page split operations so that the clustered index is kept ordered when inserting in random points (as opposed to inserting in order like a linearly increasing integer)
Loading data in parallel
Having a primary key defined on the table becomes more important when using multiple parallel threads to load data.
This section requires a clear distinction between loading from Amazon S3 and loading from local (Such as from an Amazon EC2 instance)
It is important to be aware of the possible bottlenecks with each data load mechanism. When loading from S3, the process doesn’t depend on the client’s network performance, and the Aurora Writer node can load data as quickly as permitted by the instance type and other factors discussed in this document.
If using LOAD FROM LOCAL, parallel import operations require more compute capacity from the client machine. Loading from Amazon S3 reduces the likelihood of such client-side bottlenecks. If you have multiple client machines then you might end up on a similar situation as described with LOAD FROM S3 in which the Aurora Writer node network throughput and engine-specific characteristics will be the limiting factors for the data load performance.
In Amazon S3 the ideal file size for data load performance should be between 100 MB and 1 GB. Performance can deteriorate if many small files are used due to the overhead of opening files, reading metadata and closing files.
Another best practice is to set the amount of parallel worker threads according to your instance compute capacity. Setting this too high will cause increased CPU usage and will not speed up the data load. Ideally you should take into account not only the instance’s architecture (number of vCPUs) but also any other workload simultaneously happening on the cluster.
The set of best practices described here applies to both data load mechanisms.
We start with only one thread for reference and increase up to 16 threads, which is the total of vCPUs on the instance.
The above image shows how long the import takes when running with 1 up to 16 threads and how much CPU is being utilized. It is possible to see that 10 threads is the optimal number. Anything above that takes a bit more time and considerably more CPU power. This happens because with more processes trying to access the same object (table) there are more lock waits associated with that table. After a point, the benefits of using more threads are negated due to the increasing contention.
File content order
Tables in MySQL are stored as organized data structures. The ordered b-tree that stores row data (fixed portion only) is called the clustered index. The clustered index is organized in primary key order if there is a primary key. Because of this, the order in which records are written to a table can change the data load performance.
If the rows are inserted in the order of Primary Key values, it becomes quicker to store them because each new row can be appended to the table without reorganizing the existing data. However, if new rows have primary key values lower than any of the existing rows, InnoDB must reorganize those existing rows to make room for new ones. This reorganization overhead not only lowers insert performance, but also reduces efficiency, leading to higher I/O consumption.
Having items in the file sorted in the Primary Key ascending order can further improve the performance of the data load process as shown below.
To summarize the above points, adding a Primary Key reduced load time by 76% and ordering the input file in Primary Key order improved load speed by another 20%
Both things combined resulted in the load taking near 1/6th of the time it took without the primary key and in random order. This happens because InnoDB can work faster when inserting items that are already ordered. In summary, for a better performance always prefer integers with linearly increasing values for primary keys. Try to avoid random values or composite primary keys.
Number/design of Secondary Keys
When you insert rows into a table, the database synchronously updates the secondary indexes(any non-primary key indexes) in order to keep them in sync with the new data. The more indexes the table has, and the more complex they are, the greater the overhead. Moreover, secondary key values often follow a different order than the primary key values, which makes index updates even more expensive due to out-of-order insertions.
The following example demonstrates data insert times with and without secondary indexes and with an increasing number of secondary indexes. The exact performance impact depends on the number, design, and composition of the indexes, but it can be substantial:
Here we see that the load time increased 47% with one secondary index and doubled with just 2 secondary indexes.
With the secondary indexes potentially having a very high impact on data load performance, any time invested in optimizing and pruning them prior to loading data is time well spent. Consider removing those you no longer need, delete any duplicate and redundant ones, and optimize the remaining indexes by ensuring they contain the minimum number of columns required to satisfy read queries.
It can be done by disabling the unique checks and foreign key checks during the load:
set unique_checks = 0;
set foreign_key_checks = 0;
A 20 million row import was used to show the performance gain of these settings. The file was placed on Amazon S3, the target table has one primary key and one foreign key.
In the above images is possible to see that data load using default configurations finished in 10 minutes 17 seconds (617 seconds) and the one with the disabled checks took 8 minutes and 50 seconds (529 seconds).
That’s 14% faster, which might not seem like much, but this is amplified as you have more foreign keys.
If the import is coming from an existing database, you might already trust the integrity of the data, which doesn’t need to be re-validated during load. In such a case, you can safely disable the integrity checks to speed up the import. Use this optimization only if you can guarantee that your input files are consistent.
Different factors influence the performance when loading data from files. To obtain the best results you must take a holistic approach and consider them all. The choice of the Primary Key and whether to have Secondary indexes deserves special attention because they have a great impact on the overall performance of your data load and subsequent OLTP performance.
Under optimal circumstances, similar performance can be expected from both LOAD DATA FROM S3 and LOAD DATA LOCAL. However, importing data from Amazon S3 has many advantages and allows for better scalability, resiliency and tends to be more cost effective if compared to block storage (Amazon EBS for example). The ultimate decision on which to use depends on each use case, but consider the integration efforts, any file replication needs, durability and data lifecycle strategies that you must address.
In this post, we showed you the things that contribute to data ingestion load time and gave you numbers so that you can compare the relevance of each factor on your own workload and make informed choices on how to ingest data and how to optimize for performance.
To learn more about how to choose a good primary key, the consequences of field length for primary key and how these plays into performance, data page density and storage usage in MySQL as well as planning IO for Aurora MySQL and additional techniques for making the best use of your MySQL-based engine go to the following links: