** this is where I would post the timeless Boromir meme, but I did not want to get into copywrite issues … so I leave it up to your imagination **
But seriously, transitioning from a system specifically designed for Oracle databases is not a simple process. Nor should it be taken lightly, there are consequences!
In the following sections, I address the approach cloud service provider (CSP) engineers gather workload data from an Exadata and incorrectly interpret and associate their shapes based on the gathered Exadata metrics. We have seen the CSP engineer’s estimations of throughput requirements undersized, in some instances, by more than 10x too small. This blog identifies their misrepresentations and explains how to more accurately estimate the resource requirements.
We have seen the CSP engineer’s estimations of throughput requirements undersized,
in some instances, undersized by more than 10x.
An Exadata Database Machine (DBM) represents Oracle’s flagship engineered system. The Exadata platform combines high-capacity hardware, Oracle database and specialized storage software. The storage software integrates the Oracle database, compute and storage into a highly efficient engineered system for Oracle database workloads. This unique combination results in a high-performance platform delivering scalability, resiliency and throughput for Oracle database workloads and applications. You can move an Oracle database from an Exadata, but you will sacrifice one of these characteristics.
You can move an Oracle database from an Exadata, but you will sacrifice
either scalability, resiliency, throughput and/or performance.
I am constantly involved with database migrations from Exadata platforms to cloud environments. During these engagements, we are presented with metrics captured by various cloud providers – with their perspective of moving the Oracle workload from an Exadata. While their argument is interesting, they are mostly inaccurate and misleading to clients.
There are many features to an Exadata that make it unique and make workloads residing on an Exadata special. However, some of the most mis-understood, yet most prevalent, features are the following:
- Query Offloading (throughput and efficiency)
- Oracle Real Application Cluster (RAC)
- Physical Hardware Advances
As mentioned, there are many more features with respect to performance, consolidation and performance; however, these features are commonly presented by the most popular cloud providers and in most cases, they are presented incorrectly. This is the first of a three-part blog to point out Exadata migration challenges and how cloud professionals don’t always know what is best. I will not rehash the advantages of RAC or delve into the physical hardware features of the Exadata in this blog post. Let’s focus in on the query offloading and the performance impact it creates.
As I said, of all the features on the Exadata that are mis-represented (well, except for Storage Indexes), query off-loading represents the largest challenge for the cloud service provider (CSP) engineers.
When you are looking at query offloading, as with any Oracle transactions, we look at throughput – mostly Megabytes per Second (MBPS). Oracle optimizes most operations, so trying to determine the performance of an Oracle database through IOPS is not as effective.
Anyway, back to query offloading and throughput…let’s take for example the following graph displaying throughput for a half-rack Exadata (4 compute, 7 storage cells).
On average, the throughput is around 2,500 MBPS, right? For an Exadata, that is correct; however, this average represents the absolute minimum throughput requirements for commodity storage for on-premises and cloud solutions. Why? Because the reported throughput reflects throughput AFTER offloading has occurred. Therefore, these values are not comparable to commodity storage in terms of one-to-one comparison. There are three primary considerations for this, all related to query offloading.
- First: In most cases, Exadata returns just the required result-set due to off-loading efficiency. That means, for conventional storage, the size returned is the full complement of all tables involved in a query to allow the compute node to perform the join process.
- Two and Three: There is another set of metrics defining the offloading efficiency and effectiveness of a query. These two metrics indicate the percentage of transactions that COULD be offloaded and the percentage of transactions that ARE offloaded. I generally hear the CSP engineers completely blow this figure off, or at least indicate that most transactions are not offloaded.
It is my experience that, generally, the CSP engineers completely discount the query offloading metrics, or at least indicate that most transactions are not offloaded. So, what is the truth and how is it interpreted? Well first, the actual average throughput on conventional storage is projected somewhere between 2,500 MBPS (2.5 GBPS) and 90,000 MBPS (90 GBPS). This is because we have to determine the number of transactions offloaded and then determine what the database would provide IF the query was not offloaded. But how close are we to either side of the spectrum between 2,500 MBPS and 90,000 MBPS? That is where efficiency and effectiveness provide the answer…
So, first we look at the percentage of the transactions that CAN be offloaded … consider the following graph presenting the metric – Offloading Eligibility.
It looks as through 70% – 75% of the transactions CAN be offloaded in this workload collection. So, that’s great, but how many are ACTUALLY offloaded? That comes from the following metric – Offloading Efficiency.
Well, there you go. On average, 80% of the transactions that are eligible for offloading are ACTUALLY offloaded. Let’s think about this for a minute, if our average throughput is 2.5GBPS and that represents 80% of the optimized transactions. What if we mapped out the throughput on the graph we presented before?
What if we overlayed multiple metrics to identify the potential throughput resources required if the transactions were not offloaded. Keeping in mind, this is a hypothetical number, as some of these throughput requirements exceed the capabilities of most storage systems. But, this provides the real impact of the Exadata offloading feature.
I have associated the above graph identifying the MBPS with a few other metrics to establish these hypothetical numbers. These metrics include the IO per Second, Offloading Eligibility and Offloading Efficiency. The following table provides the calculated values for the percent of transactions offloaded associated with each tagged period. In addition, the table presents the number of offloaded transactions and the resulting, projected, data throughput required.
As indicated, these projected throughput values represent the required throughput capacity of a commodity (non-Exadata) system. While this throughput does not exceed the capabilities of the current Exadata platforms, these projected throughput values do exceed most non-Exadata platforms. So, what happens for these transactions on a commodity system, such as the ones employed on most clouds?
High Throughput Impact
During these tests, we turned off the Exadata features to determine the impact of the increased throughput. When disabling the Exadata features, the changes affected the storage cell performance with respect to latency and bandwidth. What is not impacted is the physical configuration of the Exadata, including the InfiniBand bandwidth and throughput.
That being said, the following observations were made with respect to the compute node and database environment. I will put it in simple graphical format first …
Why does the CPU go down? I get that question the most, and the reason the CPU activity goes down is due to the IO Wait activity increases.
Transaction Length obviously increases as the amount of data returned increases. The impact of the increased data and the transaction length includes the reduction of the available bandwidth. For the duration of the transaction, the throughput increased as more data traveled from the storage cell to the compute node. The reduction of bandwidth also affects concurrency, as multiple users are accessing the system and making requests. The additional users and requests increase contention for a limited amount of bandwidth.
With the increase of transaction length, the response time for the user also increased. The impact on some of the batch processing and reports increased more than twice the time. For example, a report that was taking 15 – 20 minutes would take more than an hour to complete due to the increased transaction time. Most user response times increased between 10 – 30 milliseconds.
The migration from an Exadata platform to a commodity platform is rarely as advertised. Most conversations I have had with cloud engineers, they capture the Exadata throughput and then map the same throughput to available shapes in their cloud environment. The Exadata efficiencies are either glossed over or discounted, which is misleading the client and setting the migration up for failure.
The only CSP that offers Exadata is Oracle Cloud Infrastructure, so other CSP either discount the Exadata capability or employ alternate tools to try and match the Exadata performance. These attempts lead to the complexity of the solution and I have not seen the alternate solution successfully achieve the performance capability of the Exadata. In future posts, we will discuss the hardware and RAC features of Exadata and the lack of these features on other CSP.
When migrating from an Exadata to a non-Exadata platform, make sure the correct metrics are collected. As indicated, the following metrics were captured from the AWR repository in detailed values (remember, AWR reports round the metrics which obfuscate the actual metrics):
- IO Megabytes per Second
- IO Transactions per Second
- Exadata Cell offloading Eligibility
- Exadata Cell offloading Efficiency
Additional metrics from the database include the Average Active Sessions to determine the impact of the bandwidth with respect to number of concurrent transactions.
For cloud migrations, our process is to create cloud “patterns”, identifying the required resources for each application. Therefore, when a cloud engineer requests application and database resource requirements, the patterns and requirements have already been established and does not allow an engineer to mislead or misinterpret the effectiveness of the Exadata platform.