Category Archives: Uncategorized

November 2, 2023

Oracle RAC: Exadata’s often overlooked Gold Mine

By htullis

Okay – I was reminded last month that I was “old”! I was moving some couches (electric recliners) and hurt my back – who knew? I used to work for a moving company, so the doctor illustrating the proper lifting technique did not help my ego!

But, the real reason I was reminded I was old – someone asked when I started working with Oracle Real Application Clusters (RAC). Of course, RAC is the foundation and forgotten benefit of Exadata, which is seldom discussed during a cloud migration approach. However, I would like to remind people of the benefits of RAC in my second installment of my Exadata post. What you should consider when migrating from Exadata.

So yes, I am old, although I like to say “seasoned”. I started working on a pre-release of Oracle Parallel Server (OPS) with Oracle 6.0.36 at Data General. The mixture of Data General AViiON servers and CLARiiON disk arrays was a compelling platform for OPS. The “cluster wars” of 1993 between Data General, Sequent and Pyramid were awesome times and great events at Oracle Open World. But the evolution of Oracle RAC from OPS is more than remarkable and often forgotten. RAC is also a feature that is either overlooked or not discussed by most Cloud Service Provider (CSP) engineers in sizing or or migration discussions.

Before we begin, we need to make sure we are talking correctly about terms. As I have heard different terms for RAC systems. The correct terminology for RAC is “Highly Resilient System” … so here are some terms…

Fault Tolerant System – Able to continue processing, regardless of the number of failures. No outage, connection loss or data loss.
Fault Resilient System – Able to continue processing in the event of a single system failure (99.999% availability), including disk, network, CPU or host. In-flight transactions are failed to another system, while there is no data loss.
High Availability System – Failover of processing in the event of a system failure. A highly available system (99.99% availability) will fail to a separate system with minimal outage and potential data loss.

Between Oracle 6.0.37 and Oracle 8, the Oracle cluster solution was about resiliency (fault resilient system). Some of these installations were a nationwide rail system and an emergency 911 system, which require high transactions with aggressive uptime requirements. However, this is a far cry from where we are now. The following bullets show the evolution of OPS to RAC and why we should question the migration of RAC environments to a non-RAC CSP.

Oracle Parallel Server (Oracle 6.0.36 to Oracle 8): Cluster Consistency via distributed cluster locks transferred through the cluster interconnect. Individual hardware vendors provided the Distributed Lock Manager (DLM) as part of the clusterware installation.
Oracle RAC (Oracle 9i – 11.1) – Oracle developed DLM with Parallel Cache Manager (cached locks) for cluster consistency, improved reliability and reduced lock latency (performance).
Oracle RAC (Oracle 11.2 – present) – Oracle enhanced distributed cluster system maintaining cluster consistency and reduced data latency. Data shared between nodes through cluster interconnect to reduce storage dependency on cluster resident data.

That is 30 years of evolution, from 1993 to 2023 – I told you I was old! The evolution is simple, the RAC solution has gone from a fault resilient platform to a high-performance platform. The RAC interconnect transitioned from slowly transporting lock and cluster health information to rapidly transporting lock, health and data between nodes. The introduction of Oracle Exadata exploited the Oracle 11.2 data transfer feature with high performance interconnects.

So back to the initial question … What should you consider when migrating from Exadata? With this installment, the benefits of Oracle RAC should definitely be considered when migrating from Exadata. These benefits are scalability, performance, and application resiliency. I don’t mean to pick on people, but I have not heard correct description of benefits of Oracle RAC from a CSP engineer over the past 5 years. I contribute the misstatement of RAC benefits to lack of understanding, as I hope they are not leaving this information off on purpose.

Scalability

Unfortunately, horizontal scalability is not mentioned much during cloud migrations. Most of the database platforms focus on smaller shapes or potentially different architectures. We have seen Google Cloud Platform (GCP) produce impressive scalability with Big Query, but we experienced an impact to latency at the time of the test. So, in cases of applications and databases requiring low latency activity, Exadata provides an optimal and scalable solution.

As presented in the following diagram, we executed the same benchmark four times to a NON-EXADATA Oracle RAC system – each time adding a node to the cluster. We utilized the default settings for the SCAN (Single Client Access Network) – in other words, we did not isolate application services among nodes to reduce lock propagation. Simply out of the box, no tuning, the application continues a consistent slope between one node to four nodes providing impressive scalability. The largest difference in the configuration of our test platform and Exadata is the speed of the interconnect (10 GB/sec) and the storage access (16 GB/sec) via FiberChannel connections.

The scalability associated with RAC is difficult to achieve in a non-cluster architecture. The loosely coupled architecture of Oracle RAC integrated with the high-performance architecture of the Exadata delivers unique Oracle database performance, resiliency and scalability.

I/O Performance

The Exadata key performance indicator is throughput, measured in MBPS, which is a combination of “pipe” speed and “pipe” size. Most CSP engineers point to IOPS as a key performance indicator; however, Oracle, since Oracle 6, has focused on the reduction of IOPS through request efficiencies. These efficiencies include “piggyback writes” for non-committed transactions, read-ahead for sequential transactions and others. Most storage array providers still include read-ahead optimizations for database transactions – also reducing the reported number of IOPS. With the implementation of Exadata off-loading to storage cells, the IOPS reporting is not a valid reporting measure for database / system performance requirements. Leveraging Oracle’s efficiencies with requests per second, the throughput provides a more realistic resource requirement for performance.

The following graph represents the throughput requirements for a half-rack Exadata X7 supporting a client’s data warehouse. While the average throughput hovers around 3.5 GB/second there are some considerable spikes for 15GB/second throughput and above.

One of the challenges of performance is reporting. Most Oracle performance analysis is performed through AWR reports provided by the client over a period of time. Most CSP engineers consider these reports as gospel, as they don’t understand the data provided. The Oracle AWR report provides a glimpse of the workload for “health” reasons but does not provide beneficial reporting for workload sizing or analysis. The reason is simple, the AWR report averages the data for all “peaks” and “valleys” based on the duration of the report.

The AWR report for the same system and same timeline above paints a totally different picture of the resource requirements for throughput. The following graph presents the data provided by an AWR report, representing the same 3.5GB/second average, but totally misleading the 15GB/second peak to show a 5 – 8 GB/second throughput requirement. Why is this so misleading? The duration of the AWR report is 30 days and the “collection” reporting interval is limited to minute data points. Therefore, there are less data points presented over a considerable period, represented as a smooth graph on the report.

There are two primary take-away on I/O performance and Oracle workloads. First, throughput (MBPS) is the key performance indicator – NOT bandwidth (IOPS). This is because Oracle optimizes IO to reduce “requests” and increase data flow. Second, reporting is important for workload sizing and analysis. While an AWR report is good for trends and optics, it is not sufficient when presenting Exadata workloads targeted for non-Exadata environments.

Interconnect Performance

But, what about RAC? Remember, this is the goal of this discussion and how does RAC factor with performance? Among other things, RAC performance is greatly dependent on the interconnect, which is a private network between the compute nodes. In the early days, the interconnect supported the cluster locking processes and cluster heartbeat. However, with the introduction of the Exadata and Oracle 11.2, the implementation and functionality of the interconnect changed.

The introduction of InfiniBand for storage and interconnect, the RAC interconnect became a transport for data. Monitoring RAC implementations, there is now more traffic on the interconnect. The following graph presents the interconnect traffic on a half-rack Exadata, which has four (4) compute nodes.

The spikes and dynamic throughput provided indicates more than just lock traffic traversing through the interconnect. As more data is sent and received by nodes, the impact on storage systems also becomes evident.

On a recent engagement, a three (3) node RAC environment based on commodity hardware was experiencing issues. Through analysis, I observed a few things:

The environment was experiencing very high IO waits for the number of active sessions. Due to the density of application schemas in multiple databases (small database footprint), the IO wait time spikes are consistent on all databases within the associated cluster – as well as databases on other clusters. [This is a standard consolidation configuration on Oracle, which is not bad – but there are limitations]

The amount of traffic traversing the interconnect correlates with the IO activity on the cluster. The following graph shows the interconnect activity spike prior to IO waits on disk. Potentially, the request from alternative instances occurs prior to IO transactions to disk.

As the following graph presents, the IO requests occur after activity is performed by the interconnect. Some of this is standard locking processes; however, there is additional activity and data traffic that occurs. As you can see, IO activity begins shortly after the interconnect spikes. I will want to investigate it more, but my hypothesis is the interconnect activity occurs first – prior to going to disk. This is consistent with NUMA features and I am making the assumption the cluster process remains consistent.

Summary

As detailed above, the implementation of RAC consists of 20+ years of evolution to get where we are now. Although most CSP engineers like to frame RAC as a High Availability system, the reality is that RAC is a fault resilient platform with performance features included. These performance features resemble a NUMA architecture where the Parallel Cache is interrogated prior to an IO request – thus reducing IO contention and leveraging a faster interconnect.

As Oracle continues to perform as a consolidation platform, both on-premises and in the cloud, it is important to note these features. The consolidation of schema reduces the reliance on CPU; however, ensure the IO features are available. As indicated, the Exadata platform provides a distributed IO platform built to reduce incoming and outgoing IO processing. These are features not available in a general cloud service provider – such as Azure or AWS. While some features of Google are general cloud, I also consider Google a specialized cloud platform, such as Oracle cloud, for large data volumes with Big Query and other features.

As more things move to cloud, my recommendation is to improve application and data agility. This includes reducing the number of application schema per database – which allows the client the decision to move to the appropriate cloud based on data requirement and governance. I am a huge fan of distributed data and data mesh architectures, which leverage the best technology to the client domain. Therefore, understanding application and data architecture is more than a simple infrastructure decision.

Leave a comment | posted in Uncategorized

February 16, 2023

One DOES NOT simply “Lift and Shift” from Exadata

By htullis

** this is where I would post the timeless Boromir meme, but I did not want to get into copywrite issues … so I leave it up to your imagination **

But seriously, transitioning from a system specifically designed for Oracle databases is not a simple process. Nor should it be taken lightly, there are consequences!

In the following sections, I address the approach cloud service provider (CSP) engineers gather workload data from an Exadata and incorrectly interpret and associate their shapes based on the gathered Exadata metrics. We have seen the CSP engineer’s estimations of throughput requirements undersized, in some instances, by more than 10x too small. This blog identifies their misrepresentations and explains how to more accurately estimate the resource requirements.

We have seen the CSP engineer’s estimations of throughput requirements undersized,
in some instances, undersized by more than 10x.

An Exadata Database Machine (DBM) represents Oracle’s flagship engineered system. The Exadata platform combines high-capacity hardware, Oracle database and specialized storage software. The storage software integrates the Oracle database, compute and storage into a highly efficient engineered system for Oracle database workloads. This unique combination results in a high-performance platform delivering scalability, resiliency and throughput for Oracle database workloads and applications. You can move an Oracle database from an Exadata, but you will sacrifice one of these characteristics.

You can move an Oracle database from an Exadata, but you will sacrifice
either scalability, resiliency, throughput and/or performance.

I am constantly involved with database migrations from Exadata platforms to cloud environments. During these engagements, we are presented with metrics captured by various cloud providers – with their perspective of moving the Oracle workload from an Exadata. While their argument is interesting, they are mostly inaccurate and misleading to clients.

There are many features to an Exadata that make it unique and make workloads residing on an Exadata special. However, some of the most mis-understood, yet most prevalent, features are the following:

Query Offloading (throughput and efficiency)
Oracle Real Application Cluster (RAC)
Physical Hardware Advances

As mentioned, there are many more features with respect to performance, consolidation and performance; however, these features are commonly presented by the most popular cloud providers and in most cases, they are presented incorrectly. This is the first of a three-part blog to point out Exadata migration challenges and how cloud professionals don’t always know what is best. I will not rehash the advantages of RAC or delve into the physical hardware features of the Exadata in this blog post. Let’s focus in on the query offloading and the performance impact it creates.

Query Offloading

As I said, of all the features on the Exadata that are mis-represented (well, except for Storage Indexes), query off-loading represents the largest challenge for the cloud service provider (CSP) engineers.

When you are looking at query offloading, as with any Oracle transactions, we look at throughput – mostly Megabytes per Second (MBPS). Oracle optimizes most operations, so trying to determine the performance of an Oracle database through IOPS is not as effective.

Anyway, back to query offloading and throughput…let’s take for example the following graph displaying throughput for a half-rack Exadata (4 compute, 7 storage cells).

On average, the throughput is around 2,500 MBPS, right? For an Exadata, that is correct; however, this average represents the absolute minimum throughput requirements for commodity storage for on-premises and cloud solutions. Why? Because the reported throughput reflects throughput AFTER offloading has occurred. Therefore, these values are not comparable to commodity storage in terms of one-to-one comparison. There are three primary considerations for this, all related to query offloading.

First: In most cases, Exadata returns just the required result-set due to off-loading efficiency. That means, for conventional storage, the size returned is the full complement of all tables involved in a query to allow the compute node to perform the join process.

Two and Three: There is another set of metrics defining the offloading efficiency and effectiveness of a query. These two metrics indicate the percentage of transactions that COULD be offloaded and the percentage of transactions that ARE offloaded. I generally hear the CSP engineers completely blow this figure off, or at least indicate that most transactions are not offloaded.

It is my experience that, generally, the CSP engineers completely discount the query offloading metrics, or at least indicate that most transactions are not offloaded. So, what is the truth and how is it interpreted? Well first, the actual average throughput on conventional storage is projected somewhere between 2,500 MBPS (2.5 GBPS) and 90,000 MBPS (90 GBPS). This is because we have to determine the number of transactions offloaded and then determine what the database would provide IF the query was not offloaded. But how close are we to either side of the spectrum between 2,500 MBPS and 90,000 MBPS? That is where efficiency and effectiveness provide the answer…

So, first we look at the percentage of the transactions that CAN be offloaded … consider the following graph presenting the metric – Offloading Eligibility.

It looks as through 70% – 75% of the transactions CAN be offloaded in this workload collection. So, that’s great, but how many are ACTUALLY offloaded? That comes from the following metric – Offloading Efficiency.

Well, there you go. On average, 80% of the transactions that are eligible for offloading are ACTUALLY offloaded. Let’s think about this for a minute, if our average throughput is 2.5GBPS and that represents 80% of the optimized transactions. What if we mapped out the throughput on the graph we presented before?

What if we overlayed multiple metrics to identify the potential throughput resources required if the transactions were not offloaded. Keeping in mind, this is a hypothetical number, as some of these throughput requirements exceed the capabilities of most storage systems. But, this provides the real impact of the Exadata offloading feature.

I have associated the above graph identifying the MBPS with a few other metrics to establish these hypothetical numbers. These metrics include the IO per Second, Offloading Eligibility and Offloading Efficiency. The following table provides the calculated values for the percent of transactions offloaded associated with each tagged period. In addition, the table presents the number of offloaded transactions and the resulting, projected, data throughput required.

As indicated, these projected throughput values represent the required throughput capacity of a commodity (non-Exadata) system. While this throughput does not exceed the capabilities of the current Exadata platforms, these projected throughput values do exceed most non-Exadata platforms. So, what happens for these transactions on a commodity system, such as the ones employed on most clouds?

High Throughput Impact

During these tests, we turned off the Exadata features to determine the impact of the increased throughput. When disabling the Exadata features, the changes affected the storage cell performance with respect to latency and bandwidth. What is not impacted is the physical configuration of the Exadata, including the InfiniBand bandwidth and throughput.

That being said, the following observations were made with respect to the compute node and database environment. I will put it in simple graphical format first …

Why does the CPU go down? I get that question the most, and the reason the CPU activity goes down is due to the IO Wait activity increases.

Transaction Length obviously increases as the amount of data returned increases. The impact of the increased data and the transaction length includes the reduction of the available bandwidth. For the duration of the transaction, the throughput increased as more data traveled from the storage cell to the compute node. The reduction of bandwidth also affects concurrency, as multiple users are accessing the system and making requests. The additional users and requests increase contention for a limited amount of bandwidth.

With the increase of transaction length, the response time for the user also increased. The impact on some of the batch processing and reports increased more than twice the time. For example, a report that was taking 15 – 20 minutes would take more than an hour to complete due to the increased transaction time. Most user response times increased between 10 – 30 milliseconds.

Conclusion

The migration from an Exadata platform to a commodity platform is rarely as advertised. Most conversations I have had with cloud engineers, they capture the Exadata throughput and then map the same throughput to available shapes in their cloud environment. The Exadata efficiencies are either glossed over or discounted, which is misleading the client and setting the migration up for failure.

The only CSP that offers Exadata is Oracle Cloud Infrastructure, so other CSP either discount the Exadata capability or employ alternate tools to try and match the Exadata performance. These attempts lead to the complexity of the solution and I have not seen the alternate solution successfully achieve the performance capability of the Exadata. In future posts, we will discuss the hardware and RAC features of Exadata and the lack of these features on other CSP.

When migrating from an Exadata to a non-Exadata platform, make sure the correct metrics are collected. As indicated, the following metrics were captured from the AWR repository in detailed values (remember, AWR reports round the metrics which obfuscate the actual metrics):

IO Megabytes per Second
IO Transactions per Second
Exadata Cell offloading Eligibility
Exadata Cell offloading Efficiency

Additional metrics from the database include the Average Active Sessions to determine the impact of the bandwidth with respect to number of concurrent transactions.

For cloud migrations, our process is to create cloud “patterns”, identifying the required resources for each application. Therefore, when a cloud engineer requests application and database resource requirements, the patterns and requirements have already been established and does not allow an engineer to mislead or misinterpret the effectiveness of the Exadata platform.

Leave a comment | posted in Uncategorized

September 16, 2018

Are Requirements Needed in the Cloud?

By htullis

Back from vacation … wow, it always seems to take longer to recover from vacation than the last time, but had a great trip! I saw some amazing places and did some fun things. We got to visit the place they filmed Jurassic Park, but did not see any of the dinosaurs they left behind. Guess they were sleeping.

Right after vacation, I went to a conference on Healthcare technology. During one session, a consultant exclaimed – “We didn’t define requirements for moving to the cloud, because we weren’t going to submit an RFP.” First, I was impressed by the consultant’s score of the holy grail – consulting work without an RFP. Then, I was amazed with the comment. I was shocked by the number of other comments surrounding the fact that companies did not think requirements weren’t needed for cloud work.

Let me ask a question … when you go to buy a car, do you just tell the salesperson “any car would be fine”? Yes, I am equating a car dealership with a cloud provider, but think about it…the dealership has different car types (patterns if you will), different performance levels and different capacities. So again, when you are planning to buy a new car, don’t you at least have an idea of what you want to buy? Things like how fast you want the car to go, how many people you want to place in the car and the reliability features of the car.

So, the next question – is buying a car more important than your company’s systems in the cloud?

Got your attention? I hope so, because requirements are more important for cloud implementations than they are for on-premise implementations! Remember, you are relying on someone else to run your business, shouldn’t you have input on the way it is done?

I am constantly amazed by the consideration that clients entering the cloud think they don’t have a choice about what they want. A main focus of our cloud migration process includes the definition of requirements for the systems while in the cloud.

Why define requirements for the implementation, migration and steady state support for cloud implementations? The same reason you choose the type of car you want … in your mind, you are thinking you want this type of car…

GoodCar

Unfortunately, if you have not communicated correctly with the sales person … you may end up with …

BadCar

Seriously though, the requirements support multiple phases of the migration to the cloud. The following bullets identify these phases. Keeping with the car analogy, the following table presents some of the areas where requirements impact the implementation of a cloud solution.

Requirements driven results	Car Analogy
Supports the appropriate selection of cloud providers (not just for RFP purposes)	What type of car are you going to purchase and where will you purchase?
Defines success criteria of the implementation including patterns and process	What is your experience with the buying and drive-off process?
Process and SLA surrounding the migration process	Do you know what the accessories do and how to use them?
Success and management surrounding post implementation support (steady state)	Is the car performing and where will you get it serviced?

Without requirements (explicit or implicit), how do you declare success or failure of a cloud implementation? The requirements support the identification of needed resources and processes.

These components define the foundation of the cloud platform and are the differing factor between a cloud deployment and a hosting deployment.

A cloud deployment leverages repeatable processes an agile implementation and management.

So, the next time a consultant says “we don’t need requirements”, you should look for another consultant. Implementations without requirements lead to extended schedules and poor experiences.

Leave a comment | posted in Uncategorized

March 28, 2018

Cloud Preparation – More than Technology?

By htullis

Cloud is easy, right? Someone builds a data center, runs cables, slaps some hardware and software together and *poof*, you are running a cloud. If a company wants to join, they just sign some documents and push their applications into the deep blue and everyone is happy!

Of course, I am being sarcastic. Generally, when we engage a company about a failing cloud initiative or a new cloud initiative, we generally get the same thoughts. Cloud is easy, and anyone can put one together. Even the companies / teams that have failed after 2+ years with nothing to show wonder why they have not delivered.

Over the past 6 years, I have been focused on deploying private cloud implementations for multiple clients. Although most have been on engineered or converged systems, there have been multiple with physical and / or virtual systems over many technologies. These engagements include either “bail-outs” of failed implementations from other big-named firms or a complete end-to-end deployment. For the failed engagements, we generally start with a system health check to identify the challenges. Our thoughts were cloud providers understand the process for on-boarding a client – which is a false statement.

Over these engagements, we have implemented a methodology that has been very successful in the preparation and rapid delivery of cloud implementations. Over the next set of blog posts, I will discuss the methodology and some of the common errors that we have seen.

“We generally start everything with a health check – because everyone is ready for the cloud … right?”

Technology is fun and cloud technology is more fun. The issue that we find very often is that people understand cloud technology, but they don’t understand the cloud. Whether it is a public cloud or a private cloud, most people we talk with understand the benefits of implementing a cloud. The excitement over the benefits often overwhelms corporate leaders with promises of cost savings and rapid deployments. But, when we ask them what they want – generally, we get a blank stare. In fact, if you read my last blog, we get a lot of blank stares at the beginning. In reality, we find the client is ill prepared for the migration to the cloud, either through expectation or level of effort. The blank stare presents the moment when reality becomes apparent.

“As a good friend and colleague told me today – “We need to learn better English…”. Perhaps that will remove the blank stare syndrome.

The blank stares will never cease, as I believe the disconnect is in the expectation, not the comprehension. Regardless of the topic, technology is supposed to be simple. Bottom line, cloud is simple if you are prepared. It is just like riding a bike! But just like riding a bike in today’s world, you have to know what you want and how you are going to use the bike. Then you have to be prepared for the hills, the elements, the virtual riding– so really, is cloud simple?

IMG_4237 (1).jpg This is a picture of my bike, in my house “torture chamber”. I can guarantee that it is more complicated than just getting on and peddling!

So, over the next few weeks we will talk about being prepared for the cloud. Whether you are going public cloud or private cloud, the preparation is nearly the same. If you are wondering “why do I need to prepare my environment for a cloud?”, please read-on, it will change the way you deploy and save money.

Here is what I will discuss over the next few weeks …

“Are those requirements? There’s no requirements in the cloud!”
Does a reference architecture really help?
Patterns, not just for sewing anymore!
Service Catalog and Christmas List – hope eternal
Automation, Supply Chain and physics eternal

Technology is fun, and the cloud is fun. As an infrastructure engineer, database architect and application architect, I spend more time debating with myself than other people. They are good debates and I generally stop the discussion before they get violent. I hope we can have some fun as we go through this cloud journey and I welcome comments and thoughts!

Leave a comment | posted in Uncategorized

March 2, 2018

Defining Disaster – What is missing in Disaster Recovery

By htullis

Over the past year, I have been involved in numerous Disaster Recovery (DR) engagements – including review and implementations. When I start, my first question is NOT “What is your RTO/RPO?”. (although important).

My first question is “What is your goal?” It is amazing the blank stares I get from this question. Obviously, the thoughts are probably “Idiot, my goal is disaster recovery!” But, I then explain that there are many different types of disasters and many types of disaster recovery options. The most entertaining response I get when I ask this question is “Well, someone told us we needed a disaster recovery solution”. Of course, this is normally in Florida during hurricane season …

My mind drifts off to my favorite Jimmy Buffett song – “Tryin’ To Reason With Hurricane Season”

Anyway, I thought for a fun refresher, I would throw out the discussion of WHY we do a Disaster Recovery implementation, NOT HOW do we do a DR implementation. Of course, determining how to do a DR implementation is easy once we determine why we want to do a DR solution.

Seriously, most teams jump into the technology solution before considering the requirements or the goals of the solution. So let’s go and have some fun!

Goals of Disaster Recovery

Most people know me as an advocate of flexibility and a DR implementation is no different. The company DR goals should be defined by the type and scope of the disaster. Some types of disasters are obvious, including natural and man-made disasters resulting in a “smoking hole”. However, other failures may also require a disaster declaration or at least utilize the environment for recovering from an incident. The following graphic provides a focus on the goals of a disaster recovery solution.

description

The DR implementations focus on an Enterprise focus, which maintains a secondary site for their production site. Most of the time, the disaster recovery site reflects a “lights out” location and rarely tested. This implementation only satisfies a compliance person or government compliance check box, which is important. However, how does this support the business objective or the customer’s experience?

The disaster recovery goal should focus on a) How does this support our customer experience and b) how does this allow us to drive business?

After all, the resources for disaster recovery are expensive. Also, the management of data centers and environments requires focus and support of tools and personnel. Why not make these environments work for the company as well?

The following graphic represents a goal oriented focus on the disaster recovery solution. As the goals become more focused and flexible, the beneficiary transitions from internal operations to client experience and business focus.

scope

Expanding the goals of the disaster recovery implementation for flexibility removes the self serving goal of checking a box. I have reviewed DR solutions which force customer applications to fail to disaster recovery due to a business application failure. Or worse, I have evaluated a client environment that isolates customers into separate environments for customer isolation; however, the DR plan requires all customers to transition to DR in the event of a single customer outage.

Isolating applications into groups supporting distinct applications or customer installments is great. Providing a “Site Switching” strategy for each application group is excellent and improves the customer experience and confidence! The isolation of applications, databases and incidents provide an effective solution for disaster recovery. Although moving a mountain during a disaster gets notice. Causing a customer multiple days outage due to a deleted table gets more notice.

As a manager of operations for a workforce staffing company, I recall an incident with Peoplesoft. A very capable person was performing an upgrade of a Peoplesoft application and mistakenly deleted the Vendor table. Our ability to quickly rebuild this table in our DR location and place it back in production not only saved our butts, but save the company 2 – 3 days of embarrassment while we were trying to rebuild the table.

Disaster Recovery Testing Frequency

Testing a disaster recovery cut-over requires coordination and resources. I am often asked “how often should our company perform a DR test?”. The question is simple, but most of the time, misguided.

A disaster is not convenient or forgiving. In operations at the time of a disaster, everything is on fire and everyone is yelling. Only experience and muscle memory provides the difference between minutes and days.

Testing DR is not proving the ability to move applications from one point to another, that is done daily. The POINT to testing DR is for providing experience and process for the individuals.

In one of my favorite movies – The Last Samurai – Tom Cruise’s character challenges a soldier to shoot him while being attacked. Of course the soldier panics and fails the challenge and the movie continues. But, the point is clear – things change in times of stress!

The following graphic provides a guide line for when to test the switchover for the efficiency of the team.

frequency

As the testing flows down the graphic, the team’s efficiencies also improve. As we test the common Site Switching on a monthly or quarterly basis, the experience of the supporting teams increase as well. Therefore, the testing of the Enterprise cutover would encompass smaller flexible site transitions, which are tested frequently.

Can your team perform a disaster recovery test? Can your team DEFINE a disaster for your operations? If not, you have some homework!

Better question, when was the last time you checked your spare tire? When was the last time you changed your tire? Your tire will not go flat when it is convenient, it will only go flat when it is raining and on the freeway.

My last story … when I was a director of operations, I managed a team of DBA and system administrators. I had a triage DBA that rotated between application and triage, but during triage they handled the weekly issues and requests. In the middle of this stressful period, I would write the name of a database, a date and a time on a piece of paper and put it on their desk. They would have to recover the database, while they were doing triage, based on that time frame. The process was to keep them sharp, not to test their abilities.

Leave a comment | tags: disasterrecovery, dr, siteswitching | posted in Disaster Recovery, Oracle, Site Switching, System Operations, Uncategorized

August 27, 2016

Best Practices – Lazy or Strategy

By htullis

I rarely get nostalgic and think of the “good ole times”, but perhaps this is one of those times. We at Enkitec still joke about the constant use of the term “Best Practices” as the attempts of vendors to sell more product. While we began to hear the term more and more, it always seemed to be a fun discussion when we talked to engineers and friends from Oracle about installing an engineered system. These discussions typically ended with our providing our experience and execution to our friends for consideration. However, lately the ballad of “Best Practices” has left the engineering discussion and moved to consultants.

The latest incident occurred a few weeks ago at a customer location. The vendor’s pre-sales consultants engaged us, and the customer, in a 4 hour discussion about the installation of four engineered systems. This discussion focused around the physical installation; however, did not discuss the application requirements, system requirements or the customer’s infrastructure abilities. After the four hour meeting, these “engineers” left and generated a 13 page document citing “Best Practices” recommendations. What was missing from this document was things like “customer / application requirements”, physical data points, application observances. During the discussion with these consultants, they could not provide solution benefits or experiences, they simply stated “Best Practices” as the answer to each question. I left the meeting desperately wanting those four hours back.

Now, as I ponder this, I lament … “Are the days of actually talking with a customer and defining the best solution for the customer’s situation … gone? “

The jokes around the Enkitec office circled around the laziness of installers, but I am starting to believe the use of “Best Practices” is more of a strategy than plain laziness.

Are “Best Practices” necessary?

I may be alone here, but I believe they are necessary. As a performance engineer for a vendor, I participated in many TPC and AIM benchmarks. Those benchmarks provided a decent baseline for performance in a controlled environment. I think the same is true for best practices. The concept for best practices identifies a baseline of a perfect system or application in a perfect configuration installed by an engineer that did not have anything else to do. As we all know, this is rarely the case within a customer’s infrastructure and application environment. However, the customer can evaluate the solution and the comments at the end of the best practices documentation. These comments provide the pros and cons of each solution. So yes, best practices are necessary, but they are not an excuse. Full disclosure is required.

Is it Lazy?

As I mentioned earlier, we used to joke about the use of best practices. We thought, at the time, that the individual citing best practices were simply using someone else’s work as a reference. Unless the consultant could provide the full disclosure associated with the best practices comment, we generally knew two things: 1) the consultant has probably never installed the solution and 2) the consultant has probably never experienced the solution in the wild. They could stand behind the work of someone else and claim “Best Practices” without having to provide an adequate defense.

So, an inexperienced consultant could provide a “solution” without 1) providing a defense, 2) collecting or gathering data and 3) performing physical analysis of data. Then, to beat everything else, they also relinquish any responsibility for the “solution”. In the past, yes, we would call that lazy. But now, I think it has become a strategy of the consulting firm.

Strategy to level the playing field?

As I sat in my Georgia Tech MBA class on Global Product strategy today, I started wondering if it was not a smart strategy. I began to think, how can a vendor that does not have a history of experienced consulting compete against experienced consulting firms? Utilizing consultants with less than 5 years of operational experience to deliver a sound solution is challenging. However, if you provide the consultant an “equalizer”, such as “Best Practices” in every document, then it is easier to sell the “solution” as a sound solution for the customer. Therefore, the vendor no longer requires a solid staff of experienced engineers, it simply needs to define a generic solution and socialize the solution as a Best Practices solution.

Are Vendor Best Practices real?

The solution is real in most cases and in most cases, a good idea. I believe some one sat in a lab and performed the processes defined in a best practices document. I am also sure that the solution, if performed correctly, provides the benefits as indicated. But, are they really best practices? After all, one primary characteristic of the “Best Practices” definition is the term – Widely Accepted – which means that the solution is widely accepted by the community. However, most vendors publish best practices at the same time a solution is published, which challenges the “widely accepted” requirement of defining a best practice. So, as we implement systems and solutions for customers, we should be wary of the term “Best Practices” as it comes from the specific vendors – as they may be accepted by the vendor, but not widely accepted by the community.

Why be wary?

With respect to citing “Best Practices” as the only way to go, is this a bad thing? As indicated above, there is a tendency to call new technology a type of best practices, although it is not widely used or accepted from the community. Also, utilizing the mantra of best practices, it validates consultants that may not ultimately understand the technology or the use of the technology. The inexperienced consultant will cite “Best Practices” as the reason for implementing a solution, regardless of the benefit or detriment to the client. Therefore, as with anything else, we have to do our homework to make sure a solution is widely accepted and is in the best interest of the customer.

It’s all about the customer

Why do we care about the socialization of best practices? Because, in the end, we end up having to rescue customers from the latest “Best Practice”. Most customers don’t have the luxury of a test lab and some don’t have the skilled resources dedicated to the solution. Most customer resources play the role of utility player, knowing how to support an assortment of products at a high level.

How do we approach a customer to know when to implement the best practices stated by these vendors? As consultants, we should do as we have always done:

Listen and understand: Listen to the customer and understand what their team can implement and support. Just because a best practice is written, does not mean that it will fit in a customer’s environment. Our role as consultants produces the expectation to provide the best solution for the customer’s environment.

Understand the technology: Don’t recommend a product because it is the latest technology. Recommend the product because it support’s the customer’s requirements and provides flexibility. Sometimes a best practice uniquely leverages a vendor’s product, which limits the flexibility for growth or integration with other products.

Read the fine print: Most best practices come with multiple implementation options – just as most technologies. Although rarely stated by consultants, probably because they don’t understand, these options come with benefits or deficiencies. Some of these indicate the solution is complicated to implement or support. The issue may include a costly implementation due to licensing.

As I step away from my computer, I will maintain the traditions of most experienced consultants. I will continue to evaluate technology in terms of how it helps customers. Technology is a tool for us to use to meet our requirements and provide us benefit. Too often, technology sold to a customer becomes an entrapment into a vendor or solution and becomes a cage.

I guess, as I look at it, the above 3 bullets become the “Best Practices” for consultants. Remember, success is gauged by a successful customer implementation, not a technology implementation. There have been many successful technology implementations that served little purpose.

1 Comment | posted in Uncategorized

oracle-bull.com

Category Archives: Uncategorized

Oracle RAC: Exadata’s often overlooked Gold Mine

Scalability

I/O Performance

Interconnect Performance

Summary

One DOES NOT simply “Lift and Shift” from Exadata

Are Requirements Needed in the Cloud?

Cloud Preparation – More than Technology?

“We generally start everything with a health check – because everyone is ready for the cloud … right?”

“As a good friend and colleague told me today – “We need to learn better English…”. Perhaps that will remove the blank stare syndrome.

Defining Disaster – What is missing in Disaster Recovery

My mind drifts off to my favorite Jimmy Buffett song – “Tryin’ To Reason With Hurricane Season”

Goals of Disaster Recovery

The disaster recovery goal should focus on a) How does this support our customer experience and b) how does this allow us to drive business?

Disaster Recovery Testing Frequency

In one of my favorite movies – The Last Samurai – Tom Cruise’s character challenges a soldier to shoot him while being attacked. Of course the soldier panics and fails the challenge and the movie continues. But, the point is clear – things change in times of stress!

Best Practices – Lazy or Strategy

Recent Posts

Blogs I Follow

Archives

Categories