Browsed by
Tag: High availability

Configuration can be a big stumbling block when its comes to availability.

Configuration can be a big stumbling block when its comes to availability.

So let’s face it, when we build projects, we make trade-offs. And many times those trade-offs come in the form of time and effort. We would all build the most perfect software ever… if time and budget were never a concern.

So along those lines, one thing that I find gets glossed over quickly, especially with Kubernetes and micro services … configuration.

Configuration, something where likely you are looking and saying, “That’s the most ridiculous thing I’ve ever heard.” We put our configuration in a YAML file, or a web.config, and manage those values through our build pipelines. And while that might seem like a great practice, in my experience it can cause a lot more headaches in the long run than your probably expecting.

The problem with storing configuration in YAML files, or Web.configs, is that they create an illusion of being able to change these settings on the fly. An illusion that can actually cause significant headaches when you start reaching for higher availability.

The problems these configuration files can cause is the following:

Changing these files is a deployment activity

If you need to change a value for these applications, it requires changing a configuration file. Changes to configuration files usually are tightly connected to different restart process. Take App Service as a primary example, if you store your configuration in a web.config and you make a change to that file. App Service will automatically trigger a restart, which will cause a downtime even for you and or your customers.

This is further difficult in a kubernetes cluster, in that if you use a YAML file, it requires the deployment agent changing the cluster. This makes it very hard to change these values due to a change in application behavior.

For example, if you wanted to change your SQL database connection if performance degrades below a certain point. That is a lot harder to do when you referencing a connection string in a config file on pods that are deployed across a cluster.

Storing Sensitive Configuration is a problem

Let’s face it, people make mistakes. And of the biggest problems I’ve seen come up several times is that I hear the following statement, “We store normal configuration in a YAML file, and then sensitive configuration in a key vault.”

The problem here is that the concept of what “sensitive” means and that it means different things to different people. So the odds of something being miss-classified. It’s much easier to manage if you tell your team that for all settings, treat them as sensitive. It makes management a lot easier and limits you to a single store.

So what do we do…

The best way I’ve found to mitigate these issues, is to use an outside service like KeyVault to store your configuration settings, or azure configuration management service.

But that’s just step 1, step 2 is to on startup cache the configuration settings for each micro service in memory in the container, and make sure that you configure it to expire after so much time.

This helps by providing an option where by your microservices startup after deployment, reach out to a secure store, and cache the configuration settings in memory.

This also gains us several benefits that mitigate the problems above.

  • Allow for changing configuration settings on the fly: For example, if I wanted to change a connection string over to a read replica, that can be done by simply updating the configuration store, and allowing the application to move services over as they expire the cache. Or if you want even further control, you could build in a web hook that would force it to dump the configuration and re-pull it.
  • By treating all configuration as sensitive you ensure there is no accidental leaks. This also ensures that you can manage these keys at deployment time, and not have them ever be seen by human eyes.

So this is all great, but what does this actually look like from an architecture standpoint.

For AKS, its a fairly easy implementation, to create a side car for retrieving configuration, and then deploy that sidecar with any pod that is deployed.

Given this, its easy to see how you would implement separate sidecar to handle this configuration. Each service within the pod is completely oblivious to how it gets its configuration, it calls a micro-service to get it.

I personally favor the sidecar implementation here, because it allows you to easily bundle this with your other containers and minimizes latency and excessive network communication.

Latency will be low because its local to every pod, and then if you ever decide to change your configuration store, its easy to do.

Let’s take a sample here using Azure Key Vault. If you look at the following code samples, you can see how here’s a configuration could be managed.

Here’s some sample code that could easily be wrapped in a container for your configuration to keyvault:

public class KeyVaultConfigurationProvider : IConfigurationProvider
        private string _clientId = Environment.GetEnvironmentVariable("clientId");
        private string _clientSecret = Environment.GetEnvironmentVariable("clientSecret");
        private string _kvUrl = Environment.GetEnvironmentVariable("kvUrl");

        public KeyVaultConfigurationProvider(IKeyVaultConfigurationSettings kvConfigurationSettings)
            _clientId = kvConfigurationSettings.ClientID;
            _clientSecret = kvConfigurationSettings.ClientSecret;
            _kvUrl = kvConfigurationSettings.KeyVaultUrl;

        public async Task<string> GetSetting(string key)
            KeyVaultClient kvClient = new KeyVaultClient(async (authority, resource, scope) =>
                var adCredential = new ClientCredential(_clientId, _clientSecret);
                var authenticationContext = new AuthenticationContext(authority, null);
                return (await authenticationContext.AcquireTokenAsync(resource, adCredential)).AccessToken;

            var path = $"{this._kvUrl}/secrets/{key}";

            var ret = await kvClient.GetSecretAsync(path);

            return ret.Value;

Now the above code uses a single service principal to call upon keyvault to pull configuration information. This could be modified to leverage the specific pod identities for even greater security and cleaner implementation.

The next step of the above implementation would be to leverage a cache for your configuration. This could be done piecemeal as needed or in a group. There are a lot of directions you could take this but it will ultimately help you to manage configuration easier.

High Availability – a storage architecture

High Availability – a storage architecture

Hello all, so I’ve been doing a lot of work around availability in the cloud and how to build applications that are architected for resiliency. And one of the common elements that comes up, is how do I architecture for resiliency around storage.

So the scenario is this, and its a common one, I need to be able to write new files to blob storage, and read from my storage accounts, and need it to be as resilient as possible.

So let’s start with the SLA, so currently if you are running LRS storage, then your SLA is 99.9%, which from a resiliency perspective isn’t ideal for a lot of applications. But if I use RA-GRS, my SLA goes up to 99.99%.

Now, I want to be clear about storage SLAs, this SLA says that I will be able to read data from blob storage, and that it will be available 99.99% of the time using RA-GRS.

For those who are new to blob storage, let’s talk about the different types of storage available:

  • Locally Redundant Storage (LRS) : This means that the 3 copies of the data you put in blob storage, are stored within the same zone.
  • Zone Redundant Storage (ZRS): This means that the 3 copies of the data you put in blob storage, and stored across availability zones.
  • Geo Redundant Storage (GRS) : This means that the 3 copies of the data you put in blob storage, are stored across multiple regions, following azure region pairings.
  • Read Access Geo Redundant Storage (RA-GRS): This means that the 3 copies of the data you put in blob storage, are stored across multiple regions, following azure region pairings. But in this case you get a read access endpoint you can control.

So based on the above, the recommendation is that for the best availability you would use RA-GRS, which is a feature that is unique to Azure. RA-GRS enables you to have a secondary endpoint where you can get read-only access to the back up copies that are saved in the secondary region.

For more details, look here.

So based on that, you gain the fact that if your storage account is called:

Your secondary read-access endpoint would be :

So the next question is, “That’s great Kevin, but I need to be able to write and read”, and I have an architecture pattern I recommend for that. And it is this:

So the above architecture, is oversimplified but focuses on the storage account configuration for higher availability. In the above architecture, we have a web application that is deployed behind traffic manager, with an instance in a primary region, and an instance in a secondary region.

Additionally we have an Azure SQL database that is ego-replicated into a backup region.

Let’s say for the sake of argument, with the above:

  • Region A => East US
  • Region B => West US

But for storage, we do the following, Storage Account A, will be in East US, which means that it will automatically replicate to West US.

For Storage Account B, will be in West US, which means it replicates to East US.

So let’s look at the Region A side:

  • New Blobs are written to Storage Account A
  • Blobs are read based on database entries.
  • Application tries to read from database identified blob storage, if it fails it uses the “-secondary” endpoint.

So for Region B side:

  • New Blobs are written to Storage Account B
  • Blobs are read based on database entries
  • Application tries to read from database identified blob storage, if it fails it uses the “-secondary” endpoint.

So in our databases I would recommend the following fields for every blob saved:

  • Storage Account Name
  • Container Name
  • Blob Name

This allows for me to easily implement the “-secondary” when it is required.

So based on the above, let’s play out a series of events:

  • We are writing blobs to Storage Account A. (1,2,3)
  • There is a failure, we fail over to Region B.
  • We start writing new blobs to Storage Account B. (4,5,6)
  • If we want to read Blob 1, we do so through the “-secondary” endpoint from Storage Account A.
  • The issue resolves
  • We read Blob 1-3 from Storage Account A (primary endpoint)
  • If we read Blob 4-6 it would be from the “-secondary” endpoint of Storage Account B

Now some would ask the question, “when do we migrate the blobs from B to A?” I would make the argument you don’t, at the end of the day, storage accounts cost nothing, and you would need to incur additional charges to move the data to the other account for no benefit. As long as you store each piece of data you can always find the blobs so I don’t see a benefit from merging.

Weekly Links – 1/13/2020

Weekly Links – 1/13/2020

Well its officially 2020, and this year wasted absolutely no time getting moving. My kids activities started ramping up, and life moved at a sprint speed out of the gate.

See the source image

But down to the business..


Fun Stuff:

So for this week, its nothing in nerd pop culture, but a fascinating video I saw. This was recommended by a friend, and it sparked a lot of research for me around personal growth (blog post coming soon). But the video is “What does game theory teach us about war?” and there are a lot of parallels that you can draw to how your own life functions in terms of finite vs infinite games.

RT? – Making Sense of High Availability

RT? – Making Sense of High Availability

Hello all, in keeping with the last post on the blog, I started doing some posts around High Availability, so ultimately the focus here is how do I architect my solution to ensure that is meets the availability demands of my customers.

See the source image

So odds are if you’ve started down this direction, you’ve heard 3 acronyms:

  • SLA – Service Level Agreement
  • RTO – Recovery Time Objective
  • RPO – Recovery Point Objective

So what do each of these items mean, and how do they relate to your solution. For SLA, I covered this pretty extensively in my previous post. So I would direct you there for a definition and then recommendations around how to approach that topic.

So the next question is really what are RTO and RPO? And how do they relate to High availability?

What is RTO?

RTO stands for Recovery Time Objective, and basically, in software terms, this refers to when something happens, how fast do you recover?

So let’s take an example because I work best with examples. So if I have a solution that is deployed in multiple regions, and my solution uses Traffic Manager and has replication of the solution into another region. If the Traffic manager is checking the endpoint every 5 seconds, and 3 failures cause a failover…that means my RTO is 15 seconds.

By using a dual region deployment, I’m able to keep my RTO relatively low. Now the above example is pretty simplistic. But really we should do this analysis per service in our architecture, to determine how long our failover takes, and then the longest of that is your solutions RTO.

How do we improve RTO?

Now, remember that this is really a measure of continuity of business, so really looking at High Availability and Disaster Recovery. So ultimately we are talking about service uptime more than anything else.

So the best way to improve RTO is to enable the replication and take steps to increase the speed of recovery. So if you look at the last discussion of SLA, we took steps to minimize downtime by increasing SLA. This conversation will be about how do we minimize the downtime caused by those failovers.

The most important things involved in this are the following:

  • Monitoring
  • Response time
  • Data Replication
  • Failover

So the key metric to pay attention to is how long it takes to get up and running.

Monitoring is the cornerstone of your RTO target. If you don’t know there is a problem, you can’t find it. Many blogs and articles will focus on the next 3 parts, but let’s be honest, if you don’t know there’s a problem, you can’t respond. If your logs operate on a 5-minute delay, then you need to factor in the 5 minutes into your RTO.

From there the next piece is response time. And I mean this in the true sense of how quickly can you trigger a failover to your DR state. How quickly can you triage the problem and respond to the situation? The best RTO targets leverage as much automation as possible here.

Next, by looking at data replication, we can ensure that we are able to bring back up any data stores quickly and maintain continuity of business. This is important because every time we have to restore a data store, that takes time and pulls out our RTO. If you can failover in 2 minutes it doesn’t do you much good if it takes 20 minutes to get the database up.

Finally, failover. If you are in a state where you need to failover, how long does that take and what automation and steps can you take to shorten that time significantly.

Let’s give an example if I have a solution that is the following in one region:

  • Azure App Service
  • Azure SQL

If I’m deployed in a single environment, and my DR plan is to standup another region in the event of a disaster. Now that solution has a pretty high RTO, if it takes 15 minutes to standup that environment and deploy it, then the RTO is 15 minutes. If I wanted to lower that, there a couple of things I can and those would be:

  • I can increase the automation I use to reduce that time.
  • I can do is spin up another region, or leverage options to do replication.
  • I can set up automation around detection and response.

What is RPO?

RPO stands for Recovery Point Objective, which really focuses on the idea of improving the ability to recover from a data perspective. So if you have a disaster, how much data would be lost? What would the impact be?

When looking at RPO, the key comes to data and potential data loss. So how do we minimize the window for data loss and lower the chances of lost transactions in your application?

There are a few key elements that can assist with this, looking at how your application handles eventual consistency. It is possible to get to an RPO of 0, as you have constant data replication in your solution.

Now the most important part of the replication is that the replication needs to be executed in a synchronous fashion, meaning that it must write and replicate the data before sending an acknowledgment. This means that eventual consistency will keep your RPO higher than zero because it means that the replication will “eventually” get there.

How do we improve RPO?

The most important factor here is replication and data consistency. So we really need to make sure that the strength of transactions is maintained about that consistency rules are enforced. This is why data stores like Cosmos gain popularity in terms of requirements for zero RPO and low RTO because it supports models where they can enforce this type of logic.

Needless to say, this all comes down to operations and math and ultimately the requirements of your solution and balancing that against cost and impact. You really want to make sure you only take this to the level you need to as it can add a lot of cost and substantially raise the complexity of your solution.

Keeping the lights on! – Architecting for availability?

Keeping the lights on! – Architecting for availability?

Hello all, It’s been a while since I did a blog post outside of the weekly updates. But I wanted to do one in terms of conversations that I’ve been having a lot lately and seems to be largely universal. High Availability. So more and more, software is becoming a critical part of every aspect of our lives. To that end, we really see as developers / engineers, the following scenarios have become a constant reality:

  • For end customer software, not having access for an extend timeframe to an app or service can be the final nail in the coffin for a lot of users. Their tolerance for down time continues to drop. If you don’t believe me, research the metrics around how long someone will wait for a video to load before leaving according to YouTube.
  • For enterprises, organizations are becoming more and more reliant on software to function at the most basic level, meaning that outages or downtime windows have an even greater impact on their business, causing more parts of the organization to have to function at a diminished capacity or not at all during an outage.

The end result of these perceptions / realities is that the demands put on software solutions for maintaining availability are going higher and higher. And it becomes important to architect and plan for high availability to start with, as if you don’t it can be very expensive and difficult to retro-fit your applications to meet these demands.

This is a huge topic, and one that I’m not going to be able to cover in one blog post, but I’m hoping that we can identify ways to help if you are being tasked with meeting these demands.

Defining SLA

See the source image

So the first part of this conversation, always in my experience starts the same, “What’s our SLA?”, so let’s talk through what an SLA is? SLA stands for Service Level Agreement, and this is a legal agreement of what level of service you are required to provide.

Now the key part of that, is a “legal agreement”, this is not strictly a software function or engineering concept, but a business agreement in the sense that if an SLA is not met, there is a financial obligation from the organization to compensate the customer (in an enterprise setting).

Be Reasonable…

See the source image
Let’s not get crazy!

So the most common mistake I hear when someone starts down this road is “we need 100% SLA”, which is a bad place to start this process. Realistically this is almost impossible, the idea that you will never have an outage is extreme. And to get this level of resiliency you can expect to pay for it, and its easy to get upside down on your costs by starting out here. And really mean need to be realistic about the ask here.

Let’s walkthrough an example, let’s say you have a software the provides grant processing for a municipality, and that grant reviews are done monday to friday during business hours (8-6pm). If your customer says “We need a 100% SLA”, I would make the counter argument of “Do you really?” If the system is down from 1-2am on a saturday, does that really affect you and the nature of the business? Or is this just a matter of needing the solution to be up during those core business operating hours?

Conversely let’s go the other way, and say that you are providing a solution that provides emergency service communication in terms of a natural disaster? Would your customer be ok with a 5-minute downtime at 2am in the middle of a hurricane? Probably not. So tolerance should be measured in terms of actual impact to the end user and ability to function.

High Availability is like insurance, I can get add-ons to my policy for everything that could ever happen, but that means that I will likely be paying for things I don’t need. I can get volcano insurance in Pennsylvania, but the odds of needing it are so low to make it ridiculous.

So what we should be doing is finding a happy balance between what we can realistically do, and do by following recommended processes, and way the business calculation, and cost.

Let me give you a high level example, let’s say I deploy my production environment to one region, and I’ve calculated that the composite SLA (more on this later) to be 99.9% for one region. That means that right now I am telling my customers that I am expecting about 43.2 minutes of downtime a month.

But if I stood up a secondary region, and built out a lot of automation around failover and monitoring (lets say 80 hours of work), I could raise that SLA from 99.9% to 99.99% which would mean a downtime of 4.32 minutes.

Now what I need to weigh is the following:

  • 80 hours worth of labor costs
  • opportunity cost of not using that labor resource on new features
  • doubling my environment costs (2 active regions)
  • Potential advantage by supporting a higher SLA.

And I look at that and say, I’m saving 38.88 minutes of downtime in the process. So the question is, does that help my business and make sense from a financial position, or am I “ok” taking a financial hit and having only 1 environment up, and paying out if we are down for more than the 99.99% and rolling the dice on that.

I can’t say in the above discussion what the right answer is, because ultimately it depends on the type of business and resiliency of the application. You might be comfortable with that, you might not.

My point is that at the end of the day this is both an engineering problem and a business problem, and likely the right answer is somewhere in the middle.

Now to be clear, other times, especially in enterprise software, the customer may require a certain SLA, and at that point you might have to show that you meet that SLA by having specific redundancies in place. I’ll talk about this more in our next section.

Calculating a composite SLA

See the source image

Another common area of question, is “How do I calculate the SLA of my service?” And this is more straight forward than people realize. Let’s take the following example:

Note: You can find all of azure’s SLAs here.

App Service99.95%
Azure SQL99.99%

So based on the table above, the composite SLA would be:

.9995 * .9999 = .9994 = 99.94%

So that would imply that your cloud provider is standing behind these service to have downtime of :

730 (Hours per month) * (1 – .9994) = 26.28 minutes

Now the above is an estimate, but it would be around that time that we could expect to be our monthly downtime. This calculation doesn’t change the more services you add.

Now its important to note, this is the platform SLA, not your SLA. And I say that because at the end of the day, this is assuming that your application doesn’t have issues that cause downtime, so that should be considered as well.

How do we improve our SLA, start with “what is down?”

See the source image

Now for many cloud services, Microsoft and every other cloud provider gives recommendations to enhance resiliency and improve your SLA. One way to do that is to leverage items like Availability Zones and multi-region deployments. This allows you to spread out your application across multi-regions and it makes the probability of an outage drop substantially.

Really the first step here is to do a failure mode analysis, and determination of critical functionality. And what I mean by that is we need to define what constitutes the system being “Down”. So let’s take for instance you have an eCommerce platform, something like NopCommerce, and you have the following use-cases:

  1. Browse the catalog
  2. Add items to shopping cart
  3. Purchase items
  4. Publish blogs
  5. Send out notifications of deals / sales
  6. Process Orders

Now based on the above, we could identify 1,2,3, and 5 as mission critical, if we can’t allow our customers to shop, buy, and receive their products, that means that we are out of business. If we can’t publish a blog when we want to, or if a sale notice goes out a little late, its not ideal, but its not the end of the world. And let’s say that we have azure functions sending the notifications, and the blogs and promotions are managed by Cosmos DB.

So now based on that, we need to examine our architecture and identify what components are required to maintain the 4 key uses cases we identified. Notice I left off the elements that are not part of our key functionality for our SLA.

Let’s say we have the proposed architecture:

Now based on the above, I can calculate our primary region SLA to be:

Application Gateway99.95%
App Service99.95%
Azure SQL99.99%
Total SLA99.89%

So as a result of the above, we need to examine what elements of our solution are critical to the meeting our uptime SLA, and then doing a failure analysis. So based on the above use-cases, we can assume that the Traffic Manager, Application Gateway, App Service, and Azure SQL are essential to our meeting of our SLA. For the sake of this example, let’s say that the caching layer meets with industry recommendations and is used only for speed of access, if not available the application will just reach out to the database.

So how do we calculate the compound SLA for the two regions, we do that with the following math:

We basically have to figure out the probability of both regions being offline, so if we take the region “unavailability” of .12% and multiply it by one another:

0.12% * 0.12% = 0.0121%

Convert it back to availability:

100 % – 0.0121% = 99.99%

Now we take that multiplied by traffic manager SLA:

.9999 * .9999 = 99.99%

Failure Mode Analysis:

See the source image

A failure analysis means that we pick apart each element of the infrastructure and identify the following:

  • What potential failures could occur?
  • What are the different “modes” or “states” can this component be in?
  • How likely is a failure of this component?
  • What is the impact of each failure “mode” or “state” on the application?

After examining the above, you need to look at each of the “modes” or “states” and identify the following:

  • How you will respond and recover?
  • How you will monitor for this situation, before, during, and after?

So let’s take an example, because to me that always helps. If we examine the above solution, and say Azure SQL Database. If I were to do a failure mode analysis, I would find the following:

  • The database is offline in the following situations:
    • The database can be offline due to a platform issue
    • the database is shutdown
    • the database is deleted
  • The database is in a degraded state in the following situations:
    • Database is performing slowly due to high website demand.
    • Database is running slowly due to bad query optimization
    • Database is experiencing deadlocks

Now this is by no means an exhaustive list, but it hits the high points for our ecommerce site. Now in those states, I need to identify what do for these scenarios. So the question is how do we respond and recover. In the case of the database, the most common recommendations are, to use a standard tier, and to use active geo-replication.

So for “How do we respond, recover?” I would say we setup active geo-replication of our production database to a secondary region. In the event the database is “offline” we fail-over to a secondary region and leverage traffic manager to route to the backup site. We would see some data loss during the failover, but for this exercise, let’s say that is manageable.

The next question is the most important, how do we monitor for this? The answer is we could do this a couple of ways:

  • Setup alerts via azure monitor around specific metrics.
  • Setup alerts in Application Insights for Dependency failures for database calls.
  • Build a page within our application that Traffic Manager can prob to identify when the database is unreachable and trigger failover.

The next mode was “degraded” and if we examine that the response is to increase the performance tier of the database to respond to increased demand, or do more in-depth analysis around the performance of the database. Again the monitoring would be similar of setting up alerts around these conditions to make appropriate staff aware.

So all kidding aside, this is a huge topic, and one I want to boil down more on how best to implement these solutions. This post didn’t begin to discuss the differences between RTO / RPO, or how you make sure to ensure resiliency through transient fault tolerance or distributed architectures, and that’s just scratching the surface, so more to come.