You’re Spending Too Much on Your Cloud Infrastructure

Photo by Anandan Anandan on Unsplash

Well. You probably are, anyway: the cloud is tricky, and it’s easy to leak resources.

That’s not to say that it’s worse than infrastructure in the data center: the old world has its own set of problems, and won’t give you the option to dial-down those fixed costs on facilities, depreciation, and so-on when utilization declines.

So, What’s Leaking?

Unless resources are deployed and cleaned up automatically, and unless you have really great reporting on how things are being used, then probably a lot.

There are some things that require strategic attention to resolve, like retrospectively adding auto scaling to a service that wasn’t designed for it. But there are others where small changes to code, or good custodial behavior can still be highly impactful. All-in-all, this is a journey.

This article covers some of the things that you can do to make a fast impact, and I’ll write separately about the stuff that’ll take a bit more planning. I’ve used Amazon’s terminology here, but these principles apply to all of the public clouds.

These ideas are meant to be applied iteratively using the 80–20 rule. Once you’ve addressed as much of the low hanging fruit as you can, continue to prioritize what’s left based on impact and effort.

Perhaps you’re doing some of these things already: that’s great. Perhaps there are other things that I’ve missed: in which case let me know!

Clean Up Abandoned Resources

When companies start their cloud journeys, they often either:

  • Treat the cloud like a fresh data center, so whereas there are tight access restrictions in place, resources tend to be administered by hand.
  • Give developers free rein to create what they want, so that everything is different, and often wrong.

Both of these approaches cause snowflakes, and it quickly becomes difficult to tell what is needed and what isn’t. At some point, it’s safer just to leave stuff there than risk causing an outage. This is a really horrible place to be: when you get here, things will only compound until you take drastic action.

In the first instance, it’s good to have data to quantify and categorize spend. Tag resources by type, business unit, team, application, whatever makes most sense. Add in cloud metadata like location, utilization, and so on. Use the tooling provided by your cloud provider (or one of a dozen or so third parties) to slice and dice spend by category.

Look for hotspots.

There’s nothing more disheartening than putting a lot of drudgework into something like this for the greater good, only to find that others are merrily adding new stuff that’s not tagged as fast as you’re tagging. Do your best to help others understand the consequences of shortsighted behaviors. If you want to be draconian, communicate that new resources will be deleted unless they are tagged to a published standard.

Right-Size Resources

Whether something was created by hand, or automatically, there is a tendency to configure infrastructure a size or two too big. You know, “Just in case”.

Resources should be automatically terminated when they are not needed: this is not the datacenter, and creation and destruction have become trivial operations. This means things like making the most of autoscaling, and — where autoscaling doesn’t apply — terminating or putting to sleep one-off infrastructure when it’s not actively being used. You can invest your own time into developing this functionality or give a small cut of your savings to a company like ParkMyCloud.

Even when resources are provisioned automatically, developers may bump up the size to something silly because they saw a performance problem and didn’t have the time or expertise to determine the true bottleneck.

But right-sizing doesn’t just apply to compute resources: for example, Amazon’s DynamoDB supports provisioning for read and write capacity. Get this too low, and you’ll likely start running into errors, get it too large and everything will work, but you’ll be paying through the nose for capacity that is never used. In this case consider using dynamic capacity, it’s probably worth comparing.

Although the true fix is often to change how the software is run, without giving teams cost insights, they will never understand how their decisions affect the bottom line. If you have an SRE team, then this is an advanced area where they can help identify where coaching is needed.

Use the Right Regions

It turns out that some locations are more expensive than others to run your infrastructure. Pay attention to the cost of resources, and where they live. It may be that you can save a lot of money with zero impact to service, simply by changing the regions where your resources run.

Apply Service Tiering and Retention Policy

Many companies store more data than they need. This is both expensive and adds increased risk of exposure should that data be compromised. But irrespective of how much data you should have, there are chunks which you need frequently, chunks which you need infrequently, chunks that you will need rarely (like recovering from an outage or during an audit), and chunks that you will never need. Cloud storage is typically tiered: the less frequently or quickly you need to access data, the cheaper it becomes. See if you can apply these principles to the data that you store.

Use Reservations

Depending on your use-case, reserved instances can save you a ton of money. But take it slowly and cautiously: it’s important to not make big and sweeping purchases that will later turn out to be shortsighted.

Review Third-Party Contracts

Irrespective of whether you primary cloud provider is or isn’t your biggest spend, it’s worthwhile taking an inventory of your third party contracts, and — although it might be boring — reviewing monthly spend to see where there is fat to trim. It’s easy to overlook these things. For example, at several companies, we found telecoms subscriptions that should have been canceled several years before.

Summary

If you haven’t moved to the cloud, then it’s probably a good thing to do, but it takes a lot of planning to do right… “lift and shift” approaches where data center infrastructure is replicated in the cloud tend to be expensive and error prone, since they cannot apply the benefits of the cloud.

So take your time. Organizations that think about this stuff up-front tend to have fewer problems later.

If you’re already in the cloud and still not sure where to start on making things better, then let’s talk.

Software and Technology Nerd, DevOps Ninja, Maker of Things, Aerospace Enthusiast. https://orc.works/