A while ago, I wrote an article about some quick-wins you can look at to cut your cloud spend. Perhaps you took a look at all of that low hanging fruit, and are in better shape than you were. But there’s almost always more that can be done to save money.
Here are a few things to look at. If you know of others, then please leave a comment or send me a note!
So, What’s Next?
There are some things that require strategic attention to resolve, like retrospectively adding auto scaling to a service that wasn’t designed for it. But there are others where small changes to code, or good custodial behavior can still be highly impactful.
In the first instance, it’s good to have data to quantify and categorize spend. Tag resources by type, business unit, team, application, whatever makes most sense. Add in cloud metadata like location, utilization, and so on. Use the tooling provided by your cloud provider (or one of a dozen or so third parties) to slice and dice spend by category. Look for hot spots.
Oh — I’ve seen companies that make this exercise far more complex than is actually needed: we’re looking for a handful of labels that can help isolate these hot spots, not boil the ocean with compex categorizations that are added “just in case”. Don’t be like those guys. Also, avoid applying tags for things like region where there are other ways of obtaining that information from the reporting software.
Of course, if you can, make the tagging process completely automated, ideally through the provisioning process. This way, you’ll never have to worry about manually tagging again.
Now, dig in and figure out what you can: Once you’ve addressed as much of the low hanging fruit as you can, begin to prioritize what’s left based on impact and effort.
Don’t Treat the Cloud Like the Data Center
More on this later, but if you rely on people to create and destroy resources, then stuff is going to get inconsistent quickly, and you will lose all the benefits of using the cloud. By automating the way that infrastructure is provisioned and maintained, you will save a lot of time and energy later. This automation will ensure consistency and predictability in delivery.
Use Reserved Instances or Savings Plans
Just like the insurance advert, you can save up to 70% on some resources if you switch. The idea is that if you know that you’re going to be using certain types of resources for the long-haul, then you can make a contractual commitment to your cloud provider that you’ll be using those resource types for a year or more. Just like most termed contracts, you can get some fairly deep discounts when you give your provider this extra degree of confidence in the business. There is nuance and a bunch of pitfalls to navigate of course, like what happens if your needs change? But with some care, attention, and a few tricks, it’s possible to use these deals to your long-term advantage.
Use Spot Instances
Because customers add and remove resources on demand for their particular business changes, there are times when your cloud provider has a surplus of infrastructure. Having hardware sitting around unused is bad for business, so they offer deep discounts on the understanding that the resource may be taken away from you at a moments notice, should they find someone who is prepared to pay more. So, again, there are tricks, and this only works if you provision and run in a highly automated and scalable way.
Perhaps it was done because of uncertainty or used a fear rather than risk based approach, but if you designed your cloud as an “add-in” to what’s already there, where all of your cloud resources have to back-haul to the data center, then you’ll be spending a lot more on network costs and likely impacting service availability.
Perhaps you also know and love network appliances like those provided by companies such as f5, so you run virtual instances in the cloud to front your applications. These appliances have a lot of positives, but — when you’re in the cloud — the biggest benefit is familiarity with what you know. Your cloud vendor will be able to provide alternatives that don’t have such huge licensing fees, have such a high cost to run, and will integrate seamlessly.
Constrain Number of Environments
Naturally, you need an environment where the production systems live. Then the developers said that they need somewhere to test their code. Then the QA people said that they needed an environment to do acceptance testing. Then the product teams said that they are responsible for approving everything before going to production. Then the QA team said that they are working with another set of teams that work on a different schedule, so of course they need a separate test environment there.
Because of duplication, each environment adds a lot to overall costs, and it’s easy to let the number of distinct set-ups be dictated by Conway’s Law. What can you do to cut the number of environments down to one or two?
There’s another side to this beyond the basics of how much you spend for resources: more environments:
- Require higher upkeep and oversight;
- Individuals spend more time thinking and discussing where a particular thing should go;
- There is an increased risk of inconsistencies between environments which often leads to unnecessary troubleshooting;
- It’ll just take gosh darn longer to get from that commit you made to production.
The numbers are a little more nuanced than this, but if you have a simple API that can respond within 100ms, you’ll spend $0.20 for every million requests that are serviced. How awesome is that? Even better, you don’t have to worry about building and running servers, or even scaling: it’s all taken care of for you.
Changing to serverless isn’t for the faint of heart: some languages are more effective than others, and designing APIs well takes some thinking about, as does making sure that you’re factoring in gotchas like cold-start times. That being said, the benefits can be huge.
If you’re not into serverless, then autoscaling is a must. Never, ever, ever, ever deploy a bunch of static compute instances to run your business!
There was a company I worked with last year that had static infrastructure where requests were fed by an f5. In order to perform an update of the application software, the operations team had to remove one server out of the f5 pool, wait for traffic to quiesce, terminate the application, update the software, bring it back up, verify that it works, then add it back into the pool. Now repeat until all machines have been updated.
Can you imagine how long this process takes? Can you imagine what higher value those engineers could be bringing to the company if the update process were a simple button push in something like Spinnaker?
Not to mention that you’ll be paying for all of those instances, even when they’re not needed. Who thinks that’s a good idea?
Use the Right Resources, Use Them Effectively
Just because your cloud platform allows you to do something, it doesn’t mean that it’s great practice. Sometimes, you can do really, really horrible things and not realize what the impact is. Right now, I’m not even talking about what happens when you don’t secure your S3 buckets appropriately.
Here’s an example of what I am talking about.
We’ve worked with a couple of teams who wanted to avoid using a traditional database server for the service that they’re building, and picked DynamoDB.
It’s a great service for what it is, but it’s designed for frequent searches on a single index. This means that the applications that use it should really only ever search on that one key. If you treat it like a relational database, then you’ll end up searching on multiple columns. If you do this often, then you’ll want more indexes. If you have more indexes, then you’ll very quickly start paying through the nose.
At this point you have two choices: either redesign your service to better work with Dynamo, or switch to using a relational database. Neither are really great, but it’s potentially better than leaving that technical debt festering.
Use Managed Services
You can either build and run your own set of Kubernetes clusters, or you can use the equivalent offered by your cloud provider. You’ll almost certainly have more control and nuance if you build and manage everything yourself, but do your customers really care? What can you do to fit the standard model to drive down costs and improve availability?
That being said, some managed services can actually be more expensive than doing it yourself: perhaps they provide more features or a higher level of availability than you need. it’s always good to do a cost comparision before jumping in.
…And You’re Really Sure About the Data Center?
Everyone has their own path, and there are rarely absolutes, but generally, yes!
From a finance perspective for companies that report (for example) EBITDA, the numbers look great when deploying to the data center: you can do some great build outs, run state of the art hardware, and it has little consequence to the financials. But the reality is that — irrespective of capitalization — money has to be spent up-front in anticipation of future demand: once purchases are made, it’s not easy to go back, and you’ll be writing that investment down for a while. You also have the administrative burdens of tracking and replacing equipment, managing supporting infrastructure, HVAC, power, physical access, and so on.
Then think about what it costs to run active-active across multiple sites with diverse network routing, or even build out and manage a disaster recovery location.
On top of all that, you’re also spending the same amount of money for every hour of every day, even in the middle of the night when all of your customers are asleep. There’s huge benefit to having someone else worry about all of that for you, including how much it costs. Not to mention that all of that high availability stuff comes for free.
Because of understandable cognitive biases, the guys who run the data center sometimes say that they have run an analysis and they have determined that it is (safer, cheaper, more reliable,… pick any combination) to stay in the data center. Although there is a lot of care that goes into getting cloud infrastructure right, assertions like this are usually a trap!
If you haven’t moved to the cloud, then it’s probably a good thing to do, but it takes a lot of planning to do right… “lift and shift” approaches where data center infrastructure is replicated in the cloud tend to be expensive and error prone, since they cannot apply any of the benefits of the cloud. So take your time.
If you’re already in the cloud but not sure where to start on making things better, then let’s talk.