Scale, Clouds and Where Things Go Terribly Wrong – a View from the Silicon Forest

Scale is a very important topic. I speak to our customers a lot about scale and the problems that come from doing epic things. I specialize in primary storage and replication, but the solutions I touch involve business-critical and cloud native applications that touch every part of the data center, wherever it lives. I have seen customers build wildly successful clouds from the ground up. I have seen others fail at scaling a single application or process. This is a summation of a few lessons learned over time from watching those successes and failures. It is how I advise my customers on solutions. Their behaviors in these categories drive the solutions I position with them.

The TL:DR:
1. Start with standardization and stick with it
2. Anything worth doing once is worth teaching someone/something else to do it.
3. The answer to the “what problems do you anticipate with this solution?” question is always scale.

Start with standardization and stick with it.

I’ve been on the ground in the Silicon Forest for a while now. I’ve lived and worked here for over a decade. I’ve supported just about every major company in the region in some capacity. There’s incredible value in meeting with all of these IT shops and seeing what they do and how. If you pay attention, there’s a lot to be learned in how people do things and how they don’t. Two of the most successful IT shops I have worked with had the same thing in common: standards for their hardware and software. Every major escalation I’ve worked has a commonality too: snowflakes. Holding on to that one special app from 1983 that runs on an OS that hasn’t been supported for a decade? I don’t care what it does for you, kill it. Kill it with fire. Kill it before it kills your business and my holiday weekend. PLEASE!
Ok, that was an extreme example. I’ve seen many minor settings, small software version discrepancies and other minor “special” tweaks on hosts cause huge performance issues, production outages that cost millions, data corruption and lost jobs. So what does good look like? In short, it looks just like converged infrastructure. If that isn’t an investment you want to make, you build it by choosing one or two hardware skews for server builds, extending the same model to Operating Systems and other data center components. Once you have that established, work with vendors at each layer to come up with a testing plan for all software and firmware upgrades. Put that stuff in a lab. Test. Verify. Go. The fewer variables you have in your data center, the easier it will be to get to that low time to resolution when an issue pops up. The goal is to provide consistent, reliable infrastructure and cut it like cookies. You are not a beautiful or unique snowflake. Being rigorous in enforcing standards will buy time for your IT shop to be flexible in driving the results your business needs. This is how two successful cloud companies started their journey, but even if you aren’t building a giant cloud, there is tremendous value in standardization. If you make exceptions, you are wasting time, money, energy and morale.

Anything worth doing once is worth the time spent teaching someone/something else to do it

This is a life philosophy for me. Any technical task done in the data center eventually gets repeated. (Except maybe accidentally hitting the power switch on your UPS when booting your laptop, but that’s a story for another time.)
I don’t like repetitive tasks. I like to do cool new things, learn and have fun. I’ve had coworkers and seen admins that like to keep their work a secret and do it manually thinking that they are somehow protecting their jobs. We are knowledge workers. We should be spending our time advancing our knowledge to ensure our value. Time spent on repetitive tasks or alone in our work is time we are squandering our value to our employers and our community. The worst I have seen in this scenario is a company who had to rebuild thousands of virtual machines without a single template/image/script. This hurts the standards plan too because no human can repeat a process like that flawlessly that many times. Steps get missed. Here’s what an A+ looks like: bundled automated OS load, scripted software installs based on server role, configuration of cluster via a one-to-many terminal setup, configuration management with naming conventions and processes so well documented (and easily accessible) that any member of the team could pick it up and run with it. Do you need all of this to be successful? No, but it sure doesn’t hurt. For my work, I document everything I do as I do it. This is critical to what I do because I support so many customers and have 9 years of solutions out there -remembering what I did yesterday can be a challenge. I also try to start with scripting my process. I take each command I run and put it in a simple batch script and/or a macro in a spreadsheet. The first pass is manual. Get the process, document and build a repeatable script. On the second pass, I use the base script with some waits and add some intelligence. If someone is shadowing me, I make them run the keyboard and tell them what to do/explain while I document. There are usually a ton of screenshots and hacks listed. I haven’t posted many of these because they were customer-specific, but I plan to convert them into something more generic and put them on github. I’ve never been the person to enjoy giving up my personal time. I’ve never been all that interested in doing the same thing all of my life. I try to share what I know, sometimes it has to be pulled out of me, but that has more to do with all of the things happening in my head than a willingness to share. I feel that the more people know how to do what I do, the more free I am to do whatever excites me next. Automation and knowledge sharing are key to my ability to move and learn more. these are all things you need if you move to the cloud as well. After all:

The answer to the “what problems do you anticipate with this solution?” question is always scale.

II try not to use words like always, but scale is such a big and nebulous thing it’s an easy answer to this question. I try to anticipate and plan for any bottlenecks up front, but the obvious issues are rarely the ones that pop up if you are any good at building a solution. Lab testing just doesn’t happen in life size, and even if it did there are many ways that scale can break a solution. I’ve seen several OpenStack efforts fail due to a lack of resources. As technologists we often underestimate the amount of time we need to spend on care and feeding for a given endeavor. Human scale is vitally important -all the more reason to standardize and automate to eliminate unnecessary work. For a given piece of hardware, there are limits to how high it can scale and limitations on making that choice. Sure, you can build a giant server to handle an app, but eventually that app can surpass the resources of a single server. Scaling up is great until you hit the ceiling and are stuck with an app that chokes on the bottleneck. So then what do we do? We scale out and throw more servers at the problem. Scaling out works for a while, but we almost always outgrow that too. Before the app outgrows the technology, we often find other issues with scale: Fix the component that was once the issue, find a new one. I asked a guy in our IT department about lessons learned from our HANA deployment a few years ago. He said that deploying an in-memory database was a great way to fully expose bad database queries. After years of storage performance reviews and escalations, his answer felt like justice. I wanted to record it and broadcast it outside the offices of all of my customers like John Cusack:
Storage admins, I’m still with you! Scale is the reason we have jobs. It’s what makes my job fun and frustrating. It enables me to revisit solutions every few years and solve the old problem in a brand new way. Every customer and project eventually has issues here. Those that handle the issues the best understand their workloads, standardize, automate and find ways to leverage technology such as non-disruptive migrations in order to account for growth and sudden changes in workload requirements. This is where we learn the most. This is why it is important to standardize and automate as much as possible from the beginning.

In closing, if you want to enjoy your life: standardize, share knowledge and automate. The use of converged infrastructure, public or private clouds can be an “easy button” for getting these things done, but it is possible to build a great environment without it. The issues always arise when operating at scale. Understanding your workload and leveraging advanced data services can help combat issues that arise at scale quickly -provided the proper groundwork has been done.

2 thoughts on “Scale, Clouds and Where Things Go Terribly Wrong – a View from the Silicon Forest”

  1. Melissa, Brilliant article, great perspective and insight. Converged infrastructures definitely help to reduce the number of variables and opportunities for error, especially if you standardize on a model that can scale. Plus the Cusack reference was cool.

Comments are closed.