When disaster strikes, proper preparation prevents poor performance

It's going to happen to you one day

As Benjamin Franklin famously said: "An ounce of prevention is worth a pound of cure," and that's especially true when it comes to disaster recovery.

Most organizations with a reasonably sized IT department typically have an incident response plan in place, but simply having such a plan is far from sufficient. These plans must be continuously updated and tested on a regular basis, involving not just the internal team. Companies also need to maintain remediation software and ensure backups are readily available for deployment.

Rick Vanover, Vice President of Product Strategy at disaster recovery firm Veeam, shared his perspective on the matter with The Register, reflecting on his early career as an IT professional. "Back then, testing was often a recipe for failure," he recalled. The approach was to gather the team on a Saturday night, fuel up on pizza provided by management, and attempt to proceed—only for more than half of the attempts to fail. It was often shrugged off with the belief that things would improve next year, but he emphasized that such a mindset is no longer viable.

According to Vanover, the landscape has seen significant shifts over the past few decades. However, he noted that many companies continue to rely heavily on theoretical tabletop exercises, discussing "what if" scenarios without ever putting them into practice on live networks or even simulated environments.

Software plays a crucial role in enhancing preparedness and resilience, as highlighted in this context. Back in 2011, Netflix pioneered Chaos Monkey, a tool designed to randomly shut down virtual machines and induce network disruptions. Over time, the company expanded its efforts into a comprehensive toolkit known as the Simian Army, which includes Chaos Gorilla—capable of simulating the failure of an entire AWS availability zone—and Chaos Kong, which mimics the collapse of an entire region, addressing scenarios that are not unheard of.

Beyond Netflix's innovations, there is a wealth of other tools available, ranging from proprietary solutions like AWS's Fault Injection Service and Azure's Chaos Studio to open-source options such as Litmus for Kubernetes resilience testing, as well as commercial offerings like Gremlin.

IT administrators must also prioritize pre-scripted automation for remedial actions, ensuring these scripts are regularly updated and immediately deployable. Given the rapid pace at which attacks unfold, manual methods are simply impractical, according to Vanover.

Jacob Dorval, senior director at Secureworks Adversary Group and a former specialist in US Air Force networks, emphasized the importance of training. He pointed out that many organizations overlook training due to its perceived interference with maintaining smooth network performance. However, practicing response protocols—even down to physical tasks like disconnecting and reconnecting hardware—is essential.

Drawing from his military background, Dorval stressed the significance of repetition in training: "You train endlessly so that when a crisis occurs, you're prepared to act confidently. That single moment becomes manageable because you’ve practiced and know exactly what steps to take."

Know your network

It's also vital to know your territory, he added, and that means constantly updated network mapping. It's this information that is not only vital for detecting problems as they occur, but fixing them too.

Networks seldom stick to the original design layout, with new hardware and software being added and older stuff getting removed. Checking the precise network topology is key to seeing what's happening, where, and how to isolate the issue and deal with it.

"You've got to make sure you have visibility. Because this entire game is all about detection and response," Dorval warned. "How quick can you detect the threat and then respond to the threat and neutralize the threat. And dwell times are getting shorter and shorter, so making sure that you have visibility is absolutely critical."

Hiring third parties to come in and test out your network is also an increasingly popular option. A penetration testing crew will follow the attackers' strategy of mapping out networks with minimal interference before making any moves to simulate how it works in most real-world attacks. It's historically a technique US intel authority the National Security Agency (NSA) has used for years, as the former head of its hacking crew, Rob Joyce, explained.

"If you really want to protect your network you have to know your network, including all the devices and technology in it," he told the Enigma security conference in 2016. "In many cases, we know networks better than the people who designed and run them."

Dorval agreed, calling it "absolutely critical," pointing out that a third party can sometimes find things that a network administrator might miss. But there are downsides to this approach.

These kinds of pen tests are a business, after all, and this can lead to problems if the contractor is trying too hard to impress the client with a huge list of vulnerabilities and weaknesses. The end result is that admins face a daunting task, unless the reporting is clear on what to prioritize and what can be left fallow for a while.

"Finding 1,000 problems instantly may not be the most instructional or illuminative," warned Dave Russell, SVP and head of strategy at Veeam. "Recommendations should be limited to a set of actionable feedback. But remember, there's no such thing as failing a disaster recovery test. You found out ahead of time that the parachute didn't open. That is not failure."

Recovery plans

The final and most important piece of the plan in many cases is backups, and while everyone makes them, they are still an all too common point of failure.

There are key factors in backups to take into account. The recording process needs to be continuous, but Russell told us that in some cases, when they visit a site, they will find backups that are either out of date or improperly configured. If your budget allows for it, an off-site second backup system is a wise investment too, in case of physical damage to a site.

Then they have to be tested regularly. The Register has spoken to far too many admins who discovered that when they needed their backups, they were found to be incomplete, corrupted, or even non-existent because no one had checked to see if the backup system was working as it should.

And in terms of network connections, backups increasingly need to be guarded. The more advanced cybercriminals are making targeting backups a priority – for infostealers, because it's where all the information is, and for ransomware, because a victim is much more likely to pay up if their backups are locked down too.

Ultimately, there are always going to be incidents, human-made or otherwise, that can bring a network to its knees. But by preparing for the worst, maybe the outcome will be for the best and enable faster, more effective responses.

"You've been awake now for 29 hours," Russell explained. "You're pounding coffee, you're trying to remember where that encryption key is, and you're hoping you make good decisions time after time to get everything going. So that sounds very risky."

Realted Posts

The Evolution of App Control , A Carbon Black Legend
April 12, 2025 0

Dell debuts telecom cloud AI capabilities
March 6, 2025 0
Fortinet Wins 2025 Red Dot Product Design Award for FortiGate Rugged Series
November 13, 2025 0
Industry’s First-to-Market Supermicro NVIDIA HGX B200 Systems Demonstrate AI Performance Leadership on MLPerf Inference v5.0 Results
April 23, 2025 0

As Technovera Co., we officially partner with well-known vendors in the IT industry to provide solutions tailored to our customers’ needs. Technovera makes the purchase and guarantee of all these vendors, as well as the installation and configuration of the specified hardware and software.

When disaster strikes, proper preparation prevents poor performance

Why European Businesses Are Choosing Self-Managed Identity Solutions

The convergence of test-time inference scaling and edge AI

When disaster strikes, proper preparation prevents poor performance

It's going to happen to you one day

Realted Posts

The Evolution of App Control , A Carbon Black Legend

Dell debuts telecom cloud AI capabilities

Fortinet Wins 2025 Red Dot Product Design Award for FortiGate Rugged Series

Industry’s First-to-Market Supermicro NVIDIA HGX B200 Systems Demonstrate AI Performance Leadership on MLPerf Inference v5.0 Results