Super Bowl night in America, a national holiday. This year’s was interrupted by one of the most widely viewed public power outages in the last decade. More than 30 minutes of play, and an incalculable financial impact experienced on one of the most popular television spectacles in the United States. But when these power outages occur, one cannot help but wonder about redundancy and resiliency of such crucial infrastructure services as power. And while the power requirements of the Super Bowl might exceed those of a typical day in business, in hindsight of “the outage” is your businesses IT infrastructure protected against similar failure?
Businesses often in-state similar power systems as one might suspect existed at the Superdome. For example, some datacenters will have redundant power feeds coming into the building to avoid excessive impact during an interruption to a single power-grid. After the incoming power, businesses often use a battery-backup to avoid massive outages during momentary losses of power, and a generator to sustain systems during a prolonged power outage. Without any of these systems, or without on-going maintenance, business might suffer the same impact as was experienced Sunday Night.
But, through all of these fail safes, sometimes issues do occur. And when they happens, are your procedures updated to get all of your systems back up and running in a timely fashion? Your recovery time-objectives are directly linked to your ability to regain power, and sustain it through the recovery process. For instance, during the Superdome outage, we saw lights coming on following a detailed checklist with teams allowing the lights to warm and reach operating current prior to moving to the next set of units. This is a strategy often in place in datacenter environments, in which the startup current of most electronic devices far exceeds the operating voltage. Most datacenters are sized for the operating current of the devices in place, not necessarily allowing enough electricity to start all devices in parallel. Having a startup order for your hardware assets after a power outage is a necessity in this situation in order to prioritize the workload and bring up critical services first. Powering on too many devices at one time can cause another outage to occur, while being too conservative can prolong outage times.
The issue of power distribution exists as well. Typical datacenter redundancy exists at the power-supply of the server, plugging in two cords to two separate power supply systems (i.e. – UPS’) distributes the load and avoids overloading any one side during normal conditions. Unfortunately, some devices don’t have redundant power, causing balancing issues between the two power sources, potentially overloading one. Carefully monitoring what is plugged in directly related to this type of recovery plan is the only way to ensure that circuit overloading is not occurring.
Finally, as we saw in the event of the Superdome, as well as some of the other natural disasters in the world lately, pre-plan for any potential incident. If there is expected to be brown-outs in the area due to high air conditioner utilization, heavy snow storms, or a major event at your business in which power-requirements will be very high; forethought and planning around datacenter redundancy and resiliency is always a good idea.
- Ensure that all systems (i.e. – Generator, UPS, Batteries, HVAC) are up and running in peak operating condition
- Complete all regular maintenance on these units, and keep track of life expectancies
- Maintain a list and timing for power on of a complete datacenter in the event that many devices are lost in a single incident
- Constantly monitor power systems to ensure that no power systems are overloaded if they are singularly relied upon in the event of a disaster
- Pre-Plan for any major event or upcoming weather
While none of these would have specifically saved the Superdome, they may avoid your business suffering a major loss of service, with a potentially public outage.
by Jim Joseph