Top 10 things a great Data Center Manager will do
What does it take to be a Data Center manager with real-time responsibilities, supporting the hardware, systems, software and network performance critical to an organization?
I thought I would share the list of top 10 things we are focused on in our Data Center management team and to hear some more great ideas and feedback from other Data Center managers!
1. Use multiple systems to monitor. No single monitoring system no matter how well built can possibly be responsible for insuring up-time. Too often Data Center managers trust a single platform or vendor to completely manage disparate systems such as switch infrastructures, edge routers, virtualization platforms, etc. Despite all the sales and marketing lately from the industry about “Data Center 2.0” being a “single fabric” or “end to end platform” - there are still crucial metrics and key events that will not be tracked on a single platform. How full is my storage? How many multicast routes do I have from a particular trading exchange? How many firewall rules exist for a specific application? The list goes on and on. While its wise to have different software based systems “cross monitoring” from both internal and external vantage points - internal monitoring systems such as embedded controller cards should not be overlooked. Alert agents that run directly inside the systems themselves can often provide unique insight into system health and can work autonomously of external snmp, rmon or agent polling. The key thing bringing all these different systems together is a solid business workflow, where the Data Center staff are aware of what monitoring systems are responsible for what and have a well defined remediation policy in place to know “where to look” during an alert. Only by fully understanding the capabilities of each platform can your team be certain to be alerted when an outage or risk exists. Finally, do not forget a “pre-flight” check - at each shift change or at least right before the start of the work day. This is a comprehensive deep dive into the performance and state of all systems and services that are critical.
2. Cross train your teams. “It must be the network” (sound familiar). Too few Data Center Managers take the time to properly cross train different teams. The team managing the vSphere cluster should understand vlan tagging and how 802.1q switch trunks work. The Network team should understand how a vmotion is done on different network and storage systems. Your database administrators should have the list of IP endpoints the SQL Servers and clustered systems listen on and interconnect with. When something is down having a well rounded team is ESSENTIAL to quickly identifying the issue. Frequently, a Data Center Manager becomes a “translator” between different groups, that often blame each other during a failed deployment or real-world outage. By making sure each team has enough knowledge to “cover the gaps”, the Data Center teams can rapidly pin-point and resolve issues.
3. Have Logical, Physical and Wiring diagrams handy. Keep them up to date. In the era of partial or complete process outsourcing, having quick access to how the systems and network topology is cabled is just as important as understanding packet flow and firewall access. You may need to bring a vendor who is responsible for (but has never been to your facility) up to speed quickly before troubleshooting can start. The Diagram is often their only insight into what they are supporting. Successful Data Center Managers become “Librarians” of documentation about their facility - and the best ones have it available via a cloud based portal or other offsite, yet easy to access secure system. Its going to be hard to pull your Visio’s off of SharePoint if the core switches are down, huh? Google Apps is a secure, painless and often free way to insure each member of the team can quickly reference any document in the technology organization. To keep documentation updated and accurate, assign “Update Managers” who are responsible for verifying and checking off on documentation updates weekly, monthly or however often your facility schedules changes.
4. Take responsibility for outages. No one likes a mystery. Especially one that takes a business down. If you are aware of what caused the outage - document the problem thoroughly and inform all stake holders without blaming anyone (even if someone should be blamed). The truth will always come out anyway, so its better to be truthful and let everyone in the technology team learn from the mistake. Do not blame Microsoft, Cisco or some vendor for an outage. These vendors make ALL necessary documentation and support personnel available and you will not be taken seriously if your outage report says “Microsoft Dropped the ball on this one”. A year ago, a major Microsoft Exchange outsourcing vendor sent us and our customer’s Executive team an outage report - they simply stated that after speaking with Microsoft it was clear their Exchange Mailbox Cluster Servers needed some patches they didn’t have that caused the cluster to fail back and forth frequently between members of the mailbox cluster. Microsoft was not at fault and the vendor was seen as more responsive to a serious business day email outage for having taken ownership of the fact they didn’t have their systems patched properly - rather than blaming Microsoft.
5. Live and Breathe your backups. Data is the priceless human capital that above all you are tasked with protecting. Data is your “VIP” and the Data Center Manager is the head of its security detail. No backup should be considered complete until a RESTORE test has been done, preferably offsite to a different set of systems. A great Data Center Manager will assign different team members daily tasks to insure backups are accurate, complete and can be used to restore any file, system, database or anything else that exists under your control to its working state. Do not simply trust that an outsourced backup provider is “doing your backups”. Demand proof, in the form of both a login to their system (any good outsourced backup provider will have this available) and at least a quarterly restore of both the entire systems and the data that live on those systems. Many organizations hire a specialist firm to do their backups, yet have never done a restore test? Why not? This should included for free as part of their service, if not shop around - your firm deserves better! If you do your backups in house, assign different team members to cross check each others backup tasks. Some organizations have a Database Backup policy or process that is different than how they backup and archive email. Do not allow “backup completed successfully” emails to guarantee both backups and restores will actually work. Too often the only thing that worked was that email.
6. Make email alerts informative and useful. Too often we see alerts that simply say “line down, interface X/Y/Z” or “system alert - 95% disk utilization on Server A”. An engineer now needs to track down what is wrong and spend valuable time while something is down, determining the root cause of the issue or who should be contacted to resolve it. Your alerts should clearly state as much information as necessary to allow the groups responsible to take action immediately. Wouldn’t this be alert much more useful (especially on a Saturday Night when the Network Engineer on call is at Applebee’s with his family)?
Router Alert: Edge1-DC1-US.company, Interface G0/0 - Down - Circuit ID 10-9A89AB30 to Cogent down, Contact number 1-877-726-4368, option 3, 1, 2
7. Train your team on what to do during an outage. The first time something is down is a bad time to find out who the best Exchange Mail store troubleshooter or SQL Cluster Expert in an organization is? Or who knows Multicast PIM works differently between the different data feed vendors the company relies on? For each potential “hot spot” in your Data Center (anything that if down or slow someone will notice) - a well defined remediation plan should be drilled on - “What interfaces does our switch receive multicast data on, what interfaces does it send multicast data on?” “What vswitches have and use jumbo frames, and which SAN interfaces on which switch ports need jumbo frames enabled”? “What Luns needs to be available to which vmware host for Exchange or SQL to be available”? “How does a vmware host look or respond if it can reach the storage, how do we troubleshoot vmware nic connectivity issues at both the host and guest levels?”. A well trained team is a pleasure to work with. A poorly trained team is obvious and won’t go unnoticed, especially when each minute of downtime is costing the company money and credibility.
8. Know your min, max and average power drain. Sudden changes in power utilization can be an indicator your Data Center power feeds have become unstable, or the temperature is unbalanced in hot and cold aisles, etc. If a server has dual power supplies, each homed to an “A” and “B” PDU in your cabinets, guess what happens if server’s power supply has failed? - it will drain additional power it needs from the single remaining power supply, causing more AMPs to be drained on a single PDU. Your team should monitor and alert on each PDU and if possible the entire aggregate of all PDU’s in your facility, using a system such as APC’s InfraStruXure system, or by doing a sum function of power “Amperes in use” across many PDU’s on a single graph. Work with your power provider to document total available power and how power is distributed in your facility. Create power alerts that indicate how well the Data Center is functioning.
9. Familiarize staff with the physical layout of the Data Center. No one wants to trace cables or work through a nest of SAN cables the day a disk shelf is unavailable or system is down! Who wants to try and hear what someone is yelling to them on the phone in a loud Data Center! (Assuming you have cell phone coverage there at all) Take the time to show each member of your team where fiber, cat5/6, telco lines come in to the facility. Which patch panel numbers go with the top of which cabinets if you use a centralized wiring scheme? Is “12-A-11A” rack “12” or row “12”? Which ToR (top of rack) switch or fabric extender runs back to which core switch and how are the uplinks labeled? Nothing beats experience under fire, and no one can outsmart familiarity with an environment when the minutes tick by during an outage. We call them “Data Center Parties” and we have them often! They are an opportunity without the stress of something down for each person to see how things connect and how cables are labeled, know what all the lights mean, appreciate how something is setup that you may not have had knowledge of. Great Data Center managers all have a role in keeping the team familiar with how the facility is laid out.
10. Have a vendor escalation plan. Janet may be your sales rep at your SAN vendor that gets back to you quickly when you need a quote. Paul may be her manager that sent you that huge tin of caramel popcorn last December. Do NOT expect them to answer their cell phone at 3AM if your SAN is down. Each vendor for telco, hardware, software or process outsourcing should provide an easy to read, effective “Escalation Contact Sheet” that insures you and your team can reach whoever you need to speak with to restore service at any hour, even on a weekend or holiday. Do not assume opening a support ticket with the 800 number of your vendor guarantees you will get help that fixes the problem in a timely fashion. Many well known vendors use an answering service for late night calls, that act as more of a call center than an engineering point of contact. You will then need to wait for an engineer to be available or for them to do research on your issue before getting a call back. It can be many hours before “someone who knows” is actually working your issue. If something is critical to the success of your Data Center’s operation - have second and third level 24 hour contacts available. A few years back, a large Telecom company gave us an Escalation Contact Sheet with 24 hour numbers for staff there as high as two people below the CEO with home and cell phone numbers as well as his peers and people in his office. While we don’t recommend calling them first - clearly use your discretion and best judgment - if your ticket has been sitting idle in a queue for a couple hours and you’re getting nowhere - your management will expect you to be able to escalate the issue - so be ready and able to do so!
Thanks for reading and we would appreciate some feedback or additional areas we should be focused on!