So you need to move a Data Center?
Posted on November 3, 2016
By Dan Hawkins M.B.A., Director of Information Technology
Edited by Mike Grzesik, Infrastructure Security Specialist
My team and I just finished up a project that left me exhausted, elated, proud, and much wiser than before we started. We moved a Web-based Data Center application from one Internet hosting provider to another in a 60 hour weekend. The specifics of what we did, why we did it that way, and what I would have done differently are the learning points that I want to share. Three stages of the project played a major role in our success. These phases: Preparation, Move, and Go-Live support draw parallels with terms in the medical field. Some of you may have been through Pre-OP, Surgery and Recovery. Whether the patient is human, or a virtualized stack of computers, mass data storage, switches and firewalls doesn’t matter. The goal for a successful outcome is much the same.
Phase I – Preparation
Any project manager worth his/her salt will tell you that the most important part of any project is the Definition phase. You will never understand all the details of what you are attempting unless you step through the plan over and over again. Preparation, to that end, came in the form of examining the affected (relocated) systems, how they currently operated, and what changes would affect them in the new environment. “Measure twice, cut once.” is a phrase used when building a house. Revisiting the plan helps head off any unexpected outages or loss of functionality when you get to the end and are ready to press the “Go!” button. When you have spent more time than you could ever think documenting systems, functions, attributes and inter-connections, take one last look. You will find at least 20% more things you didn’t come up with during the last pass.
Once you have your definition to a comfortable level of acceptance, do one more thing: expect that something will go wrong and have a remediation strategy in place. Arrange for extra resources in a monitoring capacity. Pre-arrange the hand holding of key customers during the post-move phase. Involve members from the business teams impacted by the resources being moved. They will provide key insight and can smooth the process of migration in ways that the technical team may not even be aware of.
Internal communications between all parties is critical. We had to ensure that internal notices were being sent out to the Customer Support group, and we worked with Product Management to craft “non-technical” messages to send out to the customers. At some point we had to commit to the move day (which changed three times) and then notify everyone it was really going to happen.
External communications to customers is critical and should be made with as much notice as possible. To what extent are customers going to need to change what they do? In our case, the public IP addresses for our Internet facing systems were changing. That likely meant many non-technical customers had to work with their very technical IT departments to verify and change firewall rules. If they didn’t do that, their web portal access would be blocked when the application moved. Unfortunately, this did happen.
Communications during the move and after the move is a big part of the planning. We had a conference bridge that was utilized throughout the 60 hour weekend maintenance window for the teams in all locations to check in on the current status.
Phase II- Move Maintenance
During the Move maintenance activities, I would be remiss if I didn’t acknowledge the lack of sleep during the week leading up to the maintenance window. As much planning that was done, no matter how many times we went over the schedule, there were almost always items thrown out at the last minute that kept us up at night worrying. We could not practice the move… it was just written down on paper. But the plan was simple: shutdown the application and servers, back it up to the remote storage, unplug the storage and drive it to the new site. Reverse the process when we got to the new Data Center 30 miles away. Theory and practice turned out to be two different things.
As the Friday night maintenance window got underway, we started to bring systems offline. There was an ordered list to shut down key computers systems so that they could be backed up. In their new home, they would come back online with as little trouble as possible. This order was established in the Preparation phase, but needed to be executed “for real” during the move. Once all systems were down, the backup of systems began in earnest. The backup was expected to take 4 hours maximum. However, it really took 14 hours to complete. This was primarily due to unexpected computer network bandwidth constraints that took place during the backup process. Our starting hosting provider went into a weekend maintenance mode and our network bandwidth dropped to 25% for almost 2 hours of our backup time. Had we known to ask, we could have requested an exception. First lesson learned.
So 7 ½ hours later than expected we are ready to transport our systems to their new home. That part went off without a hitch, and the onsite team even had time for a Saturday afternoon taco lunch. At the new site, we racked and started up our storage solution and pre-emptively made sure we didn’t have any throughput challenges. A major hurdle turned out to be getting the storage to talk nice to the virtualization platform in the new home. We planned to have contracted support from the storage and virtualization vendors available, but after nine hours we still could not get the two systems to connect. This challenge put us behind by our original schedule. It also started to physically and emotionally impact the team. One team member had been awake for nearly 36 hours, and the other had only about 3 hours of sleep since arriving Friday afternoon at the site. Everyone was losing focus and we were 24 hours into a 60 hour maintenance window. As the project leader, I had to demand that the team get some rest. No amount of coffee and sugar can substitute for six to eight hours of sleep. Lesson #2 is to expect that you will need to demand the team resources stop at key points along the schedule, even if you are behind.
By Sunday morning there was no solution to the storage challenge, and at this point we had to seriously consider failing back to the original hosting provider. We set the time to start bringing systems back online in the old data center, so they would be ready if we had to cut back over. Better to have the backup plan running in parallel with the current move in order to be safe. That’s when the first real break came on Sunday afternoon and all the systems came online so we could move forward. Lesson #3, always have contracted phone and email support with your hardware or software vendors, especially with complex projects.
By 5pm Sunday, we were launching back into our server boot-up sequence, and copying key servers over to the new storage. The team rigged the network bandwidth to be four times faster than originally anticipated, so the new data copying speed put us back on schedule by about 6 hours. At 11PM on Sunday night, we had all the key systems online and ready for validation testing. The resources that were planning to assist in testing were contacted to begin their system checks. Lesson #4, separate your test team members from your onsite technical group. This will help to spread out the responsibility and give the last technical hurdle of the project a new set of fresh eyes to help complete the cutover.
While this was going on, the onsite technical team had some end-of-project cleanup to accomplish with regards to server settings and network firewalls changes. By 1:45 AM on Monday, only hours before the close of the maintenance window at 10:00 AM, we were 95% functional and had delivered website and systems availability to our internal and external customers. That is when the onsite team called in a night, and responsibility transitioned to the Customer Support team. Their task was to pick up the ball in the morning and deal with any unexpected challenges during the Go-Live phase.
Phase III- Go Live Support
On the morning of Go-Live, some three hours prior to the end of the scheduled maintenance, we held a conference bridge for all of the teams to check in and provide Go-No-Go. All resources checked in and reported that systems were functioning as expected and the same services that were available on Friday night at 10PM were now available in a new home some 60 hours later. We announce at the scheduled time the completion of maintenance activities and prepared for any unforeseen issues. In our case there were two systems that needed some extra tweaking to bring them fully to capacity (one web server interface and our Exchange email system), but those issues were rectified within a couple of hours after the move. All underlying systems performed nominally or better and we only had a few customers who required extra assistance in finalizing the network requirements on either side of the new systems architecture. We mitigated the last of these a couple of days after the move however the vast majority (over 170 customers) saw no significant downtime outside of the initial maintenance window. Many reported systems functionality and response times to be at or above the expected performance level.
If I had to do it all over again, I would make few choices differently. Mainly I would spend more time in the Preparation phase and I would push for replication of the systems between the old home and new home over our copy method that we called “Lift and Drop” Having the 60 hour window to move also gave little room for error. Not something I would like to relive again.
As a result of the move, we were able to compile a more detailed inventory of all the systems that made up this hosted application. There is now a better description of every component, its contribution to the application, and how they all interact with each other. By working with systems that had to be moved, the team left the project with a better appreciation for what the systems did and how our customers use them. We also learned how the internal Customer Support teams leveraged the tools and technology on a daily basis.
Toward the end, I also realized the value and the tenacity of the onsite team. They faced some tremendous physical and technical challenges in this move, but stayed focused to resolve the immediate issues. Their push came from the fact they didn’t want to quit until the job was done. Failure was an option in this project. But it wasn’t an option that anyone relished. Their determination to succeed married with their technical skillset to help make the project a success.
Finally, you should never forget to celebrate success. The team that worked on the project and completed the job needs to be made aware of the overall appreciation from the company and customers. Posting a note of thanks on internal communications platforms, providing lunch (or dinner), or granting comp time are all good. Sometimes, however, it is most impactful simply to say “Thank You” to all the team members involved.
Announcements | Blog | People