October 22, 2010
Last downtimes in detail
To put it mildly, we’re not satisfied with the current availability of mite. To be honest, we’re heavily frustrated. One hour of downtime on October 15th, fifteen minutes on the 19th and two hours during last night – that’s simply not the level of quality that mite is known for and that you can and should anticipate. We owe you. Not only another apology, but a detailed description of what went wrong and what we’re doing to prevent this from happening again.
What did happen?
Hardware failures in the data center caused all three outages, the app itself was and is running smoothly. The first failure wasn’t connected to the second and the third one. Bad luck and bad timing, it all came together.
On October 15th, an electricity problem occured in our primary data center, despite of redundant power systems being in place, of course. The power systems were undergoing maintenance, that’s when a switch between the two systems failed, due to a combination of a flawed documentation of the hardware supplier as well as a not perfect emergency plan. Power supply was recovered within half an hour, but the servers needed some more time to check all data and to resume their work properly.
The nightly outages on October 19th and 21th were caused by defect network switches. On the 19th, one of this switches broke. Within minutes, it was replaced. Yesterday night, two switches in one blade center by IBM failed simultaneously. Replacing the switches didn’t solve the problem. Servers had to be moved to another blade center, this took some more precious time.
What will be done about it?
Two notes upfront: one, no hardware will always work 100%, not in our data center and not in another one. That’ll simply not going to happen, that’s a reality we cannot change as much as we’d love to – but we can change how we deal with this reality. Two, our top priority is to assure that your data is totally safe, at any given point of time. To guarantee this guideline, we’ll even keep up with some more minutes of downtime, in case of doubt.
What we can do and will do, is a) throw light on every little failure to really understand it and therefore be able to prevent this from happening in the future, and b) enhance uptime by putting more redundancy in place.
In this particular case, after October 15th, the motor to switch between the different power systems was replaced. Plus, our hoster, the folks from the data center and the manufacturer of the systems have joined forces to clarify the error in the documentation and to fix it. Plus, they are discussing to implement another redundant power system on top of the existing one.
The network switches that caused the downtimes of October 19th and 21th will undergo a scheduled maintenance, probably during the next week. We’ll update as soon as we have more information.
At the moment, we’re thinking about how to add even more redundancy on our side, e.g. by adding further systems that could take over in case of a hardware failure.
On the bright side, we’d like to point out that we trust our primary hosting Partner, SysEleven, despite of those numerous downtimes. Monitoring informed us within a minute. Technicians were hands on within five minutes. CEO and head of IT updated us on an ongoing basis, in detail and in a transparent way. They are deeply sorry and definetely unsatisfied with the status quo, as well. They’ll focus on improving the current set-up during the rest of 2010, no new features will be taken on. All in all, their 10 years hosting history shows that this is not the norm, without a question.
Uptime of mite in 2010: 99,93%
Concluding, we’d like to talk about the bigger picture. We analyzed previous downtimes to help you put this into perspective.
From January 1st 2010 until today, mite was unexpectedly down for a total of 295 minutes. This is an uptime of 99,93%. Even if we included scheduled maintenance, mite was up for 99,89%, all in all.
The gap to 100,00% is not big, but not satisfying. We aim to be better than this. We’ll keep on improving every little detail to maximize uptime even further. Please, trust us: we will get better. If you’d like any further information: please, get in touch!
Julia in Tech talk
Got something to add?