Last downtimes in detail

To put it mildly, we’re not satisfied with the current availability of mite. To be honest, we’re heavily frustrated. One hour of downtime on October 15th, fifteen minutes on the 19th and two hours during last night – that’s simply not the level of quality that mite is known for and that you can and should anticipate. We owe you. Not only another apology, but a detailed description of what went wrong and what we’re doing to prevent this from happening again.

What did happen?

Hardware failures in the data center caused all three outages, the app itself was and is running smoothly. The first failure wasn’t connected to the second and the third one. Bad luck and bad timing, it all came together.

On October 15th, an electricity problem occured in our primary data center, despite of redundant power systems being in place, of course. The power systems were undergoing maintenance, that’s when a switch between the two systems failed, due to a combination of a flawed documentation of the hardware supplier as well as a not perfect emergency plan. Power supply was recovered within half an hour, but the servers needed some more time to check all data and to resume their work properly.

The nightly outages on October 19th and 21th were caused by defect network switches. On the 19th, one of this switches broke. Within minutes, it was replaced. Yesterday night, two switches in one blade center by IBM failed simultaneously. Replacing the switches didn’t solve the problem. Servers had to be moved to another blade center, this took some more precious time.

What will be done about it?

Two notes upfront: one, no hardware will always work 100%, not in our data center and not in another one. That’ll simply not going to happen, that’s a reality we cannot change as much as we’d love to – but we can change how we deal with this reality. Two, our top priority is to assure that your data is totally safe, at any given point of time. To guarantee this guideline, we’ll even keep up with some more minutes of downtime, in case of doubt.

What we can do and will do, is a) throw light on every little failure to really understand it and therefore be able to prevent this from happening in the future, and b) enhance uptime by putting more redundancy in place.

In this particular case, after October 15th, the motor to switch between the different power systems was replaced. Plus, our hoster, the folks from the data center and the manufacturer of the systems have joined forces to clarify the error in the documentation and to fix it. Plus, they are discussing to implement another redundant power system on top of the existing one.

The network switches that caused the downtimes of October 19th and 21th will undergo a scheduled maintenance, probably during the next week. We’ll update as soon as we have more information.

At the moment, we’re thinking about how to add even more redundancy on our side, e.g. by adding further systems that could take over in case of a hardware failure.

On the bright side, we’d like to point out that we trust our primary hosting Partner, SysEleven, despite of those numerous downtimes. Monitoring informed us within a minute. Technicians were hands on within five minutes. CEO and head of IT updated us on an ongoing basis, in detail and in a transparent way. They are deeply sorry and definetely unsatisfied with the status quo, as well. They’ll focus on improving the current set-up during the rest of 2010, no new features will be taken on. All in all, their 10 years hosting history shows that this is not the norm, without a question.

Uptime of mite in 2010: 99,93%

Concluding, we’d like to talk about the bigger picture. We analyzed previous downtimes to help you put this into perspective.

From January 1st 2010 until today, mite was unexpectedly down for a total of 295 minutes. This is an uptime of 99,93%. Even if we included scheduled maintenance, mite was up for 99,89%, all in all.

The gap to 100,00% is not big, but not satisfying. We aim to be better than this. We’ll keep on improving every little detail to maximize uptime even further. Please, trust us: we will get better. If you’d like any further information: please, get in touch!

Julia in Tech talk

Today’s service interruption

Update: Since 01:17, mite is back up. Again: we’re so sorry! These outages cannot and won’t continue.

~~
Since 23:18 CEST, mite is not available due to a hardware defect. Technicians are hands on, already. Please visit Twitter to get the newest information on this issue, we’ll update continuously. We’re terribly sorry, please, excuse us!

Julia in Tech talk

Downtime

Update (12:41 am CEST): mite is back up now. All of your data is fine of course, there was never any real danger. Again: we’re terribly sorry for this brief outage! Hopefully, this downtime didn’t cause too much trouble on your side.

~~
Since 11:45 am CEST we experience electricity supply issues in our data center. We informed the data center, their whole team is working on the issue.

Please excuse this outage a thousand times! Please visit Twitter to get updates on the issue, we’ll update the status continuously.

Julia in Tech talk

Scheduled Maintenance: May 17th

Next Monday, between 1 am and 2 am CEST (what time is that for me?), some updates to our servers will be made. Therefore, mite will be unavailable for a very brief period of time. We expect the interruption to last for no longer than 10 to 15 minutes.

Maintenance will include updating the kernels, i.e. the heart of the server systems, as well as some improvements to the hardware, i.e. server rack restructurings. This maintenance takes place to reduce the possibility of future downtimes by tackling the root of past problems. We ask for your understanding.

Julia in Tech talk

Today's service interruption

Between 1:22pm and 2:20pm CEST, mite was down for all users. We’re terribly sorry, please accept our apologies!

The reason for this downtime were problems in our data center: defect routers of the upstream provider caused an interruption in the external connection. Three minutes after this downtime started we began to update on this problem via Twitter. Within minutes, technicians started working on the hardware in the data center. Collaborating with our hosting partner SysEleven, we’ll keep looking into this problem to prevent similar problems in the future, this goes without saying. Of course, your data was totally safe throughout this downtime.

Again: we’re so sorry! This shouldn’t happen.

Julia in Tech talk

Scheduled Maintenance, take two: March 21th

Update, March 21th, 7:53am: Maintenance went as planned.

One of our brand-spanking new servers isn’t feeling quite at home yet. To prevent future problems up front, we’ll therefore move it to a new server blade. Unfortunately, this action requires another scheduled maintenance. Fortunately, this one requires a very short period of time only.

mite will be offline on Sunday, March 21th, between 7:30am and 7:50am CET. (What time is that for me?)

We’re so sorry for this second interruption! This decision didn’t come easy. But we’re convinced that this preventive step makes sense even if nothing serious happened yet. We ask for your understanding.

Julia in Tech talk

Today's downtime

At about 11am CET, mite became extremely slow; So slow, that the service went down between 11:24am and 11:35am. First of all: we are terribly sorry, this shouldn’t happen!

These problems were due to a disruption of the external internet connection of the data center where our servers are taken care of. Our own alarm systems did work, as well as the ones of our hosting partner SysEleven; we updated via Twitter. SysEleven managed to solve the problems of the IP uplink. All seems fine again. Nevertheless, you can definitely count on us for being extra-observing today.

Again: we’re sorry. Please excuse this interruption!

Julia in Tech talk

Scheduled Maintenance: March 7th

Update, March 7th, 5:18am: Everything went as planned. Good time tracking on the new servers, everyone! Just in case you stumble over a bug: please get in touch with as much details as possible.

Safe, secure and lightning-fast: that’s how mite behaves today and should behave in the future, no matter how fast the user base grows. To make this happen, Sebastian who’s taking care of the technical infrastructure here has been preparing mite for the next step: this weekend, we’re moving the application to a new server cluster. Therefore,

mite will be offline on Sunday, March 7th between 3am and 5am CET. (What time is that for me?)

We’ll update this blog post and keep you posted on Twitter in real time.

Ideally, you won’t notice a thing about the new infrastructure. All your data will stand at your service exactly as it is used to, this goes without saying. Nevertheless, experience shows that in production we might have to tweak the system here and there a little bit to optimize its responsiveness and stability – despite testing, testing, testing up front really thoroughly. Therefore, not only us, but also the team of SysEleven, our new hosting partner, will be extra beady-eyed. Promised. Now, let’s get moving!

Julia in Tech talk

Domain name problem affecting user subset

[Update: June 10th, 7:25 a.m.] The regular domain *.yo.lk which was erroneously suspended is now working again. You should be able to access your mite.account under its standard URL. As a matter of course, your data was safe at every moment.

It could take some time for DNS servers all over the world to reflect the unsuspension. If you cannot access your account by now, please continue accessing mite via the emergency domain. We are deeply sorry for these problems, even if we were not responsible directly for them. We know that you depend on mite. We will do everything possible to gain your trust again. Sorry.

Since this afternoon, some users are experiencing problems accessing their account. Our servers are up and running, but there is a problem with our DNS entry. If you happen to be affected, we want to apologize first of all!

We set up an alternative domain, under which you can access your account:
http://youraccountname.appmite.de

All your data will be waiting there for you, of course. If you create new time entries or any other data, it will be accessible again, as soon as the regulair domain address is available again.

Unfortunately, SSL is not available on this backup domain. Please access mite through HTTP as long as we are working to get the domain working again.

We will get back to you via mail with detailed information. At the moment, we assume that a service provider became insolvent and took our DNS entry with him. We’ll keep updating via Twitter.

We are terribly sorry for these problems. Please stay with us!

Julia in Tech talk

Downtime

Between 16:01 and 17:29 this Sunday afternoon, mite was down for all users due to a power interruption in our data center. We are terribly sorry for this interruption of the service! This shouldn’t happen. We’ll continue to look into the issue, together with the very capable people in Munich maintaining the infrastructure, to prevent something like this to happen again. Again: we are sorry for the inconvenience. Please stay with us.

Julia in Tech talk