mite.blog. Tech talk

January 20, 2012

Scheduled Maintenance

Monday, January 23th, mite won’t be available between 0:15 am and ~0:45 am CET (what time is that for me?). We’ll move the service to new, more powerful servers. We ask for your understanding.

Update, January 23th: Maintenance went as planned.

Julia in Tech talk

June 7, 2011

Since 21:21 CEST (what time is that for you?), mite is not available for some users due to a routing problem in our primary data center. We’re terribly sorry, please, excuse us! We’ll do everything to get mite up and running again as soon as possible. Please visit Twitter to get the newest information on this issue, we’ll update continuously.

~~
Update, 22:33 CEST: mite is back up for all users. Hardware problems at the data center were the reason for this outage, routing was at the heart of the problem. We are and will be working together with our hoster to understand this interruption in detail to prevent this from happening in the future. Again: we’re so sorry for causing you trouble!

Julia in Tech talk

November 25, 2010

Scheduled Maintenance: November 27th

Update 6:17 CET: Maintenance is completed, mite is happy to track your time again. Thanks so much for your huge patience, everybody! Please get in touch if you felt affected by this maintenance beyond the acceptable level – we’re really sorry for the delay.

Update 3:02 CET: Maintenance is taking longer than expected, we’re sorry!

~~
Tomorrow night, on November 27th between 1:00 and ~2:00 CET (what time is that for me?), mite won’t be available due to a move of our main servers to a more redundant server cage within our data center.

We don’t treat our promise lightly: this maintenance is one of the necessary measures that we’re taking from October’s downtimes. Tomorrow’s steps will help us to ensure a more stable mite in the future by putting redundant hardware in place. We ask for your understanding!

Julia in Tech talk

October 22, 2010

Last downtimes in detail

To put it mildly, we’re not satisfied with the current availability of mite. To be honest, we’re heavily frustrated. One hour of downtime on October 15th, fifteen minutes on the 19th and two hours during last night – that’s simply not the level of quality that mite is known for and that you can and should anticipate. We owe you. Not only another apology, but a detailed description of what went wrong and what we’re doing to prevent this from happening again.

What did happen?

Hardware failures in the data center caused all three outages, the app itself was and is running smoothly. The first failure wasn’t connected to the second and the third one. Bad luck and bad timing, it all came together.

On October 15th, an electricity problem occured in our primary data center, despite of redundant power systems being in place, of course. The power systems were undergoing maintenance, that’s when a switch between the two systems failed, due to a combination of a flawed documentation of the hardware supplier as well as a not perfect emergency plan. Power supply was recovered within half an hour, but the servers needed some more time to check all data and to resume their work properly.

The nightly outages on October 19th and 21th were caused by defect network switches. On the 19th, one of this switches broke. Within minutes, it was replaced. Yesterday night, two switches in one blade center by IBM failed simultaneously. Replacing the switches didn’t solve the problem. Servers had to be moved to another blade center, this took some more precious time.

What will be done about it?

Two notes upfront: one, no hardware will always work 100%, not in our data center and not in another one. That’ll simply not going to happen, that’s a reality we cannot change as much as we’d love to – but we can change how we deal with this reality. Two, our top priority is to assure that your data is totally safe, at any given point of time. To guarantee this guideline, we’ll even keep up with some more minutes of downtime, in case of doubt.

What we can do and will do, is a) throw light on every little failure to really understand it and therefore be able to prevent this from happening in the future, and b) enhance uptime by putting more redundancy in place.

In this particular case, after October 15th, the motor to switch between the different power systems was replaced. Plus, our hoster, the folks from the data center and the manufacturer of the systems have joined forces to clarify the error in the documentation and to fix it. Plus, they are discussing to implement another redundant power system on top of the existing one.

The network switches that caused the downtimes of October 19th and 21th will undergo a scheduled maintenance, probably during the next week. We’ll update as soon as we have more information.

At the moment, we’re thinking about how to add even more redundancy on our side, e.g. by adding further systems that could take over in case of a hardware failure.

On the bright side, we’d like to point out that we trust our primary hosting Partner, SysEleven, despite of those numerous downtimes. Monitoring informed us within a minute. Technicians were hands on within five minutes. CEO and head of IT updated us on an ongoing basis, in detail and in a transparent way. They are deeply sorry and definetely unsatisfied with the status quo, as well. They’ll focus on improving the current set-up during the rest of 2010, no new features will be taken on. All in all, their 10 years hosting history shows that this is not the norm, without a question.

Uptime of mite in 2010: 99,93%

Concluding, we’d like to talk about the bigger picture. We analyzed previous downtimes to help you put this into perspective.

From January 1st 2010 until today, mite was unexpectedly down for a total of 295 minutes. This is an uptime of 99,93%. Even if we included scheduled maintenance, mite was up for 99,89%, all in all.

The gap to 100,00% is not big, but not satisfying. We aim to be better than this. We’ll keep on improving every little detail to maximize uptime even further. Please, trust us: we will get better. If you’d like any further information: please, get in touch!

Julia in Tech talk

October 21, 2010

Today’s service interruption

Update: Since 01:17, mite is back up. Again: we’re so sorry! These outages cannot and won’t continue.

~~
Since 23:18 CEST, mite is not available due to a hardware defect. Technicians are hands on, already. Please visit Twitter to get the newest information on this issue, we’ll update continuously. We’re terribly sorry, please, excuse us!

Julia in Tech talk

October 15, 2010

Downtime

Update (12:41 am CEST): mite is back up now. All of your data is fine of course, there was never any real danger. Again: we’re terribly sorry for this brief outage! Hopefully, this downtime didn’t cause too much trouble on your side.

~~
Since 11:45 am CEST we experience electricity supply issues in our data center. We informed the data center, their whole team is working on the issue.

Please excuse this outage a thousand times! Please visit Twitter to get updates on the issue, we’ll update the status continuously.

Julia in Tech talk

May 15, 2010

Scheduled Maintenance: May 17th

Next Monday, between 1 am and 2 am CEST (what time is that for me?), some updates to our servers will be made. Therefore, mite will be unavailable for a very brief period of time. We expect the interruption to last for no longer than 10 to 15 minutes.

Maintenance will include updating the kernels, i.e. the heart of the server systems, as well as some improvements to the hardware, i.e. server rack restructurings. This maintenance takes place to reduce the possibility of future downtimes by tackling the root of past problems. We ask for your understanding.

Julia in Tech talk

April 29, 2010

Today's service interruption

Between 1:22pm and 2:20pm CEST, mite was down for all users. We’re terribly sorry, please accept our apologies!

The reason for this downtime were problems in our data center: defect routers of the upstream provider caused an interruption in the external connection. Three minutes after this downtime started we began to update on this problem via Twitter. Within minutes, technicians started working on the hardware in the data center. Collaborating with our hosting partner SysEleven, we’ll keep looking into this problem to prevent similar problems in the future, this goes without saying. Of course, your data was totally safe throughout this downtime.

Again: we’re so sorry! This shouldn’t happen.

Julia in Tech talk

March 20, 2010

Scheduled Maintenance, take two: March 21th

Update, March 21th, 7:53am: Maintenance went as planned.

One of our brand-spanking new servers isn’t feeling quite at home yet. To prevent future problems up front, we’ll therefore move it to a new server blade. Unfortunately, this action requires another scheduled maintenance. Fortunately, this one requires a very short period of time only.

mite will be offline on Sunday, March 21th, between 7:30am and 7:50am CET. (What time is that for me?)

We’re so sorry for this second interruption! This decision didn’t come easy. But we’re convinced that this preventive step makes sense even if nothing serious happened yet. We ask for your understanding.

Julia in Tech talk

March 19, 2010

Today's downtime

At about 11am CET, mite became extremely slow; So slow, that the service went down between 11:24am and 11:35am. First of all: we are terribly sorry, this shouldn’t happen!

These problems were due to a disruption of the external internet connection of the data center where our servers are taken care of. Our own alarm systems did work, as well as the ones of our hosting partner SysEleven; we updated via Twitter. SysEleven managed to solve the problems of the IP uplink. All seems fine again. Nevertheless, you can definitely count on us for being extra-observing today.

Again: we’re sorry. Please excuse this interruption!

Julia in Tech talk

« Older postings Newer postings »

Scheduled Maintenance

Today’s service interruption

Scheduled Maintenance: November 27th

Last downtimes in detail

What did happen?

What will be done about it?

Uptime of mite in 2010: 99,93%

Today’s service interruption

Downtime

Scheduled Maintenance: May 17th

Today's service interruption

Scheduled Maintenance, take two: March 21th

Today's downtime

Categories

Search

Subscribe