I lead the engineering organization responsible for BPOS. My team builds, operates and supports our BPOS service, and over the last few days, we have not satisfied our customer’s needs. On Tuesday and today we experienced three separate service issues that impacted customers served from our Americas data center. All of these issues have been resolved and the service is now running smoothly. These incidents were unique to BPOS and not related to Office 365 or any other Microsoft services.
I’d like to apologize to you, our customers and partners, for the obvious inconveniences these issues caused. We know that email is a critical part of your business communication, and my team and I fully recognize our responsibility as your partner and service provider. We will provide a full post mortem, and will also provide additional updates on how our service level agreement (SLA) was impacted. We will be proactively issuing a service credit to our impacted customers.
I also want to provide more detail about the recent issues.
On Tuesday at 9:30am PDT, the BPOS-S Exchange service experienced an issue with one of the hub components due to malformed email traffic on the service. Exchange has the built-in capability to handle such traffic, but encountered an obscure case where that capability did not work correctly. The result was a growing backlog of email. By 12:00pm PDT, the malformed traffic was isolated and the mail queues cleared. The delays encountered by customers varied, on the order of 6-9 hours. Short term mitigation was implemented and a fix was under development.
At 9:10am PDT today, service monitoring again detected malformed email traffic on the service. The problem was resolved at 10:03am, but users experienced up to 45 minute email delays during this time. A second, but related issue was detected via monitoring at 11:35am PDT, resulting in email stuck in some end users’ outboxes. The issue was remediated at 12:04pm PDT. During this time, more than 1.5 million messages had queued on the service awaiting delivery. The backlog was 90% clear by 4:12 PM, but because of this large backlog of email, customers may have experienced delays of as long as 3 hours. We are implementing a comprehensive fix to both problems.
As a result of Tuesday’s incident, we feel we could have communicated earlier and been more specific. Effective today, we updated our communications procedures to be more extensive and timely. We understand that it is critical for our customers to be as fully informed as possible during service impacting events. We will continue to improve the timeliness and specificity of our communications. The primary mechanism for communicating to our customers on issues has been and will continue to be the Service Health Dashboard. For North America, that dashboard is at https://health.noam.microsoftonline.com/.
In an unrelated incident, starting at 1:04am PDT, service monitoring detected a failure in the Domain Name Service (DNS) hosting the http://mail.microsoftonline.com domain. This failure, prevented users from accessing Outlook Web Access hosted in the Americas, and partially impacted some functionality of Microsoft Outlook and Microsoft Exchange ActiveSync devices. The team diagnosed, and fixed, an underlying problem in the servers hosting Domain Name Service (DNS) for the http://mail.microsoftonline.com domain, and restored service at 4:52am PDT. The team identified a number of improvements in our handling of problems associated with DNS, and will provide a full post mortem of this incident available through Microsoft Support.
As I’ve said before, all of us in the BPOS team and at Microsoft appreciate the serious responsibility we have as a service provider to you, and we know that any issue with the service is a disruption to your business – that’s not acceptable. I want to assure you that we are investing the time and resources required to ensure we are living up to your – and our own – expectations for a quality service experience every day.
As always, if you are experiencing any service issues, we encourage customers to check the Service Health Dashboard for the latest information or contact our customer support team. Our customer support is available 24 hours a day by telephone or via Service Requests submitted from the Microsoft Online Services Administration Center.
May 17, 2011 Update:I wanted to thank you for taking the time to read the comments, and though JRG has been responding to many of them, I wanted to address some of the common themes:
I want to again sincerely apologize for the inconvenience last week’s issues caused you, our customers and partners.
Corporate Vice-President, Microsoft Online Services
Because of the seemingly never ending Exchange Online fiasco this week, relationships with some of our multimillion dollar clients have been threatened. Halfway through migration to BPOS we have cancelled our account.
i appreciate the apologies. I'm just wondering why the health status website requires authentication. Is there a reason for that? I now you aren't going to change it because i'm asking, but, why can't you make the exchange health website open to everybody so i dont have to keep logging in, after my page timeouts...
just my .02 cents
keep up the good work
Utah! Get me 2!
Thank you for taking the time to write about what happened. I'm sure a Post incident report would be available, however most of us wanted to know or have an idea what happened.
Pathetic Microsoft. My stock with my BPOS clients was cut in half today. We'll be migrating to Google Apps as soon as possible. Never experienced more than 5 min of consecutive downtime with Google.
"By 12:00am PDT, the malformed traffic..." I think you mean "12:00pm."
Thanks for the write-up. I understand the problems arise, and I remain a loyal customer. My biggest issue may surprise you: I find it really, really frustrating that you insist on putting your health status behind a password. Why in the world do you do this? No one else does: not Google, Salesforce, or any other SaaS provider. The two main frustrations I have with it are 1) it makes it much more difficult to refresh - I can't just hit a button - I have hit refresh, retype my username/pw, click login, wait for the page, etc., and 2) I can't allow my users to check it in my absence, so they constantly ping me for updates. If it was publicly available, they could check it themselves.
Thanks, and best wishes to smooth sailing (for everyone!)
I have this weird feeling that the malformed email in question came from me... Any chance that the business who it came form starts with a W?
Get rid of the dashboard password. What are you trying to hide? My users need to be able to check it as well
You say it takes about 4 hours to clear a backlog of 1.5 million emails?
Did you accidentally miss out the words, "per server"?
I agree. Please remove the password.
Can't wait for Office 365 to get official. I've never seen the beta go down, and it has 50x the features of BPOS. Completely blows away Google Apps. BTW, who do you call when you have problems with Google Apps? :P
Dave at the 5th paragraph you wrote "...A second, but related issue was" is that right or it should read "A second, but UNrelated issue was" ? Thanks
Yes please remove the password, and make the RSS feed contain the information from the dashboard.
Firstly their is absolutely no reason to hide the dashboard behind a password.
Secondly it takes too long for the dashboard to be updated (4 hours into the DNS outage it was still Green!)
Here is my suggestion:
1. Have the current health status dashboard viewable to the public, without authentication. It seemingly only provides high level information that most of us already know if we are going to check it. "Service Degraded.... etc."
2. Have a lower level health status dashboard as part of the admin portal (that is single sign on!) that provides more detailed/in depth information to us network/email/system administrators that use the admin portal.
With this we can have end users able to help themselves by having a place to find out this information (after all, we can't email it to them), and network/email/system administrators can have USEFUL information to digest.
I appreciate the letter!