I lead the engineering organization responsible for BPOS. My team builds, operates and supports our BPOS service, and over the last few days, we have not satisfied our customer’s needs. On Tuesday and today we experienced three separate service issues that impacted customers served from our Americas data center. All of these issues have been resolved and the service is now running smoothly. These incidents were unique to BPOS and not related to Office 365 or any other Microsoft services.
I’d like to apologize to you, our customers and partners, for the obvious inconveniences these issues caused. We know that email is a critical part of your business communication, and my team and I fully recognize our responsibility as your partner and service provider. We will provide a full post mortem, and will also provide additional updates on how our service level agreement (SLA) was impacted. We will be proactively issuing a service credit to our impacted customers.
I also want to provide more detail about the recent issues.
On Tuesday at 9:30am PDT, the BPOS-S Exchange service experienced an issue with one of the hub components due to malformed email traffic on the service. Exchange has the built-in capability to handle such traffic, but encountered an obscure case where that capability did not work correctly. The result was a growing backlog of email. By 12:00pm PDT, the malformed traffic was isolated and the mail queues cleared. The delays encountered by customers varied, on the order of 6-9 hours. Short term mitigation was implemented and a fix was under development.
At 9:10am PDT today, service monitoring again detected malformed email traffic on the service. The problem was resolved at 10:03am, but users experienced up to 45 minute email delays during this time. A second, but related issue was detected via monitoring at 11:35am PDT, resulting in email stuck in some end users’ outboxes. The issue was remediated at 12:04pm PDT. During this time, more than 1.5 million messages had queued on the service awaiting delivery. The backlog was 90% clear by 4:12 PM, but because of this large backlog of email, customers may have experienced delays of as long as 3 hours. We are implementing a comprehensive fix to both problems.
As a result of Tuesday’s incident, we feel we could have communicated earlier and been more specific. Effective today, we updated our communications procedures to be more extensive and timely. We understand that it is critical for our customers to be as fully informed as possible during service impacting events. We will continue to improve the timeliness and specificity of our communications. The primary mechanism for communicating to our customers on issues has been and will continue to be the Service Health Dashboard. For North America, that dashboard is at https://health.noam.microsoftonline.com/.
In an unrelated incident, starting at 1:04am PDT, service monitoring detected a failure in the Domain Name Service (DNS) hosting the http://mail.microsoftonline.com domain. This failure, prevented users from accessing Outlook Web Access hosted in the Americas, and partially impacted some functionality of Microsoft Outlook and Microsoft Exchange ActiveSync devices. The team diagnosed, and fixed, an underlying problem in the servers hosting Domain Name Service (DNS) for the http://mail.microsoftonline.com domain, and restored service at 4:52am PDT. The team identified a number of improvements in our handling of problems associated with DNS, and will provide a full post mortem of this incident available through Microsoft Support.
As I’ve said before, all of us in the BPOS team and at Microsoft appreciate the serious responsibility we have as a service provider to you, and we know that any issue with the service is a disruption to your business – that’s not acceptable. I want to assure you that we are investing the time and resources required to ensure we are living up to your – and our own – expectations for a quality service experience every day.
As always, if you are experiencing any service issues, we encourage customers to check the Service Health Dashboard for the latest information or contact our customer support team. Our customer support is available 24 hours a day by telephone or via Service Requests submitted from the Microsoft Online Services Administration Center.
May 17, 2011 Update:I wanted to thank you for taking the time to read the comments, and though JRG has been responding to many of them, I wanted to address some of the common themes:
I want to again sincerely apologize for the inconvenience last week’s issues caused you, our customers and partners.
Corporate Vice-President, Microsoft Online Services
you can utilize the RSS notification which helps keep you aware of new postings. i have that feed in our sharepoint dashboard home page so everyone has the ability to know what is going on. although the info is general and it directs you to the health dash, it helps keep everyone in the loop.
thanks for the detailed response. the dash makes my job easier when i can relay the issue and projected timeline of resolution to everyone when problems come up.
communication is key and in my opinion, MS is doing their part.
Also agree on the Health dashboard ... should not require a login ... it's very annoying.
I would like to say that as a BPOS customer for about 7 months now, I am deeply disappointed about the product and all these outages. This has not been the first e-mail outage ... we have had many ... but this was a big one that affected thousands of customers.
The Health dashboard as is now is pretty useless ... it just says basically what we already know ... service is down. Well yeah, we know that. Then they post things like "we will update in 2 hours" ... i check 2.5 hours later and still no update. Communication, communication, communication. I hope they do deliver on this promise as mentioned above.
As an IT profesional, I have lost confidence in the service. And not just me .... my CFO, and all the other IT Managers all throughout the world that keep complaining to me about the service. We went from an in-house Exchange system to hand it over to someone who we felt could do a much better job than us. So far, that has not been the case.
Step it up, or you will lose many many customers. My migration to Office365 might actually be going to a different provider.
Thank you for the detailed information -- this is exactly the kind of open communications that many of us have been wanting, and I hope you'll continue to be so open in the future. I'd second what others have said -- the health dashboard should be updated more quickly and honestly, and there's no real reason to restrict access to service administrators. I mean, when there are service outages, it's going to be all over Twitter and Facebook anyway, so why not open up the health dashboard to provide an authoritative source of information? Anyway, thanks again -- I'm not sure that our company's relationship with MSOL is salvageable in the eyes of my company's leaders, but if it can be saved, it will be because of honest, open communications and an evident ability to learn from past missteps.
While there is no excuse and explanation to defend Microsoft, for the GOOG fans out there - just an FYI - Blogger from Google has been down for nearly 2 days now (as of May 13, 2011). We also all know abt all the outages Amazon recently went through.
Again - others' woes don't make BPOS any better. But just don't compare to those that are same/worse.
Can't wait for Office 365.
"Our customer support is available 24 hours a day by telephone"
This is an absolute, unqualified falsehood. During the first outage the customer support was playing a recording stating that the office was closed and that people should call back during regular business hours. During yesterday's outage the recording was that you couldn't take our call, please try again later.
I wanted to set up the rss feed for viewing the update of the health status dashboard. The password is an issue. Please take it off. It is the status of the server up or down information only. Why do I need to login to view it. If u can't take it off the login procedure, please make something else to be able to notify us the status. E.g. Send a status report to admin account.
Hi Alfredo Saavedra. Thanks for your comments. We'll make a detailed report of the cause available through Support as soon as our investigation is complete.
Thanks Jake Harris, good eye. We corrected the am/pm error in the post.
Hi andrew m. Thanks for reminding folks about our RSS feed. BPOS customers can subscribe to this feed to receive service updates.
Thanks JoeyDee. It's great to hear your feedback on Office 365.
You did not mention what the credit we will recievie will be. will just not charge us for the month given what has occurrued?
Thank for the reimbursement to customers, but how about your partners? We spent dozens of man hours this week dealing with this outage and all we have to show for it are customers that are pissed at us for recommending your service...
Hi Steven Hibshman. Our team is working out the compensation details now. As Dave mentioned, we will issue a service credit to all customers who were impacted. If you do not receive a compensation notice, contact our Support team. We will make sure that all customers who were impacted receive compensation.
Hi Jorge Leitão. Thanks for your question. Dave's post is correct on this point - the issue was related. The nature of the malformed mail masked this related issue until the mail issue was resolved. We're still investigating the root cause of the related issue and how it was masked from our monitoring systems.
I learned of this post today after opening a service request on Tuesday (which was addressed on Thursday). I appreciate reading this, but wonder why it wasn't posted as a link in the Admin portal (which was working on Tuesday, correct?) that day.
My company is coming off an on-premise nightmare experience, where the server would go down and we wouldn't have any sort of update as to what was going on. I looked forward to our BPOS transition because that problem would go away.
You can never over-communicate in this type of situation. People initially just want to know if it's just them or if it's everybody...very similar to when the power goes out in your neighborhood.
I appreciate the post, and appreciate even more that JRG_MSFT has been replying to comments. That's pretty terrific of you, JRG_MSFT.
Here's to the weekend ;)