My name is Morgan Cole, and I lead a team at Microsoft whose mission is to make sure that BPOS customers have a great experience with our services. We aspire to deliver quality services, and in the last couple of weeks, we have fallen short of this aspiration. During this time, we experienced two network access issues in North America, and just yesterday, two brief periods of service degradation also affecting users served from North America. These incidents were unique to BPOS and not related to other Microsoft services.
I wanted to write here to apologize to you, our customers, for any inconvenience these issues may have caused. We know how important these services are to the daily operation of your business, and we take our responsibility as your partner and service provider very seriously.
I also want to provide a bit more detail about the recent issues.
Specific to the August 23 event: our proactive efforts to upgrade to next generation network infrastructure caused unforeseen problems that affected access to some services. Operations and Engineering quickly identified a design issue in the upgrade that caused unexpected impact, but the issue resulted in a 2-hour period of intermittent access for BPOS organizations served from North America.
The August 23 event was remediated, but the solution did not resolve another underlying issue which created subsequent problems on September 3rd and 7th. BPOS customers experienced brief periods of service degradation, primarily affecting the sign-in service and administrative portals. The impact during the afternoon of September 7th had more widespread customer impact, although the duration was relatively short. We performed emergency maintenance to isolate suspect traffic, which has proven successful in stabilizing the service. We continue to monitor the network and all services to ensure stable operations. Needless to say we, like you, find the events unacceptable and have 24/7 efforts underway to ensure we do not have a repeat of these events.
We appreciate the serious responsibility we have as a service provider to you, and we know that any issue with the service is a disruption to your business – and that’s not acceptable. I can assure you that we are investing the time and resources required to ensure we are living up to your – and our own – expectations for a quality service experience every day.
As always, if you are experiencing any service issues, we encourage customers to contact us. Our customer support is available 24 hours a day by telephone or via Service Requests submitted from the Microsoft Online Services Administration Center.
Given the 2 hour outage equates to 99.7% for August, will you be honoring your pledge to refund affected users? My understanding was that the 99.9% uptime promise was backed by a money-back guarantee.
What is being done to improve communication when there are issues? On 9/7, on the Online Serivices admin site, the Service Status showed services were "Healthy" during a time when the services were not accessible. Additionally, the information provided by the RSS feeds is frustratingly vague and not timely.
There was also an outage on August 10th... what was the cause of that outage?
Morgan that is needed better tools to measure Service Level, that will make a difference.
Hi Guy. Thanks for your question.
In the case of the widespread August 23 incident, we proactively provided a credit to all affected customers. However, in general practice, customers who believe that we have not met our service level agreement should contact Support to request an SLA credit. As stated in the post, customers can contact Support by phone, or by filing a service request in the Administrator portal.
Thanks again for your question.JRG_MSFT
Hi David/One Guy
Your feedback is definitely appreciated. We're reviewing all of our communications and service level measurements to identify areas of improvement. One area of focus that we have is to build better tools to provide timely, accurate and targeted communications about service health. Per your statement about RSS feeds, we've introduced customization into the service, so we can provide more data about the situation. We're starting to use that feature more and more, as we did to describe the September 7 service event.
As we roll out new improvements to our incident communications, we'll definitey update customers and blog readers as soon as possible.
Thanks for taking the time to post.
Thank you for the post it has been read with interesst. Since we are connected to the EMEA BPOS we have not been affected by this but still it is important to get an overview of what happened. It seems that this is because of the large number of users you have in North America. The problems you have met now, will the solutions automaticly be mirrored to EMEA so when the client base grows we would probably avoid problems like this? Also all the talk about mirrors and hot sites, why is the clients just moved to a new site when you have problems. Is there any documentation about the Exchange2007/Sharepoint setup/redundancy. Is it possible to do this or is it just for disaster recovery? Since we now are going all in with the clients I really appriciate all info you can provide.
Best regards a happy BPOS partner and user
Hi Lars T
Lots of questions there, but great to hear from a partner and customer! Let me see if I can address them all. The service issue indeed was isolated to customers served from our North American data center. And while our customer base is growing very fast, the issues we had were not related to capacity, but rather to network configuration. And all of the painful learning we've made as part of these service events accrues to our configuration, operation, support and maintenance of all of our data centers.
To your question about client movement related to potential service interruptions, our disaster recovery philosophy is to solve for data availability and data security. Failing over the service to a different data center may have required more time than the service interruption itself, and no data was ever lost during the event. That's the key design principle for us - protect customer data.
We have a number of different documents related to our service available for partners. You can go to our Quickstart for Online Service page here (www.quickstartonlineservices.com) for more information.
Thanks for taking the time to post.
Built by man, broke by man.......
I suggest you to give a look to the transaprency provided by Salesforce.
I'm evaluating a subcription of BPOS and when I asked availability report, the reseller responded me that MS don't deliver this kind of info. This behaviour is not positive on petential customer.
Our experiences with the service center haven't been great. We have two outstanding replication issues with our GAL, one recurring, the other new. My colleague has had to explain, in great detail, the issue to several different people with absolutely no resolution. So far not impressed with your support.
Hi Tiffany, can you please send your SR numbers to BPOSCOM@microsoft.com
If you have not submitted an SR please visit this site http://cot.ag/aqE6bB or call 1-866-676-6546 option #2 or #3 for customer service.
Thank you and we will try our best to resolve your issues as soon as we can,
Morgan, you said this issue was caused by a CHANGE to the network infrastructure. It occurred on August 23, but you did not specify what time. Aug 23 is a Monday, a typically heavy day for businesses. Why were you making changes to the infrastructure mid-week, and without a quick backout or failover plan. In December, BING had a simlar outage, caused by a mid-week, mid-day CHANGE that did not have a quick backout plan.
I apologize for being blunt on this, but in an enterprise-class data center, you do NOT make changes mid-week, mid-day. In enterprises, people/management are fired for such activity. Any change that has any degree of risk should be isolated to very finite (and highly published) maintenance windows, usually on Sunday mornings between 1-4 AM.
Does MS have a stringent Change Management policy? Does it really allow such changes to occur mid-week? What is MS doing to ensure that changes with potential impacts are NOT allowed during peak business times?
The salesforce system status site seems to provide valuable insight into system performance. I appreciate the feedback, and check back here for updates on what we're planning to deliver to provide more information to our customers. We take our 'better every day' mantra very seriously, and that includes regular updates to service availability and communications regarding service status.
Thanks for your feedback.
Thanks for the feedback. Perhaps I should have been more specific.
The changes were made during our regularly-scheduled maintenance window which occurs over the weekend in the late evening and early morning on Saturday night and into Sunday morning Redmond standard time. The resulting customer impact occurred as our customers came into the office on Monday morning and began to put significant load on the system.
As you mention, planned maintenance is done in accordance with industry best practices to limit any interrruption for customers. We have very strict maintenance procedures to ensure we limit any sort of mid-week changes that might impact our customer's productivity.
Again, we appreciate the feedback, and the opportunity to provide a bit of clarification.