• Using Word 2010 with TechNet Blog Platform

    This is obviously not a post about Exchange Server, but as a blogger I wanted to share this since it's been something that has been irritating me for some time…

    Basically I like to write in Word rather than Live Writer. Live Writer is great but I spend a lot of time writing documentation in Microsoft Word 2010 and so that is my preferred writing tool. Most of my blog posts start off life in Word 2010 and then I have to transfer them over to Live Writer to get the content uploaded to my Blog. The problem is that the formatting is often lost during the transfer and I have to spend time trying to make it look how I wanted in the first place… surely there must be a way to get Word 2010 to write to my TechNet blog in the first place? Surely?? J

    So… after some searching and some trial and error I have finally managed to come up with a solution to connect Word 2010 with the TechNet Blog platform…

    Connecting Word 2010 to TechNet Blog Platform

     

    1. Open Microsoft Word 2010
    2. Click File -> New
    3. Pick Blog Post from the Available Templates
    4. Word should prompt you to Register a Blog Account at this stage
    5. Select Register Now
    6. In the Choose your blog provider drop down, select Other and click Next
    7. Select MetaWebLog in the API type
    8. Enter your TechNet blog URL with /metablog.ashx tagged on the end – in my case this ends up being http://blogs.technet.com/b/neiljohn/metablog.ashx
    9. Enter your username and password

    1. Click OK
    2. Your TechNet Blog account is now registered in Word 2010

    Hopefully someone else out there will find this useful J

  • Database Maintenance in Exchange Server 2010

    If there is one area of Exchange 2010 that was poorly documented and poorly understood it has to be database maintenance.  The problem stemmed from the significant changes that we made in the store for Exchange 2010 and a confusion of terms. This has led many people in the field to get confused about what is actually going on in the various database maintenance activities.

    Well.. I am pleased to say that Ross Smith IV has written a fantastic article that explains what is actually going on inside these mystical processes and what they actually do for us…

    http://blogs.technet.com/b/exchange/archive/2011/12/14/database-maintenance-in-exchange-2010.aspx

  • Exchange Server 2010 Service Pack (SP2) is released!

    Last week when the internal announcement was made to say that SP2 had been completed and was scheduled for release, I added a few blog articles onto my “to do” list.  Imagine my surprise this morning when I woke up to find that the internets were crammed full of SP2 news and that most of the articles I was planning to write, had already been written! (if only that happened with my design deliverables!)

    So, I have decided to write a post today about all of the others posts instead.

    Firstly here is a link to download SP2 if that's all you need – I still recommend that you skim through the release notes, even if you are progressing through your own test plan since it may save you some time.

    Released: Exchange Server 2010 SP2

    The Exchange Team Blog (this is owned by the Microsoft Exchange Server Product Group if you were not aware) announced the release of SP2 on Dec 5th at 6.38pm with the following post…

    Some interesting things to note…

    1. It requires a schema update (just like every other Exchange SP since 2007!)
    2. It provides the following new features…

     

    Exchange Server 2010 SP2 Release Notes

    Before you consider deploying any new service pack for Exchange I strongly urge you to take a look at the release notes…

    For this release the following areas may require attention – all are discussed in the release notes.

    • Installation Prerequisites for Client Access Servers has changed
    • Possible issue with RBAC on non SP2 servers after upgrading one server to SP2 in the organisation
    • Outlook Web App redirection issue after upgrading to SP2
    • Mailbox Replication Proxy
    • Hybrid Configuration Wizard error with domains starting with a numeric value

    It is worth noting here that Exchange Service packs are inclusive of all previous service packs and update rollups.  This means that Exchange Server 2010 SP2 includes the fixes and changes up to and including Update Rollup 6 for Exchange Server 2010 Service Pack 1.

    Exchange Server 2010 SP2 Internet Content

    This is a subset of the non-TechNet material I recommend reading to get up to speed on Exchange Server 2010 SP2

    Other Interesting Stuff

    I spotted that we released a new version of the Exchange Server User Monitor yesterday also – I don't know if this is required if you deploy SP2 but I thought I would include it here just in case Smile

    Hopefully this provides enough information in one place to save you searching the rest of the internet for your Exchange Server SP2 news Smile

  • Exchange Service Availability….For the greater good…

    I recently authored a post over at the Exchange Team Blog about Exchange ESE and Windows Disk Timeout values.

    Whilst writing that article and especially researching how ESE timeouts have evolved, I began thinking about the history of Exchange availability and how the product has evolved over my time working with it…

    In the beginning…

    Back when I began in enterprise infrastructure (1997) the name of the game was all about server uptime, not necessarily service uptime.  What was interesting back then was that our servers were monitored for availability but we didn't have the technology to record service available very well.  We didn't really have many options to improve our service availability with Exchange 5.5 – I used to pack out my servers with resilient components such as redundant network cards, power supplies and raid controllers to ensure that they would keep running for as long as possible, but if the server did go down our only option was to bring it back up as quickly as possible.  My first concern during an outage was always the health of my 30GB EDB files – I knew that if one of these was damaged or the storage was unrecoverable then I was in for a minimum 10 hour restore from tape, I also knew that my manager would be pretty unimpressed  when I delivered this news, so I would go to elaborate lengths to try and avoid it.

    Towards the end of my time with Exchange 5.5 we were encouraged (by Microsoft) to look at clustering technology.  I duly did this in my test lab (with real physical servers and a SCSI “Y” cable!) and I have to admit that I was pretty impressed initially.  This clustering I would be able to tolerate an entire server failure and be back online in a few minutes.  This was a huge step forward. 

    It was only when I started to write my business case to justify my new hardware requirements, I realised that over the previous two years all of my serious outages had come from storage issues (Raid controller firmware, cache failure, multiple HDD failure).  I was hoping to use previous outage details to justify my request for new Exchange clusters, but the reality was that we had only had 1 real server failure (main board) and I was able to replace that fairly quickly with one from the test lab and it was back online in 2 hours.  All of the serious service outages had come from storage related failures where i had to revert to tape restoration.  I realised right at this point that clustering based on a shared storage model probably wasn't going to give me what i needed.

    The next level…

    A few years later I was going through the design phase for Exchange 2003.  My primary goal for the new platform was to improve our service levels, which I was informed by our service delivery manager were fairly steady at 99% for messaging.  I was still very aware at this point that to do this I needed to improve my restore times from tape.  Some progress had been made with the switch to DLT drives, but my EDB files were growing at an incredible rate and it only took 18 months before my restore times were back at 10 hours per server again.  I re-visited clustering during this timeframe but came to the same conclusion as before that it was just adding complexity and wasn't going to materially effect my service availability.  Instead I chose some fancy SAN technology which allowed me to use snapshots to recover my EDB files in minutes rather than hours.  This technology also allowed me to mirror my data off-site.  Management were duly impressed with my solution, until they realised how much it was going to cost!.  Still, they eventually implemented the design and service levels did improve somewhat (although not as much as I had hoped).

    What I had totally neglected during this design was that my new storage technology was a lot more complicated than my old raid controllers were.  I quickly discovered that it was very easy to break things on a monumental scale.  One wrong command given to my clever new storage technology would be sufficient to stop the entire messaging service.  This realization very quickly lead me to adopt the “if it isn't broke, don't fix it” approach to maintaining service.  This brought its own issues though and I was faced with an annual update task to bring the messaging servers back inline with our enterprise standards, this was always a disaster and would entail almost a days worth of downtime while we tried to get the magical combination of OS hotfixes, drivers, firmware and SAN revision right.  Then, once we had a shiny rack of twinkly green lights we would leave it alone for another year…

    Clustering that works?

    Exchange 2007 seemed to address all of my previous concerns.  I was now working for Microsoft so I was getting to see lots of customers, all struggling with the same basic issue I had previously experienced as a customer, i.e how do we recover service quickly in the event of a database problem.  Exchange 2007 seemed to solve this elegantly with the introduction of CCR clusters.  I loved the simplicity of this solution – just copy all of the changes from the active server to the passive server and then fail over in the event of a problem.  It meant that even in a serious failure scenario we could just bring up our passive copy and be back online in minutes!  The solution was out-of-box and so there were no supportability issues either!  Customers loved CCR and I thought it was the best Exchange feature ever developed.

    Over time though I started to see issues with the CCR model.  The most frustrating was that we had to fail the whole server over to the passive node.  This meant that if we had a isolated storage failure on the active node we had a difficult decision to make… leave the users on the failed database offline until a maintenance window was available, or interrupt service for everyone on the server and fail its workload over to the passive node.  Not a great position to be in and often it would depend “who” was on the failed database and how important they were, rather than anything more scientific!

    Service Availability?

    Exchange 2010 took the CCR model and addressed many of the issues reported by customers.  Now we could fail over individual databases between server nodes.  This was a huge step forward and meant that many customers were now actually hitting 99.9% availability.  Given that best practices were followed it took an unusual situation to take Exchange 2010 offline for any significant period of time.

    Over the past couple of years I have noticed some interesting service outages for my customers though.  Given that the server experiences a clean failure of a storage component the database hosted on that storage will simply move to an alternate copy and service will be resumed.  However, if the storage device does not fail cleanly and instead just begins responding slowly or intermittently the database will not simply fail over.  This is quite common with JBOD or where the storage controller suffers an unusual failure, such as overheating or memory corruption.

    Self Sacrifice

    Now we return to the present day (well, earlier this year).  I was in a storage design meeting with a customer and their core storage vendors.  I began talking about changes in Exchange 2010 SP1 and how Exchange would force BSOD (bugcheck) a server if we didn't hear back from a LUN in 4 minutes.  At this point I could hear a sharp intake of breath from pretty much everyone involved.  The feeling in the room was that forcing a server to crash reboot was insane!

    I began to question this behaviour myself and started to think back to some of the failures I had seen and how or if this behaviour would have helped.  I have to admit that forcing a server to blue screen does seem pretty extreme, however upon reflection I came to the conclusion that I quite liked this behaviour.  Given that we have multiple independent copies of our database in a DAG then would I rather a workload remained on a server with a storage I/O problem (4 minutes is a long time to get a response back from your storage!) or that it was moved to another copy?  Well, obviously I want it to be moved and actually I'm not even sure i want to wait 4 minutes!

    For me the decision is pretty obvious.  If I have multiple independent copies of my databases, I want Exchange to switch over intelligently if it detects a problem with the currently active copy.  It may seem counterintuitive to crash reboot a server hosting an active service to improve service availability, but the alternative is to leave the service running on a wounded server until a human being comes along and does the same thing.  Frequently in my support days I would arrive in the datacentre to find a hung server with a black screen.  I would try remote RDP access, maybe trigger a remote reboot via RPC but all too frequently once a server gets into this state you need to press and hold the big red button on the front to get it to come back up and begin troubleshooting the event logs.  All of the time the server is in this hung state the service is unavailable.  By triggering a bugcheck someone still needs to troubleshoot the root cause, but at least the service is only interrupted for a few minutes rather than a few hours…

  • Recommended Windows Hotfix for Database Availability Groups running Windows Server 2008 R2

    Scott just posted this article up on the team blog..

    A summary of the issue…

    “This hotfix is strongly recommended for all database availability groups that are stretched across multiple datacentres. For DAGs that are not stretched across multiple datacentres, this hotfix is good to have, as well. The article describes a race condition and cluster database deadlock issue that can occur when a Windows Failover cluster encounters a transient communication failure. There is a race condition within the reconnection logic of cluster nodes that manifests itself when the cluster has communication failures. When this occurs, it will cause the cluster database to hang, resulting in quorum loss in the failover cluster.”

    If you haven't already picked this up I strongly recommend taking a look and assessing if you need it on your production clusters.