SharePoint Ramblings

tips, tricks, thoughts & ponderings from a PFE living in the SharePoint trenches

What every SharePoint admin should know about troubleshooting and managing the newsfeed & distributed cache

What every SharePoint admin should know about troubleshooting and managing the newsfeed & distributed cache

  • Comments 1
  • Likes

If you’re using any of the new social capabilities in SharePoint 2013, you’re probably already somewhat familiar with the role distributed cache plays in the newsfeed functionality.  If you haven’t already, I’d also highly recommend reading Josh Gavant’s two part blog series on the topic:

Part 1: AppFabric Caching and SharePoint: Concepts and Examples
Part 2: AppFabric Caching (and SharePoint): Configuration and Deployment

Done reading? Great, welcome back.  Now that you’re an AppFabric guru, I’m going to skip the basics & assume you’re already familiar with the concepts.  Pretty slick stuff, isn’t it?  Here’s my compilation of handy tips, tricks & scripts you’ll want in your toolbox for troubleshooting & managing the distributed cache in your own environment.

Understanding Data Loss

Figured I’d start here since preventing data loss is hopefully a priority for all of us.  So what exactly does distributed cache data loss look like in regards to the newsfeed?  Well, if you don’t have at least one healthy cache host in your farm, basically the newsfeed won’t work.  You can post something to the newsfeed, refresh the page, and your post won’t show up.  Remember that most of these things still get persisted to the database, but surfacing that post back onto your newsfeed is a function of the distributed cache & if you don’t have any healthy cache hosts, nothing will show up in your feed.  Users won’t like that – expect some calls. 

So what if I have multiple cache hosts & some are healthy, some are not?  Most likely you’ll just notice things like some posts “dropping off” the newsfeed, while other posts persist.  Why is my post from 5 days ago still out there, but my post from 2 days ago is gone?  Things like that are the symptoms of having one or more unhealthy cache hosts in a multi host cache cluster.

Also note that some things like document following activity are not persisted to the database, so when you experience data loss on a cache host, those things are gone forever.

Checking Cache Host Statistics

So how do I find the unhealthy ones & what do I do about it? Health analyzer has a rule for that, so check central admin & see if it has done the work for you identifying which host is unhealthy.  For the purpose of this blog I’m going to ignore that.  So, back to Josh’s blog, he’s got some great scripts for looking at all of your cache hosts statistics, configuration, and other details.  My favorite of which is this one that returns some basic statistics from each cache host in the farm including name, size of cache, number of items in cache, request count, etc.  Very useful stuff – give it a shot.  You'll need to run the "use-cachecluster" cmdled before running any of the cmds below or you will get a "cacheHostInfo is null" error when executing any of the appfabric cmdlets.

Get-AFCacheHostStatus | % {
    $ServerName = $_.HostName
    Get-AFCacheStatistics -ComputerName $_.HostName -CachePort $_.PortNo | Add-Member -MemberType NoteProperty -Name 'ServerName' -Value $ServerName -PassThru
} | Format-List -Property *    

 

Run that on any of your cache hosts & see if you can connect to the cache cluster & return details from all other cache hosts.  If it throws an error, you may likely be on the unhealthy host, so try it on one of the others.  If you do receive an error, another simple test is to just try the “use-cachecluster” command.  If it’s able to connect to the cache configuration, the command will succeed.  If you’re unable to connect to the cache configuration, sounds like you found an unhealthy cache host, and you’ll see an error like this: Invalid provider and connection string read. Please provide the values manually

So what do I do about it?

Stopping & Repairing a cache host

First, try a graceful shutdown on your unhealthy host.  The graceful process moves the cache items from that host to the other cache hosts in that cluster, so any time you need to reboot a cache host or something, you’ll most likely always want to do a graceful shutdown.  Otherwise you’ll lose the cache data from that host & those posts no longer shows in the newsfeed.  Basically the same symptoms as an unhealthy cache host – some posts show on the newsfeed, others don’t, because you lost what was on that cache host.  To save you a click, the graceful shutdown process consists of these two commands:

Stop-SPDistributedCacheServiceInstance -Graceful;

Remove-SPDistributedCacheServiceInstance;

 

Now, delete the service instance as described in the “repair a cache host” section of Managing Distributed Cache article on Technet.  This powershell below will save you the step of having to lookup the GUID like the one in the article.  If the service instance was able to be removed successfully by the remove-SPDistributedCacheServiceInstance cmdlet executed above, this step won't be necessary & you'll just get a null reference exception when executing the delete method, but if remove-SPDistributedCacheServiceInstance wasn't successful in deleting the service instance for whatever reason, you'll want to proceed with this step before attempting to add the faulty host back to the cache cluster.

$instance = "SPDistributedCacheService Name=AppFabricCachingService";

$service = Get-SPServiceInstance |?{($_.service.tostring()) -eq $instancename -and ($_.server.name) -eq $env:COMPUTERNAME};

$service.delete();

Adding back to the Cache Cluster

To add your cache host back to the farm, run the following cmdlet.  Never stop/start the service directly from services console or central admin, always use these cmdlets for managing the service.  Also note that you may need to give everything a few minutes before running this cmdlet below or you may end up getting a "TCP port 22234 is already in use" error.  If that happens, just delete the service instance again as described above, give it a few more minutes, then try again.

Add-spdistributedcacheserviceinstance

At this point you may also want to manually kick off the timer job that handles the repopulation of the feed cache, which by default runs every 5 mins.  The name of that job is “User Profile service application proxy - feed cache repopulation”.  Now try running that Get-AFCacheHost command above to check statistics & you should see the item count starting to grow on the cache host you just added back to the cluster.  If you have other unhealthy cache hosts, just repeat the same process on the other hosts, but remember that the graceful process unloads one host’s cache to the other available hosts, so you probably only want to do this on one server at a time to avoid overloading your other hosts.  And also remember that once you hit your configured memory limits things will start getting evicted from cache, so another reason not to overload your other hosts.

At this point you should be able to sit back & behold the beauty of the newsfeed.  Ahhhh, sweet social.

Until next time, happy caching!

<update 5/13/13>   One more thing - Josh makes an important note in the eviction section of the 2nd post linked above, specifically:  "If and when less than 15% (by default) of server memory remains, an eviction run is initiated regardless of the local cache host’s watermark and size settings. That is, even though the host is not using all of its allowed memory, if available memory on the server is below 15% of the total physical memory, a full eviction run will begin, as if the high watermark had been passed."    Think about that for a minute....  Now take a look at the guidance posted in the planning for distributed cache article & decide if you really want to gamble & over allocate your cache hosts.   Believe me, a full eviction of a cache host just because you didn’t properly plan for peak capacity on your servers leads to a bad user experience, and it will leave you scratching your head when all your cache hosts are reporting healthy yet for some reason users are reporting missing posts in their newsfeed.  And while we’re at it, how’s your monitoring situation?  Are you properly set up to alert on low memory conditions?  For your own sanity, I sure hope so...

 

Comments
  • Fantastic post, thanks Scott. DC is all new for SP admins to support. Appreciate the tips!

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment