A few years ago I was working for a major UK outsourcing company and I was involved with a key account within large UK organisations. I was given a project to upgrade a customers McAfee virus scanner to CA’s eTrust version 6 (like I said it was a few years back J ). The project looked straight forward although the timescales were quite aggressive after an oversight re license renewal gave us only 2 months to completely replace the antivirus software for 12,000 clients.
I started the project like I do all others with a good high level design trying to encompass all the major goals of the project and identify the main risks. Now anti-virus software is not my normal job but it does in theory fall into the category system management so myself and a colleague built up a model office environment to test the new product on a very old build, the customer used NT 4 service pack 3 (don’t ask why I would need to write an 100 pages just to try and explain). The first major item the testing drew out was that I couldn’t use both the McAfee and eTrust agent on the box at the same time, well I could but it took 40 minutes to log in if you did. So we had the challenge of how to go about removing McAfee and install eTrust without losing antivirus protection for any length of time. So we decided to:
· Disable the services and functionality of McAfee
· Install the new eTrust software, test everything worked
· If it didn’t, disable any bits of eTrust and re-enable McAfee and log the install as a failure and required checking by an operator.
Now all this was done in a very complex batch script (which ended up being at version 17 by the time we finished testing it and rolled to live !) and the script along with the required software files were software delivered to the clients on a rolling schedule to minimise impact to the network.
Now you can imagine we did a fair but of testing and most of the 2 months we had allocated for the project was testing and sample rollouts to small pilot groups of PC’s, but we didn’t do enough, we failed to test one key area of the product, network bandwidth usage.
We made some allowances for this in the lab we knew the product would need to get signature updates and send data back occasional but with most of the links being good and the actually client communication usually always back to a local site server we didn’t believe we would have a problem… how wrong we were.
We started the deployment with some of the smaller sites, no problems we were getting a good hit rate of about 95% success and the project picked up speed; we got to the final large site in a matter of weeks. This site held about 3500 clients we started rollout, the following day all hell broke out. Clients started reporting that it took anything from 10 minutes to one hour to log on each morning, we spent all day trying to figure out why and nothing major came up and the problem seemed to disappear as the day went on. Next day same thing happened again. We added some network monitoring software onto the network and the amount of LDAP queries the clients were generating each to the Domain Controller each morning was horrendous. After a lot of testing we found the problem because the clients were running SP3 the new antivirus program couldn’t mark that it had virus checked a file and unless it changed not to bother with it again, this feature of the OS didn’t come in to the until SP4
…so every morning when the client started the antivirus program would go
… ok need to check this file
… now who owns it
… I know I’ll ask the DC
... for every system file it was trying to load into memory
The DC’s would get swamped and the clients would start to queue there requests… After a few frantic calls to the CA, after all we were trying to do stuff with there product on an unsupported service pack, we got a hotfix and the problem went away.
Ok so what can we learn:-
When you start to look at implementing a new product try and think out of the box when it comes to drawing up your test plan. Factor in all parts of the network that may have contact to the product not just the areas you may be working on. Think about how you roll out the product and continue to test the solution as you deploy to find any vulnerability. Or in other words test test and test again.
Would we have found this in the lab… probably not as we didn’t do network tracing as part of our acceptance tests for our products before they went live… but guess what they do now…
A great story - certainly goes to show that you can't test too much. I've got to admit I wouldn't have thought about the network bandwidth saturation issue either!
Glad you managed to troubleshoot it though!