In my last blog I hypothesised that Architectural analysis is slightly different from developer analysis and so needs a subtly different skill set and way of thinking. To demonstrate what I mean let me describe a real life example of an architectural problem and different solutions.
I was called up late one Friday afternoon (why is it always Friday these things happen?) by a distraught business manager who’s biggest customer had a problem (I wont say who it was but they are household name in the UK). They had decided to provide a new product over the telephone and so had built a customer / order processing system for a maximum of 200 telesales operatives using Microsoft Products. They were going live on the Monday when the new product launched (and that wasn’t going to be easy to stop!). They had been in stress test for 3 weeks and when they took the load up to 50 users the system crashed. They had tried to fix the problem and couldn’t so it must be Microsoft’s fault, after all they had read in the press that Windows didn’t scale and here was proof! The Business Manager wanted me at their office (a 3 hour drive) asap, not so much to fix the problem but more to show that we were doing something. It seems to be a common misconception that putting technical people in cars or trains is a valuable use of their time which I vigorously dispute, I feel that most problems can be solved more quickly over the phone. There was a short discussion about efficient problem solving techniques, he spoke to my manager and I was in the car. Why is it my life is so like a Dilbert cartoon?
Getting to the customer a disaster scene met my eyes, paperwork everywhere, empty coffee cups, red eyed technical people, irascible managers, phones ringing, you know the sort of thing. The technical people just wanted me to say it was our technologies fault so they could go home. Managers wanted to nail me to a whiteboard and take turns with the whip; fun all around, however as I am not into S&M I insisted on looking at the application first (so maybe I am, just not that sort!).
It was a simple 3 tier app, smart client, business tier doing some business processing and a database with some simple stored procedures isolating the data access; nice and simple. There was however one strange thing; they had a second server running a piece of the business logic alongside the main business server. I asked why this was and it transpired that they had profiled the application (nice but unusual in my experience) and found one piece of code which was doing some simple customer validation and generating a GUID was taking about 30% of the CPU. They were concerned that it would become a bottleneck so had come up with the idea that, as the application was very well modularised, they could put that code on a second server and so distribute the load. They knew all about scale out.
The problem was that when the load got to 50 users the network stack on the server overflowed and so the system crashed. They had been on to product support and got patches to increase the network stack size (something I didn’t even know you could do!) but of course that didn’t fix the problem. Because it seemed to be something in the network layer they had spent ages in network tuning, putting in faster Ethernets and hubs etc. They were now convinced that it was an OS problem and Windows wasn’t scalable so why didn’t I admit it and let the blame fall on MS.
This is not a great career move at Microsoft and anyway I thought I knew what the problem was. I suggested a quick rebuild of the application with a simple change and then a retest whilst I went and moved the car (I had left it on double yellows). By the time I got back they had done the modifications, stress tested and were able to meet the 200 user criteria easily (which either shows how productive our platform is or how difficult it is to find a parking place in the UK!). Congratulations all round, techies treating me like a guru, senior managers fetching me coffee and a much relived business manager who carried my bag to the car, sometimes I love this Job!
So four questions:
1 What was causing the problem?
2 Waht was the fix?
3 How should it have been architected for scalability in the first place?
4 Why do I hate marketing messages?
Answers in the feedback
1 The GUID code on the separate server was hardly doing anything and being called every time any customer access happened. It took a lot longer to get through the network stack than it did to run the code so the network became the bottleneck
2 To put the GUID code back on the business server.
3 The GUID code should have been put in the client
4 The message "Just distribute the code and that gives you scalability using scale out" is not very smart.
I dont envy your job Michael.. ;)
one thing I don't understand, in your post you said "I suggested a quick rebuild of the application with a simple change and then a retest...", but then you said that the solution was to move the GUI code back to the server, I guess GUID code was in a DLL and that DLL could have been moved to the business server without recompiling, wasn't it ?
Why the need to recompile ?
It was actually a DCOM component.
Nice post. I've been in similar situations myself, though perhaps not as much a as a do or die situation.
Do you think that their profiling was "wrong" since you said "GUID code on the separate server was hardly doing anything"??
This is a very good post Michael. We can learn a lot from your experiences like this. Keep bloggin...
Good example of expanding your view; especially /wrt premature optimization. It seems a little crazy to profile and observe a bottleneck, apply a remidy and not reprofile to see if there was any effect (good or bad).
I do disagree with your answer to 3. Depending on how important GUIDs are to your application you may not want the client giving you the GUID.
What would be the impact if the client always gave you the same GUID? Or if different clients gave you the same GUID?
How trustworthy is your client? Are you writing the client? Will it be put into the hands of someone who could gain by modifying the client (pleasure, money, fame...)?
Letting your client manage a core precondition (uniqueness) of your application is dangerous.
The pofiling was correct, their solution was wrong.
The GUID was a UUID so unique. The clients were all terminals for internal telesales reps so inside the firewall and locked down.
A little knowledge is a dangerous thing.
The problem is that the world is littered with a little knowledge all over the place.
But always remember that in some areas you will be the one with a little knowledge :).
supposedly they moved the GUID code OFF the main server in the beginning... so they moved it back to the main and it fixed it?
whats going on here?
and how could 50 users' guid generating use up 30% of cpu?
i'm assuming the profiling was throwing FAR more than 50 users at the app... So I guess somebody said, "since the cpu will 30% spike with 100,000,000,xxx users, this could become a bottleneck?" this is bizarre thinking.
i guess i'm a little confused as to the solution, and why the guid was even moved to a different server in the first place...
I like the comment about a little knowledge:
"A little knowledge is dangerous.
the world is full of a lot of little knowledge."
And its so SO true.
I wasnt quite clear when I wrote this, it was 30% of the cpu usage by the application, not total. The app with the profile test was only about 30% of the total cpu, the rest being system idle.
Actually even though this was a simple app I think there were problems in the UUID code to get to this level of utilisation.
The thing I find in my experience, is that if you had you offered that suggestion at the beginning of the project they would have argued and argued that it wouldn't work. Only when customers get in a real bind do they really start to listen, because only at that point do they truly understand that they don't have all the right answers.