MSCOM Operations get lots of requests from both internal and external customers on how we operate www.microsoft.com, Microsoft Update, and the Microsoft Download Center (just to name a few). Those customers are asking about a wide variety of topics that we may be able to help them with. Topics like our best practices we use in rolling out new technologies like Windows 2008 and IIS7.0 to our production web environment ,or how we use Peer-to Peer replication in our SQL topologies. Sometimes they just want to chew the fat with fellow System Engineers about the common problems that we all face as SEs.
These customer interactions are one of the best parts of working in MSCOM Operations. We truly learn as much in these customer engagements as (hopefully) our customers learn from us. To help us target these discussions we have created the following Infrastructure Architecture Straw Man. We provide this to customers that have pending engagements with us to try and get a sense of what their environments look like. This really helps us put the right Subject Matter Expert (SME) from our team in front of the customer with a good idea of what direction the discussion is likely to go.
Hopefully this will be of some use to you as well.
Infrastructure Architecture Questions/Topics Straw Man
This is intended to provide a list of topics to address with customers to assist in providing infrastructure architectural guidance.
First question(s) to ask is what are the business problems that need to be solved, the “must haves”. These then need to be prioritized. The infrastructure architecture can be very different depending on the result of this prioritization.
Next try to get a data flow diagram. Where are the calls coming from, what are the expected results from those calls and what components will need to be touched for each call. This will also help to flesh out the infrastructure.
Then try and get a high level diagram of the number of hosting locations/data centers; server clusters etc. which may or may not be known. Also try to determine upfront the target audience (public internet, corporate user’s intranet, partners extranet) and the approximate number of end users.
Finally get a list of requirements the customer thinks they need, then ensure that they understand the ramifications of those requirements. Example: Customer: “We need 5 nines availability.” Architect: “Great, please understand that equates to 2.59 sec of downtime per month.”
Below is a non prioritized list of requirement topics.
a. Defined as providing the required functional benefit to the users, not simply that “a server is up” metric.
Downtime per year
Downtime per month*
Downtime per week
b. Each one comes with an associated cost.
a. Web Site
i. Size of the page
ii. Number of calls to render the page
iii. Type of calls (http, https etc.)
b. Web Services
i. Will this be used to call SQL
c. Back End (SQL)
i. 100’s or 1000’s or 1,000,000 Req/sec
e. Performance testing – what is the acceptable
a. What need to be monitored
iii. Web services
v. Events collected by monitoring should be “actionable”
a. Directly related to Ops costs (people are the most expensive component)
b. Simple cookie-cutter vs. complex one-offs
c. Server specific
i. Global Load Balancing
ii. Local Load Balancing
6. Content Distribution
a. Content delivery networks (CDN)
b. Caching strategies
c. Nature of content
iii. Make up
a. Geo Location (high latency links)
b. Server resource utilization under various loads (always good to load stress a server until it starts to fail to develop maximum capacity levels)
a. Running “cool” . i.e. CPU 40%, memory (depends on server role)
i. Determine load thresholds
b. Running “normal” i.e. CPU sustained 60-75%, memory (depends on server role)
i. Determine load thresholds
c. Running “Hot” i.e. CPU sustained 90%, memory (depends on server role)
a. Application security
b. Infrastructure security (DDoS, Intrusion detection, etc.)
iii. Router ACLs
9. Traffic Analysis
a. Volume of traffic
b. Geo location of traffic
c. Nature of traffic
i. ASP.net, etc.
10. Geo-location of traffic
b. Type of client requests
i. For static content
ii. For Dymanic contest
11. Network (Frontend)
a. Latency concerns
b. Geo-redundancy needs
12. Network Backend
c. Clients/Application Servers
14. Identity Management/Authentication
a. Windows auth
b. Certificate based
c. Protecting Personal Identifiable Information (PII)