Microsoft.com Operations

We are the operations team that runs the Microsoft.com sites.

Systems Engineering Architecture Consultation…”Help Us to Help You!”

Systems Engineering Architecture Consultation…”Help Us to Help You!”

  • Comments 3
  • Likes

MSCOM Operations get lots of requests from both internal and external customers on how we operate www.microsoft.com, Microsoft Update, and the Microsoft Download Center (just to name a few). Those customers are asking about a wide variety of topics that we may be able to help them with. Topics like our best practices we use in rolling out new technologies like Windows 2008 and IIS7.0 to our production web environment ,or how we use Peer-to Peer replication in our SQL topologies. Sometimes they just want to chew the fat with fellow System Engineers about the common problems that we all face as SEs.

 

These customer interactions are one of the best parts of working in MSCOM Operations. We truly learn as much in these customer engagements as (hopefully) our customers learn from us. To help us target these discussions we have created the following Infrastructure Architecture Straw Man. We provide this to customers that have pending engagements with us to try and get a sense of what their environments look like. This really helps us put the right Subject Matter Expert (SME) from our team in front of the customer with a good idea of what direction the discussion is likely to go.

 

Hopefully this will be of some use to you as well.

 

Infrastructure Architecture Questions/Topics Straw Man

This is intended to provide a list of topics to address with customers to assist in providing infrastructure architectural guidance.

First question(s) to ask is what are the business problems that need to be solved, the “must haves”. These then need to be prioritized. The infrastructure architecture can be very different depending on the result of this prioritization.

Next try to get a data flow diagram. Where are the calls coming from, what are the expected results from those calls and what components will need to be touched for each call. This will also help to flesh out the infrastructure.

Then try and get a high level diagram of the number of hosting locations/data centers; server clusters etc. which may or may not be known. Also try to determine upfront the target audience (public internet, corporate user’s intranet, partners extranet) and the approximate number of end users.

Finally get a list of requirements the customer thinks they need, then ensure that they understand the ramifications of those requirements. Example: Customer: “We need 5 nines availability.” Architect: “Great, please understand that equates to 2.59 sec of downtime per month.”

Below is a non prioritized list of requirement topics.

1.     Availability

a.     Defined as providing the required functional benefit to the users, not simply that “a server is up” metric.

Availability %

Downtime per year

Downtime per month*

Downtime per week

98%

7.30 days

14.4 hours

3.36 hours

99%

3.65 days

7.20 hours

1.68 hours

99.5%

1.83 days

3.60 hours

50.4 min

99.9%

8.76 hours

43.2 min

10.1 min

99.99%

52.6 min

4.32 min

1.01 min

99.999%

5.26 min

25.9 sec

6.05 sec

99.9999%

31.5 sec

2.59 sec

0.605 sec

b.    Each one comes with an associated cost.

2.     Performance

a.     Web Site

                                          i.    Size of the page

                                         ii.    Number of calls to render the page

                                        iii.    Type of calls (http, https etc.)

b.    Web Services

                                          i.    Will this be used to call SQL

c.     Back End (SQL)

d.    Scalability

                                          i.    100’s or 1000’s or 1,000,000 Req/sec

e.     Performance testing – what is the acceptable

3.     Monitoring

a.     What need to be monitored

                                          i.    Servers

                                         ii.    Applications

                                        iii.    Web services

                                        iv.    Connectivity

                                         v.    Events collected by monitoring should be “actionable”

4.     Manageability

a.     Directly related to Ops costs (people are the most expensive component)

b.    Simple cookie-cutter vs. complex one-offs

5.     Scale

a.     Firewall/DMZ

b.    Bandwidth

c.     Server specific

                                          i.    Global Load Balancing

                                         ii.    Local Load Balancing

6.     Content Distribution

a.     Content delivery networks (CDN)

b.    Caching strategies

c.     Nature of content

                                          i.    Static

                                         ii.    Dynamic

                                        iii.    Make up

1.     Jpegs

2.     gifs

3.     Flash

4.     Silverlight

7.     Performance

a.     Geo Location (high latency links)

b.    Server resource utilization under various loads (always good to load stress a server until it starts to fail to develop maximum capacity levels)

a.     Running “cool” . i.e. CPU 40%, memory (depends on server role)

                                                          i.    Determine load thresholds

b.    Running “normal” i.e. CPU sustained 60-75%, memory (depends on server role)

                                                          i.    Determine load thresholds

c.     Running “Hot” i.e. CPU sustained 90%, memory (depends on server role)

                                                          i.    Determine load thresholds

8.     Security

a.     Application security

b.    Infrastructure security (DDoS, Intrusion detection, etc.)

                                          i.    Firewalls

                                         ii.    DMZ

                                        iii.    Router ACLs

9.     Traffic Analysis

a.     Volume of traffic

                                          i.    Requests/sec

b.    Geo location of traffic

c.     Nature of traffic

                                          i.    ASP.net, etc.

10.  Geo-location of traffic

b.    Type of client requests

                                          i.    For static content

                                         ii.    For Dymanic contest

11.  Network (Frontend)

a.     Latency concerns

b.    Geo-redundancy needs

12.  Network Backend

a.     Latency concerns

b.    Geo-redundancy needs

13.  Virtualization

c.     Clients/Application Servers

d.    Storage

e.     Applications

14.  Identity Management/Authentication

a.     Windows auth

b.    Certificate based

c.     Protecting Personal Identifiable Information (PII)

d.    Cookies

 

Comments
Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment