Microsoft.com Operations

We are the operations team that runs the Microsoft.com sites.

Comparing Trends in Unmatched Timed Data

Comparing Trends in Unmatched Timed Data

  • Comments 1
  • Likes

Introduction – Maybe You Have Been There Too

Ok, this could be another pretty dry topic, but here we go anyway. More programmer talk, which can put normal people to sleep.  J  A while ago I came up with a programmatic solution to a challenge I had faced many times. I imagine this to be a common challenge, worthy for comparing notes.

Behind the scenes at Microsoft.com, we commonly need to compare multiple data series’, over given time windows, to see if their trends correlate at all. For example, maybe you would like to compare network traffic throughput, on a group of web servers, with the number of HTTP get requests coming in per second on same set of servers, for a particular time frame. Let’s say the traffic throughput readings are taken every twenty-five seconds, but the get requests per second data you have is sampled once per minute. You have the raw data, date-time stamps with values, and need to compare them.

You can take both data series into Microsoft Excel, make two graphs, place one above the other, and visually compare them to see how their trends line up. But what if you want to repeat this process every month? What if you eventually end up comparing five different sets of data? And how about plotting all of the data on a single graph, instead of trying to “eyeball” a stack of graphs? You need some slick automation!

I bet you could use Excel macro programming to code up some help. And I have seen some Microsoft SQL Server Transact-SQL code that helps. However, I like to code in C#. So what follows is a description of how I automated part of this process, using C# and the .NET Framework.

The Challenge

I found the fundamental challenge to be dealing with data of differing time intervals, and maybe sometimes the interval varies slightly in the same series. To plot all of the data on one graph, on the same x-axis over time, with all data points on the exact same time intervals, takes some interpolation. Given an arbitrary standard new time interval to plot all data against, any data series that has points sampled less frequently will need more points inserted to fill-in the required standard points. Any series with more-frequent points will end up with less points plotted on the imposed standard scale. And given that it would be unlikely that any real data series would have times that exactly match arbitrary new time points, almost all plotted points would be interpolated new points.

My Solution

There must be numerous methods to interpolate the new points based on the original points. One important choice is how the new points will be related to the original points. A relatively simple algorithm for generating the new points uses linear interpolation. Using linear interpolation, each new point is placed along imaginary straight lines connecting the original points. The new points are dropped on the connecting lines at the points where imaginary vertical lines, representing the new standard intervals, intersect those connecting lines.

Other more complex algorithms might employ moving averages or maybe exponential smoothing, and could better represent some data. But I think the simpler linear interpolation is good enough to represent basic trends in my data.

Now, even if linear interpolation suffices for your data, you still have some options for generating the new data. One efficient algorithm might involve keeping position pointers on both the new and original data lists. A pointer in the original list would be maintained to always point to the member with a timestamp immediately preceding the current position in the new points list. Using the new timestamp list as a driver, when a value needs to be interpolated, the pointer in the original list could be used to find the two original points on either side of the new point. These two points define the straight line used to calculate the new point value.

However, I think I have come up with a possibly less-efficient, but simpler scheme to interpolate the new points. Using the new list of timestamps as the driver, I employ the .NET DataSet to locate the two original points on either side of each new point. With the original points stored in a DataTable within a DataSet, the DataTable.Select method can be used to readily determine the two original points. Here’s how I did this part:

//get previous data point before the interval mark

SelectExpression = "date <= #" + AddDateKey.ToString() + "#";

SelectSortOrder = "date DESC";

PointRowsBefore = PointSetTable.Select(SelectExpression, SelectSortOrder);

if (PointRowsBefore.Length > 0)

{

    PrevDate = DateTime.Parse(Convert.ToString(PointRowsBefore[0].ItemArray[0]));

    PrevValue = float.Parse(PointRowsBefore[0].ItemArray[1].ToString());

}

 

Once those two original points on either side of a new point are determined, interpolating the value for the new point is mostly straightforward. Relative to the original point preceding the interpolated point, I calculate a delta value to add to the original point value. This delta is a portion of the total difference between the two surrounding original point values. The portion is based on the percentage of how far across the gap in time the interpolated point is, from the preceding original point to the next original point.

    AddDelta = float.Parse(Convert.ToString(Convert.ToDecimal((AddDateKey.Subtract(PrevDate)).Ticks) / Convert.ToDecimal((NextDate.Subtract(PrevDate)).Ticks))) * (NextValue - PrevValue);

 

Example

Maybe a simplified example might help illustrate all of this. Let’s say you wanted to compare the prices of two stocks over the same time window, to see if the two prices were moving up and down in a similar way. For this example, we will consider two fictitious stocks, “A” and “B”. Here are the data values gathered for these two stocks in a ten minute span:

Time

8:00:00

8:01:00

8:02:00

8:03:00

8:04:00

8:05:00

8:06:00

8:07:00

8:08:00

8:09:00

Stock Price A

20

20.12

20.3

20.67

20.49

20.35

20.56

20.75

20.91

21.3

Time

8:00:20

8:02:18

8:04:22

8:06:21

8:08:19

Stock Price B

10.75

11.11

10.6

10.8

11.5

 

As you can see, the times don’t match up. Stock A prices are listed on the exact minute, but stock B prices were sampled about every two minutes. If you graphed these two series in Excel, it would look close to:

StockA 

 

These two stocks look like they are moving in a similar way, but let’s get a better idea by using the linear interpolation described earlier. After running my custom interpolation code against these two data series, imposing a standard one minute interval, I end up these two data sets:

Time

8:00

8:01

8:02

8:03

8:04

8:05

8:06

8:07

8:08

8:09

Stock A

20

20.1

20.3

20.67

20.49

20.35

20.56

20.8

20.91

21.3

Stock B

10.9

11.06

10.94

10.69

10.664

10.765

11

11.39

 

Now the times are all lined up, so we can plot these values on the same graph:

 

You can kind of see the trends, but the lines are too flat since the data values are far apart. To remedy this, we can artificially amplify the stock B prices to bring the two sets closer. A simple doubling of the values works pretty well, ending up with a new plot:

 

Now we can finally clearly see the relationships between these two stock prices. The stock B line should look similar to the original stock B line, but with more value changes. I included code listing below to give you more details on how I created this example, and to show you exactly how I do the interpolations. 

Conclusion

Using ideas from this article as a basis, you can develop your own code to normalize multiple time-based data sets, so you can plot them together to show how the data correlates.  You might try computing the moving averages, and you could explore more efficient algorithms. Either way, this example should help you get off to a good start.

Sample Program Listing

static void Main(string[] args)

{

    //define the input data series'

    SortedList StockPriceA = new SortedList(10);

    StockPriceA.Add(new DateTime(2007, 5, 30, 8, 0, 0), 20);

    StockPriceA.Add(new DateTime(2007, 5, 30, 8, 1, 0), 20.12);

    StockPriceA.Add(new DateTime(2007, 5, 30, 8, 2, 0), 20.3);

    StockPriceA.Add(new DateTime(2007, 5, 30, 8, 3, 0), 20.67);

    StockPriceA.Add(new DateTime(2007, 5, 30, 8, 4, 0), 20.49);

    StockPriceA.Add(new DateTime(2007, 5, 30, 8, 5, 0), 20.35);

    StockPriceA.Add(new DateTime(2007, 5, 30, 8, 6, 0), 20.56);

    StockPriceA.Add(new DateTime(2007, 5, 30, 8, 7, 0), 20.75);

    StockPriceA.Add(new DateTime(2007, 5, 30, 8, 8, 0), 20.91);

    StockPriceA.Add(new DateTime(2007, 5, 30, 8, 9, 0), 21.3);

    SortedList StockPriceB = new SortedList(5);

    StockPriceB.Add(new DateTime(2007, 5, 30, 8, 0, 20), 10.75);

    StockPriceB.Add(new DateTime(2007, 5, 30, 8, 2, 18), 11.11);

    StockPriceB.Add(new DateTime(2007, 5, 30, 8, 4, 22), 10.6);

    StockPriceB.Add(new DateTime(2007, 5, 30, 8, 6, 21), 10.8);

    StockPriceB.Add(new DateTime(2007, 5, 30, 8, 8, 19), 11.5);

 

    //set desired date range and interval to plot

    DateTime StartDate = new DateTime(2007, 5, 30, 7, 59, 0);

    DateTime EndDate = new DateTime(2007, 5, 30, 8, 10, 0);

    TimeSpan NormalInterval = new TimeSpan(0, 1, 0);

    //normalize both data series' to interpolate data points to the desdired plot

    SortedList StockPriceANormal = Normalize(StartDate, EndDate, StockPriceA, NormalInterval, 1);

    SortedList StockPriceBNormal = Normalize(StartDate, EndDate, StockPriceB, NormalInterval, 2);

 

    //display the data before it was normalized

    foreach (DictionaryEntry de in StockPriceA) Debug.Print(de.Key + "," + de.Value);

    foreach (DictionaryEntry de in StockPriceB) Debug.Print(de.Key + "," + de.Value);

    //display the data after it was normalized

    foreach (DictionaryEntry de in StockPriceANormal) Debug.Print(de.Key + "," + de.Value);

    foreach (DictionaryEntry de in StockPriceBNormal) Debug.Print(de.Key + "," + de.Value);

}

 

private static SortedList Normalize(DateTime StartDate, DateTime EndDate, SortedList DataSeries, TimeSpan NormalInterval, Double AmpFactor)

{

    if (AmpFactor <= 0) AmpFactor = 1;

 

    DateTime PrevDate = DateTime.MinValue;

    float PrevValue = float.NegativeInfinity;

    DateTime NextDate = DateTime.MinValue;

    float NextValue = float.NegativeInfinity;

    float AddValue = float.NegativeInfinity;

    float AddDelta = float.NegativeInfinity;

    int PointSub = 0;

    DataRow[] PointRowsBefore;

    DataRow[] PointRowsAfter;

    string SelectExpression = string.Empty;

    string SelectSortOrder = string.Empty;

    DateTime AddDateKey = StartDate;

 

    //define output data list

    TimeSpan DataTimeSpan = EndDate.Subtract(StartDate);

    long NormalPoints = DataTimeSpan.Ticks / NormalInterval.Ticks;

    if (NormalPoints > int.MaxValue) NormalPoints = (long)int.MaxValue;

    int NormalPointCount = Convert.ToInt32(NormalPoints);

    SortedList sl = new SortedList(NormalPointCount);

    //convert the input sorted list into a data.datatable

    DataTable PointSetTable = ConvertListToDataTable(DataSeries);

    //use the normalized date points to drive interpolation

    for (PointSub = 0; PointSub < NormalPointCount; PointSub++)

    {

        AddDateKey = AddDateKey.Add(NormalInterval);

        PrevDate = DateTime.MinValue;

        PrevValue = float.NegativeInfinity;

        NextDate = DateTime.MinValue;

        NextValue = float.NegativeInfinity;

        //get previous data point before the interval mark

        SelectExpression = "date <= #" + AddDateKey.ToString() + "#";

        SelectSortOrder = "date DESC";

        PointRowsBefore = PointSetTable.Select(SelectExpression, SelectSortOrder);

        if (PointRowsBefore.Length > 0)

        {

            PrevDate = DateTime.Parse(Convert.ToString(PointRowsBefore[0].ItemArray[0]));

            PrevValue = float.Parse(PointRowsBefore[0].ItemArray[1].ToString());

        }

        //get next data point after the interval mark

        SelectExpression = "date >= #" + AddDateKey.ToString() + "#";

        SelectSortOrder = "date ASC";

        PointRowsAfter = PointSetTable.Select(SelectExpression, SelectSortOrder);

        if (PointRowsAfter.Length > 0)

        {

            NextDate = DateTime.Parse(Convert.ToString(PointRowsAfter[0].ItemArray[0]));

            NextValue = float.Parse(PointRowsAfter[0].ItemArray[1].ToString());

        }

        /*

        interpolate a new point using a linear trend line

        between the two points before and after the new point

        */

        if (PointRowsBefore.Length > 0 && PointRowsAfter.Length > 0)

        {

            if (Convert.ToDecimal((AddDateKey.Subtract(PrevDate)).Ticks) != 0)

                AddDelta = float.Parse(Convert.ToString(Convert.ToDecimal((AddDateKey.Subtract(PrevDate)).Ticks) / Convert.ToDecimal((NextDate.Subtract(PrevDate)).Ticks))) * (NextValue - PrevValue);

            else

                AddDelta = 0;

            AddValue = PrevValue + AddDelta;

            if (AmpFactor != 1) AddValue = AddValue * (float)AmpFactor;

            sl.Add(AddDateKey, AddValue);

        }

    }

 

    return sl;

}

 

private static DataTable ConvertListToDataTable(SortedList DataSeries)

{

    DataTable dt = new DataTable();

    DataColumn dc;

    DataRow dr;

    // create date column   

    dc = new DataColumn();

    dc.DataType = System.Type.GetType("System.DateTime");

    dc.ColumnName = "date";

    dt.Columns.Add(dc);

    // create value column

    dc = new DataColumn();

    dc.DataType = Type.GetType("System.Double");

    dc.ColumnName = "value";

    dt.Columns.Add(dc);

    DateTime AddDate = DateTime.MinValue;

    string AddDateString = string.Empty;

    // create new DataRow objects and add to DataTable 

    for (int i = 0; i < DataSeries.Capacity; i++)

    {

        dr = dt.NewRow();

        AddDate = (DateTime)DataSeries.GetKey(i);

        AddDateString = AddDate.ToString();

        dr["date"] = AddDate;

        dr["value"] = DataSeries[AddDate];

        dt.Rows.Add(dr);

    }

    return dt;

}

 

Comments
  • I question the method of doubling the values in stock B to compare trend to stock A. If you double the values of stock B, you also double the rate of change between those values. dy/dx(x)=1, but dy/dx(2x)=2 for all x. Doubling the values to make rate of change (derivative) comparison is not a valid comparison; stock B in your last graph above is twice as curvy as it should be.

    Instead, add a constant to move the graph of one stream up or down relative to another.

    -john

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment