Ward Pond's SQL Server blog

Ruminating on issues pertinent to the design and development of sound databases and processes under Microsoft SQL Server 2008, SQL Server 2005, and SQL Server 2000 (while reserving the right to vent about anything else that's on my mind)

Database Programming: Game, Set, Match -- Why Writing Code In Transact-SQL Is Different Than Writing In A Compiled Language

Database Programming: Game, Set, Match -- Why Writing Code In Transact-SQL Is Different Than Writing In A Compiled Language

  • Comments 4
  • Likes

More tidbits from an internal discussion..  it'll take us awhile to get to the answer to the titular question; hopefully you'll find this a worthy journey.

Mike asked, in a nutshell, why this batch is taking so long to run:

SET NOCOUNT ON
DECLARE
@Rows AS INTEGER
Set
@Rows=1
WHILE @Rows < 1000000
BEGIN
   
INSERT INTO SomeTable values (NEWID())
    SET @Rows = @Rows + 1
END

Mike saw this batch taking over 15 minutes to run on his single-proc sandbox; while he was expecting it to be I/O bound, Profiler reported low I/O rates and high CPU utilization.

Several folks on the alias ran some benchmarks; I'll reproduce mine here.

The first question was: is this a flow of control issue?  Mike's syntax above ran in 5:08 (five minutes, eight seconds) on my sandbox (it appears, at minimum, that Mike needs a new sandbox).  In order to run this test, I simply commented out the INSERT statement:

SET NOCOUNT ON
DECLARE
@Rows AS
INTEGER
Set
@Rows=
1
WHILE @Rows <
1000000
BEGIN
--  INSERT INTO SomeTable values (NEWID())
    SET @Rows = @Rows +
1
END

This syntax ran in 0:02 on my sandbox (single 3.2gHz CPU (no HT); 1gB RAM).  So, flow of control isn't the issue.

My next test was to wrap Mike's existing code in a transaction:

SET NOCOUNT ON
DECLARE
@Rows AS
INTEGER
Set
@Rows=
1
BEGIN
TRAN
WHILE
@Rows <
1000000
BEGIN
   
INSERT INTO SomeTable values (NEWID
())
    SET @Rows = @Rows +
1
END
COMMIT
TRAN

This code ran in in 0:44, which suggests to me that a good deal of the overhead here is from transaction control, so the lesson here is that a single 1M row transaction is cheaper than 1M single-row transactions.  This makes sense intuitively, as we're saving 999,999 COMMITs.

I then wondered what impact the function call was having on our scenario, so I replaced the NEWID() call with a literal:

SET NOCOUNT ON
DECLARE
@Rows AS
INTEGER
Set
@Rows=
1
BEGIN
TRAN
WHILE
@Rows <
1000000
BEGIN
    INSERT INTO SomeTable values ('1'
)
    SET @Rows = @Rows +
1
END
COMMIT
TRAN

This code ran in 0:36, which suggests that the function call has some overhead, which seems intuitively reasonable.

At this point, along came Shaun with a technique discovered in Dr. SQL's blog (and originated by Joe Celko):

;with digits(i) as (
select 1 as i union all select 2 union all select 3
union all
select 4 union all select 5 union all select 6
union all
select 7 union all select 8 union all select 9
union all
select
0
),
-- generate 1M rows each with a unique row number, i
sequence(i) as
(
select d1.i + (10*d2.i) + (100*d3.i) + (1000*d4.i) + (10000*d5.i) + (100000*d6.i
)
from digits as d1
,
digits as d2
,
digits as d3
,
digits as d4
,
digits as d5
,
digits as
d6
)
select newid() as
uuid
into
SomeTable
from sequence

Well, folks, this little nugget ran in 0:03 on my sandbox.  Three seconds!  One-twentieth of a minute.

NOW we're finally ready to answer the titular question, because this insight provides a stark example of why programming in T-SQL is different than a typical 3GL or 4GL.

nGLs are procedural animals.

T-SQL is a set-based animal.

This example shows that, in T-SQL, no matter what sort of gyrations you go through to generate a set of data, you'll get better performance working in a set-based metaphor than you will in a procedural metaphor.

By building a set of one million unique integers rather than counting to one million one-at-a-time procedurally, we introduce an order-of-magnitude improvement over our best procedural alternative.  Shaun's CTE avoids 2M statement executions (1M INSERTs and 1M SETs).

To a T-SQL programmer who groks set-based thinking, Shaun's syntax has that special sparkle of brilliance about it.  To a practitioner of procedural coding, the approach might seem bizarre, or at the least contrived.

Well, back last week when we were talking about FOR XML EXPLICIT syntax, I wrote:

It seems to me that any syntax that produces the desired result with a close-to-optimal query plan is "proper", and that any further fine tuning devolves quickly into the relative merits of personal style.

Not only do I not find Shaun's code contrived in the least, when I filter it through the statement above I find it to be the one true path in this scenario in that it leverages T-SQL's strength in handling sets of data.  This query plan is way more optimal than any of those we generated procedurally, and therein lies the rub.

For my money, this is why good nGL programmers (generally) make bad T-SQL programmers, and vice-versa.  It's very difficult (painful at times in my case) to veer between those two worlds.

So, since this is, after all, a SQL Server blog, I say (with a respectful tip of the hat to our procedural brethren), "Long live the set-based thinkers!"

Do you think you can get 1M distinct GUIDs into a table in under three seconds?  Drop me a line with your syntax, or any other comments you might have.

     -wp

Comments
  • Hi Ward,

    Yep, I think I can do faster.

    Five subsequent executions of your code on my sandbox, all with clean cache and buffers, took 890, 906, 940, 1030, and 906 ms respectively. (I discarded the very first execution; that one took about 3 seconds, but I think that the database files had to grow).

    I then tested the code posted below and got 813, 796, 810, 1000, and 846 ms on 5 subsequent executions. I can't really explain the one with 1000 ms - maybe CPU or I/O conflicts with one of the other processes running on my box?

    Anyway, here's the code. Note that I used the same "surrounding" code to measure performance of both your and my versions. Also note that I tested on an empty database; if yoou have a database with at least 1,000 rows in sysobjects, you'll only need two cross-joins of said table.

    drop table SomeTable;
    go
    checkpoint;
    dbcc freeproccache;
    dbcc dropcleanbuffers;
    go
    declare @start datetime;
    set @start = getdate();
    select top (1000000) newid() as uuid
    into SomeTable
    from sysobjects a, sysobjects b, sysobjects c, sysobjects d;
    select datediff(ms, @start, getdate());
    go
    select count(*) from SomeTable;

    Best regards,

    Hugo Kornelis

  • PingBack from http://blogs.technet.com/wardpond/archive/2006/09/20/457856.aspx

  • No sane person would even consider using SQL Server to construct a list of prime numbers. So just to...

  • " Denis the SQL Menace " is at it again. When we last heard from Denis on a technical matter in this

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment