This post will close the loop on the "discredited" prime number syntax tangent, the code for which was first posted here and subsequently modified here (and briefly referenced here).
When we last left this discussion, I'd abandoned this approach as a contender for first place in Denis' Prime Number Challenge because the approaches Denis and Hugo used, both based on the Sieve of Eratosthenes, were fundamentally less computation-intensive. However, I had a feeling that this code should run way faster than the couple of hours I was anticipating for the last version, and I've indeed got two interesting versions of this syntax that, under our current rules of engagement, run in under 30 seconds.
Since the last post on this topic, I've made the following modifications:
This last point bears a little discussion, and was the primary performance issue with the previous version of the code. There's an old fable about a group of people running away from a hungry bear. The key to survival in this scenario is not to be the fastest human in the pack, but rather to not be the slowest individual.
Similarly, we don't need to build a list of factors for the numbers that aren't prime numbers; we just need to diagnose their lack of primacy. So, if a particular number has any factors greater than one but less than or equal to its square root, it's not a prime number. All of this hopefully makes good sense.
Now, To The Interesting Part..
I then coded up two alternatives for populating ComparisonMatrix: a cursor loop and a straight insert (in order to render this post easier to deal with, I've placed the code that builds and populates all the tables except ComparisonMatrix in a comment to this post here):
-- keep some house for what comes.. DECLARE @Now DATETIME, @PreviousInterval DECIMAL(12,2), @Interval DECIMAL(12,2), @Counter int, @CandidateValue int -- log a status message SELECT GETDATE() AS [Processing: INSERTing into ComparisonMatrix..] -- set timer SET @Now = GETDATE() -- initalize a counter SET @Counter = 1 -- declare a cursor to traverse the Candidate table DECLARE TraverseCandidateTable CURSOR FAST_FORWARD READ_ONLY FOR SELECT CandidateValue FROM Candidate ORDER BY CandidateValue -- open the cursor and get the first record OPEN TraverseCandidateTable FETCH NEXT FROM TraverseCandidateTable INTO @CandidateValue -- log a status message SELECT GETDATE() AS [Cursor Loop Starts..] BEGIN TRAN -- start a loop which will run as long as the cursor finds a record WHILE @@FETCH_STATUS = 0 BEGIN INSERT ComparisonMatrix ( Base, Factor ) SELECT @CandidateValue, Id FROM SetBuilder WHERE Id <= SQRT(@CandidateValue) AND @CandidateValue % Id = 0 UNION ALL SELECT @CandidateValue, @CandidateValue -- log a database commit and a status message and start a new transaction -- if it's been 20000 transactions since we did so -- (I experimented with batch sizes of 5K, 10K, 20K, 25K, and 50K; -- 20K provided the best performance on my sandbox) IF @Counter % 20000 = 0 BEGIN COMMIT TRAN SET @Interval = DATEDIFF (ss, @now, GETDATE()) SELECT GETDATE() AS [Processing: Status Datetime..], @Counter AS [Current Count..], @CandidateValue AS [Current Candidate..], @Interval AS [ElapsedSecondsThisBatch..], CASE WHEN @PreviousInterval = 0 THEN 100 ELSE (@Interval/@PreviousInterval) * 100 END AS [PercentPerformanceOfPreviousBatch] -- initialize loop variables for the next batch SET @Now = GETDATE() SET @PreviousInterval = @Interval BEGIN TRAN END -- get the next candidate value FETCH NEXT FROM TraverseCandidateTable INTO @CandidateValue -- increment the final transaction SET @Counter = @Counter + 1 END -- we're out of the loop; -- we still need to log one more status record and commit the final transaction SET @Interval = DATEDIFF (ss, @now, GETDATE()) SELECT GETDATE() AS [Processing: Status Datetime..], @Counter AS [Current Count..], @CandidateValue AS [Current Candidate..], @Interval AS [ElapsedSecondsThisBatch..], CASE WHEN @PreviousInterval = 0 THEN 100 ELSE (@Interval/@PreviousInterval) * 100 END AS [PercentPerformanceOfPreviousBatch] COMMIT TRAN -- clost and deallocate the cursor CLOSE TraverseCandidateTable DEALLOCATE TraverseCandidateTable -- log a status message SELECT GETDATE() AS [Cursor Loop Complete..]
-- keep some house for what comes..
DECLARE @Now DATETIME,
@PreviousInterval DECIMAL(12,2),
@Interval DECIMAL(12,2),
@Counter int,
@CandidateValue int
-- log a status message
SELECT GETDATE() AS [Processing: INSERTing into ComparisonMatrix..]
-- set timer
SET @Now = GETDATE()
-- initalize a counter
SET @Counter = 1
-- declare a cursor to traverse the Candidate table
DECLARE TraverseCandidateTable CURSOR FAST_FORWARD READ_ONLY FOR
SELECT CandidateValue
FROM Candidate
ORDER BY CandidateValue
-- open the cursor and get the first record
OPEN TraverseCandidateTable
FETCH NEXT
FROM TraverseCandidateTable
INTO @CandidateValue
SELECT GETDATE() AS [Cursor Loop Starts..]
BEGIN TRAN
-- start a loop which will run as long as the cursor finds a record
WHILE @@FETCH_STATUS = 0
BEGIN
INSERT ComparisonMatrix (
Base,
Factor
)
SELECT @CandidateValue,
Id
FROM SetBuilder
WHERE Id <= SQRT(@CandidateValue)
AND @CandidateValue % Id = 0
UNION ALL
@CandidateValue
-- log a database commit and a status message and start a new transaction
-- if it's been 20000 transactions since we did so
-- (I experimented with batch sizes of 5K, 10K, 20K, 25K, and 50K;
-- 20K provided the best performance on my sandbox)
IF @Counter % 20000 = 0
COMMIT TRAN
SET @Interval = DATEDIFF (ss, @now, GETDATE())
SELECT GETDATE() AS [Processing: Status Datetime..],
@Counter AS [Current Count..],
@CandidateValue AS [Current Candidate..],
@Interval AS [ElapsedSecondsThisBatch..],
CASE
WHEN @PreviousInterval = 0 THEN 100
ELSE (@Interval/@PreviousInterval) * 100
END AS [PercentPerformanceOfPreviousBatch]
-- initialize loop variables for the next batch
SET @PreviousInterval = @Interval
END
-- get the next candidate value
-- increment the final transaction
SET @Counter = @Counter + 1
-- we're out of the loop;
-- we still need to log one more status record and commit the final transaction
-- clost and deallocate the cursor
CLOSE TraverseCandidateTable
DEALLOCATE TraverseCandidateTable
SELECT GETDATE() AS [Cursor Loop Complete..]
-- keep some house for what comes.. DECLARE @Now DATETIME, @Interval DECIMAL(12,2) -- log a status message SELECT GETDATE() AS [Processing: INSERTing into ComparisonMatrix..] -- set timer SET @Now = GETDATE() -- thia INSERT/SELECT will get the factors from 1 to the SQRT -- (which ia the largest possible factor except the candidate itself) INSERT ComparisonMatrix ( Base, Factor ) SELECT c.CandidateValue, sb.Id FROM SetBuilder sb JOIN Candidate c -- use of a rendered column for the square root results in (essentially) -- a doubling of execution time! try it! ON sb.Id <= SQRT(c.CandidateValue) --c.SQRTCandidateValue AND c.CandidateValue % sb.Id = 0 -- log a status message SELECT GETDATE() AS [ComparisonMatrix First Insert: Completed At..], @@ROWCOUNT AS [Rows Inserted..], CAST(DATEDIFF (ms, @now, GETDATE()) AS DECIMAL (12,3))/1000 AS [ElapsedSeconds..] GO
@Interval DECIMAL(12,2)
-- thia INSERT/SELECT will get the factors from 1 to the SQRT
-- (which ia the largest possible factor except the candidate itself)
SELECT c.CandidateValue,
sb.Id
FROM SetBuilder sb
JOIN Candidate c
-- use of a rendered column for the square root results in (essentially)
-- a doubling of execution time! try it!
ON sb.Id <= SQRT(c.CandidateValue) --c.SQRTCandidateValue
AND c.CandidateValue % sb.Id = 0
SELECT GETDATE() AS [ComparisonMatrix First Insert: Completed At..],
@@ROWCOUNT AS [Rows Inserted..],
CAST(DATEDIFF (ms, @now, GETDATE()) AS DECIMAL (12,3))/1000 AS [ElapsedSeconds..]
GO
Once the ComparisonMatrix table is populated, all we need to do is run a SELECT against it to get the list of prime numbers (more on that in a minute).
What's really fascinating to me about this is that the cursor-based solution outperforms the full-set-based solution, every time. I get the best performance with 20K row batchs (20 seconds); I tested 5K, 10K, 20K, 25K, and 50K batch sizes and never got worse than 28 second response time. The set-based solution never ran under 31 seconds.
As Adam notes here, sometimes cursors help, and this is one of those times.
Note also that the SELECT statements are different for each alternative due to a slight difference in rendering style (that did not impact the results):
-- generate the list of prime numbers -- if the SUM(IsQuotientInteger)=2 (where IsQuotientInteger=1 when a particular member of the set -- is a precise integer factor of the candidate value), then that candidate is a prime number, -- because IsQuotientInteger will equal 1 for 1 and the candidate itself. All candidates with -- additional factors will thus be filtered out by the HAVING cluase. SELECT Base FROM ComparisonMatrix GROUP BY Base HAVING SUM(IsQuotientInteger) = 2 ORDER BY Base "Straight" INSERT Code
-- generate the list of prime numbers
-- if the SUM(IsQuotientInteger)=2 (where IsQuotientInteger=1 when a particular member of the set
-- is a precise integer factor of the candidate value), then that candidate is a prime number,
-- because IsQuotientInteger will equal 1 for 1 and the candidate itself. All candidates with
-- additional factors will thus be filtered out by the HAVING cluase.
SELECT Base
FROM ComparisonMatrix
GROUP BY Base
HAVING SUM(IsQuotientInteger) = 2
ORDER BY Base
-- if the SUM(IsQuotientInteger)=1 (where IsQuotientInteger=1 when a particular member of the set
-- because IsQuotientInteger will equal 1 for 1 and any factor other than candidate itself (the candidate itself is
-- assumed, but not INSERTed, for performance reasons (ALL numbers are factors of themselves, so I don't feel
--- the need to render that)). All candidates with additional factors will thus be filtered out by the HAVING cluase.
HAVING SUM(IsQuotientInteger) = 1
Conclusion
When I started on this project, I didn't know about the Sieve of Eratosthenes, so I approached it with something of a blunt stick, and my early results betrayed this state of affairs. Through careful tuning of the code, I was able to bring a process whose first draft would've taken two weeks to run down to a 20 second execution time.
Even though the Sieve of Eratosthenes-based solutions are obviously preferable for a production implementation of this functionality, tuning this code was a worthwhile exercise from an educational standpoint.
"Love your mistakes", indeed.
-wp