SQL Server, Analytics, .Net, Machine Learning, R, Python
Mitch Wheat has been working as a professional programmer since 1984, graduating with a honours degree in Mathematics from Warwick University, UK in 1986. He moved to Perth in 1995, having worked in software houses in London and Rotterdam. He has worked in the areas of mining, electronics, research, defence, financial, GIS, telecommunications, engineering, and information management. Mitch has worked mainly with Microsoft technologies (since Windows version 3.0) but has also used UNIX. He holds the following Microsoft certifications: MCPD (Web and Windows) using C# and SQL Server MCITP (Admin and Developer). His preferred development environment is C#, .Net Framework and SQL Server. Mitch has worked as an independent consultant for the last 10 years, and is currently involved with helping teams improve their Software Development Life Cycle. His areas of special interest lie in performance tuning
Saturday, August 27, 2011
LINQPad: Async Functions Interactive Tutorial
Joe Albahari has recently released an interactive tutorial on C# asynchronous functions. It's shipped as a LINQPad sample library. Download the latest LINQPad beta, then go to 'Samples' on the bottom left, click 'Download more samples' and choose the async functions tutorial.
Tuesday, August 16, 2011
Visual Studio Lightswitch
The latest issue of MSDN magazine has a short but interesting article “Advanced Programming Made Easy With Visual Studio Lightswitch” from Beth Massi. You can download the development sample from here.
Sunday, August 07, 2011
T-SQL: Generating Random Numbers, Random Sampling and Random ‘Goodness’
Note: I will use the word ‘random’ throughout to mean ‘pseudo-random’. If you want truly random numbers then you need to sample a physical source of noise, i.e. you need one of these devices: http://www.idquantique.com/true-random-number-generator/products-overview.html !
If you are working with large volumes of data, it’s a common requirement to select a smaller, random sample of that data. There are several techniques to do this using TSQL; we will examine several, and perform a statistical evaluation of the distribution of the samples that each technique produces.
There are two common scenarios. Consider a data warehouse that contains hundreds (or thousands) of millions of rows. Working with the entire dataset to perform statistical analysis or data mining can be very time-consuming. Instead, analysts often select a smaller, random sample of records to build a model or draw conclusions, and then test these predictions against the full population.
Alternatively, analysts could use a BI tool (such as OLAP or data mining) to study the entire dataset and identify patterns that might represent a business opportunity. By testing against small random subsets of data, analysts can verify that their hypotheses hold true.
If you search the internet for “TSQL Random Sampling” most paths lead back to this article “Random Sampling in T-SQL”, written by Brian Connolly back in 2004. As the code no longer appears to be available in its entirety, I’ve rewritten the code but followed some of the same techniques used in that article.
There are several techniques that can be used to implement random sampling with TSQL:
Before going any further, we can quickly rule out limiting result sets using TABLESAMPLE:
In other words, despite its name, if you want a truly random sample don’t use the built-in TABLESAMPLE!
Some considerations: Rand() is relatively expensive because it requires the use of INSERTs, cursors, or single-row UPDATEs. Seeded Rand() and NewID() are easier to use, but are they sufficiently random? What we want to determine is the quality of the statistical random samples they produce. There are many statistical tests to determine the randomness of a pseudorandom number generator, which we expect to have a uniform distribution. One of the most commonly used is the Chi-square Test:
The Chi-squared test divides the row range into a number of equally spaced "buckets" and ensures that on average the number of items in each bucket falls within the expected count (which for a uniform distribution is simply the number of rows divided by the number of buckets). For example, a random 1 percent sample of a RandomPopulation table with 100,000 rows will return 1000 row ids that are scattered between 1 and 100,000. Divide the row space into 40 "buckets" from 1-2500, 2501-5000 ... 97501-100000. A random 1 percent sample from RandomPopulation should place roughly 25 row ids into each bucket (some buckets will have more and some will have less). The Chi-square test measures this deviation to determine how much non-randomness is present.
First Attempt Using Rand()
So you have a problem where you need to select some random sampled data in TSQL, and therefore need a source of ‘random’ numbers. Your first attempt at generating random numbers might be something along the lines of:
-- Won't work: returns same value for all rows!
But you will quickly discover that a major drawback to using Rand() without some form of ‘row anchor’ is that it is only evaluated once per query. So the previous query returns the same random number repeated for each row.
The RandomPopulation table
Following the original implementation, we will use a similar table to implement and compare these techniques:
create table dbo.RandomPopulation
This table will contain the population to be sampled. The first group of columns hold the actual random values generated, while the second group hold rowid values when ordered by their respective random value column.
If there are 100,000 rows in the table, rowid will have values between 1 and 100,000 inclusive. The row with the smallest pure_random number will have a pure_random_order value of 1, and similarly the row with the largest pure_random number will have a pure_random_order value of 100,000. We use queries like the following to return a sample of RandomPopulation with @samplecount rows into a temporary table:
The other random value columns are processed similarly.
Method 1: Fill a column using rand() in a while loop
Since the only way to assign a different random number to every row in a table is to make a separate assignment for each row, one solution is to loop over the rows, updating the table one row at a time:
-- 1. Pure random value
Once all rows have been assigned a pure_random value, the pure_random_order column can be calculated:
This query pattern is likewise repeated for the other random value/order column pairs.
Method 2 and 4: rand() seeded with Newid(), and pure Newid()
In order for rand() to be evaluated for each row, we need to seed it with a row-dependent value. TSQL’s NewID() function is evaluated once per row, and the uniqueidentifier values (128 bit GUIDs) that it returns can be ordered. Both of these methods are shown below:
-- 2. Rand seeded with guid
-- 4. newid() value
Method 3: rand() seeded with a row anchor
As mentioned earlier, if rand() is seeded with a row-dependent value, it's evaluated not once per query, but rather once for each row. In Brian’s original article, he uses the following formula to seed rand() with a row-dependent value:
-- 3. rand seeded and anchored to row: DO NOT USE!!
This formula does not work at all well. You do not even need to perform a chi-square test on the data ; it is visibly predictable!
Instead, I used the following formula:
-- 3. rand seeded and anchored to row (2): also, DO NOT USE!!
The seed chosen for rand() is a function of the rowid and also the current nanoseconds. This is the one I’ve included in the results, BUT both versions perform very poorly, and should not be used in practice. Here’s what Brian had to say:
Although using a time based function to seed rand() might seem convenient, in addition to the above reasons for not doing so, there is also the issue that the ‘randomness’ is affected by the speed of your CPU!
Method 5: checksum(newid) as a float value
This method transforms the 128bit GUID returned from Newid() into a 32bit integer value using Checksum() and then into a floating point value between 0 and 1.
-- 5. checksum(newid()) value
The expression CAST(CHECKSUM(NEWID()) & 0x7fffffff AS float / CAST (0x7fffffff AS int) evaluates to a random float value between 0 and 1. The idea behind testing this is to check whether the restriction of the GUID values to 32 bits still renders them sufficiently random.
Note: This technique can be used directly to select a random sample of individual rows. In this example from MSDN, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:
SELECT * FROM Sales.SalesOrderDetail
The SalesOrderID column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float / CAST (0x7fffffff AS int) evaluates to a random float value between 0 and 1.
Sidebar: Picking a single random row
Picking a single random row using an ORDER BY clause like this:
SELECT TOP 1 * FROM table ORDER BY CHECKSUM(NEWID())
performs a table scan (because the random value associated with each row obviously needs to be calculated before the rows can be ordered), which can be slow for large tables. Using an indexed integer column (such as that commonly used for a primary key), and using:
SELECT TOP 1 * FROM table
works in constant time, provided the
Running the Tests
I have followed Brian’s original TSQL implementation to calculate the bucket frequencies and the resulting chi-square statistic. Please refer to the original article or the attached downloadable code for more details. The tests are wrapped in the following stored procedure (available in the download):
create procedure dbo.test_random_selection
All tests were run against SQL Server 2008 R2 SP1 Enterprise Edition (x64) (and also without SP1; the results were similar) on an Intel Quad core i7 based PC running Windows 7 x64 Ultimate.
Test1: test_random_selection(50, 100000, 0.01, 40): 50 runs, 100,000 RandomPopulation rows, a 1 percent sample, and 40 buckets.
Each 2,500 (= 100,000 / 40) interval range bucket should contain approximately 25 rowids. There are 39 degrees of freedom (1 less than the number of buckets). We can say that the row sample for each technique is independent and uniformly distributed at the 90% confidence level if the Chi-square statistic for each technique is Chi-Sq[.90, 39]. Consulting a standard Chi-square table gives a value of 50.660 for the Chi-Sq[.90, 39] threshold level of this statistic (and the lower Chi-Sq[.10, 39] = 28.196). As long as the Chi-square value for each technique is lower than 50.66, there is no reason to suspect that the sample is not random (Note: we expect that some samples will exceed the threshold value, because we are working with a confidence level of 90%. A few instances of crossing this value does not necessarily mean that the samples are non-random).
The results from this test are shown here:
The Rand() seeded with a row anchor technique is showing very predictable behaviour. In fact, it is producing an unusual number of equal values!. The averages, standard deviations, max and min values for the other techniques are very similar to each other in shape and bounds.
Out of interest, I ran the tests again using the rand anchored formula from the original article, and obtained a similar result to Brian:
Neither of the ‘rand() seeded with a row anchor’ methods produce anything like sufficiently random sample, and consequently should not be used in practice.
Test 2: test_random_selection(160, 1000000, 0.01, 100): 160 runs, 1,000,000 RandomPopulation rows, a 1 percent sample, and 100 buckets.
Consulting a standard Chi-square table gives a value of 117.407 for the Chi-Sq[.90, 99] threshold level of this statistic (and the lower Chi-Sq[.10, 99] = 81.449). I have omitted the seeded anchored results in the larger graph to improve the viewable scale (but have included with the same results below, reduced in size)
The seeded random anchored technique should never be used: the results were inconsistent and generally very poor.
The other 4 random sampling techniques produced very similar, acceptable results. They produced random samples that were sufficiently random as measured by a Chi-Square test. rand() has been designed to meet statistical standards so it is no surprise it produces random numbers whose distribution is uniform and satisfy a Chi-Square test. You should use rand() wherever possible, but bear in mind the overhead it has when used as a ‘row at a time’ operation. In testing, I found generating rand() per row was 60 times slower than the set-based operation of either using checksum(newid()) or even rand(checksum(newid())):
Filling a table column for 1 million rows:Rand() per row : 124 seconds
Set:CHECKSUM(NEWID()) : 2 seconds
Set:rand(checksum(newid())) : 2 seconds
So, if speed is a factor, then consider using of the newid()techniques. But heed Brian’s advice:
Note: none of the random number generation methods discussed here are sufficient for cryptographic purposes (which require much larger number generation spaces). SQL Server 2008 introduced the CRYPT_GEN_RANDOM(length [, seed ]) function, which returns a cryptographic random number generated by the Crypto API (CAPI). The output is a hexadecimal number of the specified number of bytes.
Note: Please be aware of the following issue, when using Newid() or any other non-deterministic function in a table expression (derived table, CTE, view, inline TVF):
Tuesday, August 02, 2011
SQL Server: Is it OK to use a uniqueidentifier (GUID) as a Primary Key?
I was recently discussing this topic with a colleague and less than a day later a similar was question was asked on StackOverflow. Then a few days later I was asked a related question by another colleague.
Let’s start with the answer to the question “Is it OK to use a uniqueidentifier (GUID) as a Primary Key?”: Yes, a uniqueidentifier (GUID) column can be fine as a Primary Key, BUT it is not a particularly good choice for the clustered index. In many cases, you will be better off creating the clustered index on a column (or columns) that are likely be used in range searches, and create a non-clustered index on the GUID column.
OK, that’s the answer out of the way! Now for the reasons.
A common reason for using a GUID column as a primary key (over an int or bigint) is the ability to create and assign these in the middle application tier without making a round trip to the database, while still avoiding key clashes between distributed clients.
A primary key is a column (or set of columns) that uniquely identifies each row in a table (or entity). When a column (or a set of columns) is defined as a primary key, SQL Server checks that ALL of the columns that make up the PRIMARY KEY meet the requirement of a PRIMARY KEY constraint which is that they must not contain any NULLs. If the table contains data, a check is made to ensure that this existing data meets the uniqueness constraint. If there are any duplicate rows, the addition of the constraint will fail, as will the creation of the primary key.To enforce this for new rows, SQL Server builds a UNIQUE index. If you do not specify the index type when adding the constraint, SQL Server creates a UNIQUE CLUSTERED index. Which is almost certainly why people sometimes confuse the clustered index with the primary key.
It’s important, so it’s worth repeating: If you don't specify the index type when designating a Primary Key, SQL Server creates a UNIQUE CLUSTERED index by default.
What’s the significance of the clustering key?
Notice point (1): Each index row in the non-clustered index contains the non-clustered key value and a row locator. This locator points to the data row in the clustered index or heap having the key value (Nonclustered Index Structures). So a wide clustering key adds overhead to every non-clustered index. This overhead can be significant.
What are good choices for the clustering key?
A uniqueidentifier (GUID) column requires 16 bytes of storage (4 times wider than an int identity column), so the space used per row is higher (and less rows fit on a database page), and bandwidth consumed by data transfer is higher.
If all you need is a key range larger than that supported by an int (4 bytes, 231 – 1 rows) then consider using a bigint (8 bytes), which supports 263 - 1 rows (which should be sufficient for most applications) and saves 8 bytes per row compared to using a GUID.
Are there any performance considerations? Yes. In addition to the increased storage issues, a clustered index based upon a non-ordered GUID will experience fragmentation due to page splits during inserts. If you absolutely have to have a (non-ordered) GUID as a primary key with a clustered index, make sure you have a regular index rebuild as part of your database maintenance plans, and maybe even consider leaving extra space per page by decreasing the Fill Factor for the clustered index.When an index is created or rebuilt, the fill-factor value determines the percentage of space on each leaf-level page to be filled with data, reserving the remainder on each page as free space for future growth, and reducing the number of page splits. [Note: The fill-factor setting applies only when the index is created, or rebuilt.]
Monday, August 01, 2011
Perth .NET User Group Meeting: Thurs 4th Aug, 5:30pm: Introduction to NServiceBus with Colin Scott
Join us at the Perth .NET user group, August 4th 5:30pm, where Colin Scott will give an introduction to NServiceBus covering key features and concepts such as asynchronous messaging, Publish/Subscribe, message handling and versioning and just what the heck is a service anyway. NServiceBus is an open source service bus built on .NET. It supports the construction of systems that are robust, loosely coupled and aligned to your business.
Colin Scott is a developer who's been working with .NET long enough to start telling the kids of today that they don't know how good they have it! Having worked for consulting companies since 1999 he's recently decided to see how the other half live by joining the development team at Quickflix as a Senior .NET Developer/Architect. When not making the world a better place for developers everywhere he can be found on Twitter at @AbstractCode.
SQL Server - Book Recommendations
If you’re an experienced developer/DBA coming from a background in another RDBMS or just want to improve your SQL Server knowledge, here are some SQL Server book recommendations:
If you have more of a development focus:
There are also some very good, free electronic books from Redgate:
to list a few.
MSN, Email: mitch døt wheat at gmail.com