Last week, I found myself simultaneously confounded and fascinated by a conversation on the internal Microsoft SQL Server discussion list. If you know me at all, you know that I had to toss in my two cents, but I’m getting ahead of myself.
The thread started with a question from an application development consultant in the field: which is better to hold a blank value: NULL or an empty string? Before I had a chance to chime in, a couple of other folks responded. One wrote “avoid NULLs if you can”, and the other said, “IMHO, NULLs are way overused.”
Well, I had to take a contrarian view. NULLity is good. In fact, NULLity is way good. If one understands it and how it works in the database, I don’t think it can responsibly be “avoided,” and if one is following sound database design practices, I don’t think it can be “overused.”
Later conversation raised the legitimate point that there are databases out in the world that were designed by people who perhaps didn’t possess the requisite skills for such an undertaking. A lot of folks are dealing with poor designs that they’ve inherited and can’t change. In this sense, at least, I’m fortunate to work in an ivory tower: there’s nobody around me who’s designing databases who shouldn’t be running with that particular pair of scissors. To use another “sharp object” metaphor, NULLity is a tool, like a chainsaw is a tool. Use a chainsaw correctly, and you can heat your house for the winter. Use a chainsaw incorrectly, and you won’t be able to walk to the woods to cut down your trees.
So, in the interest of promoting the responsible use of chainsaws.. I mean, NULLs.. I humbly present a brief discourse on what NULLity means, how it works in the database, how to use it, and examples of its misuse. This post is a refactoring of what I wrote to the SQL Server discussion list on the topic.
NULL means “unknown”. When designing a table, one of the properties we set on each column is whether this is a legal response. If we MUST have an answer (for either business rule or data integrity reasons, for example), then we set the column to NOT NULL -- we’re saying “I don’t know” is not a workable response for this column.
When designing a table, I apply the following maxim when setting the NULLity characteristic on a column: if there’s a possibility that we’re not going to receive user input for a value that the user knows and we don’t, then we must allow NULLs (this falls under the general category of “respecting our ignorance”). It’s when NOT NULL is erroneously applied to columns that “workarounds” are applied to the data (“corruptions” might be a more accurate description), most likely to harmful effect.
Let’s run through a couple of examples.
Consider an nvarchar(50) column, CityOfBirth. The intent of the column is to hold the name of a person’s birth city (“Torrance”, “Toronto”, “Bradenton”, etc.). There are times, though, when we might not know this information. What do we put in the column then? My answer would be NULL, of course, because we don’t know. But let’s imagine for a minute that CityOfBirth has been defined as NOT NULL. We MUST supply an answer. This is where the blank string comes in, and those unfamiliar with the tenets of sound database design might think it’s no big deal.
Well, I think it’s a big deal.
The great thing about NULLs is, since they symbolize the absence of data, they fall out of JOINs -- if a value is unknown to us, then it can’t be matched to any other value, can it? If we select two random people and we don’t know where they were born, does that mean that they were born in the same place? The blank string in the database says “yes”; the NULL says “we don’t know because we don’t have data”.
So, a blank string is data -- singularly unenlightening data, yes, but still data nonetheless. More significantly, since blank strings are data just like “Torrance” and “Toronto” and “Bradenton” are data, they won’t fall out of JOINs. As far as the database is concerned, two blank strings match; two NULLs do not. Never mind “good” or “way good” -- that’s why NULLs are beautiful.
“But blank strings are still no big deal,” you say. Okay, let’s try another column, a datetime column DateOfBirth, defined as NOT NULL. If the customer doesn’t wish to share this information (my mother was well known for this), what’s a reasonable “unknown” value now? May 11, 1992? June 17, 1927? March 16, 1957? February 29, 2032? Whatever we put in this column, we’re corrupting our data just as we did with the blank string for CityOfBirth -- it’s maybe just a bit more obvious now. If we allow NULLs into the column, though, the whole issue goes away, because we have a reasonable representation for “we don’t know”.
NOT NULL columns and default values are great and useful things, as long as the default value is a reasonable member of the range of values for the column (a default value says, “if the user doesn’t nominate a value, we’re going to assume she wants the one we’re going to provide.”). There are places where this is very sound practice (GETUTCDATE() is a fine default value for a NOT NULL column called DateCreated, for example), and there are places where it’s not (“Torrance”, “Toronto”, “Bradenton”, a blank string, or anything else, for CityOfBirth; any date for DateOfBirth).
The only reason I can imagine for having a blank string as a default on a character column like CityOfBirth is that we’ve misapplied the NULLity property. If we’re in any way “hacking” or “working around” the NULLity property on a column in this fashion -- if the default value is not a reasonable member of the range of data for the column, and fairly likely to be the proper and correct entry -- chances are very good that NULLity is not set correctly.
For each column in each table we’re designing, there will be exactly one unique correct answer to the question, “should we allow NULLs in this column?” If we’re designing databases, one of our jobs is to find all of those answers. Check your requirements carefully, and remember my maxim: if there’s a possibility that we’re not going to receive user input for a value that the user knows and we don’t, then we must allow NULLs. When in doubt, use the Cliff Notes version: we’re here to store the user’s data, not make it up.
Got a different opinion? Please leave me a comment!
Great article. I agree :)
I also tend to use NULL. However, I have seen others use the NOT NULL DEFAULT "" and insist it is better that NULLS. One of the main reasons they give is that a NULL value cannot be indexed, and therefore slows down searches. Anythoughts on this?
Thanks for your question, Eric. My answer got so involved, it turned into a post. Please let me know your thoughts..
More on NULLs: a reader asks if NULLs slow down searches.
I agree completely, but let me add a clarification: Blank should mean "none" or "does not apply" or whatever makes sense in context, while null means "unknown". Confusing these two things is huge problem in a lot of systems. To take an example I used in my book "A Sane Approach to Database Design" available at electrictactics.com (shameless plug for my book), suppose on a loan application we want to check the applicant's credit history and also his or her spouse's credit history. So we have a screen field for "Applicant Social Security Number" and another for "Spouse Social Security Number". If the second field is blank, does that mean that the applicant is not married and so there is no spouse credit history to check? Or does it mean that at the time he visited the bank to apply for the loan, he didn't know his wife's SSN?
Sometimes when I make a point like this, someone will object that we should just create another field to tell us, like in this case, a "marital status" field. But then we're creating two fields with overlapping information. What does it mean if we check the "Unmarried" box and then fill in a spouse SSN? Redundant data like this forces us to do additional error checking.
Similar things apply to numbers. Suppose a data entry screen asks us to enter, say, the number of years of SQL experience that a job applicant has. The person entering the data finds the number on the job seeker's application form illegible. If there is no way to say "unknown", what is the user supposed to enter? Zero? But saying a person has no experience and saying that you don't know how much experience they have are two very very different things.
I visited a real estate site recently that listed the selling price of a certain house as $0. Are they really giving this house away for free? What, is it haunted or something? I can pretty much guess that in this case 0 meant "unknown". But statisticians routinely calculate the average selling price of a house in this or that city. If they're not careful in their programming, a few $0 houses in the database could make it look like property values are plummetting.
Almost three years after the post was originally published in October of 2005, Mark Johansen, author
Very Nice post, quiet informative.
Database designing problems being the buzzwords these days, one of the most common designing mistakes is the “Poor Planning / Architecture” of the database or mart. Follow the link to know more…