Tuesday, 5 May 2009

Tech Day #4 – SQL (For Developers)

Looking at current trends in the market, with companies wanting to downsize and squeeze all they can from their staff – it has becoming increasingly more obvious that most developers are doing a LOT more database work.

This has been my “Achilles heel” for quite some time, however it has never really been an issue since there has always been people at work that I can call upon to give us a hand. However, I am not a fan of leaving weaknesses unchecked, so here I am and here is SQL on my Tech ToDo..

Why the “For Developers”?

I think my biggest problem (and I am sure I am not alone here) with SQL is that I am a developer first. I often struggle with SQL because I think in a more procedural fashion, whereas SQL is set-based. This is a real mental shift in how you approach problems.

Before we get cracking, this post is NOT geared towards complete SQL n00bs. You should be comfortable with the basics.. SELECT, INSERT, UPDATES, DELETES and JOINS etc.

Getting Started

Go ahead and grab the sample code from my codepit. In there you will find a SQL service database file called “Data.mdf”. CREATE A COPY OF THAT AND THE “DATA.LDF” FILE. You can then mess with the data to your hearts content :)

In order to access the data, you will also need to attach the database to a SQL server service instance. So if you don’t have SQL server installed, grab the Express Edition (it’s free).

You may find it easier to work in MSEE, so you may want to sit in that rather than Visual Studio.

In the database, you will find a few tables with some random data in (thank you random name generator for the help!). These tables will be used throughout the post, data loss is likely to occur, hence the reason for the backup :)

Procedural vs. Set-Based

The Scenario

We have had some changes to the staff table. We have some new staff members as well as a couple of promotions. We need to update the staff table with the new information.

The Procedural Way

The C# developer in me thinks along the lines of this (ProceduralThinking.cs):

   1: // Get the New/Amended Records
   2: var records = GetNewAndAmendedRecords();
   3:  
   4: foreach (var record in records)
   5: {
   6:     if (record.Exists)
   7:         UpdateRecord(record);
   8:     else
   9:         InsertRecord(record);
  10: }

Where the “UpdateRecord” method would perform an UPDATE query, and the “InsertRecord” an INSERT INTO. This could equate to [roughly] the following SQL (ProceduralThinking.sql):

   1: -- Create a Table with New/Updated Records
   2: DECLARE @records TABLE (
   3:  ID int,
   4:  FirstName varchar(50),
   5:  LastName varchar(100),
   6:  Salary int,
   7:  Role int
   8: )
   9:  
  10: INSERT INTO @records VALUES (0, 'Rob', 'Cooper', 1000000, 5); -- Yeah, Right.
  11: INSERT INTO @records VALUES (0, 'John', 'Guido', 30000, 2);
  12: INSERT INTO @records VALUES (2, 'Rachel', 'Jackson', 14750, 1);
  13: INSERT INTO @records VALUES (9, 'Jon', 'Caton', 32000, 4);
  14:  
  15: -- Iterate Through the Records
  16: -- INSERT/UPDATE as necessary.
  17: DECLARE @id int;
  18: SET @id = -1;
  19:  
  20: WHILE (@id IS NOT NULL)
  21: BEGIN
  22:     SET @id = (SELECT TOP 1 ID FROM @records);
  23:     IF (@id IS NOT NULL)
  24:     BEGIN
  25:         IF (@id = 0)
  26:         BEGIN
  27:             INSERT INTO Staff (FirstName, LastName, Salary, Role)
  28:                 SELECT TOP 1 FirstName, LastName, Salary, Role FROM @records;
  29:         END
  30:         
  31:         ELSE
  32:         BEGIN
  33:             DECLARE @firstName varchar(50);
  34:             DECLARE @lastName varchar(100);
  35:             DECLARE @salary int;
  36:             DECLARE @role int;
  37:             
  38:             SET @firstName = (SELECT TOP 1 FirstName FROM @records);
  39:             SET @lastName = (SELECT TOP 1 LastName FROM @records);
  40:             SET @salary = (SELECT TOP 1 Salary FROM @records);
  41:             SET @role = (SELECT TOP 1 Role FROM @records);
  42:             
  43:             UPDATE Staff
  44:             SET FirstName = @firstName,
  45:                 LastName = @lastName,
  46:                 Salary = @salary,
  47:                 Role = @role
  48:             WHERE
  49:                 ID = @id;
  50:         END
  51:     END
  52:     DELETE TOP (1) FROM @records;
  53: END

Sure, it looks nasty and it also takes a bloody long time to write! This method is known as “RBAR” or “Row By Agonising Row” (as you can see, for a good reason!).

The Set-Based Way

Think about it like this, we have two sets of data, the data we are updating (the Staff table) and the data we are updating it with (the @records temp table). We need to tell SQL server to mash these two sets together in a way that works as desired.

Looking at the data, we know the following:

  • New records that need to be INSERTed have an ID of 0, since they currently do not have one.
  • Existing records have their ID set to the record ID in the Staff table.

So, the “ID” column in the temp table is really the “hook” that distinguishes how we are supposed to act:

  • All records in the @records set with an ID of 0 must be inserted.
  • All records in the @records set with an ID that is non-zero must be updated.

So, let’s see if we can re-work the above (SetBasedThinking.sql)

   1: INSERT INTO Staff
   2:     SELECT FirstName, LastName, Salary, Role FROM @records
   3:     WHERE ID = 0;
   4:     
   5: UPDATE S
   6:     SET FirstName = R.FirstName,
   7:         LastName = R.LastName,
   8:         Salary = R.Salary,
   9:         Role = R.Role
  10:     FROM Staff S
  11:     INNER JOIN @records R
  12:     ON S.ID = R.ID;

See what we did? We simply performed the desired operations based on the data as a complete set. That is set-based.

I also came across this great post on “thinking set-based” and highly recommend reading it.

INSERTS – Just the One Please

Those on the ball with the above may have noticed something.. What if I was to execute the above script more than once?

We would have a problem. The updates would be fine (assuming the data is still current) however, the INSERTS would execute again, adding duplicate data to our database. Let’s revisit the SQL (SetBasedThinkingRevisited.sql):

   1: INSERT INTO Staff
   2:     SELECT FirstName, LastName, Salary, Role FROM @records R
   3:     WHERE ID = 0 AND NOT EXISTS
   4:     (SELECT * FROM Staff S
   5:         WHERE S.FirstName = R.FirstName AND
   6:             S.LastName = R.LastName AND
   7:             S.Salary = R.Salary AND
   8:             S.Role = R.Role);

Here we have added an NOT EXISTS clause. EXISTS will basically return “true” if the statement passed to it returns one or more records. Our SELECT within the EXISTS is pretty mundane since it is a simple data set, but would probably be a lot easier if we had more realistic data in there (e.g. we could simply look up a SSN).

Common Problem Scenarios

Selecting the “nth” Record

We need to know who has the third-highest salary in the company, can you quickly query the database to find out?

Now, this sounds REALLY easy right? In C# you might see code like the following (NthRecord.cs):

   1: var records = GetRecordsBySalary();
   2: int thirdHighest = records[2];

Here the problem is odd, the problem is because SQL is set-based, it has no concept of an individual row and it’s relationship to rows preceding and following it.

We could quickly establish the top three highest salaries by using this straightforward query:

   1: SELECT TOP 3 * FROM Staff
   2: ORDER BY Salary DESC

We could then use that set and turn it on it’s head, and only retrieve the top record like so (ThirdHighestSalary.sql):

   1: SELECT TOP 1 * FROM
   2: (SELECT TOP 3 ID, FirstName, LastName, Salary, Role
   3: FROM Staff ORDER BY Salary DESC) sort
   4: ORDER BY Salary;

I have been told that this is the correct way to do things pre-SQL2005, but with 2005 comes a better way for us to do things (ThirdHighestSalary2005.sql):

   1: SELECT TOP 1 ID, FirstName, LastName, Salary, Role FROM
   2: (SELECT *,
   3:     ROW_NUMBER() OVER(ORDER BY Salary DESC) AS _RN
   4:     FROM Staff) as S
   5: WHERE _RN = 3;

Here we are using a “Window Function” to figure out the row number of the set. We then take the results of the sub-query and only want the third row, so we filter based on that. NOTE: I am very new to window functions so please bear with me with any queries! Be sure to read the MSDN documentation on OVER and ROW_NUMBER.

Variables

I’m not going to regurgitate the SQL data types here, that would just be silly. But there are a few points I would like to make with regards to working with variables in SQL.

Strings are [almost] Limitless in C#

Always keep in mind that in many databases strings have a limited length.. In C# we are not really used to hitting limits with strings since we rarely run in environments starving of memory. Some weird errors can occur if you do not validate input against allowed field length before saving data.

OO is Great For Data Representation SQL is Great for Manipulation

Working with data on the fly within SQL can be real cumbersome. In C# you can easily create structs/classes to act as DTO’s and then quickly create a method to new up one based on input. In SQL, we do not have that luxury as well as it being tiresome to create standard variables (both DECLARE and SET required). Try to get everything you need sorted before hitting the SQL layer. Keep the process simple. Input in, sets out. Keep SQL for doing data-access. Do the heavy-lifting in C# where it’s easy and you have rich support. Leave SQL to just doing what it has to. READ: KEEP YOUR DAMN BUSINESS LOGIC OUT OF THE DATA ACCESS LAYER!!

Temp Tables Rock, Use Them!

Temp tables are fantastic for working with masses of data input. I have found myself using knocking together the code to churn out the SQL to create a new temp table based on objects. See FormattingCSOutput.cs in the sample code. I threw that together in <5 mins (in Snippet Compiler)and it allows me to quickly populate temporary tables to work with when testing/debugging SQL. This can easily piggyback off of your stubs for your unit tests to quickly get you testing your SQL and not messing around with trying to get dummy data to work with. You also have the obvious safety net of not actually making any changes to the data.

To Wrap Up..

  • I am not a SQL guru – far from from it. The above may not be “best practice” but as a developer I try to find ways to help keep me productive and as error-free as possible.
  • I do not intend to be a SQL guru – Companies need to stop penny-pinching and realise that if they want a great developer, they are not going to a great DBA as well (unless they are really, really lucky). Hell, I have heard horror stories of DBA’s that are worse DBA’s than developers – even if you lose a few milliseconds in your data access layer, you still get a kick-ass product right? ;)
  • I have found today useful just to write more SQL. I made so many screw ups (especially on the procedural vs. set-based stuff) as well as just having to double-check syntax. While this may not be a comprehensive guide, it has not been a waste of a day.
  • I am already thinking about righting a nice little C# generic helper to just take an object and dump the SQL to create a temp table :)
  • I still have a great deal to learn about SQL, but I’ll be a bit more pro-active and organised about using it now. I have a slightly better understanding of what SQL is good for, and what it is not good for. A small step forward is better than staying stationary.

I think I would have to say the realisation I got from today is this:

Write more SQL, just don’t try and write EVERYTHING in SQL.

Pick your shots ;)

1 comment:

  1. I come from a completely different world where anything between 30- 70% of my time is spent in SQL. Its a great tool and one well worth learning. I forget that 'developers' often stay away. Infact I'd say I start with the DB when approaching a problem.

    ReplyDelete