What Records Are Included in a Transaction Log Backup?

I had an interesting situation come up recently, and it boiled down to this. Suppose I start a transaction log backup at 10:00:00 and it runs for one minute, until 10:01:00. Is a transaction that happened at 10:00:30 included in the backup? Let’s do an experiment.

I have a sample database containing customers and orders, and an application that fires up a bunch of threads and inserts random data into the database. So I kicked off the application and let it run for several hours so as to generate a bunch of log records. I disabled the normal log backups for the duration of this test.

With the application still running, I then kicked off a transaction log backup, introduced a brief delay, followed by a second log backup.

backup log CorpDB to disk = 'S:\Backup\SQL2014\Log\log1.trn' with init, compression;
waitfor delay '0:00:10';
backup log CorpDB to disk = 'S:\Backup\SQL2014\Log\log2.trn' with init, compression;

Now I want to look in the header of my backup files.

restore headeronly from disk = 'S:\backup\sql2014\Log\log1.trn';
restore headeronly from disk = 'S:\backup\sql2014\Log\log2.trn';

RestoreHeader

I see that the first log backup started at 7:30:17 pm and ended at 7:31:37, and the second backup ranged from 7:31:47 to 7:31:47. Unfortunately, this is only stored with one-second resolution, but we can see that the first backup ran for about 80 seconds.

So the question now is which backup contains a transaction that occurred at, say, 7:30:30?

I’m going to load the transaction log records into a couple of temp tables.

select row_number() over (order by [Current LSN]) rn, *
into #log1
from sys.fn_dump_dblog(null, null, 'disk', 1, 'S:\backup\sql2014\Log\log1.trn',
default, default, default, default, default, default, default, default, default, default, default, 
default, default, default, default, default, default, default, default, default, default, default, 
default, default, default, default, default, default, default, default, default, default, default, 
default, default, default, default, default, default, default, default, default, default, default, 
default, default, default, default, default, default, default, default, default, default, default, 
default, default, default, default, default, default, default, default);
 
create clustered index ix1_log1 on #log1 (rn);
 
select row_number() over (order by [Current LSN]) rn, *
into #log2
from sys.fn_dump_dblog(null, null, 'disk', 1, 'S:\backup\sql2014\Log\log2.trn',
default, default, default, default, default, default, default, default, default, default, default, 
default, default, default, default, default, default, default, default, default, default, default, 
default, default, default, default, default, default, default, default, default, default, default, 
default, default, default, default, default, default, default, default, default, default, default, 
default, default, default, default, default, default, default, default, default, default, default, 
default, default, default, default, default, default, default, default);
 
create clustered index ix1_log2 on #log2 (rn);

Now let’s take a look at the last 10 records in the first log file, the first 10 records in the second log, and the last 10 records in the second log.

select 1 log_nbr, rn, Operation, Context, [Current LSN], [Transaction ID], [Begin Time], [End Time]
from #log1
where rn >= (select max(rn) from #log1) - 9
union all
select 2 log_nbr, rn, Operation, Context, [Current LSN], [Transaction ID], [Begin Time], [End Time]
from #log2
where rn <= 10
union all
select 2 log_nbr, rn, Operation, Context, [Current LSN], [Transaction ID], [Begin Time], [End Time]
from #log2
where rn >= (select max(rn) from #log2) - 9
order by log_nbr, rn;

CombinedLogRecords

The first backup contains log records through 7:30:17.873, and the second backup starts with log records a few milliseconds later at 7:30:17.887 and ends at 7:31:47.327. In other words, the second log contains the records covering the 90 seconds between the start times of the two log backups.

To answer my specific question I can also find a transaction that happened at 7:30:30 pm in the second log file:

select top 1 rn, Operation, Context, [Current LSN], [Transaction ID], [Begin Time], [End Time]
from #log2
where [Begin Time] >= '2016/04/29 19:30:30'
order by [Begin Time];

SearchForSpecificTxn

From this, I conclude that a log backup includes transactions up to approximately the BackupStartDate contained in the header, and not the BackupEndDate.

Caveat: I only tried this on SQL Server 2014 SP1 Developer Edition. It is entirely possible that other versions / service packs / editions behave differently.

A Catch-All Backup Job (with Automatic Restore Script Generation)

Job one as a DBA is to ensure that the systems under your control are backed up. Actually, job one is to make sure that systems are restorable, but that starts with robust backup jobs.

One job that I always set up on my personal SQL Server instances is a catch-all backup job. A job, that is, that backups up all databases except for ones that are specifically excluded. Why? I don’t always think about creating the backup jobs just because I’ve added a new database. A shortcoming, to be sure, but even on those personal system I want to be protected.

I am a firm believer that backups on enterprise systems should be done via dedicated backup jobs due to the variety of RPO/RTO requirements associated with those systems, but I suppose that a case could be made for a catch-all backup job there as well.

The Catch-All Backup Job

First, we need a table to store a list of databases that will be excluded from the backups. I put this in a database called Admin.

create table dbo.DatabasesExcludedFromBackup
(
	 DatabaseName sysname not null
);
 
insert dbo.DatabasesExcludedFromBackup (DatabaseName) values ('tempdb');

Be sure to add other databases that don’t need to be covered this job, including databases that are backed up through other jobs, or databases like AdventureWorks that you can easily recreate from the original source.

Automated Restore Script Generation

I also like to throw in some code to generate restore scripts automatically. We need a couple of tables to store this (also in the Admin database).

create table dbo.BackupType
(
	BackupType varchar(10) not null,
	Description nvarchar(100) null,
	constraint pk_BackupType primary key clustered (BackupType)
);
 
create table dbo.BackupChainType
(
	BackupChainType varchar(25) not null,
	Description nvarchar(100) null,
	constraint pk_BackupChainType primary key clustered (BackupChainType)
);
 
insert dbo.BackupType (BackupType, Description)
	values ('Diff', 'Differential Backup'), ('Full', 'Full backup'), ('Log', 'Transaction log backup');
 
insert dbo.BackupChainType (BackupChainType, Description)
	values ('Full+Diff', 'Full backup plus differential backups'),
	('Full+Diff+Log', 'Full backup plus differential backups plus transaction log backups'),
	('Full+Log', 'Full backup plus transaction log backups'), 
	('FullOnly', 'Full backup only');
 
create table dbo.RestoreScript
(
	RestoreScriptId int identity(1,1) not null,
	DatabaseName sysname not null,
	BackupChainType varchar(25) not null,
	BackupType varchar(10) not null,
	TimeGenerated datetime2(7) not null constraint df_RestoreScript__TimeGenerated  default (sysdatetime()),
	ScriptText nvarchar(max) not null,
	constraint pk_RestoreScript primary key clustered (RestoreScriptId),
	constraint fk_RestoreScript__BackupChainType foreign key (BackupChainType) references dbo.BackupChainType (BackupChainType),
	constraint fk_RestoreScript__BackupType foreign key (BackupType) references dbo.BackupType (BackupType)
);

The Full Backup Job

Next, create a SQL Agent job. I call my job “Backup all databases – FULL” and add a job step with this script. Be sure to set the location of the backup folder at the beginning of the script, and also change all references to the “Admin” database if necessary.

declare	@dbName nvarchar(256),
		@recoveryModel nvarchar(60),
		@backupTime nvarchar(100),
		@fileName nvarchar(255),
		@sql nvarchar(4000),
		@sqlNoRecovery nvarchar(4000),
		@destinationFolder nvarchar(256);
 
-- Set this to the appropriate backup folder.  ** IMPORTANT ** It must end with a backslash!
select @destinationFolder = 'S:\SQL\Backups\Full\';
 
declare dbCsr cursor for select name, recovery_model_desc from sys.databases where name not in (select DatabaseName from Admin.dbo.DatabasesExcludedFromBackup)
	and source_database_id is null;
open dbCsr;
fetch next from dbCsr into @dbName, @recoveryModel;
while @@FETCH_STATUS = 0
begin
	select	@backupTime = replace(replace(replace(CONVERT(nvarchar(100), getdate(), 120), '-', ''), ':', ''), ' ', '_');
	select	@fileName = @destinationFolder + @dbName	+ '.' + @backupTime + '.bak'
	select	@sql = 'backup database [' + @dbName + '] to disk = ''' + @fileName + ''' with init, compression;';
	exec(@sql);
 
	delete Admin.dbo.RestoreScript where DatabaseName = @dbName;
 
	select @sql = 'restore database [' + @dbName + '] from disk = ''' + @fileName + ''' with replace';
	select @sqlNoRecovery = @sql + ', norecovery;';
	select @sql += ';';
 
	if @recoveryModel = 'SIMPLE'
	begin
		insert Admin.dbo.RestoreScript (DatabaseName, BackupChainType, BackupType, ScriptText)
		values (@dbName, 'FullOnly', 'Full', @sql);
 
		insert Admin.dbo.RestoreScript (DatabaseName, BackupChainType, BackupType, ScriptText)
		values (@dbName, 'Full+Diff', 'Full', @sqlNoRecovery);
	end
	else
	begin
		insert Admin.dbo.RestoreScript (DatabaseName, BackupChainType, BackupType, ScriptText)
		values (@dbName, 'Full+Log', 'Full', @sqlNoRecovery);
 
		insert Admin.dbo.RestoreScript (DatabaseName, BackupChainType, BackupType, ScriptText)
		values (@dbName, 'Full+Diff+Log', 'Full', @sqlNoRecovery);
	end
 
	fetch next from dbCsr into @dbName, @recoveryModel;
end
close dbCsr;
deallocate dbCsr;

Set an appropriate schedule on the job. For me, once per day works fine.

The Log Backup Job

Now create another job called “Backup all databases – LOG” and add this step:

declare	@dbName nvarchar(256),
		@backupTime nvarchar(100),
		@fileName nvarchar(255),
		@sql nvarchar(4000),
		@destinationFolder nvarchar(256);
 
-- Set this to the appropriate backup folder.  ** IMPORTANT ** It must end with a backslash!
select @destinationFolder = 'S:\SQL\Backups\Log\';
 
declare dbCsr cursor for
select db.name from sys.databases db
cross apply (select top 1 * from msdb.dbo.backupset bs where bs.database_name = db.name and bs.type = 'D' order by bs.backup_finish_date desc) bs
where db.recovery_model_desc != 'SIMPLE'
and db.name not in (select DatabaseName from Admin.dbo.DatabasesExcludedFromBackup)
and db.source_database_id is null;
 
open dbCsr;
fetch next from dbCsr into @dbName;
while @@FETCH_STATUS = 0
begin
	select	@backupTime = replace(replace(replace(CONVERT(nvarchar(100), getdate(), 120), '-', ''), ':', ''), ' ', '_');
	select	@fileName = @destinationFolder + @dbName	+ '.' + @backupTime + '.trn'
	select	@sql = 'backup log [' + @dbName + '] to disk = ''S:\Backup\SQL2014\Log\' + 
		@dbName	+ '.' + @backupTime + '.trn'' with init, compression';
	exec(@sql);
 
	select @sql = 'restore log [' + @dbName + '] from disk = ''' + @fileName + ''' with norecovery;';
 
	insert Admin.dbo.RestoreScript (DatabaseName, BackupChainType, BackupType, ScriptText)
	values (@dbName, 'Full+Log', 'Log', @sql);
 
	insert Admin.dbo.RestoreScript (DatabaseName, BackupChainType, BackupType, ScriptText)
	values (@dbName, 'Full+Diff+Log', 'Log', @sql);
 
	fetch next from dbCsr into @dbName;
end
close dbCsr;
deallocate dbCsr;

Be sure to schedule the job according to your needs. I run log backups once per hour.

Additional Steps

You will want to add an extra step into your backup jobs to purge the old backup files after a few days, otherwise they will keep accumulating forever.

I also add a step to copy the files off-system because I am definitely a believer in getting backups off to another box as quickly as possible, just in case the machine blows up.

The Restore Script

The backup jobs generate restore scripts as they process the backups. You can find a list of the possible restore paths by executing:

select distinct DatabaseName, BackupChainType
from Admin.dbo.RestoreScript;

Then you can pass in a combination of DatabaseName and BackupChainType into a script such as the following:

select ScriptText
from Admin.dbo.RestoreScript
where DatabaseName = 'Admin'
and BackupChainType = 'Full+Log'
order by TimeGenerated asc;

This will output the commands to restore the database to the last log backup. You will need to recover the database manually, and you easily select which logs to restore or add a STOPAT clause to recover to a point in time.

I even have a small SSIS package which runs a Foreach Loop and extract each restore script out to a text file. The few times I have needed to use it, these pre-generated scripts have made the restores trivially simple to execute.

Potential Enhancements

Even though the script tables allow for differential backups in the mix, there’s nothing here that really uses differentials. For my purposes, that’s overkill, but you can certainly add a “Backup all databases – DIFF’ and include logic to generate appropriate restore scripts.

First Impressions of anonymize()

In a recent previous post, I gave a method that I have used to create test data. As I was playing with the new anonymize() function introduced in SQL Server 2016 RC0, it occurred to me that it could be used for similar purpose.

The Books Online page for anonymize() is somewhat skimpy as of this writing, but there’s enough to figure out a the general idea. The function always require a column as input, and this seems to be the case so that the function can sample the data in the column to determine the range of output values, particularly with numeric data types. However, with fewer than three rows, there just doesn’t seem to be enough to work with, and anonymize() will simply output the row values.

declare @t table (FirstName nvarchar(25), LastName nvarchar(25), Age int, DateOfBirth date);
insert @t (FirstName, LastName, Age, DateOfBirth)
values	('Mark', 'Smith', 25, '1990-09-01'),
	('Emily', 'Jones', 23, '1992-12-31');
 
select	anonymize(FirstName) as FirstName,
	anonymize(LastName) as LastName,
	anonymize(Age) as Age,
	anonymize(DateOfBirth) as DateOfBirth
from @t;

anonymize_2rows

But when we add a third row, the magic starts to happen. I’ll run this statement twice in a row:

declare @t table (FirstName nvarchar(25), LastName nvarchar(25), Age int, DateOfBirth date);
insert @t (FirstName, LastName, Age, DateOfBirth)
values	('Mark', 'Smith', 25, '1990-09-01'),
	('Emily', 'Jones', 23, '1992-12-31'),
	('Sara', 'Brown', 31, '1985-02-22');
 
select	anonymize(FirstName) as FirstName,
	anonymize(LastName),
	anonymize(Age) as Age,
	anonymize(DateOfBirth) as DateOfBirth
from @t;

anonymize_3rows_run1

anonymize_3rows_run2

So, a couple of observations. First, the ages that are displayed are in the range of the input data. That is, the input in the age column are between 23 and 31 years old, and the output are similarly from 24 to 28 in the first result, and from 25 to 30 in the second. Clearly, SQL Server is reading the input column and basing the anonymized output on the data in there. Similarly, the dates output are in the ranges of the input column. However, there is no relationship between Age and DateOfBirth, which makes sense since SQL Server has nothing to tie the two columns together. (Books Online makes it clear, by the way, that anonymize() will not work on computed columns.)

But the real magic happens with the names columns (both first and last). Notice that the function output values are in no way derived from the inputs. I have to admit to being quite impressed that with only three rows, SQL Server can interpret that these columns contain name data. And no, it not based on the column names; when I rename the columns to c1 and c2, the above example still works fine.

Perhaps more impressive is that it can detect and generate addresses as well, even using obviously fake address data as input.

declare @t table (Address nvarchar(100), City nvarchar(40), State nvarchar(5));
insert @t (Address, City, State)
values ('1234 Main St', 'Anytown', 'PA'),
	('987 E 1st Ave', 'Podunkville', 'KS'),
	('456 S 654 W', 'By the Sea', 'CA');
 
select	anonymize(Address),
	anonymize(City),
	anonymize(State)
from @t;

anonymize_address

As with many other new features, the anonymize() function comes with a host of limitations. As previously mentioned, it doesn’t work on computed columns. As you would expect, you cannot use it in a WHERE clause, and the function is incompatible with most other new SQL Server features, including Always Encrypted, in-memory tables and temporal tables. Interesting things also happen when you use anonymize() with a TOP clause.

declare @t table (FirstName nvarchar(25), LastName nvarchar(25), Age int, DateOfBirth date);
insert @t (FirstName, LastName, Age, DateOfBirth)
values	('Mark', 'Smith', 25, '1990-09-01'),
	('Emily', 'Jones', 23, '1992-12-31'),
	('Sara', 'Brown', 31, '1985-02-22');
 
select	top 1 anonymize(FirstName) as FirstName,
	anonymize(LastName) as LastName,
	anonymize(DateOfBirth) as DateOfBirth

anonymize_top

How I Create a Nums Table

You need a Nums or Tally table. You really do. OK, you can make do without, but having one really makes a lot of operations easier.

A Nums table is a good candidate for either an admin database or for the user database, depending on how tightly coupled it is with the stored procedures, functions, views, etc. On the one hand, you might want to make sure that the Nums table moves with a backup and restore of your primary database. On the other hand, you may not want the overhead of multiple Nums tables across several databases.

My preferred way to create and populate a Nums tables is:

create table Nums
(
	n int not null,
	constraint pk_Nums primary key clustered (n)
);
 
with	l0 as (select 1 v union all select 1),
		l1 as (select a.v from l0 a, l0),
		l2 as (select a.v from l1 a, l1),
		l3 as (select a.v from l2 a, l2),
		l4 as (select a.v from l3 a, l3),
		l5 as (select a.v from l4 a, l4),
		n as (select row_number() over (order by (select null)) n from l5)
insert Nums(n)
select n.n
from n
where n.n <= 1000000;

The downside here is that the code is ugly, and unless you are familiar with what’s
going on it may take a bit to decipher the script.

The upside is that the script is fast (about 5 seconds on my machine to load 1,000,000 records) and is quite adaptable. If you want more than one million numbers in the Nums table, just increase the value in the script up to about 4 billion. And if for reason you need more than that just add another CTE for “l6” following the same pattern.

For a million numbers, this occupies less than 13 MB of storage, so in most cases the overhead for a single database is pretty trivial.

Finally, I really like the flexibility of taking this same method and generating a virtual Nums table. Yes, it add a lot of ugly to an otherwise nice script, but you also get independence from having to have that Nums table, which is useful in conditions where can’t have an Admin database and can’t modify the user database.

If I want to find gaps in the IDs in my Customer table, I can come up with something like this:

declare @maxCustomerId int = (select max(CustomerID) from Customer);
 
with	l0 as (select 1 v union all select 1),
		l1 as (select a.v from l0 a, l0),
		l2 as (select a.v from l1 a, l1),
		l3 as (select a.v from l2 a, l2),
		l4 as (select a.v from l3 a, l3),
		l5 as (select a.v from l4 a, l4),
		Nums as (select row_number() over (order by (select null)) n from l5)
select	Nums.n MissingCustomerId
from	Nums left join Customer cust on Nums.n = cust.CustomerID
where	Nums.n <= @maxCustomerId
and		cust.CustomerID is null;

The nice thing is that SQL Server quite efficiently optimizes the query and doesn’t evaluate any more CTEs than are really necessary based on the constraints used on the Nums CTE.

This also makes it easy and efficient to create a table-valued function instead of (or in addition to) a persisted Nums table:

if exists (select * from sys.objects where type = 'IF' and name = 'fnNums')
	drop function dbo.fnNums;
go
create function dbo.fnNums()
returns table
as
return
with	l0 as (select 1 v union all select 1),
		l1 as (select a.v from l0 a, l0),
		l2 as (select a.v from l1 a, l1),
		l3 as (select a.v from l2 a, l2),
		l4 as (select a.v from l3 a, l3),
		l5 as (select a.v from l4 a, l4),
		Nums as (select row_number() over (order by (select null)) n from l5)
select Nums.n
from Nums
go

Capturing output from SQL Server using C#

I have recently been doing some preliminary work on a tool that needs to read non-tabular output from SQL Server. This is the stuff that normally appears on the “messages” tab in Management Studio.

Consider the following query:

1
2
3
set nocount on;
print 'Now selecting from sys.databases';
select * from sys.databases;

When I run this in SSMS, I get the following on the messages tab:

Now selecting from sys.databases

So how we capture the print statement? It turns out to be pretty easy.

using System;
using System.Data;
using System.Data.SqlClient;
 
namespace Playground
{
    class SqlOutputCapture
    {
        public static void Test()
        {
            SqlConnectionStringBuilder sb = new SqlConnectionStringBuilder();
            sb.DataSource = @"servername";
            sb.InitialCatalog = "master";
            sb.IntegratedSecurity = true;
 
            using (SqlConnection connection = new SqlConnection(sb.ToString()))
            {
                connection.InfoMessage += OnInfoMessageGenerated;                string sql = @"set nocount on; print 'Now selecting from sys.databases';
					select * from sys.databases;";
                using (SqlCommand command = new SqlCommand(sql, connection))
                {
                    using (SqlDataAdapter adapter = new SqlDataAdapter(command))
                    {
                        using (DataSet set = new DataSet())
                        {
                            adapter.Fill(set);
                        }
                    }
                }
            }
        }
 
        private static void OnInfoMessageGenerated(object sender, SqlInfoMessageEventArgs args)        {            Console.WriteLine("{0}", args.Message);        }    }
}

This writes the following to the console.

Now selecting from sys.databases

So let’s flip the NOCOUNT to off:

                string sql = @"set nocount off; print 'Now selecting from sys.databases';
					select * from sys.databases;";

When we run the query in SSMS, we now get:

Now selecting from sys.databases

(33 row(s) affected)

But executing the C# code outputs:

Now selecting from sys.databases

So why don’t we get the message about the number of rows?

It turns out that not everything that appears in Management Studio’s “messages” tab directly comes from SQL Server, at least in literal text format. Setting NOCOUNT OFF doesn’t cause SQL Server to return a literal string “(33 row(s) affected)” but rather affects the data sent in the underlying TDS network stream.

We can sort of emulate what SSMS does.

        public static void Test()
        {
            SqlConnectionStringBuilder sb = new SqlConnectionStringBuilder();
            sb.DataSource = @"servername";
            sb.InitialCatalog = "master";
            sb.IntegratedSecurity = true;
 
            using (SqlConnection connection = new SqlConnection(sb.ToString()))
            {
                connection.InfoMessage += OnInfoMessageGenerated;
                string sql = @"set nocount off; print 'Now selecting from sys.databases';
					select * from sys.databases;";
                using (SqlCommand command = new SqlCommand(sql, connection))
                {
                    command.StatementCompleted += OnStatementCompleted;                    using (SqlDataAdapter adapter = new SqlDataAdapter(command))
                    {
                        using (DataSet set = new DataSet())
                        {
                            adapter.Fill(set);
                        }
                    }
                }
            }
        }
 
        static void OnStatementCompleted(object sender, StatementCompletedEventArgs args)        {            Console.WriteLine("({0} row(s) affected)", args.RecordCount);        } 
        private static void OnInfoMessageGenerated(object sender, SqlInfoMessageEventArgs args)
        {
            Console.WriteLine("{0}", args.Message);
        }

And now we get:

Now selecting from sys.databases
(33 row(s) affected)

Another option is to use the connection statistics:

        public static void Test()
        {
            SqlConnectionStringBuilder sb = new SqlConnectionStringBuilder();
            sb.DataSource = @"servername";
            sb.InitialCatalog = "master";
            sb.IntegratedSecurity = true;
 
            using (SqlConnection connection = new SqlConnection(sb.ToString()))
            {
                connection.InfoMessage += OnInfoMessageGenerated;
                connection.StatisticsEnabled = true;                string sql = @"set nocount off; print 'Now selecting from sys.databases';
					select * from sys.databases;";
                using (SqlCommand command = new SqlCommand(sql, connection))
                {
                    using (SqlDataAdapter adapter = new SqlDataAdapter(command))
                    {
                        using (DataSet set = new DataSet())
                        {
                            adapter.Fill(set);
                        }
                    }
                }
 
                System.Collections.IDictionary statistics = connection.RetrieveStatistics();
                object rowsAffected = statistics["SelectRows"];                Console.WriteLine("({0} rows(s) affected)", rowsAffected);            }
        }

Which also returns:

Now selecting from sys.databases
(33 row(s) affected)

However, be aware that the number of rows returned is cumulative for the connection. The statistics that are returned contain other interesting properties, including ConnectionTime, ExecutionTime, and ServerRoundtrips, as well as data about the network traffic, including BytesSent and BytesReceived.

It is worth nothing that the SqlConnection.InfoMessage sends more than a simple message to the output stream. The SqlInfoMessageEventArgs object also contains an Errors property which is a collection of errors being returned to the client. Using this, we can capture error information similar to what is display in SSMS. Let’s put a couple of errors into our SQL statement.

        public static void Test()
        {
            SqlConnectionStringBuilder sb = new SqlConnectionStringBuilder();
            sb.DataSource = @"servername";
            sb.InitialCatalog = "master";
            sb.IntegratedSecurity = true;
 
            using (SqlConnection connection = new SqlConnection(sb.ToString()))
            {
                connection.Open();
                connection.InfoMessage += OnInfoMessageGenerated;
                connection.FireInfoMessageEventOnUserErrors = true;                string sql = "printf 'Now selecting from sys.databasii'; select * from sys.databasii;";                using (SqlCommand command = new SqlCommand(sql, connection))
                {
                    using (SqlDataAdapter adapter = new SqlDataAdapter(command))
                    {
                        using (DataSet set = new DataSet())
                        {
                            adapter.Fill(set);
                        }
                    }
                }
            }
        }
 
        private static void OnInfoMessageGenerated(object sender, SqlInfoMessageEventArgs args)        {            foreach (SqlError err in args.Errors)            {                Console.WriteLine("Msg {0}, Level {1}, State {2}, Line {3}",                    err.Number, err.Class, err.State, err.LineNumber);                Console.WriteLine("{0}", err.Message);            }        }

The output is:

Msg 2812, Level 16, State 62, Line 1
Could not find stored procedure 'printf'.
Msg 208, Level 16, State 1, Line 1
Invalid object name 'sys.databasii'.

Creating Realistic Test Data

There are a number of tools out there to create test data, but a few months ago I wanted to try my hand at rolling my own. My initial attempt was a combination of T-SQL and C#, and it works well, but in this blog post I want to demonstrate a pure T-SQL solution. We’ll create random customer data, including first and last names, street address, city and state. This is intended to create addresses that are United States-centric, but the same principles should be applicable anywhere.

Source Data

In order the data to appear realistic, we need to have some basis for creating that data. The U.S. Census Bureau publishes lists of first and last names, and we’ll start with that as a basis.  To keep things simple, imagine that there are only 5 last names in the world:

NameFrequency
Smith1.006
Johnson0.810
Williams0.699
Brown0.621
Jones0.621

We need to “normalize” the frequencies so that they sum to 1. We will also keep a running total on the normalized frequency for all previous rows (there are rounding anomalies in the results).

NameFrequencyNormalized FrequencyRunning Total
Smith1.0060.2678.0000
Johnson0.8100.2156.2678
Williams0.6990.1861.4834
Brown0.6210.1653.6694
Jones0.6210.1653.8347

Now we can generate a random number from 0 to 1 (more on that in a bit) and use that to select a record in the table. For instance, if the random number is 0.7421, we find the first record where the running total is less than or equal to 0.7421, or “Brown.”

The raw data sources I used were:

Male first names (U.S. Census Bureau)
Female first names (U.S. Census Bureau)
Last names (U.S. Census Bureau)
Cities and states (U.S. Census Bureau)
Street names (I am having a hard time finding where I originally obtained this data. If I figure it out later, I will update this post.)

I have somewhat filtered and modified the data sets, mostly to eliminate frequency information that is rounded to 0 in the raw data. I have also included StreetPrefix and StreetSuffix data sets that I came up with myself.

Let’s create a few tables to store the data:

CREATE TABLE [dbo].[Name]
(
	[ID] int NOT NULL IDENTITY(1,1) PRIMARY KEY,
	[Name] nvarchar (40) NULL,
	[NameType] char (6) NULL,
	[Fraction] float NULL
);
 
CREATE TABLE [dbo].[City]
(
	[ID] int NOT NULL IDENTITY(1,1) PRIMARY KEY,
	[City] nvarchar (150) NULL,
	[State] char (2) NULL,
	[Population] int NULL
);
 
CREATE TABLE [dbo].[StreetPrefix]
(
	[ID] int NOT NULL IDENTITY(1,1) PRIMARY KEY,
	[Prefix] nvarchar (20) NULL,
	[Weight] float NULL
);
 
CREATE TABLE [dbo].[StreetSuffix]
(
	[ID] int NOT NULL IDENTITY(1,1) PRIMARY KEY,
	[Suffix] nvarchar (20) NULL,
	[Weight] float NULL
);
 
CREATE TABLE [dbo].[Street]
(
	[ID] int NOT NULL IDENTITY(1,1) PRIMARY KEY,
	[StreetName] nvarchar (150) NULL
);

Note that each of the data sets has some sort of weighting factor except for the Street data. I don’t have any sort of frequency information, so I will just make the assumption that each street is equally likely.

Once the tables are created, download the data file and run it. The data file is a SQL script with a bunch of inserts into the tables. It’s about 200,000 lines long, but it should run fine in SSMS; it just may take a minute or two. Also note that I have combined all of the first and last names into a single table.

Finally, let’s normalize the data sets. It’s easiest to use windowing functions, so this requires at least SQL Server 2012, but with some rewriting it can be done in earlier versions as well. Note that for the names data set we separate out last names from first names (male and female combined). Since the street data does not contain any weighting information, we just use the COUNT function rather than sum. Finally, we’ll create some useful nonclustered indexes.

alter table dbo.Name add NameGroup varchar(10);
 
update dbo.Name set NameGroup = 'Last' where NameType = 'Last';
update dbo.Name set NameGroup = 'First' where NameType = 'Female';
update dbo.Name set NameGroup = 'First' where NameType = 'Male';
 
alter table dbo.Name add FractionStart float;
alter table dbo.City add FractionStart float;
alter table dbo.StreetPrefix add FractionStart float;
alter table dbo.StreetSuffix add FractionStart float;
alter table dbo.Street add FractionStart float;
 
with NameData as
(
	select FractionStart,
		1.0 * isnull(sum(Fraction) over (partition by NameGroup order by (select null) rows between unbounded preceding and 1 preceding), 0) / sum(Fraction) over (partition by NameGroup) NewFractionStart
	from dbo.Name
)
update NameData
set FractionStart = NewFractionStart;
 
with CityData as
(
	select FractionStart,
		1.0 * isnull(sum(Population) over (order by (select null) rows between unbounded preceding and 1 preceding), 0) / sum(Population) over () NewFractionStart
	from dbo.City
)
update CityData
set FractionStart = NewFractionStart;
 
with PrefixData as
(
	select FractionStart,
		1.0 * isnull(sum(Weight) over (order by (select null) rows between unbounded preceding and 1 preceding), 0) / sum(Weight) over () NewFractionStart
	from dbo.StreetPrefix
)
update PrefixData
set FractionStart = NewFractionStart;
 
with SuffixData as
(
	select FractionStart,
		1.0 * isnull(sum(Weight) over (order by (select null) rows between unbounded preceding and 1 preceding), 0) / sum(Weight) over () NewFractionStart
	from dbo.StreetSuffix
)
update SuffixData
set FractionStart = NewFractionStart;
 
with StreetData as
(
	select FractionStart,
		1.0 * isnull(count(*) over (order by (select null) rows between unbounded preceding and 1 preceding), 0) / count(*) over () NewFractionStart
	from dbo.Street
)
update StreetData
set FractionStart = NewFractionStart;
 
create nonclustered index IX_Name_NormalizedFraction on dbo.Name (NameGroup, FractionStart);
create nonclustered index IX_City_NormalizedFraction on dbo.City (FractionStart);
create nonclustered index IX_StreetPrefix_NormalizedFraction on dbo.StreetPrefix (FractionStart);
create nonclustered index IX_StreetSuffix_NormalizedFraction on dbo.StreetSuffix (FractionStart);
create nonclustered index IX_Street_NormalizedFraction on dbo.Street (FractionStart);

Now that we have data, let’s take a look at the idea of randomness.

Randomness

Unfortunately, the built-in SQL Server rand() function has a lot of limitations, not the least of which is that the function returns the same value when called in set-based operations. For instance, try running

select ID, Suffix, Weight, FractionStart, rand() RandValue from StreetSuffix;

You’ll get a different value for the RandValue column every time you run this statement, but the you’ll also get the same value in each row. This makes rand() OK for row-by-row operations, but fairly useless in the set-based world.

Fortunately, the newid() function comes to the rescue. For example, consider

select ID, Suffix, Weight, FractionStart, newid() RandGuid from StreetSuffix;

Every time you run this, you will get a different GUID in each row. To turn this into a random number, we can compute the binary_checksum() of the GUID. Since binary_checksum() returns a value in the range of int, we can normalize the random value by adding a number to it to ensure the result is non-negative and then dividing by the range of int values.

select ID, Suffix, Weight, FractionStart,
	(binary_checksum(newid()) + 2147483648.) / 4294967296. RandValue
from StreetSuffix;

It’s ugly, but we now get a random value that is greater than or equal to zero and less than one.

Now we can generate a random city:

select top 1 City, State
from dbo.City
where FractionStart <= (binary_checksum(newid()) + 2147483648.) / 4294967296.
order by FractionStart desc;

If you execute this statement a number of times, you find that it tends to favor larger cities. This is exactly what we expect. All other things being equal, realistic data will show far more customer from Los Angelos, CA than from a small midwestern town.

Random Values over Ranges of Data

So far, we’ve covered how to select a random item out of a finite list of data, but what about generating random values over ranges. In our situation, we need to create a house number as part of the address. My initial arbitrary rules were:

  • The house number should contain between 1 and 5 digits. The specific number of digits is randomly determined.
  • If the house number has 5 digits, the first digit should be 1. (This is to avoid fairly unrealistic addresses such as 87369 Main St.)
  • There should be no leading zeros in the house number.

We can generate a single random digit from 0 to 9 with this statement:

select cast((binary_checksum(newid()) + 2147483648.) / 4294967296. * 10. as int);

Now, we could string together some code to follows the above rules exactly, but what they really come down is that we need a number from 1 to 19,999:

select 1 + cast((binary_checksum(newid()) + 2147483648.) / 4294967296. * 19999. as int);

Certainly a case could be made that house numbers will tend toward smaller values (and even better, could be correlated with the size of the associated city) but we’ll keep things simple here and stick with this simple requirement.

Putting It Together

Now we have the data that we need, as well as a means to produce data randomly. Using the same strategy that we used before to get a random city, we can get all the different bits of pieces of the customer record:

with FirstName as
(
	select top 1 Name FirstName
	from dbo.Name
	where NameGroup = 'First'
	and FractionStart <= (binary_checksum(newid()) + 2147483648.) / 4294967296.
	order by NameGroup, FractionStart desc
), LastName as
(
	select top 1 Name LastName
	from dbo.Name
	where NameGroup = 'Last'
	and FractionStart <= (binary_checksum(newid()) + 2147483648.) / 4294967296.
	order by FractionStart desc
), HouseNumber as
(
	select 1 + cast((binary_checksum(newid()) + 2147483648.) / 4294967296. * 19999. as int) HouseNumber
), StreetPrefix as
(
	select top 1 Prefix
	from dbo.StreetPrefix
	where FractionStart <= (binary_checksum(newid()) + 2147483648.) / 4294967296.
	order by FractionStart desc
), StreetName as
(
	select top 1 StreetName
	from dbo.Street
	where FractionStart <= (binary_checksum(newid()) + 2147483648.) / 4294967296.
	order by FractionStart desc
), StreetSuffix as
(
	select top 1 Suffix
	from dbo.StreetSuffix
	where FractionStart <= (binary_checksum(newid()) + 2147483648.) / 4294967296.
	order by FractionStart desc
), CityState as
(
	select top 1 City, State
	from dbo.City
	where FractionStart <= (binary_checksum(newid()) + 2147483648.) / 4294967296.
	order by FractionStart desc
), RandomCustomer as
(
	select	FirstName.FirstName, 
			LastName.LastName, 
			cast(HouseNumber.HouseNumber as varchar(5)) + ' ' + StreetPrefix.Prefix + ' ' + StreetName.StreetName + ' ' + StreetSuffix.Suffix Address,
			CityState.City,
			CityState.State
	from	FirstName, LastName, HouseNumber, StreetPrefix, StreetName, StreetSuffix, CityState
)
select *
from RandomCustomer;

It would be nice if we could generate multiple customers at once, but here we run into some trouble. If we try joining the above to a tally table, we might hope that we get a whole bunch of customers. But, alas, the optimizer outsmarts us and all we get is repeated rows (well, mostly repeated; the house number does in fact change from row to row, but otherwise it’s all the same output). Nor can we wrap the above statement into a table-valued function because the newid() function is not allowed.

There is a solution, however. It’s a bit clunky, but it does work. Let’s create a TVF where the random values are passed in as parameters:

if exists (select * from sys.objects where name = 'fnGenerateRandomCustomer' and type = 'IF')
	drop function fnGenerateRandomCustomer;
go
create function fnGenerateRandomCustomer(@r1 float, @r2 float, @r3 float, @r4 float, @r5 float, @r6 float, @r7 float)
returns table
as
return
with FirstName as
(
	select top 1 Name FirstName
	from dbo.Name
	where NameGroup = 'First'
	and FractionStart <= @r1
	order by NameGroup, FractionStart desc
), LastName as
(
	select top 1 Name LastName
	from dbo.Name
	where NameGroup = 'Last'
	and FractionStart <= @r2
	order by FractionStart desc
), HouseNumber as
(
	select 1 + cast(@r3 * 19999. as int) HouseNumber
), StreetPrefix as
(
	select top 1 Prefix
	from dbo.StreetPrefix
	where FractionStart <= @r4
	order by FractionStart desc
), StreetName as
(
	select top 1 StreetName
	from dbo.Street
	where FractionStart <= @r5
	order by FractionStart desc
), StreetSuffix as
(
	select top 1 Suffix
	from dbo.StreetSuffix
	where FractionStart <= @r6
	order by FractionStart desc
), CityState as
(
	select top 1 City, State
	from dbo.City
	where FractionStart <= @r7
	order by FractionStart desc
), RandomCustomer as
(
	select	FirstName.FirstName, 
			LastName.LastName, 
			cast(HouseNumber.HouseNumber as varchar(5)) + ' ' + StreetPrefix.Prefix + ' ' + StreetName.StreetName + ' ' + StreetSuffix.Suffix Address,
			CityState.City,
			CityState.State
	from	FirstName, LastName, HouseNumber, StreetPrefix, StreetName, StreetSuffix, CityState
)
select *
from RandomCustomer;

Then we invoke the function by generating the random values and passing them to the function. We’ll generate a virtual Nums table to control the number of customers produced, and then generate seven random numbers to be passed to the TVF we just created.

declare @customersToGenerate int = 10;
 
with l0 as (select 1 v union all select 1), l1 as (select a.v from l0 a, l0), l2 as (select a.v from l1 a, l1),
l3 as (select a.v from l2 a, l2), l4 as (select a.v from l3 a, l3), l5 as (select a.v from l4 a, l4),
Nums as (select row_number() over (order by (select null)) n from l5),
RandomValues as
(
	select	(binary_checksum(newid()) + 2147483648.) / 4294967296. r1,
			(binary_checksum(newid()) + 2147483648.) / 4294967296. r2,
			(binary_checksum(newid()) + 2147483648.) / 4294967296. r3,
			(binary_checksum(newid()) + 2147483648.) / 4294967296. r4,
			(binary_checksum(newid()) + 2147483648.) / 4294967296. r5,
			(binary_checksum(newid()) + 2147483648.) / 4294967296. r6,
			(binary_checksum(newid()) + 2147483648.) / 4294967296. r7
	from	Nums n
	where	n.n <= @customersToGenerate
)
select cust.FirstName, cust.LastName, cust.Address, cust.City, cust.State
from RandomValues
cross apply dbo.fnGenerateRandomCustomer(r1, r2, r3, r4, r5, r6, r7) cust;

Here is the output from running this script. As you would expect, the results are going to be different every time.

RandomCustomerData

If you look at the actual query plan, you can see that the system is doing an index seek and key lookup on each of the data tables for each customer generated, which is not surprising given the random nature of the query. This means that as the number of customers generated increases, there is a lot overhead doing those key lookups. However, it is still reasonably efficient. On my machine I was able to insert 1,000,000 randomly generated customers into a table in about 30 seconds.

Summary

I have put together this post merely to present an example of how randomized test data can be generated. The focus has been on generating customer names and addresses, but the same principles can be applied to creating a wide variety of data types. It’s not perfect, but it can suffice nicely for a lot of situations.

Convert Local Time to UTC in SQL Server

I have occasionally had to convert from local time to UTC in SQL Server, but unfortunately there is no built-in functionality to do this. I will share one solution in this post to accomplish this conversion.

By far, the hardest part of the conversion has been accounting for daylight saving time. The rules behind DST vary greatly in different parts of the world and change periodically. The solution that I present here works only for the United States, as accounting for every possible combination would be far more complex. Even for the United States, this code doesn’t account for everything; however, for my purposes, this has been more than adequate.

A little research shows that in the United States, there was no uniform DST rule prior to 1966, and I am making the simplification to ignore DST from 1965 and earlier. One set of rules applied from 1966 to 1986 (last Sunday in April to last Sunday in October). There were some exceptions in 1974 and 1975, but I am ignoring that here. From 1987 to 2006, DST was in effect from the first Sunday in April to the last Sunday in October, and since 2007 it is in effect from the second Sunday in March until the first Sunday in November. The change always happens at 2:00 am local time.

In order to implement this code, I first create a small code table that tells us when when DST starts and ends. This table, as well as all of the functions described in this post, should be created in an administrative database. The table structure is:

create table DstRanges
(
	RangeID int not null identity(1,1) primary key clustered,
	RangeStart datetime null,
	RangeEnd datetime null,
	IsDst bit not null
);
 
create nonclustered index ix_DstRanges__RangeStart_RangeEnd on DstRanges (RangeStart, RangeEnd);
go

Next, I create a helper function that will be used to populate the DstRanges tables. This function can dropped after the load is completed.

create function fn_GetDSTBeginEnd (@date datetime)
returns @dates table (dst_begin datetime, dst_end datetime)
as
begin
	declare	@year int,
			@dst_begin datetime,
			@dst_end datetime;
 
	-- determine if daylight saving time is in effect (all of the following rules based in the US)
 
	select	@year = datepart(year, @date);
 
	-- From 1966 to 1986, daylight saving in effect from:
	-- last Sunday in April until the last Sunday in October
	-- (some exception for 1974-1975, but those will not be accounted for here)
 
	if	@year >= 1966 and @year <= 1986
	begin
		-- start by getting May 1 and November 1
		select	@dst_begin = cast(cast(@year as varchar(20)) + '-05-01' as datetime),
				@dst_end = cast(cast(@year as varchar(20)) + '-11-01' as datetime);
		-- set to first Sunday in May and first Sunday in November
		select	@dst_begin = dateadd(day, 7 - ((datepart(weekday, @dst_begin) + @@datefirst - 1) % 7), @dst_begin),
				@dst_end = dateadd(day, 7 - ((datepart(weekday, @dst_end) + @@datefirst - 1) % 7), @dst_end);
		-- back one week to get last Sunday in April and last Sunday in October
		select	@dst_begin = dateadd(day, -7, @dst_begin),
				@dst_end = dateadd(day, -7, @dst_end);
		-- add two hours because the clock change happens at 2:00 am
		select	@dst_begin = dateadd(hour, 2, @dst_begin),
				@dst_end = dateadd(hour, 2, @dst_end);
	end;
 
	-- From 1987 to 2006, daylight saving in effect from:
	-- first Sunday in April until the last Sunday in October
 
	else if	@year >= 1987 and @year <= 2006
	begin
		-- start by getting April 1 and November 1
		select	@dst_begin = cast(cast(@year as varchar(20)) + '-04-01' as datetime),
				@dst_end = cast(cast(@year as varchar(20)) + '-11-01' as datetime);
		-- set to first Sunday in April and first Sunday in November
		select	@dst_begin = dateadd(day, 7 - ((datepart(weekday, @dst_begin) + @@datefirst - 1) % 7), @dst_begin),
				@dst_end = dateadd(day, 7 - ((datepart(weekday, @dst_end) + @@datefirst - 1) % 7), @dst_end);
		-- back off end date by one week to get last Sunday in October
		select	@dst_end = dateadd(day, -7, @dst_end);
		-- add two hours because the clock change happens at 2:00 am
		select	@dst_begin = dateadd(hour, 2, @dst_begin),
				@dst_end = dateadd(hour, 2, @dst_end);
	end;
 
	-- From 2007 onward, daylight saving in effect from:
	-- second Sunday in March until the first Sunday in November
 
	else if	@year >= 2007
	begin
		-- start by getting March 1 and November 1
		select	@dst_begin = cast(cast(@year as varchar(20)) + '-03-01' as datetime),
				@dst_end = cast(cast(@year as varchar(20)) + '-11-01' as datetime);
		-- set to first Sunday in March and first Sunday in November
		select	@dst_begin = dateadd(day, (7 - ((datepart(weekday, @dst_begin) + @@datefirst - 1) % 7))%7, @dst_begin),
				@dst_end = dateadd(day, (7 - ((datepart(weekday, @dst_end) + @@datefirst - 1) % 7))%7, @dst_end);
		-- forward end date by one week to get second Sunday in March
		select	@dst_begin = dateadd(day, 7, @dst_begin);
		-- add two hours because the clock change happens at 2:00 am
		select	@dst_begin = dateadd(hour, 2, @dst_begin),
				@dst_end = dateadd(hour, 2, @dst_end);
	end;
 
	if  @date >= @dst_begin
	and	@date < @dst_end
	begin
		-- spring forward
		select	@date = dateadd(hour, 1, @date);
	end;
 
	insert	@dates (dst_begin, dst_end) values (@dst_begin, @dst_end);
	return;
end
go

And next, the table can be populated with this code. In this example the table will be loaded through the year 2099, assuming that the current rules are not changed.

create table #year
(
	YearNbr int not null,
	DstStart datetime null,
	DstEnd datetime null
);
 
with l0 as (select 1 v union all select 1), l1 as (select a.v from l0 a, l0), l2 as (select a.v from l1 a, l1),
l3 as (select a.v from l2 a, l2), l4 as (select a.v from l3 a, l3), l5 as (select a.v from l4 a, l4),
nums as (select row_number() over (order by (select null)) n from l5)
insert #year (YearNbr)
select n.n
from nums n
where n.n >= 1966
and n.n <= 2099;
 
update y
set y.DstStart = dst.dst_begin,
	y.DstEnd = dst.dst_end
from #year y cross apply (select * from fn_GetDSTBeginEnd(datefromparts(y.YearNbr, 1, 1))) dst;
 
with Dst as
(
	select	y.YearNbr,
			lag(y.DstEnd, 1, null) over (order by y.YearNbr) PrevDstEnd,
			y.DstStart,
			y.DstEnd
	from	#year y
)
insert DstRanges (RangeStart, RangeEnd, IsDst)
select Dst.PrevDstEnd RangeStart, DstStart RangeEnd, 0
from Dst
union all
select DstStart RangeStart, DstEnd RangeEnd, 1
from Dst
order by RangeStart;

Finally, we define a pair of table-valued functions (one helper function, and one intended to be called by the user). Be sure to change the name of your Admin database in the two places it is referenced (lines 11 and 27).

if exists (select * from sys.objects where name = 'fn_GetDSTInfo')
	drop function fn_GetDSTInfo
go
create function fn_GetDSTInfo (@TargetDate datetime2)
returns table
as
return
	select		td.TargetDate,
				isnull(r.IsDst, 0) IsDst
	from		(select @TargetDate TargetDate) td
	left join	Admin.dbo.DstRanges r
	on			r.RangeStart <= td.TargetDate
	and			r.RangeEnd > td.TargetDate;
go
if exists (select * from sys.objects where name = 'fn_LocalTimeToUTC')
	drop function fn_LocalTimeToUTC;
go
create function fn_LocalTimeToUTC (@LocalTime datetime2, @UtcOffsetHours int, @UsesDst bit)
returns table
as
return
	with OffsetInfo as
	(
		select		@LocalTime LocalTime,
					@UtcOffsetHours UtcOffsetHours,
					case when @UsesDst = 1 and dst.IsDst = 1 then 1 else 0 end DstOffsetHours
		from		Admin.dbo.fn_GetDSTInfo(@LocalTime) dst
	)
	select	OffsetInfo.LocalTime,
            OffsetInfo.UtcOffsetHours,
            OffsetInfo.DstOffsetHours,
			dateadd(hour, -(OffsetInfo.UtcOffsetHours + OffsetInfo.DstOffsetHours), OffsetInfo.LocalTime) UtcTime
	from	OffsetInfo;
go

To use this, we call the fn_LocalTimeToUTC function. It has three parameters.

  • The local time, expressed as a datetime2
  • The offset of local time (ignoring daylight saving) from UTC
  • A bit indicating if daylight saving time is observed locally

  • For example, I am in the Central Time Zone (-6 hours from UTC) and daylight saving is observed, so I would call:

    select * from fn_LocalTimeToUTC(sysdatetime(), -6, 1);

    And the output is:

    LocalTimeToUTC

    Note that there is ambiguity between 1:00 am and 2:00 am when DST ends in the fall, and there is no good way to resolve that ambiguity. This code presumes that DST is still in effect.

    Because fn_LocalTimeToUTC is written as a TVF, it can be efficiently used in set-based operations, usually using an OUTER APPLY operator. For example:

    select oh.OrderId, oh.OrderDate, utc.UtcTime
    from OrderHeader oh
    outer apply Admin.dbo.fn_LocalTimeToUTC(oh.OrderDate, -6, 1) utc
    where oh.OrderId >= 50
    and oh.OrderId <= 59;

    LocalTimeToUTCSetBased

    Always Encrypted Storage Overhead

    The new Always Encrypted feature in SQL Server 2016 adds some overhead to the storage requirements of a column. I wanted to do some analysis to determine just how much overhead is incurred for various data types. This test was run against CTP 3.2.

    Methodology

    After creating a database along with a column master key and the column encryption key, I created four tables to store the encrypted data.

    if exists (select * from sys.tables where name = 'VarcharAnalysis')
    	drop table VarcharAnalysis;
    create table VarcharAnalysis
    (
    	CharCount int null,
    	CharData varchar(8000) collate Latin1_General_BIN2
    		encrypted with (column_encryption_key = [CEK_Auto1], 
    		encryption_type = Deterministic, 
    		algorithm = 'AEAD_AES_256_CBC_HMAC_SHA_256') null
    );
     
    create clustered index cl_VarcharAnalysis on VarcharAnalysis (CharCount);
     
    if exists (select * from sys.tables where name = 'NVarcharAnalysis')
    	drop table NVarcharAnalysis;
    create table NVarcharAnalysis
    (
    	CharCount int null,
    	CharData nvarchar(4000) collate Latin1_General_BIN2
    		encrypted with (column_encryption_key = [CEK_Auto1], 
    		encryption_type = Deterministic, 
    		algorithm = 'AEAD_AES_256_CBC_HMAC_SHA_256') null
    );
     
    create clustered index cl_NVarcharAnalysis on NVarcharAnalysis (CharCount);
     
    if exists (select * from sys.tables where name = 'VarbinaryAnalysis')
    	drop table VarbinaryAnalysis;
    create table VarbinaryAnalysis
    (
    	ByteCount int null,
    	BinaryData varbinary(8000)
    		encrypted with (column_encryption_key = [CEK_Auto1], 
    		encryption_type = Deterministic, 
    		algorithm = 'AEAD_AES_256_CBC_HMAC_SHA_256') null
    );
     
    create clustered index cl_VarbinaryAnalysis on VarbinaryAnalysis (ByteCount);
     
    if exists (select * from sys.tables where name = 'FixedSizeAnalysis')
    	drop table FixedSizeAnalysis;
    create table FixedSizeAnalysis
    (
    	TinyIntData tinyint
    		encrypted with (column_encryption_key = [CEK_Auto1], 
    		encryption_type = Deterministic, 
    		algorithm = 'AEAD_AES_256_CBC_HMAC_SHA_256') null,
    	SmallIntData smallint
    		encrypted with (column_encryption_key = [CEK_Auto1], 
    		encryption_type = Deterministic, 
    		algorithm = 'AEAD_AES_256_CBC_HMAC_SHA_256') null,
    	IntData int
    		encrypted with (column_encryption_key = [CEK_Auto1], 
    		encryption_type = Deterministic, 
    		algorithm = 'AEAD_AES_256_CBC_HMAC_SHA_256') null,
    	BigIntData bigint
    		encrypted with (column_encryption_key = [CEK_Auto1], 
    		encryption_type = Deterministic, 
    		algorithm = 'AEAD_AES_256_CBC_HMAC_SHA_256') null,
    	DecimalData decimal(38,0)
    		encrypted with (column_encryption_key = [CEK_Auto1], 
    		encryption_type = Deterministic, 
    		algorithm = 'AEAD_AES_256_CBC_HMAC_SHA_256') null,
    	BitData bit
    		encrypted with (column_encryption_key = [CEK_Auto1], 
    		encryption_type = Deterministic, 
    		algorithm = 'AEAD_AES_256_CBC_HMAC_SHA_256') null,
    	DateTimeData datetime
    		encrypted with (column_encryption_key = [CEK_Auto1], 
    		encryption_type = Deterministic, 
    		algorithm = 'AEAD_AES_256_CBC_HMAC_SHA_256') null,
    	DateTime2Data datetime2
    		encrypted with (column_encryption_key = [CEK_Auto1], 
    		encryption_type = Deterministic, 
    		algorithm = 'AEAD_AES_256_CBC_HMAC_SHA_256') null,
    	GuidData uniqueidentifier
    		encrypted with (column_encryption_key = [CEK_Auto1], 
    		encryption_type = Deterministic, 
    		algorithm = 'AEAD_AES_256_CBC_HMAC_SHA_256') null,
    );

    I then created a small C# application to first insert data into the tables, and then to extract the encrypted values out of the tables.

    using (Dal dal = new Dal(instanceName, databaseName, userName, password, true))
    {
    	dal.InsertVarcharAnalysis(null);
    	for (int i = 0; i <= 8000; i++)
    	{
    		string data = new string('*', i);
    		dal.InsertVarcharAnalysis(data);
    	}
    }
     
    using (Dal dal = new Dal(instanceName, databaseName, userName, password, true))
    {
    	dal.InsertNVarcharAnalysis(null);
    	for (int i = 0; i <= 4000; i++)
    	{
    		string data = new string('*', i);
    		dal.InsertNVarcharAnalysis(data);
    	}
    }
     
    using (Dal dal = new Dal(instanceName, databaseName, userName, password, true))
    {
    	dal.InsertVarbinaryAnalysis(null);
    	byte[] template = new byte[8000];
    	Enumerable.Range(0, 8000).ToList().ForEach(i => template[i] = (byte)(i % 256));
    	for (int i = 0; i <= 8000; i++)
    	{
    		byte[] data = new byte[i];
    		Array.Copy(template, data, i);
    		dal.InsertVarbinaryAnalysis(data);
    	}
    }
     
     
    using (Dal dal = new Dal(instanceName, databaseName, userName, password, true))
    {
    	dal.InsertFixedSizeAnalysis(null, null, null, null, null, null, null, null, null);
    	dal.InsertFixedSizeAnalysis(byte.MaxValue, short.MaxValue, int.MaxValue, long.MaxValue, decimal.MaxValue, true,
    			DateTime.Now, DateTime.Now, Guid.NewGuid());
    }

    In my Dal object’s constructor, the final parameter is a boolean value indicating whether to set the Column Encryption Setting to “Enabled”. This is necessary because the inserts must be done when encryption is enabled, but we need to read back the tables with encryption disabled so that we get back the raw, encrypted bytes instead of the decrypted data.

    The analysis code looks like this:

    using (Dal dal = new Dal(instanceName, databaseName, userName, password, false))
    {
    	List records = dal.ReadVarcharAnalysis();
    	int? previousRecordSize = int.MinValue;
    	foreach (CharDataRecord record in records)
    	{
    		if (record.Data?.Length != previousRecordSize)
    		{
    			Console.WriteLine("{0} - {1}", record.Length, record.Data?.Length);
    			previousRecordSize = record.Data?.Length;
    		}
    	}
    }
     
    using (Dal dal = new Dal(instanceName, databaseName, userName, password, false))
    {
    	List records = dal.ReadNVarcharAnalysis();
    	int? previousRecordSize = int.MinValue;
    	foreach (CharDataRecord record in records)
    	{
    		if (record.Data?.Length != previousRecordSize)
    		{
    			Console.WriteLine("{0} - {1}", record.Length, record.Data?.Length);
    			previousRecordSize = record.Data?.Length;
    		}
    	}
    }
     
    using (Dal dal = new Dal(instanceName, databaseName, userName, password, false))
    {
    	List records = dal.ReadVarbinaryAnalysis();
    	int? previousRecordSize = int.MinValue;
    	foreach (CharDataRecord record in records)
    	{
    		if (record.Data?.Length != previousRecordSize)
    		{
    			Console.WriteLine("{0} - {1}", record.Length, record.Data?.Length);
    			previousRecordSize = record.Data?.Length;
    		}
    	}
    }
     
    using (Dal dal = new Dal(instanceName, databaseName, userName, password, false))
    {
    	List records = dal.ReadFixedSizeAnalysis();
    	foreach (FixedSizeDataRecord record in records)
    	{
    		Console.WriteLine("TinyInt: {0}", record.TinyIntData?.Length);
    		Console.WriteLine("SmallInt: {0}", record.SmallIntData?.Length);
    		Console.WriteLine("Int: {0}", record.IntData?.Length);
    		Console.WriteLine("BigInt: {0}", record.BigIntData?.Length);
    		Console.WriteLine("Decimal: {0}", record.DecimalData?.Length);
    		Console.WriteLine("Bit: {0}", record.BitData?.Length);
    		Console.WriteLine("DateTime: {0}", record.DateTimeData?.Length);
    		Console.WriteLine("DateTime2: {0}", record.DateTime2Data?.Length);
    		Console.WriteLine("Guid: {0}", record.GuidData?.Length);
    	}
    }

    Results

    Let’s start by looking at the varchar data. Here are the first few rows in the table:

    VarcharAnalysisTopRows

    One thing to note is that, not surprisingly, a NULL input to the encryption function yields a NULL output This was true everywhere in this test. However, a blank string (0-length) generates 65 bytes of output, as does an input containing 1 character, 2 characters, etc. At 16 characters of input, the output grows to 81 bytes, and when input reaches 32 characters the output is 97 bytes.

    Input Size (characters)Output Size (bytes)
    065
    1681
    3297
    48113
    64129
    80145
    96161
    112177
    128193

    The pattern is quickly apparent. We can express this relationship very simply as:

    C = 16 * floor(P / 16) + 65

    where P is the input plaintext size in characters and C is the output cryptotext size in bytes.

    Next, the nvarchar data. This has similar growth patterns but, not surprisingly, the encryption output grows at 8-character boundaries.

    Input Size (characters)Output Size (bytes)
    065
    881
    1697
    24113
    32129
    40145
    48161
    56177
    64193
    72209

    C = 16 * floor(P / 8) + 65

    This is the same result as before taking into account that each input character is now two bytes instead of one.

    Varbinary data, as you might expect, follows the same pattern as varchar data:

    Input Size (characters)Output Size (bytes)
    065
    1681
    3297
    48113
    64129
    80145
    96161
    112177
    128193

    Finally, let’s look at fixed-width data. It turns out that the tinyint, smallint, int and bigint data types now occupy 65 bytes, as do bit, datetime and datetime2 columns. The decimal and uniqueidentifier data types require 81 bytes when they are encrypted. Again, this fits our pattern we have observed so far.

    Data TypeUnencrypted (bytes)Encrypted (bytes)
    tinyint165
    smallin265
    int465
    bigint865
    decimal(38,0)1781
    bit0.12565
    datetime865
    datetime2865
    uniqueidentifier1681

    Conclusions

    The size of the encrypted data throughout this analysis is very consistently

    C = 16 * floor(P / 16) + 65

    where P is the size, in bytes, of the unencrypted data, and C is the resulting size, again in bytes, of the encrypted data. This means that small inputs incur a significant amount of overhead, but as the input grows the overhead becomes only a tiny fraction. The greatest overhead is observed with the bit data type, where the encrypted data size is 520 times the unencrypted data size.

    Coda

    You may have noticed that I created all of the tables with the encryption type set to Deterministic. I re-ran this test using Randomized encryption throughout, and the result are identical.

    Observe that these results are based on the encrypted data size after the initial insert into the table. I’m not sure if there is anything later in the process, such as key rotation, that may cause the data size to grow, so keep that limitation in mind.

    Always Encrypted and sys.dm_exec_describe_first_result_set

    Suppose we have the following table created in SQL Server 2016 CTP 3.2 (presuming that the encryption keys have already been created):

    create table dbo.Customer
    (
    	CustomerID int NULL,
    	FirstName varchar(50) collate Latin1_General_BIN2
    		ENCRYPTED WITH (COLUMN_ENCRYPTION_KEY = CEK_Auto1,
    		ENCRYPTION_TYPE = Deterministic,
    		ALGORITHM = 'AEAD_AES_256_CBC_HMAC_SHA_256') NULL,
    	LastName varchar(50) collate Latin1_General_BIN2
    		ENCRYPTED WITH (COLUMN_ENCRYPTION_KEY = CEK_Auto1,
    		ENCRYPTION_TYPE = Deterministic,
    		ALGORITHM = 'AEAD_AES_256_CBC_HMAC_SHA_256') NULL,
    	Address varchar(50) NULL,
    	City varchar(50) NULL,
    	State varchar(50) NULL
    );

    Now suppose we connect to the database using a connection where Column Encryption Setting=Disabled and execute:

    select column_ordinal, name, system_type_name
    from sys.dm_exec_describe_first_result_set(N'select * from dbo.Customer', null, null);

    Then we get the following results:

    DescribeFirstResultSet_EncryptionDisabled

    The first and last name columns, rather than coming back as varchar(50) are instead shown as varbinary(113). This isn’t too surprising, given that we are on an unencrypted connection, and if we actually execute the select against the Customer table we will, in fact, get back binary data.

    So surely if we use a connection where Column Encryption Setting=Enabled, we will get the correct data type back, right?

    Nope.

    DescribeFirstResultSet_EncryptionEnabled

    We still get back varbinary(113).

    I’m not sure how I feel about these results. On the one hand, the results are consistent regardless of the column encryption setting, and they do reflect the true data type at the storage engine level. However, the real usefulness of sys.dm_exec_describe_first_result_set (and its companion stored procedure sys.sp_describe_first_result_set, which gives identical results) is the ability to discover the data types coming back from a particular T-SQL statement. If used as intended (that is, on a connection with the column encryption setting enabled), the metadata coming back doesn’t match what will be returned by the actual query, significantly reducing the value of this DMF.

    The official documentation for the DMF doesn’t address the impact of Always Encrypted, and the wording is sufficiently vague (“describes the metadata”) that it’s hard to say what the “correct” results should be in this case.

    Of course, it could very well be something that Microsoft is already planning to address before RTM.

    Defeating Dynamic Data Masking

    I can’t recall exactly where or who said it, but at some point over this past summer I recall someone mentioning that the new dynamic data masking feature in SQL Server 2016 could be easily worked around by selecting the data into a temporary table.  I shelved that thought for several months, and then this past weekend I started playing with it, only to find that the issue had been fixed in CTP 2.4.  However, I wanted to document some of the things that I tried just to show that it isn’t so easy any more to defeat the benefits of this new feature.  These tests were run in CTP 3.2.

    Let’s start with the sample table from Books Online:

    CREATE TABLE Membership
      (MemberID int IDENTITY PRIMARY KEY,
       FirstName varchar(100) NULL,
       LastName varchar(100) NOT NULL,
       Phone# varchar(12) NULL,
       EMail varchar(100) NULL);
     
    ALTER TABLE Membership ALTER COLUMN FirstName ADD MASKED WITH (FUNCTION = 'partial(1,"XXXXXXX",0)');
    ALTER TABLE Membership ALTER COLUMN Phone# ADD MASKED WITH (FUNCTION = 'default()');
    ALTER TABLE Membership ALTER COLUMN Email ADD MASKED WITH (FUNCTION = 'email()');
     
    INSERT Membership (FirstName, LastName, Phone#, EMail) VALUES 
    ('Roberto', 'Tamburello', '555.123.4567', 'RTamburello@contoso.com'),
    ('Janice', 'Galvin', '555.123.4568', 'JGalvin@contoso.com.co'),
    ('Zheng', 'Mu', '555.123.4569', 'ZMu@contoso.net');
     
    CREATE USER TestUser WITHOUT LOGIN;
    GRANT SELECT ON Membership TO TestUser;

    We can test that DDM is working by running:

    EXECUTE AS USER = 'TestUser';
    SELECT * FROM Membership;
    REVERT;

    The results are:

    ddmResults1

    Next, we will try to select the results into a temporary table and then select from the temp table.

    EXECUTE AS USER = 'TestUser';
    if object_id('tempdb..#Membership') is not null
    	drop table #Membership;
    SELECT * into #Membership FROM Membership;
    select * from #Membership;
    REVERT;

    The results are identical.

    ddmResults2

    In some cases attributes of the table get copied as part of a SELECT … INTO statement, so the next test is to explicitly create the temp table:

    EXECUTE AS USER = 'TestUser';
    if object_id('tempdb..#Membership') is not null
    	drop table #Membership;
    CREATE TABLE #Membership
      (MemberID int IDENTITY PRIMARY KEY,
       FirstName varchar(100) NULL,
       LastName varchar(100) NOT NULL,
       Phone# varchar(12) NULL,
       EMail varchar(100) NULL);
     
    set identity_insert #Membership on;
     
    insert #Membership (MemberID, FirstName, LastName, [Phone#], Email)
    SELECT MemberID, FirstName, LastName, [Phone#], Email FROM Membership;
     
    set identity_insert #Membership off;
     
    select * from #Membership;
    REVERT;

    The results are identical:

    ddmResults3

    What if we do something very similar, but use one of the columns in an expression?

    EXECUTE AS USER = 'TestUser';
    if object_id('tempdb..#Membership') is not null
    	drop table #Membership;
    CREATE TABLE #Membership
      (MemberID int IDENTITY PRIMARY KEY,
       FirstName varchar(100) NULL,
       LastName varchar(100) NOT NULL,
       Phone# varchar(12) NULL,
       EMail varchar(100) NULL);
    set identity_insert #Membership on;
     
    insert #Membership (MemberID, FirstName, LastName, [Phone#], Email)
    SELECT MemberID, FirstName + '', LastName, [Phone#], Email FROM Membership; 
    set identity_insert #Membership off;
    select * from #Membership;
    REVERT;

    ddmResults4

    Interestingly, this strips the first character from the results, yielding only “xxxx” in the FirstName column!

    We can try and see if the “masked” attribute was copied to the temp table created as a SELECT … INTO statement.

    EXECUTE AS USER = 'TestUser';
    if object_id('tempdb..#Membership') is not null
    	drop table #Membership;
    SELECT * into #Membership FROM Membership;
     
    alter table #Membership alter column FirstName drop masked;
     
    select * from #Membership;
    REVERT;

    Msg 16007, Level 16, State 0, Line 32
    The column 'FirstName' does not have a data masking function.

    Nope, the masking attribute doesn’t get copied to the temp table.

    Next, we try selecting into a variable first, then seeing what is in the variable:

    EXECUTE AS USER = 'TestUser';
    declare @FirstName varchar(100);
    select @FirstName = FirstName from Membership where MemberID = 1;
    select @FirstName;
    REVERT;

    Now we get:

    ddmResults5

    Let’s select the characters individually. Based on what we’ve seen to this point, this probably isn’t going to work.

    EXECUTE AS USER = 'TestUser';
    declare @c1 char(1), @c2 char(1), @c3 char(1), @c4 char(1), @c5 char(1), @c6 char(1), @c7 char(1)
    select @c1 = substring(FirstName, 1, 1), @c2 = substring(FirstName, 2, 1), @c3 = substring(FirstName, 3, 1), @c4 = substring(FirstName, 4, 1),
    	@c5 = substring(FirstName, 5, 1), @c6 = substring(FirstName, 6, 1), @c7 = substring(FirstName, 7, 1)
    from Membership where MemberID = 1;
    select @c1, @c2, @c3, @c4, @c5, @c6, @c7;
    REVERT;

    And indeed we only get:

    ddmResults6

    Finally, we will create another database and create a user as database owner:

    create database AnotherDatabase;
    GO
    use AnotherDatabase;
     
    create login AnotherDBUser with password = 'B@dP@$$w0rd';
    create user AnotherDBUser for login AnotherDBUser;
     
    alter role db_owner add member AnotherDBUser;

    Now switch back to the database containing the member table, and grant this login read-only access.

    create user AnotherDBUser for login AnotherDBUser;
    alter role [db_datareader] add member AnotherDBUser;

    And try it out:

    execute as login = 'AnotherDBUser';
    SELECT * FROM Membership;
    REVERT;
     
    execute as user = 'AnotherDBUser';
    SELECT * FROM Membership;
    REVERT;

    Both queries return the same masked result set.

    ddmResults7