sql - What data type is optimal for clustered index of a table published by using transactional replication? -


we have application stores data in sql server database. (currently support sql server 2005 , higher). our db has more 400 tables. structure of database not ideal. biggest problem have lot of tables guids (newid()) primary clustered keys. when asked our main database architect “why?”, said: “it because of replication”. our db should support transactional replication. initially, primary keys int identity(1,1) clustered. later when came replication support, fields replaced uniqueidentifier default newid(). said “otherwise nightmare deal replication”. newsequentialid() not supported sql 7/2000 @ time. have tables following structure:

create table table1(         table1_pid uniqueidentifier default newid() not null,         field1 varchar(50) null,         fieldn varchar(50) null,         constraint pk_table1 primary key clustered (table1_pid)     )     go  create table table2(     table2_pid uniqueidentifier default newid() not null,     table1_pid uniqueidentifier null,     field1 varchar(50) null,     fieldn varchar(50) null,     constraint pk_table2 primary key clustered (table2_pid),     constraint fk_table2_table1 foreign key (table1_pid) references table1 (table1_pid) ) go 

all tables have lot of fields (up 35) , 15 non-clustered indexes.

i know guid not sequential - 1 has it's values generated in client (using .net) or generated newid() sql function (like in our case) horribly bad choice clustered index 2 reasons:

  1. fragmentation
  2. size

i know clustering key is:

  1. unique,
  2. narrow,
  3. static,
  4. ever-increasing,
  5. non-nullable,
  6. and fixed-width

for more details on reasons behind this, check out following great video: http://technet.microsoft.com/en-us/sqlserver/gg508879.aspx.

so, int identity best choice. bigint identity good, typically int 2+ billion rows should sufficient vast majority of tables.

when our customers began suffering fragmentation, decided make primary keys non-clustered. result, tables remained without clustered index. in other words, tables turned heaps. don’t solution because sure heap tables not part of database design. please, check sql server best practices article: http://technet.microsoft.com/en-us/library/cc917672.aspx.

currently consider 2 options improve database structure:

the first option replace default newid() default newsequentialid() primary clustered key:

create table table1_guid (   table1_pid uniqueidentifier default newsequentialid() not null,   field1 varchar(50) null,   fieldn varchar(50) null,   constraint pk_table1 primary key clustered (table1_pid) ) go 

the second option add int identity column each table , make clustered unique index, leaving primary key not clustered. table1 like:

create table table1_int (   table1_id int identity(1,1) not null,   table1_pid uniqueidentifier default newsequentialid() not null,   field1 varchar(50) null,   fieldn varchar(50) null,   constraint pk_table1 primary key nonclustered (table1_pid),   constraint uk_table1 unique clustered (table1_id) ) go 

table1_pid used replication, (that’s why left pk), while table1_id not replicated @ all.

the long story short, after run benchmarks see approach better, found both solutions not good:

the first approach (table1_guid) revealed following shortcomings: although sequential guid's lot better regular random guids, still 4 times larger int (16 vs 4 byte) , factor in our case because have lots of rows in our tables (up 60 million), , lots of non-clustered indexes on tables (up 15). clustering key being added each , every non-clustered index, increases negative effect of having 16 vs. 4 bytes in size. more bytes means more pages on disk , in sql server ram , more disk i/o , more work sql server.

to more precise, after inserted 25mln rows of real data each table , created 15 non-clustered indexes on each table, saw big difference in space used tables:

exec sp_spaceused 'table1_guid' -- 14.85 gb exec sp_spaceused 'table1_int' -- 11.68 gb 

furthermore, test showed inserts table1_guid bit slower table1_int.

the second approach (table1_int) revealed in queries (select) joining 2 tables on table1_int.table1_pid = table2_int.table1_pid execution plan became worse because additional key lookup operator appeared.

now question: i believe there should better solution our problem. if recommend me or point me resource, appreciate greatly. thank in advance.

updated:

let me give example of select statement additional key lookup operator appears:

--create 2 tables int identity(1,1) clustered key. --these tables have one-to-many relationship. create table table1_int (     table1_id int identity(1,1) not null,     table1_pid uniqueidentifier default newsequentialid() not null,     field1 varchar(50) null,     fieldn varchar(50) null,     constraint pk_table1_int primary key nonclustered (table1_pid),     constraint uk_table1_int unique clustered (table1_id) ) go  create table table2_int(     table2_id int identity(1,1) not null,     table2_pid uniqueidentifier default newsequentialid() not null,     table1_pid uniqueidentifier null,     field1 varchar(50) null,     fieldn varchar(50) null,     constraint pk_table2_int primary key nonclustered (table2_pid),     constraint uk_table2_int unique clustered (table2_id),     constraint fk_table2_table1_int foreign key (table1_pid) references table1_int (table1_pid) ) go 

and create other 2 tables comperison:

--create same 2 tables, uniqueidentifier newsequentialid() clustered key. create table table1_guid (     table1_pid uniqueidentifier default newsequentialid() not null,     field1 varchar(50) null,     fieldn varchar(50) null,     constraint pk_table1_guid primary key clustered (table1_pid), ) go  create table table2_guid(     table2_pid uniqueidentifier default newsequentialid() not null,     table1_pid uniqueidentifier null,     field1 varchar(50) null,     fieldn varchar(50) null,     constraint pk_table2_guid primary key clustered (table2_pid),     constraint fk_table2_table1_guid foreign key (table1_pid) references table1_guid (table1_pid) ) go 

now run following select statements , @ execution plan compare:

select t1.field1, t2.fieldn table1_int t1      inner join table2_int t2          on t1.table1_pid = t2.table1_pid;  select t1.field1, t2.fieldn table1_guid t1      inner join table2_guid t2          on t1.table1_pid = t2.table1_pid; 

execution plan

i use int identity of primary , clustering keys.

you need keep apart primary key logical construct - uniquely identifies rows, has unique , stable , not null. guid works primary key, - since it's guaranteed unique. guid primary key choice if use sql server replication, since in case, need uniquely identifying guid column anyway.

the clustering key in sql server physical construct used physical ordering of data, , lot more difficult right. typically, queen of indexing on sql server, kimberly tripp, requires clustering key unique, stable, narrow possible, , ideally ever-increasing (which int identity is).

see articles on indexing here:

and see jimmy nilsson's the cost of guids primary key

a guid bad choice clustering key, since it's wide, totally random, , leads bad index fragmentation , poor performance. also, clustering key row(s) stored in each , every entry of each , every non-clustered (additional) index, want keep small - guid 16 byte vs. int 4 byte, , several non-clustered indices , several million rows, makes huge difference.

in sql server, primary key default clustering key - doesn't have be. can use guid non-clustered primary key, , int identity clustering key - takes bit of being aware of it.


Comments

Popular posts from this blog

SPSS keyboard combination alters encoding -

Add new record to the table by click on the button in Microsoft Access -

CSS3 Transition to highlight new elements created in JQuery -