SQL Server Normalization Tactic: Varchar VS Int Identity

SQL Server normalization tactic: varchar vs int Identity

Can you really use names as primary keys? Isn't there a high risk of several people with the same name?

If you really are so lucky that your name attribute can be used as primary key, then - by all means - use that. Often, though, you will have to make something up, like a customer_id, etc.

And finally: "NAME" is a reserved word in at least one DBMS, so consider using something else, e.g. fullname.

SQL primary key: integer vs varchar

The primary key is supposed to represent the identity for the row and should not change over time.

I assume that the varchar is some sort of natural key - such as the name of the entity, an email address, or a serial number. If you use a natural key then it can sometimes happen that the key needs to change because for example:

The data was incorrectly entered and needs to be fixed.
The user changes their name or email address.
The management suddenly decide that all customer reference numbers must be changed to another format for reasons that seem completely illogical to you, but they insist on making the change even after you explain the problems it will cause you.
Maybe even a country or state decides to change the spelling of its name - very unlikely, but not impossible.

By using a surrogate key you avoid problems caused by having to change primary keys.

Is there a REAL performance difference between INT and VARCHAR primary keys?

You make a good point that you can avoid some number of joined queries by using what's called a natural key instead of a surrogate key. Only you can assess if the benefit of this is significant in your application.

That is, you can measure the queries in your application that are the most important to be speedy, because they work with large volumes of data or they are executed very frequently. If these queries benefit from eliminating a join, and do not suffer by using a varchar primary key, then do it.

Don't use either strategy for all tables in your database. It's likely that in some cases, a natural key is better, but in other cases a surrogate key is better.

Other folks make a good point that it's rare in practice for a natural key to never change or have duplicates, so surrogate keys are usually worthwhile.

What's the best practice for primary keys in tables?

I follow a few rules:

Primary keys should be as small as necessary. Prefer a numeric type because numeric types are stored in a much more compact format than character formats. This is because most primary keys will be foreign keys in another table as well as used in multiple indexes. The smaller your key, the smaller the index, the less pages in the cache you will use.
Primary keys should never change. Updating a primary key should always be out of the question. This is because it is most likely to be used in multiple indexes and used as a foreign key. Updating a single primary key could cause of ripple effect of changes.
Do NOT use "your problem primary key" as your logic model primary key. For example passport number, social security number, or employee contract number as these "natural keys" can change in real world situations. Make sure to add UNIQUE constraints for these where necessary to enforce consistency.

On surrogate vs natural key, I refer to the rules above. If the natural key is small and will never change it can be used as a primary key. If the natural key is large or likely to change I use surrogate keys. If there is no primary key I still make a surrogate key because experience shows you will always add tables to your schema and wish you'd put a primary key in place.

Normalize sql string combinations

One of my biggest complaints about string_split is it lacks the ordinal position of each value. That makes situations like this one a lot easier to work with. Here is another approach to this. I am using the splitter from Jeff Moden which can be found here. There really is no need for a cursor here.

I also took the liberty of adding a GroupID column so you know which row each value belongs to once you parse them out. If the Fruits column is unique you could skip that but hard to tell for sure.

CREATE TABLE #Fruits
( 
    GroupID int identity
    , Fruits VARCHAR(100)
) 

INSERT INTO #Fruits (Fruits)
VALUES ( 'banana,apple'),
       ('apple,banana'),
       ('kiwi,jackfruit'),
       ('jackfruit, kiwi')

;
with SortedResults as
(
    select f.GroupID
        , Item = ltrim(x.Item)
        , x.ItemNumber
        , RowNum = ROW_NUMBER() over(partition by GroupID order by ltrim(x.Item))
    from #Fruits f
    cross apply dbo.DelimitedSplit8K(f.Fruits, ',') x
)

select Max(case when RowNum = 1 then Item end) + ', ' + max(case when RowNum = 2 then Item end)
from SortedResults
group by GroupID

drop table #Fruits

Is it ok to use character values for primary keys?

I'd stay away from using text as your key - what happens in the future when you want to change the team ID for some team? You'd have to cascade that key change all through your data, when it's the exact thing a primary key can avoid. Also, though I don't have any emperical evidence, I'd think the INT key would be significantly faster than the text one.

Perhaps you can create views for your data that make it easier to consume, while still using a numeric primary key.

Split one large, denormalized table into a normalized database

Ok, I'm not an SQL Server expert, but here's the "strategy" I would suggest.

Calculate the personId on the staging table
As @Shnugo suggested before me, calculating the personId in the staging table will ease the next steps

Use a sequence for the personID
From SQL Server 2012 you can define sequences. If you use it for every person insert, you'll never risk an overlapping of IDs. If you have (as it seems) personId that were loaded before the sequence you can create the sequence with the first free personID as starting value

Create a numbers table
Create an utility table keeping numbers from 1 to n (you need n to be at least 50.. you can look at this question for some implementations)

Use set logic to do the insert
I'd avoid cursor and row-by-row logic: you are right that it is better to limit the number of accesses to the table, but I'd say that you should strive to limit it to one access for target table.

You could proceed like these:

People:

 INSERT INTO People (personID) 
 SELECT personId from staging;

Names:

 INSERT INTO Names (personID, lName, fName) 
 SELECT personId, lName, fName from staging;

Licenses:
here we'll need the Number table

 INSERT INTO Licenses (personId, number, issuer)
 SELECT * FROM (
    SELECT personId, 
           case nbrs.n 
                when 1 then licenseNumber1 
                when 2 then licenseNumber2
                ...
                when 50 then licenseNumber50
            end as licenseNumber,    
           case nbrs.n 
                when 1 then licenseIssuer1 
                when 2 then licenseIssuer2
                ...
                when 50 then licenseIssuer50
            end as licenseIssuer
      from staging 
           cross join 
           (select n from numbers where n>=1 and n<=50) nbrs
  ) WHERE licenseNumber is not null;

Specialties:

 INSERT INTO Specialties(personId, name, state)
 SELECT * FROM (
    SELECT personId, 
           case nbrs.n 
                when 1 then specialtyName1
                when 2 then specialtyName2
                ...
                when 15 then specialtyName15
            end as specialtyName,    
           case nbrs.n 
                when 1 then specialtyState1
                when 2 then specialtyState2
                ...
                when 15 then specialtyState15
            end as specialtyState
      from staging 
           cross join 
           (select n from numbers where n>=1 and n<=15) nbrs
 ) WHERE specialtyName is not null;

Identifiers:

 INSERT INTO Identifiers(personId, value)
 SELECT * FROM (
    SELECT personId, 
           case nbrs.n 
                when 1 then identifier1
                when 2 then identifier2
                ...
                when 15 then identifier15
            end as value
      from staging 
           cross join 
           (select n from numbers where n>=1 and n<=15) nbrs
 ) WHERE value is not null;

Hope it helps.

Writing query to normalize a table

CREATE TABLE dbo.Territory
(id int identity, STATE nvarchar(255), CITY nvarchar(255),ZIP nvarchar(255));

create table dbo.customer (CUS_ID int identity, territoryid int,        CUS_PHONE varchar (12),  CUS_NAME varchar(25))

alter table dbo.customer  
ADD CONSTRAINT t_id,
FOREIGN KEY (territoryid) 
REFERENCES   dbo.Territory(id)

Any territoryid must refer to a value that already exists in the id column in dbo.territory.

To insert a new customer:
insert into dbo.customer (territoryid, cus_phone, cus_name) values (3,'212-555-1212','Mary')

SQL Server Normalization Tactic: Varchar VS Int Identity