What's the Most Efficient Way to Normalize Text from Column into a Table

What's the most efficient way to normalize text from column into a table?

If you have a maximum number of columns, a little XML within a CROSS APPLY.

If unknown, you would have to go DYNAMIC.

Example

Declare @YourTable Table ([ID] varchar(50),[SomeCol] varchar(50))
Insert Into @YourTable Values
(1,'[Key1:Value1:Value2:Value3:Value4:Value5]')
,(2,'[Key2:Value1:Value2:Value3:Value4:Value5]')
,(3,'[Key3:Value1:Value2:Value3:Value4:Value5]')

Select A.ID
,B.*
From @YourTable A
Cross Apply (
Select Pos1 = ltrim(rtrim(xDim.value('/x[1]','varchar(max)')))
,Pos2 = ltrim(rtrim(xDim.value('/x[2]','varchar(max)')))
,Pos3 = ltrim(rtrim(xDim.value('/x[3]','varchar(max)')))
,Pos4 = ltrim(rtrim(xDim.value('/x[4]','varchar(max)')))
,Pos5 = ltrim(rtrim(xDim.value('/x[5]','varchar(max)')))
,Pos6 = ltrim(rtrim(xDim.value('/x[6]','varchar(max)')))
,Pos7 = ltrim(rtrim(xDim.value('/x[7]','varchar(max)')))
,Pos8 = ltrim(rtrim(xDim.value('/x[8]','varchar(max)')))
,Pos9 = ltrim(rtrim(xDim.value('/x[9]','varchar(max)')))
From (Select Cast('<x>' + replace(replace(replace(SomeCol,'[',''),']',''),':','</x><x>')+'</x>' as xml) as xDim) as A
) B

Returns

ID  Pos1    Pos2    Pos3    Pos4    Pos5    Pos6    Pos7    Pos8    Pos9
1 Key1 Value1 Value2 Value3 Value4 Value5 NULL NULL NULL
2 Key2 Value1 Value2 Value3 Value4 Value5 NULL NULL NULL
3 Key3 Value1 Value2 Value3 Value4 Value5 NULL NULL NULL

EDIT

I should add, the ltrim(rtrim(...)) is optional and the varchar(max) is my demonstrative default.

EDIT - One String delimited with CRLF

Declare @S varchar(max)='
[Key1:Value1:Value2:Value3:Value4:Value5]
[Key2:Value1:Value2:Value3:Value4:Value5]
[Key3:Value1:Value2:Value3:Value4:Value5]
'

Select B.*
From (
Select RetSeq = Row_Number() over (Order By (Select null))
,RetVal = LTrim(RTrim(B.i.value('(./text())[1]', 'varchar(max)')))
From (Select x = Cast('<x>' + replace(@S,char(13)+char(10),'</x><x>')+'</x>' as xml).query('.')) as A
Cross Apply x.nodes('x') AS B(i)
) A
Cross Apply (
Select Pos1 = ltrim(rtrim(xDim.value('/x[1]','varchar(max)')))
,Pos2 = ltrim(rtrim(xDim.value('/x[2]','varchar(max)')))
,Pos3 = ltrim(rtrim(xDim.value('/x[3]','varchar(max)')))
,Pos4 = ltrim(rtrim(xDim.value('/x[4]','varchar(max)')))
,Pos5 = ltrim(rtrim(xDim.value('/x[5]','varchar(max)')))
,Pos6 = ltrim(rtrim(xDim.value('/x[6]','varchar(max)')))
,Pos7 = ltrim(rtrim(xDim.value('/x[7]','varchar(max)')))
,Pos8 = ltrim(rtrim(xDim.value('/x[8]','varchar(max)')))
,Pos9 = ltrim(rtrim(xDim.value('/x[9]','varchar(max)')))
From (Select Cast('<x>' + replace(replace(replace(RetVal,'[',''),']',''),':','</x><x>')+'</x>' as xml) as xDim) as A
) B
Where A.RetVal is not null

What's best practice for normalisation of DB where a domain table has an Other option for free text?

After talking it over, we're gonna do as suggested and do a text scan and use that to populate our lookups, then going forward we'll try to discourage the use of free-text fields for lookups, storing the values in a separate table for the time being so we don't clutter our main tables.

How can I normalize the data in a range of columns in my pandas dataframe

You can perform operations on a sub set of rows or columns in pandas in a number of ways. One useful way is indexing:

# Assuming same lines from your example
cols_to_norm = ['Age','Height']
survey_data[cols_to_norm] = survey_data[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

This will apply it to only the columns you desire and assign the result back to those columns. Alternatively you could set them to new, normalized columns and keep the originals if you want.

Normalizing an extremely big table

You'll find other benefits to normalizing the data besides the speed of queries running against it... such as size and maintainability, which alone should justify normalizing it...

However, it will also likely improve the speed of queries; currently having a single row containing 300 text columns is massive, and is almost certainly past the 8,060 byte limit for storing the row data page... and is instead being stored in the ROW_OVERFLOW_DATA or LOB_DATA Allocation Units.

By reducing the size of each row through normalization, such as replacing redundant text data with a TINYINT foreign key, and by also removing columns that aren't dependent on this large table's primary key into another table, the data should no longer overflow, and you'll also be able to store more rows per page.

As far as the overhead added by performing JOIN to get the normalized data... if you properly index your tables, this shouldn't add a substantial amount of overhead. However, if it does add an unacceptable overhead, you can then selectively de-normalize the data as necessary.

Normalizing(Reshaping) data frame based on split and columns

This is relatively easy using data.table (and fast obviously).

require( data.table )
dt <- data.table( df )
dt[ , list( name = unlist( strsplit( name , "\n" ) ) ) , by = list( Date , score ) ]
# Date score name
#1: 12/09/2012 120 Mahesh
#2: 12/09/2012 120 Rahul
#3: 13/09/2012 110 abc
#4: 13/09/2012 110 xyz

As a note I took df to be the following data (note character classes over factor classes that appear in your actual data...

df <- read.delim( text = "Date    name    score
12/09/2012 'Mahesh\nRahul' 120
13/09/2012 'abc\nxyz' 110" ,
sep = "" , h = TRUE , quote = "\'" , stringsAsFactors = FALSE )

How far to take normalization?

Denormalization has the advantage of fast SELECTs on large queries.

Disadvantages are:

  • It takes more coding and time to ensure integrity (which is most important in your case)

  • It's slower on DML (INSERT/UPDATE/DELETE)

  • It takes more space

As for optimization, you may optimize either for faster querying or for faster DML (as a rule, these two are antagonists).

Optimizing for faster querying often implies duplicating data, be it denormalization, indices, extra tables of whatever.

In case of indices, the RDBMS does it for you, but in case of denormalization, you'll need to code it yourself. What if Department moves to another Office? You'll need to fix it in three tables instead of one.

So, as I can see from the names of your tables, there won't be millions records there. So you'd better normalize your data, it will be simplier to manage.

How to normalise a table which can link to one or more tables?

This is not going to be an answer on how to put the tables into 1NF. I hope it will be helpful, though.

When creating a database, we usually don't think in 1NF, 2NF etc. We think about what to model and what entities there are. When we think this through, then very often the database is already in 5NF or so. If in doubt we can use the NFs as a kind of checklist.

I don't know your exact requirements, so there is a lot of guessing or just general advice here. Maybe one of your problems is that you are using the noun "notes" which doesn't describe exactly what this is about. Later you call this "correspondence", but are all "notes" = "correspondence"?

Your database is about services you take from an agent or from a service company directly. So one entity that I see is this provider:

provider

  • provider_id
  • name
  • contact_person_name
  • phone
  • email
  • type (agent or final service provider)

If a provider can have multiple contacts, phones and emails, you'd make this a provider_contact table instead:

provider

  • provider_id
  • name
  • type (agent or final service provider)

provider_contact

  • provider_contact_id
  • name
  • phone
  • email
  • provider_id

As to notes: well if there are notes on a provider ("Always ask for Jane; she's the most professional there.") or contact ("Jane has worked in London and Madrid."), I'd usually make this just one text column where you can enter whatever you like. You can even store HTML or a Word document. No need for multiple documents per provider, I suppose.

Now there are also services. Either you need a list of who offers which service, then add these tables: service (which services exist) and provider_service (who provides which service). But maybe you can do without that, because you know what services exist and who provides them anyway and don't want to have this in your model.

I don't know if you want to enter service enquiries or only already fixed service contracts. In any case you may want a service_contract table either with a status or without.

service_contract

  • service_contract_id
  • provider_id (or maybe provider_contact_id?)
  • service (free text or a service_id referencing a service table)
  • valid_from
  • valid_to

Here again you may have notes like "We are still waiting for the documents. Jane said, they will come in March 2018.", which again would be just one column.

Then you said you want the correspondence, which could be an additional table service_contract_correspondance:

service_contract_correspondance

  • service_contract_correspondance_id
  • service_contract_id
  • type (received from provider or sent to provider)
  • text

But then again, maybe you can do without it. Will you ever have to access single correspondances (e.g. "give me all correspondences from December 2017" or "Delete all correspondances older than one year")? Or will there be thousands of correspondences on a single contract? Mabe not. Maybe you can again see this as one single document (text, HTML, Word, ...) and add this as a mere field to the table.

Having a text consisting of multiple contacts like


January 2, 2018 Jane
She says they'll need to weeks to make an offer.

January 20, 2018
I asked about the status. Talked with some Thomas Smith. He says they'll call us tomorrow.

January 21, 2018 Jane
She has emailed the contract for us to sign.

is not per se against NF1. As long as you are not interested in details in your database (e.g. you'll never select all January correspondence alone on the contract), then to the database this is atomic; no need to change this.

Same with contacts. If you have a string column only with "Jane (sales person) 123-45678 jane@them.com, Jack (assistent) 123-98765 jack@them.com", this is not per se against NF1 again. As long as you don't want to select names only or check phone numbers, but always treat the string as a whole, as the contact, then just do so.

You see, it all boils down to what exactly you want to model and how you want to work with the data. Yes, there are agents and direct providers, but is the difference between the two so big that you need two different tables? Yes there is a chronology of correspondence per contract, but do you really need them separated into single correspondencies in your database?



Related Topics



Leave a reply



Submit