What's the most efficient way to normalize text from column into a table?
If you have a maximum number of columns, a little XML within a CROSS APPLY.
If unknown, you would have to go DYNAMIC.
Example
Declare @YourTable Table ([ID] varchar(50),[SomeCol] varchar(50))
Insert Into @YourTable Values
(1,'[Key1:Value1:Value2:Value3:Value4:Value5]')
,(2,'[Key2:Value1:Value2:Value3:Value4:Value5]')
,(3,'[Key3:Value1:Value2:Value3:Value4:Value5]')
Select A.ID
,B.*
From @YourTable A
Cross Apply (
Select Pos1 = ltrim(rtrim(xDim.value('/x[1]','varchar(max)')))
,Pos2 = ltrim(rtrim(xDim.value('/x[2]','varchar(max)')))
,Pos3 = ltrim(rtrim(xDim.value('/x[3]','varchar(max)')))
,Pos4 = ltrim(rtrim(xDim.value('/x[4]','varchar(max)')))
,Pos5 = ltrim(rtrim(xDim.value('/x[5]','varchar(max)')))
,Pos6 = ltrim(rtrim(xDim.value('/x[6]','varchar(max)')))
,Pos7 = ltrim(rtrim(xDim.value('/x[7]','varchar(max)')))
,Pos8 = ltrim(rtrim(xDim.value('/x[8]','varchar(max)')))
,Pos9 = ltrim(rtrim(xDim.value('/x[9]','varchar(max)')))
From (Select Cast('<x>' + replace(replace(replace(SomeCol,'[',''),']',''),':','</x><x>')+'</x>' as xml) as xDim) as A
) B
Returns
ID Pos1 Pos2 Pos3 Pos4 Pos5 Pos6 Pos7 Pos8 Pos9
1 Key1 Value1 Value2 Value3 Value4 Value5 NULL NULL NULL
2 Key2 Value1 Value2 Value3 Value4 Value5 NULL NULL NULL
3 Key3 Value1 Value2 Value3 Value4 Value5 NULL NULL NULL
EDIT
I should add, the ltrim(rtrim(...))
is optional and the varchar(max)
is my demonstrative default.
EDIT - One String delimited with CRLF
Declare @S varchar(max)='
[Key1:Value1:Value2:Value3:Value4:Value5]
[Key2:Value1:Value2:Value3:Value4:Value5]
[Key3:Value1:Value2:Value3:Value4:Value5]
'
Select B.*
From (
Select RetSeq = Row_Number() over (Order By (Select null))
,RetVal = LTrim(RTrim(B.i.value('(./text())[1]', 'varchar(max)')))
From (Select x = Cast('<x>' + replace(@S,char(13)+char(10),'</x><x>')+'</x>' as xml).query('.')) as A
Cross Apply x.nodes('x') AS B(i)
) A
Cross Apply (
Select Pos1 = ltrim(rtrim(xDim.value('/x[1]','varchar(max)')))
,Pos2 = ltrim(rtrim(xDim.value('/x[2]','varchar(max)')))
,Pos3 = ltrim(rtrim(xDim.value('/x[3]','varchar(max)')))
,Pos4 = ltrim(rtrim(xDim.value('/x[4]','varchar(max)')))
,Pos5 = ltrim(rtrim(xDim.value('/x[5]','varchar(max)')))
,Pos6 = ltrim(rtrim(xDim.value('/x[6]','varchar(max)')))
,Pos7 = ltrim(rtrim(xDim.value('/x[7]','varchar(max)')))
,Pos8 = ltrim(rtrim(xDim.value('/x[8]','varchar(max)')))
,Pos9 = ltrim(rtrim(xDim.value('/x[9]','varchar(max)')))
From (Select Cast('<x>' + replace(replace(replace(RetVal,'[',''),']',''),':','</x><x>')+'</x>' as xml) as xDim) as A
) B
Where A.RetVal is not null
What's best practice for normalisation of DB where a domain table has an Other option for free text?
After talking it over, we're gonna do as suggested and do a text scan and use that to populate our lookups, then going forward we'll try to discourage the use of free-text fields for lookups, storing the values in a separate table for the time being so we don't clutter our main tables.
How can I normalize the data in a range of columns in my pandas dataframe
You can perform operations on a sub set of rows or columns in pandas in a number of ways. One useful way is indexing:
# Assuming same lines from your example
cols_to_norm = ['Age','Height']
survey_data[cols_to_norm] = survey_data[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
This will apply it to only the columns you desire and assign the result back to those columns. Alternatively you could set them to new, normalized columns and keep the originals if you want.
Normalizing an extremely big table
You'll find other benefits to normalizing the data besides the speed of queries running against it... such as size and maintainability, which alone should justify normalizing it...
However, it will also likely improve the speed of queries; currently having a single row containing 300 text columns is massive, and is almost certainly past the 8,060 byte limit for storing the row data page... and is instead being stored in the ROW_OVERFLOW_DATA
or LOB_DATA
Allocation Units.
By reducing the size of each row through normalization, such as replacing redundant text data with a TINYINT
foreign key, and by also removing columns that aren't dependent on this large table's primary key into another table, the data should no longer overflow, and you'll also be able to store more rows per page.
As far as the overhead added by performing JOIN
to get the normalized data... if you properly index your tables, this shouldn't add a substantial amount of overhead. However, if it does add an unacceptable overhead, you can then selectively de-normalize the data as necessary.
Normalizing(Reshaping) data frame based on split and columns
This is relatively easy using data.table
(and fast obviously).
require( data.table )
dt <- data.table( df )
dt[ , list( name = unlist( strsplit( name , "\n" ) ) ) , by = list( Date , score ) ]
# Date score name
#1: 12/09/2012 120 Mahesh
#2: 12/09/2012 120 Rahul
#3: 13/09/2012 110 abc
#4: 13/09/2012 110 xyz
As a note I took df
to be the following data (note character
classes over factor
classes that appear in your actual data...
df <- read.delim( text = "Date name score
12/09/2012 'Mahesh\nRahul' 120
13/09/2012 'abc\nxyz' 110" ,
sep = "" , h = TRUE , quote = "\'" , stringsAsFactors = FALSE )
How far to take normalization?
Denormalization has the advantage of fast SELECT
s on large queries.
Disadvantages are:
It takes more coding and time to ensure integrity (which is most important in your case)
It's slower on DML (INSERT/UPDATE/DELETE)
It takes more space
As for optimization, you may optimize either for faster querying or for faster DML (as a rule, these two are antagonists).
Optimizing for faster querying often implies duplicating data, be it denormalization, indices, extra tables of whatever.
In case of indices, the RDBMS does it for you, but in case of denormalization, you'll need to code it yourself. What if Department
moves to another Office
? You'll need to fix it in three tables instead of one.
So, as I can see from the names of your tables, there won't be millions records there. So you'd better normalize your data, it will be simplier to manage.
How to normalise a table which can link to one or more tables?
This is not going to be an answer on how to put the tables into 1NF. I hope it will be helpful, though.
When creating a database, we usually don't think in 1NF, 2NF etc. We think about what to model and what entities there are. When we think this through, then very often the database is already in 5NF or so. If in doubt we can use the NFs as a kind of checklist.
I don't know your exact requirements, so there is a lot of guessing or just general advice here. Maybe one of your problems is that you are using the noun "notes" which doesn't describe exactly what this is about. Later you call this "correspondence", but are all "notes" = "correspondence"?
Your database is about services you take from an agent or from a service company directly. So one entity that I see is this provider:
provider
- provider_id
- name
- contact_person_name
- phone
- type (agent or final service provider)
If a provider can have multiple contacts, phones and emails, you'd make this a provider_contact table instead:
provider
- provider_id
- name
- type (agent or final service provider)
provider_contact
- provider_contact_id
- name
- phone
- provider_id
As to notes: well if there are notes on a provider ("Always ask for Jane; she's the most professional there.") or contact ("Jane has worked in London and Madrid."), I'd usually make this just one text column where you can enter whatever you like. You can even store HTML or a Word document. No need for multiple documents per provider, I suppose.
Now there are also services. Either you need a list of who offers which service, then add these tables: service
(which services exist) and provider_service
(who provides which service). But maybe you can do without that, because you know what services exist and who provides them anyway and don't want to have this in your model.
I don't know if you want to enter service enquiries or only already fixed service contracts. In any case you may want a service_contract
table either with a status or without.
service_contract
- service_contract_id
- provider_id (or maybe provider_contact_id?)
- service (free text or a service_id referencing a service table)
- valid_from
- valid_to
Here again you may have notes like "We are still waiting for the documents. Jane said, they will come in March 2018.", which again would be just one column.
Then you said you want the correspondence, which could be an additional table service_contract_correspondance
:
service_contract_correspondance
- service_contract_correspondance_id
- service_contract_id
- type (received from provider or sent to provider)
- text
But then again, maybe you can do without it. Will you ever have to access single correspondances (e.g. "give me all correspondences from December 2017" or "Delete all correspondances older than one year")? Or will there be thousands of correspondences on a single contract? Mabe not. Maybe you can again see this as one single document (text, HTML, Word, ...) and add this as a mere field to the table.
Having a text consisting of multiple contacts like
January 2, 2018 Jane
She says they'll need to weeks to make an offer.
January 20, 2018
I asked about the status. Talked with some Thomas Smith. He says they'll call us tomorrow.
January 21, 2018 Jane
She has emailed the contract for us to sign.
is not per se against NF1. As long as you are not interested in details in your database (e.g. you'll never select all January correspondence alone on the contract), then to the database this is atomic; no need to change this.
Same with contacts. If you have a string column only with "Jane (sales person) 123-45678 jane@them.com, Jack (assistent) 123-98765 jack@them.com", this is not per se against NF1 again. As long as you don't want to select names only or check phone numbers, but always treat the string as a whole, as the contact, then just do so.
You see, it all boils down to what exactly you want to model and how you want to work with the data. Yes, there are agents and direct providers, but is the difference between the two so big that you need two different tables? Yes there is a chronology of correspondence per contract, but do you really need them separated into single correspondencies in your database?
Related Topics
Cast VS Ssis Data Flow Implicit Conversion Difference
SQL Group Rows with Same Value, and Put That Value into Header
Join Two Different Tables and Remove Duplicated Entries
Remove Duplicate Rows in a Table
SQL Server Performance for Alter Table Alter Column Change Data Type
Postgresql: Row Number Changes on Update
Want a Stored Procedure for Comma Seperated String Which Is of a Column (Has 20000 Rows ) in a Table
How to Evaluate Expression in Select Statement in Postgres
How to Do a Contiguous Group by in MySQL
Create View Must Be the Only Statement in the Batch
SQL Joins: Future of the SQL Ansi Standard (Where VS Join)
Simple Update Statement So That All Rows Are Assigned a Different Value
How to Sum Multiple Lines in SQL
Why Does Nvl Always Evaluate 2Nd Parameter