Trimmining a Column with Bad Data

Trimmining a column with bad data

Use CHARINDEX(), LEN() and RIGHT() instead.

RIGHT(LPNumber, LEN(LPNumber) - CHARINDEX('#', LPNumber, 0))

Is there a way to TRIM all data in a SELECT * FROM statement?

You need to specify each string column by hand:

SELECT TRIM(col1),       --LTRIM(RTRIM(...)) If RDBMS is SQL Server
TRIM(col2),
TRIM(col3),
TRIM(col4)
-- ...
FROM table

There is another problem with your proposal. * is placeholder for each column in table so there will be problem with trimming date/decimal/spatial data ....

Addendum

Using Oracle 18c Polymorphic Table Functions(provided code is just PoC, there is a space for a lot of improvements):

CREATE TABLE tab(id INT, d DATE,
v1 VARCHAR2(100), v2 VARCHAR2(100), v3 VARCHAR2(100) );

INSERT INTO tab(id, d,v1, v2, v3)
VALUES (1, SYSDATE, ' aaaa ', ' b ', ' c');
INSERT INTO tab(id, d,v1, v2, v3)
VALUES (2, SYSDATE+1, ' afasd', ' ', ' d');
COMMIT;

SELECT * FROM tab;
-- Output
.----.-----------.-----------.-----------.-----.
| ID | D | V1 | V2 | V3 |
:----+-----------+-----------+-----------+-----:
| 1 | 02-MAR-18 | aaaa | b | c |
:----+-----------+-----------+-----------+-----:
| 2 | 03-MAR-18 | afasd | | d |
'----'-----------'-----------'-----------'-----'

And table function:

CREATE OR REPLACE PACKAGE ptf AS
FUNCTION describe(tab IN OUT dbms_tf.table_t)RETURN dbms_tf.describe_t;
PROCEDURE FETCH_ROWS;
END ptf;
/
CREATE OR REPLACE PACKAGE BODY ptf AS
FUNCTION describe(tab IN OUT dbms_tf.table_t) RETURN dbms_tf.describe_t AS
new_cols DBMS_TF.COLUMNS_NEW_T;
BEGIN
FOR i IN 1 .. tab.column.count LOOP
IF tab.column(i).description.type IN ( dbms_tf.type_varchar2) THEN
tab.column(i).pass_through:=FALSE;
tab.column(i).for_read:= TRUE;
NEW_COLS(i) :=
DBMS_TF.COLUMN_METADATA_T(name=> tab.column(i).description.name,
type => tab.column(i).description.type);
END IF;
END LOOP;
RETURN DBMS_TF.describe_t(new_columns=>new_cols, row_replication=>true);
END;

PROCEDURE FETCH_ROWS AS
inp_rs DBMS_TF.row_set_t;
out_rs DBMS_TF.row_set_t;
rows PLS_INTEGER;
BEGIN
DBMS_TF.get_row_set(inp_rs, rows);
FOR c IN 1 .. inp_rs.count() LOOP
FOR r IN 1 .. rows LOOP
out_rs(c).tab_varchar2(r) := TRIM(inp_rs(c).tab_varchar2(r));
END LOOP;
END LOOP;
DBMS_TF.put_row_set(out_rs, replication_factor => 1);
END;
END ptf;

And final call:

CREATE OR REPLACE FUNCTION trim_col(tab TABLE)
RETURN TABLE pipelined row polymorphic USING ptf;

SELECT *
FROM trim_col(tab); -- passing table as table function argument

.----.-----------.-------.-----.----.
| ID | D | V1 | V2 | V3 |
:----+-----------+-------+-----+----:
| 1 | 02-MAR-18 | aaaa | b | c |
:----+-----------+-------+-----+----:
| 2 | 03-MAR-18 | afasd | - | d |
'----'-----------'-------'-----'----'

db<>fiddle demo

Trim all database fields

No cursors. Copy and paste the output. Works also for SQL 2000, which doesn't have varchar(max). This can be easily extended to add a GO line to the end of each UPDATE if desired.

SELECT  SQL
FROM ( SELECT t.TABLE_CATALOG
, t.TABLE_SCHEMA
, t.TABLE_NAME
, 0 SORT
, 'UPDATE ' + QUOTENAME(t.TABLE_CATALOG) + '.' + QUOTENAME(t.TABLE_SCHEMA) + '.' + QUOTENAME(t.TABLE_NAME) SQL
FROM INFORMATION_SCHEMA.TABLES t
JOIN INFORMATION_SCHEMA.COLUMNS c
ON t.TABLE_CATALOG = c.TABLE_CATALOG
AND t.TABLE_SCHEMA = c.TABLE_SCHEMA
AND t.TABLE_NAME = c.TABLE_NAME
WHERE t.TABLE_TYPE = 'BASE TABLE'
AND c.DATA_TYPE IN ('char','nchar','varchar','nvarchar')
GROUP BY t.TABLE_CATALOG
, t.TABLE_SCHEMA
, t.TABLE_NAME
UNION ALL
SELECT x.TABLE_CATALOG
, x.TABLE_SCHEMA
, x.TABLE_NAME
, CASE WHEN x.COLUMN_NAME_MIN = y.COLUMN_NAME
THEN 1
ELSE 2
END SORT
, CASE WHEN x.COLUMN_NAME_MIN = y.COLUMN_NAME
THEN 'SET '
ELSE ' , '
END + y.SQL SQL
FROM ( SELECT t.TABLE_CATALOG
, t.TABLE_SCHEMA
, t.TABLE_NAME
, MIN(c.COLUMN_NAME) COLUMN_NAME_MIN
FROM INFORMATION_SCHEMA.TABLES t
JOIN INFORMATION_SCHEMA.COLUMNS c
ON t.TABLE_CATALOG = c.TABLE_CATALOG
AND t.TABLE_SCHEMA = c.TABLE_SCHEMA
AND t.TABLE_NAME = c.TABLE_NAME
WHERE t.TABLE_TYPE = 'BASE TABLE'
AND c.DATA_TYPE IN ('char','nchar','varchar','nvarchar')
GROUP BY t.TABLE_CATALOG
, t.TABLE_SCHEMA
, t.TABLE_NAME
) x
JOIN ( SELECT t.TABLE_CATALOG
, t.TABLE_SCHEMA
, t.TABLE_NAME
, c.COLUMN_NAME
, QUOTENAME(c.COLUMN_NAME) + ' = LTRIM(RTRIM(' + QUOTENAME(c.COLUMN_NAME) + '))' SQL
FROM INFORMATION_SCHEMA.TABLES t
JOIN INFORMATION_SCHEMA.COLUMNS c
ON t.TABLE_CATALOG = c.TABLE_CATALOG
AND t.TABLE_SCHEMA = c.TABLE_SCHEMA
AND t.TABLE_NAME = c.TABLE_NAME
WHERE t.TABLE_TYPE = 'BASE TABLE'
AND c.DATA_TYPE IN ('char','nchar','varchar','nvarchar')
) y
ON x.TABLE_CATALOG = y.TABLE_CATALOG
AND x.TABLE_SCHEMA = y.TABLE_SCHEMA
AND x.TABLE_NAME = y.TABLE_NAME
) x
ORDER BY x.TABLE_CATALOG
, x.TABLE_SCHEMA
, x.TABLE_NAME
, x.SORT
, x.SQL

when we need trim when we don't need in oracle

Looking at the Oracle Documentation on literals:

Text literals have properties of both the CHAR and VARCHAR2 datatypes:

  • Within expressions and conditions, Oracle treats text literals as though they have the datatype CHAR by comparing them using blank-padded comparison semantics.

and the documentation of blank-padded comparison semantics states:

With blank-padded semantics, if the two values have different lengths, then Oracle first adds blanks to the end of the shorter one so their lengths are equal. Oracle then compares the values character by character up to the first character that differs. The value with the greater character in the first differing position is considered greater. If two values have no differing characters, then they are considered equal. This rule means that two values are equal if they differ only in the number of trailing blanks. Oracle uses blank-padded comparison semantics only when both values in the comparison are either expressions of datatype CHAR, NCHAR, text literals, or values returned by the USER function.

So, for your examples:

SELECT * FROM AAA WHERE T1 = 'A'

Since the left-hand side of the comparison, T1 which equals 'A ', is a CHAR(3) and the right-hand side is a text literal 'A' then blank-padded comparison semantics are used and 'A ' = 'A' is true.

Then:

SELECT  * FROM AAA WHERE T2 = 'A  '

The left-hand side of the comparison, T2 which equals 'A ', is a VARCHAR2(3) and the right-hand side is a text literal 'A ' then blank-padded comparison semantics are not used (both sides are not CHAR or text literals) but the values are equal ('A ' = 'A ') so the condition is true.

And:

SELECT  * FROM AAA WHERE T2 = 'A    '

The LHS is a VARCHAR2 equal to 'A ' and the RHS is a text literal equal to 'A '; blank-padded comparison semantics will not be used and the values are not equal so the condition is not true and the row will not be returned.

Then:

SELECT  * FROM AAA WHERE TRIM(T2) = TRIM('A   ')

The TRIM() function states that:

The string returned is of VARCHAR2 datatype if trim_source is a character datatype and a LOB if trim_source is a LOB datatype.

Both values are trimmed so you have the condition 'A' = 'A' where both sides are a VARCAHR2 data type so blank-padded comparison semantics will not be used but the values are equal so the condition is true and the row will be returned.

Finally:

SELECT CASE WHEN 'A' = 'A ' THEN '1' ELSE '0' END T1, 
CASE WHEN 'A ' = 'A' THEN '1' ELSE '0' END T2,
CASE WHEN 'A ' = 'A ' THEN '1' ELSE '0' END T2
FROM DUAL;

In all cases both sides of the equality are text literals so blank-padded comparison semantics will be used so for

  • 'A' = 'A ' the LHS will be padded so the strings are the same length;
  • 'A ' = 'A' the RHS will be padded;
  • 'A ' = 'A ' both sides are equal length so no padding is needed.

And in each case the equalities are equal.

If you use CAST to change from text literals to a VARCHAR2 data type:

SELECT CASE WHEN 'A' = CAST( 'A ' AS VARCHAR2(5) ) THEN 1 ELSE 0 END AS T1
FROM DUAL

The LHS is a text literal but the RHS is a VARCHAR2 and blank-padded comparison semantics are not used and the equality is not true.

Trimming all cells in DataTable

OleDb Objects


Actually what I meant is, to get formatted/trimmed string values from the Excel Sheet and create a DataTable with DataColumn objects of string type only, use the forward-only OleDbDataReader to create both, DataColumn and DataRow objects as it reads. Doing so, the data will be modified and filled in one step hence no need to call another routine to loop again and waste some more time. Also, consider using asynchronous calls to speed up the process and avoid freezing the UI while executing the lengthy task.

Something might help you to go:

private async void TheCaller()
{
using (var ofd = new OpenFileDialog
{
Title = "Select File",
Filter = "Excel WorkBook|*.xlsx|Excel WorkBook 97 - 2003|*.xls|All Files(*.*)|*.*",
AutoUpgradeEnabled = true,
})
{
if (ofd.ShowDialog() != DialogResult.OK) return;

var conString = string.Empty;
var msg = "Loading... Please wait.";

try
{
switch (ofd.FilterIndex)
{
case 1: //xlsx
conString = $"Provider=Microsoft.ACE.OLEDB.12.0;Data Source={ofd.FileName};Extended Properties='Excel 12.0;HDR=Yes;IMEX=1;'";
break;
case 2: //xls
conString = $"Provider=Microsoft.Jet.OLEDB.4.0;Data Source={ofd.FileName};Extended Properties='Excel 8.0;HDR=Yes;IMEX=1;'";
break;
default:
throw new FileFormatException();
}

var sheetName = "sheet1";
var dt = new DataTable();

//Optional: a label to show the current status
//or maybe show a ProgressBar with ProgressBarStyle = Marquee
lblStatus.Text = msg;

await Task.Run(() =>
{
using (var con = new OleDbConnection(conString))
using (var cmd = new OleDbCommand($"SELECT * From [{sheetName}$]", con))
{
con.Open();

using (var r = cmd.ExecuteReader())
while (r.Read())
{
if (dt.Columns.Count == 0)
for (var i = 0; i < r.FieldCount; i++)
dt.Columns.Add(r.GetName(i).Trim(), typeof(string));

object[] values = new object[r.FieldCount];

r.GetValues(values);
dt.Rows.Add(values.Select(x => x?.ToString().Trim()).ToArray());
}
}
});

//If you want...
dataGridView1.DataSource = null;
dataGridView1.DataSource = dt;

msg = "Loading Completed";
}
catch (FileFormatException)
{
msg = "Unknown Excel file!";
}
catch (Exception ex)
{
msg = ex.Message;
}
finally
{
lblStatus.Text = msg;
}
}
}

Here's a demo, reading sheets with 8 columns and 5000 rows from both xls and xlsx files:

SOA62682518

Less than a second. Not bad.

However, this will not work correctly if the Sheet has mixed-types columns like your case where the third column has string and int values in different rows. That because the data type of a column is guessed in Excel by examining the first 8 rows by default. Changing this behavior requires changing the registry value of TypeGuessRows in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\x.0\Engines\Excel from 8 to 0 to force checking all the rows instead of just the first 8. This action will dramatically slow down the performance.

Office Interop Objects


Alternatively, you could use the Microsoft.Office.Interop.Excel objects to read the Excel Sheet, get and format the values of the cells regardless of their types.

using Excel = Microsoft.Office.Interop.Excel;
//...

private async void TheCaller()
{
using (var ofd = new OpenFileDialog
{
Title = "Select File",
Filter = "Excel WorkBook|*.xlsx|Excel WorkBook 97 - 2003|*.xls|All Files(*.*)|*.*",
AutoUpgradeEnabled = true,
})
{
if (ofd.ShowDialog() != DialogResult.OK) return;

var msg = "Loading... Please wait.";
Excel.Application xlApp = null;
Excel.Workbook xlWorkBook = null;

try
{
var dt = new DataTable();

lblStatus.Text = msg;

await Task.Run(() =>
{
xlApp = new Excel.Application();
xlWorkBook = xlApp.Workbooks.Open(ofd.FileName, Type.Missing, true);

var xlSheet = xlWorkBook.Sheets[1] as Excel.Worksheet;
var xlRange = xlSheet.UsedRange;

dt.Columns.AddRange((xlRange.Rows[xlRange.Row] as Excel.Range)
.Cells.Cast<Excel.Range>()
.Where(h => h.Value2 != null)
.Select(h => new DataColumn(h.Value2.ToString()
.Trim(), typeof(string))).ToArray());

foreach (var r in xlRange.Rows.Cast<Excel.Range>().Skip(1))
dt.Rows.Add(r.Cells.Cast<Excel.Range>()
.Take(dt.Columns.Count)
.Select(v => v.Value2 is null
? string.Empty
: v.Value2.ToString().Trim()).ToArray());
});

(dataGridView1.DataSource as DataTable)?.Dispose();
dataGridView1.DataSource = null;
dataGridView1.DataSource = dt;

msg = "Loading Completed";
}
catch (FileFormatException)
{
msg = "Unknown Excel file!";
}
catch (Exception ex)
{
msg = ex.Message;
}
finally
{
xlWorkBook?.Close(false);
xlApp?.Quit();

Marshal.FinalReleaseComObject(xlWorkBook);
Marshal.FinalReleaseComObject(xlApp);

xlWorkBook = null;
xlApp = null;

GC.Collect();
GC.WaitForPendingFinalizers();

lblStatus.Text = msg;
}
}
}

Note: You need to add reference to the mentioned library.

SOA62682518B

Not fast especially with a big number of cells but it gets the desired output.

How to avoid space(s) while inserting data in a table in oracle

You can use a trigger:

CREATE OR REPLACE TRIGGER EMPLOYEE_BIU
BEFORE INSERT OR UPDATE ON EMPLOYEE
FOR EACH ROW
BEGIN
:NEW.DEPT := TRIM(:NEW.DEPT);
END EMPLOYEE_BIU;

Trim a part of a String

Since you are using SQL Server 2016, you can use it's built in STRING_SPLIT method to convert your string to a table.
Then all you have to do is to select the value from the table that starts with CUSTOMER NAME:, like this:

DECLARE @s varchar(300) = 'QSTY-IOM-HFVNBJGYG | Mobile #: 9876541323 | CUSTOMER NAME: MNOP UNIPOUYTREA POIUY | INVOICE DATE:02/28/18 | EMP#: 101466 | EMPLOYEE NAME: ANGELINA CASIANO'

SELECT [value]
FROM STRING_SPLIT(@s, '|')
WHERE LTRIM([value]) LIKE 'CUSTOMER NAME:%'

However, you should read Is storing a delimited list in a database column really that bad? and normalize your database if possible.



Related Topics



Leave a reply



Submit