Import CSV File Error : Column Value containing column delimiter
A word of warning: I'm not a regular C# coder.
But anyway this code does the following:
It opens a file called C:\Input.TXT
It searches each line. If the line has more than 5 commas, it takes all the extra commas out of the third last field (notes)
It writes the result to C:\Output.TXT - that's the one you need to actually import
There are many improvements that could be made:
- Get file paths from connection managers
- Error handling
- An experienced C# programmer could probably do this in hlaf the code
Keep in mind your package will need write access to the appropriate folder
public void Main()
{
// Search the file and remove extra commas from the third last field
// Extended from code at
// http://stackoverflow.com/questions/1915632/open-a-file-and-replace-strings-in-c-sharp
// Nick McDermaid
string sInputLine;
string sOutputLine;
string sDelimiter = ",";
String[] sData;
int iIndex;
// open the file for read
using (System.IO.FileStream inputStream = File.OpenRead("C:\\Input.txt"))
{
using (StreamReader inputReader = new StreamReader(inputStream))
{
// open the output file
using (StreamWriter outputWriter = File.AppendText("C:\\Output.txt"))
{
// Read each line
while (null != (sInputLine = inputReader.ReadLine()))
{
// Grab each field out
sData = sInputLine.Split(sDelimiter[0]);
if (sData.Length <= 6)
{
// 6 or less fields - just echo it out
sOutputLine = sInputLine;
}
else
{
// line has more than 6 pieces
// We assume all of the extra commas are in the notes field
// Put the first three fields together
sOutputLine =
sData[0] + sDelimiter +
sData[1] + sDelimiter +
sData[2] + sDelimiter;
// Put the middle notes fields together, excluding the delimiter
for (iIndex=3; iIndex <= sData.Length - 3; iIndex++)
{
sOutputLine = sOutputLine + sData[iIndex] + " ";
}
// Tack on the last two fields
sOutputLine = sOutputLine +
sDelimiter + sData[sData.Length - 2] +
sDelimiter + sData[sData.Length - 1];
}
// We've evaulted the correct line now write it out
outputWriter.WriteLine(sOutputLine);
}
}
}
}
Dts.TaskResult = (int)Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success;
}
SSIS error The column delimiter for column x was not found
You have some interesting features in your spreadsheet. The column names in the first row have trailing spaces in them. The columns values in rows 2+ have spaces after the closing double quotes (3rd column), and leading spaces for the fourth and final two columns. This smells like a file generated from a mainframe and forced into a CSV.
Were it me, I would not identify this a column delimited file (based on the 2 lines of data). This is a ragged right file format that someone has added text delimiters and column delimiters to.
- Use a LF as the header row delimiter
- header rows to skip = 1
- uncheck column names in first row
In the Columns tab, set a column where the quotes and commas are.
In the Advanced tab, every other column is named with ignore and the next column will be the actual data.
In my data flow, I pulled in the ignore columns to show that they did contain junk but in your package, you would not want to do this.
What's left?
Cleanup.
You can't convert your numbers to a number data type as those leading spaces will cause issues (if memory serves me correctly). Plus, the string data may have trailing spaces as well so I would have a derived column that applies TRIM to all the columns.
From that point, then try and convert to a strongly typed value and land it in your database.
Here's my sample input file (saved with unix/linefeed/LF endings)
"NAME ","OPEDT ","OBJ","PCT","MGR ","rr ","WCASH "
"AAAA","07/12/92","BELL" , 3.23,"AUJ", 0, 12364.00
"BBBB","01/05/91","PELL" , 0.78,"ACC", 0, 9879.00
Reading CSV with Separator in column values
on_bad_lines
deprecates error_bad_lines
, so if you're on an older version of pandas
, you can just use that:
pd.read_csv("data.csv", sep = "|", error_bad_lines = False)
If you want to keep bad lines, you can also use warn_bad_lines
, extract bad lines from the warnings and read them separately in a single column:
import contextlib
with open('log.txt', 'w') as log:
with contextlib.redirect_stderr(log):
df = pd.read_csv('data.csv', sep = '|', error_bad_lines = False, warn_bad_lines = True)
with open('log.txt') as f:
f = f.readlines()
bad_lines = [int(x[0]) - 1 for x in f[0].split('line ')[1:]]
df_bad_lines = pd.read_csv('data.csv', skiprows = lambda x: x not in bad_lines, squeeze = True, header = None)
Problem in importing CSV file with column containig NULL values or comma
You have defined your EmpID field as being Int64 which will work great when you have a digits there but in the case of no data (but a row still being present), SSIS is going to try to convert the empty string to a number and that will fail.
If you add an error pathway from the Flat File Source for truncation/error/etc you'd see rows 5+ going down that path. For this data, I'd define everything as string as you need to get the data into the pipeline and then you'll need to take action on it based on whatever business rules make sense (no name/id, trash it).
As @alex points out in the comment, the final rows indicate there are three columns of data whereas you've defined two so when the flat file source gets to that, you'll blow up. SSIS won't be able to handle inconsistent file formats like that.
Column delimiter not found for the column
The file has to be changed. If the text qualifier of the file is a double-quote, then the actual data cannot contain double-quotes.
You can either talk to the person who creates the file and have them use a different text delimiter, or you can write a script task that edits the file and replaces the delimiting double-quotes with another delimiter, leaving the double-quotes in the data intact.
For a bit more detail, see here.
Import CSV file where last column has many separators
I would read the file as one single column and parse manually:
df = pd.read_csv(filename, sep='\t')
pat = ','.join([f'(?P<{x}>[^\,]*)' for x in ['region','state','latitude','longitute']])
pat = '^'+ pat + ',(?P<status>.*)$'
df = df.iloc[:,0].str.extract(pat)
Output:
region state latitude longitute status
0 florida FL 27.8333 -81.717 open,for,activity
1 georgia GA 32.9866 -83.6487 open
2 hawaii HI 21.1098 -157.5311 illegal,stuff
3 iowa IA 42.0046 -93.214 medical,limited
Import file delimited to Database Column with delimiter
Try using OPENROWSET
SELECT * INTO #Temp
FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0','Text;Database=C:\;HDR=Yes;FORMAT=Delimited(;)', 'SELECT * FROM [8AA31234-2215-4FE4-907F-AB3B3A417F72~WO2~29302~INSTAL PP~instal pp~1319.csv]')
This is a good process to import files with ,
in between the values
Related Topics
Difference Between Datetime Converts in Msexcel and SQL Server
Sql: Join Tables on Substrings
Cascade Delete in Many-To-Many Self-Reference Table
Aggregate String Concatenation in Oracle 10G
Performance of Querying for a String That Starts and Ends with Something
Split String Oracle into a Single Column and Insert into a Table
Can Vba in Ms Access Using Parameter to Prevent SQL Injection
Tips and Tricks to Speed Up an SQL
Compare Strings as Numbers in SQLite3
View or Function '' Is Not Updatable Because the Modification Affects Multiple Base Tables
T-SQL How to Convert Comma Separated String of Numbers to Integer
Conditional Order by Depending on Column Values
Differencebetween Cross Join and Multiple Tables in One From
Fixing Holes/Gaps in Numbers Generated by Postgres Sequence