Import CSV File Error:Column Value Containing Column Delimiter

Import CSV File Error : Column Value containing column delimiter

A word of warning: I'm not a regular C# coder.

But anyway this code does the following:

It opens a file called C:\Input.TXT

It searches each line. If the line has more than 5 commas, it takes all the extra commas out of the third last field (notes)

It writes the result to C:\Output.TXT - that's the one you need to actually import

There are many improvements that could be made:

  • Get file paths from connection managers
  • Error handling
  • An experienced C# programmer could probably do this in hlaf the code

Keep in mind your package will need write access to the appropriate folder

public void Main()
{
// Search the file and remove extra commas from the third last field
// Extended from code at
// http://stackoverflow.com/questions/1915632/open-a-file-and-replace-strings-in-c-sharp
// Nick McDermaid

string sInputLine;
string sOutputLine;
string sDelimiter = ",";
String[] sData;
int iIndex;

// open the file for read
using (System.IO.FileStream inputStream = File.OpenRead("C:\\Input.txt"))
{
using (StreamReader inputReader = new StreamReader(inputStream))
{
// open the output file
using (StreamWriter outputWriter = File.AppendText("C:\\Output.txt"))
{
// Read each line
while (null != (sInputLine = inputReader.ReadLine()))
{
// Grab each field out
sData = sInputLine.Split(sDelimiter[0]);
if (sData.Length <= 6)
{
// 6 or less fields - just echo it out
sOutputLine = sInputLine;
}
else
{
// line has more than 6 pieces
// We assume all of the extra commas are in the notes field

// Put the first three fields together
sOutputLine =
sData[0] + sDelimiter +
sData[1] + sDelimiter +
sData[2] + sDelimiter;

// Put the middle notes fields together, excluding the delimiter
for (iIndex=3; iIndex <= sData.Length - 3; iIndex++)
{
sOutputLine = sOutputLine + sData[iIndex] + " ";
}

// Tack on the last two fields
sOutputLine = sOutputLine +
sDelimiter + sData[sData.Length - 2] +
sDelimiter + sData[sData.Length - 1];

}

// We've evaulted the correct line now write it out
outputWriter.WriteLine(sOutputLine);
}
}
}
}

Dts.TaskResult = (int)Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success;
}

SSIS error The column delimiter for column x was not found

You have some interesting features in your spreadsheet. The column names in the first row have trailing spaces in them. The columns values in rows 2+ have spaces after the closing double quotes (3rd column), and leading spaces for the fourth and final two columns. This smells like a file generated from a mainframe and forced into a CSV.

Were it me, I would not identify this a column delimited file (based on the 2 lines of data). This is a ragged right file format that someone has added text delimiters and column delimiters to.

  • Use a LF as the header row delimiter
  • header rows to skip = 1
  • uncheck column names in first row

Sample Image

In the Columns tab, set a column where the quotes and commas are.

Sample Image

In the Advanced tab, every other column is named with ignore and the next column will be the actual data.

Sample Image

In my data flow, I pulled in the ignore columns to show that they did contain junk but in your package, you would not want to do this.

Sample Image

What's left?

Cleanup.

You can't convert your numbers to a number data type as those leading spaces will cause issues (if memory serves me correctly). Plus, the string data may have trailing spaces as well so I would have a derived column that applies TRIM to all the columns.

From that point, then try and convert to a strongly typed value and land it in your database.

Here's my sample input file (saved with unix/linefeed/LF endings)

"NAME  ","OPEDT ","OBJ","PCT","MGR   ","rr    ","WCASH "       
"AAAA","07/12/92","BELL" , 3.23,"AUJ", 0, 12364.00
"BBBB","01/05/91","PELL" , 0.78,"ACC", 0, 9879.00

Reading CSV with Separator in column values

on_bad_lines deprecates error_bad_lines, so if you're on an older version of pandas, you can just use that:

pd.read_csv("data.csv", sep = "|", error_bad_lines = False)

If you want to keep bad lines, you can also use warn_bad_lines, extract bad lines from the warnings and read them separately in a single column:

import contextlib

with open('log.txt', 'w') as log:
with contextlib.redirect_stderr(log):
df = pd.read_csv('data.csv', sep = '|', error_bad_lines = False, warn_bad_lines = True)

with open('log.txt') as f:
f = f.readlines()

bad_lines = [int(x[0]) - 1 for x in f[0].split('line ')[1:]]

df_bad_lines = pd.read_csv('data.csv', skiprows = lambda x: x not in bad_lines, squeeze = True, header = None)

Problem in importing CSV file with column containig NULL values or comma

You have defined your EmpID field as being Int64 which will work great when you have a digits there but in the case of no data (but a row still being present), SSIS is going to try to convert the empty string to a number and that will fail.

If you add an error pathway from the Flat File Source for truncation/error/etc you'd see rows 5+ going down that path. For this data, I'd define everything as string as you need to get the data into the pipeline and then you'll need to take action on it based on whatever business rules make sense (no name/id, trash it).

As @alex points out in the comment, the final rows indicate there are three columns of data whereas you've defined two so when the flat file source gets to that, you'll blow up. SSIS won't be able to handle inconsistent file formats like that.

Column delimiter not found for the column

The file has to be changed. If the text qualifier of the file is a double-quote, then the actual data cannot contain double-quotes.

You can either talk to the person who creates the file and have them use a different text delimiter, or you can write a script task that edits the file and replaces the delimiting double-quotes with another delimiter, leaving the double-quotes in the data intact.

For a bit more detail, see here.

Import CSV file where last column has many separators

I would read the file as one single column and parse manually:

df = pd.read_csv(filename, sep='\t')

pat = ','.join([f'(?P<{x}>[^\,]*)' for x in ['region','state','latitude','longitute']])
pat = '^'+ pat + ',(?P<status>.*)$'

df = df.iloc[:,0].str.extract(pat)

Output:

    region state latitude  longitute             status
0 florida FL 27.8333 -81.717 open,for,activity
1 georgia GA 32.9866 -83.6487 open
2 hawaii HI 21.1098 -157.5311 illegal,stuff
3 iowa IA 42.0046 -93.214 medical,limited

Import file delimited to Database Column with delimiter

Try using OPENROWSET

SELECT * INTO #Temp
FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0','Text;Database=C:\;HDR=Yes;FORMAT=Delimited(;)', 'SELECT * FROM [8AA31234-2215-4FE4-907F-AB3B3A417F72~WO2~29302~INSTAL PP~instal pp~1319.csv]')

This is a good process to import files with , in between the values



Related Topics



Leave a reply



Submit