How to Robustly Parse Malformed CSV

Parse a malformed CSV line

I am not sure how malformed your data is malformed, but here is one approach for that line.

> puts line
"Sensitive",2416,159,"Test "Malformed" Failure",2789,111,7-24-11,1800,0600,"R2","12323","",""
>
> puts line.scan /[\d.-]+|(?:"[^"]*"[^",]*)+/
"Sensitive"
2416
159
"Test "Malformed" Failure"
2789
111
7-24-11
1800
0600
"R2"
"12323"
""
""

Note: Tested on ruby 1.9.2p290

Google Apps Script: REGEX to fix malformed pipe delimited csv file runs too slowly

The first three replace lines can be merged into one, you just want to remove all \r\n occurrences that are not followed with 1 to 5 digits and a |, .replace(/\r\n(?!\d{1,5}\|)/g,"").

The last two replace lines can also be merged into one if you use alternaton, .replace(/'\||\|'/g,"|").

Use





function cleanCSV(csvFileId){
//The file we receive has line breaks in the middle of the records, this removes the line breaks and converts the file to a csv.
var content = DriveApp.getFileById(csvFileId).getBlob().getDataAsString();
var newContent = content.replace(/\r\n(?!\d{1,5}\|)/g,""); // remove line endings not followed with 1-5 digits and |
var csvContent = newContent.replace(/'\||\|'/g,"|"); // removes trailing/leading single quote
//Logger.log(csvContent);
var sheetId = DriveApp.getFolderById(csvFolderId).createFile(csvFileName, csvContent, MimeType.CSV).getId();
return sheetId;
}

How can I further process the line of data that causes the Ruby FasterCSV library to throw a MalformedCSVError?


require 'csv' #CSV in ruby 1.9.2 is identical to FasterCSV

# File.open('test.txt','r').each do |line|
DATA.each do |line|
begin
CSV.parse(line) do |row|
p row #handle row
end
rescue CSV::MalformedCSVError => er
puts er.message
puts "This one: #{line}"
# and continue
end
end

# Output:

# Unclosed quoted field on line 1.
# This one: 1,"aaa
# Illegal quoting on line 1.
# This one: aaa",valid
# Unclosed quoted field on line 1.
# This one: 2,"bbb
# ["bbb", "invalid"]
# ["3", "ccc", "valid"]

__END__
1,"aaa
aaa",valid
2,"bbb
bbb,invalid
3,ccc,valid

Just feed the file line by line to FasterCSV and rescue the error.

How can I read and parse CSV files in C++?

If you don't care about escaping comma and newline,

AND you can't embed comma and newline in quotes (If you can't escape then...)

then its only about three lines of code (OK 14 ->But its only 15 to read the whole file).

std::vector<std::string> getNextLineAndSplitIntoTokens(std::istream& str)
{
std::vector<std::string> result;
std::string line;
std::getline(str,line);

std::stringstream lineStream(line);
std::string cell;

while(std::getline(lineStream,cell, ','))
{
result.push_back(cell);
}
// This checks for a trailing comma with no data after it.
if (!lineStream && cell.empty())
{
// If there was a trailing comma then add an empty element.
result.push_back("");
}
return result;
}

I would just create a class representing a row.

Then stream into that object:

#include <iterator>
#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <string>

class CSVRow
{
public:
std::string_view operator[](std::size_t index) const
{
return std::string_view(&m_line[m_data[index] + 1], m_data[index + 1] - (m_data[index] + 1));
}
std::size_t size() const
{
return m_data.size() - 1;
}
void readNextRow(std::istream& str)
{
std::getline(str, m_line);

m_data.clear();
m_data.emplace_back(-1);
std::string::size_type pos = 0;
while((pos = m_line.find(',', pos)) != std::string::npos)
{
m_data.emplace_back(pos);
++pos;
}
// This checks for a trailing comma with no data after it.
pos = m_line.size();
m_data.emplace_back(pos);
}
private:
std::string m_line;
std::vector<int> m_data;
};

std::istream& operator>>(std::istream& str, CSVRow& data)
{
data.readNextRow(str);
return str;
}
int main()
{
std::ifstream file("plop.csv");

CSVRow row;
while(file >> row)
{
std::cout << "4th Element(" << row[3] << ")\n";
}
}

But with a little work we could technically create an iterator:

class CSVIterator
{
public:
typedef std::input_iterator_tag iterator_category;
typedef CSVRow value_type;
typedef std::size_t difference_type;
typedef CSVRow* pointer;
typedef CSVRow& reference;

CSVIterator(std::istream& str) :m_str(str.good()?&str:nullptr) { ++(*this); }
CSVIterator() :m_str(nullptr) {}

// Pre Increment
CSVIterator& operator++() {if (m_str) { if (!((*m_str) >> m_row)){m_str = nullptr;}}return *this;}
// Post increment
CSVIterator operator++(int) {CSVIterator tmp(*this);++(*this);return tmp;}
CSVRow const& operator*() const {return m_row;}
CSVRow const* operator->() const {return &m_row;}

bool operator==(CSVIterator const& rhs) {return ((this == &rhs) || ((this->m_str == nullptr) && (rhs.m_str == nullptr)));}
bool operator!=(CSVIterator const& rhs) {return !((*this) == rhs);}
private:
std::istream* m_str;
CSVRow m_row;
};


int main()
{
std::ifstream file("plop.csv");

for(CSVIterator loop(file); loop != CSVIterator(); ++loop)
{
std::cout << "4th Element(" << (*loop)[3] << ")\n";
}
}

Now that we are in 2020 lets add a CSVRange object:

class CSVRange
{
std::istream& stream;
public:
CSVRange(std::istream& str)
: stream(str)
{}
CSVIterator begin() const {return CSVIterator{stream};}
CSVIterator end() const {return CSVIterator{};}
};

int main()
{
std::ifstream file("plop.csv");

for(auto& row: CSVRange(file))
{
std::cout << "4th Element(" << row[3] << ")\n";
}
}

Reliably parse unpredictable CSV formats

The Python CSV package is pretty good at this. However, when dealing with unpredictable CSV formats, I expect you'll have to do maintenance no matter what library you pick.

BeanIO unquotedQuotesAllowed in CSV not working

I'm going to assume that you want to preserve/retain the double quotes (").

The unquotedQuotesAllowed config option is only applicable to CSV streams, but based on your sample test data, you are using a pipe symbol (|) as a delimiter. Yes, you can change the delimiter for a CSV stream, but I think it would be better to just use a stream mapping configured as a delimited format. IMO this is easier to work with and you don't need to comply to all the rules and subtleties of a CSV format.

I would then use the following:

<stream name="csvStream" format="delimited">
<parser>
<property name="delimiter" value="|"/>
</parser>
<record name="...">
....
</record>
</stream>

Using the above mapping, I get the following output:

Field1: "TEST"/37326330, Field2: TEST2

FasterCSV - instead of getting the content file getting the path to file

Because CSV#parse actually parses the string you passed to it, not the file from location that this string contains.
What you need is CSV#read: http://ruby-doc.org/stdlib-1.9.2/libdoc/csv/rdoc/CSV.html#method-c-read



Related Topics



Leave a reply



Submit