Parse a malformed CSV line
I am not sure how malformed your data is malformed, but here is one approach for that line.
> puts line
"Sensitive",2416,159,"Test "Malformed" Failure",2789,111,7-24-11,1800,0600,"R2","12323","",""
>
> puts line.scan /[\d.-]+|(?:"[^"]*"[^",]*)+/
"Sensitive"
2416
159
"Test "Malformed" Failure"
2789
111
7-24-11
1800
0600
"R2"
"12323"
""
""
Note: Tested on ruby 1.9.2p290
Google Apps Script: REGEX to fix malformed pipe delimited csv file runs too slowly
The first three replace
lines can be merged into one, you just want to remove all \r\n
occurrences that are not followed with 1 to 5 digits and a |
, .replace(/\r\n(?!\d{1,5}\|)/g,"")
.
The last two replace
lines can also be merged into one if you use alternaton, .replace(/'\||\|'/g,"|")
.
Use
function cleanCSV(csvFileId){
//The file we receive has line breaks in the middle of the records, this removes the line breaks and converts the file to a csv.
var content = DriveApp.getFileById(csvFileId).getBlob().getDataAsString();
var newContent = content.replace(/\r\n(?!\d{1,5}\|)/g,""); // remove line endings not followed with 1-5 digits and |
var csvContent = newContent.replace(/'\||\|'/g,"|"); // removes trailing/leading single quote
//Logger.log(csvContent);
var sheetId = DriveApp.getFolderById(csvFolderId).createFile(csvFileName, csvContent, MimeType.CSV).getId();
return sheetId;
}
How can I further process the line of data that causes the Ruby FasterCSV library to throw a MalformedCSVError?
require 'csv' #CSV in ruby 1.9.2 is identical to FasterCSV
# File.open('test.txt','r').each do |line|
DATA.each do |line|
begin
CSV.parse(line) do |row|
p row #handle row
end
rescue CSV::MalformedCSVError => er
puts er.message
puts "This one: #{line}"
# and continue
end
end
# Output:
# Unclosed quoted field on line 1.
# This one: 1,"aaa
# Illegal quoting on line 1.
# This one: aaa",valid
# Unclosed quoted field on line 1.
# This one: 2,"bbb
# ["bbb", "invalid"]
# ["3", "ccc", "valid"]
__END__
1,"aaa
aaa",valid
2,"bbb
bbb,invalid
3,ccc,valid
Just feed the file line by line to FasterCSV and rescue the error.
How can I read and parse CSV files in C++?
If you don't care about escaping comma and newline,
AND you can't embed comma and newline in quotes (If you can't escape then...)
then its only about three lines of code (OK 14 ->But its only 15 to read the whole file).
std::vector<std::string> getNextLineAndSplitIntoTokens(std::istream& str)
{
std::vector<std::string> result;
std::string line;
std::getline(str,line);
std::stringstream lineStream(line);
std::string cell;
while(std::getline(lineStream,cell, ','))
{
result.push_back(cell);
}
// This checks for a trailing comma with no data after it.
if (!lineStream && cell.empty())
{
// If there was a trailing comma then add an empty element.
result.push_back("");
}
return result;
}
I would just create a class representing a row.
Then stream into that object:
#include <iterator>
#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <string>
class CSVRow
{
public:
std::string_view operator[](std::size_t index) const
{
return std::string_view(&m_line[m_data[index] + 1], m_data[index + 1] - (m_data[index] + 1));
}
std::size_t size() const
{
return m_data.size() - 1;
}
void readNextRow(std::istream& str)
{
std::getline(str, m_line);
m_data.clear();
m_data.emplace_back(-1);
std::string::size_type pos = 0;
while((pos = m_line.find(',', pos)) != std::string::npos)
{
m_data.emplace_back(pos);
++pos;
}
// This checks for a trailing comma with no data after it.
pos = m_line.size();
m_data.emplace_back(pos);
}
private:
std::string m_line;
std::vector<int> m_data;
};
std::istream& operator>>(std::istream& str, CSVRow& data)
{
data.readNextRow(str);
return str;
}
int main()
{
std::ifstream file("plop.csv");
CSVRow row;
while(file >> row)
{
std::cout << "4th Element(" << row[3] << ")\n";
}
}
But with a little work we could technically create an iterator:
class CSVIterator
{
public:
typedef std::input_iterator_tag iterator_category;
typedef CSVRow value_type;
typedef std::size_t difference_type;
typedef CSVRow* pointer;
typedef CSVRow& reference;
CSVIterator(std::istream& str) :m_str(str.good()?&str:nullptr) { ++(*this); }
CSVIterator() :m_str(nullptr) {}
// Pre Increment
CSVIterator& operator++() {if (m_str) { if (!((*m_str) >> m_row)){m_str = nullptr;}}return *this;}
// Post increment
CSVIterator operator++(int) {CSVIterator tmp(*this);++(*this);return tmp;}
CSVRow const& operator*() const {return m_row;}
CSVRow const* operator->() const {return &m_row;}
bool operator==(CSVIterator const& rhs) {return ((this == &rhs) || ((this->m_str == nullptr) && (rhs.m_str == nullptr)));}
bool operator!=(CSVIterator const& rhs) {return !((*this) == rhs);}
private:
std::istream* m_str;
CSVRow m_row;
};
int main()
{
std::ifstream file("plop.csv");
for(CSVIterator loop(file); loop != CSVIterator(); ++loop)
{
std::cout << "4th Element(" << (*loop)[3] << ")\n";
}
}
Now that we are in 2020 lets add a CSVRange object:
class CSVRange
{
std::istream& stream;
public:
CSVRange(std::istream& str)
: stream(str)
{}
CSVIterator begin() const {return CSVIterator{stream};}
CSVIterator end() const {return CSVIterator{};}
};
int main()
{
std::ifstream file("plop.csv");
for(auto& row: CSVRange(file))
{
std::cout << "4th Element(" << row[3] << ")\n";
}
}
Reliably parse unpredictable CSV formats
The Python CSV package is pretty good at this. However, when dealing with unpredictable CSV formats, I expect you'll have to do maintenance no matter what library you pick.
BeanIO unquotedQuotesAllowed in CSV not working
I'm going to assume that you want to preserve/retain the double quotes ("
).
The unquotedQuotesAllowed
config option is only applicable to CSV streams, but based on your sample test data, you are using a pipe symbol (|
) as a delimiter. Yes, you can change the delimiter for a CSV stream, but I think it would be better to just use a stream mapping configured as a delimited
format. IMO this is easier to work with and you don't need to comply to all the rules and subtleties of a CSV format.
I would then use the following:
<stream name="csvStream" format="delimited">
<parser>
<property name="delimiter" value="|"/>
</parser>
<record name="...">
....
</record>
</stream>
Using the above mapping, I get the following output:
Field1: "TEST"/37326330, Field2: TEST2
FasterCSV - instead of getting the content file getting the path to file
Because CSV#parse actually parses the string you passed to it, not the file from location that this string contains.
What you need is CSV#read: http://ruby-doc.org/stdlib-1.9.2/libdoc/csv/rdoc/CSV.html#method-c-read
Related Topics
Ruby: Creating a Sandboxed Eval
The Compiler Failed to Generate an Executable File. (Runtimeerror)
How to Implement This Post Request Using Httparty
Ruby Hash Autovivification (Facets)
Railstutorial - Chapter 8.4.3 - Test Database Not Clearing After Adding User in Integration Test
Getting Openssl::X509::Certificateerror Nested Asn1 Error on Ruby
How to Create Symbol (Hash Key) from Association, Using New Ruby (1.9) Hash Syntax
How to Convert a String to Lower or Upper Case in Ruby
Best Way to Pretty Print a Hash
How to Run a Ruby File in a Rails Environment
How Does Ruby's Sort_By {Rand} Work
Why Do I Get an Error Installing the JSON Gem in Ubuntu
How to Find Each Instance of a Class in Ruby
Activerecord::Connectiontimeouterror
Traversing a Hash Recursively in Ruby