How to Parse a CSV String With JavaScript, Which Contains Comma in Data

How can I parse a CSV string with JavaScript, which contains comma in data?

Disclaimer

2014-12-01 Update: The answer below works only for one very specific format of CSV. As correctly pointed out by DG in the comments, this solution does NOT fit the RFC 4180 definition of CSV and it also does NOT fit MS Excel format. This solution simply demonstrates how one can parse one (non-standard) CSV line of input which contains a mix of string types, where the strings may contain escaped quotes and commas.

A non-standard CSV solution

As austincheney correctly points out, you really need to parse the string from start to finish if you wish to properly handle quoted strings that may contain escaped characters. Also, the OP does not clearly define what a "CSV string" really is. First we must define what constitutes a valid CSV string and its individual values.

Given: "CSV String" Definition

For the purpose of this discussion, a "CSV string" consists of zero or more values, where multiple values are separated by a comma. Each value may consist of:

  1. A double quoted string. (may contain unescaped single quotes.)
  2. A single quoted string. (may contain unescaped double quotes.)
  3. A non-quoted string. (may NOT contain quotes, commas or backslashes.)
  4. An empty value. (An all whitespace value is considered empty.)

Rules/Notes:

  • Quoted values may contain commas.
  • Quoted values may contain escaped-anything, e.g. 'that\'s cool'.
  • Values containing quotes, commas, or backslashes must be quoted.
  • Values containing leading or trailing whitespace must be quoted.
  • The backslash is removed from all: \' in single quoted values.
  • The backslash is removed from all: \" in double quoted values.
  • Non-quoted strings are trimmed of any leading and trailing spaces.
  • The comma separator may have adjacent whitespace (which is ignored).

Find:

A JavaScript function which converts a valid CSV string (as defined above) into an array of string values.

Solution:

The regular expressions used by this solution are complex. And (IMHO) all non-trivial regexes should be presented in free-spacing mode with lots of comments and indentation. Unfortunately, JavaScript does not allow free-spacing mode. Thus, the regular expressions implemented by this solution are first presented in native regex syntax (expressed using Python's handy: r'''...''' raw-multi-line-string syntax).

First here is a regular expression which validates that a CVS string meets the above requirements:

Regex to validate a "CSV string":

re_valid = r"""
# Validate a CSV string having single, double or un-quoted values.
^ # Anchor to start of string.
\s* # Allow whitespace before value.
(?: # Group for value alternatives.
'[^'\\]*(?:\\[\S\s][^'\\]*)*' # Either Single quoted string,
| "[^"\\]*(?:\\[\S\s][^"\\]*)*" # or Double quoted string,
| [^,'"\s\\]*(?:\s+[^,'"\s\\]+)* # or Non-comma, non-quote stuff.
) # End group of value alternatives.
\s* # Allow whitespace after value.
(?: # Zero or more additional values
, # Values separated by a comma.
\s* # Allow whitespace before value.
(?: # Group for value alternatives.
'[^'\\]*(?:\\[\S\s][^'\\]*)*' # Either Single quoted string,
| "[^"\\]*(?:\\[\S\s][^"\\]*)*" # or Double quoted string,
| [^,'"\s\\]*(?:\s+[^,'"\s\\]+)* # or Non-comma, non-quote stuff.
) # End group of value alternatives.
\s* # Allow whitespace after value.
)* # Zero or more additional values
$ # Anchor to end of string.
"""

If a string matches the above regex, then that string is a valid CSV string (according to the rules previously stated) and may be parsed using the following regex. The following regex is then used to match one value from the CSV string. It is applied repeatedly until no more matches are found (and all values have been parsed).

Regex to parse one value from valid CSV string:

re_value = r"""
# Match one value in valid CSV string.
(?!\s*$) # Don't match empty last value.
\s* # Strip whitespace before value.
(?: # Group for value alternatives.
'([^'\\]*(?:\\[\S\s][^'\\]*)*)' # Either $1: Single quoted string,
| "([^"\\]*(?:\\[\S\s][^"\\]*)*)" # or $2: Double quoted string,
| ([^,'"\s\\]*(?:\s+[^,'"\s\\]+)*) # or $3: Non-comma, non-quote stuff.
) # End group of value alternatives.
\s* # Strip whitespace after value.
(?:,|$) # Field ends on comma or EOS.
"""

Note that there is one special case value that this regex does not match - the very last value when that value is empty. This special "empty last value" case is tested for and handled by the js function which follows.

JavaScript function to parse CSV string:

// Return array of string values, or NULL if CSV string not well formed.
function CSVtoArray(text) {
var re_valid = /^\s*(?:'[^'\\]*(?:\\[\S\s][^'\\]*)*'|"[^"\\]*(?:\\[\S\s][^"\\]*)*"|[^,'"\s\\]*(?:\s+[^,'"\s\\]+)*)\s*(?:,\s*(?:'[^'\\]*(?:\\[\S\s][^'\\]*)*'|"[^"\\]*(?:\\[\S\s][^"\\]*)*"|[^,'"\s\\]*(?:\s+[^,'"\s\\]+)*)\s*)*$/;
var re_value = /(?!\s*$)\s*(?:'([^'\\]*(?:\\[\S\s][^'\\]*)*)'|"([^"\\]*(?:\\[\S\s][^"\\]*)*)"|([^,'"\s\\]*(?:\s+[^,'"\s\\]+)*))\s*(?:,|$)/g;
// Return NULL if input string is not well formed CSV string.
if (!re_valid.test(text)) return null;
var a = []; // Initialize array to receive values.
text.replace(re_value, // "Walk" the string using replace with callback.
function(m0, m1, m2, m3) {
// Remove backslash from \' in single quoted values.
if (m1 !== undefined) a.push(m1.replace(/\\'/g, "'"));
// Remove backslash from \" in double quoted values.
else if (m2 !== undefined) a.push(m2.replace(/\\"/g, '"'));
else if (m3 !== undefined) a.push(m3);
return ''; // Return empty string.
});
// Handle special case of empty last value.
if (/,\s*$/.test(text)) a.push('');
return a;
};

Example input and output:

In the following examples, curly braces are used to delimit the {result strings}. (This is to help visualize leading/trailing spaces and zero-length strings.)

// Test 1: Test string from original question.
var test = "'string, duppi, du', 23, lala";
var a = CSVtoArray(test);
/* Array hes 3 elements:
a[0] = {string, duppi, du}
a[1] = {23}
a[2] = {lala} */
// Test 2: Empty CSV string.
var test = "";
var a = CSVtoArray(test);
/* Array hes 0 elements: */
// Test 3: CSV string with two empty values.
var test = ",";
var a = CSVtoArray(test);
/* Array hes 2 elements:
a[0] = {}
a[1] = {} */
// Test 4: Double quoted CSV string having single quoted values.
var test = "'one','two with escaped \' single quote', 'three, with, commas'";
var a = CSVtoArray(test);
/* Array hes 3 elements:
a[0] = {one}
a[1] = {two with escaped ' single quote}
a[2] = {three, with, commas} */
// Test 5: Single quoted CSV string having double quoted values.
var test = '"one","two with escaped \" double quote", "three, with, commas"';
var a = CSVtoArray(test);
/* Array hes 3 elements:
a[0] = {one}
a[1] = {two with escaped " double quote}
a[2] = {three, with, commas} */
// Test 6: CSV string with whitespace in and around empty and non-empty values.
var test = " one , 'two' , , ' four' ,, 'six ', ' seven ' , ";
var a = CSVtoArray(test);
/* Array hes 8 elements:
a[0] = {one}
a[1] = {two}
a[2] = {}
a[3] = { four}
a[4] = {}
a[5] = {six }
a[6] = { seven }
a[7] = {} */

Additional notes:

This solution requires that the CSV string be "valid". For example, unquoted values may not contain backslashes or quotes, e.g. the following CSV string is NOT valid:

var invalid1 = "one, that's me!, escaped \, comma"

This is not really a limitation because any sub-string may be represented as either a single or double quoted value. Note also that this solution represents only one possible definition for: "Comma Separated Values".

Edit: 2014-05-19: Added disclaimer.
Edit: 2014-12-01: Moved disclaimer to top.

How to parse a string in JavaScript based on comma as a delimiter?

You can split the string on a comma:

let band = "John,Paul,George,Ringo";
let members = band.split(',');
let john = members[0];
let paul = members[1];
let george = members[2];
let ringo = members[3];

This will give you an array called members which lets you access them using an index either directly or via a loop.

Depending on what browsers/environments you may need to support, you can destructure the array too:

let band = "John,Paul,George,Ringo";
let [john, paul, george, ringo] = band.split(',');

If you want loop over the members:

let band = "John,Paul,George,Ringo";
let members = band.split(',');

for (let i = 0; i < members.length; i++) {
console.log(members[i]);
}

How to split a string containing CSV data with arbitrary text into a JavaScript Array of Arrays?

This should do it:

let parseRow = function(row) {
let isInQuotes = false;
let values = [];
let val = '';

for (let i = 0; i < row.length; i++) {
switch (row[i]) {
case ',':
if (isInQuotes) {
val += row[i];
} else {
values.push(val);
val = '';
}
break;

case '"':
if (isInQuotes && i + 1 < row.length && row[i+1] === '"') {
val += '"';
i++;
} else {
isInQuotes = !isInQuotes
}
break;

default:
val += row[i];
break;
}
}

values.push(val);

return values;
}

It will return the values in an array:

parseRow('512,"""Fake News"" and the ""Best Way"" to deal with A, B, and C", 1/18/2019,media');
// => ['512', '"Fake News" and the "Best Way" to deal with A, B, and C', ' 1/18/2019', 'media']

To get the requested array of arrays you can do:

let parsedCsv = theCsv.split(/\r?\n/).map(parseRow);

Explanation

The code might look a little obscure. But the principal idea is as follows: We parse the string character by character. When we encounter a " we set isInQuotes = true. This will change the behavior for parsing ,and "". When we encounter a single " we set isInQuotes = false again.

Split a string by commas but ignore commas within double-quotes using Javascript

Here's what I would do.

var str = 'a, b, c, "d, e, f", g, h';
var arr = str.match(/(".*?"|[^",\s]+)(?=\s*,|\s*$)/g);

Sample Image
/* will match:

    (
".*?" double quotes + anything but double quotes + double quotes
| OR
[^",\s]+ 1 or more characters excl. double quotes, comma or spaces of any kind
)
(?= FOLLOWED BY
\s*, 0 or more empty spaces and a comma
| OR
\s*$ 0 or more empty spaces and nothing else (end of string)
)

*/
arr = arr || [];
// this will prevent JS from throwing an error in
// the below loop when there are no matches
for (var i = 0; i < arr.length; i++) console.log('arr['+i+'] =',arr[i]);

Capturing comma inside text of comma separated values

I would suggest you follow the suggestions given to you in the comments and not use regex.

However, if you did need to do this using regex, the following should do the trick:

(.*?)=("?)([^"]+?)\2(?:,|$)
  • (.*?)= Captures the key to the left of the = sign. It only captures one key because the ? makes it match as few characters as possible.
  • ("?) Captures whether or not the value is in quotes.
  • ([^"]+?)\2(?:,|$)
    • ([^"]+?) Captures more than 1 character that is not ", but as few as possible.
    • \2(?:,|$) This stops either if there was a quote and it finds one again, or at the next comma or if the string has finished.

Test online

Parse comma separated string using JavaScript?

Use match instead of split, and repeatedly match either non-comma, non-" characters, or match "s, followed by non-" characters (thus matching commas inside "s, as desired), followed by another ". Also use negative lookahead for a space at the beginning of the pattern to ensure that the first matched character is not a space:

const translate = str => console.log(

str.match(/(?! )(?:[^",]+|"[^"]*")+/g)

);

[

`12.0,trs,"xx-xx NY,US"`,

`"12.0","trs","xx-xx NY,US"`,

`"12.0","trs", "xx-xx NY,US"`

].forEach(translate);

Parsing a CSV string while ignoring commas inside the individual columns

Use uniVocity-parsers CsvParser for that instead of parsing it by hand. CSV is much harder than you think and there are many corner cases to cover. You just found one. In short, you NEED a library to read CSV reliably. uniVocity-parsers is used by other Scala projects (e.g. spark-csv)

I'll put an example using plain Java here, because I don't know Scala, but you'll get the idea:

public static void main(String ... args){
CsvParserSettings settings = new CsvParserSettings(); //many options here, check the documentation
CsvParser parser = new CsvParser(settings);
String[] row = parser.parseLine("A,B,\"Hi,There\",C,D");
for(String value : row){
System.out.println(value);
}
}

Output:

A
B
Hi,There
C
D

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

How to parse CSV data?

You can use the CSVToArray() function mentioned in this blog entry.

<script type="text/javascript">
// ref: http://stackoverflow.com/a/1293163/2343
// This will parse a delimited string into an array of
// arrays. The default delimiter is the comma, but this
// can be overriden in the second argument.
function CSVToArray( strData, strDelimiter ){
// Check to see if the delimiter is defined. If not,
// then default to comma.
strDelimiter = (strDelimiter || ",");

// Create a regular expression to parse the CSV values.
var objPattern = new RegExp(
(
// Delimiters.
"(\\" + strDelimiter + "|\\r?\\n|\\r|^)" +

// Quoted fields.
"(?:\"([^\"]*(?:\"\"[^\"]*)*)\"|" +

// Standard fields.
"([^\"\\" + strDelimiter + "\\r\\n]*))"
),
"gi"
);

// Create an array to hold our data. Give the array
// a default empty first row.
var arrData = [[]];

// Create an array to hold our individual pattern
// matching groups.
var arrMatches = null;

// Keep looping over the regular expression matches
// until we can no longer find a match.
while (arrMatches = objPattern.exec( strData )){

// Get the delimiter that was found.
var strMatchedDelimiter = arrMatches[ 1 ];

// Check to see if the given delimiter has a length
// (is not the start of string) and if it matches
// field delimiter. If id does not, then we know
// that this delimiter is a row delimiter.
if (
strMatchedDelimiter.length &&
strMatchedDelimiter !== strDelimiter
){

// Since we have reached a new row of data,
// add an empty row to our data array.
arrData.push( [] );

}

var strMatchedValue;

// Now that we have our delimiter out of the way,
// let's check to see which kind of value we
// captured (quoted or unquoted).
if (arrMatches[ 2 ]){

// We found a quoted value. When we capture
// this value, unescape any double quotes.
strMatchedValue = arrMatches[ 2 ].replace(
new RegExp( "\"\"", "g" ),
"\""
);

} else {

// We found a non-quoted value.
strMatchedValue = arrMatches[ 3 ];

}

// Now that we have our value string, let's add
// it to the data array.
arrData[ arrData.length - 1 ].push( strMatchedValue );
}

// Return the parsed data.
return( arrData );
}

</script>

Write a string containing commas and double quotes to CSV

It turns out that, according to the CSV specs, to include double quotes within a string that is already quoted, you need to use two double quotes (""). I changed:

itemDesc = itemDesc.replace(/"/g, '\"');

to

itemDesc = itemDesc.replace(/"/g, '""');

I also removed

itemDesc = itemDesc.replace(/,/g, '\,');
itemDesc = itemDesc.replace(/'/g, '\'');

Since the column in the CSV is being quoted already. These are unnecessary.



Related Topics



Leave a reply



Submit