Parsing CSV Input with a Regex in Java

Parsing CSV input with a RegEx in java

Operator precedence. Basically there is none. It's all left to right. So the or (|) is applying to the closing quote lookahead and the comma lookahead

Try:

(?:(?<=")([^"]*)(?="))|(?<=,|^)([^,]*)(?=,|$)

Regular Expression for reading CSV in Java

DEMO

Regex pattern: (?:\s*(?:\"([^\"]*)\"|([^,]+))\s*,?)+?

Update for null values: (?:\s*(?:\"([^\"]*)\"|([^,]+))\s*,?|(?<=,)(),?)+? DEMO

An example of it working, I know it's kinda CSV Format but as long as you dont write really really weird things it'll match all of them.

Matcher ma = Pattern.compile("(?:\\s*(?:\\\"([^\\\"]*)\\\"|([^,]+))\\s*,?)+?").matcher("   \"  ab  cd  \" ,    \"  efgh,ijk.\",  4,\"lmno\"");
while (ma.find()) {
if (ma.group(1) == null) {
System.out.println(ma.group(2));
} else {
System.out.println(ma.group(1));
}
}

Edit, btw if you wanted us to give the code for you, don't tell us about a regex online tester, if you do so it's because you know how to handle regex, if you have no idea of how to do that, ask it too.

Parsing CSV files using Regex in Java

Anyway I've found the fix myself, thanks guys for your suggestion and help.

This was my initial code

    if(pm.find()
System.out.println( cs);

Now changed this to

  while(pm.find()
{
CharSequence css = pm.group();
//print css
}

Also I used a different Regex. I'm getting the desired output now.

Parse a csv file using Regex in java with '|' as seperator

Given your source, you could probably just replace the comma with a pipe, since from the comments, all that pattern does is split the string on a delimiter (except the ones in double quotes)

eg: from

\\s*(\"[^\"]*\"|[^,]*)\\s*,?

to

\\s*(\"[^\"]*\"|[^|]*)\\s*\\|?

As for your number exception, you need to debug the way you're calling the CSV loader.

I've never used that tool before, but if you look at line 352

for (int i = 0; i < types.length; ++i) {

Now look at the switch block that starts at line 362: it defines the types that each field should be casted to.

switch(types[i]) {
case DOUBLE:
prepStmt.setDouble(i+1, Double.parseDouble(field));
break;
...

This type of conversion is likely going to cause issues if you don't properly specify the types.

Regex to split a CSV

Description

Instead of using a split, I think it would be easier to simply execute a match and process all the found matches.

This expression will:

  • divide your sample text on the comma delimits
  • will process empty values
  • will ignore double quoted commas, providing double quotes are not nested
  • trims the delimiting comma from the returned value
  • trims surrounding quotes from the returned value
  • if the string starts with a comma, then the first capture group will return a null value

Regex: (?:^|,)(?=[^"]|(")?)"?((?(1)[^"]*|[^,"]*))"?(?=,|$)

Sample Image

Example

Sample Text

123,2.99,AMO024,Title,"Description, more info",,123987564

ASP example using the non-java expression

Set regEx = New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.MultiLine = True
sourcestring = "your source string"
regEx.Pattern = "(?:^|,)(?=[^""]|("")?)""?((?(1)[^""]*|[^,""]*))""?(?=,|$)"
Set Matches = regEx.Execute(sourcestring)
For z = 0 to Matches.Count-1
results = results & "Matches(" & z & ") = " & chr(34) & Server.HTMLEncode(Matches(z)) & chr(34) & chr(13)
For zz = 0 to Matches(z).SubMatches.Count-1
results = results & "Matches(" & z & ").SubMatches(" & zz & ") = " & chr(34) & Server.HTMLEncode(Matches(z).SubMatches(zz)) & chr(34) & chr(13)
next
results=Left(results,Len(results)-1) & chr(13)
next
Response.Write "<pre>" & results

Matches using the non-java expression

Group 0 gets the entire substring which includes the comma

Group 1 gets the quote if it's used

Group 2 gets the value not including the comma

[0][0] = 123
[0][1] =
[0][2] = 123

[1][0] = ,2.99
[1][1] =
[1][2] = 2.99

[2][0] = ,AMO024
[2][1] =
[2][2] = AMO024

[3][0] = ,Title
[3][1] =
[3][2] = Title

[4][0] = ,"Description, more info"
[4][1] = "
[4][2] = Description, more info

[5][0] = ,
[5][1] =
[5][2] =

[6][0] = ,123987564
[6][1] =
[6][2] = 123987564

Edited

As Boris pointed out CSV format will escape a double quote " as a double double quote "". Although this requirement wasn't included by the OP, if your text includes double double quotes then you'll want to use a this modified expression:

Regex: (?:^|,)(?=[^"]|(")?)"?((?(1)(?:[^"]|"")*|[^,"]*))"?(?=,|$)

See also: https://regex101.com/r/y8Ayag/1

It should also be pointed out that Regex is a pattern matching tool not a parsing engine. Therefore if your text includes double double quotes it will still contain the double double quotes after pattern matching is completed. With this solution you'd still need to search for the double double quotes and replace them in your captured text.

Validate csv file with regular expression in Java

Disclaimer: I didn't even try compiling my code, but this pattern has worked before.

When I can't see at a glance what a regex does, I break it out into lines so it's easier to figure out what's going on. Mismatched parens are more obvious and you can even add comments to it. Also, let's add the Java code around it so escaping oddities become clear.

^(\"[^,\"]*\")(,(\"[^,\"]*\"))*(.(\"[^,\"]*\")(,(\"[^,\"]*\")))*.$

becomes

String regex = "^" +
"(\"[^,\"]*\")" +
"(," +
"(\"[^,\"]*\")" +
")*" +
"(." +
"(\"[^,\"]*\")" +
"(," +
"(\"[^,\"]*\")" +
")" +
")*" +
".$";

Much better. Now to business: the first thing I see is your regex for the quoted values. It doesn't allow for commas within the strings - which probably isn't what you want - so let's fix that. Let's also put it in its own variable so we don't mis-type it at some point. Lastly, let's add comments so we can verify what the regex is doing.

final String QUOTED_VALUE = "\"[^\"]*\""; // A double quote character, zero or more non-double quote characters, and another double quote
String regex = "^" + // The beginning of the string
"(" + QUOTED_VALUE + ")" + // Capture the first value
"(," + // Start a group, a comma
"(" + QUOTED_VALUE + ")" + // Capture the next value
")*" + // Close the group. Allow zero or more of these
"(." + // Start a group, any character
"(" + QUOTED_VALUE + ")" + // Capture another value
"(," + // Started a nested group, a comma
"(" + QUOTED_VALUE + ")" + // Capture the next value
")" + // Close the nested group
")*" + // Close the group. Allow zero or more
".$"; // Any character, the end of the input

Things are getting even clearer. I see two big things here:

1) (I think) you're trying to match the newline in your input string. I'll play along, but it's cleaner and easier to split the input on a newline than what you're doing (that's an exercise you can do yourself though). You also need to be mindful of the different newline conventions that different operating systems have (read this).

2) You're capturing too much. You want to use non-capturing groups or parsing your output is going to be difficult and error-prone (read this).

final String QUOTED_VALUE = "\"[^\"]*\""; // A double quote character, zero or more non-double quote characters, and another double quote
final String NEWLINE = "(\n|\n\r|\r\n)"; // A newline for (almost) any OS: Windows, *NIX or Mac
String regex = "^" + // The beginning of the string
"(" + QUOTED_VALUE + ")" + // Capture the first value
"(?:," + // Start a group, a comma
"(" + QUOTED_VALUE + ")" + // Capture the next value
")*" + // Close the group. Allow zero or more of these
"(?:" + NEWLINE + // Start a group, any character
"(" + QUOTED_VALUE + ")" + // Capture another value
"(?:," + // Started a nested group, a comma
"(" + QUOTED_VALUE + ")" + // Capture the next value
")" + // Close the nested group
")*" + // Close the group. Allow zero or more
NEWLINE + "$"; // A trailing newline, the end of the input

From here, I see you duplicating work again. Let's fix that. This also fixes a missing * in your original regex. See if you can find it.

final String QUOTED_VALUE = "\"[^\"]*\""; // A double quote character, zero or more non-double quote characters, and another double quote
final String NEWLINE = "(\n|\n\r|\r\n)"; // A newline for (almost) any OS: Windows, *NIX or Mac
final String LINE = "(" + QUOTED_VALUE + ")" + // Capture the first value
"(?:," + // Start a group, a comma
"(" + QUOTED_VALUE + ")" + // Capture the next value
")*"; // Close the group. Allow zero or more of these
String regex = "^" + // The beginning of the string
LINE + // Read the first line, capture its values
"(?:" + NEWLINE + // Start a group for the remaining lines
LINE + // Read more lines, capture their values
")*" + // Close the group. Allow zero or more
NEWLINE + "$"; // A trailing newline, the end of the input

That's a little easier to read, no? Now you can test your big nasty regex in pieces if it doesn't work.

You can now compile the regex, get the matcher, and grab the groups from it. You still have a few issues though:

1) I said earlier that it would be easier to break on newlines. One reason is: how do you determine how many values do you have per line? Hard-coding it will work, but it'll break as soon as your input changes. Maybe this isn't a problem for you, but it's still bad practice. Another reason: the regex is still too complex for my liking. You could really get away with stopping at LINE.

2) CSV files allow lines like this:

"some text","123",456,"some more text"

To handle this you might want to add another mini-regex that gets either a quoted value or a list of digits.



Related Topics



Leave a reply



Submit