Parsing CSV input with a RegEx in java
Operator precedence. Basically there is none. It's all left to right. So the or (|) is applying to the closing quote lookahead and the comma lookahead
Try:
(?:(?<=")([^"]*)(?="))|(?<=,|^)([^,]*)(?=,|$)
Regular Expression for reading CSV in Java
DEMO
Regex pattern: (?:\s*(?:\"([^\"]*)\"|([^,]+))\s*,?)+?
Update for null values: (?:\s*(?:\"([^\"]*)\"|([^,]+))\s*,?|(?<=,)(),?)+?
DEMO
An example of it working, I know it's kinda CSV Format but as long as you dont write really really weird things it'll match all of them.
Matcher ma = Pattern.compile("(?:\\s*(?:\\\"([^\\\"]*)\\\"|([^,]+))\\s*,?)+?").matcher(" \" ab cd \" , \" efgh,ijk.\", 4,\"lmno\"");
while (ma.find()) {
if (ma.group(1) == null) {
System.out.println(ma.group(2));
} else {
System.out.println(ma.group(1));
}
}
Edit, btw if you wanted us to give the code for you, don't tell us about a regex online tester, if you do so it's because you know how to handle regex, if you have no idea of how to do that, ask it too.
Parsing CSV files using Regex in Java
Anyway I've found the fix myself, thanks guys for your suggestion and help.
This was my initial code
if(pm.find()
System.out.println( cs);
Now changed this to
while(pm.find()
{
CharSequence css = pm.group();
//print css
}
Also I used a different Regex. I'm getting the desired output now.
Parse a csv file using Regex in java with '|' as seperator
Given your source, you could probably just replace the comma with a pipe, since from the comments, all that pattern does is split the string on a delimiter (except the ones in double quotes)
eg: from
\\s*(\"[^\"]*\"|[^,]*)\\s*,?
to
\\s*(\"[^\"]*\"|[^|]*)\\s*\\|?
As for your number exception, you need to debug the way you're calling the CSV loader.
I've never used that tool before, but if you look at line 352
for (int i = 0; i < types.length; ++i) {
Now look at the switch block that starts at line 362: it defines the types that each field should be casted to.
switch(types[i]) {
case DOUBLE:
prepStmt.setDouble(i+1, Double.parseDouble(field));
break;
...
This type of conversion is likely going to cause issues if you don't properly specify the types.
Regex to split a CSV
Description
Instead of using a split, I think it would be easier to simply execute a match and process all the found matches.
This expression will:
- divide your sample text on the comma delimits
- will process empty values
- will ignore double quoted commas, providing double quotes are not nested
- trims the delimiting comma from the returned value
- trims surrounding quotes from the returned value
- if the string starts with a comma, then the first capture group will return a null value
Regex: (?:^|,)(?=[^"]|(")?)"?((?(1)[^"]*|[^,"]*))"?(?=,|$)
Example
Sample Text
123,2.99,AMO024,Title,"Description, more info",,123987564
ASP example using the non-java expression
Set regEx = New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.MultiLine = True
sourcestring = "your source string"
regEx.Pattern = "(?:^|,)(?=[^""]|("")?)""?((?(1)[^""]*|[^,""]*))""?(?=,|$)"
Set Matches = regEx.Execute(sourcestring)
For z = 0 to Matches.Count-1
results = results & "Matches(" & z & ") = " & chr(34) & Server.HTMLEncode(Matches(z)) & chr(34) & chr(13)
For zz = 0 to Matches(z).SubMatches.Count-1
results = results & "Matches(" & z & ").SubMatches(" & zz & ") = " & chr(34) & Server.HTMLEncode(Matches(z).SubMatches(zz)) & chr(34) & chr(13)
next
results=Left(results,Len(results)-1) & chr(13)
next
Response.Write "<pre>" & results
Matches using the non-java expression
Group 0 gets the entire substring which includes the comma
Group 1 gets the quote if it's used
Group 2 gets the value not including the comma
[0][0] = 123
[0][1] =
[0][2] = 123
[1][0] = ,2.99
[1][1] =
[1][2] = 2.99
[2][0] = ,AMO024
[2][1] =
[2][2] = AMO024
[3][0] = ,Title
[3][1] =
[3][2] = Title
[4][0] = ,"Description, more info"
[4][1] = "
[4][2] = Description, more info
[5][0] = ,
[5][1] =
[5][2] =
[6][0] = ,123987564
[6][1] =
[6][2] = 123987564
Edited
As Boris pointed out CSV format will escape a double quote "
as a double double quote ""
. Although this requirement wasn't included by the OP, if your text includes double double quotes then you'll want to use a this modified expression:
Regex: (?:^|,)(?=[^"]|(")?)"?((?(1)(?:[^"]|"")*|[^,"]*))"?(?=,|$)
See also: https://regex101.com/r/y8Ayag/1
It should also be pointed out that Regex is a pattern matching tool not a parsing engine. Therefore if your text includes double double quotes it will still contain the double double quotes after pattern matching is completed. With this solution you'd still need to search for the double double quotes and replace them in your captured text.
Validate csv file with regular expression in Java
Disclaimer: I didn't even try compiling my code, but this pattern has worked before.
When I can't see at a glance what a regex does, I break it out into lines so it's easier to figure out what's going on. Mismatched parens are more obvious and you can even add comments to it. Also, let's add the Java code around it so escaping oddities become clear.
^(\"[^,\"]*\")(,(\"[^,\"]*\"))*(.(\"[^,\"]*\")(,(\"[^,\"]*\")))*.$
becomes
String regex = "^" +
"(\"[^,\"]*\")" +
"(," +
"(\"[^,\"]*\")" +
")*" +
"(." +
"(\"[^,\"]*\")" +
"(," +
"(\"[^,\"]*\")" +
")" +
")*" +
".$";
Much better. Now to business: the first thing I see is your regex for the quoted values. It doesn't allow for commas within the strings - which probably isn't what you want - so let's fix that. Let's also put it in its own variable so we don't mis-type it at some point. Lastly, let's add comments so we can verify what the regex is doing.
final String QUOTED_VALUE = "\"[^\"]*\""; // A double quote character, zero or more non-double quote characters, and another double quote
String regex = "^" + // The beginning of the string
"(" + QUOTED_VALUE + ")" + // Capture the first value
"(," + // Start a group, a comma
"(" + QUOTED_VALUE + ")" + // Capture the next value
")*" + // Close the group. Allow zero or more of these
"(." + // Start a group, any character
"(" + QUOTED_VALUE + ")" + // Capture another value
"(," + // Started a nested group, a comma
"(" + QUOTED_VALUE + ")" + // Capture the next value
")" + // Close the nested group
")*" + // Close the group. Allow zero or more
".$"; // Any character, the end of the input
Things are getting even clearer. I see two big things here:
1) (I think) you're trying to match the newline in your input string. I'll play along, but it's cleaner and easier to split the input on a newline than what you're doing (that's an exercise you can do yourself though). You also need to be mindful of the different newline conventions that different operating systems have (read this).
2) You're capturing too much. You want to use non-capturing groups or parsing your output is going to be difficult and error-prone (read this).
final String QUOTED_VALUE = "\"[^\"]*\""; // A double quote character, zero or more non-double quote characters, and another double quote
final String NEWLINE = "(\n|\n\r|\r\n)"; // A newline for (almost) any OS: Windows, *NIX or Mac
String regex = "^" + // The beginning of the string
"(" + QUOTED_VALUE + ")" + // Capture the first value
"(?:," + // Start a group, a comma
"(" + QUOTED_VALUE + ")" + // Capture the next value
")*" + // Close the group. Allow zero or more of these
"(?:" + NEWLINE + // Start a group, any character
"(" + QUOTED_VALUE + ")" + // Capture another value
"(?:," + // Started a nested group, a comma
"(" + QUOTED_VALUE + ")" + // Capture the next value
")" + // Close the nested group
")*" + // Close the group. Allow zero or more
NEWLINE + "$"; // A trailing newline, the end of the input
From here, I see you duplicating work again. Let's fix that. This also fixes a missing * in your original regex. See if you can find it.
final String QUOTED_VALUE = "\"[^\"]*\""; // A double quote character, zero or more non-double quote characters, and another double quote
final String NEWLINE = "(\n|\n\r|\r\n)"; // A newline for (almost) any OS: Windows, *NIX or Mac
final String LINE = "(" + QUOTED_VALUE + ")" + // Capture the first value
"(?:," + // Start a group, a comma
"(" + QUOTED_VALUE + ")" + // Capture the next value
")*"; // Close the group. Allow zero or more of these
String regex = "^" + // The beginning of the string
LINE + // Read the first line, capture its values
"(?:" + NEWLINE + // Start a group for the remaining lines
LINE + // Read more lines, capture their values
")*" + // Close the group. Allow zero or more
NEWLINE + "$"; // A trailing newline, the end of the input
That's a little easier to read, no? Now you can test your big nasty regex in pieces if it doesn't work.
You can now compile the regex, get the matcher, and grab the groups from it. You still have a few issues though:
1) I said earlier that it would be easier to break on newlines. One reason is: how do you determine how many values do you have per line? Hard-coding it will work, but it'll break as soon as your input changes. Maybe this isn't a problem for you, but it's still bad practice. Another reason: the regex is still too complex for my liking. You could really get away with stopping at LINE.
2) CSV files allow lines like this:
"some text","123",456,"some more text"
To handle this you might want to add another mini-regex that gets either a quoted value or a list of digits.
Related Topics
Java: Subtract '0' from Char to Get an Int... Why Does This Work
What Is Mutex and Semaphore in Java? What Is the Main Difference
How to Get Method Parameter Names in Java 8 Using Reflection
Java If VS. Try/Catch Overhead
Tool to Convert Java to C# Code
Javamail Could Not Convert Socket to Tls Gmail
When Do You Need to Explicitly Call a Superclass Constructor
Are There Any Other Java Libraries for Bonjour/Zeroconf Apart from Jmdns
Java Raw Type and Generics Interaction
How to Convert a Word Document to PDF
How to Turn Off the Httpsession in Web.Xml
Httpget with Https:Sslpeerunverifiedexception
Access Maven Properties Defined in the Pom
The Matching Wildcard Is Strict, But No Declaration Can Be Found for Element 'Context:Component-Scan
Factorial Using Recursion in Java
Can a Class Have No Constructor
Spring - Injecting a Dependency into a Servletcontextlistener