Ignoring Comma in Field of CSV File with Awk

Ignoring comma in field of CSV file with awk

I think your requirement is the perfect use case for using FPAT in GNU Awk,

Quoting as-is from the man page,

Normally, when using FS, gawk defines the fields as the parts of the record that occur in between each field separator. In other words, FS defines what a field is not, instead of what a field is. However, there are times when you really want to define the fields by what they are, and not by what they are not.

The most notorious such case is so-called comma-separated values (CSV) data. If commas only separated the data, there wouldn’t be an issue. The problem comes when one of the fields contains an embedded comma. In such cases, most programs embed the field in double quotes.

In the case of CSV data as presented here, each field is either “anything that is not a comma,” or “a double quote, anything that is not a double quote, and a closing double quote.” If written as a regular expression constant (see Regexp), we would have /([^,]+)|("[^"]+")/. Writing this as a string requires us to escape the double quotes, leading to:

FPAT = "([^,]+)|(\"[^\"]+\")"

Using that on your input file,

awk 'BEGIN{FPAT = "([^,]+)|(\"[^\"]+\")"}{print $1}' file
"Company Name, LLC"

Parse a csv using awk and ignoring commas inside a field

The extra output you're getting from csv.awk is from demo code. It's intended that you use the functions within the script to do the parsing and then output it how you want.

At the end of csv.awk is the { ... } loop which demonstrates one of the functions. It's that code that's outputting the -> 2|.

Instead most of that, just call the parsing function and do print csv[1], csv[2].

That part of the code would then look like:

{
num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1);
if (num_fields < 0) {
printf "ERROR: %s (%d) -> %s\n", csverr, num_fields, $0;
} else {
# printf "%s -> ", $0;
# printf "%s", num_fields;
# for (i = 0;i < num_fields;i++) {
# printf "|%s", csv[i];
# }
# printf "|\n";
print csv[1], csv[2]
}
}

Save it as your_script (for example).

Do chmod +x your_script.

And cat is unnecessary. Also, you can do sort -u instead of sort | uniq.

Your command would then look like:

./yourscript Buildings.csv | sort -u > floors.csv

How to make awk ignore the field delimiter inside double quotes?

From the GNU awk manual (http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content):

$ awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print $1,$4}' file
"abc@xyz.com,www.example.com",field4
"def@xyz.com",field4

and see What's the most robust way to efficiently parse CSV using awk? for more generally parsing CSVs that include newlines, etc. within fields.

awk FPAT to ignore commas in csv

Could you please try following.

awk -F"," -v OFS=',' 'BEGIN{FPAT="([^,]*)|(\"[^\"]+\")"} {print $1,$2,$3,$5,$6,$7}' Input_file

OR make better use of BEGIN :)

awk 'BEGIN{FS=OFS=",";FPAT="([^,]*)|(\"[^\"]+\")"} {print $1,$2,$3,$5,$6,$7}' Input_file

Reason why OP's code is partially working: Since you are using END block and printing everything there that is the reason it is printing last row (though this behavior is not defined in few of awk AFAIK). How END block works is:

There are 3 main BLOCKS in awk:

  1. BEGIN BLOCK: Which runs before any Input_file is being read, it is important when you want to initialize variables we can do it before program starts reading actual Input_file.
  2. {...} main BLOCK: Now comes the main block where all Input_file records(lines) will be read.
  3. END BLOCK: END block of any awk program is executed once program is done with reading whole Input_file, so all kind of calculations eg--> with arrays, printing last values after processing of complete Input_file will be done here.

What man awk says:

Finally, after all the input is exhausted, gawk executes the code in
the END rule(s) (if any).

Ignoring commas in attribute value while writing to csv file

Whenever you have name -> value pairs in your data it's a good idea to build a similar mapping array first and then print the values by name:

$ cat tst.awk
BEGIN {
OFS = ","
numFlds = split("uid omEntitytype sn givenName initials omUnit departmentNumber omCostCenter title omManager omaffiliatedaccount",flds)
print "USERID,USER_TYPE,USER_LASTNAME,USER_FIRSTNAME,USER_INITIAL,USER_UNIT,USER_DEPT,CHARGE_UNIT,USER_JOB_TITLE,SPONSOR_USERID,ACCOUNT_ID"
}
{
name = value = $0
sub(/:.*/,"",name)
sub(/[^:]*:[[:space:]]*/,"",value)
name2value[name] = value
}
!NF { prtRec() }
END { prtRec() }

function prtRec() {
for (i=1; i<=numFlds; i++) {
printf "\"%s\"%s", name2value[flds[i]], (i<numFlds?OFS:ORS)
}
delete name2value
}

$ awk -f tst.awk file
USERID,USER_TYPE,USER_LASTNAME,USER_FIRSTNAME,USER_INITIAL,USER_UNIT,USER_DEPT,CHARGE_UNIT,USER_JOB_TITLE,SPONSOR_USERID,ACCOUNT_ID
"sample1","Contingent, off-Site","sample1 name1","sample1","P","07","123","10","Analyst","","12345"
"sample2","Contingent, On-Site","sample2 name2","sample2","P","07","123","10","PLAT MGR, ENGINE,HYD,ELECT,DRIVES","","12345"

You didn't actually tell us how you want the commas handled, the above quotes them but if that's not what you want then just change prtRec() to do whatever it is you do want, e.g. maybe gsub(/,/,";",name2value[flds[i]]).

Note that if your input data wasn't missing some fields that you want output and/or if you just wanted fields output in the order they appear in the input then the above would be quite a bit simpler.

Awk output removes comma

There's 2 things happening:

  1. If you don't specify a field separator (e.g. FS=",") then awk will use chains of white space so then your first field, $1, of your first input line is 0123, rather than 0123 and
  2. When you perform a numeric operation on a string, awk strips all non-digits off the right side of that string and leading zeros off the left to turn it into a number so then 0123, becomes 123 (and 000173foo would become 173).

So $1 is 0123, and therefore:

sprintf("%05d", $1) = sprintf("%05d", "0123,") = sprintf("%05d", "123") = 00123

which when you assign that result to $1 replaces 0123, with 00123 hence the vanishing ,.

This is what you really wanted:

awk '
BEGIN { FS="[[:space:]]*,[[:space:]]*"; OFS=", " }
{ $1=sprintf("%05d", $1); $3=sprintf("%08d", $3) }
1' mycsv.txt

The above will accept input with any white space around the field-separating ,s and will ensure the output fields are all separated by exactly 1 comma followed by 1 blank. If you don't want the blanks in the output just change OFS=", " to OFS=",".

awk split string on commas ignore if inside double quotes

gnu awk has a function called patsplit that lets you do a split using an FPAT pattern:

$ awk '{ print "RECORD " NR ":"; n=patsplit($0, a, "([^,]*)|(\"[^\"]+\")"); for (i=1;i<=n;++i) {print i, "|" a[i] "|"}}' file
RECORD 1:
1 |Hi|
2 |I|
3 |"am,your"|
4 |father|
RECORD 2:
1 |maybe|
2 |you|
3 |knew|
4 |it|
RECORD 3:
1 |but|
2 |"I,wanted"|
3 |to|
4 |"be,sure"|


Related Topics



Leave a reply



Submit