Ignoring comma in field of CSV file with awk
I think your requirement is the perfect use case for using FPAT
in GNU Awk
,
Quoting as-is from the man
page,
Normally, when using FS
, gawk
defines the fields as the parts of the record that occur in between each field separator. In other words, FS
defines what a field is not, instead of what a field is. However, there are times when you really want to define the fields by what they are, and not by what they are not.
The most notorious such case is so-called comma-separated values (CSV) data. If commas only separated the data, there wouldn’t be an issue. The problem comes when one of the fields contains an embedded comma. In such cases, most programs embed the field in double quotes.
In the case of CSV data as presented here, each field is either “anything that is not a comma,” or “a double quote, anything that is not a double quote, and a closing double quote.” If written as a regular expression constant (see Regexp), we would have /([^,]+)|("[^"]+")/
. Writing this as a string requires us to escape the double quotes, leading to:
FPAT = "([^,]+)|(\"[^\"]+\")"
Using that on your input file,
awk 'BEGIN{FPAT = "([^,]+)|(\"[^\"]+\")"}{print $1}' file
"Company Name, LLC"
Parse a csv using awk and ignoring commas inside a field
The extra output you're getting from csv.awk
is from demo code. It's intended that you use the functions within the script to do the parsing and then output it how you want.
At the end of csv.awk
is the { ... }
loop which demonstrates one of the functions. It's that code that's outputting the -> 2|
.
Instead most of that, just call the parsing function and do print csv[1], csv[2]
.
That part of the code would then look like:
{
num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1);
if (num_fields < 0) {
printf "ERROR: %s (%d) -> %s\n", csverr, num_fields, $0;
} else {
# printf "%s -> ", $0;
# printf "%s", num_fields;
# for (i = 0;i < num_fields;i++) {
# printf "|%s", csv[i];
# }
# printf "|\n";
print csv[1], csv[2]
}
}
Save it as your_script
(for example).
Do chmod +x your_script
.
And cat
is unnecessary. Also, you can do sort -u
instead of sort | uniq
.
Your command would then look like:
./yourscript Buildings.csv | sort -u > floors.csv
How to make awk ignore the field delimiter inside double quotes?
From the GNU awk manual (http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content):
$ awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print $1,$4}' file
"abc@xyz.com,www.example.com",field4
"def@xyz.com",field4
and see What's the most robust way to efficiently parse CSV using awk? for more generally parsing CSVs that include newlines, etc. within fields.
awk FPAT to ignore commas in csv
Could you please try following.
awk -F"," -v OFS=',' 'BEGIN{FPAT="([^,]*)|(\"[^\"]+\")"} {print $1,$2,$3,$5,$6,$7}' Input_file
OR make better use of BEGIN
:)
awk 'BEGIN{FS=OFS=",";FPAT="([^,]*)|(\"[^\"]+\")"} {print $1,$2,$3,$5,$6,$7}' Input_file
Reason why OP's code is partially working: Since you are using END
block and printing everything there that is the reason it is printing last row (though this behavior is not defined in few of awk
AFAIK). How END
block works is:
There are 3 main BLOCKS in awk
:
BEGIN
BLOCK: Which runs before any Input_file is being read, it is important when you want to initialize variables we can do it before program starts reading actual Input_file.{...}
main BLOCK: Now comes the main block where all Input_file records(lines) will be read.END
BLOCK:END
block of anyawk
program is executed once program is done with reading whole Input_file, so all kind of calculations eg--> with arrays, printing last values after processing of complete Input_file will be done here.
What man awk
says:
Finally, after all the input is exhausted, gawk executes the code in
the END rule(s) (if any).
Ignoring commas in attribute value while writing to csv file
Whenever you have name -> value pairs in your data it's a good idea to build a similar mapping array first and then print the values by name:
$ cat tst.awk
BEGIN {
OFS = ","
numFlds = split("uid omEntitytype sn givenName initials omUnit departmentNumber omCostCenter title omManager omaffiliatedaccount",flds)
print "USERID,USER_TYPE,USER_LASTNAME,USER_FIRSTNAME,USER_INITIAL,USER_UNIT,USER_DEPT,CHARGE_UNIT,USER_JOB_TITLE,SPONSOR_USERID,ACCOUNT_ID"
}
{
name = value = $0
sub(/:.*/,"",name)
sub(/[^:]*:[[:space:]]*/,"",value)
name2value[name] = value
}
!NF { prtRec() }
END { prtRec() }
function prtRec() {
for (i=1; i<=numFlds; i++) {
printf "\"%s\"%s", name2value[flds[i]], (i<numFlds?OFS:ORS)
}
delete name2value
}
$ awk -f tst.awk file
USERID,USER_TYPE,USER_LASTNAME,USER_FIRSTNAME,USER_INITIAL,USER_UNIT,USER_DEPT,CHARGE_UNIT,USER_JOB_TITLE,SPONSOR_USERID,ACCOUNT_ID
"sample1","Contingent, off-Site","sample1 name1","sample1","P","07","123","10","Analyst","","12345"
"sample2","Contingent, On-Site","sample2 name2","sample2","P","07","123","10","PLAT MGR, ENGINE,HYD,ELECT,DRIVES","","12345"
You didn't actually tell us how you want the commas handled, the above quotes them but if that's not what you want then just change prtRec()
to do whatever it is you do want, e.g. maybe gsub(/,/,";",name2value[flds[i]])
.
Note that if your input data wasn't missing some fields that you want output and/or if you just wanted fields output in the order they appear in the input then the above would be quite a bit simpler.
Awk output removes comma
There's 2 things happening:
- If you don't specify a field separator (e.g.
FS=","
) then awk will use chains of white space so then your first field,$1
, of your first input line is0123,
rather than0123
and - When you perform a numeric operation on a string, awk strips all non-digits off the right side of that string and leading zeros off the left to turn it into a number so then
0123,
becomes123
(and000173foo
would become173
).
So $1
is 0123,
and therefore:
sprintf("%05d", $1)
= sprintf("%05d", "0123,")
= sprintf("%05d", "123")
= 00123
which when you assign that result to $1
replaces 0123,
with 00123
hence the vanishing ,
.
This is what you really wanted:
awk '
BEGIN { FS="[[:space:]]*,[[:space:]]*"; OFS=", " }
{ $1=sprintf("%05d", $1); $3=sprintf("%08d", $3) }
1' mycsv.txt
The above will accept input with any white space around the field-separating ,
s and will ensure the output fields are all separated by exactly 1 comma followed by 1 blank. If you don't want the blanks in the output just change OFS=", "
to OFS=","
.
awk split string on commas ignore if inside double quotes
gnu awk has a function called patsplit
that lets you do a split using an FPAT
pattern:
$ awk '{ print "RECORD " NR ":"; n=patsplit($0, a, "([^,]*)|(\"[^\"]+\")"); for (i=1;i<=n;++i) {print i, "|" a[i] "|"}}' file
RECORD 1:
1 |Hi|
2 |I|
3 |"am,your"|
4 |father|
RECORD 2:
1 |maybe|
2 |you|
3 |knew|
4 |it|
RECORD 3:
1 |but|
2 |"I,wanted"|
3 |to|
4 |"be,sure"|
Related Topics
What Length Can a Network Interface Name Have
Why Does This Movq Instruction Work on Linux and Not Osx
Low-Overhead Way to Access the Memory Space of a Traced Process
How to Grep While Avoiding 'Too Many Arguments'
How to Replace a Multi Line String in a Bunch Files
How Does This Canonical Flock Example Work
How to Run a Linux Command That Still Runs After I Close My Putty Ssh Session
How to Concatenate Files with the Same Prefix (And Many Prefixes)
Passing Arguments to a Script Invoked with Bash -C
Linux How to Add a File to a Specific Folder Within a Zip File
Get a Spectrum of Frequencies from Wav/Riff Using Linux Command Line
Bash Alias Create File with Current Timestamp in Filename
How Can Linux Ptrace Be Unsafe or Contain a Race Condition
What Happened to Socket If Network Has Broken Down
What's the Difference Between 'Push' and 'Pushq' in At&T Assembly