What's a Semantically-Correct Way to Parse CSV from SQL Server 2008

What's a semantically-correct way to parse CSV from SQL Server 2008?

The following uses regexp and String#scan. I observe that in the broken CSV format you're dealing with, that " only has quoting properties when it comes at the beginning and end of a field.

Scan moves through the string successively matching the regexp, so the regexp can assume its start match point is the beginning of a field. We construct the regexp so it can match a balanced quoted field with no internal quotes (QUOTED) or a string of non-commas (UNQUOTED). When either alternative field representation is matched, it must be followed by a separator which can be either comma or end of string (SEP)

Because UNQUOTED can match a zero length field before a separator, the scan always matches an empty field at the end which we discard with [0...-1]. Scan produces an array of tuples; each tuple is an array of the capture groups, so we map over each element picking the captured alternate with matches[0] || matches[1].

None of your example lines show a field which contains both a comma and a quote -- I have no idea how it would be legally represented and this code probably wont recognize such a field correctly.

SEP = /(?:,|\Z)/
QUOTED = /"([^"]*)"/
UNQUOTED = /([^,]*)/

FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/

def ugly_parse line
line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] }
end

lines.each do |l|
puts l
puts ugly_parse(l).inspect
puts
end

# Electrical,197135021E,"SERVICE, OUTLETS",1997-05-15 00:00:00
# ["Electrical", "197135021E", "SERVICE, OUTLETS", "1997-05-15 00:00:00"]
#
# Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
# ["Plumbing", "196222006P", "REPLACE LEAD WATER SERVICE W/1\" COPPER", "1996-08-09 00:00:00"]
#
# Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
# ["Construction", "197133031B", "MORGAN SHOES\" ALT", "1997-05-13 00:00:00"]

Ruby: How can I process a CSV file with bad commas?

Well, here's an idea: You could replace each instance of comma-followed-by-a-space with a unique character, then parse the CSV as usual, then go through the resulting rows and reverse the replace.

Disabling automatic encapsulation on export

In Aspose.Cells for .NET v7.2.2 they have added a new property to TxtSaveOptions called AlwaysQuoted.

Here's a sample of how to use this property during the save:

var options = new TxtSaveOptions(SaveFormat.CSV) {AlwaysQuoted = true};
workBook.Save(@"C:\Export Location\"), options);

Setting the property to true will encapsulate all fields, regardless of content. The default setting is to mimic the CSV standard as Excel would.

How to escape commas inside CSV values when importing table to PostgreSQL?

Works for me (PG 9.2, linux):

$ cat something.csv 
1, "some name", 2, "(1):IPC Sections - 147, 323, 352, 504, 506 , Other Details - Case no.283A/2000, A.C.J.M-5, Ghumangunj Ellahabad, UP, Dt.12.11.2000"

$ psql test
test=> CREATE TABLE candidates (Sno int, name varchar, cases int, case_details varchar);
CREATE TABLE
test=> \copy candidates from 'something.csv' with NULL AS ' ' csv ;
test=> select * from candidates ;
sno | name | cases | case_details
-----+------------+-------+--------------------------------------------------------------------------------------------------------------------------------------
1 | some name | 2 | (1):IPC Sections - 147, 323, 352, 504, 506 , Other Details - Case no.283A/2000, A.C.J.M-5, Ghumangunj Ellahabad, UP, Dt.12.11.2000
(1 row)

Using IN in a WHERE clause where the number of items in the set is very large

I would always use

WHERE id IN (1,2,3,4,.....10000)

unless your in clause was stupidly large, which shouldn't really happen from user input.

edit: For instance, Rails does this a lot behind the scenes

It would definitely not be better to do separate update statements in a single transaction.

In SQL, how can you group by in ranges?

Neither of the highest voted answers are correct on SQL Server 2000. Perhaps they were using a different version.

Here are the correct versions of both of them on SQL Server 2000.

select t.range as [score range], count(*) as [number of occurences]
from (
select case
when score between 0 and 9 then ' 0- 9'
when score between 10 and 19 then '10-19'
else '20-99' end as range
from scores) t
group by t.range

or

select t.range as [score range], count(*) as [number of occurrences]
from (
select user_id,
case when score >= 0 and score< 10 then '0-9'
when score >= 10 and score< 20 then '10-19'
else '20-99' end as range
from scores) t
group by t.range


Related Topics



Leave a reply



Submit