How to Read and Manipulate CSV File Data in C++

How can I read and manipulate CSV file data in C++?

If what you're really doing is manipulating a CSV file itself, Nelson's answer makes sense. However, my suspicion is that the CSV is simply an artifact of the problem you're solving. In C++, that probably means you have something like this as your data model:

struct Customer {
int id;
std::string first_name;
std::string last_name;
struct {
std::string street;
std::string unit;
} address;
char state[2];
int zip;
};

Thus, when you're working with a collection of data, it makes sense to have std::vector<Customer> or std::set<Customer>.

With that in mind, think of your CSV handling as two operations:

// if you wanted to go nuts, you could use a forward iterator concept for both of these
class CSVReader {
public:
CSVReader(const std::string &inputFile);
bool hasNextLine();
void readNextLine(std::vector<std::string> &fields);
private:
/* secrets */
};
class CSVWriter {
public:
CSVWriter(const std::string &outputFile);
void writeNextLine(const std::vector<std::string> &fields);
private:
/* more secrets */
};
void readCustomers(CSVReader &reader, std::vector<Customer> &customers);
void writeCustomers(CSVWriter &writer, const std::vector<Customer> &customers);

Read and write a single row at a time, rather than keeping a complete in-memory representation of the file itself. There are a few obvious benefits:

  1. Your data is represented in a form that makes sense for your problem (customers), rather than the current solution (CSV files).
  2. You can trivially add adapters for other data formats, such as bulk SQL import/export, Excel/OO spreadsheet files, or even an HTML <table> rendering.
  3. Your memory footprint is likely to be smaller (depends on relative sizeof(Customer) vs. the number of bytes in a single row).
  4. CSVReader and CSVWriter can be reused as the basis for an in-memory model (such as Nelson's) without loss of performance or functionality. The converse is not true.

Read .csv file in C

Hopefully this would get you started

See it live on http://ideone.com/l23He (using stdin)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const char* getfield(char* line, int num)
{
const char* tok;
for (tok = strtok(line, ";");
tok && *tok;
tok = strtok(NULL, ";\n"))
{
if (!--num)
return tok;
}
return NULL;
}

int main()
{
FILE* stream = fopen("input", "r");

char line[1024];
while (fgets(line, 1024, stream))
{
char* tmp = strdup(line);
printf("Field 3 would be %s\n", getfield(tmp, 3));
// NOTE strtok clobbers tmp
free(tmp);
}
}

Output:

Field 3 would be nazwisko
Field 3 would be Kowalski
Field 3 would be Nowak

How to edit .csv files in C

The first thing I would do is encapsulate the data in a struct, that makes it
easier to map a line of a CSV file into an object representing a line.

If both files city.csv and meteo.csv have different columns, I'd create a
different struct for each file. If both files have the same columns, you could
use the struct. I assume that both files are different and that city has the
format meteo_id,city_id,name.

typedef struct city_t {
int meteo_id;
int city_id;
char name[100]; // no city should have
// longer than 100 chars
} city_t;

typedef struct meteo_t {
int meteo_id;
int city_id;
int tempt_max;
int tempt_mix;
double humidity;
double preassure;
char date[11];
} meteo_t;

Let's assume that both files are well formatted, otherwise you would have to
write code that checks for errors and handles them, that would be the next step
in the exercise, so I'm going to write only the basic version with basic error
recognition.

#include <stdio.h>
#include <string.h>
#include <errno.h>

// takes 2 params, the filename and a pointer
// to size_t where the number of cities is stored
city_t *read_cities(const char *filename, size_t *len)
{
if(filename == NULL || len == NULL)
return NULL;

FILE *fp = fopen(filename, "r");
if(fp == NULL)
{
fprintf(stderr, "Could not open %s: %s\n", filename, strerror(errno));
return NULL;
}

city_t *arr = NULL, *tmp;
*len = 0;

// assuming that no line will be longer than 1023 chars long
char line[1024];

while(fgets(line, sizeof line, fp))
{
tmp = realloc(arr, (*len + 1) * sizeof *arr);
if(tmp == NULL)
{
fprintf(stderr, "could not parse the whole file %s\n", filename);
// returning all parsed cities so far

if(*len == 0)
{
free(arr);
arr = NULL;
}

return arr;
}

arr = tmp;

// %99[^\n] is to read up to 99 characters until the end of the line
if(sscanf(line, "%d,%d,%99[^\n]", &(arr[*len].meteo_id),
&(arr[*len].city_id), arr[*len].name) != 3)
{
fprintf(stderr, "Invalid line format (skipping line):\n%s\n", line);
// skip this line, and decrement *len
(*len)--;
continue;
}

// incrementing only when parsing of line was OK
(*len)++;
}

fclose(fp);

// file is empty or
// all lines have wrong format
if(*len == 0)
{
free(arr);
arr = NULL;
}


return arr;
}

void print_cities(city_t *cities, size_t len, FILE *fp)
{
if(cities == NULL || fp == NULL)
return;

for(size_t i = 0; i < len; ++i)
fprintf(fp, "%d,%d,%s\n", cities[i].meteo_id, cities[i].citiy_id,
cities[i].name);
}

Now I've written the read and write functions for the file citiy.csv assuming the
format meteo_id;city_id;name. The print_cities allows you to print the CSV
content on the screen (passing stdout as the last argument) or to a file
(passing a FILE object as the last argument).

You can use these functions as templates for reading and writing meteo.csv, the
idea is the same.

You can use these function as follows:

int main(void)
{
size_t cities_len;
city_t *cities = read_cities("city.csv", &cities_len);

// error
if(cities == NULL)
return 1;

do_something_with_cities(cities, cities_len);

// update csv
FILE *fp = fopen("city.csv", "w");

if(fp == NULL)
{
fprintf(stderr, "Could not open city.csv for reading: %s\n",
strerror(errno));

free(cities);
return 1;
}

print_cities(cities, cities_len, fp);

fclose(fp);
free(cities);
return 0;
}

Now for your exercise: write a similar function that parses meteo.csv (using
my function as a template shouldn't be that difficult) and parse both files. Now
that you've got them in memory, it's easy to manipulate the data (insert,
update, delete). Then write the files like I did in the example and that's it.

One last hint: how to search for a city:

// returns the index in the array or -1 on error or when not found
int search_for_city_by_name(city_t *cities, size_t len, const char *name)
{
if(cities == NULL || name == NULL)
return -1;

for(size_t i = 0; i < len; ++i)
if(strcmp(name, cities[i].name) == 0)
return i;

// not found
return -1;
}

Now I have given you almost all parts of the assignment, all you have to do is
stick them together and write the same functions for the meteo.csv file.

Reading CSV from text file in C

You just stored pointers into a local buffer. When you leave load() this buffer is gone and not accessible anymore.

You must allocate memory for name and email before you can copy it into the Owner struct.

char *tok;
tok = strtok(NULL, ",");
len = strlen(tok);
owner->name = malloc(len + 1);
strcpy(owner->name, tok);
...

[EDIT: you need to allocate len+1 bytes so you have space for the terminating NUL character. -Zack]

How do I read a .csv file and save it into a 2d array?

You can replace you current loop with this one:

for(i = 0; i < 4; i++){
for(j = 0; j < 2; j++){
char junk;

if (j != 0) fgetc(in_data);

fscanf(in_data, "%c%d%c", &junk, &trIn[i][j], &junk);
printf("%d ", trIn[i][j]);
}
fgetc(in_data);
printf("\n");
}

This works because fscanf(in_data, "%d", &trIn[i][j]) reads an int (%d) from the file in_data into the memory location of trIn[i][j]. fgetc didn't work because it only reads a single character which was then printed as if it were an integer.

EDIT: Now, for each value in the .csv file, you read to a junk variable the " and \n characters.

Line by line:

if (j != 0) fgetc(in_data); reads the , from the csv file into a junk variable (which isn't the first character on a line).

fscanf(in_data, "%c%d%c", &junk, &trIn[i][j], &junk); reads:

  • 1st: the opening " character into a junk variable
  • 2nd: the number (int: %d) into to memory location of trIn[i][j].
  • 3rd: the closing " character into a junk variable

Reading from a CSV file and separating the fields to store in a struct in C

strcspn can be used to find either double quotes or double quote plus comma.

The origial string is not modified so string literals can be utilized.

The position of the double quotes is not significant. They can be in any field.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main( void) {

char *string[] = {
"Jim,Hunter,9239234245,\"8/1 Hill Street, New Hampshire\""
, "Jay,Rooney,92364434245,\"122 McKay Street, Old Town\""
, "Ray,Bundy,923912345,NOT SPECIFIED"
, "Ray,Bundy,\" double quote here\",NOT SPECIFIED"
};

for ( int each = 0; each < 4; ++each) {
char *token = string[each];
char *p = string[each];

while ( *p) {
if ( '\"' == *p) {//at a double quote
p += strcspn ( p + 1, "\"");//advance to next double quote
p += 2;//to include the opening and closing double quotes
}
else {
p += strcspn ( p, ",\"");//advance to a comma or double quote
}
int span = ( int)( p - token);
if ( span) {
printf ( "token:%.*s\n", span, token);//print span characters

//copy to another array
}
if ( *p) {//not at terminating zero
++p;//do not skip consecutive delimiters

token = p;//start of next token
}
}
}
return 0;
}

EDIT: copy to variables

A counter can be used to keep track of fields as they are processed.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SIZENAME 21
#define SIZEID 11
#define SIZEADDR 151

typedef struct {
char name[SIZENAME];
char surname[SIZENAME];
char uniqueId[SIZEID];
char address[SIZEADDR];
} employee_t;

int main( void) {

char *string[] = {
"Jim,Hunter,9239234245,\"8/1 Hill Street, New Hampshire\""
, "Jay,Rooney,92364434245,\"122 McKay Street, Old Town\""
, "Ray,Bundy,923912345,NOT SPECIFIED"
, "Ray,Bundy,\"quote\",NOT SPECIFIED"
};
employee_t *employees = malloc ( sizeof *employees * 4);
if ( ! employees) {
fprintf ( stderr, "problem malloc\n");
return 1;
}

for ( int each = 0; each < 4; ++each) {
char *token = string[each];
char *p = string[each];
int field = 0;

while ( *p) {
if ( '\"' == *p) {
p += strcspn ( p + 1, "\"");//advance to a delimiter
p += 2;//to include the opening and closing double quotes
}
else {
p += strcspn ( p, ",\"");//advance to a delimiter
}
int span = ( int)( p - token);
if ( span) {
++field;
if ( 1 == field) {
if ( span < SIZENAME) {
strncpy ( employees[each].name, token, span);
employees[each].name[span] = 0;
printf ( "copied:%s\n", employees[each].name);//print span characters
}
}
if ( 2 == field) {
if ( span < SIZENAME) {
strncpy ( employees[each].surname, token, span);
employees[each].surname[span] = 0;
printf ( "copied:%s\n", employees[each].surname);//print span characters
}
}
if ( 3 == field) {
if ( span < SIZEID) {
strncpy ( employees[each].uniqueId, token, span);
employees[each].uniqueId[span] = 0;
printf ( "copied:%s\n", employees[each].uniqueId);//print span characters
}
}
if ( 4 == field) {
if ( span < SIZEADDR) {
strncpy ( employees[each].address, token, span);
employees[each].address[span] = 0;
printf ( "copied:%s\n", employees[each].address);//print span characters
}
}
}
if ( *p) {//not at terminating zero
++p;//do not skip consceutive delimiters

token = p;//start of next token
}
}
}
free ( employees);
return 0;
}

Efficient data/string reading and copy from file (CSV) c

Try scanning through your big file without storing it all to memory, just keeping one record at a time in a local variables:

void csvReader(FILE *f) {
T_structCDT c;
int count = 0;
c.string = malloc(1000);
while (fscanf(f, "%d,%d,%d,%999[^,],%d\n", &c.a, &c.b, &c.vivienda, c.c, &c.d) == 5) {
// nothing for now
count++;
}
printf("%d records parsed\n");
}

Measure the time taken for this simplistic parser:

  • If it is quick enough, perform the tests for selection and output the few matching records one at a time at they are found during the parse phase. The extra time for these steps should be fairly small since only a few records match.

  • It the time is too long, you need a more fancy CSV parser, which is a lot of work but can be done and made fast, especially if you can assume your input file to use this simple format for all records. This is too broad a subject to detail here but the speed achievable should be close to that of cat csvfile > /dev/null or grep a_short_string_not_present csvfile

On my system (average linux server with regular hard disk), it takes less than 20 seconds to parse 40 million lines totalling 2GB from a cold start, and less than 4 seconds the second time: disk I/O seems to be the bottleneck.

If you need to perform this selection very often, you should probably use a different data format, possibly a database system. If the scan is performed occasionally on data whose format is fixed, using faster storage such as SSD will help but don't expect miracles.

EDIT To put words in action, I wrote a simple generator and extractor:

Here is a simple program to generate CSV data:

#include <stdio.h>
#include <stdlib.h>

const char *dict[] = {
"Lorem", "ipsum", "dolor", "sit", "amet;", "consectetur", "adipiscing", "elit;",
"sed", "do", "eiusmod", "tempor", "incididunt", "ut", "labore", "et",
"dolore", "magna", "aliqua.", "Ut", "enim", "ad", "minim", "veniam;",
"quis", "nostrud", "exercitation", "ullamco", "laboris", "nisi", "ut", "aliquip",
"ex", "ea", "commodo", "consequat.", "Duis", "aute", "irure", "dolor",
"in", "reprehenderit", "in", "voluptate", "velit", "esse", "cillum", "dolore",
"eu", "fugiat", "nulla", "pariatur.", "Excepteur", "sint", "occaecat", "cupidatat",
"non", "proident;", "sunt", "in", "culpa", "qui", "officia", "deserunt",
"mollit", "anim", "id", "est", "laborum.",
};

int csvgen(const char *fmt, long lines) {
char buf[1024];

if (*fmt == '\0')
return 1;

while (lines > 0) {
size_t pos = 0;
int count = 0;
for (const char *p = fmt; *p && pos < sizeof(buf); p++) {
switch (*p) {
case '0': case '1': case '2': case '3': case '4':
case '5': case '6': case '7': case '8': case '9':
count = count * 10 + *p - '0';
continue;
case 'd':
if (!count) count = 101;
pos += snprintf(buf + pos, sizeof(buf) - pos, "%d",
rand() % (2 + count - 1) - count + 1);
count = 0;
continue;
case 'u':
if (!count) count = 101;
pos += snprintf(buf + pos, sizeof(buf) - pos, "%u",
rand() % count);
count = 0;
continue;
case 's':
if (!count) count = 4;
count = rand() % count + 1;
while (count-- > 0 && pos < sizeof(buf)) {
pos += snprintf(buf + pos, sizeof(buf) - pos, "%s ",
dict[rand() % (sizeof(dict) / sizeof(*dict))]);
}
if (pos < sizeof(buf)) {
pos--;
}
count = 0;
continue;
default:
buf[pos++] = *p;
count = 0;
continue;
}
}
if (pos < sizeof(buf)) {
buf[pos++] = '\n';
fwrite(buf, 1, pos, stdout);
lines--;
}
}
return 0;
}

int main(int argc, char *argv[]) {
if (argc < 3) {
fprintf(stderr, "usage: csvgen format number\n");
return 2;
}
return csvgen(argv[1], strtol(argv[2], NULL, 0));
}

Here is an extractor with 3 different parsing methods:

#include <stdio.h>
#include <stdlib.h>

static inline unsigned int getuint(const char *p, const char **pp) {
unsigned int d, n = 0;
while ((d = *p - '0') <= 9) {
n = n * 10 + d;
p++;
}
*pp = p;
return n;
}

int csvgrep(FILE *f, int method) {
struct {
int a, b, c, d;
int spos, slen;
char s[1000];
} c;
int count = 0, line = 0;

// select 500 out of 43M
#define select(c) ((c).a == 100 && (c).b == 100 && (c).c > 74 && (c).d > 50)

if (method == 0) {
// default method: fscanf
while (fscanf(f, "%d,%d,%d,%999[^,],%d\n", &c.a, &c.b, &c.c, c.s, &c.d) == 5) {
line++;
if (select(c)) {
count++;
printf("%d,%d,%d,%s,%d\n", c.a, c.b, c.c, c.s, c.d);
}
}
} else
if (method == 1) {
// use fgets and simple parser
char buf[1024];
while (fgets(buf, sizeof(buf), f)) {
char *p = buf;
int i;
line++;
c.a = strtol(p, &p, 10);
p += (*p == ',');
c.b = strtol(p, &p, 10);
p += (*p == ',');
c.c = strtol(p, &p, 10);
p += (*p == ',');
for (i = 0; *p && *p != ','; p++) {
c.s[i++] = *p;
}
c.s[i] = '\0';
p += (*p == ',');
c.d = strtol(p, &p, 10);
if (*p != '\n') {
fprintf(stderr, "csvgrep: invalid format at line %d\n", line);
continue;
}
if (select(c)) {
count++;
printf("%d,%d,%d,%s,%d\n", c.a, c.b, c.c, c.s, c.d);
}
}
} else
if (method == 2) {
// use fgets and hand coded parser, positive numbers only, no string copy
char buf[1024];
while (fgets(buf, sizeof(buf), f)) {
const char *p = buf;
line++;
c.a = getuint(p, &p);
p += (*p == ',');
c.b = getuint(p, &p);
p += (*p == ',');
c.c = getuint(p, &p);
p += (*p == ',');
c.spos = p - buf;
while (*p && *p != ',') p++;
c.slen = p - buf - c.spos;
p += (*p == ',');
c.d = getuint(p, &p);
if (*p != '\n') {
fprintf(stderr, "csvgrep: invalid format at line %d\n", line);
continue;
}
if (select(c)) {
count++;
printf("%d,%d,%d,%.*s,%d\n", c.a, c.b, c.c, c.slen, buf + c.spos, c.d);
}
}
} else {
fprintf(stderr, "csvgrep: unknown method: %d\n", method);
return 1;
}
fprintf(stderr, "csvgrep: %d records selected from %d lines\n", count, line);
return 0;
}

int main(int argc, char *argv[]) {
if (argc > 2 && strtol(argv[2], NULL, 0)) {
// non zero second argument -> set a 1M I/O buffer
setvbuf(stdin, NULL, _IOFBF, 1024 * 1024);
}
return csvgrep(stdin, argc > 1 ? strtol(argv[1], NULL, 0) : 0);
}

And here are some comparative benchmark figures:

$ time ./csvgen "u,u,u,s,u" 43000000 > 43m
real 0m34.428s user 0m32.911s sys 0m1.358s
$ time grep zz 43m
real 0m10.338s user 0m10.069s sys 0m0.211s
$ time wc -lc 43m
43000000 1195458701 43m
real 0m1.043s user 0m0.839s sys 0m0.196s
$ time cat 43m > /dev/null
real 0m0.201s user 0m0.004s sys 0m0.195s
$ time ./csvgrep 0 < 43m > x0
csvgrep: 508 records selected from 43000000 lines
real 0m14.271s user 0m13.856s sys 0m0.341s
$ time ./csvgrep 1 < 43m > x1
csvgrep: 508 records selected from 43000000 lines
real 0m8.235s user 0m7.856s sys 0m0.331s
$ time ./csvgrep 2 < 43m > x2
csvgrep: 508 records selected from 43000000 lines
real 0m3.892s user 0m3.555s sys 0m0.312s
$ time ./csvgrep 2 1 < 43m > x3
csvgrep: 508 records selected from 43000000 lines
real 0m3.706s user 0m3.488s sys 0m0.203s
$ cmp x0 x1
$ cmp x0 x2
$ cmp x0 x3

As you can see, specializing the parse method provides a gain of almost 50% and hand coding the integer conversion and string scanning gains another 50%. Using a 1 megabyte buffer instead of the default size offers only a marginal gain of 0.2sec.

To further improve the speed, you can use mmap() to bypass the I/O streaming interface and make stronger assumptions about the file contents. In the above code, invalid formats are still handled gracefully, but you could remove some tests and shave an extra 5% from the execution time at the cost of reliability.

The above benchmark is performed on a system with an SSD drive and the file 43m fits in RAM, so the timings do not include much disk I/O latency. grep is surprisingly slow, and increasing the search string length makes it even worse... wc -lc set a target for scanning performance, a factor of 4, but cat seems out of reach.



Related Topics



Leave a reply



Submit