Reading a CSV File in .Net

How to read a CSV file into a .NET Datatable

Here's an excellent class that will copy CSV data into a datatable using the structure of the data to create the DataTable:

A portable and efficient generic parser for flat files

It's easy to configure and easy to use. I urge you to take a look.

Reading CSV files using C#

Don't reinvent the wheel. Take advantage of what's already in .NET BCL.

  • add a reference to the Microsoft.VisualBasic (yes, it says VisualBasic but it works in C# just as well - remember that at the end it is all just IL)
  • use the Microsoft.VisualBasic.FileIO.TextFieldParser class to parse CSV file

Here is the sample code:

using (TextFieldParser parser = new TextFieldParser(@"c:\temp\test.csv"))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
while (!parser.EndOfData)
{
//Processing row
string[] fields = parser.ReadFields();
foreach (string field in fields)
{
//TODO: Process field
}
}
}

It works great for me in my C# projects.

Here are some more links/informations:

  • MSDN: Read From Comma-Delimited Text Files in Visual Basic
  • MSDN: TextFieldParser Class

Read a .csv file in c# efficiently?

Method 1

By using LINQ:

var Lines = File.ReadLines("FilePath").Select(a => a.Split(';'));
var CSV = from line in Lines
select (line.Split(',')).ToArray();

Method 2

As Jay Riggs stated here

Here's an excellent class that will copy CSV data into a datatable using the structure of the data to create the DataTable:

A portable and efficient generic parser for flat files

It's easy to configure and easy to use. I urge you to take a look.

Method 3

Rolling your own CSV reader is a waste of time unless the files that you're reading are guaranteed to be very simple. Use a pre-existing, tried-and-tested implementation instead.

Reading a CSV file with 50M lines, how to improve performance

I have a lot of experience with CSV, and the bad news is that you aren't going to be able to make this a whole lot faster. CSV libraries aren't going to be of much assistance here. The difficult problem with CSV, that libraries attempt to handle, is dealing with fields that have embedded commas, or newlines, which require quoting and escaping. Your dataset doesn't have this issue, since none of the columns are strings.

As you have discovered, the bulk of the time is spent in the parse methods. Andrew Morton had a good suggestion, using TryParseExact for DateTime values can be a quite a bit faster than TryParse. My own CSV library, Sylvan.Data.Csv (which is the fastest available for .NET), uses an optimization where it parses primitive values directly out of the stream read buffer without converting to string first (only when running on .NET core), that can also speed things up a bit. However, I wouldn't expect it to be possible to cut the processing time in half while sticking with CSV.

Here is an example of using my library, Sylvan.Data.Csv to process the CSV in C#.

static List<Foo> Read(string file)
{
// estimate of the average row length based on Andrew Morton's 4GB/50m
const int AverageRowLength = 80;

var textReader = File.OpenText(file);
// specifying the DateFormat will cause TryParseExact to be used.
var csvOpts = new CsvDataReaderOptions { DateFormat = "yyyy-MM-ddTHH:mm:ss" };
var csvReader = CsvDataReader.Create(textReader, csvOpts);

// estimate number of rows to avoid growing the list.
var estimatedRows = (int)(textReader.BaseStream.Length / AverageRowLength);
var data = new List<Foo>(estimatedRows);

while (csvReader.Read())
{
if (csvReader.RowFieldCount < 5) continue;
var item = new Foo()
{
time = csvReader.GetDateTime(0),
ID = csvReader.GetInt64(1),
A = csvReader.GetDouble(2),
B = csvReader.GetDouble(3),
C = csvReader.GetDouble(4)
};
data.Add(item);
}
return data;
}

I'd expect this to be somewhat faster than your current implementation, so long as you are running on .NET core. Running on .NET framework the difference, if any, wouldn't be a significant. However, I don't expect this to be acceptably fast for your users, it will still likely take tens of seconds, or minutes to read the whole file.

Given that, my advice would be to abandon CSV altogether, which means you can abandon parsing which is what is slowing things down. Instead, read and write the data in binary form. Your data records have a nice property, in that they are fixed width: each record contains 5 fields that are 8 bytes (64bit) wide, so each record requires exactly 40 bytes in binary form. 50m x 40 = 2GB. So, assuming Andrew Morton's estimate of 4GB for the CSV is correct, moving to binary will halve the storage needs. Immediately, that means there is half as much disk IO needed to read the same data. But beyond that, you won't need to parse anything, the binary representation of the value will essentially be copied directly to memory.

Here are some examples of how to do this in C# (don't know VB very well, sorry).


static List<Foo> Read(string file)
{
var stream = File.OpenRead(file);
// the exact number of records can be determined by looking at the length of the file.
var recordCount = stream.Length / 40;
var data = new List<Foo>(recordCount);
var br = new BinaryReader(stream);
for (int i = 0; i < recordCount; i++)
{
var ticks = br.ReadInt64();
var id = br.ReadInt64();
var a = br.ReadDouble();
var b = br.ReadDouble();
var c = br.ReadDouble();
var f = new Foo()
{
time = new DateTime(ticks),
ID = id,
A = a,
B = b,
C = c,
};
data.Add(f);
}
return data;
}

static void Write(List<Foo> data, string file)
{
var stream = File.Create(file);
var bw = new BinaryWriter(stream);
foreach(var item in data)
{
bw.Write(item.time.Ticks);
bw.Write(item.ID);
bw.Write(item.A);
bw.Write(item.B);
bw.Write(item.C);
}
}

This should almost certainly be an order of magnitude faster than a CSV-based solution. The question then becomes: is there some reason that you must use CSV? If the source of the data is out of your control, and you must use CSV, I would then ask: will the data file change every time, or will it only be appended to with new data? If it is appended to, I would investigate a solution where each time the app starts convert only the new section of appended CSV data and add it to a binary file that you will then load everything from. Then you only have to pay the cost of processing the new CSV data each time, and will load everything quickly from the binary form.

This could be made even faster by creating fixed layout struct (Foo), allocating an array of them, and using span-based trickery to read the array data directly from the FileStream. This can be done because all of your data elements are "blittable". This would be the absolute fastest way to load this data into your program. Start with the BinaryReader/Writer and if you find that still isn't fast enough, then investigate this.

If you find this solution to work, I'd love to hear the results.

How to Read CSV file

There is a class TextFieldParser located in Microsoft.VisualBasic.FileIO namespace.

This class can be configured like:

using TextFieldParser parser = new TextFieldParser(stream)
{
Delimiters = new[] { "," },
HasFieldsEnclosedInQuotes = true,
TextFieldType = FieldType.Delimited,
TrimWhiteSpace = true,
};

Usage of the parser:

while (!parser.EndOfData)
{
var fields = parser.ReadFields() ?? throw new InvalidOperationException("Parser unexpectedly returned no data");

// your code
}

The benefit of using a CSV parser as mentioned is:
You can process line by line

Memory consumption is extremly low as you are reading the file sequentially instead of loading everything into memory.

Reading Csv file with text as parameter in C#

Have not tested it, but this should do the trick. Change the first few lines to look like this:

private IEnumerable<Dictionary<string,EntityProperty>> ReadCSV( string data, IEnumerable<TableField> cols )
{
using( TextReader reader = new StringReader(data) )


Related Topics



Leave a reply



Submit