How to Find Out Line-Endings in a Text File

How to find out line-endings in a text file?

You can use the file utility to give you an indication of the type of line endings.

Unix:

$ file testfile1.txt
testfile.txt: ASCII text

"DOS":

$ file testfile2.txt
testfile2.txt: ASCII text, with CRLF line terminators

To convert from "DOS" to Unix:

$ dos2unix testfile2.txt

To convert from Unix to "DOS":

$ unix2dos testfile1.txt

Converting an already converted file has no effect so it's safe to run blindly (i.e. without testing the format first) although the usual disclaimers apply, as always.

How to read file having different line ending in C++

As Sebastian said, we will need to read the block and then find out the appropriate line-ending.

So, we will need to open the file in binary mode and read the last characters.

#include<iostream>
#include<fstream>
#include<string>
using namespace std;
void SetLineEnding(char *filename, std::string &newline, char &delimiter)
{
std::string str;
std::ifstream chk(filename,std::ios::binary);
if(getline(chk, str))
{
if(str.size() && str[str.size()-1] == '\r')
{
//It can be either \r or \r\n
if(getline(chk, str))
{
delimiter = '\n';
newline = "\\r\\n";
}
else
{
delimiter = '\r';
newline = "\\r";
}
}
else
{
delimiter = '\n';
newline = "\\n";
}
}
}
int32_t main()
{

string newLine;
string delimiter;
char filename[256];
in>>filename;
SetLineEnding(filename,newLine,delimiter);
std::ifstream inp(filename,ios::in);
if(!inp.is_open())
{
cout<<"File not opened"<<endl;
return 0;
}
//getline() function with delimiter
string str;
getline(inp,str,delimiter);

return 0;
}

Now you can pass delimiter to getline() and you will be able to read according to line-ending.

How do I read the last line of a text file in a C program?

Compare these two programs, one (mis)using feof() and one not using it at all. The first corresponds closely to the code in the question — it ignores the return value from fgets() to its detriment. The second only tests the return value from fgets(); it has no need to use feof().

eof53.c

#include <stdio.H>

int main(void)
{
char buffer[256];

fgets(buffer, sizeof(buffer), stdin);
while (!feof(stdin))
{
printf("[%s]\n", buffer);
fgets(buffer, sizeof(buffer), stdin);
}
return 0;
}

eof71.c

#include <stdio.H>

int main(void)
{
char buffer[256];

while (fgets(buffer, sizeof(buffer), stdin) != NULL)
printf("[%s]\n", buffer);
return 0;
}

Given a data file abc containing 3 bytes — 0x41 ('A'), 0x42 ('B'), 0x43 ('C') and no newline, I get the following results:

$ eof53 < abc
$ eof71 < abc
[ABC]
$

This was tested on MacOS Big Sur 11.6.6.

Note that fgets() does not report EOF (by returning a null pointer) when reading the (only) incomplete line, but empirically, feof() does report EOF — correctly, since the file input has ended, even though fgets() did return a string (but not a line) of data.

As explained in the canonical Q&A while (!feof(file)) is always wrong!, using feof() rather than testing the return value from the I/O functions leads to bad results.

Get newline stats for a text file in Python

import sys


def calculate_line_endings(path):
# order matters!
endings = [
b'\r\n',
b'\n\r',
b'\n',
b'\r',
]
counts = dict.fromkeys(endings, 0)

with open(path, 'rb') as fp:
for line in fp:
for x in endings:
if line.endswith(x):
counts[x] += 1
break
print(counts)


if __name__ == '__main__':
if len(sys.argv) == 2:
calculate_line_endings(sys.argv[1])

sys.exit('usage: %s <filepath>' % sys.argv[0])

Gives output for your file

crlf: 1123
lfcr: 0
cr: 0
lf: 0

Is it enough?

Text Editor which shows \r\n?

With Notepad++, you can show end-of-line characters. It shows CR and LF, instead of "\r" and "\n", but it gets the point across. However, it will still insert the line breaks. But you do get to see the line-ending characters.

To use Notepad++ for this, open the View menu, open the Show Symbols slide out, and select either "Show all characters" or "Show end-of-line characters".

Sample Image

Converting text file with spaces between CR & LF

Line-by-line processing of a stream doesn't have to be a bottleneck if you implement it at the right point in your overall process.

When I've had to do this kind of preprocessing I put a folder watch on the inbound folder, then automatically pick up each file and process it upon arrival, putting the original into an archive folder and writing the processed file into another location from which data will be parsed or loaded into the database. Unless you have unusual real-time requirements, you'll never notice this kind of overhead. If you do have real-time requirements, this issue will pale in comparison to all the other issues you'll face with batched data files :)

But you may not even have to go through a preprocessing step at all. You didn't indicate what database you will be using or how you plan to load the data, but many databases do include utilities to process fixed-length records. In the past, fixed-format files came with every imaginable kind of bizarre format (and contained all kinds of stuff that had to be stripped out or converted). As a result those utilities tend to be very efficient at this kind of task. In my experience they can easily be at least an order of magnitude faster than line-by-line processing, which can make a real difference on larger bulk loads.

If your database doesn't have good bulk import processing tools, there are a number of many open-source or freeware utilities already written that do pretty much exactly what you need. You can find them on GitHub and other places. For example, NPM replace is here and zzzprojects findandreplace is here.

For a quick and dirty approach that allows you to preview all the changes as you develop a more robust solution, many text editors have the ability to find and replace in multiple files. I've used that approach successfully in the past. For example, here's the window from NotePad++ that lets you use RegEx to remove or change whatever you like in all files matching defined criteria.

Sample Image



Related Topics



Leave a reply



Submit