Check Line for Unprintable Characters While Reading Text File

Check line for unprintable characters while reading text file

If you want to check a string has unprintable characters you can use a regular expression

[^\p{Print}]

How to find non-printable character in text file

Option #1 - Show All Characters

You can download Notepad++ and open the file there. Then, go to the menu and select View->Show Symbol->Show All Characters. All characters will become visible, but you will have to scroll through the whole file to see which character needs to be removed.

Unfortunately, Notepad++ will automatically convert line endings according to your Edit->EOL Conversion selection, so it won't help if your non-printable characters are CR or LF.

Option #2 - TextFX Zap Non-printable Chars

Alternatively, you could install the TextFX plugin from SourceForge, and use TextFX->TextFX Characters->Zap all non-printable characters to #. This will replace some non-printable characters with a pound sign, but not CR or LF.

Option #3 - Remove BOM Encoding

Lastly, you could use Notepad++, and use Encoding->Convert to UTF8 without BOM. This will remove non-printable characters which occasionally causes issues with certain renderers (VSO).

Reading lines of text file in C++ fails due to hidden/control characters

Based on your description, there is just no newline at the end of one of the two files. You can have a look at the files using, e.g., od -c file | less to see the exact content of the file, including their character codes.

That said, your approach to reading lines can probably be improved: Just read a line, check if it could be read, and process it. This way, there is no need to count the number of line endings up front:

for (std::string line; std::getline(inFile, line); ) {
vec.push_back(strtod(line.c_str()));
}

Personally, I would probably just read the numbers in the first place, e.g.:

for (double value; inFile >> value; ) {
vec.push_back(value);
}

Well, that's not really the way to read a sequence of doubles into a vector but this is:

std::vector<double> vec((std::istream_iterator<double>(inFile)),
std::istream_iterator<double>());

(instead of the extra parenthesis, you could use uniform initialization notation in C++11).

How to evaluate escape character when reading a text file in Python

Use -

with open('./sample.txt', 'r') as file_:
txt = file_.read().replace('\\n', '\n')
print(txt)

Output

"RUN DATE:  2/08/18   9:00:24     USER:XXXXXX        DISPLAY: MENULIST           PROG NAME: MH4567 PAGE 1
MENU: ADCS00 Visual Basic Things
Service
80 Printer / Message Control
90 Sign Off
Selection or
command
===>____________________________________________________________________________
____________________________________________________________________________
____
F3=Exit F4=Prompt F9=Retrieve F12=Previous
80 CALL PGM(GUCMD)
AUTHORIZED: DOAPROCESS FDOAPROCES FOESUPR FPROGRAMMR OESUPR PROGRAMMER

90 SIGNOFF
AUTHORIZED: DOAPROCESS FDOAPROCES FOESUPR FPROGRAMMR OESUPR PROGRAMMER
", "RUN DATE: 5/09/19 9:00:24 USER:XXXXXX DISPLAY:
MENULIST PROG NAME: MH4567 PAGE 2
MENU: APM001 Accounts Payable Menu
MENU
OPTIONS DISPLAY PROGRAMS
1 Radar Processing 30 Vendor
2 Prepaid Processing
31 Prepaid
"

When reading from a file I get special characters, that are not in my text file

I've fixed up your code a little and added some debugging statements. It'll save the contents of the input file into an malloc'ed buffer of size=(charSize*filesize+1), with the +1 bit to hold the null terminating character. It works on my machine with reasonably sized binary files

You can uncomment the printf(buffer_copy) statement to get what you were doing before. Otherwise, it'll now loop through each byte in the buffer and output as its hexadecimal equivalent. If you're still getting that 'junk' then it's just part of your input file and not a bug.

//get the file
FILE *infp=fopen(infile,"r");
if(infp!=NULL){
//get length of file
fseek(infp,0L,SEEK_END);
size_t filesize=(size_t)ftell(infp);
printf("file length = %d\n",(int)filesize); //debug statement
rewind(infp);
if(filesize>0){
char * buffer;
buffer=(char*)malloc((sizeof(char)*filesize)+1); // +1 for null char at the end
size_t chars_read=fread(buffer,sizeof(char),filesize,infp);
printf("chars read = %d\n",(int)chars_read); // debug statement (chars_read should equal filesize)
buffer[filesize]='\0'; // properly terminate the char array
fclose(infp);
//output what you read (method 1, print string)
//char *buffer_copy=buffer;
//printf("the file=\"%s\"",buffer_copy); //uncomment these two statement to do what you did before */
//output what you read (method 2, byte-by-byte)
if(chars_read>0){
int i;
for(i=0;i<chars_read;i++){
char the_char = *buffer;
printf("char%d=%02x\n",i,the_char); //output each byte as hexadecimal equivalent
buffer++;
}
} else { printf "problem with fread"; }
} else { printf("problem with filesize"); }
else { printf("problem opening the file"); }

The while loop would stop reading at the first null terminating char. The for loop will now read every single byte inside the file (in case you're trying to peek inside something that isn't necessarily .txt, like .jpg)


Have you tried checking the file from the command line to make sure it only has the characters you expect?

For example, by running the command od -c to view each byte as its ASCII equivalent (or octal, if non-printable).

Reading non-ASCII characters from a text file

  1. First of all - detect the file's encoding

from chardet import detect
encoding = lambda x: detect(x)['encoding']
print encoding(line)

  1. then - convert it to unicode or your default encoding str:

n_line=unicode(line,encoding(line),errors='ignore')
print n_line
print n_line.encode('utf8')


Related Topics



Leave a reply



Submit