Check Line for Unprintable Characters While Reading Text File

Check line for unprintable characters while reading text file

If you want to check a string has unprintable characters you can use a regular expression

[^\p{Print}]

How to find non-printable character in text file

Option #1 - Show All Characters

You can download Notepad++ and open the file there. Then, go to the menu and select View->Show Symbol->Show All Characters. All characters will become visible, but you will have to scroll through the whole file to see which character needs to be removed.

Unfortunately, Notepad++ will automatically convert line endings according to your Edit->EOL Conversion selection, so it won't help if your non-printable characters are CR or LF.

Option #2 - TextFX Zap Non-printable Chars

Alternatively, you could install the TextFX plugin from SourceForge, and use TextFX->TextFX Characters->Zap all non-printable characters to #. This will replace some non-printable characters with a pound sign, but not CR or LF.

Option #3 - Remove BOM Encoding

Lastly, you could use Notepad++, and use Encoding->Convert to UTF8 without BOM. This will remove non-printable characters which occasionally causes issues with certain renderers (VSO).

Reading lines of text file in C++ fails due to hidden/control characters

Based on your description, there is just no newline at the end of one of the two files. You can have a look at the files using, e.g., od -c file | less to see the exact content of the file, including their character codes.

That said, your approach to reading lines can probably be improved: Just read a line, check if it could be read, and process it. This way, there is no need to count the number of line endings up front:

for (std::string line; std::getline(inFile, line); ) {
    vec.push_back(strtod(line.c_str()));
}

Personally, I would probably just read the numbers in the first place, e.g.:

for (double value; inFile >> value; ) {
    vec.push_back(value);
}

Well, that's not really the way to read a sequence of doubles into a vector but this is:

std::vector<double> vec((std::istream_iterator<double>(inFile)),
                        std::istream_iterator<double>());

(instead of the extra parenthesis, you could use uniform initialization notation in C++11).

How to evaluate escape character when reading a text file in Python

Use -

with open('./sample.txt', 'r') as file_:
    txt = file_.read().replace('\\n', '\n')
    print(txt)

Output

"RUN DATE:  2/08/18   9:00:24     USER:XXXXXX        DISPLAY: MENULIST           PROG NAME: MH4567 PAGE 1
MENU:     ADCS00        Visual Basic Things 
Service
                                              80 Printer / Message Control
                                              90 Sign Off
Selection or 
command
===>____________________________________________________________________________
____________________________________________________________________________
____
 F3=Exit      F4=Prompt    F9=Retrieve    F12=Previous
 80 CALL PGM(GUCMD)
AUTHORIZED: DOAPROCESS FDOAPROCES FOESUPR     FPROGRAMMR OESUPR     PROGRAMMER
 
90 SIGNOFF
AUTHORIZED: DOAPROCESS FDOAPROCES FOESUPR     FPROGRAMMR OESUPR     PROGRAMMER
", "RUN DATE:   5/09/19    9:00:24     USER:XXXXXX        DISPLAY: 
MENULIST                                PROG NAME: MH4567    PAGE 2
MENU:      APM001         Accounts Payable Menu
      MENU 
OPTIONS                                 DISPLAY PROGRAMS
  1   Radar Processing                         30 Vendor
  2   Prepaid Processing                        
31 Prepaid
"

When reading from a file I get special characters, that are not in my text file

I've fixed up your code a little and added some debugging statements. It'll save the contents of the input file into an malloc'ed buffer of size=(charSize*filesize+1), with the +1 bit to hold the null terminating character. It works on my machine with reasonably sized binary files

You can uncomment the printf(buffer_copy) statement to get what you were doing before. Otherwise, it'll now loop through each byte in the buffer and output as its hexadecimal equivalent. If you're still getting that 'junk' then it's just part of your input file and not a bug.

//get the file
FILE *infp=fopen(infile,"r");
if(infp!=NULL){
    //get length of file
    fseek(infp,0L,SEEK_END);
    size_t filesize=(size_t)ftell(infp);
    printf("file length = %d\n",(int)filesize); //debug statement
    rewind(infp);
    if(filesize>0){ 
        char * buffer;
        buffer=(char*)malloc((sizeof(char)*filesize)+1); // +1 for null char at the end
        size_t chars_read=fread(buffer,sizeof(char),filesize,infp);
        printf("chars read = %d\n",(int)chars_read); // debug statement (chars_read should equal filesize)
        buffer[filesize]='\0'; // properly terminate the char array
        fclose(infp);
        //output what you read (method 1, print string)
        //char *buffer_copy=buffer; 
        //printf("the file=\"%s\"",buffer_copy); //uncomment these two statement to do what you did before */
        //output what you read (method 2, byte-by-byte)
        if(chars_read>0){
            int i;
            for(i=0;i<chars_read;i++){
                char the_char = *buffer;
                printf("char%d=%02x\n",i,the_char); //output each byte as hexadecimal equivalent
                buffer++;
            }
        } else { printf "problem with fread"; }
    } else { printf("problem with filesize"); }
else { printf("problem opening the file"); }

The while loop would stop reading at the first null terminating char. The for loop will now read every single byte inside the file (in case you're trying to peek inside something that isn't necessarily .txt, like .jpg)

Have you tried checking the file from the command line to make sure it only has the characters you expect?

For example, by running the command od -c to view each byte as its ASCII equivalent (or octal, if non-printable).

Reading non-ASCII characters from a text file

First of all - detect the file's encoding


  from chardet import detect
  encoding = lambda x: detect(x)['encoding']
  print encoding(line)

then - convert it to unicode or your default encoding str:


  n_line=unicode(line,encoding(line),errors='ignore')
  print n_line
  print n_line.encode('utf8')

Check Line for Unprintable Characters While Reading Text File