Read/Write .Txt File with Special Characters

Read/write .txt file with special characters

It's the output console which doesn't support those characters. Since you're using Eclipse, you need to ensure that it's configured to use UTF-8 for this. You can do this by Window > Preferences > General > Workspace > Text File Encoding > set to UTF-8.

Read in a dataframe from .txt file with special characters in R

You can specify the enconding while importing or just it after importing the data.

Option 1

df <- read.table('path/file.ext', econding = "UTF-8", ...)

Option 2

x <- c(
  ">like I don't understand< sorry like how old's your momÂ¿",
  "Â°ye[a:h]Â°",
  "Â°I don't knowÂ°")

Encoding(x) <- 'UTF-8'

print(x)

Read special characters from .txt file in python

When you open a text file in python, the encoding is ANSI by default, so it doesn't contain your é chartecter. Try

word_file = open ("./words.txt", "r", encoding='utf-8')

Python convert a .txt file with special characters into dataframe

delimiter and sep are actually alias. You can use either of them, and use skiprows=1 to ignore the first rows:

pd.read_csv('filename.txt', sep='\s*&\s*', skiprows=1)

Output:

       #A      B      C       D      E
0    #foo  13.52  333.2  4504.4   0.00
1  #1 taw  13.49  314.6     4.6   1.29
2  #2 ewq  35.44    4.2     5.2   3.06
3  #3 asd  13.41    4.1     6.8   5.04
4   #4 er  13.37  230.0     7.1   7.07
5   #5 we  13.33  199.7     8.9   9.12
6  #6 wed  13.27  169.4     8.6  11.17

When reading from a file I get special characters, that are not in my text file

I've fixed up your code a little and added some debugging statements. It'll save the contents of the input file into an malloc'ed buffer of size=(charSize*filesize+1), with the +1 bit to hold the null terminating character. It works on my machine with reasonably sized binary files

You can uncomment the printf(buffer_copy) statement to get what you were doing before. Otherwise, it'll now loop through each byte in the buffer and output as its hexadecimal equivalent. If you're still getting that 'junk' then it's just part of your input file and not a bug.

//get the file
FILE *infp=fopen(infile,"r");
if(infp!=NULL){
    //get length of file
    fseek(infp,0L,SEEK_END);
    size_t filesize=(size_t)ftell(infp);
    printf("file length = %d\n",(int)filesize); //debug statement
    rewind(infp);
    if(filesize>0){ 
        char * buffer;
        buffer=(char*)malloc((sizeof(char)*filesize)+1); // +1 for null char at the end
        size_t chars_read=fread(buffer,sizeof(char),filesize,infp);
        printf("chars read = %d\n",(int)chars_read); // debug statement (chars_read should equal filesize)
        buffer[filesize]='\0'; // properly terminate the char array
        fclose(infp);
        //output what you read (method 1, print string)
        //char *buffer_copy=buffer; 
        //printf("the file=\"%s\"",buffer_copy); //uncomment these two statement to do what you did before */
        //output what you read (method 2, byte-by-byte)
        if(chars_read>0){
            int i;
            for(i=0;i<chars_read;i++){
                char the_char = *buffer;
                printf("char%d=%02x\n",i,the_char); //output each byte as hexadecimal equivalent
                buffer++;
            }
        } else { printf "problem with fread"; }
    } else { printf("problem with filesize"); }
else { printf("problem opening the file"); }

The while loop would stop reading at the first null terminating char. The for loop will now read every single byte inside the file (in case you're trying to peek inside something that isn't necessarily .txt, like .jpg)

Have you tried checking the file from the command line to make sure it only has the characters you expect?

For example, by running the command od -c to view each byte as its ASCII equivalent (or octal, if non-printable).

Read/Write .Txt File with Special Characters