Read/Write .Txt File with Special Characters

Read/write .txt file with special characters

It's the output console which doesn't support those characters. Since you're using Eclipse, you need to ensure that it's configured to use UTF-8 for this. You can do this by Window > Preferences > General > Workspace > Text File Encoding > set to UTF-8.

See also:

  • Unicode - How to get the characters right?

Update as per the updated question and the comments, apparently the UTF-8 BOM is the culprit. Notepad by default adds the UTF-8 BOM on save. It look like that the JRE on your HTC doesn't swallow that. You may want to consider to use the UnicodeReader example as outlined in this answer instead of InputStreamReader in your code. It autodetects and skips the BOM.

FileInputStream fis = new FileInputStream(new File(fileName));
UnicodeReader ur = new UnicodeReader(fis, "UTF-8");
BufferedReader in = new BufferedReader(ur);

Unrelated to the actual problem, it's a good practice to close resources in finally block so that you ensure that they will be closed in case of exceptions.

BufferedReader reader = null;
try {
reader = new BufferedReader(new UnicodeReader(new FileInputStream(fileName), "UTF-8"));
// ...
} finally {
if (reader != null) try { reader.close(); } catch (IOException logOrIgnore) {}
}

Also unrelated, I'd suggest to put Pattern p = Pattern.compile(","); outside the loop, or even make it a static constant, because it's relatively expensive to compile it and it's unnecessary to do this everytime inside a loop.

Read in a dataframe from .txt file with special characters in R

You can specify the enconding while importing or just it after importing the data.

Option 1

df <- read.table('path/file.ext', econding = "UTF-8", ...)

Option 2

x <- c(
">like I don't understand< sorry like how old's your mom¿",
"°ye[a:h]°",
"°I don't know°")

Encoding(x) <- 'UTF-8'

print(x)

Read special characters from .txt file in python

When you open a text file in python, the encoding is ANSI by default, so it doesn't contain your é chartecter. Try

word_file = open ("./words.txt", "r", encoding='utf-8')

Python convert a .txt file with special characters into dataframe

delimiter and sep are actually alias. You can use either of them, and use skiprows=1 to ignore the first rows:

pd.read_csv('filename.txt', sep='\s*&\s*', skiprows=1)

Output:

       #A      B      C       D      E
0 #foo 13.52 333.2 4504.4 0.00
1 #1 taw 13.49 314.6 4.6 1.29
2 #2 ewq 35.44 4.2 5.2 3.06
3 #3 asd 13.41 4.1 6.8 5.04
4 #4 er 13.37 230.0 7.1 7.07
5 #5 we 13.33 199.7 8.9 9.12
6 #6 wed 13.27 169.4 8.6 11.17

When reading from a file I get special characters, that are not in my text file

I've fixed up your code a little and added some debugging statements. It'll save the contents of the input file into an malloc'ed buffer of size=(charSize*filesize+1), with the +1 bit to hold the null terminating character. It works on my machine with reasonably sized binary files

You can uncomment the printf(buffer_copy) statement to get what you were doing before. Otherwise, it'll now loop through each byte in the buffer and output as its hexadecimal equivalent. If you're still getting that 'junk' then it's just part of your input file and not a bug.

//get the file
FILE *infp=fopen(infile,"r");
if(infp!=NULL){
//get length of file
fseek(infp,0L,SEEK_END);
size_t filesize=(size_t)ftell(infp);
printf("file length = %d\n",(int)filesize); //debug statement
rewind(infp);
if(filesize>0){
char * buffer;
buffer=(char*)malloc((sizeof(char)*filesize)+1); // +1 for null char at the end
size_t chars_read=fread(buffer,sizeof(char),filesize,infp);
printf("chars read = %d\n",(int)chars_read); // debug statement (chars_read should equal filesize)
buffer[filesize]='\0'; // properly terminate the char array
fclose(infp);
//output what you read (method 1, print string)
//char *buffer_copy=buffer;
//printf("the file=\"%s\"",buffer_copy); //uncomment these two statement to do what you did before */
//output what you read (method 2, byte-by-byte)
if(chars_read>0){
int i;
for(i=0;i<chars_read;i++){
char the_char = *buffer;
printf("char%d=%02x\n",i,the_char); //output each byte as hexadecimal equivalent
buffer++;
}
} else { printf "problem with fread"; }
} else { printf("problem with filesize"); }
else { printf("problem opening the file"); }

The while loop would stop reading at the first null terminating char. The for loop will now read every single byte inside the file (in case you're trying to peek inside something that isn't necessarily .txt, like .jpg)


Have you tried checking the file from the command line to make sure it only has the characters you expect?

For example, by running the command od -c to view each byte as its ASCII equivalent (or octal, if non-printable).



Related Topics



Leave a reply



Submit