How to Use 'Catdoc' to Display Dock File Encoded in Utf-8

Version control Word .docx files with docx2txt with Git on Mac OS X

Thanks to klang, I made it. Now I can diff .docx files in Terminal.app in Mac OS X (10.9). But this one doesn't seamlessly work with SourceTree GUI. Below is basically the same as klang's but with minor corrections.

Download and install the docx2txt converter from http://docx2txt.sourceforge.net/

wget -O doc2txt.tar.gz http://docx2txt.cvs.sourceforge.net/viewvc/docx2txt/?view=tar
tar zxf doc2txt.tar.gz
cd docx2txt/docx2txt/
sudo make

Then make a small wrapper script to make docx2txt output to STDOUT

echo '#!/bin/bash
docx2txt.pl "$1" -' > /usr/local/bin/docx2txt
chmod +x /usr/local/bin/docx2txt

Git attributes for (Word) .docx diffing in your repository

echo "*.docx diff=wordx" >> .gitattributes
git config diff.wordx.textconv docx2txt

Use .git/info/attributes if the setting should not be committed with the project.

Git attributes for (Word) .doc diffing

echo "*.doc diff=word" >> .gitattributes
git config diff.word.textconv strings

Detect and convert encoding for list of files

Easy to use awk:


file exports/invoice/* | grep "ISO-8859" | awk -F':' '{print $1}'

Write Velocity file forced in UTF-8

Use a FileOutputStream in combination with an OutputStreamWriter:

final OutputStream out = new FileOutputStream(...);
final Writer writer
= new OutputStreamWriter(out, Charset.forName("UTF-8"));

How to reliably guess the encoding between MacRoman, CP1252, Latin1, UTF-8, and ASCII

First, the easy cases:

ASCII

If your data contains no bytes above 0x7F, then it's ASCII. (Or a 7-bit ISO646 encoding, but those are very obsolete.)

UTF-8

If your data validates as UTF-8, then you can safely assume it is UTF-8. Due to UTF-8's strict validation rules, false positives are extremely rare.

ISO-8859-1 vs. windows-1252

The only difference between these two encodings is that ISO-8859-1 has the C1 control characters where windows-1252 has the printable characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ. I've seen plenty of files that use curly quotes or dashes, but none that use C1 control characters. So don't even bother with them, or ISO-8859-1, just detect windows-1252 instead.

That now leaves you with only one question.

How do you distinguish MacRoman from cp1252?

This is a lot trickier.

Undefined characters

The bytes 0x81, 0x8D, 0x8F, 0x90, 0x9D are not used in windows-1252. If they occur, then assume the data is MacRoman.

Identical characters

The bytes 0xA2 (¢), 0xA3 (£), 0xA9 (©), 0xB1 (±), 0xB5 (µ) happen to be the same in both encodings. If these are the only non-ASCII bytes, then it doesn't matter whether you choose MacRoman or cp1252.

Statistical approach

Count character (NOT byte!) frequencies in the data you know to be UTF-8. Determine the most frequent characters. Then use this data to determine whether the cp1252 or MacRoman characters are more common.

For example, in a search I just performed on 100 random English Wikipedia articles, the most common non-ASCII characters are ·•–é°®’èö—. Based on this fact,

  • The bytes 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, or 0xF6 suggest windows-1252.
  • The bytes 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, or 0xE1 suggest MacRoman.

Count up the cp1252-suggesting bytes and the MacRoman-suggesting bytes, and go with whichever is greatest.

Unknown characters displaying while encoding UTF-8 words into JSON format using json_encode in PHP

JSON fully supports Unicode (rather should I say the standard for parsers does). The problem is that PHP does not fully support Unicode.

In this stack overflow question, I'll quote

Some frameworks, including PHP's implementation of JSON, always do the safe numeric encodings on the encoder side. This is intended for maximum compatibility with buggy/limited transport mechanisms and the like. However, this should not be interpreted as an indication that JSON decoders have problems with UTF-8.

Those "unknown characters" that you are referring to are actually known as Unicode Escape Sequences, and are there for parsers built in programming languages that do not fully support Unicode. These sequences are also used in CSS files, for displaying Unicode characters (see CSS content property).

If you want to display this in your client side app (I'm going to assume you're using Java), then I'll refer you to this question

tl;dr: There is nothing wrong with your JSON file. Those encodings are there to help the parser.



Related Topics



Leave a reply



Submit