Version control Word .docx files with docx2txt with Git on Mac OS X
Thanks to klang, I made it. Now I can diff .docx files in Terminal.app in Mac OS X (10.9). But this one doesn't seamlessly work with SourceTree GUI. Below is basically the same as klang's but with minor corrections.
Download and install the docx2txt converter from http://docx2txt.sourceforge.net/
wget -O doc2txt.tar.gz http://docx2txt.cvs.sourceforge.net/viewvc/docx2txt/?view=tar
tar zxf doc2txt.tar.gz
cd docx2txt/docx2txt/
sudo make
Then make a small wrapper script to make docx2txt output to STDOUT
echo '#!/bin/bash
docx2txt.pl "$1" -' > /usr/local/bin/docx2txt
chmod +x /usr/local/bin/docx2txt
Git attributes for (Word) .docx diffing in your repository
echo "*.docx diff=wordx" >> .gitattributes
git config diff.wordx.textconv docx2txt
Use .git/info/attributes
if the setting should not be committed with the project.
Git attributes for (Word) .doc diffing
echo "*.doc diff=word" >> .gitattributes
git config diff.word.textconv strings
Detect and convert encoding for list of files
Easy to use awk:
file exports/invoice/* | grep "ISO-8859" | awk -F':' '{print $1}'
Write Velocity file forced in UTF-8
Use a FileOutputStream
in combination with an OutputStreamWriter
:
final OutputStream out = new FileOutputStream(...);
final Writer writer
= new OutputStreamWriter(out, Charset.forName("UTF-8"));
How to reliably guess the encoding between MacRoman, CP1252, Latin1, UTF-8, and ASCII
First, the easy cases:
ASCII
If your data contains no bytes above 0x7F, then it's ASCII. (Or a 7-bit ISO646 encoding, but those are very obsolete.)
UTF-8
If your data validates as UTF-8, then you can safely assume it is UTF-8. Due to UTF-8's strict validation rules, false positives are extremely rare.
ISO-8859-1 vs. windows-1252
The only difference between these two encodings is that ISO-8859-1 has the C1 control characters where windows-1252 has the printable characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ. I've seen plenty of files that use curly quotes or dashes, but none that use C1 control characters. So don't even bother with them, or ISO-8859-1, just detect windows-1252 instead.
That now leaves you with only one question.
How do you distinguish MacRoman from cp1252?
This is a lot trickier.
Undefined characters
The bytes 0x81, 0x8D, 0x8F, 0x90, 0x9D are not used in windows-1252. If they occur, then assume the data is MacRoman.
Identical characters
The bytes 0xA2 (¢), 0xA3 (£), 0xA9 (©), 0xB1 (±), 0xB5 (µ) happen to be the same in both encodings. If these are the only non-ASCII bytes, then it doesn't matter whether you choose MacRoman or cp1252.
Statistical approach
Count character (NOT byte!) frequencies in the data you know to be UTF-8. Determine the most frequent characters. Then use this data to determine whether the cp1252 or MacRoman characters are more common.
For example, in a search I just performed on 100 random English Wikipedia articles, the most common non-ASCII characters are ·•–é°®’èö—
. Based on this fact,
- The bytes 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, or 0xF6 suggest windows-1252.
- The bytes 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, or 0xE1 suggest MacRoman.
Count up the cp1252-suggesting bytes and the MacRoman-suggesting bytes, and go with whichever is greatest.
Unknown characters displaying while encoding UTF-8 words into JSON format using json_encode in PHP
JSON fully supports Unicode (rather should I say the standard for parsers does). The problem is that PHP does not fully support Unicode.
In this stack overflow question, I'll quote
Some frameworks, including PHP's implementation of JSON, always do the safe numeric encodings on the encoder side. This is intended for maximum compatibility with buggy/limited transport mechanisms and the like. However, this should not be interpreted as an indication that JSON decoders have problems with UTF-8.
Those "unknown characters" that you are referring to are actually known as Unicode Escape Sequences, and are there for parsers built in programming languages that do not fully support Unicode. These sequences are also used in CSS files, for displaying Unicode characters (see CSS content property).
If you want to display this in your client side app (I'm going to assume you're using Java), then I'll refer you to this question
tl;dr: There is nothing wrong with your JSON file. Those encodings are there to help the parser.
Related Topics
Ssh - Help Understanding Proxy Command
How to Add a Ppa Repository Using Ansible
Dreamweaver Equivalent for Linux
Possible to Assign a New Ip Address on Every Http Request
How to Decide How Much Stack I Can Use After a Call to Pthread_Attr_Setstacksize
How to Create a File of Size More Than 2Gb in Linux/Unix
How to Compare The Size of Two Directories
How to Run a Binary File That Is Mach-O Executable I386 on Linux
Wget Returns "Unable to Establish Ssl Connection"
Why Does Sed Leave Many Files Around
Bash, Execute Command But Continue with Interactive Session
Bluez: Setting Local Address to Be Private and Non-Resolvable
How to Clone a Git Repository Which Is Present on a Remote Linux Server into Windows
How to Get Jenkins Working with Binaries from a Subfolder of The Root User