Transliteration Script for Linux Shell

Transliteration script for linux shell

Not an answer, just to show a briefer, idiomatic way to populate the table[] array from @konsolebox's answer as discussed in the related comments:

BEGIN {
    split("a  e b", old)
    split("x ch o", new)
    for (i in old)
        table[old[i]] = new[i]
    FS = OFS = ""
}

so the mapping of old to new chars is clearly shown in that the char in the first split() is mapped to the char(s) below it and for any other mapping you want you just need to change the string(s) in the split(), not change 26-ish explicit assignments to table[].

You can even create a general script to do mappings and just pass in the old and new strings as variables:

BEGIN {
    split(o, old)
    split(n, new)
    for (i in old)
        table[old[i]] = new[i]
    FS = OFS = ""
}

then in shell anything like this:

old="a  e b"
new="x ch o"
awk -v o="$old" -v b="$new" -f script.awk file

and you can protect yourself from your own mistakes populating the strings, e.g.:

BEGIN {
    numOld = split(o, old)
    numNew = split(n, new)

    if (numOld != numNew) {
        printf "ERROR: #old vals (%d) != #new vals (%d)\n", numOld, numNew | "cat>&1"
        exit 1
    }

    for (i=1; i <= numOld; i++) {
        if (old[i] in table) {
            printf "ERROR: \"%s\" duplicated at position %d in old string\n", old[i], i | "cat>&2"
            exit 1
        }
        if (newvals[new[i]]++) {
            printf "WARNING: \"%s\" duplicated at position %d in new string\n", new[i], i | "cat>&2"
        }
        table[old[i]] = new[i]
    }
}

Wouldn't it be good to know if you wrote that b maps to x and then later mistakenly wrote that b maps to y? The above really is the best way to do this but your call of course.

Here's one complete solution as discussed in the comments below

BEGIN {
    numOld = split("a  e b", old)
    numNew = split("x ch o", new)

    if (numOld != numNew) {
        printf "ERROR: #old vals (%d) != #new vals (%d)\n", numOld, numNew | "cat>&1"
        exit 1
    }

    for (i=1; i <= numOld; i++) {
        if (old[i] in table) {
            printf "ERROR: \"%s\" duplicated at position %d in old string\n", old[i], i | "cat>&2"
            exit 1
        }
        if (newvals[new[i]]++) {
            printf "WARNING: \"%s\" duplicated at position %d in new string\n", new[i], i | "cat>&2"
        }
        map[old[i]] = new[i]
    }

    FS = OFS = ""
}
{
    for (i = 1; i <= NF; ++i) {
        if ($i in map) {
            $i = map[$i]
        }
    }
    print
}

I renamed the table array as map just because iMHO that better represents the purpose of the array.

save the above in a file script.awk and run it as awk -f script.awk inputfile

Transliteration in sed

I think you misunderstand the description. What y does is whenever any character from the left-hand side occurs in the input, replace it with the corresponding character from the right-hand side.

Specifying one character multiple times doesn't really make sense and I'm not sure the behavior of sed is defined in this case, although your version apparently takes the first occurrence and uses that.

To illustrate:

$ echo HELLO WORLD | sed 'y/L/x/'
HExxO WORxD
$ echo HELLO WORLD | sed 'y/LL/xy/'
HExxO WORxD

Fundamentally, your problem is that it's impossible to accomplish this task with just transliteration.

Your case illustrates that quite nicely: 15 is really 1 and 5 and sed has no way of distinguishing between the two.

Translating strings to another language using bash

(0) Preliminarily, taking a word in language A and changing each letter (or sometimes letter pair) to the letter (or pair) with the (approximately) same sound in language B, but not changing to a word in language B, is not translating, it is transliterating. Also your 'table' file is not hashed or a hash; it is just a file containing the desired translations.

(1) Your script doesn't change anything because shell variables are not expanded within single-quotes; in fact nothing at all is given special meaning within single-quotes, as specified by this quite terse item in the bash manual:

Enclosing characters in single quotes (‘'’) preserves the literal value of each character within the quotes. A single quote may not occur between single quotes, even when preceded by a backslash.

Thus you are telling tr to replace $ with $, and g with l, and r with a, and e with i, and k with n. Since your input presumably doesn't contain any of $ g r e k this does nothing.

(2A) If you fix this by using double-quotes which do expand $var (and some other things not relevant here) it still won't work in some cases because tr replaces character by character. Thus if you run tr with first argument xi (one char, see next) and second argument KS (two chars) it will translate any (and all) xi to K and never use the S for anything.

To translate a single character to a string that may be more than one character, consider instead sed or something like awk or perl. Or since you want 'only bash' you can use bash's own string substitution like ${1//$greek/$latin}

(2B) Another possible problem is that many (but decidedly not all) systems with the GNU shell bash also have the GNU coreutils implementation of tr which does not support multi-byte characters i.e. UTF-8. Most 'multi-lingual' (more accurately non-English/non-ASCII) material nowadays is encoded in UTF-8. There is however an ISO-8859 single-octet code, variant -7, for Greek and if your input (script and data) is in 8859-7 or can be converted to that, then GNU tr could be usable except for multi-character cases.

(3) You don't need the multiple cut processes to parse your input lines; shell read can do it:

while IFS=, read flag greek latin0 latin1; do
  echo "${1//$greek/$latin0}" >>output
  if [ "$flag" == "1" ]; then echo "${1//$greek/$latin1}" >>output; fi
done <translationsfile

(4) echo can malfunction for some data, although that data is probably unlikely for your use case. The safer and more portable method is printf.

(5) You don't really need the flag column to tell you when the 'latin1' column exists, you could just test for (the value of) $latin1 being nonempty.

(6) Your logic creates a separate translation, or maybe two, for each letter. If the input name has e.g. 5 letters with none repeated, you will create 5 translations each with only one letter changed from Greek to Latin and another 20 or whatever it is (I didn't count) with no change at all. I have fairly often seen people use names with all letters transliterated to a different language that is presumably more convenient for at least some people, but a name with some letters in one language and one letter in another language seems to me to be inconvenient for everybody and thus useless. I would start from the input name and transliterate all the letters -- either all the ones in the value (perhaps with an actual hash table, which can be implemented in recent bash with an associative array) or all possible ones. I leave this so you can still do some of the work on your assignment.

(7) Last and least important, you never need to specify $PWD as the starting path for a file, since relative pathnames automatically start in the working directory; that's what 'working directory' means. If you want to emphasize that it is relative, a common convention is to start with ./relative/path/to/whatever which is technically still redundant but is a visible reminder.

Character Translation using Python (like the tr command)

See string.translate

import string
"abc".translate(string.maketrans("abc", "def")) # => "def"

Note the doc's comments about subtleties in the translation of unicode strings.

And for Python 3, you can use directly:

str.translate(str.maketrans("abc", "def"))

Edit: Since tr is a bit more advanced, also consider using re.sub.

how to loop through string for patterns from linux shell?

A pipe through tr can split those strings out to separate lines:

grep -hx -- ':[:[:alnum:]]*:' ~/Documents/wiki{,/diary}/*.mkd | tr -s ':' '\n'

This will also remove the colons and an empty line will be present in the output (easy to repair, note the empty line will always be the first one due to the leading :). Add sort -u to sort and remove duplicates, or awk '!seen[$0]++' to remove duplicates without sorting.

An approach with sed:

sed '/^:/!d;s///;/:$/!d;s///;y/:/\n/' ~/Documents/wiki{,/diary}/*.mkd

This also removes colons, but avoids adding empty lines (by removing the leading/trailing : with s before using y to transliterate remaining : to <newline>). sed could be combined with tr:

sed '/:$/!d;/^:/!d;s///' ~/Documents/wiki{,/diary}/*.mkd | tr -s ':' '\n'

Using awk to work with the : separated fields, removing duplicates:

awk -F: '/^:/ && /:$/ {for (i=2; i<NF; ++i) if (!seen[$i]++) print $i}' \
~/Documents/wiki{,/diary}/*.mkd

Script to convert ASCII chars to Uxxx unicode notation

Every char with file input

If you wanted to convert every character of a file to the unicode representation, then it would be this simple one-liner

while IFS= read -r -n1 c;do printf "<U%04X>" "'$c"; done < ./infile

Every char on STDIN

If you want to make a unix-like tool which converts input on STDIN to unicode-like output, then use this:

uni(){ c=$(cat); for((i=0;i<${#c};i++)); do printf "<U%04X>" "'${c:i:1}"; done; }

Proof of Concept

$ echo "abc" | uni
<U0061><U0062><U0063>

Only chars between double-quotes

#!/bin/bash

flag=0
while IFS= read -r -n1 c; do
    if [[ "$c" == '"' ]]; then
        ((flag^=1))
        printf "%c" "$c"
    elif [[ "$c" == $'\0' ]]; then
        echo
    elif ((flag)); then
        printf "<U%04X>" "'$c"
    else
        printf "%c" "$c"
    fi
done < /path/to/infile

Proof of Concept

$ cat ./unime
LC_TIME
d_t_fmt "%a %d %b %Y %T %Z"
d_fmt   "%d-%m-%Y"
t_fmt   "%T"
abday "Dom";"Seg";/
here is a string with "multiline
quotes";/

$ ./uni.sh
LC_TIME
d_t_fmt "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"
d_fmt   "<U0025><U0064><U002D><U0025><U006D><U002D><U0025><U0059>"
t_fmt   "<U0025><U0054>"
abday "<U0044><U006F><U006D>";"<U0053><U0065><U0067>";/
here is a string with "<U006D><U0075><U006C><U0074><U0069><U006C><U0069><U006E><U0065>
<U0071><U0075><U006F><U0074><U0065><U0073>";/

Explanation

Pretty simply really

while IFS= read -r -n1 c;: Iterate over the input one character at a time (via -n1) and store the char in the variable c. The IFS= and -r flags are there so that the read builtin doesn't try to do word splitting or interpret escape sequences, respectively.
if [[ "$c" == '"' ]];: If the current char is a double-quote
((flag^=1)): Invert the value of flag from 0->1 or 1->0
elif [[ "$c" == $'\0' ]];: If the current char is a NUL, then echo a newline
elif ((flag)): If flag is 1, then perform unicode transliteration
printf "<U%04X>" "'$c": The magic that does the unicode transliteration. Note that the single-quote before the $c is mandatory as it tells printf that we are giving it the ASCII representation of a number.
else printf "%c" "$c": Print out the character with no unicode transliteration performed

Romanize generic Japanese in commandline

Taking a look of source html of http://nihongo.j-talk.com site, I've made a guess of API.

Here are the steps:

1) Send a Japanese string to the server by wget and obtain the result in index.html.

2) Parse the index.html and extract Romaji strings.

Here is the sample code:

#!/bin/bash

string="日本語は、主に日本で使われている言語である。日本では法規によって「公用語」として規定されているわけではないが、各種法令（裁判所法第74条、会社計算規則第57条、特許法施行規則第2条など）において日本語を用いることが定められるなど事実>上の公用語となっており、学校教育の「国語」でも教えられる。"

uniqid="46a7e5f7e7c7d8a7d9636ecb077da485479b66bc"

wget -N --post-data "uniqid=$uiqid&Submit='Translate Now'&kanji_parts=standard&kanji=$string&converter=spaced&kana_output=romaji" http://nihongo.j-talk.com/ > /dev/null 2>&1 

perl -e ' 
$file = "index.html"; 
open(FH, $file) or die "$file: $!\n";

while (<FH>) {
    if (/<div id=.spaced. class=.romaji.>(.+)/) {
        ($str = $1) =~ s/<.*?>//g;
        $str =~ s/\&\#(\d+);/&utfconv($1)/eg;
        print $str, "\n";
    }
}

# utf16 to utf8
sub utfconv {
    $utf16 = shift;
    my $upper = ($utf16 >> 6) & 0b0001_1111 | 0b1100_0000;
    my $lower = $utf16 & 0b0011_1111 | 0b1000_0000;
    pack("C2", $upper, $lower);
}'

Some comments:

- I wrote the parser with Perl just because it is rather familiar to me but you may modify or convert it to other language by reading index.html file.

- The uniqid string is what I have picked from html source of the site. If it doesn't work well, make sure what is embedded in the html source.

Hope this helps.

Run command with space characters in bash script

Shell works fine with lines split, and using variables to make the code readable - avoiding horizontal scroll bars here...

while read -r line
do
    for i in 80 35 200
    do
        epsfile="Cards/$line"
        pngbase=$(basename "$line" .eps | tr ' A-Z' '_a-z')
        pngfile="../img/card/${pngbase}_${i}.png"
        convert "$epsfile" -size ${i}x${i} "$pngfile"
    done
done < card_list.txt

If you have to deal with filenames that contain spaces, you need to enclose the names in double quotes when passing them to commands. The complex code with all the file manipulation scrunched into a single line is near-enough impossible to read. You can do the whole of the transliteration in a single command, as shown above, which also simplifies things.

Even though 80 characters isn't a hard limit, it is worth keeping it as a rough limit, because if the line is much longer, it probably isn't readable, and code must be readable to be maintainable.

Bash: Convert non-ASCII characters to ASCII

Depending on your machine you can try piping your strings through

iconv -f utf-8 -t ascii//translit

(or whatever your encoding is, if it's not utf-8)

Transliteration Script for Linux Shell