Transliteration script for linux shell
Not an answer, just to show a briefer, idiomatic way to populate the table[]
array from @konsolebox's answer as discussed in the related comments:
BEGIN {
split("a e b", old)
split("x ch o", new)
for (i in old)
table[old[i]] = new[i]
FS = OFS = ""
}
so the mapping of old to new chars is clearly shown in that the char in the first split() is mapped to the char(s) below it and for any other mapping you want you just need to change the string(s) in the split(), not change 26-ish explicit assignments to table[].
You can even create a general script to do mappings and just pass in the old and new strings as variables:
BEGIN {
split(o, old)
split(n, new)
for (i in old)
table[old[i]] = new[i]
FS = OFS = ""
}
then in shell anything like this:
old="a e b"
new="x ch o"
awk -v o="$old" -v b="$new" -f script.awk file
and you can protect yourself from your own mistakes populating the strings, e.g.:
BEGIN {
numOld = split(o, old)
numNew = split(n, new)
if (numOld != numNew) {
printf "ERROR: #old vals (%d) != #new vals (%d)\n", numOld, numNew | "cat>&1"
exit 1
}
for (i=1; i <= numOld; i++) {
if (old[i] in table) {
printf "ERROR: \"%s\" duplicated at position %d in old string\n", old[i], i | "cat>&2"
exit 1
}
if (newvals[new[i]]++) {
printf "WARNING: \"%s\" duplicated at position %d in new string\n", new[i], i | "cat>&2"
}
table[old[i]] = new[i]
}
}
Wouldn't it be good to know if you wrote that b maps to x and then later mistakenly wrote that b maps to y? The above really is the best way to do this but your call of course.
Here's one complete solution as discussed in the comments below
BEGIN {
numOld = split("a e b", old)
numNew = split("x ch o", new)
if (numOld != numNew) {
printf "ERROR: #old vals (%d) != #new vals (%d)\n", numOld, numNew | "cat>&1"
exit 1
}
for (i=1; i <= numOld; i++) {
if (old[i] in table) {
printf "ERROR: \"%s\" duplicated at position %d in old string\n", old[i], i | "cat>&2"
exit 1
}
if (newvals[new[i]]++) {
printf "WARNING: \"%s\" duplicated at position %d in new string\n", new[i], i | "cat>&2"
}
map[old[i]] = new[i]
}
FS = OFS = ""
}
{
for (i = 1; i <= NF; ++i) {
if ($i in map) {
$i = map[$i]
}
}
print
}
I renamed the table
array as map
just because iMHO that better represents the purpose of the array.
save the above in a file script.awk
and run it as awk -f script.awk inputfile
Transliteration in sed
I think you misunderstand the description. What y
does is whenever any character from the left-hand side occurs in the input, replace it with the corresponding character from the right-hand side.
Specifying one character multiple times doesn't really make sense and I'm not sure the behavior of sed
is defined in this case, although your version apparently takes the first occurrence and uses that.
To illustrate:
$ echo HELLO WORLD | sed 'y/L/x/'
HExxO WORxD
$ echo HELLO WORLD | sed 'y/LL/xy/'
HExxO WORxD
Fundamentally, your problem is that it's impossible to accomplish this task with just transliteration.
Your case illustrates that quite nicely: 15
is really 1
and 5
and sed
has no way of distinguishing between the two.
Translating strings to another language using bash
(0) Preliminarily, taking a word in language A and changing each letter (or sometimes letter pair) to the letter (or pair) with the (approximately) same sound in language B, but not changing to a word in language B, is not translating, it is transliterating. Also your 'table' file is not hashed or a hash; it is just a file containing the desired translations.
(1) Your script doesn't change anything because shell variables are not expanded within single-quotes; in fact nothing at all is given special meaning within single-quotes, as specified by this quite terse item in the bash manual:
Enclosing characters in single quotes (‘'’) preserves the literal value of each character within the quotes. A single quote may not occur between single quotes, even when preceded by a backslash.
Thus you are telling tr
to replace $
with $
, and g
with l
, and r
with a
, and e
with i
, and k
with n
. Since your input presumably doesn't contain any of $ g r e k
this does nothing.
(2A) If you fix this by using double-quotes which do expand $var
(and some other things not relevant here) it still won't work in some cases because tr
replaces character by character. Thus if you run tr
with first argument xi (one char, see next) and second argument KS
(two chars) it will translate any (and all) xi to K
and never use the S
for anything.
To translate a single character to a string that may be more than one character, consider instead sed
or something like awk
or perl
. Or since you want 'only bash' you can use bash's own string substitution like ${1//$greek/$latin}
(2B) Another possible problem is that many (but decidedly not all) systems with the GNU shell bash
also have the GNU coreutils implementation of tr
which does not support multi-byte characters i.e. UTF-8. Most 'multi-lingual' (more accurately non-English/non-ASCII) material nowadays is encoded in UTF-8. There is however an ISO-8859 single-octet code, variant -7, for Greek and if your input (script and data) is in 8859-7 or can be converted to that, then GNU tr
could be usable except for multi-character cases.
(3) You don't need the multiple cut
processes to parse your input lines; shell read
can do it:
while IFS=, read flag greek latin0 latin1; do
echo "${1//$greek/$latin0}" >>output
if [ "$flag" == "1" ]; then echo "${1//$greek/$latin1}" >>output; fi
done <translationsfile
(4) echo
can malfunction for some data, although that data is probably unlikely for your use case. The safer and more portable method is printf.
(5) You don't really need the flag column to tell you when the 'latin1' column exists, you could just test for (the value of) $latin1
being nonempty.
(6) Your logic creates a separate translation, or maybe two, for each letter. If the input name has e.g. 5 letters with none repeated, you will create 5 translations each with only one letter changed from Greek to Latin and another 20 or whatever it is (I didn't count) with no change at all. I have fairly often seen people use names with all letters transliterated to a different language that is presumably more convenient for at least some people, but a name with some letters in one language and one letter in another language seems to me to be inconvenient for everybody and thus useless. I would start from the input name and transliterate all the letters -- either all the ones in the value (perhaps with an actual hash table, which can be implemented in recent bash with an associative array) or all possible ones. I leave this so you can still do some of the work on your assignment.
(7) Last and least important, you never need to specify $PWD
as the starting path for a file, since relative pathnames automatically start in the working directory; that's what 'working directory' means. If you want to emphasize that it is relative, a common convention is to start with ./relative/path/to/whatever
which is technically still redundant but is a visible reminder.
Character Translation using Python (like the tr command)
See string.translate
import string
"abc".translate(string.maketrans("abc", "def")) # => "def"
Note the doc's comments about subtleties in the translation of unicode strings.
And for Python 3, you can use directly:
str.translate(str.maketrans("abc", "def"))
Edit: Since tr
is a bit more advanced, also consider using re.sub
.
how to loop through string for patterns from linux shell?
A pipe through tr
can split those strings out to separate lines:
grep -hx -- ':[:[:alnum:]]*:' ~/Documents/wiki{,/diary}/*.mkd | tr -s ':' '\n'
This will also remove the colons and an empty line will be present in the output (easy to repair, note the empty line will always be the first one due to the leading :
). Add sort -u
to sort and remove duplicates, or awk '!seen[$0]++'
to remove duplicates without sorting.
An approach with sed
:
sed '/^:/!d;s///;/:$/!d;s///;y/:/\n/' ~/Documents/wiki{,/diary}/*.mkd
This also removes colons, but avoids adding empty lines (by removing the leading/trailing :
with s
before using y
to transliterate remaining :
to <newline>
). sed could be combined with tr:
sed '/:$/!d;/^:/!d;s///' ~/Documents/wiki{,/diary}/*.mkd | tr -s ':' '\n'
Using awk
to work with the :
separated fields, removing duplicates:
awk -F: '/^:/ && /:$/ {for (i=2; i<NF; ++i) if (!seen[$i]++) print $i}' \
~/Documents/wiki{,/diary}/*.mkd
Script to convert ASCII chars to Uxxx unicode notation
Every char with file input
If you wanted to convert every character of a file to the unicode representation, then it would be this simple one-liner
while IFS= read -r -n1 c;do printf "<U%04X>" "'$c"; done < ./infile
Every char on STDIN
If you want to make a unix-like tool which converts input on STDIN to unicode-like output, then use this:
uni(){ c=$(cat); for((i=0;i<${#c};i++)); do printf "<U%04X>" "'${c:i:1}"; done; }
Proof of Concept
$ echo "abc" | uni
<U0061><U0062><U0063>
Only chars between double-quotes
#!/bin/bash
flag=0
while IFS= read -r -n1 c; do
if [[ "$c" == '"' ]]; then
((flag^=1))
printf "%c" "$c"
elif [[ "$c" == $'\0' ]]; then
echo
elif ((flag)); then
printf "<U%04X>" "'$c"
else
printf "%c" "$c"
fi
done < /path/to/infile
Proof of Concept
$ cat ./unime
LC_TIME
d_t_fmt "%a %d %b %Y %T %Z"
d_fmt "%d-%m-%Y"
t_fmt "%T"
abday "Dom";"Seg";/
here is a string with "multiline
quotes";/
$ ./uni.sh
LC_TIME
d_t_fmt "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"
d_fmt "<U0025><U0064><U002D><U0025><U006D><U002D><U0025><U0059>"
t_fmt "<U0025><U0054>"
abday "<U0044><U006F><U006D>";"<U0053><U0065><U0067>";/
here is a string with "<U006D><U0075><U006C><U0074><U0069><U006C><U0069><U006E><U0065>
<U0071><U0075><U006F><U0074><U0065><U0073>";/
Explanation
Pretty simply really
while IFS= read -r -n1 c;
: Iterate over the input one character at a time (via-n1
) and store the char in the variablec
. TheIFS=
and-r
flags are there so that theread
builtin doesn't try to do word splitting or interpret escape sequences, respectively.if [[ "$c" == '"' ]];
: If the current char is a double-quote((flag^=1))
: Invert the value of flag from 0->1 or 1->0elif [[ "$c" == $'\0' ]];
: If the current char is a NUL, thenecho
a newlineelif ((flag))
: If flag is 1, then perform unicode transliterationprintf "<U%04X>" "'$c"
: The magic that does the unicode transliteration. Note that the single-quote before the$c
is mandatory as it tellsprintf
that we are giving it the ASCII representation of a number.else printf "%c" "$c"
: Print out the character with no unicode transliteration performed
Romanize generic Japanese in commandline
Taking a look of source html of http://nihongo.j-talk.com site, I've made a guess of API.
Here are the steps:
1) Send a Japanese string to the server by wget and obtain the result in index.html.
2) Parse the index.html and extract Romaji strings.
Here is the sample code:
#!/bin/bash
string="日本語は、主に日本で使われている言語である。日本では法規によって「公用語」として規定されているわけではないが、各種法令(裁判所法第74条、会社計算規則第57条、特許法施行規則第2条など)において日本語を用いることが定められるなど事実>上の公用語となっており、学校教育の「国語」でも教えられる。"
uniqid="46a7e5f7e7c7d8a7d9636ecb077da485479b66bc"
wget -N --post-data "uniqid=$uiqid&Submit='Translate Now'&kanji_parts=standard&kanji=$string&converter=spaced&kana_output=romaji" http://nihongo.j-talk.com/ > /dev/null 2>&1
perl -e '
$file = "index.html";
open(FH, $file) or die "$file: $!\n";
while (<FH>) {
if (/<div id=.spaced. class=.romaji.>(.+)/) {
($str = $1) =~ s/<.*?>//g;
$str =~ s/\&\#(\d+);/&utfconv($1)/eg;
print $str, "\n";
}
}
# utf16 to utf8
sub utfconv {
$utf16 = shift;
my $upper = ($utf16 >> 6) & 0b0001_1111 | 0b1100_0000;
my $lower = $utf16 & 0b0011_1111 | 0b1000_0000;
pack("C2", $upper, $lower);
}'
Some comments:
- I wrote the parser with Perl just because it is rather familiar to me but you may modify or convert it to other language by reading index.html file.
- The uniqid string is what I have picked from html source of the site. If it doesn't work well, make sure what is embedded in the html source.
Hope this helps.
Run command with space characters in bash script
Shell works fine with lines split, and using variables to make the code readable - avoiding horizontal scroll bars here...
while read -r line
do
for i in 80 35 200
do
epsfile="Cards/$line"
pngbase=$(basename "$line" .eps | tr ' A-Z' '_a-z')
pngfile="../img/card/${pngbase}_${i}.png"
convert "$epsfile" -size ${i}x${i} "$pngfile"
done
done < card_list.txt
If you have to deal with filenames that contain spaces, you need to enclose the names in double quotes when passing them to commands. The complex code with all the file manipulation scrunched into a single line is near-enough impossible to read. You can do the whole of the transliteration in a single command, as shown above, which also simplifies things.
Even though 80 characters isn't a hard limit, it is worth keeping it as a rough limit, because if the line is much longer, it probably isn't readable, and code must be readable to be maintainable.
Bash: Convert non-ASCII characters to ASCII
Depending on your machine you can try piping your strings through
iconv -f utf-8 -t ascii//translit
(or whatever your encoding is, if it's not utf-8)
Related Topics
Error: You Must Install at Least One Postgresql-Client-<Version> Package
Environment Variables in Symbolic Links
Mqtt Socket Error on Client <Unknown>
Difference Between $() and () in Bash
Analyze Memory with Crash with Kdump
Proxmox with Opnsense as Firewall/Gw - Routing Issue
Unexpected Periodic Behaviour of an Ultra Low Latency Hard Real Time Multi-Threaded X86 Code
Can Inotify Tell Me Where a Monitored File Is Moved
What Exactly Is Program Stack's Growth Direction
Linux Process Context and Svc Call in Arm
Matching Third Field in a CSV with Pattern File in Gnu Linux (Awk/Sed/Grep)
Authenticating Gtk App to Run with Root Permissions
Sed Command Works on Linux, But Not on Os X
How to Enable Mixed Mode Debugging in Visual Studio Code