Bash Script to Convert from HTML Entities to Characters

bash: convert html entities to UTF-8, but keep existing UTF-8

perl one-liner:

$ echo 'Arabic & ٱلْعَرَبِيَّة' | perl -CS -MHTML::Entities -ne 'print decode_entities($_)' 
Arabic & ٱلْعَرَبِيَّة

Requires the HTML::Entities module, which is part of the larger HTML::Parser bundle. Install through your OS package manager or favorite CPAN client.

How to convert text to html character codes with a bash script?

You can use printf to get ascii value of characters using ' in front of the variable. This will of course result in > instead of >. You can use the code bellow to convert $1 to a string of html ascii codes.

str=$1
for (( i=0; i<${#str}; i++ )); do
  c=${str:$i:1}
  printf "&#%d;" "'$c" #
done
echo ""

Short way to escape HTML in Bash?

Escaping HTML really just involves replacing three characters: <, >, and &. For extra points, you can also replace " and '. So, it's not a long sed script:

sed 's/&/\&/g; s/</\</g; s/>/\>/g; s/"/\"/g; s/'"'"'/\'/g'

Replacing HTML ascii codes via a bash script?


$ echo '!' | recode html/..
!
$ echo '<∞>' | recode html/..
<∞>

Convert HTML entities in plain text to characters

To decode HTML Entities like of your example you could use the following code.

html_encoded = 'Motorists could be charged for every mile they drive to raise €35bn'
import html
html_decoded = html.unescape(html_encoded)
print(html_decoded)

How convert html code to char in javascript?

Just remove the prefix ("&#") and suffix (";") and use String.fromCharCode.

function entityToChar(ent){
  return String.fromCharCode(ent.slice(2,-1));
}
console.log(entityToChar("a"));

Windows tool to decode HTML entities in a file

You don't need extensive applications (like JREPL.bat or my own FindRepl.bat) or complicated programs in order to perform a replacement as simple as this one. The small Batch file below is an example that performs a replacement of 3 HTML entities:

@set @a=0 // & cscript //nologo //E:JScript "%~F0" < input.txt & goto :EOF

var rep = new Array();
rep["©"]   = "\u00A9";
rep["팆"] = "\uD306";
rep["☃"] = "\u2603";

var f = new ActiveXObject("Scripting.FileSystemObject").CreateTextFile("output.txt", true, true);
f.Write(WScript.Stdin.ReadAll().replace(/©|팆|☃/g,function (A) {return rep[A]}));
f.Close();

input.txt:

Foo © bar 팆 baz ☃ qux

output.txt:

Foo © bar 팆 baz ☃ qux

You only need to add as many character equivalences as you want to convert...

How can I decode HTML entities?

Take a look at HTML::Entities:

use HTML::Entities;

my $html = "Snoopy & Charlie Brown";

print decode_entities($html), "\n";

You can guess the output.

Any good tool to convert HTML entities in HTML documents to plain UTF characters?

The GNU utility "recode" will do this, with the invocation

recode HTML..UTF-16LE < old.html > new.html

(or UTF-16BE, of course.)

http://ftp.gnu.org/gnu/recode/recode-3.6.tar.gz

It's use of HTML as a character set is a bit of a hack and is treated as either ASCII or LATIN-1, when it should be treated as a "surface" for any character set. If there are any UTF-8 characters, it can break, so I'm now withdrawing my recommendation. Use the first.

(You might expect recode UTF-8..HTML,HTML..UTF-16LE to work, but this first encodes the ampersands...)