Converting Named HTML Entities to Numeric HTML Entities

Java - convert named html entities to numbered xml entities

Have you tried with JTidy?

private String cleanData(String data) throws UnsupportedEncodingException {
    Tidy tidy = new Tidy();
    tidy.setInputEncoding("UTF-8");
    tidy.setOutputEncoding("UTF-8");
    tidy.setPrintBodyOnly(true); // only print the content
    tidy.setXmlOut(true); // to XML
    tidy.setSmartIndent(true); 
    ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8"));
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    tidy.parseDOM(inputStream, outputStream);
    return outputStream.toString("UTF-8");
}

Although I think it will repair some of your HTML code in case has something.

How do I convert named HTML entities to numeric HTML entities in javascript?

php.js can be useful to this:

http://phpjs.org/functions/html_entity_decode:424

http://phpjs.org/functions/htmlspecialchars_decode:427

http://phpjs.org/functions/htmlentities:425

Decoding numeric html entities via PHP

html_entity_decode already does what you're looking for:

$string = '’';

echo html_entity_decode($string, ENT_COMPAT, 'UTF-8');

It will return the character:

’   binary hex: c292

Which is PRIVATE USE TWO (U+0092). As it's private use, your PHP configuration/version/compile might not return it at all.

Also there are some more quirks:

But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range € to Ÿ are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.

See: ’ is getting converted as “\u0092” by nokogiri in ruby on rails

PHP How to encode text to numeric entity?

Your converter is converting your LaTeX into MathML, not HTML entities. You need something that converts directly into HTML character references, or a MathML to HTML character reference converter.

You should be able to use htmlentities:

htmlentities($symbolsToEncode, ENT_XML1, 'UTF-8');

http://pt1.php.net/htmlentities

You can change ENT_XML1 to ENT_SUBSTITUTE and it will return Unicode Replacement Characters or Hex character references.

As an alternative, you could use strtr to convert the characters to something you specify:

$chars = array(
    "\x8484" => "蒄"
    ...
);

$convertedXML = strtr($xml, $chars);

http://php.net/strtr

Someone has done something similar on GitHub.

Convert HTML entities in plain text to characters

To decode HTML Entities like of your example you could use the following code.

html_encoded = 'Motorists could be charged for every mile they drive to raise €35bn'
import html
html_decoded = html.unescape(html_encoded)
print(html_decoded)

Convert HTML entities in Json back to characters

There is the solution. I needed to

convert & to & to standardize encoding systems;
convert all applicable characters to HTML entities.

There is the final code. Many thanks to all for all your comments and suggestions.

Full code and online test here: https://www.tehplayground.com/zythX4MUdF3ric4l

array_walk_recursive($data, function(&$item, $key) {
    if(is_string($item)) {
        $item = str_replace("&", "&", $item); // 1. Replace & by &
        $item = html_entity_decode($item); // 2. Convert HTML entities to their corresponding characters
    }
});