Java - convert named html entities to numbered xml entities
Have you tried with JTidy?
private String cleanData(String data) throws UnsupportedEncodingException {
Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setPrintBodyOnly(true); // only print the content
tidy.setXmlOut(true); // to XML
tidy.setSmartIndent(true);
ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8"));
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
tidy.parseDOM(inputStream, outputStream);
return outputStream.toString("UTF-8");
}
Although I think it will repair some of your HTML code in case has something.
How do I convert named HTML entities to numeric HTML entities in javascript?
php.js can be useful to this:
http://phpjs.org/functions/html_entity_decode:424
http://phpjs.org/functions/htmlspecialchars_decode:427
http://phpjs.org/functions/htmlentities:425
Decoding numeric html entities via PHP
html_entity_decode
already does what you're looking for:
$string = '';
echo html_entity_decode($string, ENT_COMPAT, 'UTF-8');
It will return the character:
’ binary hex: c292
Which is PRIVATE USE TWO (U+0092). As it's private use, your PHP configuration/version/compile might not return it at all.
Also there are some more quirks:
But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range
to
are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.
See: is getting converted as “\u0092” by nokogiri in ruby on rails
PHP How to encode text to numeric entity?
Your converter is converting your LaTeX into MathML, not HTML entities. You need something that converts directly into HTML character references, or a MathML to HTML character reference converter.
You should be able to use htmlentities
:
htmlentities($symbolsToEncode, ENT_XML1, 'UTF-8');
http://pt1.php.net/htmlentities
You can change ENT_XML1
to ENT_SUBSTITUTE
and it will return Unicode Replacement Characters or Hex character references.
As an alternative, you could use strtr
to convert the characters to something you specify:
$chars = array(
"\x8484" => "蒄"
...
);
$convertedXML = strtr($xml, $chars);
http://php.net/strtr
Someone has done something similar on GitHub.
Convert HTML entities in plain text to characters
To decode HTML Entities like of your example you could use the following code.
html_encoded = 'Motorists could be charged for every mile they drive to raise €35bn'
import html
html_decoded = html.unescape(html_encoded)
print(html_decoded)
Convert HTML entities in Json back to characters
There is the solution. I needed to
- convert
&
to&
to standardize encoding systems; - convert all applicable characters to HTML entities.
There is the final code. Many thanks to all for all your comments and suggestions.
Full code and online test here: https://www.tehplayground.com/zythX4MUdF3ric4l
array_walk_recursive($data, function(&$item, $key) {
if(is_string($item)) {
$item = str_replace("&", "&", $item); // 1. Replace & by &
$item = html_entity_decode($item); // 2. Convert HTML entities to their corresponding characters
}
});
Related Topics
How to Manually Return or Throw a Validation Error/Exception in Laravel
Zip All Files in Directory and Download Generated .Zip
Reading Ssl Page with Curl (Php)
Test PHP's Mail Function from Localhost
Strip PHP Variable, Replace White Spaces with Dashes
Most Efficient Way to Do Language File in PHP
Apache Permissions, PHP File Create, Mkdir Fail
Dyld: Library Not Loaded: /Usr/Local/Lib/Libjpeg.8.Dylib - Homebrew PHP
Dynamically Load Information to Twitter Bootstrap Modal
How to Create a Random Hash/String
PHP Preg_Replace/Preg_Match VS PHP Str_Replace
Generating Ssh Keys for 'Apache' User
Symfony2 Collection of Entities - How to Add/Remove Association with Existing Entities
Check for Session Timeout in Laravel
Whats the Best Way to Do User Authentication in PHP