Is There a Jdk Class to Do HTML Encoding (But Not Url Encoding)

Is there a JDK class to do HTML encoding (but not URL encoding)?

Apparently, the answer is, "No." This was unfortunately a case where I had to do something and couldn't add a new external dependency for it -- in the short term. I agree with everyone that using Commons Lang is the best long-term solution. This is what I will go with once I can add a new library to the project.

It's a shame that something of such common use is not in the Java API.

Get encoded html content only from url in java

It's never a good idea to parse HTML using regex, that's a recipe for disaster.

So first look at this Q&A for HTML parsing in java: Java HTML Parsing

Once you are able to parse HTML and get internal HTML text then you can encode HTML in one of the these ways: Is there a JDK class to do HTML encoding (but not URL encoding)?

How to encode URL to avoid special characters in Java?

URL construction is tricky because different parts of the URL have different rules for what characters are allowed: for example, the plus sign is reserved in the query component of a URL because it represents a space, but in the path component of the URL, a plus sign has no special meaning and spaces are encoded as "%20".

RFC 2396 explains (in section 2.4.2) that a complete URL is always in its encoded form: you take the strings for the individual components (scheme, authority, path, etc.), encode each according to its own rules, and then combine them into the complete URL string. Trying to build a complete unencoded URL string and then encode it separately leads to subtle bugs, like spaces in the path being incorrectly changed to plus signs (which an RFC-compliant server will interpret as real plus signs, not encoded spaces).

In Java, the correct way to build a URL is with the URI class. Use one of the multi-argument constructors that takes the URL components as separate strings, and it'll escape each component correctly according to that component's rules. The toASCIIString() method gives you a properly-escaped and encoded string that you can send to a server. To decode a URL, construct a URI object using the single-string constructor and then use the accessor methods (such as getPath()) to retrieve the decoded components.

Don't use the URLEncoder class! Despite the name, that class actually does HTML form encoding, not URL encoding. It's not correct to concatenate unencoded strings to make an "unencoded" URL and then pass it through a URLEncoder. Doing so will result in problems (particularly the aforementioned one regarding spaces and plus signs in the path).

HTTP URL Address Encoding in Java

The java.net.URI class can help; in the documentation of URL you find

Note, the URI class does perform escaping of its component fields in certain circumstances. The recommended way to manage the encoding and decoding of URLs is to use an URI

Use one of the constructors with more than one argument, like:

URI uri = new URI(
"http",
"search.barnesandnoble.com",
"/booksearch/first book.pdf",
null);
URL url = uri.toURL();
//or String request = uri.toString();

(the single-argument constructor of URI does NOT escape illegal characters)


Only illegal characters get escaped by above code - it does NOT escape non-ASCII characters (see fatih's comment).

The toASCIIString method can be used to get a String only with US-ASCII characters:

URI uri = new URI(
"http",
"search.barnesandnoble.com",
"/booksearch/é",
null);
String request = uri.toASCIIString();

For an URL with a query like http://www.google.com/ig/api?weather=São Paulo, use the 5-parameter version of the constructor:

URI uri = new URI(
"http",
"www.google.com",
"/ig/api",
"weather=São Paulo",
null);
String request = uri.toASCIIString();

Url encoding not supported

Based on error code (-1,1004) it seems you might get HTTP 403 Forbidden as an answer from server. Have you tried Uri.encode(String).toString() before mediaPlayer.setDataSource(String url)?

From https://stackoverflow.com/a/4571518/262462:

Don't use the URLEncoder class! Despite the name, that class actually does HTML form encoding, not URL encoding.

Java URL encoding of query string parameters

URLEncoder is the way to go. You only need to keep in mind to encode only the individual query string parameter name and/or value, not the entire URL, for sure not the query string parameter separator character & nor the parameter name-value separator character =.

String q = "random word £500 bank $";
String url = "https://example.com?q=" + URLEncoder.encode(q, StandardCharsets.UTF_8);

When you're still not on Java 10 or newer, then use StandardCharsets.UTF_8.toString() as charset argument, or when you're still not on Java 7 or newer, then use "UTF-8".


Note that spaces in query parameters are represented by +, not %20, which is legitimately valid. The %20 is usually to be used to represent spaces in URI itself (the part before the URI-query string separator character ?), not in query string (the part after ?).

Also note that there are three encode() methods. One without Charset as second argument and another with String as second argument which throws a checked exception. The one without Charset argument is deprecated. Never use it and always specify the Charset argument. The javadoc even explicitly recommends to use the UTF-8 encoding, as mandated by RFC3986 and W3C.

All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.

See also:

  • What every web developer must know about URL encoding

Java.net.URI constructor is not encoding & character

Have you tried something like below?

String tomJerry = "name=" + URLEncoder.encode("tom&jerry", StandardCharsets.UTF_8.toString());
String episode = "episode=" + URLEncoder.encode("2", StandardCharsets.UTF_8.toString());
String query = tomJerry + '&' + episode;
URI uri = new URI(scheme, null, host, port, path, query, null);

A better way would be looping through the key value pairing of the queries and applying URLEncoder to the value then joining the rest of the query with & after or perhaps stream, map then collect. But the point is to encode the value part of the query string.

java.net.URLEncoder.encode(String) is deprecated, what should I use instead?

Use the other encode method in URLEncoder:

URLEncoder.encode(String, String)

The first parameter is the text to encode; the second is the name of the character encoding to use (e.g., UTF-8). For example:

System.out.println(
URLEncoder.encode(
"urlParameterString",
java.nio.charset.StandardCharsets.UTF_8.toString()
)
);


Related Topics



Leave a reply



Submit