Http Url Address Encoding in Java

HTTP URL Address Encoding in Java

The java.net.URI class can help; in the documentation of URL you find

Note, the URI class does perform escaping of its component fields in certain circumstances. The recommended way to manage the encoding and decoding of URLs is to use an URI

Use one of the constructors with more than one argument, like:

URI uri = new URI(
"http",
"search.barnesandnoble.com",
"/booksearch/first book.pdf",
null);
URL url = uri.toURL();
//or String request = uri.toString();

(the single-argument constructor of URI does NOT escape illegal characters)


Only illegal characters get escaped by above code - it does NOT escape non-ASCII characters (see fatih's comment).

The toASCIIString method can be used to get a String only with US-ASCII characters:

URI uri = new URI(
"http",
"search.barnesandnoble.com",
"/booksearch/é",
null);
String request = uri.toASCIIString();

For an URL with a query like http://www.google.com/ig/api?weather=São Paulo, use the 5-parameter version of the constructor:

URI uri = new URI(
"http",
"www.google.com",
"/ig/api",
"weather=São Paulo",
null);
String request = uri.toASCIIString();

Java URL encoding of query string parameters

URLEncoder is the way to go. You only need to keep in mind to encode only the individual query string parameter name and/or value, not the entire URL, for sure not the query string parameter separator character & nor the parameter name-value separator character =.

String q = "random word £500 bank $";
String url = "https://example.com?q=" + URLEncoder.encode(q, StandardCharsets.UTF_8);

When you're still not on Java 10 or newer, then use StandardCharsets.UTF_8.toString() as charset argument, or when you're still not on Java 7 or newer, then use "UTF-8".


Note that spaces in query parameters are represented by +, not %20, which is legitimately valid. The %20 is usually to be used to represent spaces in URI itself (the part before the URI-query string separator character ?), not in query string (the part after ?).

Also note that there are three encode() methods. One without Charset as second argument and another with String as second argument which throws a checked exception. The one without Charset argument is deprecated. Never use it and always specify the Charset argument. The javadoc even explicitly recommends to use the UTF-8 encoding, as mandated by RFC3986 and W3C.

All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.

See also:

  • What every web developer must know about URL encoding

How to do URL decoding in Java?

This does not have anything to do with character encodings such as UTF-8 or ASCII. The string you have there is URL encoded. This kind of encoding is something entirely different than character encoding.

Try something like this:

try {
String result = java.net.URLDecoder.decode(url, StandardCharsets.UTF_8.name());
} catch (UnsupportedEncodingException e) {
// not going to happen - value came from JDK's own StandardCharsets
}

Java 10 added direct support for Charset to the API, meaning there's no need to catch UnsupportedEncodingException:

String result = java.net.URLDecoder.decode(url, StandardCharsets.UTF_8);

Note that a character encoding (such as UTF-8 or ASCII) is what determines the mapping of characters to raw bytes. For a good intro to character encodings, see this article.

Encoding URL query parameters in Java

java.net.URLEncoder.encode(String s, String encoding) can help too. It follows the HTML form encoding application/x-www-form-urlencoded.

URLEncoder.encode(query, "UTF-8");

On the other hand, Percent-encoding (also known as URL encoding) encodes space with %20. Colon is a reserved character, so : will still remain a colon, after encoding.

How to encode URL to avoid special characters in Java?

URL construction is tricky because different parts of the URL have different rules for what characters are allowed: for example, the plus sign is reserved in the query component of a URL because it represents a space, but in the path component of the URL, a plus sign has no special meaning and spaces are encoded as "%20".

RFC 2396 explains (in section 2.4.2) that a complete URL is always in its encoded form: you take the strings for the individual components (scheme, authority, path, etc.), encode each according to its own rules, and then combine them into the complete URL string. Trying to build a complete unencoded URL string and then encode it separately leads to subtle bugs, like spaces in the path being incorrectly changed to plus signs (which an RFC-compliant server will interpret as real plus signs, not encoded spaces).

In Java, the correct way to build a URL is with the URI class. Use one of the multi-argument constructors that takes the URL components as separate strings, and it'll escape each component correctly according to that component's rules. The toASCIIString() method gives you a properly-escaped and encoded string that you can send to a server. To decode a URL, construct a URI object using the single-string constructor and then use the accessor methods (such as getPath()) to retrieve the decoded components.

Don't use the URLEncoder class! Despite the name, that class actually does HTML form encoding, not URL encoding. It's not correct to concatenate unencoded strings to make an "unencoded" URL and then pass it through a URLEncoder. Doing so will result in problems (particularly the aforementioned one regarding spaces and plus signs in the path).

HTTP URL Address Encoding in Scala/Java

Try lemonlabsuk/scala-uri, for example

import io.lemonlabs.uri.Url

val urls1 = Url.parse("http://www.ins.gob.pe/insvirtual/images/otrpubs/pdf/ponzo%C3%B1osos.pdf")
val urls2 = Url.parse("http://www.ins.gob.pe/insvirtual/images/otrpubs/pdf/ponzoñosos.pdf")

println(urls1)
println(urls2)

outputs in both cases

http://www.ins.gob.pe/insvirtual/images/otrpubs/pdf/ponzo%C3%B1osos.pdf
http://www.ins.gob.pe/insvirtual/images/otrpubs/pdf/ponzo%C3%B1osos.pdf

so it seems it is able to detect if the URL is already encoded.

Escaping a URL in Java

I haven't found so far how to encode this string to match both storing in an HTML and encoded as a URL

That's because there isn't any, since those are two separate things.

Printing in HTML should generally be done by replacing only ', ", <, > and & with ', ", <, > and &. Here are examples doing that: Recommended method for escaping HTML in Java, the most trivial and easiest to reason with being

public static String encodeToHTML(String str) {
return str
.replace("'", "'")
.replace("\"", """)
.replace("<", "<")
.replace(">", ">")
.replace("&", "&");
}

Note that you need to have matching character set in your page, and be aware that if you for example print the url in an attribute field, requirements are a bit different.

Encoding as an url allows for a lot shorter list of characters. From URLEncoder documentation:

The alphanumeric characters "a" through "z", "A" through "Z" and "0"
through "9" remain the same.

The special characters ".", "-", "*", and "_" remain the same.

The space character " " is converted into a plus sign "+".

All other characters are unsafe and are first converted into
one or more bytes using some encoding scheme. Then each byte is
represented by the 3-character string "%xy", where xy is the two-digit
hexadecimal representation of the byte.

The recommended encoding scheme to use is UTF-8.

You'd get those with

String encoded = new java.net.URLEncoder.encode(url, "UTF-8");

The above will give you HTML form encoding, which is close to what url encoding does, with a few noteable differences, the most relevant being + vs %20. For that, you can do this on its output:

String encoded = encoded.replace("+", "%20");

Note also that you don't want to use url encoding for the whole http://BUCKET_ENDPOINT/PATH_1/PATH_2/PATH_3/PATH_4/PATH_5/TEST NAME COULD BE WITH & AND OTHER SPECIAL CHARS.zip, but to the last part of it, TEST NAME COULD BE WITH & AND OTHER SPECIAL CHARS.zip, and the individual path segments if they are not fixed.

If you are in a position that you need to generate the url and print it in html, first encode it as an url, then do html escaping.

how do I encode a complete http url String correctly?

With external library:

import org.apache.commons.httpclient.util.URIUtil;
String myUrl_1= "http://one.two/three?four five";
System.out.println(URIUtil.encodeQuery(myUrl_1));

And the output:

http://one.two/three?four%20five

Or

String webResourceURL = "http://stackoverflow.com/search?q=<script>alert(1)</script> s";
System.out.println(URIUtil.encodeQuery(webResourceURL));

And the output:

http://stackoverflow.com/search?q=%3Cscript%3Ealert(1)%3C/script%3E%20s

And the Maven dependency

<dependency>
<groupId>commons-httpclient</groupId>
<artifactId>commons-httpclient</artifactId>
<version>3.1</version>
</dependency>

Java library for URL encoding if necessary (like a browser)

What every web developer must know about URL encoding

Url Encoding Explained

Why do I need URL encoding?

The URL specification RFC 1738 specifies that only a small set of characters 
can be used in a URL. Those characters are:

A to Z (ABCDEFGHIJKLMNOPQRSTUVWXYZ)
a to z (abcdefghijklmnopqrstuvwxyz)
0 to 9 (0123456789)
$ (Dollar Sign)
- (Hyphen / Dash)
_ (Underscore)
. (Period)
+ (Plus sign)
! (Exclamation / Bang)
* (Asterisk / Star)
' (Single Quote)
( (Open Bracket)
) (Closing Bracket)

How does URL encoding work?

All offending characters are replaced by a % and a two digit hexadecimal value 
that represents the character in the proper ISO character set. Here are a
couple of examples:

$ (Dollar Sign) becomes %24
& (Ampersand) becomes %26
+ (Plus) becomes %2B
, (Comma) becomes %2C
: (Colon) becomes %3A
; (Semi-Colon) becomes %3B
= (Equals) becomes %3D
? (Question Mark) becomes %3F
@ (Commercial A / At) becomes %40

Simple Example:

import java.util.logging.Level;
import java.util.logging.Logger;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;
import javax.script.ScriptException;

public class TextHelper {
private static ScriptEngine engine = new ScriptEngineManager()
.getEngineByName("JavaScript");

/**
* Encoding if need escaping %$&+,/:;=?@<>#%
*
* @param str should be encoded
* @return encoded Result
*/
public static String escapeJavascript(String str) {
try {
return engine.eval(String.format("escape(\"%s\")",
str.replaceAll("%20", " "))).toString()
.replaceAll("%3A", ":")
.replaceAll("%2F", "/")
.replaceAll("%3B", ";")
.replaceAll("%40", "@")
.replaceAll("%3C", "<")
.replaceAll("%3E", ">")
.replaceAll("%3D", "=")
.replaceAll("%26", "&")
.replaceAll("%25", "%")
.replaceAll("%24", "$")
.replaceAll("%23", "#")
.replaceAll("%2B", "+")
.replaceAll("%2C", ",")
.replaceAll("%3F", "?");
} catch (ScriptException ex) {
Logger.getLogger(TextHelper.class.getName())
.log(Level.SEVERE, null, ex);
return null;
}
}


Related Topics



Leave a reply



Submit