How to Do Url Decoding in Java

How to do URL decoding in Java?

This does not have anything to do with character encodings such as UTF-8 or ASCII. The string you have there is URL encoded. This kind of encoding is something entirely different than character encoding.

Try something like this:

try {
String result = java.net.URLDecoder.decode(url, StandardCharsets.UTF_8.name());
} catch (UnsupportedEncodingException e) {
// not going to happen - value came from JDK's own StandardCharsets
}

Java 10 added direct support for Charset to the API, meaning there's no need to catch UnsupportedEncodingException:

String result = java.net.URLDecoder.decode(url, StandardCharsets.UTF_8);

Note that a character encoding (such as UTF-8 or ASCII) is what determines the mapping of characters to raw bytes. For a good intro to character encodings, see this article.

Java URL encoding of query string parameters

URLEncoder is the way to go. You only need to keep in mind to encode only the individual query string parameter name and/or value, not the entire URL, for sure not the query string parameter separator character & nor the parameter name-value separator character =.

String q = "random word £500 bank $";
String url = "https://example.com?q=" + URLEncoder.encode(q, StandardCharsets.UTF_8);

When you're still not on Java 10 or newer, then use StandardCharsets.UTF_8.toString() as charset argument, or when you're still not on Java 7 or newer, then use "UTF-8".


Note that spaces in query parameters are represented by +, not %20, which is legitimately valid. The %20 is usually to be used to represent spaces in URI itself (the part before the URI-query string separator character ?), not in query string (the part after ?).

Also note that there are three encode() methods. One without Charset as second argument and another with String as second argument which throws a checked exception. The one without Charset argument is deprecated. Never use it and always specify the Charset argument. The javadoc even explicitly recommends to use the UTF-8 encoding, as mandated by RFC3986 and W3C.

All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.

See also:

  • What every web developer must know about URL encoding

How do you unescape URLs in Java?

This is not unescaped XML, this is URL encoded text. Looks to me like you want to use the following on the URL strings.

URLDecoder.decode(url);

This will give you the correct text. The result of decoding the like you provided is this.

http://cliveg.bu.edu/people/sganguly/player/ Rang De Basanti - Tu Bin Bataye.mp3

The %20 is an escaped space character. To get the above I used the URLDecoder object.

URL decoding in Java for non-ASCII characters

Anv%E4ndare

As PopoFibo says this is not a valid UTF-8 encoded sequence.

You can do some tolerant best-guess decoding:

public static String parse(String segment, Charset... encodings) {
byte[] data = parse(segment);
for (Charset encoding : encodings) {
try {
return encoding.newDecoder()
.onMalformedInput(CodingErrorAction.REPORT)
.decode(ByteBuffer.wrap(data))
.toString();
} catch (CharacterCodingException notThisCharset_ignore) {}
}
return segment;
}

private static byte[] parse(String segment) {
ByteArrayOutputStream buf = new ByteArrayOutputStream();
Matcher matcher = Pattern.compile("%([A-Fa-f0-9][A-Fa-f0-9])")
.matcher(segment);
int last = 0;
while (matcher.find()) {
appendAscii(buf, segment.substring(last, matcher.start()));
byte hex = (byte) Integer.parseInt(matcher.group(1), 16);
buf.write(hex);
last = matcher.end();
}
appendAscii(buf, segment.substring(last));
return buf.toByteArray();
}

private static void appendAscii(ByteArrayOutputStream buf, String data) {
byte[] b = data.getBytes(StandardCharsets.US_ASCII);
buf.write(b, 0, b.length);
}

This code will successfully decode the given strings:

for (String test : Arrays.asList("Fondation_Alliance_fran%C3%A7aise",
"Anv%E4ndare")) {
String result = parse(test, StandardCharsets.UTF_8,
StandardCharsets.ISO_8859_1);
System.out.println(result);
}

Note that this isn't some foolproof system that allows you to ignore correct URL encoding. It works here because v%E4n - the byte sequence 76 E4 6E - is not a valid sequence as per the UTF-8 scheme and the decoder can detect this.

If you reverse the order of the encodings the first string can happily (but incorrectly) be decoded as ISO-8859-1.


Note: HTTP doesn't care about percent-encoding and you can write a web server that accepts http://foo/%%%%% as a valid form. The URI spec mandates UTF-8 but this was done retroactively. It is really up to the server to describe what form its URIs should be and if you have to handle arbitrary URIs you need to be aware of this legacy.

I've written a bit more about URLs and Java here.

URL Decode in Java 6

Now you need to specify the character encoding of your string. Based off the information on the URLDecoder page:

Note: The World Wide Web Consortium
Recommendation states that UTF-8
should be used. Not doing so may
introduce incompatibilites.

The following should work for you:

java.net.URLDecoder.decode(url, "UTF-8");

Please see Draemon's answer below.

UrlDecoder decode several times

monta[%25]C3[%25]B1a
monta % C3 % B1a which has a UTF-8 multi-byte sequence
monta ñ a

It is important to decode with the same Charset as it was encoded.
Evidently it was URL encoded twice, first into UTF-8, and then % was still encoded once.

Twice doing the encoding should be repaired, as otherwise an incomprehensible patch remains:

s = URLDecoder.decode(s, StandardCharsets.UTF_8);
s = URLDecoder.decode(s, StandardCharsets.UTF_8);

UTF-8 URL Decode / Encode

The id value you see seems to be decoded using the iso-8859-1 charset instead of utf-8. The encoding of the path part of a url is not specified in java EE and there is no standard api to set it. For query parameters you can use request.setCharacterEncoding before accessing any parameters to have them decoded correctly. The CharacterEncodingFilter does exactly that but has no influence on path parameters.

To make this work in Tomcat you have to set the URIEncoding attribute of the Connector element in its server.xml to "utf-8".

All you ever wanted to know about character encoding in a java webapp can be found in this excellent answer to a similar question.



Related Topics



Leave a reply



Submit