How to do URL decoding in Java?
This does not have anything to do with character encodings such as UTF-8 or ASCII. The string you have there is URL encoded. This kind of encoding is something entirely different than character encoding.
Try something like this:
try {
String result = java.net.URLDecoder.decode(url, StandardCharsets.UTF_8.name());
} catch (UnsupportedEncodingException e) {
// not going to happen - value came from JDK's own StandardCharsets
}
Java 10 added direct support for Charset
to the API, meaning there's no need to catch UnsupportedEncodingException:
String result = java.net.URLDecoder.decode(url, StandardCharsets.UTF_8);
Note that a character encoding (such as UTF-8 or ASCII) is what determines the mapping of characters to raw bytes. For a good intro to character encodings, see this article.
Java URL encoding of query string parameters
URLEncoder
is the way to go. You only need to keep in mind to encode only the individual query string parameter name and/or value, not the entire URL, for sure not the query string parameter separator character &
nor the parameter name-value separator character =
.
String q = "random word £500 bank $";
String url = "https://example.com?q=" + URLEncoder.encode(q, StandardCharsets.UTF_8);
When you're still not on Java 10 or newer, then use StandardCharsets.UTF_8.toString()
as charset argument, or when you're still not on Java 7 or newer, then use "UTF-8"
.
Note that spaces in query parameters are represented by +
, not %20
, which is legitimately valid. The %20
is usually to be used to represent spaces in URI itself (the part before the URI-query string separator character ?
), not in query string (the part after ?
).
Also note that there are three encode()
methods. One without Charset
as second argument and another with String
as second argument which throws a checked exception. The one without Charset
argument is deprecated. Never use it and always specify the Charset
argument. The javadoc even explicitly recommends to use the UTF-8 encoding, as mandated by RFC3986 and W3C.
All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.
See also:
- What every web developer must know about URL encoding
How do you unescape URLs in Java?
This is not unescaped XML, this is URL encoded text. Looks to me like you want to use the following on the URL strings.
URLDecoder.decode(url);
This will give you the correct text. The result of decoding the like you provided is this.
http://cliveg.bu.edu/people/sganguly/player/ Rang De Basanti - Tu Bin Bataye.mp3
The %20 is an escaped space character. To get the above I used the URLDecoder object.
URL decoding in Java for non-ASCII characters
Anv%E4ndare
As PopoFibo says this is not a valid UTF-8 encoded sequence.
You can do some tolerant best-guess decoding:
public static String parse(String segment, Charset... encodings) {
byte[] data = parse(segment);
for (Charset encoding : encodings) {
try {
return encoding.newDecoder()
.onMalformedInput(CodingErrorAction.REPORT)
.decode(ByteBuffer.wrap(data))
.toString();
} catch (CharacterCodingException notThisCharset_ignore) {}
}
return segment;
}
private static byte[] parse(String segment) {
ByteArrayOutputStream buf = new ByteArrayOutputStream();
Matcher matcher = Pattern.compile("%([A-Fa-f0-9][A-Fa-f0-9])")
.matcher(segment);
int last = 0;
while (matcher.find()) {
appendAscii(buf, segment.substring(last, matcher.start()));
byte hex = (byte) Integer.parseInt(matcher.group(1), 16);
buf.write(hex);
last = matcher.end();
}
appendAscii(buf, segment.substring(last));
return buf.toByteArray();
}
private static void appendAscii(ByteArrayOutputStream buf, String data) {
byte[] b = data.getBytes(StandardCharsets.US_ASCII);
buf.write(b, 0, b.length);
}
This code will successfully decode the given strings:
for (String test : Arrays.asList("Fondation_Alliance_fran%C3%A7aise",
"Anv%E4ndare")) {
String result = parse(test, StandardCharsets.UTF_8,
StandardCharsets.ISO_8859_1);
System.out.println(result);
}
Note that this isn't some foolproof system that allows you to ignore correct URL encoding. It works here because v%E4n - the byte sequence 76 E4 6E
- is not a valid sequence as per the UTF-8 scheme and the decoder can detect this.
If you reverse the order of the encodings the first string can happily (but incorrectly) be decoded as ISO-8859-1.
Note: HTTP doesn't care about percent-encoding and you can write a web server that accepts http://foo/%%%%%
as a valid form. The URI spec mandates UTF-8 but this was done retroactively. It is really up to the server to describe what form its URIs should be and if you have to handle arbitrary URIs you need to be aware of this legacy.
I've written a bit more about URLs and Java here.
URL Decode in Java 6
Now you need to specify the character encoding of your string. Based off the information on the URLDecoder
page:
Note: The World Wide Web Consortium
Recommendation states that UTF-8
should be used. Not doing so may
introduce incompatibilites.
The following should work for you:
java.net.URLDecoder.decode(url, "UTF-8");
Please see Draemon's answer below.
UrlDecoder decode several times
monta[%25]C3[%25]B1a
monta % C3 % B1a which has a UTF-8 multi-byte sequence
monta ñ a
It is important to decode with the same Charset as it was encoded.
Evidently it was URL encoded twice, first into UTF-8, and then %
was still encoded once.
Twice doing the encoding should be repaired, as otherwise an incomprehensible patch remains:
s = URLDecoder.decode(s, StandardCharsets.UTF_8);
s = URLDecoder.decode(s, StandardCharsets.UTF_8);
UTF-8 URL Decode / Encode
The id value you see seems to be decoded using the iso-8859-1 charset instead of utf-8. The encoding of the path part of a url is not specified in java EE and there is no standard api to set it. For query parameters you can use request.setCharacterEncoding before accessing any parameters to have them decoded correctly. The CharacterEncodingFilter
does exactly that but has no influence on path parameters.
To make this work in Tomcat you have to set the URIEncoding
attribute of the Connector
element in its server.xml
to "utf-8".
All you ever wanted to know about character encoding in a java webapp can be found in this excellent answer to a similar question.
Related Topics
How to Get the Separate Digits of an Int Number
Convert a Json String to Object in Java Me
How to Format a Duration in Java (E.G Format H:Mm:Ss)
System.Out.Println and System.Err.Println Out of Order
How Does a Preparedstatement Avoid or Prevent SQL Injection
How to Open a New Tab Using Selenium Webdriver in Java
Insert & Fetch Java.Time.Localdate Objects To/From an SQL Database Such as H2
How to Escape Text For Regular Expression in Java
How to Add Blank Page in Digitally Signed Pdf Using Java
Are Getters and Setters Poor Design? Contradictory Advice Seen
Simplest Way to Read Json from a Url in Java
How to Turn a List of Lists into a List in Java 8
Get a List of All Threads Currently Running in Java
Calling Remove in Foreach Loop in Java
Number of Lines in a File in Java
Want Current Date and Time in "Dd/Mm/Yyyy Hh:Mm:Ss.Ss" Format