How to Get Utf-8 Working in Java Webapps

How to get UTF-8 working in Java webapps?

Answering myself as the FAQ of this site encourages it. This works for me:

Mostly characters äåö are not a problematic as the default character set used by browsers and tomcat/java for webapps is latin1 ie. ISO-8859-1 which "understands" those characters.

To get UTF-8 working under Java+Tomcat+Linux/Windows+Mysql requires the following:

Configuring Tomcat's server.xml

It's necessary to configure that the connector uses UTF-8 to encode url (GET request) parameters:

<Connector port="8080" maxHttpHeaderSize="8192"
maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
enableLookups="false" redirectPort="8443" acceptCount="100"
connectionTimeout="20000" disableUploadTimeout="true"
compression="on"
compressionMinSize="128"
noCompressionUserAgents="gozilla, traviata"
compressableMimeType="text/html,text/xml,text/plain,text/css,text/ javascript,application/x-javascript,application/javascript"
URIEncoding="UTF-8"
/>

The key part being URIEncoding="UTF-8" in the above example. This quarantees that Tomcat handles all incoming GET parameters as UTF-8 encoded.
As a result, when the user writes the following to the address bar of the browser:

 https://localhost:8443/ID/Users?action=search&name=*ж*

the character ж is handled as UTF-8 and is encoded to (usually by the browser before even getting to the server) as %D0%B6.

POST request are not affected by this.

CharsetFilter

Then it's time to force the java webapp to handle all requests and responses as UTF-8 encoded. This requires that we define a character set filter like the following:

package fi.foo.filters;

import javax.servlet.*;
import java.io.IOException;

public class CharsetFilter implements Filter {

private String encoding;

public void init(FilterConfig config) throws ServletException {
encoding = config.getInitParameter("requestEncoding");
if (encoding == null) encoding = "UTF-8";
}

public void doFilter(ServletRequest request, ServletResponse response, FilterChain next)
throws IOException, ServletException {
// Respect the client-specified character encoding
// (see HTTP specification section 3.4.1)
if (null == request.getCharacterEncoding()) {
request.setCharacterEncoding(encoding);
}

// Set the default response content type and encoding
response.setContentType("text/html; charset=UTF-8");
response.setCharacterEncoding("UTF-8");

next.doFilter(request, response);
}

public void destroy() {
}
}

This filter makes sure that if the browser hasn't set the encoding used in the request, that it's set to UTF-8.

The other thing done by this filter is to set the default response encoding ie. the encoding in which the returned html/whatever is. The alternative is to set the response encoding etc. in each controller of the application.

This filter has to be added to the web.xml or the deployment descriptor of the webapp:

 <!--CharsetFilter start--> 

<filter>
<filter-name>CharsetFilter</filter-name>
<filter-class>fi.foo.filters.CharsetFilter</filter-class>
<init-param>
<param-name>requestEncoding</param-name>
<param-value>UTF-8</param-value>
</init-param>
</filter>

<filter-mapping>
<filter-name>CharsetFilter</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>

The instructions for making this filter are found at the tomcat wiki (http://wiki.apache.org/tomcat/Tomcat/UTF-8)

JSP page encoding

In your web.xml, add the following:

<jsp-config>
<jsp-property-group>
<url-pattern>*.jsp</url-pattern>
<page-encoding>UTF-8</page-encoding>
</jsp-property-group>
</jsp-config>

Alternatively, all JSP-pages of the webapp would need to have the following at the top of them:

 <%@page pageEncoding="UTF-8" contentType="text/html; charset=UTF-8"%>

If some kind of a layout with different JSP-fragments is used, then this is needed in all of them.

HTML-meta tags

JSP page encoding tells the JVM to handle the characters in the JSP page in the correct encoding.
Then it's time to tell the browser in which encoding the html page is:

This is done with the following at the top of each xhtml page produced by the webapp:

   <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fi">
<head>
<meta http-equiv='Content-Type' content='text/html; charset=UTF-8' />
...

JDBC-connection

When using a db, it has to be defined that the connection uses UTF-8 encoding. This is done in context.xml or wherever the JDBC connection is defiend as follows:

      <Resource name="jdbc/AppDB" 
auth="Container"
type="javax.sql.DataSource"
maxActive="20" maxIdle="10" maxWait="10000"
username="foo"
password="bar"
driverClassName="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/ ID_development?useEncoding=true&characterEncoding=UTF-8"
/>

MySQL database and tables

The used database must use UTF-8 encoding. This is achieved by creating the database with the following:

   CREATE DATABASE `ID_development` 
/*!40100 DEFAULT CHARACTER SET utf8 COLLATE utf8_swedish_ci */;

Then, all of the tables need to be in UTF-8 also:

   CREATE TABLE  `Users` (
`id` int(10) unsigned NOT NULL auto_increment,
`name` varchar(30) collate utf8_swedish_ci default NULL
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_swedish_ci ROW_FORMAT=DYNAMIC;

The key part being CHARSET=utf8.

MySQL server configuration

MySQL serveri has to be configured also. Typically this is done in Windows by modifying my.ini -file and in Linux by configuring my.cnf -file.
In those files it should be defined that all clients connected to the server use utf8 as the default character set and that the default charset used by the server is also utf8.

   [client]
port=3306
default-character-set=utf8

[mysql]
default-character-set=utf8

Mysql procedures and functions

These also need to have the character set defined. For example:

   DELIMITER $$

DROP FUNCTION IF EXISTS `pathToNode` $$
CREATE FUNCTION `pathToNode` (ryhma_id INT) RETURNS TEXT CHARACTER SET utf8
READS SQL DATA
BEGIN

DECLARE path VARCHAR(255) CHARACTER SET utf8;

SET path = NULL;

...

RETURN path;

END $$

DELIMITER ;

GET requests: latin1 and UTF-8

If and when it's defined in tomcat's server.xml that GET request parameters are encoded in UTF-8, the following GET requests are handled properly:

   https://localhost:8443/ID/Users?action=search&name=Petteri
https://localhost:8443/ID/Users?action=search&name=ж

Because ASCII-characters are encoded in the same way both with latin1 and UTF-8, the string "Petteri" is handled correctly.

The Cyrillic character ж is not understood at all in latin1. Because Tomcat is instructed to handle request parameters as UTF-8 it encodes that character correctly as %D0%B6.

If and when browsers are instructed to read the pages in UTF-8 encoding (with request headers and html meta-tag), at least Firefox 2/3 and other browsers from this period all encode the character themselves as %D0%B6.

The end result is that all users with name "Petteri" are found and also all users with the name "ж" are found.

But what about äåö?

HTTP-specification defines that by default URLs are encoded as latin1. This results in firefox2, firefox3 etc. encoding the following

    https://localhost:8443/ID/Users?action=search&name=*Päivi*

in to the encoded version

    https://localhost:8443/ID/Users?action=search&name=*P%E4ivi*

In latin1 the character ä is encoded as %E4. Even though the page/request/everything is defined to use UTF-8. The UTF-8 encoded version of ä is %C3%A4

The result of this is that it's quite impossible for the webapp to correly handle the request parameters from GET requests as some characters are encoded in latin1 and others in UTF-8.
Notice: POST requests do work as browsers encode all request parameters from forms completely in UTF-8 if the page is defined as being UTF-8

Stuff to read

A very big thank you for the writers of the following for giving the answers for my problem:

  • http://tagunov.tripod.com/i18n/i18n.html
  • http://wiki.apache.org/tomcat/Tomcat/UTF-8
  • http://java.sun.com/developer/technicalArticles/Intl/HTTPCharset/
  • http://dev.mysql.com/doc/refman/5.0/en/charset-syntax.html
  • http://cagan327.blogspot.com/2006/05/utf-8-encoding-fix-tomcat-jsp-etc.html
  • http://cagan327.blogspot.com/2006/05/utf-8-encoding-fix-for-mysql-tomcat.html
  • http://jeppesn.dk/utf-8.html
  • http://www.nabble.com/request-parameters-mishandle-utf-8-encoding-td18720039.html
  • http://www.utoronto.ca/webdocs/HTMLdocs/NewHTML/iso_table.html
  • http://www.utf8-chartable.de/

Important Note

mysql supports the Basic Multilingual Plane using 3-byte UTF-8 characters. If you need to go outside of that (certain alphabets require more than 3-bytes of UTF-8), then you either need to use a flavor of VARBINARY column type or use the utf8mb4 character set (which requires MySQL 5.5.3 or later). Just be aware that using the utf8 character set in MySQL won't work 100% of the time.

Tomcat with Apache

One more thing If you are using Apache + Tomcat + mod_JK connector then you also need to do following changes:

  1. Add URIEncoding="UTF-8" into tomcat server.xml file for 8009 connector, it is used by mod_JK connector. <Connector port="8009" protocol="AJP/1.3" redirectPort="8443" URIEncoding="UTF-8"/>
  2. Goto your apache folder i.e. /etc/httpd/conf and add AddDefaultCharset utf-8 in httpd.conf file. Note: First check that it is exist or not. If exist you may update it with this line. You can add this line at bottom also.

Webapp with UTF8

By default, the Servlet encoding is ISO-8859-1. Adding this arg to Tomcat can solve your issue:

-Djavax.servlet.request.encoding=UTF-8 

Making web application use UTF-8

First of all, it's important to understand what the -Dfile.encoding=UTF-8 exactly does. It is a Sun/Oracle JVM-specific setting (which thus don't necessarily work in all other JVMs!) which basically instructs the JVM to read the Java .class files using the given encoding instead of the platform default one. So, setting this would only solve any possible Mojibake problems which is caused by using "special characters" in Java class/variable names or hardcoded String values in Java classes (yes, you read it right: only Java classes, not other files, thus definitely not properties files or JSF XHTML files, etc).

In all honesty, I can hardly imagine that this is the right solution to your concrete problem. Why would one ever use special characters straight in Java classes? Class/variable names should be in all English and localized text should be placed in resource bundle files. Every self respected Java developer adeheres this convention.

Given that fact, and assuming that you are also not using special characters in Java classes at all, I thus believe that your concrete problem is caused by something else. The problem symptoms are not specific enough described (at which step exactly does it fail? which characters exactly do you expect and get instead? etc) to see the possible root cause of your concrete problem. I can at least tell that the URL pattern of your Spring filter is completely wrong. It must be mapped on /*

<filter-mapping>
<filter-name>encoder</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>

(by the way, you don't necessarily need Spring for this, just a custom Filter with only 2 or 3 lines in doFilter() implementation is already sufficient)

See also:

  • Unicode - How to get the characters right?

Setting default character encoding for all jsps in a Java Web Application

You can configure the default Character Encoding for all JSPs on the web.xml file, that way it's done globally

<jsp-config>   
<jsp-property-group id="defaultUtf8Encoder">
<url-pattern>*.jsp</url-pattern>
<page-encoding>UTF-8</page-encoding>
</jsp-property-group>
</jsp-config>

What you can also do, is to create a Filter which sets the response character encoding (and eventually content type), as such: (below the example does the character encoding)

public class CharsetFilter
implements Filter {

String encoding = "UTF-8";

public void destroy() {
/* Do nothing */
}

public void doFilter(ServletRequest request,
ServletResponse response,
FilterChain chain) throws IOException, ServletException {

response.setCharacterEncoding(encoding);
chain.doFilter(request, response);
}

public void init(FilterConfig config) throws ServletException {
}
}

Then you define the filter in the web.xml file

<filter>
<filter-name>
charsetFilter
</filter-name>
<filter-class>
your.filter.package.CharsetFilter
</filter-class>
</filter>

<filter-mapping>
<filter-name>charsetFilter</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>

Notice that I'm applying the filter to /* which uses any web app resource. This may be handy if you want the filter to affect every single web resource

Hopefully that should sort you out

Strange problem, Tomcat Webapp UTF-8 Character can't display correctly after each restart or each redeployment

If you're sure the content is UTF-8, this could work. Set this line in the catalina.sh file (for example, just after the huge initial comment, long before exporting them):

export CATALINA_OPTS="$CATALINA_OPTS -Dfile.encoding=UTF-8"

Also, we don't know if you're using data from a database. Check if you have put it in there properly.



Related Topics



Leave a reply



Submit