Utf-8 Not Working in HTML Forms

UTF-8 not working in HTML forms

In your HTML, add this meta tag:

 <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

Also add this PHP header at top of the script:

 header("Content-Type: text/html;charset=UTF-8");

[EDIT]:

One more tip is to save the file as UTF-8 without BOM encoding. You can use Notepad++ or any decent editor to do that.

HTML : Form does not send UTF-8 format inputs

I added the meta tag : nothing changed.

It indeed doesn't have any effect when the page is served over HTTP instead of e.g. from local disk file system (i.e. the page's URL is http://... instead of e.g. file://...). In HTTP, the charset in HTTP response header will be used. You've already set it as below:

<%@page pageEncoding="UTF-8"%>

This will not only write out the HTTP response using UTF-8, but also set the charset attribute in the Content-Type response header.

Sample Image

This one will be used by the webbrowser to interpret the response and encode any HTML form params.



I added the accept-charset attribute in form : nothing changed.

It has only effect in Microsoft Internet Explorer browser. Even then it is doing it wrongly. Never use it. All real webbrowsers will instead use the charset attribute specified in the Content-Type header of the response. Even MSIE will do it the right way as long as you do not specify the accept-charset attribute. As said before, you have already properly set it via pageEncoding.


Get rid of both the meta tag and accept-charset attribute. They do not have any useful effect and they will only confuse yourself in long term and even make things worse when enduser uses MSIE. Just stick to pageEncoding. Instead of repeating the pageEncoding over all JSP pages, you could also set it globally in web.xml as below:

<jsp-config>
<jsp-property-group>
<url-pattern>*.jsp</url-pattern>
<page-encoding>UTF-8</page-encoding>
</jsp-property-group>
</jsp-config>

As said, this will tell the JSP engine to write HTTP response output using UTF-8 and set it in the HTTP response header too. The webbrowser will use the same charset to encode the HTTP request parameters before sending back to server.

Your only missing step is to tell the server that it must use UTF-8 to decode the HTTP request parameters before returning in getParameterXxx() calls. How to do that globally depends on the HTTP request method. Given that you're using POST method, this is relatively easy to achieve with the below servlet filter class which automatically hooks on all requests:

@WebFilter("/*")
public class CharacterEncodingFilter implements Filter {

@Override
public void init(FilterConfig config) throws ServletException {
// NOOP.
}

@Override
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
request.setCharacterEncoding("UTF-8");
chain.doFilter(request, response);
}

@Override
public void destroy() {
// NOOP.
}
}

That's all. In Servlet 3.0+ (Tomcat 7 and newer) you don't need additional web.xml configuration.

You only need to keep in mind that it's very important that setCharacterEncoding() method is called before the POST request parameters are obtained for the first time using any of getParameterXxx() methods. This is because they are parsed only once on first access and then cached in server memory.

So e.g. below sequence is wrong:

String foo = request.getParameter("foo"); // Wrong encoding.
// ...
request.setCharacterEncoding("UTF-8"); // Attempt to set it.
String bar = request.getParameter("bar"); // STILL wrong encoding!

Doing the setCharacterEncoding() job in a servlet filter will guarantee that it runs timely (at least, before any servlet).


In case you'd like to instruct the server to decode GET (not POST) request parameters using UTF-8 too (those parameters you see after ? character in URL, you know), then you'd basically need to configure it in the server end. It's not possible to configure it via servlet API. In case you're using for example Tomcat as server, then it's a matter of adding URIEncoding="UTF-8" attribute in <Connector> element of Tomcat's own /conf/server.xml.

In case you're still seeing Mojibake in the console output of System.out.println() calls, then chances are big that the stdout itself is not configured to use UTF-8. How to do that depends on who's responsible for interpreting and presenting the stdout. In case you're using for example Eclipse as IDE, then it's a matter of setting Window > Preferences > General > Workspace > Text File Encoding to UTF-8.

See also:

  • Unicode - How to get the characters right?

JSP not showing correct UTF-8 contents for HTML form POST

This is caused by Tomcat, but the root problem is the Java Servlet 4 specification, which is incorrect and outdated.

Originally HTML 4.0.1 said that application/x-www-form-urlencoded encoded octets should be decoded as US-ASCII. The servlet specification changed this to say that, if the request encoding is not specified, the octets should be decoded as ISO-8859-1. Tomcat is simply following the servlet specification.

There are two problems with the Java servlet specification. The first is that the modern interpretation of application/x-www-form-urlencoded is that encoded octets should be decoded using UTF-8. The second problem is that tying the octet decoding to the resource charset confuses two levels of decoding.

Take another look at this POST content:

fullName=Fl%C3%A1vio+Jos%C3%A9

You'll notice that it is ASCII!! It doesn't matter if you consider the POST HTTP request charset to be ISO-8859-1, UTF-8, or US-ASCII—you'll still wind up with exactly the same Unicode characters before decoding the octets! What encoding is used to decode the encoding octets is completely separate.

As a further example, let's say I download a text file instructions.txt that is clearly marked as ISO-8859-1, and it contains the URI https://example.com/example.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9. Just because the text file has a charset of ISO-8859-1, does that mean I need to decode %C3%A using ISO-8859-1? Of course not! The charset used for decoding URI characters is a separate level of decoding on top of the resource content type charset! Similarly the octets of values encoded in application/x-www-form-urlencoded should be decoded using UTF-8, regardless of the underlying charset of the resource.

There are several workarounds, some of them found at found by looking at the Tomcat character encoding FAQ to "use UTF-8 everywhere".

Set the request character encoding in your web.xml file.

Add the following to your WEB-INF/web.xml file:

<request-character-encoding>UTF-8</request-character-encoding>

This setting is agnostic of the servlet container implementation, and is defined forth in the servlet specification. (You should be able to alternatively put it in Tomcat's conf/web.xml file, if want a global setting and don't mind changing the Tomcat configuration.)

Set the SetCharacterEncodingFilter in your web.xml file.

Tomcat has a proprietary equivalent: use the org.apache.catalina.filters.SetCharacterEncodingFilter in the WEB-INF/web.xml file, as the Tomcat FAQ above mentions, and as illustrated by https://stackoverflow.com/a/37833977/421049, excerpted below:

<filter>
<filter-name>setCharacterEncodingFilter</filter-name>
<filter-class>org.apache.catalina.filters.SetCharacterEncodingFilter</filter-class>
<init-param>
<param-name>encoding</param-name>
<param-value>UTF-8</param-value>
</init-param>
</filter>

<filter-mapping>
<filter-name>setCharacterEncodingFilter</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>

This will make your web application only work on Tomcat, so it's better to put this in the Tomcat installation conf/web.xml file instead, as the post above mentions. In fact Tomcat's conf/web.xml installations have these two sections, but commented out; simply uncomment them and things should work.

Force the request character encoding to UTF-8 in the JSP or servlet.

You can force the character encoding of the servlet request to UTF-8, somewhere early in the JSP:

<% request.setCharacterEncoding("UTF-8"); %>

But that is ugly, unwieldy, error-prone, and goes against modern best practices—JSP scriptlets shouldn't be used anymore.

Hopefully we can get a newer Java servlet specification to remove any relationship between the resource charset and the decoding of application/x-www-form-urlencoded octets, and simply state that application/x-www-form-urlencoded octets must be decoded as UTF-8, as is modern practice as clarified by the latest W3C and WHATWG specifications.

Update: I've updated the Tomcat FAQ on Character Encoding Issues with this information.

Perl and HTML: UTF8 does not work in forms

It's due to a bug in your version of decode_utf8.

$ perl -Mutf8 -MEncode -E'
$u = $d = encode_utf8("é");
utf8::upgrade($u); # Changes how the string is stored internally
say $u eq $d ?1:0;
say decode_utf8($d) eq decode_utf8($u) ?1:0;
'
1
0

As you can see, $u and $d are equal, but your version of decode_utf8 decodes them differently. Specifically, it returns $u unchanged.

This has been fixed in newer versions of Encode. (2.53, I think.)

The easier way to address the problem is to fix your own bug. Using use open, you tell your program to decode STDIN from UTF-8 before unescaping the url-encoding and decoding from UTF-8 a second time.

Fix:

#!/usr/bin/perl

use utf8; # Source code is encoded using UTF-8.
use open ':encoding(UTF-8)'; # Set default encoding for file handles.
BEGIN { binmode(STDOUT, ':encoding(UTF-8)'); } # HTML
BEGIN { binmode(STDERR, ':encoding(UTF-8)'); } # Error log

use Encode;

# Safe query-string in hash:
$querystring = $ENV{ 'QUERY_STRING' };
read(STDIN, my $poststring, $ENV{CONTENT_LENGTH});
if (($querystring ne "") && ($poststring ne "")) { $querystring .= "&$poststring"; }
else { $querystring .= $poststring; }

$querystring =~ s/&/=/gi;
%query = split( /=/, $querystring );
foreach $key ( keys( %query ) ) {
$query{$key} =~ tr/+/ /;
$query{$key} =~ s/%([\da-f][\da-f])/chr( hex($1) )/egi;
$uquer{$key} = decode_utf8( $query{$key} );
}

print "Content-Type: text/html; charset=\"UTF-8\"\n\n";
print <<END;
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" content="text/html; charset=utf-8">
</HEAD>
<BODY>
<FORM NAME="frmeing" METHOD="POST">
<INPUT NAME="df_kurs" TYPE="TEXT" VALUE="$uquer{'df_kurs'}">
<INPUT TYPE="SUBMIT">
</FORM>
</BODY>
</HTML>
END

But you really should use CGI.pm.

#!/usr/bin/perl

use strict; # Always!
use warnings; # Always!

use utf8; # Source code is encoded using UTF-8.
use open ':encoding(UTF-8)'; # Set default encoding for file handles.
BEGIN { binmode(STDOUT, ':encoding(UTF-8)'); } # HTML
BEGIN { binmode(STDERR, ':encoding(UTF-8)'); } # Error log

use CGI qw( -utf8 );
use Encode;

my $cgi = CGI->new();
my %uquer = $cgi->Vars();

print $cgi->header('text/html; charset=UTF-8');
print <<END;
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" content="text/html; charset=utf-8">
</HEAD>
<BODY>
<FORM NAME="frmeing" METHOD="POST">
<INPUT NAME="df_kurs" TYPE="TEXT" VALUE="$uquer{'df_kurs'}">
<INPUT TYPE="SUBMIT">
</FORM>
</BODY>
</HTML>
END

accept-charset=UTF-8 parameter doesnt do anything, when used in form

The question, as asked, is self-contradictory: the heading says that the accept-charset parameter does not do anything, whereas the question body says that when the accept-charset attribute (this is the correct term) is used, “the headers have different accept charset option in the request header”. I suppose a negation is missing from the latter statement.

Browsers send Accept-Charset parameters in HTTP request headers according to their own principles and settings. For example, my Chrome sends Accept-Charset:windows-1252,utf-8;q=0.7,*;q=0.3. Such a header is typically ignored by server-side software, but it could be used (and it was designed to be used) to determine which encoding is to be used in the server response, in case the server-side software (a form handler, in this case) is capable of using different encodings in the response.

The accept-charset attribute in a form element is not expected to affect HTTP request headers, and it does not. It is meant to specify the character encoding to be used for the form data in the request, and this is what it actually does. The HTML 4.01 spec is obscure about this, but the W3C HTML5 draft puts it much better, though for some odd reason uses plural: “gives the character encodings that are to be used for the submission”. I suppose the reason is that you could specify alternate encodings, to prepare for situations where a browser is unable to use your preferred encoding. And what actually happens in Chrome for example is that if you use accept-charset="foobar utt-8", then UTF-8 used.

In practice, the attribute is used to make the encoding of data submission different from the encoding of the page containing the form. Suppose your page is ISO-8859-1 encoded and someone types Greek or Hebrew letters into your form. Browsers will have to do some error recovery, since those characters cannot be represented in ISO-8859-1. (In practice they turn the characters to numeric character references, which is logically all wrong but pragmatically perhaps the best they can do.) Using <form charset=utf-8> helps here: no matter what the encoding is, the form data will be sent as UTF-8 encoding, which can handle any character.

If you wish to tell the form handler which encoding it should use in its response, then you can add a hidden (or non-hidden) field into the form for that.

Charset UTF-8 not working on ?php contact form

You aren't setting the mail header there, you are setting the http header. This function header is sending a raw HTTP header, it isn't doing anything for the email you are sending

   header('Content-Type: text/html;charset=UTF-8');

You need to add the header "Content-Type: text/html; charset=UTF-8" (for HTML Email bodies) or "Content-Type: text/plain; charset=UTF-8" (for Plain Text Email bodies) to your mail function. Like this.

$headers = array("Content-Type: text/html; charset=UTF-8");
mail($to, $subject, $message, $headers)

Additionally, for email, each lines should be separated with a CRLF (\r\n) instead of merely using a linefeed (\n). A fully example end result might look more so like this:

<?php

$crlf = "\r\n";

//Get Data
$name = strip_tags($_POST['name']);
$email = strip_tags($_POST['email']);
$service = strip_tags($_POST['service']);
$phone = strip_tags($_POST['phone']);
$phoneconfirm = strip_tags($_POST['phoneconfirm']);
$priority = strip_tags($_POST['priority']);
$subject = strip_tags($_POST['subject']);
$message = strip_tags($_POST['message']);

// Parse/Format/Verify Data
$to = "THETOEMAIL@GOES.HERE";
$from = 'THEFROMEMAIL@GOES.HERE';
$subject = "Via Website";
$message = "De: $name$crlf E-Mail: $email$crlf Serviço: $service$crlf
Telefone/Celular: $phone$crlf Ligar/Retornar: $phoneconfirm$crlf
Prioridade: $priority$crlf Assunto: $subject$crlf Mensagem:$crlf
$message";

// Setup EMAIL headers, particularly to support UTF-8
// We set the EMAIL headers here, these will be sent out with your message
// for the receiving email client to use.
$headers = 'From: ' . $from . $crlf .
'Reply-To: ' . $from . $crlf .
'Content-Type: text/plain; charset=UTF-8' . $crlf .
'Para: WebSite' . $crlf .
'X-Mailer: PHP/' . phpversion();

// Then we pass the headers into our mail function
mail($to, $subject, $message, $headers);
?>

Reference:

  • header function
  • mail function

Is there any benefit to adding accept-charset=UTF-8 to HTML forms, if the page is already in UTF-8?

If the page is already interpreted by the browser as being UTF-8, setting accept-charset="utf-8" does nothing.

If you set the encoding of the page to UTF-8 in a <meta> and/or HTTP header, it will be interpreted as UTF-8, unless the user deliberately goes to the View->Encoding menu and selects a different encoding, overriding the one you specified.

In that case, accept-encoding would have the effect of setting the submission encoding back to UTF-8 in the face of the user messing about with the page encoding. However, this still won't work in IE, due the previous problems discussed with accept-encoding in that browser.

So it's IMO doubtful whether it's worth including accept-charset to fix the case where a non-IE user has deliberately sabotaged the page encoding (possibly messing up more on your page than just the form).

Personally, I don't bother.



Related Topics



Leave a reply



Submit