users
[Top] [All Lists]

Re: [cinjug-users] Java i18n Weirdness

To: Justin Fister <jfister@xxxxxxxxx>
Subject: Re: [cinjug-users] Java i18n Weirdness
From: Troy Davis <troy@xxxxxxxxxxxxxxxxxx>
Date: Fri, 3 Jun 2005 16:11:26 -0400
Cc: users@xxxxxxxxxx
Delivered-to: mailing list users@cinjug.org
In-reply-to: <bb97f32c05060308187b534726@mail.gmail.com>
Mailing-list: contact users-help@cinjug.org; run by ezmlm
References: <bb97f32c05060308187b534726@mail.gmail.com>
Hi Justin,

I just went through the process of upgrading an existing java app to handle Unicode text, and it was definitely a learning curve... (BTW, thank you to everyone that sent suggestions, most of them helped!) Since I completed the upgrade work, I've found myself copying and pasting text from just about anything, and much to my surprise it actually works. Even in MSIE, marvel of marvels.

One of the problems you'll find in trying to convert between Windows cp1252, latin1 and other older encodings is that there's no easy way to detect which character set any given string is in. Supposedly Microsoft invested a pretty significant amount of developer time for MSIE so that it could detect character sets based on heuristic analysis. But short of getting that code and porting it to Java, I'd recommend switching to UTF-8 instead.

In order to upgrade my company's app to be Unicode-safe, I had to address several different levels of concerns:

1. The database needed to be Unicode-safe. We use MySQL, but you have to use version 4.1.1+ to get that. Most hosting providers are still using 3.x or 4.0.x. One of our clients' sites is on a server that has 4.0.something, and it became a real roadblock. We wound up recompiling their jar file so that the DAO connection string pointed to our own database server. Slowed down the site a bit, but it works.

The keys to this bit of magic turned out to be four-fold:

    - Mysql >= 4.1.1.

- Changing the connection string to look like jdbc:mysql:// server.com/db_name? useUnicode=true&characterEncoding=utf8&autoReconnect=true

    - Exporting and converting the data to utf8.

- Changing the create table clauses to include "ENGINE=MyISAM DEFAULT CHARSET=utf8;" at the end.

I also found myself typing "set names 'utf8';" at the command line quite a bit before uploading converted text.

2. The jdbc driver needed to be a recent version, so I had to upgrade Connector/J. Not a big deal for our own servers, but some clients are on other company's servers, and that took some time and persuasion.

3. Page headers must specify the UTF-8 character set, so your first line in a JSP file might look like this: <%@page language="java" contentType="text/html;charset=UTF-8" pageEncoding="UTF-8"%>

4. If you're going to have page headers that say UTF-8, your html content-type metatags should be consistent, and appear just after the <head> tag: <meta http-equiv="content-type" content="text/ html;charset=UTF-8">

5. Whatever processes your form data will require something like this: request.setCharacterEncoding("UTF-8");

6. In order for #5 to work, you'll need the SetCharacterEncodingFilter.class in your WEB-INF/lib directory. Look in the Tomcat examples for a copy of this. You'll need to have a web.xml file that looks something like this:

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE web-app
    PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
    "http://java.sun.com/dtd/web-app_2_3.dtd";>

<web-app>
    <display-name>My App</display-name>
    <description>Something about My App.</description>
    <filter>
        <filter-name>Set Character Encoding</filter-name>
        <filter-class>filters.SetCharacterEncodingFilter</filter-class>
        <init-param>
            <param-name>encoding</param-name>
            <param-value>UTF-8</param-value>
        </init-param>
    </filter>
    <filter-mapping>
        <filter-name>Set Character Encoding</filter-name>
        <servlet-name>action</servlet-name>
    </filter-mapping>
</web-app>

HTH,
Troy

__________________
Troy Davis
Technology Director
Metaphor Studio
538 Reading Road
Loft 200
Cincinnati, Ohio 45202

Tel: 513-723-0290
Fax: 513-723-0670
http://metaphorstudio.com

On Jun 3, 2005, at 11:18 AM, Justin Fister wrote:

I have a question for any Java gurus with i18n experience.  I'm having
a hard time understanding the way things work with a webapp --
actually why it doesn't work.  Here's what's going on... I have a
web-based admin that contains an HTML textarea field in which users
enter in text.  Often the text contains special Windows characters
(such as curly quotes) and Latin-1 characters for words like
"naiveté".  In a servlet, I use the HttpServletRequest.getParameter()
method to retrieve the text and dump it into a MySQL database that
uses Latin1 as its default charset.  That works fine -- no problems.
The text can later be viewed fine through a web page as well as
through Mysql Control Center.

The problem occurs with another Java program I wrote which iterates
over the database records, does some string manipulation to the text,
and updates the records.  After this program is run, all of the
Windows characters and Latin1 characters show up as garbage text.

So, I'm wondering why, in each case, I do nothing special to convert
character sets, but it works for the initial insert, but not for the
update.  Why does my web-based app using
HttpServletRequest.getParameter() seem to handle character sets
differently than my standalone app using JDBC?  Each are run on the
same machine.

Any help would be appreciated.

Thanks!
Justin

---------
You may unsubscribe from this mailing list
by sending a blank email addressed to:
users-unsubscribe@xxxxxxxxxx

--
Find additional help by sending a blank email
addressed to:
users-help@xxxxxxxxxx






<Prev in Thread] Current Thread [Next in Thread>