users
[Top] [All Lists]

Re: [cinjug-users] Java i18n Weirdness

To: Troy Davis <troy@xxxxxxxxxxxxxxxxxx>
Subject: Re: [cinjug-users] Java i18n Weirdness
From: Justin Fister <jfister@xxxxxxxxx>
Date: Fri, 3 Jun 2005 17:10:39 -0400
Cc: users@xxxxxxxxxx
Delivered-to: mailing list users@cinjug.org
Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=XQrkMIuvKEZ+xWfz3qwHxzoLGjMsaCecABF14TvDVEYHCcQAdPhCaj+ukPx/6rFcLTZEV70MI61TvJBx7cA3jubOFwsD9rxFCijPR+QZCjDEVpJ40R0GXbvOMczgJhWkbmT6ykQgEgZYL4Zg0IVVtKbiw+Ce8dS1ZVkRQKTT0RU=
In-reply-to: <33AA8334-529E-44EE-9D3E-D5BF66D40E4C@metaphorstudio.com>
Mailing-list: contact users-help@cinjug.org; run by ezmlm
References: <bb97f32c05060308187b534726@mail.gmail.com> <33AA8334-529E-44EE-9D3E-D5BF66D40E4C@metaphorstudio.com>
Reply-to: Justin Fister <jfister@xxxxxxxxx>
Thanks for the excellent info!  I think that is the most thorough, yet
to-the-point, tutorial for converting a webapp to UTF-8 that I've
read.  It will probably be a little while before I bite the bullet and
undertake all of these conversions and upgrades.

In the meantime, I was able to use the java.nio classes to convert the
Latin1 text from the database into Java's native Unicode.  After some
String manipulation I then had to convert the Unicode back to Latin1
before calling the ResultSet.updateBytes method.  I ran into another
issue though when I had to insert the same Latin1 text into a new
record in a different table.  I couldn't find a way to send an sql
string as a byte array using a Statement, so I had to use a
PreparedStatement with the setBytes() method.  What a pain!!!  I'd
love to know if anyone else dealing with this stuff has a more elegant
way of doing it.  I'm also still wondering why I was able to grab text
from a browser POST and put it into the database (without any charset
conversions or byte arrays) and it not end up as garbage text.  I may
never know.

Troy, thanks again for the information -- I will definitely be using
it as a guide on future i18n projects.

Justin

On 6/3/05, Troy Davis <troy@xxxxxxxxxxxxxxxxxx> wrote:
> Hi Justin,
> 
> I just went through the process of upgrading an existing java app to
> handle Unicode text, and it was definitely a learning curve...
> (BTW, thank you to everyone that sent suggestions, most of them
> helped!) Since I completed the upgrade work, I've found myself
> copying and pasting text from just about anything, and much to my
> surprise it actually works. Even in MSIE, marvel of marvels.
> 
> One of the problems you'll find in trying to convert between Windows
> cp1252, latin1 and other older encodings is that there's no easy way
> to detect which character set any given string is in. Supposedly
> Microsoft invested a pretty significant amount of developer time for
> MSIE so that it could detect character sets based on heuristic
> analysis. But short of getting that code and porting it to Java, I'd
> recommend switching to UTF-8 instead.
> 
> In order to upgrade my company's app to be Unicode-safe, I had to
> address several different levels of concerns:
> 
> 1. The database needed to be Unicode-safe. We use MySQL, but you have
> to use version 4.1.1+ to get that. Most hosting providers are still
> using 3.x or 4.0.x. One of our clients' sites is on a server that has
> 4.0.something, and it became a real roadblock. We wound up
> recompiling their jar file so that the DAO connection string pointed
> to our own database server. Slowed down the site a bit, but it works.
> 
> The keys to this bit of magic turned out to be four-fold:
> 
>      - Mysql >= 4.1.1.
> 
>      - Changing the connection string to look like jdbc:mysql://
> server.com/db_name?
> useUnicode=true&characterEncoding=utf8&autoReconnect=true
> 
>      - Exporting and converting the data to utf8.
> 
>      - Changing the create table clauses to include "ENGINE=MyISAM
> DEFAULT CHARSET=utf8;" at the end.
> 
> I also found myself typing "set names 'utf8';" at the command line
> quite a bit before uploading converted text.
> 
> 2. The jdbc driver needed to be a recent version, so I had to upgrade
> Connector/J. Not a big deal for our own servers, but some clients are
> on other company's servers, and that took some time and persuasion.
> 
> 3. Page headers must specify the UTF-8 character set, so your first
> line in a JSP file might look like this: <%@page language="java"
> contentType="text/html;charset=UTF-8" pageEncoding="UTF-8"%>
> 
> 4. If you're going to have page headers that say UTF-8, your html
> content-type metatags should be consistent, and appear just after the
> <head> tag: <meta http-equiv="content-type" content="text/
> html;charset=UTF-8">
> 
> 5. Whatever processes your form data will require something like
> this: request.setCharacterEncoding("UTF-8");
> 
> 6. In order for #5 to work, you'll need the
> SetCharacterEncodingFilter.class in your WEB-INF/lib directory. Look
> in the Tomcat examples for a copy of this. You'll need to have a
> web.xml file that looks something like this:
> 
> <?xml version="1.0" encoding="UTF-8"?>
> 
> <!DOCTYPE web-app
>      PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
>      "http://java.sun.com/dtd/web-app_2_3.dtd";>
> 
> <web-app>
>      <display-name>My App</display-name>
>      <description>Something about My App.</description>
>      <filter>
>          <filter-name>Set Character Encoding</filter-name>
>          <filter-class>filters.SetCharacterEncodingFilter</filter-class>
>          <init-param>
>              <param-name>encoding</param-name>
>              <param-value>UTF-8</param-value>
>          </init-param>
>      </filter>
>      <filter-mapping>
>          <filter-name>Set Character Encoding</filter-name>
>          <servlet-name>action</servlet-name>
>      </filter-mapping>
> </web-app>
> 
> HTH,
> Troy
> 
> __________________
> Troy Davis
> Technology Director
> Metaphor Studio
> 538 Reading Road
> Loft 200
> Cincinnati, Ohio 45202
> 
> Tel: 513-723-0290
> Fax: 513-723-0670
> http://metaphorstudio.com
> 
> On Jun 3, 2005, at 11:18 AM, Justin Fister wrote:
> 
> > I have a question for any Java gurus with i18n experience.  I'm having
> > a hard time understanding the way things work with a webapp --
> > actually why it doesn't work.  Here's what's going on... I have a
> > web-based admin that contains an HTML textarea field in which users
> > enter in text.  Often the text contains special Windows characters
> > (such as curly quotes) and Latin-1 characters for words like
> > "naiveté".  In a servlet, I use the HttpServletRequest.getParameter()
> > method to retrieve the text and dump it into a MySQL database that
> > uses Latin1 as its default charset.  That works fine -- no problems.
> > The text can later be viewed fine through a web page as well as
> > through Mysql Control Center.
> >
> > The problem occurs with another Java program I wrote which iterates
> > over the database records, does some string manipulation to the text,
> > and updates the records.  After this program is run, all of the
> > Windows characters and Latin1 characters show up as garbage text.
> >
> > So, I'm wondering why, in each case, I do nothing special to convert
> > character sets, but it works for the initial insert, but not for the
> > update.  Why does my web-based app using
> > HttpServletRequest.getParameter() seem to handle character sets
> > differently than my standalone app using JDBC?  Each are run on the
> > same machine.
> >
> > Any help would be appreciated.
> >
> > Thanks!
> > Justin
> >
> > ---------
> > You may unsubscribe from this mailing list
> > by sending a blank email addressed to:
> > users-unsubscribe@xxxxxxxxxx
> >
> > --
> > Find additional help by sending a blank email
> > addressed to:
> > users-help@xxxxxxxxxx
> >
> >
> >
> >
> 
>

<Prev in Thread] Current Thread [Next in Thread>