Thanks for the excellent info! I think that is the most thorough, yet
to-the-point, tutorial for converting a webapp to UTF-8 that I've
read. It will probably be a little while before I bite the bullet and
undertake all of these conversions and upgrades.
In the meantime, I was able to use the java.nio classes to convert the
Latin1 text from the database into Java's native Unicode. After some
String manipulation I then had to convert the Unicode back to Latin1
before calling the ResultSet.updateBytes method. I ran into another
issue though when I had to insert the same Latin1 text into a new
record in a different table. I couldn't find a way to send an sql
string as a byte array using a Statement, so I had to use a
PreparedStatement with the setBytes() method. What a pain!!! I'd
love to know if anyone else dealing with this stuff has a more elegant
way of doing it. I'm also still wondering why I was able to grab text
from a browser POST and put it into the database (without any charset
conversions or byte arrays) and it not end up as garbage text. I may
never know.
Troy, thanks again for the information -- I will definitely be using
it as a guide on future i18n projects.
Justin
On 6/3/05, Troy Davis <troy@xxxxxxxxxxxxxxxxxx> wrote:
> Hi Justin,
>
> I just went through the process of upgrading an existing java app to
> handle Unicode text, and it was definitely a learning curve...
> (BTW, thank you to everyone that sent suggestions, most of them
> helped!) Since I completed the upgrade work, I've found myself
> copying and pasting text from just about anything, and much to my
> surprise it actually works. Even in MSIE, marvel of marvels.
>
> One of the problems you'll find in trying to convert between Windows
> cp1252, latin1 and other older encodings is that there's no easy way
> to detect which character set any given string is in. Supposedly
> Microsoft invested a pretty significant amount of developer time for
> MSIE so that it could detect character sets based on heuristic
> analysis. But short of getting that code and porting it to Java, I'd
> recommend switching to UTF-8 instead.
>
> In order to upgrade my company's app to be Unicode-safe, I had to
> address several different levels of concerns:
>
> 1. The database needed to be Unicode-safe. We use MySQL, but you have
> to use version 4.1.1+ to get that. Most hosting providers are still
> using 3.x or 4.0.x. One of our clients' sites is on a server that has
> 4.0.something, and it became a real roadblock. We wound up
> recompiling their jar file so that the DAO connection string pointed
> to our own database server. Slowed down the site a bit, but it works.
>
> The keys to this bit of magic turned out to be four-fold:
>
> - Mysql >= 4.1.1.
>
> - Changing the connection string to look like jdbc:mysql://
> server.com/db_name?
> useUnicode=true&characterEncoding=utf8&autoReconnect=true
>
> - Exporting and converting the data to utf8.
>
> - Changing the create table clauses to include "ENGINE=MyISAM
> DEFAULT CHARSET=utf8;" at the end.
>
> I also found myself typing "set names 'utf8';" at the command line
> quite a bit before uploading converted text.
>
> 2. The jdbc driver needed to be a recent version, so I had to upgrade
> Connector/J. Not a big deal for our own servers, but some clients are
> on other company's servers, and that took some time and persuasion.
>
> 3. Page headers must specify the UTF-8 character set, so your first
> line in a JSP file might look like this: <%@page language="java"
> contentType="text/html;charset=UTF-8" pageEncoding="UTF-8"%>
>
> 4. If you're going to have page headers that say UTF-8, your html
> content-type metatags should be consistent, and appear just after the
> <head> tag: <meta http-equiv="content-type" content="text/
> html;charset=UTF-8">
>
> 5. Whatever processes your form data will require something like
> this: request.setCharacterEncoding("UTF-8");
>
> 6. In order for #5 to work, you'll need the
> SetCharacterEncodingFilter.class in your WEB-INF/lib directory. Look
> in the Tomcat examples for a copy of this. You'll need to have a
> web.xml file that looks something like this:
>
> <?xml version="1.0" encoding="UTF-8"?>
>
> <!DOCTYPE web-app
> PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
> "http://java.sun.com/dtd/web-app_2_3.dtd">
>
> <web-app>
> <display-name>My App</display-name>
> <description>Something about My App.</description>
> <filter>
> <filter-name>Set Character Encoding</filter-name>
> <filter-class>filters.SetCharacterEncodingFilter</filter-class>
> <init-param>
> <param-name>encoding</param-name>
> <param-value>UTF-8</param-value>
> </init-param>
> </filter>
> <filter-mapping>
> <filter-name>Set Character Encoding</filter-name>
> <servlet-name>action</servlet-name>
> </filter-mapping>
> </web-app>
>
> HTH,
> Troy
>
> __________________
> Troy Davis
> Technology Director
> Metaphor Studio
> 538 Reading Road
> Loft 200
> Cincinnati, Ohio 45202
>
> Tel: 513-723-0290
> Fax: 513-723-0670
> http://metaphorstudio.com
>
> On Jun 3, 2005, at 11:18 AM, Justin Fister wrote:
>
> > I have a question for any Java gurus with i18n experience. I'm having
> > a hard time understanding the way things work with a webapp --
> > actually why it doesn't work. Here's what's going on... I have a
> > web-based admin that contains an HTML textarea field in which users
> > enter in text. Often the text contains special Windows characters
> > (such as curly quotes) and Latin-1 characters for words like
> > "naiveté". In a servlet, I use the HttpServletRequest.getParameter()
> > method to retrieve the text and dump it into a MySQL database that
> > uses Latin1 as its default charset. That works fine -- no problems.
> > The text can later be viewed fine through a web page as well as
> > through Mysql Control Center.
> >
> > The problem occurs with another Java program I wrote which iterates
> > over the database records, does some string manipulation to the text,
> > and updates the records. After this program is run, all of the
> > Windows characters and Latin1 characters show up as garbage text.
> >
> > So, I'm wondering why, in each case, I do nothing special to convert
> > character sets, but it works for the initial insert, but not for the
> > update. Why does my web-based app using
> > HttpServletRequest.getParameter() seem to handle character sets
> > differently than my standalone app using JDBC? Each are run on the
> > same machine.
> >
> > Any help would be appreciated.
> >
> > Thanks!
> > Justin
> >
> > ---------
> > You may unsubscribe from this mailing list
> > by sending a blank email addressed to:
> > users-unsubscribe@xxxxxxxxxx
> >
> > --
> > Find additional help by sending a blank email
> > addressed to:
> > users-help@xxxxxxxxxx
> >
> >
> >
> >
>
>
|