Solid Block of Ise

Authentic Frontier Gibberish

Them Character Encoding Blues

Posted by isecore on October 6th, 2007

Back in 2003 I started running my own server. Mostly for fun –I had an old machine that could function as a server– but mostly because I’m a huge nerd who loves having computers doing things.

When I started I installed Debian, and immediately got hooked on it. I didn’t really know much about running a server, but I’ve played around with Linux since ~1994 and figured that I’d learn what I needed as I went along. Sure, this is a good way of learning things but it ain’t the fastest way either.

Looking back at these four years of server-administration I’ve come to the conclusion that had I known then what I now know I would’ve done things a lot differently. I’ve learned _a lot_ about the finer details of running a server. Especially a server such as mine, which performs a lot of different functions.

The biggest annoyance has been character encodings. When I started out I had no clue whatsoever how they worked and why some were used and some weren’t. I knew that Sweden generally used ISO-8859-1 but not why or why it would cause problems. So I happily went on using that, and as time went by it caused a lot of problems. Problems since pretty much everything these days is geared towards UTF-8. UTF-8 is the standard encoding on pretty much every operating system there is. Every real operating system at least, I wouldn’t be surprised if Microsoft still insists on some funky encoding even in Vista. I don’t know since I’ve never run Vista and have no intentions of doing it either.

So, I’ve decided that from now on everything in this house that uses a character encoding will use UTF-8. I’m tired of seeing weird characters on the server that runs ISO-8859-1 just because some system somewhere else used UTF-8. UTF-8 is the future, most everything else is just old and busted.

This means I’m going to have to go over my server with a fine comb and adjust everything to use UTF-8, and make sure it’s systemwide. This is since my server is originally a Debian Woody (3.0) that’s been upgraded two major releases (first to Sarge, 3.1 and the latest to Etch, 4.0) since 2003. It’s either that or reinstalling from scratch — the latter is of even less interest since it means mounds and mounds of work to get everything up to speed again.

And yes, this is probably not the most world-conscious entry I’ve posted so far. But there was a decided lack of nerdery in this blog and hopefully this makes it up. This will also explain why some things occasionally might look funny while I figure out a way to transcribe the database holding this blog from ISO-8859-1 to UTF-8.

License

This work is published under a Creative Commons Attribution-NoDerivs 2.5 Sweden License.

3 Responses to “Them Character Encoding Blues”

  1. Mind Says:

    It´s funny, I was just looking up much of this stuff with character encoding while programming on a chat-bot. Sort of my long term project to do a megaHAL replacement that is sligthly more modern and portable. Anyway, I can´t seem to find exactly what sort of character encoding unicode is in the various libraries of C++. The encoding is simply referred to as “wide”. I know that UTF-8 is “teh shit”, It´s not really a lack of knowledge on my part about how the character encodings work or what they are good for, it´s more of a lack in documentation on the libraries part or a lack of my ability to find the correct information about it. Really annoying. I have found a standardlib that I use now that I suppose is the same everyone that makes “unicode” supporting programs use. But I am not certain, and I don`t really know what will happen when people start entering text into it with various keyboardsettings and languages or from different media such as the input-box on a webform kontra a well formatted textfile. It´s a djungle really… at least until I find more reliable information about it.

  2. Mind Says:

    Ah, I found some more info. Apparently, in windows the standard is UTF-16, that is, the cheap version of UTF-8. The pros is that characters are always 16 bits wide instead of varying length as it is in UTF-8. That makes for extremely easy encoding as you can beforehand reserve space for said amount of characters. But, UTF-16 is not UTF-8 and therefor does not support all the characters UTF-8 supports.

    Now, windowsdevelopers refer to UTF-16 as unicode and Linuxdevelopers refer to UTF-8 as unicode. Hence the confusion on my part. A more correct term would be to not refer to UTF-16 as unicode at all but rather keep using the term “wide” and nothing else. The standard libs to handle unicode under Linux handles UTF-8 and on windows UTF-16. Pretty annoying, this probably means I have to get some external lib to be able to support UTF-8 on windows.

  3. isecore Says:

    Well, that would make sense. I’ve read about several Windows-users who discover that their OS uses some pretty weird encodings, but UTF-16 would make sense in some ass-backwards way.

    Anyhoo, everything here is going to be converted to UTF-8 in the future. I’m tired of seeing weird characters in my filenames :)

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>


Perhaps these similar posts might be of interest?