jinfeng_wang

G-G-S,D-D-U!

BlogJava 首页 新随笔 联系 聚合 管理
  400 Posts :: 0 Stories :: 296 Comments :: 0 Trackbacks

EBCDIC Encoding with .NET

After reading a post on the C# newsgroup asking for a EBCDIC to ASCII converter, and seeing one solution, I decided to write my own implementation. This page describes the implementation and its limitations, and a bit about EBCDIC itself.

EBCDIC

Unfortunately it appears to be fairly tricky to get hold of many concrete specifications of EBCDIC. This is what I've managed to glean from various websites:

  • Introduced by IBM, EBCDIC is an encoding mostly used on mainframes.
  • Like "OEM", EBCDIC isn't a single character encoding: there are many EBCDIC encodings, suited to different cultures.
  • It is primarily a single-byte encoding, ie each character is encoded as a single byte. However, there are two characters, "shift out" and "shift in" (0x0e and 0x0f respectively) which are used to change between this an a double-byte character set (DBCS). As far as I can tell, a single EBCDIC encoding doesn't specify which DBCS is to be used - in other words, you really need even more information before you can tell what's going on. Presumably the DBCS in question can't have any pairs beginning with byte 0x0f, as otherwise it would be confused with the "shift in" flag.

If you have any more information, particularly about the DBCS aspect, please mail me at skeet@pobox.com.

My EBCDIC Encoding implementation

I managed to get hold of details of 47 EBCDIC encodings from http://std.dkuug.dk/i18n/charmaps/. To be honest, I don't really know what DKUUG is, so I'm really just hoping that the maps are accurate - they seem to be quite reasonable though. Each encoding has a name and several have aliases, although I currently ignore this aliasing.

My implementation consists of three projects, described below, of which only the middle one is of any interest to most people.

A character map reader
This simply finds all of the files whose names begin with "EBCDIC-" in the current directory, reads them all in (warning of any oddities in the encoding, such as any non-zero byte having two distinct meanings) and writes a resource file out, ebcdic.dat. This is a console applicion built from a single C# source file.
An encoding library
This is a library built from two C# source files and the ebcdic.dat file generated by the reader. This library is all most users will need. More details are provided below.
A test program
This is a console application built from a single C# source file and requiring the library described above. Currently it just displays the encoded version of "hello" and then decodes it.

Using The Encoding Library

The encoding library is very simple to use, as the encoding class (JonSkeet.Ebcdic.EbcdicEncoding) is a subclass of the standard .NET System.Text.Encoding class. To obtain an instance of the appropriate encoding, use EbcdicEncoding.GetEncoding (String) passing it the name of the encoding you wish to use (eg EBCDIC-US). You can find out the list of names of available encodings using the EbcdicEncoding.AllNames property, which returns the names as an array of strings.

Once you have obtained an EbcdicEncoding instance, use it like any other Encoding: call GetString, GetBytes etc. The encoding does not save any state between requests, and can safely be used by many threads simultaneously. There is no need (or indeed facility) to release encoding resources when it is no longer needed. All encodings are created on the first use of the EbcdicEncoding class, and maintained until the application domain is unloaded.

Sample Code

The following is a sample program to convert a file from EBCDIC-US to ASCII. It should be easy to see how to modify it to convert the other way, or to use a different encoding (eg from EBCDIC-UK, or to UTF-8).

using System;
using System.IO;
using System.Text;
using JonSkeet.Ebcdic;

public class ConvertFile
{
    public static void Main(string[] args)
    {
        if (args.Length != 2)
        {
            Console.WriteLine 
                ("Usage: ConvertFile <ebcdic file (input)> <ascii file (output)>");
            return;
        }
        
        string inputFile = args[0];
        string outputFile = args[1];

        Encoding inputEncoding = EbcdicEncoding.GetEncoding ("EBCDIC-US");
        Encoding outputEncoding = Encoding.ASCII;
        
        try
        {
            // Create the reader and writer with appropriate encodings.
            using (StreamReader inputReader = 
                      new StreamReader (inputFile, inputEncoding))
            {
                using (StreamWriter outputWriter = 
                           new StreamWriter (outputFile, false, outputEncoding))
                {
                    // Create an 8K-char buffer
                    char[] buffer = new char[8192];
                    int len=0;
                    
                    // Repeatedly read into the buffer and then write it out
                    // until the reader has been exhausted.
                    while ( (len=inputReader.Read (buffer, 0, buffer.Length)) > 0)
                    {
                        outputWriter.Write (buffer, 0, len);
                    }
                }
            }
        }
        // Not much in the way of error handling here - you may well want
        // to do better handling yourself!
        catch (IOException e)
        {
            Console.WriteLine ("Exception during processing: {0}", e.Message);
        }
    }
}    

Limitations

Due to the lack of available information about the DBCS aspect of EBCDIC, this encoding class makes no effort whatsoever to simulate proper shifting. Shift out and shift in are merely encoded/decoded to/from their equivalent Unicode characters, and bytes between them are treated as if the shift had not taken place. (This means that a decoded byte array is always a string of the same length as the byte array, and vice versa).

Any byte not recognised to be from the specific encoding being used is decoded to the question mark character, '?'. Any character not recognised to be in the set of characters encoded by the specific encoding being used is encoded to the byte representing the question mark character, or to byte zero if the question mark character is not in the character set either.

The library doesn't currently have a strong-name, so can't be placed in the GAC. You may, however, download the source and modify

Licence

This was just an interesting half-day project. I have no desire to make any money out of this code whatsoever, but I hope it's interesting and useful to others. So, feel free to use it. If you have any questions about it, or just find it useful and wish to let me know, please mail me at skeet@pobox.com. You may use this code in commercial projects, either in binary or source form. You may change the namespace and the class names to suit your company, and modify the code if you wish. I'd rather you didn't try to pass it off as your own work, and specifically you may not sell just this code - at least not without asking me first. I make no claims whatsoever about this code - it comes with no warranty, not even the implied warranty of fitness for purpose, so don't sue me if it breaks something. (Mail me instead, so we can try to stop it from happening again.)

Downloads

History

  • August 31st 2003, v1.0.0.1 - no in-code changes, just made the XML documentation build correctly.
  • August 28th 2003, v1.0.0.1 - slight tweaking to remove unnecessary (and probably counterproductive) efficiency measure. No functional changes.
  • May 21st 2003, v1.0.0.0 - initial implementation.
posted on 2006-01-18 19:58 jinfeng_wang 阅读(1063) 评论(0)  编辑  收藏 所属分类: ZZ.Net

只有注册用户登录后才能发表评论。


网站导航: