Sunday, June 03, 2012

Been busy as of late.

19

So I wanted to cover real quickly the basic structure of the program that I will be using to watch socket packets for a TN5250 stream I will be connecting to.


/-----\         /----------\         /---------------\
| GUI |   <==>  | Activity |   <==>  | Packet reader |
\-----/         \----------/         \---------------/
                    /\
                    ||
                    ||  (passes back an open socket)
                    \/
                /----------------\
                | Socket Builder |
                \----------------/

So that's the basic structure. This isn't a TN5250 client as it is more a packet sniffer. Basically the socket builder is made it's own object just in case I need to change how a socket is made. The packet reader is the same reason, just in case I need to read the packets differently. I'll be writing this on Sunday and testing it on one of the open AS400 systems on the Internet.

At the same time I want to cover another wonderful point of telnet and TN5250, code pages. Over the years we've all become very accustom to a thing called UTF. Basically, this has become the agreed upon standard for encoding languages of the world. I won't cover how UTF works but it has greatly simplified how languages are encoded. However, AS400 systems talk in a very cryptic language call EBCDIC. So here's the deal. EBCDIC is eight bits (256 possible codepoints) when building a TN5250 terminal, one needs to be able to take the binary value and convert it to a UTF-8 character. How does that work? Let's go over a little example.

If you head over to the Wikipedia page for EBCDIC, you will see that the letters A through I are 0xC1 through 0xC9, in binary that would be 11000001 through 11001001. In decimal that would be 193 through 201. To have a conversion from that to UTF-8 one could build an array like so...

//blah blah init code

public static char [] cp037 = new char[256];
//...
//...
cp037[193] = '\u0041';          //Letter A
cp037[194] = '\u0042';          //Letter B
cp037[195] = '\u0043';          //Letter C
cp037[196] = '\u0044';          //Letter D
cp037[197] = '\u0045';          //Letter E
cp037[198] = '\u0046';          //Letter F
cp037[199] = '\u0047';          //Letter G
cp037[200] = '\u0048';          //Letter H
cp037[201] = '\u0049';          //Letter I

//Okay here comes the fun part of EBCDIC...  
//The alphabet isn't in numeric order so the 
//next character after I is soft hyphen...  
//We don't get to J until 209.

cp037[202] = '\u00AD';          //Soft Hyphen (look it up if you don't know)
cp037[203] = '\u00F4';          //Letter Ô
cp037[204] = '\u00F6';          //Letter Ö
cp037[205] = '\u00F2';          //Letter Ò
cp037[206] = '\u00F3';          //letter Ó
cp037[207] = '\u00F5';          //Letter Õ, also this is 0xCF, next is 0xD0
cp037[208] = '\u007D';          // the } symbol
cp037[209] = '\u004A';          //Finally, the Letter J...

Of course if you are really slick you can just define the array and init the values all in one go. Also, you might want to make the array final.

public static final char [] cp037 = { 
//192 other elements
'\u0041', '\u0042', '\u0043', '\u0044', '\u0045', '\u0046', '\u0047'
//You get the point hopefully.
};

Also I want to note one more hateful thing. You'll notice that I have called the array cp037... That's because EBCDIC is an 8-bit encoding, so there are multiple code pages to describe each language. In EBCDIC code page 037 there are enough characters to describe the languages of Australia, Brazil, Canada, New Zealand, Portugal, South Africa, and USA, except that 037 has no Euro support. Code Page 037 with the Euro symbol is known as code page 1140. The difference between 037 and 1140 is a single character. At code point 9F in EBCDIC in CP 037 is \u00A4 or ¤. At code point 9F in EBCDIC in CP 1140 is \u20AC or €. In other words the difference between the two in code is:


//Remember code point 9F = 159 in decimal.

cp037[159] = '\u00A4';          //The ¤ symbol
cp1140[159] = '\u20AC';         //The € symbol

Other then that, cp037 and cp1140 are exactly the same.

To actually implement every single code page in EBCDIC would be very difficult by myself. However, the nice thing is that there are several resources on the Internet that quote the IBM guidebook on the layout of each code page. This is nice because A) It's in digital format, B) I can copy and paste and reformat outside the IDE and then copy paste into the IDE when I've got it actually looking like code.

Anyway, I've beaten this EBCDIC to UTF-8 conversion thing to death. Yet, I have not answered why one must do this... Simple, because IBM hates you. Actually, Java speaks UTF-8 down to the core, but IBM speaks EBCDIC (which the story is funny because they made EBCDIC as a competitor to ASCII [guess who won]). So just like any program that has to speak to different encoding machines, you have to convert between the two. Usually one doesn't have to worry about this kind of stuff because some sort of library already exists to do all of the conversion without your knowledge. Well, that's exactly what we are doing. In order to have a TN5250 terminal we need to write a EBCDIC to UTF-8 converter library. Now you might wonder, why hasn't anyone already done this. Well a couple of people have, but they didn't make it into a library per se. So I can't use those pieces because those pieces don't stand on their own. In other words, the EBCDIC to UTF-8 converter is hard coded into the software that uses it, and thus cannot be separated without a lot of pain. Which then it becomes a "lesser of two evils" kind of deal.

Well I hope you all have enjoyed this little foray in encodings. Well, off to bed and tomorrow morning some Android socket programming for me!

No comments: