UNICODE Text to String - lite-C Forums

Posted By: Benni003

UNICODE Text to String - 02/07/13 12:23

Hello, I hope someone can help me.
I want to paste an text from a textfile(unicode)!! (Important, because special letters are needed) to a string.
It's not working and I don't know much about this.
Please can anyone help me in this? Thank you

Demo to show you:
Example

On the link click the smaller downloadbutton under
Datum:07.02.2013
Dateigr��e:0 MB
Downloads:

And here the code seperate:

STRING* str1 = "";
STRING* str2 = "";

function main()
{
var file;

file = file_open_read("file.txt");

file_str_readtow(file,str1," ",200);
file_str_readtow(file,str2," ",200);

file_close(file);

printf(_chr(str1));
printf(_chr(str2));
}

Posted By: fogman

Re: UNICODE Text to String - 02/07/13 13:24

printf doesn�t seem to use a unicode font.
Because it works, try this:

#include <acknex.h>
#include <default.c>

STRING* str1 = "#8";
STRING* str2 = "#8";

FONT* fontArial = "Arial#20b"; // truetype font, 20 point bold characters

TEXT* txtTest =
{
layer = 999;
string = str1;
flags = SHOW;
font = fontArial;
}

function main()
{
var file;

file = file_open_read("file.txt");

file_str_readtow(file,str1," ",200);
file_str_readtow(file,str2," ",200);

file_close(file);

// printf(_chr(str1));
// printf(_chr(str2));
}

Posted By: WretchedSid

Re: UNICODE Text to String - 02/07/13 13:53

Yes, finally, character encoding, I was already afraid that this topic would never come up here.

So, fasten your seatbelts and let's talk about character encoding; A string is composed out of a number of bits and bytes, just like your average integer, however, the interpretation of a string is a much bigger clusterfuck than the interpretation of an integer (mainly because we agreed upon how an integer looks like, we just can't always agree on the byte-order). It's a bit like the thousand and one floating point formats out there, except that it's even more fucked up because usually floating point means IEEE754 and it just works�.

When you type a string in Lite-C, it's encoded in the so called ASCII format. Each character is exactly one byte in size and has a value from 0-127, which is enough to encode all english characters, a few punctuation marks some control characters etc (here is the complete list: http://www.asciitable.com/). Now, this obviously poses a problem because as it happens there are much much more characters in the world and people also want to use emojis in their test messages, so having just 128 characters is a bit meh. Luckily we came up with a billion ways to encode all kinds of characters, the most commonly used one being UTF8 (except of some IRC channels on freenode which will ban you if you use it), UTF8 in it's core uses one byte per character and has same 128 characters as ASCII, so an ASCII string is also a valid UTF8 string (hooray, someone thought about compatibility), however, a character can also be larger than a byte in UTF8 (up to 4bytes) and thus represent a character outside of the 128 ASCII characters (except that it then stops being a valid ASCII string and becomes garbage). Oh yeah, by the way, UTF8 is a Unicode encoding. Just like UTF16 and UTF32.

So, what the fuck is Unicode? Unicode, or ISO 10646 if you like numbers, is a standard and not an encoding (it's a standard created because we had too many competing standards. Relevant xkcd: http://xkcd.com/927/). Unicode is basically a list of characters, currently 1114112, containing everything from latin characters, over most of the asian characters, currency symbols, mathematical symbols, scientific symbols to the good old smiling pile of poo emoji (code point U+1F4A9, if anyone wants to know).
Each character in the unicode table has a so called code point, which basically is just a number, written in hex, and prefixed with U+. The � for example is U+00DF, or simply put 223 in decimal. The characters in the Unicode list are broken into so called planes, and the first 2 byte is called the common language plane and it contains the most common characters (ie the pile of poo is not there). Planes are broken into blocks, and the first block of the common language plane is the latin characters block and guess which 128 characters those are (that's what happens if you let americans design shit... after they designed a bazillion other standards on how to write latin strings).

Let's go back to UTF8 and its friends UTF16 and UTF32. All three encoding are Unicode encodings, meaning that they can be used to represent Unicode strings. The difference in how they encode the unicode character is based on their size. UTF8 uses a byte per character, UTF16 uses 2 bytes and UTF32 uses 4 bytes, and is the only of the three that is able to hold all unicode characters in a single unit, but it also wastes a lot of space because not every character is 4 bytes long (as already mentioned, the most commonly used english characters can perfectly fit into 1 byte). I'm going to skip the UTF8 encoding details for now, and get straight to UTF16, which is the encoding Gamestudio uses.

In UTF16 each character has a unit size of 2 bytes, but it can extend up to 4 bytes depending on what you want to encode (the smiling pile of poo for example uses 4 bytes). This is the point where it should start to get clear that a) character encoding is a fucked up art coming from a time where people couldn't make/afford RAM or hard drives larger than a few kilobytes and b) that UTF16 doesn't work with normal C Strings because they are expected to be 1 byte in size and UTF16 uses 2 bytes (so no, the problem is not that printf() uses a non Unicode font but that it works with CStrings which have a completely different unit size).

So, how does iterating over a string encoded in UTF16 work? The straightforward approach would be this:

Code:

short *myString = xyz;
while(*myString)
{
     myString ++;
}

Except that not. Like mentioned before, not every character is 2 bytes in size, some can be larger, so unlike with C strings, making assumption of the length of a string without looking at it's content is bad and you should feel bad if you do that. The good thing, like already mentioned, is that Unicode wasn't designed by complete lunatics, so the most commonly used characters, including asian ones, are neatly put into the first 2 bytes of the unicode table, and if you only use these characters you can assume that each character is 2 bytes and be happy. Except not, because third party input can't be trusted, so let's talk about digesting a UT16 string.

The easy case is that your character is 2 bytes, this includes everything up to the code point U+FFFF.
The hard case is everything else. When the Unicode code points starting at U+10000 are encoded, UTF16 uses 4 bytes per character, broken into the so called lead and the trail surrogate. This is done by first subtracting 0x10000 from the 4 byte code point (leaving 20 bits) and then putting the higher 10 bits into the lead surrogate and the lower 10 bits into the trail surrogate, so the lead surrogate is a number between 0xD800 and 0xDBFF and the trail surrogate is a number between 0xDC00 and 0xDFFF.
Easy? Yep, so here is how to write a correct str_length() for UTF16 encoded strings:

Code:

unsigned int str_lengthw(short *string)
{
    unsigned int length = 0;
    while(*string)
    {
        // Check if this is a 4 byte character
        if(*string >= 0xD800 && *string <= 0xDBFF)
        {
            // Skip over the next 4 byte.
            // In reality you should then check if the trail surrogate is in the correct range as well, because, you know, third party input can't be trusted at all.
            string ++;
        }
    
        length ++;
        string ++;
    }
    
    return length;
}

By now two more things should be clear:
a) Printf with the %s format specifier and a UTF16 string don't work
b) UTF16 can't possibly fit into an ASCII string
(special c:) Don't use the string directly as the format string, for heavens sake, third party input can't be trusted!!!!

Now, the way to make printf() work with an UTF16 string is by converting the UTF16 string into an ASCII string, which is lossy in most cases, because UTF16 can represent much much more characters that are impossible to represent using ASCII. Writing such a conversion function is a nice exercise, and everyone here should be capable of doing that (now that they know how an UTF16 string looks like). If you can't be bothered with this, here you go (but please feel bad for about 10 minutes or so):

Click to reveal..

Posted By: Uhrwerk

Re: UNICODE Text to String - 02/07/13 23:02

This post is awesome. It should be printed on A1 paper and be put directly into the ACKNEX temple right next to the JCL poster.

I would have voted for the wiki in the first place, but that's offline.

Posted By: Benni003

Re: UNICODE Text to String - 02/08/13 08:43

Thanks for your help, it's working!
And thank you JustSid for your fantastic explaination

Posted By: Benni003

Re: UNICODE Text to String - 02/08/13 09:34

hm now I got another problem:
I want that str_main and str_pointer will be compared and if it's the same, the engine should exit.
I have also a unicode text file wich includes just "NULL", named file.txt.
Can anyone help me with this?
It have to be unicode, because I need special letters.

#include <acknex.h>
#include <default.c>

FONT* fontArial = "Arial Unicode MS#30";

PANEL* panel=
{
pos_x=0;pos_y=0;
flags=SHOW;layer=3;
}

STRING* str_main;
STRING* str_pointer;

function main()
{
str_main = str_create("");
str_pointer = str_create("NULL");

//---------------------------------------------
var file = file_open_read("file.txt");

file_str_readtow(file,str_main,NULL,5000);

if(str_cmpni(str_main,str_pointer)==1){sys_exit("");}

file_close(file);
//----------------------------

pan_setstring(panel,0,0 ,0,fontArial,str_main);
pan_setstring(panel,0,0,20,fontArial,str_pointer);
}

Posted By: fogman

Re: UNICODE Text to String - 02/08/13 11:52

The problem is here:
str_pointer = str_create("NULL");

Basically, if you work with Unicode, you can�t simply define any string directly in c or h files. You have to read all strings from textfiles, even for comparisons. Think about it: You try to compare ASCII with unicode, this won�t work.

Solution: Read "NULL" from a unicode textfile into str_pointer.

Posted By: fogman

Re: UNICODE Text to String - 02/08/13 11:56

I bet you have to localize a game for a publisher? Contact me if you need help, I�ve done that four times already. If you don�t have forseen unicode, it�ll we a bunch of work. You can contact me at tf [at] zappadong.de

Posted By: Talemon

Re: UNICODE Text to String - 02/08/13 12:12

you can define unicode strings in code, i do it like this:
short null_char = '\0';
STRING* str = str_createw(&null_char);

Posted By: Benni003

Re: UNICODE Text to String - 02/08/13 13:05

Originally Posted By: fogman

The problem is here:
str_pointer = str_create("NULL");

Basically, if you work with Unicode, you can�t simply define any string directly in c or h files. You have to read all strings from textfiles, even for comparisons. Think about it: You try to compare ASCII with unicode, this won�t work.

Solution: Read "NULL" from a unicode textfile into str_pointer.

Your solution is good, but I tried this before.
It's not correct working.

//--------------------------
1. Working:

Textfile:
"NULL"

file = file_open_read("file.txt");

file_str_readtow(file,str1,NULL,5000); // str1 includes "NULL" from textfile

if(str_cmpni(str1,str_pointer)==1){sys_exit("");} // str_pointer includes "NULL" from other file

file_close(file);

//--------------------------
2. NOT Working: ( This is the needed )

Textfile:
"Hello"
"NULL"

file = file_open_read("file.txt");

file_str_readtow(file,str1,NULL,5000); // str1 includes "Hello" from textfile
file_str_readtow(file,str2,NULL,5000); // str1 includes "NULL" from textfile

if(str_cmpni(str2,str_pointer)==1){sys_exit("");} // str_pointer includes "NULL" from other file

file_close(file);

//------------------------------

It seems like it's not working if it's not the first string from the textfile.

Posted By: fogman

Re: UNICODE Text to String - 02/08/13 13:57

You�re right!
This seems to be a bug, because the following works.
Here, I read str_pointer twice:

Content of "null.txt":
NULL
NULL

Content of "file.txt":
Hello
NULL

Code:

#include <acknex.h>
#include <default.c>

STRING* str1 = "#1";
STRING* str2 = "#1";
STRING* str_pointer = "#1";


function main()
{
 	var file;

	file = file_open_read("null.txt");
	file_str_readtow(file,str_pointer,NULL,5000); // NULL
	file_str_readtow(file,str_pointer,NULL,5000); // NULL
	file_close(file);
 	file = file_open_read("file.txt");
 
	file_str_readtow(file,str1,NULL,5000); // Hello
 	file_str_readtow(file,str2,NULL,5000); // NULL
	file_close(file);
	
	if(str_cmpi(str2,str_pointer)!=0){error("It works!");} // str_pointer includes "NULL" from other file
}

I use mostly txt_loadw for ingame text, not file_str_readtow, so I didn�t come across this bug.
You should give a bug report to jcl.

Posted By: WretchedSid

Re: UNICODE Text to String - 02/08/13 15:00

It's not a bug per se but works as intended! The first character in a unicode text is the so called BOM (byte order mark) which is put there because you can encode a Unicode text as big endian or little endian, so the BOM is there to signal the endianess of the encoded text.

Now, you can argue that the read function should just ignore the BOM, but it's actually part of the text, just like every other control character, so you can also argue that it should be there. As a solution; Open the file, read the first character (16bit) and check if it's the BOM (because some retards write editors that don't include the BOM for whatever reason), and then either seek back one character or just continue. The BOM has the code point U+FEFF, but you should read it as two characters and compare them, unless you know how to write a function that reverses the byte order of something.

Posted By: Talemon

Re: UNICODE Text to String - 02/08/13 15:19

Originally Posted By: JustSid

It's not a bug per se but works as intended! The first character in a unicode text is the so called BOM (byte order mark) which is put there because you can encode a Unicode text as big endian or little endian, so the BOM is there to signal the endianess of the encoded text.

Now, you can argue that the read function should just ignore the BOM, but it's actually part of the text, just like every other control character, so you can also argue that it should be there. As a solution; Open the file, read the first character (16bit) and check if it's the BOM (because some retards write editors that don't include the BOM for whatever reason), and then either seek back one character or just continue. The BOM has the code point U+FEFF, but you should read it as two characters and compare them, unless you know how to write a function that reverses the byte order of something.

Hah! I knew this day would come:
http://www.opserver.de/ubb7/ubbthreads.php?ubb=showflat&Number=413022#Post413022