VASmalltalk – ICU and CharsetDetection

I’ve not posted anything for ICU over the time. I’ve changed to ICU49 and most of the tests were pretty green. Some changed from yellow to green, some stayed yellow – because I would expect a different result, but the community does not think so. In this posting I would like to show the possibility…

I’ve not posted anything for ICU over the time. I’ve changed to ICU49 and most of the tests were pretty green. Some changed from yellow to green, some stayed yellow – because I would expect a different result, but the community does not think so.

In this posting I would like to show the possibility to do charset detection when a special string is given. To summarize: the results are only hints with some mathematical confidence numbers (from 0 .. 100). In addition to the charset the system also tries to detect the language of the text – well sometimes it works pretty well and it is said, that the text should be longer to get more accurate guesses on the language ….

I only want to show to high-level smalltalk API here:

| results aStream |
results :=
 'Hello Marten. This is an english text, but I do not tell you this' icuAllMatchingCharsets.

aStream := WriteStream on: String new.
results do: [ :each | each printOn: aStream. aStream cr. ].
Transcript show: aStream contents

gives the result set:

CharsetMatch (language=[en],  name = [ISO-8859-1], confidence = [51])
CharsetMatch (language=[hu],  name = [ISO-8859-2], confidence = [46])
CharsetMatch (language=[tr],  name = [ISO-8859-9], confidence = [18])
CharsetMatch (language=[],  name = [UTF-8], confidence = [10])
CharsetMatch (language=[ja],  name = [Shift_JIS], confidence = [10])
CharsetMatch (language=[zh],  name = [GB18030], confidence = [10])
CharsetMatch (language=[ja],  name = [EUC-JP], confidence = [10])
CharsetMatch (language=[ko],  name = [EUC-KR], confidence = [10])
CharsetMatch (language=[zh],  name = [Big5], confidence = [10])
CharsetMatch (language=[ar],  name = [IBM420_ltr], confidence = [4])

And if we want to check a german text:

| results aStream |
results :=
 'Hallo Marten. Das ist ein deutscher Text' icuAllMatchingCharsets.

aStream := WriteStream on: String new.
results do: [ :each | each printOn: aStream. aStream cr. ].
Transcript show: aStream contents

gives the result set:

CharsetMatch (language=[de],  name = [ISO-8859-1], confidence = [97])
CharsetMatch (language=[tr],  name = [ISO-8859-9], confidence = [45])
CharsetMatch (language=[hu],  name = [ISO-8859-2], confidence = [30])
CharsetMatch (language=[],  name = [UTF-8], confidence = [10])
CharsetMatch (language=[ja],  name = [Shift_JIS], confidence = [10])
CharsetMatch (language=[zh],  name = [GB18030], confidence = [10])
CharsetMatch (language=[ja],  name = [EUC-JP], confidence = [10])
CharsetMatch (language=[ko],  name = [EUC-KR], confidence = [10])
CharsetMatch (language=[zh],  name = [Big5], confidence = [10])

This service is available with version 08.05.01-49.01.02 of MSKICUApp

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.