I’ve not posted anything for ICU over the time. I’ve changed to ICU49 and most of the tests were pretty green. Some changed from yellow to green, some stayed yellow – because I would expect a different result, but the community does not think so.
In this posting I would like to show the possibility to do charset detection when a special string is given. To summarize: the results are only hints with some mathematical confidence numbers (from 0 .. 100). In addition to the charset the system also tries to detect the language of the text – well sometimes it works pretty well and it is said, that the text should be longer to get more accurate guesses on the language ….
I only want to show to high-level smalltalk API here:
| results aStream | results := 'Hello Marten. This is an english text, but I do not tell you this' icuAllMatchingCharsets. aStream := WriteStream on: String new. results do: [ :each | each printOn: aStream. aStream cr. ]. Transcript show: aStream contents
gives the result set:
CharsetMatch (language=[en], name = [ISO-8859-1], confidence = [51]) CharsetMatch (language=[hu], name = [ISO-8859-2], confidence = [46]) CharsetMatch (language=[tr], name = [ISO-8859-9], confidence = [18]) CharsetMatch (language=[], name = [UTF-8], confidence = [10]) CharsetMatch (language=[ja], name = [Shift_JIS], confidence = [10]) CharsetMatch (language=[zh], name = [GB18030], confidence = [10]) CharsetMatch (language=[ja], name = [EUC-JP], confidence = [10]) CharsetMatch (language=[ko], name = [EUC-KR], confidence = [10]) CharsetMatch (language=[zh], name = [Big5], confidence = [10]) CharsetMatch (language=[ar], name = [IBM420_ltr], confidence = [4])
And if we want to check a german text:
| results aStream | results := 'Hallo Marten. Das ist ein deutscher Text' icuAllMatchingCharsets. aStream := WriteStream on: String new. results do: [ :each | each printOn: aStream. aStream cr. ]. Transcript show: aStream contents
gives the result set:
CharsetMatch (language=[de], name = [ISO-8859-1], confidence = [97]) CharsetMatch (language=[tr], name = [ISO-8859-9], confidence = [45]) CharsetMatch (language=[hu], name = [ISO-8859-2], confidence = [30]) CharsetMatch (language=[], name = [UTF-8], confidence = [10]) CharsetMatch (language=[ja], name = [Shift_JIS], confidence = [10]) CharsetMatch (language=[zh], name = [GB18030], confidence = [10]) CharsetMatch (language=[ja], name = [EUC-JP], confidence = [10]) CharsetMatch (language=[ko], name = [EUC-KR], confidence = [10]) CharsetMatch (language=[zh], name = [Big5], confidence = [10])
This service is available with version 08.05.01-49.01.02 of MSKICUApp
Leave a comment