smalltalk - Squeak Monticello character-encoding -

June 15, 2011

for work project using headless squeak on (displayless, remote) linuxserver , using squeak on windows developer-machine.

code on developer machine managed using monticello. have copy mcz server using sftp unfortunately (e.g. having push-repository on server not possible security reasons). code merged eg:

mczinstaller installfilenamed: 'name-b.18.mcz'.

which works.

unfortunately our code-base contains strings contain umlauts , other non-ascii characters. during monticello-reimport of them replaced other characters , replaced nothing.

i tried e.g.

mczinstaller installstream: (filestream readonlyfilenamed: '...') binary

(note .mcz's .zip's, binary should appropriate, guess default anyway)

finding out how make monticello's transfer preserve squeak internal-encoding of non-ascii's main goal of question. changing source code use ascii-strings (at least in codebase) less desirable because manual labor involved. if interested in why not simple grep-replace in case read side note:

(side note: (a simplified/special case) codebase uses seaside's #text: method render strings contain chars have html-escaped. works fine our non-ascii's e.g. converts ä ä, if grep-replace literal ä's ä explicitly, have use #html: method instead (else double-escape), require replace other characters have html-escaped (e.g. &), again source-code contains such characters. , there other cases, #text:'s take third-party strings, may not replaced #html's...)

squeak use unicode (iso 10646) internally encoding characters in string.
might use extension cp1252 characters in range 16r80 to: 16r9f, i'm not sure anymore.

the characters codes written on stream source.st, , these codes made of single byte bytestring when characters <= 16rff. in case, file should encoded in iso-8859-l1 or cp1252.

if ever have character codes > 16rff, widestring used in squeak. once again codes written on stream source.st, time these 32 bits codes (written in big-endian order). technically, encoding utf-32be.

now mczinstaller does? uses snapshot/source.st file, , uses setconverterforcode reading file, either utf-8 or macroman... non ascii characters might changed, , worse in case of widestring re-interpreted bytestring.

mc doesn't use snapshot/source.st member in archive.
rather uses snapshot.bin (see code in mcmczreader, mcmczwriter).
binary file format governed datastream.

the snippet should use rather:

mcmczreader loadversionfile: 'yourpackage-b.18.mcz'

Search This Blog

Three

smalltalk - Squeak Monticello character-encoding -

Comments

Post a Comment

Popular posts from this blog

.htaccess - First slash is removed after domain when entering a webpage in the browser -

c# - Farseer ContactListener is not working -

Automatically create pages in phpfox -