------------------------------------------------------------------------------
------------------------------------------------------------------------------
This file was saved as utf8. If you don't save or view this file as utf8
then the special characters will not appear correctly.
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Summary
we can change the cp programatically such that latin1 always works, but
it only displays correctly in the console if the font is lucida
the lucida font can be expected to be installed on all windows systems
there is no programatic way to set the font
it may be possible to programatically determine the font and print a warning
when it is not lucida
for most european customers the default oem cp is 850 which does display
all latin1, so they work.
we could force some or all systems to a good codepage, or we could just
force systems that start on a bad codepage.
we could force no systems and display question marks for bad characters
and document that if they want it to work then they must change their
code page.
all you have to do to change the default oem codepage is to change your
system locale.
Start > Controp Panel > Regional and Language Options > Advanced
if I change the locale to French (France) and reboot then the default
oem code page is 850 and the extended characters display correctly
in the raster font.
Confirmed that the default oem code page for the united kingdom is 850
------------------------------------------------------------------------------
------------------------------------------------------------------------------
References:
http://en.wikipedia.org/wiki/Code_page_437
http://en.wikipedia.org/wiki/Windows-1252
http://support.microsoft.com/kb/q65124/
http://www.microsoft.com/globaldev/reference/wincp.mspx
http://www.microsoft.com/globaldev/reference/oem.mspx
http://www.microsoft.com/globaldev/reference/iso.mspx
http://www.computerhope.com/chcphlp.htm
http://en.wikipedia.org/wiki/Windows_Alt_keycodes
http://support.microsoft.com/kb/108450
http://blogs.msdn.com/oldnewthing/archive/2005/03/08/389527.aspx
http://blogs.msdn.com/michkap/archive/2005/03/01/382289.aspx
http://blogs.msdn.com/michkap/archive/2005/02/08/369197.aspx
http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q247815
http://www.kpym.com/blog/2005/11/console-howto-change-console-font.html
http://www.microsoft.com/globaldev/getwr/steps/WRG_lclmdl.mspx
---------------------------------------------------------------
Windows OEM Code Pages
• 437 (US)
• 720 (Arabic)
• 737 (Greek)
• 775 (Baltic)
• 850 (Multilingual Latin I)
• 852 (Latin II)
• 855 (Cyrillic)
• 857 (Turkish)
• 858 (Multilingual Latin I + Euro)
• 862 (Hebrew)
• 866 (Russian)
Windows ANSI Code Pages
• 1250 (Central Europe)
• 1251 (Cyrillic)
• 1252 (Latin I)
• 1253 (Greek)
• 1254 (Turkish)
• 1255 (Hebrew)
• 1256 (Arabic)
• 1257 (Baltic)
• 1258 (Vietnam)
• 874 (Thai)
---------------------------------------------------------------
cp437 cp1252 cp737 cp850
PESETA zWithCaron smallEta timesSign
₧ ž η ×
20A7 017E 03B7 00D7
158 158 158 158
x9E x9E x9E x9E
Note that cp437 and cp737 do not contain ž
and cp737 and cp1252 do not contain ₧
and cp437 and cp1252 do not contain η
Note that cp437 and cp737 do not contain ×, but cp1252 does
------------------------------------------------------------------------------
------------------------------------------------------------------------------
With: Active code page: 437, Font: Raster Fonts
INPUT alt158 alt0158
DISPLAY ₧ z // PESETA and z
REDISPLAY ₧ z // PESETA and z
RECEIVED x9E x7A
DECIMAL 158 122
FROM OEM 20A7 007A
FROM ANSI 017E 007A
Here alt158 is asking for byte 158 from cp437 which is ₧
And alt0158 is asking for byte 158 from cp1252 which is ž
Because we are asking for ž from cp1252 when the active code page
is cp437 (in which it doesn't exist) it first undergoes a best-fit
mapping. This is not a desireable input method.
Since cp437 is an OEM codepage, the only way to always get correct
bytes is to use CP_OEMCP and perform the round-trip test. Characters
not in the active code page cannot be received.
---------------------------------------------------------------
With: Active code page: 437, Font: Lucida Console
INPUT alt158 alt0158
DISPLAY ₧ ž // PESETA and zWithCaron
REDISPLAY ₧ z // PESETA and z
RECEIVED x9E x7A
DECIMAL 158 122
FROM OEM 20A7 007A
FROM ANSI 017E 007A
Here alt158 is asking for byte 158 from cp437 which is ₧
And alt0158 is asking for byte 158 from cp1252 which is ž
(i.e. same as above)
It appears that the font layer is a multi-byte layer that is above the
input layer. So the font knows we asked for ž and displays it, but
since the active code page does not support that character the best-fit
mapping still occurs.
Since cp437 is an OEM codepage, the only way to always get correct
bytes is to use CP_OEMCP and perform the round-trip test. Characters
not in the active code page cannot be received.
Note that if using the Lucida font and entering bytes via alt0#
when cp1252 is not the active codepage will confuse the user
because the requested character will be displayed, but not received.
---------------------------------------------------------------
With: Active code page: 1252, Font: Lucida Console
INPUT alt158 alt0158
DISPLAY ž ž // zWithCaron
REDISPLAY ž ž // zWithCaron
RECEIVED x9E x9E
DECIMAL 158 158
FROM OEM 20A7 20A7
FROM ANSI 017E 017E
Here alt158 is asking for byte 158 from cp1252 which is ž
And alt0158 is asking for byte 158 from cp1252 which is ž
Since our active codepage is cp1252, both the alt# method and the
alt0# method are asking for the same byte, so there will never be
a best-fit mapping. And the Lucida font is able to display characters
from the active code page, so they always appear correctly.
Since cp1252 is an ANSI codepage, the only way to always get correct
bytes is to use CP_ACP. Here it may not be necessary to perform the
round-trip test because there may not be any method to input characters
not in this codepage. This is not true in general for ANSI codepages.
---------------------------------------------------------------
With: Active code page: 1252, Font: Raster Fonts
INPUT alt158 alt0158
DISPLAY ₧ ₧ // PESETA
REDISPLAY ₧ ₧ // PESETA
RECEIVED x9E x9E
DECIMAL 158 158
FROM OEM 20A7 20A7
FROM ANSI 017E 017E
Here alt158 is asking for byte 158 from cp1252 which is ž
And alt0158 is asking for byte 158 from cp1252 which is ž
(i.e. same as above)
This confirms that Raster Fonts don't support changing the codepage.
It seems that the Raster Font is hard-coded to cp437, just as the
alt0# input method is hard-coded to cp1252.
Since our active codepage is cp1252, both the alt# method and the
alt0# method are asking for the same byte, so there will never be
a best-fit mapping. But since the font is hard-coded to cp437
it displays byte 158 as ₧ instead of ž.
This scenario should never be used, as it will definetly confuse
the customer. Bu using CP_ACP, one could get the byte for the character
that was displayed, but since that wasn't the character that was
requested, it seems unwise.
---------------------------------------------------------------
With: Active code page: 737, Font: Lucida Console
INPUT alt158 alt0158
DISPLAY η ž // smallEta and zWithCaron
REDISPLAY η ? // smallEta and ?
RECEIVED x9E x3F
DECIMAL 158 63
FROM OEM 20A7 003F
FROM ANSI 017E 003F
Here alt158 is asking for byte 158 from cp737 which is η
And alt0158 is asking for byte 158 from cp1252 which is ž
Again, it looks like the input is multibyte and the font understands
that and displays ž correctly but since the active codepage is singlebyte
and ž doesn't exits, a best-fit mapping must occur. Because of this,
it is bad to use the Lucida font because the character displayed is
not the input received.
Bad news. It appears that CP_OEMCP and CP_ACP are hard-coded to
cp437 and cp1252 respectively. There is no way for us to get the
correct byte for η
---------------------------------------------------------------
With: Active code page: 737, Font: Raster Fonts
INPUT alt158 alt0158
DISPLAY ₧ ? // PESETA and ?
REDISPLAY ₧ ? // PESETA and ?
RECEIVED x9E x3F
DECIMAL 158 63
FROM OEM 20A7 003F
FROM ANSI 017E 003F
Here alt158 is asking for byte 158 from cp737 which is η
And alt0158 is asking for byte 158 from cp1252 which is ž
(i.e. same as above)
Again, because the Raster font doesn't support ž, the best-fit mapping
occurs before dispaly. This is good because the character dispalyed
is the character received.
Bad news worse than above. Not only are the CP_OEMCP and CP_ACP
conversions invalid, since the Raster font onlz displays characters
from cp437, it appears that ₧, not η was entered.
------------------------------------------------------------------------------
------------------------------------------------------------------------------
In the above I have assumed that alt# and CP_OEMCP are hard-coded to
cp437, and that alt0# and CP_ACP are hard-coded to cp1252. It is likely
that these settings are configurable during install. All we have discovered
is that they are not effected by chcp.
We have confirmed that chcp is of no use to us. It is only useful if
you want to remain in that codepage with single-byte characters.
MultiByteToWideChar
uCodePage specifies the codepage to be used when performing the conversion.
The codepage can be any valid codepage number.
The codepage may also be one of the following values:
CP_ACP instructs the API to use the currently set default Windows ANSI codepage.
CP_OEMCP instructs the API to use the currently set default OEM codepage.
So, if we force the console to Lucida font
and a known latin1 codepage
and hardcode MultiByteToWideChar to that codepage
then it will all work
Or we can let them use their default setup
and advise against use of alt0# because of the best-fit conversion
and suggest use of alt#
and use CP_OEMCP
and forbid use of chcp because of invalid conversion
but not all latin1 will always be possible to input or display
so rfc escape sequences will be necessary
---------------------------------------------------------------
Let's confirm that we can change the default OEM and ANSI codepages.
For Latin 1 use chcp 1252 and Lucida Console.
HKEY_CURRENT_USER\Software\Microsoft\Command Processor\AutoRun="chcp 1252"
HKEY_CURRENT_USER\Console\ (change the font somehow)
(wprintf doesn't work)
You can do the same from code
SetConsoleOutputCP
SetConsoleCP
The identifiers of the code pages available on the local computer are stored in the registry under the following key.
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
---------------------------------------------------------------
cp437 cp1252 cp737 cp850
PESETA zWithCaron smallEta timesSign
₧ ž η ×
20A7 017E 03B7 00D7
158 158 158 158
x9E x9E x9E x9E
---------------------------------------------------------------
With Raster Font
INITIAL CP 437 437 850 737
CODE SET CP 437 850 737 1252
INPUT alt158 alt158 alt158 alt158
DISPLAY ₧ ₧ ₧ ₧
REDISPLAY ₧ ₧ ₧ ₧
RECEIVED x9E x9E x9E x9E
DECIMAL 158 158 158 158
FROM REAL 20A7 00D7 03B7 017E
FROM OEM 20A7 20A7 20A7 20A7
FROM ANSI 017E 017E 017E 017E
FINAL CP 437 850 737 1252
With Lucida Console Font (mostly the same)
DISPLAY ₧ × η ž
REDISPLAY ₧ × η ž
So,
SetConsoleCP(cp);
SetConsoleOutputCP(cp);
These work great.
And we should always do this for console,
MultiByteToWideChar( GetConsoleCP(), MB_PRECOMPOSED | MB_ERR_INVALID_CHARS, ptr, strlen(ptr), wide, strlen(ptr) );
Because CP_OEMCP indicates the default cp, not the current cp.
So, if we're willing to force the console to 1252
then we are able to input and display all characters
but the font must always be Lucida.
You CANNOT control the font progammatically.
------------------------------------------------------------------------------
------------------------------------------------------------------------------
-- Windows English Defaults ------------------------------------
cmd chcp
Active code page: 437
regedit
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
ACP 1252
OEMCP 437
-- Windows French Defaults ------------------------------------
cmd chcp
Page de codes active : 850
regedit
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
ACP 1252
OEMCP 850
-- English Keyboard -----------------------------------
`1234567890-= ~!@#$%^&*()_+
qwertyuiop[]\ QWERTYUIOP{}|
asdfghjkl;' ASDFGHJKL:"
zxcvbnm,./ ZXCVBNM<>?
-- German Keyboard ------------------------------------
^1234567890ß´ °!"§$%&/()=?` ²³{[]}\
qwertzuiopü+# QWERTZUIOPÜ*' @€~µ
asdfghjklöä ASDFGHJKLÖÄ
yxcvbnm,.- YXCVBNM;:_
-- French Keyboard ------------------------------------
²&é"'(-è_çà)= 1234567890°+ ~#{[|`\^@]}
azertyuiop^$* AZERTYUIOP¨£µ €¤
qsdfghjklmù QSDFGHJKLM%
wxcvbn,;:! WXCVBN?./§
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Here is a simple App that will let you test this stuff
---------------------------------------------------------------
#include <stdio.h>
#include <string.h>
#include <Windows.h>
void asHex( char * str );
int main(int argc, char** argv)
{
int result;
int cp = GetConsoleCP();
if ( argc == 2 ) { cp = atoi(argv[1]); }
result = SetConsoleCP(cp);
printf( "Console CP set to %d\n", GetConsoleCP() );
result = SetConsoleOutputCP(cp);
printf( "Console Output CP set to %d\n", GetConsoleOutputCP() );
char line[256];
char exit[] = "exit";
printf("Enter input to see it echoed.\n");
printf("Enter 'exit' to quit.\n\n");
do
{
printf("> ");
gets(line);
if ( strcmp( line, exit ) == 0 ) { break; }
asHex(line);
printf("\n");
}
while(1);
return 0;
}
char hexchars[] = "0123456789ABCDEF";
void asHex( char * str )
{
char * ptr;
char msg[1024];
unsigned short wide[1024];
unsigned int i;
int size;
// string -------------------------------
i = 0;
ptr = str;
while ( *ptr != 0 )
{
msg[i++] = ' ';
msg[i++] = ' ';
msg[i++] = ' ';
msg[i++] = *ptr;
msg[i++] = ' ';
ptr++;
}
msg[i] = 0;
printf("%s\n",msg);
// hex -------------------------------
ptr = str;
while ( *ptr != 0 )
{
printf(" x%X ", ((unsigned int)*ptr) & 0x000000ff );
ptr++;
}
printf("\n");
// decimal -------------------------------
ptr = str;
while ( *ptr != 0 )
{
printf(" %3u ", ((unsigned int)*ptr) & 0x000000ff );
ptr++;
}
printf("\n");
// wide actual -------------------------------
ptr = str;
size = MultiByteToWideChar( GetConsoleCP(), MB_PRECOMPOSED | MB_ERR_INVALID_CHARS, ptr, strlen(ptr), wide, strlen(ptr) );
for (i=0; i<strlen(ptr); i++)
{
printf("%04X ", ((unsigned int)wide[i]) & 0x0000ffff );
}
printf("\n");
// wide oem -------------------------------
ptr = str;
size = MultiByteToWideChar( CP_OEMCP, MB_PRECOMPOSED | MB_ERR_INVALID_CHARS, ptr, strlen(ptr), wide, strlen(ptr) );
for (i=0; i<strlen(ptr); i++)
{
printf("%04X ", ((unsigned int)wide[i]) & 0x0000ffff );
}
printf("\n");
// wide ansi -------------------------------
ptr = str;
size = MultiByteToWideChar( CP_ACP, MB_PRECOMPOSED | MB_ERR_INVALID_CHARS, ptr, strlen(ptr), wide, strlen(ptr) );
for (i=0; i<strlen(ptr); i++)
{
printf("%04X ", ((unsigned int)wide[i]) & 0x0000ffff );
}
printf("\n");
}
------------------------------------------------------------------------------
------------------------------------------------------------------------------
|