utf-8

Autres langues

Langue: ja

Version: 2001-05-11 (openSuse - 09/10/07)

Section: 7 (Divers)

̾Á°

UTF-8 - ASCII ¤È¸ß´¹À­¤Î¤¢¤ë¿¥Ð¥¤¥È Unicode ¤ÎÉä¹æ²½

ÀâÌÀ

¥æ¥Ë¥³¡¼¥É (Unicode) 3.0 ʸ»ú½¸¹ç¤Ï 16 ¥Ó¥Ã¥È¤Î¥³¡¼¥É¶õ´Ö¤òÀê¤á¤ë¡£ ºÇ¤âñ½ã¤Ê Unicode ¤ÎÉä¹æ²½ÊýË¡ (UCS-2) ¤Ç¤Ï¡¢Ê¸»ú¤Ï 16 ¥Ó¥Ã¥È¡¦¥ï¡¼¥É (16 ¥Ó¥Ã¥Èʸ»ú¤ÎÎó) ¤Ç¹½À®¤µ¤ì¤ë¡£ ¤³¤ÎÎó¤Ë¤Ï¡¢ \ 0' ¤ä '/' ¤Î¤è¤¦¤Ê (¥Õ¥¡¥¤¥ë̾¤ä C ¤Î¥é¥¤¥Ö¥é¥ê´Ø¿ô¤Î°ú¤­¿ô¤ÎÆâÉô¤Ç) Æüì¤Ê°ÕÌ£¤ò»ý¤Ä 16 ¥Ó¥Ã¥Èʸ»ú¤¬´Þ¤Þ¤ì¤ë¤³¤È¤¬¤¢¤ë¡£ ¤µ¤é¤Ë¡¢¤Û¤È¤ó¤É¤Î UNIX ¥Ä¡¼¥ë¤Ï ASCII ¥Õ¥¡¥¤¥ë¤òÆþÎϤȤ·¤Æ´üÂÔ¤¹¤ë¤Î¤Ç¡¢ ÂçÉý¤ÊÊѹ¹¤Ê¤·¤Ë¤Ï 16 ¥Ó¥Ã¥È¥ï¡¼¥É¤òʸ»ú¤È¤·¤ÆÆɤळ¤È¤¬¤Ç¤­¤Ê¤¤¡£ ¤³¤ì¤é¤ÎÍýͳ¤«¤é¡¢ UCS-2 ¤Ï¥Õ¥¡¥¤¥ë̾¡¦¥Æ¥­¥¹¥È¥Õ¥¡¥¤¥ë¡¦´Ä¶­ÊÑ¿ô¤Ê¤É¤ËÍѤ¤¤ë¡¢³°ÉôÍѤΠUnicode Éä¹æ¤È¤·¤Æ¤ÏÉÔŬÀڤǤ¢¤ë¡£ Unicode ¤Î¥¹¡¼¥Ñ¡¼¥»¥Ã¥È¤Ç¤¢¤ë ISO 10646 Universal Character Set (UCS) ¤Ï 31 ¥Ó¥Ã¥È¤Î¥³¡¼¥É¶õ´Ö¤òÀê¤á¤ë¤¬¡¢¤½¤ÎºÇ¤âñ½ã¤ÊÉä¹æ²½¤Ç¤¢¤ë UCS-4 ¤Ë¤â (32 ¥Ó¥Ã¥È¡¦¥ï¡¼¥É¤ÎÎó¤È¤·¤Æ) Ʊ¤¸ÌäÂ꤬¤¢¤ë¡£

Unicode ¤È UCS ¤Î UTF-8 Éä¹æ²½¤Ë¤Ï¤³¤ì¤é¤ÎÌäÂ꤬¤Ê¤¤¤Î¤Ç¡¢Unix ·Á¼°¤Î OS ¾å¤Ç Unicode ʸ»ú½¸¹ç¤ò»ÈÍѤ¹¤ë¤¿¤á¤Î°ìÈÌŪ¤ÊÊýË¡¤È¤Ê¤Ã¤Æ¤¤¤ë¡£

À­¼Á

UTF-8 Éä¹æ²½¤Ï°Ê²¼¤Î¤è¤¦¤ÊÁÇÀ²¤·¤¤À­¼Á¤òÈ÷¤¨¤Æ¤¤¤ë:
*
UCS ʸ»ú¤Î¤¦¤Á 0x00000000 ¤«¤é 0x0000007f ¤Þ¤Ç (¸ÅŵŪ¤Ê US-ASCII ¤Îʸ»ú) ¤Ï (ASCII ¤È¤Î¸ß´¹À­¤Î¤¿¤á¤Ë) ñ½ã¤Ë 0x00 ¤«¤é 0x7f ¤Î¥Ð¥¤¥È¤Ë Éä¹æ²½¤¹¤ë¡£¤³¤ì¤Ï 7 ¥Ó¥Ã¥È ASCII ʸ»ú¤Î¤ß¤ò´Þ¤à¥Õ¥¡¥¤¥ë¤äʸ»úÎó¤Ë ´Ø¤·¤Æ¤Ï¡¢ ASCII ¤È UTF-8 ¤ÇƱ¤¸Éä¹æ²½¤ò¹Ô¤Ê¤¦¤³¤È¤ò°ÕÌ£¤¹¤ë¡£
*
0x7f ¤è¤êÂ礭¤¤¤Î¤¹¤Ù¤Æ¤Î UCS ʸ»ú¤Ï¡¢ 0x80 ¤«¤é 0xfd ¤Þ¤Ç¤ÎÈϰϤΥХ¤¥È¤Î¤ß¤ò´Þ¤à ¿¥Ð¥¤¥Èʸ»úÎó¤ËÉä¹æ²½¤µ¤ì¤ë¡£ ¤·¤¿¤¬¤Ã¤Æʸ»úÎó¤Ë ASCII ¥Ð¥¤¥È¤¬´Þ¤Þ¤ì¤ë¤³¤È¤¬¤Ê¤¯¡¢'\0' ¤ä '/' ¤ÎÌäÂê¤ÏȯÀ¸¤·¤Ê¤¤¡£
*
UCS-4 ʸ»úÎó¤Ç¤Ï¼­½ñŪ¥½¡¼¥È¤Î½ç½ø¤¬Êݤ¿¤ì¤ë¡£
*
2^31 ¥Ó¥Ã¥È¤Î¤¹¤Ù¤Æ¤Î UCS ¥³¡¼¥É ¤¬ UTF-8 ¤ò»ÈÍѤ·¤ÆÉä¹æ²½¤Ç¤­¤ë¡£
*
UTF-8 Éä¹æ²½¤Ç¤Ï 0xfe ¤È 0xff ¤Î¥Ð¥¤¥È¤ÏÀäÂФ˻ÈÍѤ·¤Ê¤¤¡£
*
ASCII ¤Ç¤Ê¤¤ UCS ʸ»ú¤Î¿¥Ð¥¤¥ÈÎó¤ÎºÇ½é¤Î¥Ð¥¤¥È¤Ï¡¢ ¾ï¤Ë 0xc0 ¤«¤é 0xfd ¤ÎÈϰϤÇɽ¸½¤µ¤ì¡¢ ¤½¤Îʸ»ú¤¬²¿¥Ð¥¤¥È¤Ç¹½À®¤µ¤ì¤Æ¤¤¤ë¤«¤ò¼¨¤¹¡£ ¿¥Ð¥¤¥ÈÎó¤Î»Ä¤ê¤ÎÉôʬ¤Î¥Ð¥¤¥È¤Ï¡¢¤½¤ì¤¾¤ì 0x80 ¤«¤é 0xbf ¤ÎÈϰϤˤ¢¤ë¡£ ¤³¤ì¤Ë¤è¤êƱ´ü¤¬Íưפˤʤꡢ¥¹¥Æ¡¼¥È¥ì¥¹¤ÊÉä¹æ²½¤¬²Äǽ¤Ë¤Ê¤ê¡¢ ¥Ð¥¤¥È¤Îʶ¼º¤ËÂФ·¤Æ·ø¸Ç¤Ë¤Ê¤ë¡£
*
UTF-8 ¤ò»ÈÍѤ·¤¿ UCS ʸ»ú¤ÎÉä¹æ²½¤ÏºÇÂç 6 ¥Ð¥¤¥È¤ÎŤµ¤Ë¤Ê¤ë¡£ ¤·¤«¤·¡¢ Unicode µ¬³Ê¤Ç¤Ï 0x10ffff ¤è¤êÀè¤Îʸ»ú¤ò»ØÄꤷ¤Ê¤¤¤Î¤Ç¡¢Unicode ʸ»ú¤Ï UTF-8 ¤Ç¤Ï 4 ¥Ð¥¤¥È¤Þ¤Ç¤Ë¤·¤«¤Ê¤é¤Ê¤¤¡£

Éä¹æ²½

°Ê²¼¤Î¥Ð¥¤¥ÈÎó¤¬Ê¸»ú¤Îɽ¸½¤Ë»ÈÍѤµ¤ì¤ë¡£ ¤É¤Î¥Ð¥¤¥ÈÎó¤ò»ÈÍѤ¹¤ë¤«¤Ïʸ»ú¤Î UCS ¥³¡¼¥ÉÈÖ¹æ¤Ë°Í¸¤¹¤ë:
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0x00200000 - 0x03FFFFFF:
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0x04000000 - 0x7FFFFFFF:
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

xxx ¥Ó¥Ã¥È¤ÎÉôʬ¤Ë¤Ï 2 ¿Ê¿ô¤Çɽ¤ï¤·¤¿Ê¸»ú¥³¡¼¥É¤Î¥Ó¥Ã¥ÈÉôʬ¤¬Âбþ¤¹¤ë¡£ ¤½¤Îʸ»ú¤òɽ¸½¤¹¤ë¤Î¤ËºÇ¤âû¤¤¥Ð¥¤¥ÈÎó¤Î¤ß¤¬»ÈÍѤǤ­¤ë¡£

0xd800-0xdfff (UTF-16 ¥µ¥í¥²¡¼¥È) ¤ä 0xfffe, 0xffff (UCS ¤Î non-character) ¤È¤¤¤¦ UCS ¥³¡¼¥É¤ÎÃͤϡ¢ UTF-8 ¤Ë½àµò¤·¤¿¥¹¥È¥ê¡¼¥à¤ËÆþ¤ì¤ë¤Ù¤­¤Ç¤Ï¤Ê¤¤¡£

Îã

Unicode ʸ»ú¤Î 0xa9 = 1010 1001 (¥³¥Ô¡¼¥é¥¤¥È¡¦¥Þ¡¼¥¯) ¤Ï UTF-8 ¤ÇÉä¹æ²½¤¹¤ë¤È
11000010 10101001 = 0xc2 0xa9

¤Ë¤Ê¤ë¡£

0x2260 = 0010 0010 0110 0000 (ÉÔÅù¹æ) ¤Ï

11100010 10001001 10100000 = 0xe2 0x89 0xa0

¤Ë¤Ê¤ë¡£

¥¢¥×¥ê¥±¡¼¥·¥ç¥ó¤Ë¤ª¤±¤ëÃí°Õ

¥æ¡¼¥¶¡¼¤Ï¥¢¥×¥ê¥±¡¼¥·¥ç¥ó¤Î UTF-8 ¥µ¥Ý¡¼¥È¤òÍ­¸ú¤Ë¤¹¤ë¤¿¤á¤Ë¡¢°Ê²¼¤Î¤è¤¦¤Ë¤·¤Æ UTF-8 ¥í¥±¡¼¥ë¤òÁªÂò¤·¤Ê¤±¤ì¤Ð¤Ê¤é¤Ê¤¤¡£
export LANG=en_GB.UTF-8

»ÈÍѤµ¤ì¤Æ¤¤¤ëʸ»úÉä¹æ²½¤òʬ¤«¤Ã¤Æ¤¤¤Ê¤±¤ì¤Ð¤Ê¤é¤Ê¤¤ ¥¢¥×¥ê¥±¡¼¥·¥ç¥ó¥½¥Õ¥È¥¦¥§¥¢¤Ï¡¢ °Ê²¼¤Î¤è¤¦¤Ë¤·¤Æ¾ï¤Ë¥í¥±¡¼¥ë¤òÀßÄꤹ¤Ù¤­¤Ç¤¢¤ë¡£

setlocale(LC_CTYPE, "")

¤Þ¤¿ UTF-8 ¥í¥±¡¼¥ë¤¬ÁªÂò¤µ¤ì¤Æ¤¤¤Æ¡¢¥×¥ì¡¼¥ó¥Æ¥­¥¹¥È¤Îɸ½àÆþ½ÐÎÏ¡¦Ã¼Ëö´ÖÄÌ¿®¡¦ ¥×¥ì¡¼¥ó¥Æ¥­¥¹¥È¥Õ¥¡¥¤¥ë¤ÎÆâÍÆ¡¦¥Õ¥¡¥¤¥ë̾¡¦´Ä¶­ÊÑ¿ô¤¬ UTF-8 ¤ÇÉä¹æ²½¤µ¤ì¤Æ¤¤¤ë¤«¤ò¥Á¥§¥Ã¥¯¤¹¤ë¤¿¤á¤Ë¡¢ ¥×¥í¥°¥é¥Þ¡¼¤Ï°Ê²¼¤Î¤è¤¦¤Ê¼°¤ò»î¤¹¤³¤È¤¬¤Ç¤­¤ë¡£

strcmp(nl_langinfo(CODESET), "UTF-8") == 0

US-ASCII ¤ä ISO 8859 ¤È¤¤¤Ã¤¿¥·¥ó¥°¥ë¥Ð¥¤¥È¤ÎÉä¹æ²½¤¬½¬´·¤Ë¤Ê¤Ã¤Æ¤¤¤ë¥×¥í¥°¥é¥Þ¡¼¤Ï¡¢ ¤³¤ì¤Þ¤Ç¤Î 2 ¤Ä¤Î²¾Ä꤬ UTF-8 ¥í¥±¡¼¥ë¤Ë¤ª¤¤¤Æ¤ÏºÇÁáÍ­¸ú¤Ç¤Ï¤Ê¤¯¤Ê¤Ã¤¿¤³¤È¤òÃΤäƤª¤¯¤Ù¤­¤À¡£ 1 ÈÖÌܤÎÊѹ¹ÅÀ¤Ï¡¢1 ¥Ð¥¤¥È¤¬É¬¤º¤·¤â 1 ¤Ä¤Îʸ»ú¤ËÂбþ¤·¤Ê¤¤¤È¤¤¤¦ÅÀ¤Ç¤¢¤ë¡£ 2 ÈÖÌܤÎÊѹ¹ÅÀ¤Ï¡¢ºÇ¶á¤ÎüËö¥¨¥ß¥å¥ì¡¼¥¿¤Ï UTF-8 ¥â¡¼¥É¤Ë¤ª¤¤¤ÆÃæ¹ñ¸ì¡¦ÆüËܸ졦´Ú¹ñÄ«Á¯¸ì¤Î Á´³Ñʸ»ú ¤ä¥¹¥Ú¡¼¥¹¤¬Æþ¤é¤Ê¤¤ (non-spacing) ¹çÀ®Ê¸»ú (combining characters) ¤ËÂбþ¤·¤Æ¤¤¤ë¤Î¤Ç¡¢ ASCII ¤Î¤È¤­¤Î¤è¤¦¤Ë 1 ʸ»ú½ÐÎϤ·¤¿¸å¤Ç ¥«¡¼¥½¥ë¤òɬ¤º¤·¤â 1 ¤Ä¤À¤±¿Ê¤á¤ë¤ï¤±¤Ç¤Ï¤Ê¤¤¤È¤¤¤¦ÅÀ¤Ç¤¢¤ë¡£ º£Æü¤Ç¤Ï¡¢Ê¸»ú¤ä¥«¡¼¥½¥ë¤Î°ÌÃÖ¤ò¿ô¤¨¤ë¤Î¤Ë mbsrtowcs(3) ¤ä wcswidth(3) ¤È¤¤¤Ã¤¿¥é¥¤¥Ö¥é¥ê´Ø¿ô¤ò»È¤¦¤Ù¤­¤Ç¤¢¤ë¡£

(VT100 üËö¤Ê¤É¤Ç»È¤ï¤ì¤ë) ISO 2022 Éä¹æ²½·Á¼°¤«¤é UTF-8 ¤ØÀÚÂؤ¨¤ë¸ø¼°¤Ê¥¨¥¹¥±¡¼¥×¥·¡¼¥±¥ó¥¹¤Ï ESC % G ("\x1b%G") ¤Ç¤¢¤ë¡£ ¤³¤ì¤ËÂбþ¤¹¤ë UTF-8 ¤«¤é ISO 2022 ¤Ø¤Î¥ê¥¿¡¼¥ó¥·¡¼¥±¥ó¥¹¤Ï ESC % @ ("\x1b%@") ¤Ç¤¢¤ë¡£ (G0 ¥»¥Ã¥È¤È G1 ¥»¥Ã¥È¤òÀÚÂؤ¨¤ë¤È¤¤¤Ã¤¿) ¤½¤Î¾¤Î ISO 2022 ¥·¡¼¥±¥ó¥¹¤Ï¡¢UTF-8 ¥â¡¼¥É¤Ç¤Ï»È¤¨¤Ê¤¤¡£

ͽÃΤǤ­¤ë¾­Íè¤Ç¤Ï¡¢POSIX ¥·¥¹¥Æ¥à¾å¤Î°ìÈÌŪ¤Êʸ»úÉä¹æ²½¤ÎÁ´¤Æ¤Î¥ì¥Ù¥ë¤Ç UTF-8 ¤¬ ASCII ¤È ISO 8859 ¤òÃÖ¤­´¹¤¨¡¢¥×¥ì¡¼¥ó¥Æ¥­¥¹¥È¤ò°·¤¦Èó¾ï¤ËÍ¥¤ì¤¿´Ä¶­¤¬ºî¤é¤ì¤ë¤³¤È¤¬´üÂԤǤ­¤ë¡£

¥»¥­¥å¥ê¥Æ¥£

Unicode ¤È UCS ¤Îµ¬³Ê¤Ç¤Ï¡¢ UTF-8 ¤ÎÀ¸À®¼Ô¤Ï¤Ç¤­¤ë¤À¤±Ã»¤¤·Á¼°¤òÍѤ¤¤ë¤è¤¦Í׵ᤷ¤Æ¤¤¤ë¡£ Î㤨¤Ð¡¢ÀèƬ¥Ð¥¤¥È¤¬ 0xc0 ¤Ç¤¢¤ë¤è¤¦¤Ê 2 ¥Ð¥¤¥ÈÎó¤ò À¸À®¤¹¤ë¤Î¤Ï½àµò¤·¤Æ¤¤¤ë¤È¤Ï¤¤¤¨¤Ê¤¤¡£ Unicode 3.1 ¤Ç¤Ï¡¢µ¬³Ê¤Ë½àµò¤¹¤ë¥×¥í¥°¥é¥à¤Ï ºÇû¤Îɽ¸½·Á¼°¤Ç¤Ï¤Ê¤¤ÆþÎϤò¼õ¤±ÉÕ¤±¤Ê¤¤¡¢¤È¤¤¤¦Í×µá»ö¹à¤¬Äɲ䵤줿¡£ ¤³¤ì¤Ï¥»¥­¥å¥ê¥Æ¥£¾å¤ÎÍýͳ¤Ë¤è¤ë¡£ ¥æ¡¼¥¶¡¼ÆþÎϤ¬¥»¥­¥å¥ê¥Æ¥£¾å¤Î´í¸±¤ËÂФ·¥Á¥§¥Ã¥¯¤µ¤ì¤ë¾ì¹ç¡¢ ¥×¥í¥°¥é¥à¤Ï ASCII ÈǤΠ"/../" ¤ä ";" ¤ä "NUL" ¤À¤±¤ò¥Á¥§¥Ã¥¯¤·¡¢ ºÇû¤ËÉä¹æ²½¤µ¤ì¤Æ¤Ê¤¤¤³¤ì¤é¤Îʸ»ú¤ò¸«²á¤´¤·¤Æ¤·¤Þ¤¦¤«¤â¤·¤ì¤Ê¤¤¤«¤é¤Ç¤¢¤ë¡£ ¤Ê¤¼¤Ê¤é¡¢ºÇû¤Ç¤Ï¤Ê¤¤ UTF-8 Éä¹æ²½¤Ç¤Ï¡¢¤³¤ì¤é¤Îʸ»ú¤òɽ¸½¤¹¤ë¤è¤¦¤ÊÍÍ¡¹¤Ê ASCII °Ê³°¤Î·Á¼°¤¬Â¸ºß¤¹¤ë¤¿¤á¤Ç¤¢¤ë¡£

½àµò

ISO/IEC 10646-1:2000, Unicode 3.1, RFC 2279, Plan 9.

´ØÏ¢¹àÌÜ

nl_langinfo(3), setlocale(3), charsets(7), unicode(7)