*multibyte.txt* For Vim version 6.0aa.  Last change: 2001 Apr 03


		  VIM REFERENCE MANUAL	  by Bram Moolenaar et al.


Multi-byte support				*multibyte* *multi-byte*

						*Chinese* *Japanese* *Korean*
There are languages which have many characters that can not be represented
using one byte (one octet).  These are Chinese (simplified or traditional),
Japanese and Korean.  These languages uses more than one byte to represent a
character.

This is limited information on the support in Vim to edit files that use more
than one byte per character.  Actually, only two-byte codes are currently
supported.

For changing the messages and menus that Vim uses see |multilang.txt|.

Also see |+multi_byte| and |'fileencoding'|.

1. Introduction				|multibyte-intro|
2. Compiling				|multibyte-compiling|
3. Options				|multibyte-options|
4. Display (X fontset support)		|multibyte-display|
5. Input (XIM support)			|multibyte-input|
6. Input (Windows IME support)		|multibyte-ime|
7. UTF-8				|UTF-8|
8. UTF-8 in XFree86 xterm		|UTF8-xterm|
9. Unfinished snippets			|mbyte-snippets|

==============================================================================
1. Introduction						*multibyte-intro*

LOCALE
							*locale-multibyte*
There are a number of languages in the world.  And there are different
cultures and environments at least as much as the number of languages.	A
linguistic environment corresponding to an area is called "|locale|".  The
POSIX standard defines a concept of |locale|, which includes a lot of
information about |charset|, collating order for sorting, date format,
currency format and so on.

Your system need to support the |locale| system and the language |locale| of
your choice.  Some system has a few language |locale|s, so the |locale| of the
language which you want to use may not be on your system.  If so, you have to
add the language |locale|.  But on some systems, it is not possible to add
other |locale|s.  In this case, install X |locale|s by installing X compiled
with X_LOCALE.  Add "-DX_LOCALE" to the CFLAGS if your X lib support X_LOCALE.
For example, When you are using Linux system and you want to use Japanese, set
up your system one of the followings.
    - libc5     + X compiled with X_LOCALE
    - glibc-2.0 + libwcsmbs + X compiled without X_LOCALE
    - glibc-2.1 + locale-ja + X compiled without X_LOCALE

The location in which the |locale|s are installed varies system to system.
For example, "/usr/share/locale", "/usr/lib/locale", etc.  See your system's
setlocale() man page.

					*locale-name* *$LANG-multibyte*
The format of |locale| name is:
    language[_territory[. codeset]]
Territory means the country, codeset means the |charset|.  For example, the
|locale| name "ja_JP.eucJP" means the language is Japanese, the country is
Japan, the codeset is EUC-JP.  But it also could be "ja", "ja_JP.EUC",
"ja_JP.ujis", etc.  And unfortunately, the |locale| name for a specific
language, territory and codeset is not unified and depends on your system.
This name is used for the LANG environment value.  When you want to use Korean
and the |locale| name is "ko", do this:
    sh:  export LANG=ko
    csh: setenv LANG ko

Examples of locale name:
    |charset|	    language		  |locale-name|
    GB2312	    Chinese (simplified)  zh_CN.EUC, zh_CN.GB2312
    Big5	    Chinese (traditional) zh_TW.BIG5, zh_TW.Big5
    CNS-11643	    Chinese (traditional) zh_TW
    EUC-JP	    Japanese		  ja, ja_JP.EUC, ja_JP.ujis, ja_JP.eucJP
    Shift_JIS	    Japanese		  ja_JP.SJIS, ja_JP.Shift_JIS
    EUC-KR	    Korean		  ko, ko_KR.EUC

Even if your system does not have the multibyte language |locale| of your
choice, or does not have a enough implementation of the locale, Vim can
somehow handle the multibyte languages.  Add "--enable-broken-locale" flag at
compile time.


CODED CHARACTER SET (CCS)
					*coded-character-set* *CCS*
|CCS| is a mapping from a set of characters to a set of integers.  For
example, ((65, A), (66, B), (67, C)) is a |CCS| and ((0x41, A), (0x42, B),
(0x43, C)) is also a |CCS|.  Examples of |CCS| are ISO 10646, US-ASCII,
ISO-8859 series, JIS X 0208, JIS X 0201, KS C 5601 (KS X 1001) and KS C 5636
(KS X 1003).

The term "integer" means code point or character number and is different from
octets or bit combination.

Typically, a |CCS| is a character table.  Representing the column/line as
hexadecimal number becomes the code point of the character.  For example,
US-ASCII CCS has 8x16 character table, the column number start with 0 and end
with 7, the line number start with 0 end with F.  The code point of the
character at 4/1 is 0x41.


CHARACTER ENCODING SCHEME (CES)

					*character-encode-scheme* *CES*
|CES| is a mapping from a sequence of elements in one or more |CCS|es to a
sequence of octets.  Examples of |CES| are EUC-JP, EUC-KR, EUC-CN (GB 2312),
EUC-TW (CNS-11643), ISO-2022-JP, ISO-2022-KR, ISO-2022-CN, UTF-8, etc.


CHARSET
							*charset*
|charset| is a method of converting a sequence of octets into a sequence of
characters, the combination of one or more |CCS|es and a |CES|.  For example,
ISO-2022-JP |charset| is the combination of ASCII, JIS X 0201, JIS X 0208
|CCS|es and ISO-2022-JP |CES|.  Examples of |charset| are US-ASCII, ISO-8859
series, GB2312, EUC-JP, EUC-KR, Shift_JIS, Big5, UTF-8, etc.

Note that this is not a term used by other standards bodies, such as ISO, but
a term defined in RFC 2130.  The term "codeset" in POSIX has the same meaning
as |charset| here.  |charset| does not mean character set (a set of
characters) and the term "character repertoire" means a collection of distinct
characters.  There are historical reasons, see RFC 2130.

						*charset-conversion*
One language could have some |charset|s.  For example, Japanese has
ISO-2022-JP, EUC-JP and Shift_JIS |charset|s.  ISO-2022-JP |charset| is used
mainly for internet messages, because it is encoded in 7-bit scheme.  EUC-JP
is mainly used on Unix, Shift_JIS is mainly used on Windows and MacOS.

Vim does not convert automatically to the locale's |charset| at display time.
So, if a file's |charset| differs from your locale's |charset|, the file is
not displayed correctly.  So, you must know the file's |charset| by any way:
guessing, using some utilities, etc, and convert the |charset| to the locale's
|charset| manually.

Useful utilities for converting the |charset|:
    Japanese:	    nkf
	Nkf is "Network Kanji code conversion Filter".  One of the most unique
	facility of nkf is the guess of the input Kanji code.  So, you don't
	need to know what the inputting file's |charset| is.  When convert to
	EUC-JP from ISO-2022-JP or Shift_JIS, simply do the following command
	in Vim:
	    :%!nkf -e
	Nkf can be found at:
	http://www.sfc.wide.ad.jp/~max/FreeBSD/ports/distfiles/nkf-1.62.tar.gz
    Chinese:	    hc
	Hc is "Hanzi Converter".  Hc convert a GB file to a Big5 file, or Big5
	file to GB file.  Hc can be found at:
	ftp://ftp.cuhk.hk/pub/chinese/ifcss/software/unix/convert/hc-30.tar.gz
    Korean:	    hmconv
	Hmconv is Korean code conversion utility especially for E-mail. It can
	convert between EUC-KR and ISO-2022-KR.  Hmconv can be found at:
	ftp://ftp.kaist.ac.kr/pub/hangul/code/hmconv/hmconv1.0pl3
    Multilingual:   lv
	Lv is a Powerful Multilingual File Viewer.  And it can be worked as
	|charset| converter.  Supported |charset|: ISO-2022-CN, ISO-2022-JP,
	ISO-2022-KR, EUC-CN, EUC-JP, EUC-KR, EUC-TW, UTF-7, UTF-8, ISO-8859
	series, Shift_JIS, Big5 and HZ. Lv can be found at:
	http://www.ff.iij4u.or.jp/~nrt/freeware/lv4493.tar.gz


X LOGICAL FONT DESCRIPTION (XLFD)
							*XLFD*
XLFD is the X font name and contains the information about the font size,
|CCS|, etc.  The name is in this format:

FOUNDRY-FAMILY-WEIGHT-SLANT-WIDTH-STYLE-PIXEL-POINT-X-Y-SPACE-AVE-CR-CE

Each field means:

- FOUNDRY:  FOUNDRY field.  The company that created the font.
- FAMILY:   FAMILY_NAME field.  Basic font family name.  (helvetica, gothic,
	    times, etc)
- WEIGHT:   WEIGHT_NAME field.  How thick the letters are.  (light, medium,
	    bold, etc)
- SLANT:    SLANT field.
		r:  Roman
		i:  Italic
		o:  Oblique
		ri: Reverse Italic
		ro: Reverse Oblique
		ot: Other
		number:	Scaled font
- WIDTH:    SETWIDTH_NAME field.  Width of characters.  (normal, condensed,
	    narrow, double wide)
- STYLE:    ADD_STYLE_NAME field.  Extra info to describe font.  (Serif, Sans
	    Serif, Informal, Decorated, etc)
- PIXEL:    PIXEL_SIZE field.  Height, in pixels, of characters.
- POINT:    POINT_SIZE field.  Ten times height of characters in points.
- X:	    RESOLUTION_X field.  X resolution (dots per inch).
- Y:	    RESOLUTION_Y field.  Y resolution (dots per inch).
- SPACE:    SPACING field.
		p:  Proportional
		m:  Monospaced
		c:  CharCell
- AVE:	    AVERAGE_WIDTH field.  Ten times average width in pixels.
- CR:	    CHARSET_REGISTRY field.  Indicates the name of the font |CCS| name.
- CE:	    CHARSET_ENCODING field.  In some CCSes, such as ISO-8859 series,
	    this field is the part of |CCS| name.  In other CCSes, such as JIS
	    X 0208, if this field is 0, code points has the same value as GL,
	    and GR if 1.

For example, in case of a 14 dots font corresponding to JIS X 0208, it is
written like:
    -misc-fixed-medium-r-normal--16-110-100-100-c-160-jisx0208.1990-0


X FONTSET
						*fontset* *xfontset*
A |CCS| is typically associated with one font.  The languages which must
manage multiple |CCS|es needs to manage multiple fonts.  In X11R5, for the
internationalization of output API, FontSet was introduced.  By using this,
Xlib takes care of switching of fonts for the display.  Till X11R4, the
application themselves had to manage this.

|locale| database has the information about the |charset| of the |locale|,
which |CCS|(es) is needed and which |CES| the locale uses.  When you use a
locale which must manage multiple |CCS|es, you have to specify all the
|CCS|'s font in 'guifontset' option.

Example:
    |charset| language		    |CCS|es
    GB2312    Chinese (simplified)  ISO-8859-1 and GB 2312
    Big5      Chinese (traditional) ISO-8859-1 and Big5
    CNS-11643 Chinese (traditional) ISO-8859-1, CNS 11643-1 and CNS 11643-2
    EUC-JP    Japanese		    JIS X 0201 and JIS X 0208
    EUC-KR    Korean		    ISO-8859-1 and KS C 5601 (KS X 1001)

The |XLFD| contains the information of |CCS|.  So, by searching in fonts.dir,
you can find the |CCS|'s font.  The fonts.dir is in the fonts directory (e.g.
/usr/X11R6/lib/X11/fonts/*), the format of the file is:
    First line:	the number of fonts which are contained in this fonts.dir
    other line:	FILENAME  |XLFD|
Or, you can search fonts using xlsfonts command.  For example, when you're
searching for the font for KS C 5601: >
    xlsfonts | grep ksc5601
will show you the list of it.

						*base_font_name_list*
In 'guifontset' option and ~/.Xdefaults, you specify the
|base_font_name_list|, which is a list of |XLFD| font names that Xlib uses to
load the fonts needed for the |locale|.  The base font names is a
comma-separated list.

For example, when you use the ja_JP.eucJP |locale|, which require JIS X 0201
and JIS X 0208 |CCS|es.  You could supply a |base_font_name_list| that
explicitly specifies the charsets, like: >

 :set guifontset=-misc-fixed-medium-r-normal--14-130-75-75-c-140-jisx0208.1983-0,
	\-misc-fixed-medium-r-normal--14-130-75-75-c-70-jisx0201.1976-0

Alternatively, the user could supply a base font name list that omits the
|CCS| name, letting Xlib select font characters required for the locale. For
example: >

 :set guifontset=-misc-fixed-medium-r-normal--14-130-75-75-c-140,
	\-misc-fixed-medium-r-normal--14-130-75-75-c-70

Alternatively, the user could supply a single base font name that allows Xlib
to select from all available fonts.  For example: >

 :set guifontset=-misc-fixed-medium-r-normal--14-*

Alternatively, the user could specify the alias name.  See fonts.alias in
the fonts directory. >

 :set guifontset=k14,r14

Note that in East Asian fonts, the standard character cell is square.  When
mixing Latin font and East Asian font, East Asian font width should be twice
the Latin font width.  And GVIM needs fixed width font.


X INPUT METHOD (XIM)				*XIM* *xim* *x-input-method*

XIM (X Input Method) is an international input module for X.  There are two
kind of structures, Xlib unit type and |IM-server| (Input-Method server) type.
|IM-server| type is suitable for complex inputting, like CJK inputting.

- IM-server
							*IM-server*
  In |IM-server| type input structures, the input event is handled by either
  of the two ways: FrontEnd system and BackEnd system.  In the FrontEnd
  system, input events are snatched by the |IM-server| first, then |IM-server|
  give the application the result of input.  On the other hand, the BackEnd
  system works reverse order.  MS Windows adopt BackEnd system.  In X, most of
  |IM-server|s adopt FrontEnd system.  The demerit of BackEnd system is the
  large overhead in communication, but it provides safe synchronization with
  no restrictions on applications.

  For example, there are xwnmo and kinput2 Japanese |IM-server|, both are
  FrontEnd system.  Xwnmo is distributed with Wnn (see below), kinput2 can be
  found at: ftp://ftp.sra.co.jp/pub/x11/kinput2/

  For Chinese, there's a great XIM server named "xcin", you can input both
  Traditional and Simplified Chinese characters.  And it can accept other
  locale if you make a correct input table.  Xcin can be found at:
  http://xcin.linux.org.tw/

- Conversion Server
							*conversion-server*
  Some system needs additional server: conversion server.  Most of Japanese
  |IM-server|s need it, Kana-Kanji conversion server.  For Chinese inputting,
  it depends on the method of inputting, in some methods, PinYin or ZhuYin to
  HanZi conversion server is needed.  For Korean inputting, if you want to
  input Hanja, Hangul-Hanja conversion server is needed.

  For example, the Japanese inputting process is divided into 2 steps.  First
  we pre-input Hira-gana, second Kana-Kanji conversion.  There are so many
  Kanji characters (6349 Kanji characters are defined in JIS X 0208) and the
  number of Hira-gana characters are 76.  So, first, we pre-input text as
  pronounced in Hira-gana, second, we convert Hira-gana to Kanji or Kata-Kana,
  if needed.  There are some Kana-Kanji conversion server: jserver
  (distributed with Wnn, see below) and canna. Canna can be found at:
  ftp://ftp.nec.co.jp/pub/Canna/

There is a good input system: Wnn4.2.  Wnn 4.2 contains,
    xwnmo (|multilingualized| |IM-server|)
    jserver (Japanese Kana-Kanji conversion server)
    cserver (Chinese PinYin or ZhuYin to simplified HanZi conversion server)
    tserver (Chinese PinYin or ZhuYin to traditional HanZi conversion server)
    kserver (Hangul-Hanja conversion server)
Wnn 4.2 can be found at:
    ftp://ftp.FreeBSD.ORG/pub/FreeBSD/ports/distfiles/Wnn4.2.tar.gz


- Input Style
							*xim-input-style*
  When inputting CJK, there are four areas:
      1. The area to display of the input while it is being composed
      2. The area to display the currently active input mode.
      3. The area to display the next candidate for the selection.
      4. The area to display other tools.

  The third area is needed when converting.  For example, in Japanese
  inputting, multiple Kanji characters could have the same pronunciation, so
  a sequence of Hira-gana characters could map to a distinct sequence of Kanji
  characters.

  The first and second areas are defined in international input of X with the
  names of "Preedit Area", "Status Area" respectively.  The third and fourth
  areas are not defined and are left to be managed by the |IM-server|.  In the
  international input, four input styles have been defined using combinations
  of Preedit Area and Status Area: |OnTheSpot|, |OffTheSpot|, |OverTheSpot|
  and |Root|.

  Currently, GUI Vim support three style, |OverTheSpot|, |OffTheSpot| and
  |Root|.

*.  on-the-spot						*OnTheSpot*
    Preedit Area and Status Area are performed by the client application in
    the area of application.  The client application is directed by the
    |IM-server| to display all pre-edit data at the location of text
    insertion. The client registers callbacks invoked by the input method
    during pre-editing.
*.  over-the-spot					*OverTheSpot*
    Status Area is created in a fixed position within the area of application,
    in case of Vim, the position is the additional status line.  Preedit Area
    is made at present input position of application.  The input method
    displays pre-edit data in a window which it brings up directly over the
    text insertion position.
*.  off-the-spot					*OffTheSpot*
    Preedit Area and Status Area are performed in the area of application, in
    case of Vim, the area is additional status line.  The client application
    provides display windows for the pre-edit data to the input method which
    displays into them directly.
*.  root-window						*Root*
    Preedit Area and Status Area are outside of the application.  The input
    method displays all pre-edit data in a separate area of the screen in a
    window specific to the input method.


LOCALIZATION, INTERNATIONALIZATION AND MULTILINGUALIZATION

					*localized* *Localization* *L10N*
Localization (L10N)		To fit a system or an application with a
				specific language.
			    *internationalized* *Internationalization* *I18N*
Internationalization (I18N)	To enable a system or an application to fit
				with a specific language according to the
				|locale|.
			    *multilingualized* *Multilingualization* *M17N*
Multilingualization (M17N)	To enable a system or an application to be
				able to use multiple languages at the same
				time.
For example, JVim (Japanized version Vim 3.0) is a |localized| application for
Japanese.  Cxterm (|localized| xterm for Chinese), kterm (|localized| xterm
for Japanese) and hanterm (|localized| xterm for Korean) is also a |localized|
application.  Gnome is an |internationalized| application.  It can be
|localized| for many languages according to the |locale|.  Mule (Multilingual
Enhancement for GNU Emacs) is a |multilingualized| application.  It can handle
multiple |charset|s and can maintain a mixture of languages in a single
buffer.

Vim is an |internationalized| application.  So, you can change the language
specifying the |locale| and some options at start time.

==============================================================================
2. Compiling						*multibyte-compiling*

-.  Before you start to compile Vim, be sure that your system has the language
    |locale| of your choice.  You might need to add "-DX_LOCALE" to CFLAGS.

-.  Compiling Vim: >
	./configure --with-x --enable-multibyte --enable-fontset --enable-xim
	make

-.  You can use multi-byte in the Vim GUI, which fully supports the
    |+multi_byte| feature.  If you only use console Vim, low-level multibyte
    input/output depends on your console.  For example, if you run Vim in an
    xterm, you should use a |localized| xterm or an xterm which support |XIM|.
    |localized| xterms are kterm (Kanji term) or hanterm (for Korean) for
    example.  Known |XIM| supporting xterms are Eterm (Enlightened terminal)
    and rxvt.

==============================================================================
3. Options						*multibyte-options*

These options are relevant for editing multi-byte files.  Check the help in
options.txt for detailed information.

'encoding'	Encoding used for the keyboard and display.  it is alse the
		default encoding for files.
'fileencoding'	Encoding of a file.  When it's different from 'encoding'
		conversion is done when reading or writing the file.
'fileencodings'	List of possible encodings of a file.  When opening a file
		these will be tried and the first one that doesn't cause an
		error is used for 'fileencoding'
'charconvert'	Expression used to convert files from one encoding to another.

'formatoptions' The 'm' flag can be included to have formatting break a line
		at a multibyte character of 256 or higher.  Thus is useful for
		languages where a sequence of characters can be broken
		anywhere.

==============================================================================
4. Display						*multibyte-display*

Note that Display and Input are independent.  It is possible to see your
language even though you have no input method for it.

Multibyte output uses |xfontset| feature.

-.  Be sure that your system has the fonts corresponding to the |CCS|es, which
    the |locale| needs to manage.  See: |xfontset|.

-.  Following are requirements to use multibyte language.

    If needed, insert the lines below in your $HOME/.Xdefaults file.

    These 3 lines are specific for Vim:

	Vim.font: |base_font_name_list|
	Vim*fontSet: |base_font_name_list|
	Vim*fontList: your_language_font:

	Note: Vim.font is for text area.
	      Vim*fontSet is for menu.
	      Vim*fontList is for menu (for Motif GUI)

	For example, when you are using Japanese and 14 dots font, >

	Vim.font: -misc-fixed-medium-r-normal--14-*
	Vim*fontSet: -misc-fixed-medium-r-normal--14-*
	Vim*fontList: -misc-fixed-medium-r-normal--14-*
<
	or >

	Vim.font: k14,r14
	Vim.fontSet: k14,r14
	Vim.fontList: k14
<
    The GTK+ version of GUI Vim does not use .Xdefaults, use ~/.gtkrc instead.
    The default mostly works OK.  But for the menus you might have to change
    it.  Example: >

	style "default"
	{
		fontset="-*-*-medium-r-normal--14-*-*-*-c-*-*-*"
	}
	widget_class "*" style "default"
<

    You should set the 'guifontset' option to display a multi-byte language.
    Example: >

	:set guifontset=|base_font_name_list|

<	For example, when you are using Japanese and 14 dots font, >

	set guifontset=-misc-fixed-medium-r-normal--14-*

<	or >

	set guifontset=k14,r14

<	Note: You can not use IM unless you specify 'guifontset'.
	      Therefore, Latin users, you have to also use 'guifontset'
	      if you use IM.

    You should not set 'guifont'. If it is set, Vim ignores 'guifontset'.
    It means Vim runs without fontset support, you can see only English. The
    multi-byte characters are displayed corrupted.

    After the |+xfontset| feature is enabled as explained above, Vim does not
    allow using 'font'.  For example, if you use: >
	:set guifontset=eng_font,your_font
<   in your .gvimrc, then you should use for highlighting: >
	:hi Comment font=another_eng_font,another_your_font
<   If you would do >
	:hi Comment font=another_eng_font
<   VIM will also try to use it as a fontset. So, if it cannot display your
    |locale| dependent codeset, you will see a error message.

-.  In your .vimrc, add this >
	set fileencoding=korea
<<  You can change "korea" to the some other name such as japan, taiwan.
    See |'fileencoding'| for the supported encodings.

-.  If a file's charset is different from your |locale|'s charset, you need to
    convert the charset.  See |charset-conversion|.

==============================================================================
5. Input (XIM, X Input Method support)			*multibyte-input*

Note that Display and Input are independent.  It is possible to see your
language even though you have no input method for it.  But when your Display
method doesn't match your Input method, the text will be displayed wrong.

-.  To input your language you should run the |IM-server| which supports your
    language and |conversion-server| if needed.  Multibyte input uses |XIM|
    feature.

    Next 3 lines are common for all X applications which uses |XIM|.
    If you already use |XIM|, don't care. >

	*international: True
	*.inputMethod: your_input_server_name
	*.preeditType: your_input_style
<
	Note: input_server_name is your |IM-server| name (check your
	      |IM-server| manual).
	      your_input_style is one of |OverTheSpot|, |OffTheSpot|, |Root|.
	      See also |xim-input-style|.
	      *international may not necessary if you use X11R6.
	      *.inputMethod and *.preeditType is a optional if you use X11R6.

	For example, when you are using kinput2 as |IM-server|, >

	*international: True
	*.inputMethod: kinput2
	*.preeditType: OverTheSpot
<
    When using |OverTheSpot|, GUI Vim always connects to the IM Server even in
    Normal mode, so you can input your language with commands like "f" and
    "r".  But when using one of the other two methods, GUI Vim connects to the
    IM Server only if it is not in Normal mode.

    If your IM Server does not support |OverTheSpot|, and if you want to use
    your language with some Normal mode command like "f" or "r", then you
    should use a |localized| xterm  or an xterm which supports |XIM|

-.  If needed, you can set the XMODIFIERS env. var.

	sh:  export XMODIFIERS="@im=input_server_name"
	csh: setenv XMODIFIERS "@im=input_server_name"

	For example, when you are using kinput2 as |IM-server| and sh, >

	export XMODIFIERS="@im=kinput2"


Contributions specifically for the multi-byte features by:
	Chi-Deok Hwang <hwang@mizi.co.kr>
	Nam SungHyun <namsh@lge.com>
	K.Nagano <nagano@atese.advantest.co.jp>
	Taro Muraoka  <koron@tka.att.ne.jp>
	Yasuhiro Matsumoto <mattn@mail.goo.ne.jp>

==============================================================================
6. Input (Windows IME support)				*multibyte-ime*

{only works Windows GUI and compiled with the |+multi_byte_ime| feature}

To input multibyte characters on Windows, you have to use Input Method Editor
(IME).  In process of your editing text, you must switch status (on/off) of
IME many many many times.  Because IME with status on is hooking all of your
key inputs, you cannot input 'j', 'k', or almost all of keys to Vim directly.

This |+multi_byte_ime| feature help this.  It reduce times of switch status of
IME manually.  In normal mode, there are almost no need working IME, even
editing multibyte text.  So exiting insert mode with ESC, Vim memorize last
status of IME and force turn off IME.  When re-enter insert mode, Vim revert
IME status to that momorized automatically.

This works on not only insert-normal mode, but also search-command input and
replace mode.

Cursor color when IME is on				*CursorIM*
    There is a little cute feature for IME.  Cursor can indicate status of IME
    by changing its color.  Usually status of IME was indicated by little icon
    at a corner of desktop (or taskbar).  It is not easy to verify status of
    IME.  But this feature help this.

    You can select cursor color when status is on by using highlight group
    CursorIM.  For example, add these lines to your _gvimrc: >

	if has('multi_byte_ime')
	    highlight Cursor guibg=Green guifg=NONE
	    highlight CursorIM guibg=Purple guifg=NONE
	endif
<
    Cursor color with off IME is green.  And purple cursor indicates that
    status is on.

WHAT IS IME
    IME is a part of East asian version Windows.  That helps you to input
    multibyte character.  English and other language version Windows does not
    have any IME.  (Also there are no need usually.) But there is one that
    called Microsoft Global IME.  Global IME is a part of Internet Exproler
    4.0 or above.  You can get more information about Global IME, at below
    URL.

WHAT IS GLOBAL IME					*global-ime*
    Global IME makes capability to input Chinese, Japanese, and Korean text
    into Vim buffer on any language version of Windows 98, Windows 95, and
    Windows NT 4.0.  Please see below URL for detail of Global IME.  You can
    also find various language version of Global IME at same place.

    - Global IME detailed information.
	http://www.microsoft.com/windows/ie/features/ime.asp

    - Active Input Method Manager (Global IME)
	http://msdn.microsoft.com/workshop/misc/AIMM/aimm.asp

    Support Global IME is a experimental feature.

==============================================================================
7. UTF-8						*UTF-8* *utf-8*

Vim has comprehensive UTF-8 support.  It appears to work in:
- xterm with utf-8 support enabled
- Athena, Motif and GTK GUI
- MS-Windows GUI

Double-width characters are supported.  This works best with 'guifontwide' or
'guifontset'.  When using only 'guifont' the wide characters are drawn in the
normal width and a space to fill the gap.

Up to two combining characters can be used.  The combining character is drawn
on top of the preceding character.  When editing text a composing character is
mostly considered part of the preceding character.  For example "x" will
delete a character and its following composing characters by default. If the
'delcombine' option is on, then pressing 'x' will delete the combining
characters, one at a time, then the base character.  But when inserting, you
type the first character and the following composing characters separately,
after which they will be joined.  The "r" command will not allow you to type a
combining character, because it doesn't know one is coming.  Use "R" instead.

Bytes which are not part of a valid UTF-8 byte sequence are handled like a
single character and displayed as <xx>, where "xx" is the hex value of the
byte.

Overlong sequences are not handled specially and displayed like a valid
character.  However, search patterns may not match on an overlong sequence.
(an overlong sequence is where more bytes are used than required for the
character.)  An exception is NUL (zero) which is displayed as "<00>".

In the file and buffer the full range of Unicode characters can be used (31
bits).  However, displaying only works for 16 bit characters, and only for the
characters present in the selected font.

Useful commands:
- "ga" shows the decimal, hexadecimal and octal value of the character under
  the cursor.  If there are composing characters these are shown too. (if the
  message is truncated, use ":messages").
- "g8" shows the bytes used in a UTF-8 character, also the composing
  characters, as hex numbers.


USING UTF-8

If your current locale is in an utf-8 encoding, Vim will automatically start
in utf-8 mode.

If you are using another locale: >

	set encoding=utf-8

You might also want to select the font used for the menus.  Unfortunately this
doesn't always work.  See the system specific remarks below, and 'langmenu'.


USING UTF-8 IN AN XTERM					*utf-8-in-xterm*

When using Vim in an xterm, the xterm must have been compiled with utf-8
support.  See |UTF8-xterm|.

Start the xterm with the "-u8" argument.  You might also need so specify a
font.  Example: >

   xterm -u8 -fn -misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1


USING UTF-8 IN X-Windows				*utf-8-in-xwindows*

You need to specify a font to be used.  For double-wide characters another
font is required, which is exactly twice as wide.  There are three ways to do
this:

1. Set 'guifont' and let Vim find a matching 'guifontwide'
2. Set 'guifont' and 'guifontwide'
3. Set 'guifontset'

See the documentation for each option for details.  Example: >

   :set guifont=-misc-fixed-medium-r-normal--15-140-75-75-c-90-iso10646-1

You might also want to set the font used for the menus.  This only works for
Motif.  Use the ":hi Menu font={fontname}" command for this. |:highlight|


USING UTF-8 IN MS-WINDOWS				*utf-8-in-mswindows*

Sorry, I don't know how to select Unicode fonts for MS-Windows.  It probably
works a lot better on MS-Windows NT and 2000, compared to 95/98/ME.


USING UTF-8 IN A NON-UNICODE TERMINAL			*utf-8-in-others*

You can also edit utf-8 files in a normal terminal or in an environment that
uses a different encoding.  Do this: >

	let &termencoding = &encoding
	set encoding=utf-8

This makes it possible to edit utf-8 files while your locale is, for example,
in a Japanese encoding or latin1.

The first line causes Vim to translate characters from/to utf-8 when
communicating with the terminal.  This only works when the conversion is
possible (for "latin1" it's done internally, other conversions require the
|+iconv| feature).

Characters which can't be converted will be displayed with a '?', '_' or other
character.  Use the "ga" or "g8" command to find out which character it really
is.


TYPING UTF-8						*utf-8-typing*

If you are using X-Windows, you should find an input method that supports
utf-8.

If your system does not provide support for typing utf-8, you can use the
'keymap' feature.  This allows writing a keymap file, which defines a utf-8
character as a sequence of ASCII characters.  See |multilang-typing|.

If everyting else fails, you can type any character as four hex bytes: >

	CTRL-V u 1234

"1234" is interpreted as a hex number.  You must type four characters, prepend
a zero if necessary.


COMMAND ARGUMENTS					*utf-8-char-arg*

Commands like |f|, |F|, |t| and |r| take an argument of one character.  For
UTF-8 this argument may include one or two composing characters.  These needs
to be produced together with the base character, Vim doesn't wait for the next
character to be typed to find out if it is a composing character or not.
Using 'keymap' or |:lmap| is a nice way to type these characters.

The commands that search for a character in a line handle composing characters
as follows.  When searching for a character without a composing character,
this will find matches in the text with or without composing characters.  When
searching for a character with a composing character, this will only find
matches with that composing character.  It was implemented this way, because
not everybody is able to type a composing character.

==============================================================================
8. UTF-8 in XFree86 xterm				*UTF8-xterm*

This is a short explanation of how to use UTF-8 character encoding in the
xterm that comes with XFree86 by Thomas Dickey (text by Markus Kuhn).

Get the latest xterm version which has now UTF-8 support:

	http://www.clark.net/pub/dickey/xterm/xterm.tar.gz

Compile it with "./configure --enable-wide-chars ; make"

Also get the ISO 10646-1 version of various fonts, which is available on

	http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts.tar.gz

and install the font as described in the README file.

Now start xterm with >

  xterm -u8 -fn -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1
or, for bigger character: >
  xterm -u8 -fn -misc-fixed-medium-r-normal--15-140-75-75-c-90-iso10646-1

and you will have a working UTF-8 terminal emulator. Try both >

   cat utf-8-demo.txt
   vim utf-8-demo.txt

with the demo text that comes with ucs-fonts.tar.gz in order to see
whether there are any problems with UTF-8 in your xterm.

For Vim you may need to set 'encoding' to "utf-8", see |utf-8-in-xterm|.

==============================================================================
9. Unfinished snippets					*mbyte-snippets*

							*encoding-names*
Vim can use many different character encodings.  There are three major groups:

1   8bit	Single-byte encodings, 256 different characters.  Mostly used
		in USA and Europe.  Example: ISO-8859-1 (Latin1).  All
		characters occupy one screen cell only.

2   2byte	Double-byte encodings, over 10000 different characters.
		Mostly used in Asian countries.  Example: euc-jp (Japanese).
		The number of screen cells is equal to the number of bytes.

u   Unicode	Universal encoding, should replace all others.  ISO 10646.
		Millions of different characters.  Example: UTF-8.  The
		relation between bytes and screen cells is complex.

Other encodings cannot be used by Vim internally.  Files in other encodings
can be edited by using conversion, see 'fileencoding'.
Note that all encodings must use ASCII for the characters up to 128 (except
when compiled for EBCDIC).

Supported 'encoding' values are:
1   latin1	8-bit characters (ISO 8859-1)
1   iso-8859-n	ISO_8859 variant (n = 2 to 15)
1   8bit-{name} any 8-bit encoding (Vim specific name)
1   cp{number}	MS-Windows: any installed single-byte codepage
2   cp932	Japanese (Windows only)
2   euc-jp	Japanese (Unix only)
2   sjis	Japanese (Unix only)
2   cp949	Korean (Windows only)
2   euc-kr	Korean (Unix only)
2   cp936	simplified Chinese (Windows only)
2   chinese	simplified Chinese (Unix only, on Windows an alias for cp936)
2   cp950	traditional Chinese (on Unix alias for big5)
2   big5	traditional Chinese (on Windows alias for cp950)
2   euc-tw	traditional Chinese (Unix only)
2   2byte-{name} Unix: any double-byte encoding (Vim specific name)
2   cp{number}	MS-Windows: any installed double-byte codepage
u   utf-8	32 bit UTF-8 encoded Unicode (ISO/IEC 10646-1)
u   ucs-2	16 bit UCS-2 encoded Unicode (ISO/IEC 10646-1)
u   ucs-2le	like ucs-2, little endian
u   ucs-4	32 bit UCS-4 encoded Unicode (ISO/IEC 10646-1)
u   ucs-4le	like ucs-4, little endian
    cp{number}	MS-Windows: any installed codepage which is single-byte or
		double-byte

The {name} can be any encoding name that your system supports.  It is passed
to iconv() to convert between the encoding of the file and the current locale.
For MS-Windows "cp{number}" means using codepage {number}.
Examples: >
		:set encoding=8bit-cp1252
		:set encoding=2byte-cp932
<
Several aliases can be used, they are translated to one of the names above.
An incomplete list:

1   ansi	same as latin1 (obsolete, for backward compatibility)
2   japan	Japanese: on Unix "euc-jp", on MS-Windows shift-JIS (cp932)
2   korea	Korean: on Unix "euc-kr", on MS-Windows cp949
2   prc		simplified Chinese: on Unix "chinese", on MS-Windows cp936
2   taiwan	traditional Chinese: on Unix "euc-tw", on MS-Windows cp950
u   utf8	same as utf-8
u   unicode	same as ucs-2
u   ucs2be	same as ucs-2 (big endian)
u   ucs-2be	same as ucs-2 (big endian)
u   ucs-4be	same as ucs-4 (big endian)

For the UCS codes the byte order matters.  This is tricky, use UTF-8 whenever
you can.  The default is to use big-endian (most significant byte comes
first):
	    name	bytes		char ~
	    ucs-2	      11 22	    1122
	    ucs-2be	      11 22	    1122
	    ucs-2le	      22 11	    1122
	    ucs-4	11 22 33 44	11223344
	    ucs-4be	11 22 33 44	11223344
	    ucs-4le	44 33 22 11	11223344

On MS-Windows systems you often want to use "ucs-2le", because it uses little
endian UCS-2.


							*encoding-table*
Normally 'encoding' is equal to your current locale and 'termencoding' is
empty.  This means that your keyboard and display work with characters encoded
in your current locale, and Vim uses the same characters internally.

You can make Vim use characters in a different encoding by setting the
'encoding' option to a different value.  Since the keyboard and display still
use the current locale, conversion needs to be done.  The 'termencoding' then
takes over the value of the current locale, so Vim converts between 'encoding'
and 'termencoding'.  Example: >
	:let &termencoding = &encoding
	:set encoding=utf-8

However, not all combinations of values are possible.  The table below tells
you how each of the nine combinations works.  This is further restricted by
not all conversions being possible, iconv() being present, etc.  Since this
depends on the system used, no detailed list can be given.

('tenc' is the short name for 'termencoding' and 'enc' short for 'encoding')

'tenc'	    'enc'	remark ~

 8bit	    8bit	Works.  When 'termencoding' is different from
			'encoding' typing and displaying may be wrong for some
			characters, Vim does NOT perform conversion (set
			'encoding' to "utf-8" to get this).
 8bit      2byte	MS-Windows: works for all codepages installed on your
			system; you can only type 8bit characters;
			Other systems: does NOT work.
 8bit	   Unicode	Works, but you can only type 8bit characters; in a
			terminal you can only see 8bit characters; the GUI can
			show all characters that the 'guifont' supports.

 2byte	    8bit	Works, but typing non-ASCII characters might
			be a problem.
 2byte	   2byte	MS-Windows: works for all codepages installed on your
			system; typing characters might be a problem when
			locale is different from 'encoding'.
			Other systems: Only works when 'termencoding' is equal
			to 'encoding', you might as well leave it empty.
 2byte	   Unicode	works, Vim will translate typed characters.

 Unicode    8bit	works (unusual)
 Unicode    2byte	does NOT work
 Unicode   Unicode	works very well (leaving 'termencoding' empty works
			the same way, because all Unicode is handled
			internally as UTF-8)

 vim:tw=78:
