Unicode -- 从code point到UTF16的计算方法
版权声明:原创作品,允许转载,转载时请务必以超链接形式标明文章 原始出处 、作者信息和本声明。否则将追究法律责任。http://h2appy.blog.51cto.com/609721/144639 |
UTF16,即是通常所说的Unicode。其实把UTF16叫成Unicode不太合适,容易给人造成混乱。因为Unicode是字符集,而不是实际的存储编码方案。 UTF16是变长编码方案。 比如Unicode code point为2F92B的字,把它保存成UTF16(也就是Windows XP记事本中的Unicode),就变成了FC D8 2B DD,如果是Big endian的话就应该是D8 FC DD 2B。这个值是怎么来的? 对于0-FFFF的Unicode字符,UTF16中用一个两个字节的Unicode code point直接表示。对于10000-10FFFF的Unicode字符,UTF16中用surrogate pair表示,既用两个字符表示,它们之间的转换过程是: 下面把code point为U+64321(十六进制)的Unicode字符编码成UTF-16,由于它大于U+FFFF,所以它要编码成surrogate pair: v = 0x64321详细描述: The improvement that UTF-16 made over UCS-2 is its ability to encode characters in planes 1–16, not just those in plane 0 (BMP).
UTF-16 represents non-BMP characters (those from U+10000 through U+10FFFF) using a pair of 16-bit words, known as a surrogate pair. First 1000016
is subtracted from the code point to give a 20-bit value. This is then
split into two separate 10-bit values each of which is represented as a
surrogate with the most significant half placed in the first surrogate.
To allow safe use of simple word-oriented
string processing, separate ranges of values are used for the two
surrogates: 0xD800–0xDBFF for the first, most significant surrogate and
0xDC00-0xDFFF for the second, least significant surrogate.
For example, the character at code point U+10000 becomes the code
unit sequence 0xD800 0xDC00, and the character at U+10FFFD, the upper
limit of Unicode, becomes the sequence 0xDBFF 0xDFFD. Unicode and
ISO/IEC 10646 do not, and will never, assign characters to any of the
code points in the U+D800–U+DFFF range, so an individual code value
from a surrogate pair does not ever represent a character. 我们可以用Windows自带的计算器的科学计算模式完成上述计算,当然也可以自己写个小程序:) 要输入10000-10FFFF的字符,可以使用微软拼音输入法。它有一项以Unicode码输入字符的功能。 要显示这些字符中的汉字部分,可以安装Unifont,参见海峰五笔的网站。 关于编码知识,可以google一下这一系列文章,写的非常精彩:“Java中的字符集编码入门” 一点问题: .net framework平台下,string类型变量name包含两个字符,一个是0-FFFF的字符,另一个是10000-10FFFF的字符,那么name的长度将是3而不是2,因为name有6个字节。 using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; using System.Drawing; using System.Text; using System.Windows.Forms;![]() namespace CodePoint2UTF16 { public partial class Form1 : Form { public Form1() { InitializeComponent(); }![]() private void btnConvert_Click(object sender, EventArgs e) { String cp = tbUnicodeCodePoint.Text.Trim();![]() try { int n = Convert.ToInt32(cp, 16); if (n < 0 || n > 0x10FFFF) { MessageBox.Show(cp + " is not in 0x0 - 0x10FFFF"); return; } if (n < 0x10000) { tbUTF16Code.Text = Convert.ToString(n, 16); return; } else { n -= 0x10000; int h = n >> 10; int l = n & 0x3FF; h |= 0xD800; l |= 0xDC00; tbUTF16Code.Text = Convert.ToString(h, 16) + " " + Convert.ToString(l, 16); } } catch (Exception ex) { MessageBox.Show("Invalid text: " + cp + Environment.NewLine + ex.Message); } } } }本文出自 “GONE WITH THE WIND” 博客,请务必保留此出处http://h2appy.blog.51cto.com/609721/144639 本文出自 51CTO.COM技术博客 |


using System;
h2appy 
博客统计信息
热门文章
最新评论
友情链接