Explanation of UFT-8 and Unicode

本文介绍了Unicode和UTF-8的相关知识。Unicode是标准字符集,UTF-8是其一种实现方式,在英文字符编码上二者相同。还阐述了选择UTF-8的原因,介绍了其编码长度、重要特性,给出了用UTF-8表示Unicode的原则及使用示例。

What is unicode?

  A mapping with characters and a index, we use u+xxxx to represent it.

Confuse with unicode and UTF-8?
    Unicode is a standard char set, UTF-8 is one of implementation, just one of UCS-2, UCS
-4 and so forth, but it becomes standard way of encoding. but note one thing, when we are talking about some english characters, those two standard are the same, it means

U-00000000 - U-0000007F:  0xxxxxxx

    sometimes, especially the programmer, since U-00000000 - U-0000007F is enough for their dialy use(26 english and some symbols), so, there is no different between the character set standards(unicode) and implementation standard(UTF-8) for them. when they are talking with you, you may confuse.

Why is UTF-8?
    You may ask why not use UCS-4 or UCS-2? do people like 8 more(in cantonese, it means
become rich)?
   
    The answer is no. Using UCS-2 (or UCS-4) under Unix would lead to very severe problems.
Strings with these encodings can contain as parts of many wide characters bytes like '/0' or '/' which have a special meaning in filenames and other C library function parameters.

(An ASCII or Latin-1 file can be transformed into a UCS-2 file by simply inserting a 0x00 byte in front of every ASCII byte. If we want to have a UCS-4 file, we have to insert three 0x00 bytes instead before every ASCII byte.)

    In addition, the majority of UNIX tools expects ASCII files and can't read 16-bit words as characters without major modifications.(In UTF-8
U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility
).

This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8)

------------prove the ASCII and UTF-8 are the same---------
package unicode;
public class CharTest {
    public static void main(String[] args) throws Exception {
        char[] chars = new char[]{'/u007F'};
        String str = new String(chars);
        System.out.println("within 0000 - 007F : " + str);
       //for the character whose unicode less than u0080, it is no different      with encode
by
 //ISO-8859-1 or UTF-8. they are compatiable.
        System.out.println("   UTF-8 - UTF-8      " + new String(str.getBytes("UTF-8"),

"ISO-8859-1"));
        System.out.println("   ISO-8859-1 - UTF-8 " +new String(str.getBytes("ISO-8859-1"),

"UTF-8"));
        chars = new char[]{'/u00F2'};
        str = new String(chars);
 //The above principle can not apply to the character lager than 007F
        System.out.println("out of 0000 - 007F : " + str);
        System.out.println("   UTF-8 - UTF-8      "  + new String(str.getBytes("UTF-8"),

"ISO-8859-1"));
        System.out.println("   ISO-8859-1 - UTF-8 "  + new String(str.getBytes("ISO-8859-

1"), "UTF-8"));
    }
}
---------------------------------------------------------------------------------

How long is the UTF-8 encoding?
    Theoretically, it can be 6 bytes, but actually, 3 byte is enough for us since BMP is not
longer than 3(The most commonly used characters, including all those found in major older encoding standards,
have been placed into the first plane (0x0000 to 0xFFFD), which is called the Basic
Multilingual Plane (BMP))
   

Important UTF-8 features:
  1. UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F
(ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
  2. All UCS characters >U+007F are encoded as a sequence of several bytes, each of which
has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character.
  3. The first byte of a multibyte sequence that represents a non-ASCII character is always
in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes. (?? the further investigate is necessary. can explain this currently)
  4. All possible 231 UCS codes can be encoded.
  5. UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP

characters are only up to three bytes long.
  6. The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.
------------Prove the features(1,2,3)-----------------
package unicode;

public class UTF8Features {
    public static void main(String[] args) throws Exception {
        //Why not write some no-ASCII character in the src?
        //Since it will depends on your system rather than
        //a UTF-8 as your image
        char[] chars = new char[]{'/u007F'};
        String str = new String(chars);
        System.out.println("Point 1 : " + str);
        System.out.println("   UTF-8 - UTF-8      "
                + new String(str.getBytes("UTF-8"), "ISO-8859-1"));
        System.out.println("   ISO-8859-1 - UTF-8 "
                + new String(str.getBytes("ISO-8859-1"), "UTF-8"));
        System.out.println();

        chars = new char[]{'/uE840'};
        str = new String(chars);
        System.out.println("Point 2 : " + str);
        //just a sample you can use this method to verify more characters
        System.out.println("   No less than 7F      " + getHexString(str));

        chars = new char[]{'/u2260'};
        str = new String(chars);
        //just a sample you can use this method to verify more characters
        System.out.println("Point 3 : " + str);
        System.out.println("   Range of 1st Byte      " + getHexString(str));
    }

    public static String getHexString(String num) throws Exception {
        StringBuffer sb = new StringBuffer();
        //You must specify UTF-8 here, else it will use the defaul encoding
        //which depends on your enviroment
        byte[] bytes = num.getBytes("UTF-8");
        for (int i = 0; i < bytes.length; i++) {
            sb.append(Integer.toHexString((bytes[i] >= 0 ?
                    bytes[i] : 256 + bytes[i])).toUpperCase() + " ");
        }
        return sb.toString();
    }
}

---------------------------------------------------------------------------------
Pinciple of presenting a unicode use UTF-8:

U-00000000 - U-0000007F:  0xxxxxxx 
U-00000080 - U-000007FF:  110xxxxx 10xxxxxx 
U-00000800 - U-0000FFFF:  1110xxxx 10xxxxxx 10xxxxxx 
U-00010000 - U-001FFFFF:  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 
U-00200000 - U-03FFFFFF:  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 
U-04000000 - U-7FFFFFFF:  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 

How to use the principle above?

Sample:
The Unicode character U+00A9 = 1010 1001 (copyright sign) is encoded in UTF-8 as

    11000010 10101001 = 0xC2 0xA9

Explain :

A:1010

9:1001

principle 2 : 00000080 <  00A9 < 000007FF

from low to high

1. There 6 x in the low bit    we cut last 6 bit from  - 10101001(A9)  which is 101001

2.There 5 x in the high bit. we cut the rest 2 bit of A9 which is 10 and extend it to 5 bit with three 0 which is 00010

complete the low byte with 10. ----> (10) combine (101001) -> 10101001

complete the high byte with 110, ---> (110) combine (00010) -> 11000010

the Result is

11000010 10101001 = 0xC2 0xA9

you can also verify the following unicode with principle 3 use the way above:

U-00000800 - U-0000FFFF:  1110xxxx 10xxxxxx 10xxxxxx 

character U+2260 = 0010 0010 0110 0000 (not equal to) is encoded as:

    11100010 10001001 10100000 = 0xE2 0x89 0xA0

Reference:

http://www.cl.cam.ac.uk/~mgk25/unicode.html#unicode


源码直接下载地址: https://pan.quark.cn/s/95437fdf229e Intel I-219V网卡驱动是一款专门为Intel的I-219V千兆以太网控制器而研发的驱动程序,其主要作用在于保障在Ubuntu 16.04操作系统环境下的正常运作以及优化系统性能。Intel I-219V作为一款广泛应用的内置网络接口控制器(NIC),常被集成在台式机及笔记本电脑的主板上,负责提供高速的网络连接服务。Intel公司所提供的e1000e驱动是与此硬件相配套的开源驱动解决方案,其中版本3.3.5.3是专门针对该硬件设备的定制版本。此驱动包含了不可或缺的源代码部分,赋予开发者和系统管理者按照特定需求进行编译和定制的权限,从而能够适应多样化的系统配置或针对特定情形进行问题解决。源代码的可用性同样表明用户有能力依据Linux内核的更新情况来升级驱动,确保与最新技术标准的兼容性。在Ubuntu 16.04系统中成功编译的驱动意味着它已经通过了严苛的测试流程,并能够与该版本的Linux内核实现良好兼容。Ubuntu 16.04,其代号为Xenial Xerus,是一个长期支持(LTS)的版本,因此对于那些追求系统稳定性和安全保障的用户群体而言具有特殊的意义。驱动程序的兼容性保障了I-219V网卡能够在该系统平台上实现无缝运行,提供稳定可靠的网络连接,这既包括局域网(LAN)的连接,也可能涵盖通过Wi-Fi桥接实现的无线网络连接。驱动程序的核心职责涵盖了网络接口的初始化与管理、数据包的接收与发送处理,以及错误检测与纠正功能的执行。在Linux操作系统架构中,驱动通常以模块的形式加载至内核之中,这种设计允许在非必要时期进行卸载操作,以此来有效节省系统资源。e1000e驱...
内容概要:本文围绕基于共识的捆绑算法(CBBA)在多智能体系统中的多任务分配问题展开研究,重点应用于远程太空船交会与维修的相对轨道操作(RPO)规划。通过Matlab代码实现了CBBA算法,系统地解决了多个航天器在复杂空间环境下协同执行多目标任务时的任务分配、路径规划与动态协商问题。研究详细展示了算法在任务分解、竞标机制、共识达成及冲突消解等方面的核心逻辑,验证了其在分布式决策、通信受限条件下的高效性与鲁棒性,并结合航天工程实际背景突出了算法的应用价值。该资源不仅提供完整的仿真代码,还包含详细的流程解析,有助于深入理解多智能体协同机制的设计原理。; 适合人群:具备控制理论、航天器动力学、多智能体系统或分布式优化背景的研究生、科研人员及航空航天领域工程技术人员,熟练掌握Matlab编程者尤佳。; 使用场景及目标:①应用于在轨服务、空间碎片清除、多航天器编队飞行、星座维护等多智能体协同任务的任务分配与规划;②为研究人员提供CBBA算法的实现范例,支撑其开展分布式任务规划算法的改进与扩展研究;③作为教学案例用于高级课程中讲解多智能体协同决策机制。; 阅读建议:建议结合Matlab代码逐模块分析算法实现过程,重点关注任务打包、竞标更新、共识收敛等关键环节,可尝试引入通信延迟、故障容错或障碍规避机制以进一步提升算法实用性。
内容概要:本文介绍了一种基于关键场景辨别算法的两阶段鲁棒微网优化调度方法,旨在有效应对风电等可再生能源出力不确定性带来的调度挑战。通过Matlab代码实现,构建了包含预调度与实时调整的两阶段鲁棒优化模型,第一阶段制定初始调度计划以应对不确定性,第二阶段根据实际运行数据进行修正,从而提升微网运行的经济性与可靠性。该方法结合场景生成与缩减技术,识别关键不确定性场景,降低计算复杂度,同时增强了调度方案的鲁棒性。文中还探讨了该方法与智能优化算法、机器学习及电力系统仿真工具的集成应用,展现了其在复杂综合能源系统中的广阔应用前景。; 适合人群:具备一定电力系统基础知识和Matlab编程能力,从事新能源、微网优化、不确定性建模与鲁棒调度等领域研究的科研人员、工程技术人员及研究生。; 使用场景及目标:①应用于高比例可再生能源接入的微电网优化调度,提高系统对源荷不确定性的适应能力与运行稳定性;②为科研人员提供可复现的两阶段鲁棒优化建模与求解范例,支撑高水平学术论文的复现、算法改进与创新研究。; 阅读建议:建议结合提供的Matlab代码与网盘资料,动手实践关键场景生成、不确定性建模、两阶段优化建模与求解全过程,重点关注鲁棒优化框架的设计逻辑与关键场景辨别的实现机制,同时参考文中提及的多种算法与工具,拓展研究思路与应用场景。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值