In this computer science video you will learn about text files. Specifically, you will see how Unicode code points are encoded into binary and why the byte order, that is the endianness, of some Unicode Transformation Formats could be an important consideration if you’re a programmer handling text data, or if you build websites.
The video demonstrates how Unicode code points are encoded in ASCII, UCS-2, UTF-16, UCS-4, UTF-32 and UTF-8, and it discusses some of the advantages and disadvantages of these encodings. The UTF-16 high surrogate and low surrogate format is explained, including its effect on the available range of code points. The UTF-8 bit patterns are also described in detail.
When saving UTF-16 or UTF-32 text files, it is possible to specify the byte order, which can be either big endian or little endian. The need for a byte order mark (BOM) in a UTF-16 text file is demonstrated by examining it encoding as hexadecimal data. The so called UTF-8 with BOM format is also discussed.
In this computer science video, you will also see why it is important to include a charset meta tag in the head section of a web page, to specify the character encoding. Problems that might occur if, for example, a web page has been encoded with ISO-8859-1 or Windows 1252 are demonstrated.
Chapters:
00:00 Introduction
00:59 Unicode code points
01:45 ASCII
02:24 Universal Character Set UCS-2
03:50 Unicode Transformation Format UTF-16
09:26 Unicode Transformation Format UTF-32
10:48 Unicode Transformation Format UTF-8
14:03 Byte Order
15:07 Byte Order Mark BOM Demonstration
19:47 UTF-8 with BOM Demonstration
20:54 Web page character encoding
24:04 Summary