Back Next: 2020 Winner
The Unicode Consortium (Unicode, Inc.) takes the honor of being selected as the 2019 winner of the Annual Bruessard Award. This not-for-profit organization was selected in recognition of its tireless effort to globally simplify and standardize written human communications. Its public arm, Unicode.org, devised a uniform and universal standard for identifying characters—including an electronic transmission encoding scheme. The Unicode Consortium accomplished this feat by devising a system known as Uniform Transformation Format or UTF encodings. The Unicode Consortium has assigned—and is poised to assign—a unique identifier to every written character, number, and symbol used by humans in the past, present, and conceivably in the future. Through its work, the Unicode Consortium is making all of the world's characters, numbers, and symbols amenable to electronic entry in a standardized format. Further, by making a one-to-one linkage between the characters, numbers, and symbols of different languages, it becomes possible to more easily and more consistently translate, convert and swap documents created in different languages. More importantly, the Unicode Consortium has made it possible for humans across the globe to more easily consume the human stock of knowledge—in their native language or tongue. In effect, the Unicode Consortium is further fostering the knowledge-for-all phenomenon made possible by the emergence of the World Wide Web.
Unicode is language. Unicode speaks the world's languages.
This web browser does not support HTML5 videos. Try updating your browser or using the latest Firefox, Chrome, Safari, or Edge web browser.
Check out this bonus multilingual application to demonstrates Unicode / UTF in action. Because this bonus multilingual application is being presented on the World Wide Web, it has been restricted to the world's most commonly spoken languages (in terms of millions of speakers) including the 6 official languages on the United Nations, namely, English, Arabic, Russian, Chinese, French, and Spanish.
Written human communications began humbly enough. Written human communications began with characters or letters. In turn, letters formed words, and words formed sentences. And, of course, sentences were the building blocks of human communications. Though a complicated human feat, written communications is as simple as that.
What is the origin of human communications? How did humans come to speak many languages? There are two prevailing perspectives. The first one is the religious perspective. The second one is the scientific perspective.
Take the religious perspective, in general, and the Christian perspective, in particular. According to the Christian religious faith and its sacred Holy Bible text, in the beginning, humans only spoke one language. In its explanation of how humans came to speak many languages, the Holy Bible recounts the Tower of Babel story. According to the Tower of Babel story, in effect, human arrogance and egoism greatly disillusioned God. Consequently, these undesirable human traits led God to cause a rift between humans, principally, by making humans speak many languages. The following video provides a simple recounting of the Tower of Babel story.
Now, consider the scientific perspective. Scientists, generally speaking, think that spoken human languages originated as far back as 100,000 years ago, or around the same time humans first appeared on Earth (see, for instance, the article titled Language in Wikipedia.com). Contrary to Christian Biblical dogma, science contends that humans have never really spoken one, universal language. The following images provide a general synopsis of the scientific perspective on the evolution of spoken and written human languages.
To be sure, according to Wikipedia.org, Cuneiform is perhaps the earliest known human writing system. Cuneiform was created by the Sumerian people around 3200 BC (or before the appearance of Jesus Christ on Earth) in southern Mesopotamia or what is now recognized as southern Iraq.
Now, fast forward from 100,000 years ago to the 20th century (or, more specifically, fast forward to the late 1900's). In 1969, the USA successfully tested the Internet. In 1969, USA micro-computer makers such as Commodore, Apple, and Tandy began introducing personal computers to the world. In 1981, USA computer maker IBM popularized the use of personal computers. In 1989, the World Wide Web (WWW) was created by Sir Tim Berners-Lee, and it was implemented in 1990. The Internet in conjunction with the personal computer and the World Wide Web later would combine to revolutionize human communications and interactions on a global scale by merging and unifying all kinds of technologies (such as radio, television, telephone, electronic mail, text chats, webcam chats, teleconference chats, file sharing, document collaboration, and so forth) under a single, unified umbrella know as cyberspace.
What is the common denominator here? The common denominator is this: The Internet, personal computers, and the World Wide Web, by far and predominantly, had English-speaking origins. As a result, logically, the Internet, personal computers, and the World Wide Web, initially, were launched with English-speaking users in mind. Concomitantly, most of the early websites and computer operating systems appeared in English. It did not take long, however, before the personal computer and World Wide Web became global phenomena. Non-English computer character encoding systems and websites began appearing in countries all across the globe. The World Wide Web, suddenly, had become multilingual. However, there was one big problem, and it was this: each country, more or less, had its own unique way of displaying information and implementing websites in each country's respective language. To adapt computers and websites to languages other than English or the Latin alphabets, various encoding schemes were created. The outcome of this development posed a problem because documents and websites created in one language or country could not always easily be converted, viewed, and consumed on a computer that used a different language. Words and sentences got lost in the conversion or translation. The source of the problem was the different character encoding schemes being deployed in different countries across the globe.
How was this problem of multiple language encodings to be resolved? To address this encoding problem, in 1991, the Unicode Consortium made its debut. Unicode facilitated the emergence of truly multilingual software applications and a multilingual World Wide. The development of Unicode had the effect of making it possible for the world's population to tap into and benefit from the gigantic stock of human knowledge via the use of personal computers and the World Wide Web. Unicode.org brought order to this encoding chaos by devising a universal alphabet system and an encoding scheme to match. In this scheme, each character, number, and symbol, regardless of language, was assigned a unique hexadecimal identifier.
To summarize, computers only can understand and manipulate zeroes and ones, which is known as binary code. Inside computers, the number 1 represents the "on" signal, and the number 0 represents the "off" signal. Each digit of the two digits (0 and 1) represents 1 bit. Computers transport, store, and manipulate data in chunks of bytes, whereby 8 bits or series of zeroes and ones combine to form a 1-byte unit. Encoding is a process of taking the numbers, characters, and symbols used in everyday life and transforming them into their binary equivalents that computers can understand and manipulate. And, as explained by Chris Hager, "Unicode uses 16 bits (2 bytes) per code-point and furthermore associates each code-point with one of 17 planes. Therefore Unicode provides 216 = 65,536 unique code-points per plane, with 216 * 17 = 1,114,112 maximum total unique code-points." In other words, it its original inception, Unicode was conceived as a 16-bit encoding system.
To further conceptualize this discussion, the following graphic illustrates how the word Wikipedia gets translated into its equivalent binary code as read and understood by computers.>
Again, Unicode originally was envisioned as constituting a 16-bit binary codespace, which would have made it possible to encode 65,536 characters or 216. It soon became apparent that 65,536 characters or code points would not be enough characters to capture the world's languages. So, Unicode.org extended its codespace by an additional 1,048,576 characters or code points, hence, 220. The original 65,536 code points along with the additional 1,048,576 characters combined to expand Unicode's codespace to its current capacity of capturing 1,114,112 characters (65,536 + 1,048,576 = 1,114,112). These 1,114,112 code points are laid out to span 17 planes with each plane containing 65,536 code points.
There are several ways to view how Unicode is structured or laid out. At a high level, Unicode's layout variously consists of planes, blocks, scripts, charts, and characters. Most of the attention, typically, is focused on Unicode's original Plane 0 where its initial 65,536 code points reside. The emojis are becoming increasingly popular. As emojis expand, they are expected to consume a greater amount of Unicode's overall codespace. The following table and graphics illustrate Unicode's structure.
NOTE: Significant portions of the above BMP datatable were taken from EntryLevelProgrammer.com.
The Unicode way of encoding does get a bit complicated. One of the complications arises from the fact that American alphabets and numbers (known as the ASCII standard) fully can be represented with less than 128 code points. Therefore, for USA computer makers using English as the base language, an 8-bit (28=256 code points) character encoding scheme was more than sufficient for transmitting data to English-oriented computers for processing (with an extra 128 code points to spare). This 8-bit English orientation of computers initially gave rise to an 8-bit character encoding scheme.
Unicode's rendition of an 8-bit encoding scheme is known as UTF-8. The complication relates to Unicode's unique expansion of UTF-8 to stretch beyond the range of 256 code points. The expansion enabled data to be transmitted in 1, 2, 3, or 4-byte packets to incorporate non-English characters such as the characters used by, say, the Japanese, Chinese, Korean, Arabic, Hindi, and so forth, languages. Some refer to this expansion of UTF-8 as the UTF-8 hack. Adding to Unicode's complexity was its further adoption of other encoding schemes such as its 16-bit and 32-bit methods of encoding known as UTF-16 and UTF-32, not to mention other obscure challenges such as correctly searching and sorting different Unicode characters. The following graphic illustrates the UTF-8 encoding strategy for capturing all code points.
To confuse the situation further, Unicode adopted the hexadecimal format (generally understood by mathematicians) to represent its code points rather than the decimal format understood by less mathematically inclinded humans or the binary format understood by computers. One reason for selecting the hexadecimal format to represent code points was because hexadecimal is a base 16 number system. Unicode initially was conceived to be a 16-bit encoding system. So, the "interchangeable" factor of 16-bit Unicode with base 16 hexadecimal played a role in the selection of hexadecimal as a code point representational format.
I do not purport to be any type of an authority or subject-matter expert on the technical aspects of the inner workings of Unicode and computers. Therefore, I will leave it to the following videos to explain how Unicode, encoding, and computers work in a little more depth.
Much like sign language exists as a form of communications for the deaf and Braille exists as a form of communications for the blind, Unicode exists as another form of communications for the computer. Suffice it to say that Unicode.org was not the first or only organization or corporation to attempt to devise a universal alphabet. For instance, the TRON Project preceded the Unicode Consortium in this multilingual endeavor. The Unicode encoding standard emerged as the popularly accepted universal encoding standard mainly because large, influential, multinational American computer corporations (such as Xerox, Apple, Sun, Microsoft, IBM etc.) supported and contributed to the Unicode encoding standard. The Unicode encoding standard emerged as the popularly accepted standard because prominent USA hardware and software makers were some of the early adopters of the Unicode standard as they propelled the computer industry forward through countless innovations.
The following slide show provides a brief overview of some leading events in contemporary human communications leading to the debut of Unicode. Also, the next two videos pay tribute to Unicode.
This web browser does not support HTML5 videos. Try updating your browser or using the latest Firefox, Chrome, Safari, Opera, or Edge web browser.
Unicode's Adopt a Character
According to W3Techs, when it comes to website encoding, as of 2019, Unicode comprised a whopping 95% market share of all character encoding schemes in use on the World Wide Web. The next graphic also illustrates Unicode's UTF-8 growth trend in terms of its use on the World Wide Web.
What is the biggest takeaway from this tribute to Unicode.org? The biggest takeaway resides in the fact that Unicode.org attempts to bridge the human divide through facilitating global human communications. The Unicode endeavor neatly fits into a broader human endeavor of cooperation and unity rather than one of human bickering and division. There is so much confusion, disagreement, misunderstanding, and ignorance within the human family. Humans must find a way to overcome their multifarious existential challenges to both human civilization and life on Earth. A starting point would be for humans genuinely to show courtesy and respect for one another despite their multifarious differences. Thanks to the creation of the World Wide Web and the emergence of organizations such as Unicode.org, perhaps a glimmer of light flickers brightly at the end of the tunnel.
Moving beyond multilingual Unicode, it would be remiss of me to not take this opportunity to revisit the big picture. Humans now have moved into the 21st century. A new millenium has begun. Contempary humans are presented with an array of challenges and disputes to overcome ranging from natural disasters, climate change, poverty, disease, migration, traffiking, substance abuse, wars, the prospect of nuclear warfare, torture, violence, gunplay, hatred etc. Given the billions of humans on Earth and the racial, religious, political, cultural, and so forth, differences to divide them, how will humans ever accomplish the gargantum task of getting on the "same page" of communications? How will humans ever accomplish the gargantum task of understanding one another and fostering an enduring life of harmony and prosperity on Earth?
Another challenge for 21st century humans involves taking it a step farther than the existence of a universal alphabet. Already humans have fostered a somewhat universal measurement system known as the metric system. The next step involves the creation of a universal language, thus, completing the Tower of Babel language circle, so to speak. With the advent of new technologies such as artificial intelligence and deep learning, computer companies such as the following ones are making giant strides at completing the language circle:
These machine language companies presently can take any language or many languages and instantaneously convert them into another language. Conceivably, a common human language should lead to increased human understanding. On the one hand, it is widely held that the World Wide Web possesses the potential for being the great human unifier. It is thought that, by giving all humans access to this global informational network, then humans would proceed to freely exchange ideas for the betterment of humanity. In reality, there is a lot of good stuff and equally a lot bad stuff on the World Wide Web. Although originally conceived to be a great social unifier, the social media phenomenon has exposed both the good and bad aspects of human nature. Social media, at times, can be more polarizing than harmonizing. Rather than leading to greater cooperation and unity among humans, some social media activities, in some instances, appear to elicit some of the worse aspects of human behavior leading to deeper human divisions and schisms.
Here's to unity and peace on Earth:
The question becomes this: Now that you have arrived into being, how will you choose to make use of the privilege to participate in Earth's grandiose miracle of life? Hopefully, you will choose to live a constructive, productive, and positive span of life. For, as the saying goes, time waits for no one. Always remember that it is never too late to turn it into something good no matter how old or young you happen to be.
Who will be next? The next Annual Bruessard Award winner will be announced on 1-December-2020. Stay tuned.