How to create a font input method

Statement: This article is not a software development tutorial and will not teach you how to develop input software like "Some Dog Input Method" or "Some Baidu Input Method." This article does not involve program code, but serves as a source of inspiration for ideas. Input method experts are welcome to provide guidance.

If you have never encountered a shape input method, you may first read these two articles by Beiqiao: After Trying Seven Shape Code Input Methods, I Want to Talk About Using Wubi in 2022 and To Type More Smoothly, I Learned a Niche Input Method Pursuing Extreme Performance. The introduction of this article also provides a brief overview; if you still have questions, feel free to leave a message, and I will add more information.

About me: I use Starry Sky Keyboard for chatting and typing, and I use Sanxu (a modified version of Xu Code for personal use) for formal typing, with a typing speed of 60 characters per minute. I have previously dabbled in Xiaohe Sound Shape, Cangjie Fifth Generation, and Tiger Code, but abandoned them due to different needs.

Introduction#

Definition of Chinese Character Input Method#

It is actually quite simple: an input encoding function that outputs a set of Chinese characters is a Chinese character input method.

Assume $Z$ is a set of $n$ Chinese characters, $C$ is a set of $l$ subsets of $Z$, $c_i$ is the $i$-th element of $C$, and $M$ is a set of $l$ encodings, then $f: M \rightarrow C$ is the input method.

Nowadays, most people in simplified character regions use Pinyin input methods—inputting Pinyin to output Chinese characters; a small number of people use Wubi input methods—inputting codes composed of A to Y with a length of less than four to output Chinese characters.

In traditional Chinese character regions, there is also the Cangjie input method—inputting codes composed of letters with a length of less than five to output Chinese characters; and Zhuyin input method—inputting codes composed of most characters on the keyboard to output Chinese characters.

In fact, typing using Unicode encoding can also be considered a type of input method.

Shape Input Method vs. Sound Input Method#

The Pinyin and Zhuyin mentioned above belong to sound input methods (hereinafter referred to as sound codes) that use the pronunciation of Chinese characters for encoding; Wubi and Cangjie belong to shape input methods (hereinafter referred to as shape codes) that use the shapes of Chinese characters for encoding; Xiaohe Sound Shape uses sound first and then shape for encoding, so it belongs to sound-shape codes.

Today, although sound codes equipped with large dictionaries can meet the typing needs of daily communication, shape codes still hold value due to their low redundancy and independence from pronunciation, so enthusiasts continue to develop new shape codes.

Roots and Splitting#

Roots#

Roots are the basic shape units for splitting characters in shape codes. They are similar to the concepts of radicals and components learned in elementary school, but the range of roots is often broader than that of radicals and components.

For example, the 86 version of Wubi input method has a total of 234 roots, among which "王，" "土，" and "日" are familiar radicals, while the upper part of "炙" and the middle part of "互" may seem a bit unfamiliar.

86 Version Wubi Roots Diagram

In some shape codes, complex structured Chinese characters may also be set as roots to reduce the user's burden of splitting characters. For example, in Xu Code, "爾，" "鹵，" and "黽" are all roots.

Splitting Characters#

Splitting characters means using the established roots to decompose Chinese characters. This is actually something we experienced in elementary school—when learning new characters, the language teacher would teach us that they are composed of previously learned characters.

How to split structurally simple characters should not be controversial: "叶" naturally splits into "口" and "十，" and "音" naturally splits into "立" and "日." However, if the structure of a character is slightly more complex, it becomes tricky: should "戊" be split as "戈丿" or as "厂㇂丿丶"? At this point, we need to introduce rules that limit the splitting methods to prevent a single Chinese character from having multiple splits.

Taking the 86 Wubi as an example: Wubi has several rules prioritized as considering intuitiveness, following stroke order, taking larger roots first, not connecting when possible, and not connecting when dispersing.

Starting with taking larger roots first and following stroke order:

"世" can be split as "一凵𠃊" and "廿𠃊" because the first root of the latter is larger than that of the former, so we split "廿𠃊."
"夷" can be split as "一弓人" and "大弓，" but since "大弓" does not follow stroke order and following stroke order is prioritized over taking larger roots, we split as "一弓人."

"Not connecting when dispersing" and "connecting when possible" mean that splits where roots are not connected are preferred over those where roots are connected, and those where roots are connected are preferred over those where roots intersect. Some input methods also introduce a further rule of not crossing when connecting, meaning that splits where connected roots intersect are preferred over those where the same stroke is broken. These three rules are collectively referred to as dispersing, connecting, and breaking.

Considering intuitiveness is somewhat difficult to understand; it has become a universal rule in Wubi, where the author actively determines what splits are intuitive. For example, in the following images of "或" and "戊，" the former can be split as "戈口一" without following stroke order, while the latter must be split as "厂㇂丿丶" to follow stroke order. In the 86 Wubi, such issues are numerous: why does "里" split as "日土" instead of "甲二"? Why does "匹" split as "匚儿" instead of "兀𠃊"?

Such vague rules are absolutely unacceptable; it is now recommended to use clearer and more rational rules. For example, the splitting rule for "匹" can be explained by introducing complete structure (roots with full or half-enclosing structures like "囗日目勹冂匚コ凵" should not be split apart). Additionally, the rule of original shape roots (if a root's vertical stroke becomes a left-falling stroke when used as a radical, or if a horizontal stroke becomes a rising stroke, it is considered a lower-priority variant root) can also explain some splitting ambiguities.

However, splitting rules should not be overly detailed; for instance, Zhang Code complicates rules to reduce redundancy, which burdens users; Cangjie’s splitting method is also unique in the input method community. Input method authors should establish the most easily understood rules while ensuring the uniqueness of splits to reduce the cognitive burden on users.

Cangjie Splitting Overview Cangjie Splitting Overview, excerpted from the Cangjie textbook on Wikipedia

After splitting the commonly used 3500 characters (the first level of the General Standard Chinese Character List), we obtain a usable splitting table.

Adding and Removing Roots#

Mr. Wang Yongmin, the author of Wubi, once said: "Generally speaking, a shape scheme using 26 keys should select 150 to 250 roots." Most new input methods now fall within this range after merging roots, in accordance with statistical principles.

In four-code schemes, the number of roots involved in encoding a single character is often only the first three and the last root, meaning the intermediate roots are practically useless. If a root has never been involved in encoding, it can of course be deleted directly; if multiple characters share the same first three and last roots, such as "赢，" "嬴，" and "贏，" we need to consider adding roots to achieve different roots and eliminate redundancy at the splitting stage.

If you want to learn more theoretical basis for quantitative analysis during the splitting phase, you can refer to Lan Luoxiao’s article Quantitative Evaluation of Splitting Tables .

Here’s also a shoutout for Lan Luoxiao: If you are interested in creating shape codes and still don’t know where to start, please consider using the Automatic Chinese Character Splitting System developed by Lan Luoxiao (please ignore the registration feature, just select an example to enter the main interface), which has no barriers to UI operation.

Automatic Chinese Character Splitting System

Encoding Methods#

Root Encoding#

Encoding roots is a prerequisite for encoding single characters. Root encoding generally follows certain rules; for example, Wubi uses the first two strokes to partition on the QWERTY keyboard, Zheng Code uses the first two strokes corresponding to letter order, Cangjie uses "日月金木水火土" corresponding to letter order, and shape codes use the shapes of roots corresponding to letters. These methods of systematically encoding roots based on shape characteristics are called shape support. Just as there are sound codes and shape codes, root encoding also has sound support and shape support, with Xiaohe Sound Shape using sound support.

Some input methods, for better performance metrics, use completely unordered (hereinafter referred to as random) methods to encode roots, such as the Sapphire input method.

There are various ways to encode roots, listed directly below:

Single encoding, represented by Wubi, where each root is represented by a letter, for example, the encoding for "十二" is all F, and for "五一" is all G.
Double encoding, where most roots are represented by two letters, referred to as large code and small code.
- Small code shape support, represented by Zheng Code, where the small code relates to the shape of the Chinese character, for example, the small code for "耳" is E because its structure contains "十."
- Small code sound support, where the small code relates to the pronunciation of the Chinese character, for example, in Xu Code, the small code for "自" is Z because its Pinyin is zi. Using sound support can reduce the memory load, making the memory difficulty of double-encoded roots approach that of single encoding.
Three encodings and above... In fact, the length of root encoding can be arbitrary, but if users cannot even remember the encoding of roots, there is no point in discussing the subsequent single character encoding.

Single Character Encoding#

First, we need to determine how many roots participate in encoding a character. This article adopts the method of mainstream shape codes, which is the previously mentioned first three and last one. What if there are fewer than four roots? We need to use other shape characteristics of Chinese characters to supplement the encoding; in Wubi, this is the stroke count of the roots or the structure of the Chinese character, while in Zheng Code, it is the small code of the roots; of course, we can also use the pronunciation of the Chinese character to supplement the encoding, but that would turn it into a shape-sound code.

For example, for characters like "呗员" and "吧邑，" which have identical roots, we need to use structural codes to differentiate them, using B for top-bottom structure and N for left-right structure. However, introducing structural codes creates new memory points for users, at which point double-encoded roots can come into play—adding the small code of the first root after the encoding of less frequently used characters can separate the encodings of the two.

Assume a Chinese character can be split into several roots, each with an encoding, then the encoding is numbered using uppercase Latin letters ABCD...WXYZ. Specifically, Y and Z are used to emphasize the second-to-last and last roots. For root strokes, use lowercase Greek letters αβ...ω. Specifically, ω is used to emphasize the last stroke. Ω represents the structure encoding of the character.

The encoding table for Wubi is as follows:

Single root character

Represents root AAAA

Non-representative root Aαβω

Multiple root characters

Two roots AZΩ

Three roots ABZΩ

Four roots and above ABCZ

If you want to learn more about encoding methods for input methods, you can refer to Zhu Yuhao's article Common Encoding Rules for Shape Code Input Methods.

Word Encoding (Optional)#

If you don’t want to type only single characters, word encoding is essential. Fortunately, there is currently a recognized method for word encoding in the input method community, so you don’t need to think of a new method. Just as roots form single characters, single characters form words, so the method for encoding words is similar to that for single characters:

Each character has an encoding, which is numbered using uppercase Latin letters ABCD...WXYZ. Specifically, Y and Z are used to emphasize the second-to-last and last roots. The second encoding of the character is numbered using the corresponding lowercase Latin letters abcd...wxyz.

The word encoding table for shape codes is as follows:

Two-character words AaBb

Three-character words ABCc

Four characters and above ABCZ

Performance Optimization#

Performance Indicators#

There is also a saying circulating in the community from Mr. Wang Yongmin, the author of Wubi: "A high-level 'shape code' scheme must possess the three characteristics of compatibility, regularity, and harmony." Compatibility means low redundancy; regularity means easy to learn; harmony means good feel—these three cannot be achieved simultaneously. Often, input methods sacrifice one advantage to gain another; for example, the currently popular random input methods (Tiger Code, Sapphire, Yima) have sacrificed regularity to enhance the other two, while the Xu Code I use has largely sacrificed harmony to pursue compatibility with a large character set.

Static redundancy count: Traverse all encodings, the size of the output set of Chinese characters is the total number of subsets of two, reflecting compatibility.
Dynamic redundancy rate: The output set of Chinese characters is sorted by character frequency, removing the first element, and summing the frequencies of the remaining elements, reflecting compatibility.
Average code length: The encoding length multiplied by the total frequency of Chinese characters, noting that the encoding length of non-preferred characters adds one, which is positively correlated with the dynamic redundancy rate. The number of characters typed per minute = the number of keystrokes per minute / average code length.
Speed equivalent: The comfort of continuous keystrokes analyzed from over two million experimental data points; for details, refer to the paper Research on Speed Equivalents Related to Keystrokes, reflecting harmony.
Alternating hand strike rate: Input all encodings that alternate between left and right hands, summing the frequencies of characters within the character set, reflecting harmony.

Do you remember the mathematical definition of input methods in the introduction? We can derive the mathematical definitions of the above indicators from it.

Assume $p: Z \rightarrow [0.1]$ is the mapping of the single character frequency of Chinese characters in a certain text state, sorting each character set in $C$ by character frequency, making $c_{ij}$ the $j$-th Chinese character encoded as $m_i$, where $i \in I$, $j \in J_i$, and satisfying $a\geq b$ implies $p(c_{ia})\geq p(c_{ib})$.

Static redundancy count: $N_{s} = \mid {c_{ia}, c_{ib} \text{ for all } a,b \in J_i \text{ and } i \in I }.$

Dynamic redundancy rate: $N_{d} = \sum\limits_{i \in I, j \in J_i/{1}} p(c_{ij}).$

If you don’t understand how to calculate these performance indicators, you can also use the online Tiger Code Assessment Website .

Tiger Code Assessment Website

Simplified Code for Full Code#

The lengths of Wubi's single character encoding and word encoding are both greater than three because space was left for simplified codes when designing the encoding rules. Simplified codes are shorter encodings— for example, the Wubi encoding for "的" is rqyy, but in practice, you only need to type r and then press the space bar to display "的." Simplified codes that retain only the first code are called first-level simplified codes (abbreviated as one simplified), and similarly, there are two simplified, three simplified, etc.

Simplified codes are the simplest way to improve input efficiency. The total frequency of the top 26 commonly used characters is 0.26; if all single character encodings are four codes, just reaching one simplified code can reduce the average code length by 0.5; the total frequency of the top 27-702 commonly used characters is 0.57, meaning that reaching one or two simplified codes can reduce the average code length by 1.09. We know that typing speed = number of keystrokes / code length, so ideally reaching one or two simplified codes can improve typing speed by one-third.

The benefits of using simplified codes go far beyond that. In contrast to simplified codes, the original encoding of a single character is called full code. Since the character with the highest frequency at that full code position has already been simplified, the character with the second-highest frequency can naturally take its place; this practice is called letting full code, meaning allowing the full code position of a character with an existing simplified code to be taken by another character. This way, we can eliminate existing redundancy; even if three simplified codes do not yield benefits in code length, they can still be used to reduce redundancy.

With so many benefits, what is the cost? Simplified codes are essentially a type of irrational code; the more you use, the heavier the user's memory burden becomes. If you can’t remember them, you will waste time looking at the candidate box, which reduces the number of keystrokes and is counterproductive. It is recommended to only use one simplified code, let users set two simplified codes themselves, or only use simplified codes without letting full codes, and avoid setting three simplified codes.

Global Optimization#

Global optimization generally uses simulated annealing algorithms; for the program, please refer to Principles and Applications of Simulated Annealing Algorithms, Input Method Optimization, Character Frequency, Word Frequency Statistical Algorithm Source Code Sharing, Yuhao Input Method Development Technical Documentation, and this article will not elaborate further.

Since the principle of the annealing algorithm is based on probability, setting some constraints (such as roots that cannot be on the same key and roots that must be on the same key) can effectively reduce the arrangement of useless roots, thereby improving algorithm efficiency. When setting constraints, it is advisable to adopt the wisdom of predecessors. For example, before designing Sanxu, I was always curious why regular double-encoded shape codes must set two small codes under a large code as fixed irregular main roots. After I started optimizing, I found that if two main roots were not used, redundancy would inevitably increase.

Conclusion#

When your input method is complete, don’t forget to export the code table for enthusiasts to use: the format supported by most input platforms is character\tab encoding per line, while a few input platforms are reversed. If you are confident in your input method, you can post to the Wubi forum for promotion. Input methods rely on community support; with users, you can adjust based on feedback to make the input method even better.

Of course, there is an old saying in cryptography: "Do not design your own encryption algorithm," and I believe the same applies to input methods. Before designing, it might be wise to check if there are existing input methods that meet your needs; using ready-made solutions is always simpler.

Casual Talk: The Philosophy of Choosing an Input Method#

Currently, Unicode includes about 100,000 Chinese characters, meaning that electronic devices can display 100,000 Chinese characters once fonts are installed, and this number is expected to continue to grow (over 4,000 new characters have been added to the CJK-J area). Among these characters, not all pronunciations have been passed down to the present; some characters have pronunciations but are generally unknown without consulting a dictionary— to type these characters, one must use shape codes. In summary, is a large character set important? Not really. Even if your name contains rare characters, you can simply add that character to the code table; there’s no need to look for an input method that can type the entire character set. Personally, I find the process of splitting a large character set enjoyable as a Chinese character enthusiast, but it is not a necessity.

Is typing speed important? Of course, but consider whether you have the perseverance to practice. The barrel effect of typing speed is quite evident, and the performance of the input method is actually the longest board in it. To improve typing speed, regardless of which input method you use, only long-term practice can achieve it. Don’t think that choosing the best-performing input method will solve everything; speed experts do not become experts because they chose a better input method, but because speed experts invented better input methods to further improve. If you can’t even reach a hundred characters per minute with non-watered-down text, how can you discuss input method performance?

Is the ability to type in both simplified and traditional characters important? Certainly not; it is a niche demand within a niche demand. The main issue is that OpenCC conversion has its shortcomings; some variant characters do not need conversion: for example, I believe the left-right structure "群" displays better on electronic devices than "羣."

In the end, is shape code important? Actually, it is not that important. The learning cost of double pinyin is low, and reducing the length of full pinyin codes has immediate effects. Input methods exist to input text, and the knowledge carried by text is the ladder to human understanding.