Chinese characters: A built-in, quantitative matrix for organizing and locating information online and enabling inter-language, concept-based searches

This is my latest pet project (manuscript in preparation as of Fall 2008) in the "emergent intelligent networks/information organization field."  A somewhat quirky but potentially very powerful interdisciplinary application of several esoteric fields and lines of investigations�the complex radical structure of Chinese characters, comparative linguistics, historical etymology, human cognitive mimicry, and so-called "innate linguistic symbology and algorithms"�as the basis of a powerful, quantifiable algorithm for conducting concept-based searches across multiple languages.  Hopefully the first real big break that plowing through these dozen languages will actually be providing.  Enjoy.

-- J. Wes Ulm


Application of novel Chinese character-based relational network for concept-based (CONBASE) and interlingual searching

{alternate title} Application of the Chinese Characters as a Systematic, Quantitative "Universal Background Language" for Meaning- and Concept-Based (CONBASE) Data Searches and Interlingual Searching of High Sensitivity and Specificity


Background: Fun with Etymology

As a lot of you reading this Blog and my Website know, I've long been a devotee of the obscure art of learning more languages than should ever be crammed into a given brain (a dozen at this point), to speak them fluently enough that I can communicate my ideas, no matter how complex, in a variety of important global standards.  For obvious reasons, I've had to focus on a subset of these that are most relevant to whatever I'm working on (while using fluency in one language to "bridge" fluency in a related one, as for example I've done in learning Portuguese and Italian while using Spanish as a "base," or Dutch, Danish, and Swedish from German).  Among all these languages, Mandarin Chinese has stood out for the incredible "simplicity in its complexity": It's an ancient language that's evolved to tackle the modern world with remarkable efficiency.  Since Chinese was the language used at the heart of a sprawling ancient Empire�much like Latin or the Koine Greek of Alexander the Great's domains (later to evolve into the Byzantine Empire)�it's been a "prestige language" for over 2,000 years, even since the first Emperor, Qin Shi Huangdi, unified and founded China as an intact political entity.  Except that unlike the Roman and Byzantine Empires�which fell apart, with the consequent fragmentation of ancient Latin and Greek into distinct language families of their own�China has remained intact and powerful since 221 B.C. 

And since Chinese has never stopped being an international standard, even during the worst nadirs of Chinese history (during the Mongol and Manchu conquests, for example), the Chinese language has evolved to encompass a rapidly changing world, while basically maintaining its original structure and vocabulary.  Unlike most other modern languages, Chinese rarely imports a new loanword fully intact (as a phonetic addition as well as supplying meaning); instead, when new technology, terminology, and slang arise, the Chinese language just creates a new metaphor involving Chinese characters and meaning elements, already in existence.  (Sometimes, in fact, these new words were coined first in Japanese�which is written in part using modified Chinese characters�such as the word for telephone, "dian/ hua/" or "denwa" in Japanese, which simply means "electric speech.")

And this is where the Chinese language comes in so handy for information organization and categorization�its writing system, as cumbersome as it may seem at times, is a kind of built-in, tailor-made structure to classify information.  The logographic script of Chinese combines a "radical" (which carries a meaning element and classifies the word along with others containing the same radical) with a "phonetic" (which provides a hint about the pronunciation by linking the radical to a subset of a few hundred easily learned basic characters�while at times also contribution a meaning element of its own).  Furthermore, the radicals aren't stand-alone; instead, complex radicals often incorporate elements from more basic ones, so that there are "radical families" that chain related conceptual groups together.  As a result, the Chinese language in general is structured so that a web of specific, quantifiable information relations is evident within the visual representations and "nested sets" provided by the characters.  In English and other European languages, we can also relate words together via commonalities among Greek, Latin, and Germanic roots, for example.  However, in contrast to Chinese, phonetic languages such as our own did not evolve to provide such obvious connections among word stems, and this is especially acute for English�much more than the Romance languages, which evolved out of a popular version of classical Latin, a historical "prestige" language (as French and Italian would themselves later become) and thus far less inclined than the Germanic and Slavic languages to import loanwords from another language considered "culturally superior" throughout Europe since the fragmentation of the Roman Empire.  English thus evolved as a kind of "jumble" of word roots. 

More specifically: It's only since about the early 1960's that English has been a primary international language (superseding French for that purpose, and German in the sciences for periods before that); in the nearly 1,600 years since the fall of the Roman Empire, Latin, Italian, Greek, and French (and even Arabic during a brief period during the early Middle Ages) had the status of lingua francas, for intellectual pursuits and often far more (especially in the case of French and Latin).  So English and German, both important, international modern Germanic languages, assumed a complicated overlay of Greco-Latinate roots on top of their Germanic core�the result being that there is no means to glance at a sentence in English, or at different words in a dictionary, and relate them through a "common structure" connecting up the various word roots, which have multiple sources.  As will be noted, even more conservative languages which have historically had the role of prestigious international standards themselves�such as French, Italian, and Arabic�suffer from the same problem to varying degrees if they are represented phonetically.

Chinese, in contrast, has maintained its international prestige in varying forms since 221 B.C.�when Emperor Qin Shi Huangdi first standardized the Chinese character set, among multiple competing representations�and as such, Chinese has preferred to coin new words, including abstract concepts and specialized technological lingo, by melding together two (or more) syllables, corresponding to the Chinese characters, that already exist within the language.  As stated above, the logographic script and radical families also contribute to this overwhelming propensity in Chinese.  As a result, the Chinese language has evolved in an intricate manner so that it can be quite literal and specific in describing a concept, avoiding the "homonymic trap" of English by easily distinguishing, for example, a "letter" of the alphabet from a "letter" sent by mail, or Gaelic (as a language) from Gaelic (as a cultural designation for football and festivals)�via the use of "category markers" in the form of specific Chinese character radical designations.  But Chinese can also be quite metaphorical, abstract, and artistic, by building on simpler radicals to more complex ones�which, at the same time, embody the "history of thought" that led to new characters in the form of simpler radicals contained within more complex ones.  Therefore, the Chinese language has integrated a scheme to conveniently chain concepts together and to surmount the limitations of text- or string-based searching alone.


The Chinese Characters: A Tailor-Made "Conceptual Family Tree" to Allow Meaningful CONBASE Searches

Therefore, from the standpoint of optimized information-bearing networks, Chinese is ideal as a kind of behind-the-scenes, "universal language" for classifying everything from Webpage content, to images, to videos and their own complex content�and this is especially true in regard to the tags that are applied to Blogs, video, images and other labels for various sites.  This confers upon the characters a remarkable capacity to organize information in a way that allows for high-sensitivity searching�finding what you're looking for�along with high-specificity searching, i.e. weeding out what you don't want.  (I'll focus initially on the sensitivity aspect, finishing with the specificity advantages.)  The Chinese characters, in other words, allow for high-resolution, fine-grained needle-in-a-haystack searches using concepts, rather than merely search strings themselves�and in so doing, in fact, brings the automated search process a step closer to resembling the more meaning-based recognition systems that are innate to the human mind.  Because of this, utilizing the radical structure of Chinese characters may provide a way to perform concept-based searches far superior to tools that are currently available, which mainly rely on so-called "clustering" of other words around a search term of interest among the pages that are catalogued.

The inherent "classifier tree" structure of the Chinese characters is, for all practical purposes, set up to establish rapid and quantifiable relations among concepts and tags that would be ostensibly unrelated on the basis of spelling or phonetics alone.  Text searches and tags (for videos, images, and scientific publications, for example) that are interrelated in their meaning, but with no clear, quantifiable textual links in English and most other languages, can be easily co-classified in Chinese.  An example would be a tag pairing like "strengthen"�in traditional Chinese characters or 加强 in simplified characters**, pronounced "jiā qiáng" in the pinyin Romanization of Mandarin�and "increase" (增加, same in both traditional and Chinese characters, pronounced "zēng jiā").  Let's say we were doing a search for a paper in economics on the strengthening of the Euro currency in recent years, and as one of the tags in our search, we'd like to seek out papers that refer in some way to the concept of the Euro gaining in value�i.e. strengthening or increasing in value against other currencies, or a broad variety of other, relatively equivalent words.  We'd ideally like to input just one search term to cover all these related terms pertaining to the same concept.   Notice that if you were to see these two tags together�"strengthen" and "increase"�you'd know, based on our human grasp of language, that the concepts behind these tags are closely related: Both entail some "enhancement" of a particular quantity.  In the context of a search engine, or a specialized search algorithm for a technical database (such as Pubmed, the National Library of Medicine's search tool for the biomedical literature�where I first conceived this idea about 2 ½ years ago), it would be ideal, when doing a "concept-based search" for related results, if inputting any of these terms could help us find pages, papers, videos, or other content which is conceptually linked to the original term. 

The problem is that with English and other phonetically-represented languages, there is no built-in means to transfer our own mental conceptual grasp of this interrelation ("strengthen and increase both mean similar things") into a searching protocol that can do the same.  While techniques such as clustering can help to allow some level of concept searching via word associations, direct conceptual linkages are virtually impossible.  This is especially true for English, due to the historical reasons as explained above.  The word "strengthen" is rooted in the Germanic base of English (cognate with the German word streng, "severe," for example), while "increase" is of Romance language heritage and thus Greco-Latin origin (compare Spanish "crecer" and Portuguese "crescer")�one of the thousands of French loanwords that streamed into Middle English during Geoffrey Chaucer's time, in the late Middle Ages.  Therefore phonetically, despite their similar meanings, there is no "code" to link up results tagged with "strengthen" to those containing "increase." 

                Moreover, this problem is not solved even if we were to restrict ourselves to using search terms of common etymological background in English (nor is it much easier if we are searching in a Romance language�like French or Italian�which, for historical reasons, is closer to the "classical Western prestige language" of Latin and therefore has a more consistent root structure in its vocabulary).  For example, consider the trio "poor," "poverty," and "pauper."  These three terms all stem from the very same Latin root; "poor" and "poverty" entered into English via Old French, in the wake of the Norman Conquest, while "pauper" was later imported directly from Latin.  All three of these words are very closely related conceptually, with similar meanings that we as human readers obviously recognize.  Nevertheless, because of the phonetic structure of English and the way these words are rendered (exacerbated by the notoriously irregular orthographical patterns used in the English language), there is no easy way to conduct a search that would "net" related results, using the other two terms, by initiating a search with any one of them.

                Now, contrast this with the case in Mandarin Chinese.  The word "poor" in Chinese is 貧窮 (贫穷 in simplified characters), pronounced "pín qióng" (rising tones on both syllables).  "Poverty" is 貧困  (贫困), pronounced "pín kùn," while "pauper" is貧民 (贫民), "pín mín."  As with the "strengthen/increase" pair, the Chinese language formalizes and standardizes conceptual relationships among distinct words.  Furthermore, since Mandarin Chinese very rarely imports phonetic-based loanwords�with exceptions mostly in the case of foreign foods (e.g. coffee, salad, hamburger) or proper names�the Chinese language provides a kind of naturally evolved conceptual database for classifying and searching for distinct words and phrases that have a related meaning, even if no phonetic links are readily apparent.  In the above cases, the same Chinese character is present in all 3 words (which are distinct in English and most other European languages, without an obvious phonetic "binder" to link them together for search purposes), and a search for one could obtain articles or pages containing the other terms due to this relationship.  A quantitative search protocol can be established by, for example, assigning points to results which possess the relevant Chinese characters (with a weighting system giving higher point values for exact matches).  Furthermore, this kind of procedure can be done for any language, not just Mandarin Chinese.  Tags can be fairly easily converted from many different languages into Chinese equivalents, owing to the precision of Chinese character combinations (and thus their high specificity) in distinguishing homonymic word combinations (more on this later).  Thus, Chinese can operate as a kind of "background database classifier" for the article and page tags.

                Even the above demonstration does not begin to convey the remarkably quantitative power of the Chinese characters as classifiers for tags and page/article searches.  In Chinese, not only do particular characters help to link distinct concepts when used in different words (i.e., combinations with other characters); in addition, the information contained within each character can be used to make still finer-grained classifications, and establish a highly intricate web of conceptual relationships that can be easily searched.  That's because first of all, as noted above, Chinese characters in general are composed a radical and phonetic pairing, and the radical establishes a "class" into which a variety of different words (both single characters and words comprised of multiple characters) can be classed.  While this categorization system is not perfect (many characters have e.g. grammatical or other functional roles that the radicals may not hint at), it is still convenient in the vast majority of cases to refine the point-value system depicted above to do fine-grained concept-based searching.  Furthermore, the more intricate Chinese characters are largely composites: Their radicals are comprised of two or more simpler "base radicals" assembled into the main radical itself, which means that still another level of fine-graining can be obtained even between disparate radical families.

                As an example, in English, the words "juice," "perspiration," "a tear (cried)," "wash," and "flow" all have an obvious conceptual linkage: They represent something wet or, more specifically, something that is liquid and flows.  In English and almost any other language, if we wanted to do a search for "something wet" or "something fluid," we really have no path to do so efficiently; words in English may be related and we, the human speakers of the language, may know this when we see the words or phrases, but there is no practical means to perform an automated concept-based search for "something wet" or "something flowing."  In Mandarin Chinese, however, this kind of a concept-based search is straightforward, since all of these words fit within the same radical family: (same in both traditional and simplified characters), pronounced zhī, "juice";   h n, "perspiration"; () lèi, "a tear";   xǐ"wash"; liú "flow."   All of these characters have the "water" radical and, so, are conceptually organized in the symbols that represent them in Chinese. 

We can go even further with character composites.  For example, a "toe" in Chinese is pronounced  "zhǐ," and this "toe radical" (with its own family of characters, in conjunction with phonetics) can serve as the base for producing other composite radicals with their families, thus linking the "radical cousins" to each other.  For example, the character pronounced "zǒu" means "to walk," and is a composite containing (in modified form) and another radical indicating forward motion or inclination.  This character can also function as a radical in other characters when paired with a phonetic:   ()"gǎn" which means "to hurry" or "catch up to," "chāo" which means "to exceed or surpass," "qǐ" which means "to rise" or "get up."  Notice how each of these more complex characters relates conceptually in some way to the original : They all represent, literally or metaphorically, some kind of action that is performed "on one's toes."  This can involve walking, stopping (from a walk), getting up, speeding up, or surpassing someone else in motion.  In this way, even the metaphors that are built on concrete objects (little stories that are used in all languages to construct complex, abstract new vocabulary) are quantitatively related to each other in the symbology used for writing Chinese. 

There is one more respect in which the Chinese character CONBASE searching provides a powerful, incredibly valuable system for concept-based searching: increased specificity.  As briefly mentioned above, if we enter a search term like "letter" into an English language search, we wind up with the frustrating complication that "letter" could apply to "a letter of the alphabet" or "a letter that we write (to a friend) and mail."   Similar for "trip" (and fall) or "trip" (the kind that you book a reservation for).  They're two completely different things in each case, but in English they're exactly the same word�a homophone or homonym.  The English language is utterly full of such homonyms, again an artifact of the historical evolution of the language, and this fact introduces a particularly vexing headache for text-based searches: If we are searching for a "letter" tag, it is extremely difficult to distinguish between the two different usages, both in the search string and in the results.  Outside of adding more specific additional terms, there is not much to narrow it down.  To be sure, most other European languages do not suffer as much as English from homonymic mix-ups, but English is hardly alone in this regard.

Chinese, in contrast, is inherently organized to distinguish between words on the basis of meaning and conceptual background.  Thus, the two (very different) senses of "letter" and "trip" in English are represented by distinct Chinese characters, even in cases when the pronounced syllable would sound the same in spoken speech.  When two different terms mean distinct things, Mandarin Chinese places them into distinct radical families, which allows useful concept-based searching to take place.  Chinese is not entirely immune to the homonymic complication, and some Chinese characters are so-called "duoyinzi," with more than one possible pronunciation.  Nevertheless, the vast majority of Chinese characters correspond to just one syllable and one or a few closely related meaning elements, with the inherent penchant toward classification enabling fine and meaningful distinctions among similar-sounding words that  might be confused with each other.  Thus, Chinese character-driven CONBASE searching is specific as well as highly sensitive.


Applications: Proof of Principle in Technical Databases, Expansion to General Searching

What I've provided above is a brief summary of years of work on this topic.  In practice, making a system like this practical has required merging the innate advantages of the Chinese character radical families for concept-based searching, with often esoteric and seemingly arcane fields such as historical etymology, comparative linguistics, human cognitive mimicry, and "innate linguistic symbology within the mind," to develop a system that is both workable and quantitative.  It has also required a painstaking analysis and application of the system for each particular character, radical, and radical family: For historical reasons in the development of the Chinese language (including contingent events such as Imperial dynastic transitions, wars in the East Asian region, and legal and bureaucratic innovations), different radical families tell different "narratives" in the logographic metaphors that they represent.  Therefore, understanding these narratives and their historical evolution has been critical in establishing a truly robust concept-based search algorithm utilizing the Chinese characters. 

Thus in practice, of course, constructing protocols to incorporate these relationships into search engines has required substantial work to actually quantify "how related" different composite radicals are to each other, and how much to "weight" different characters containing the same radical�a kind of database that has been my project for the last several years.  But if we consider different character combinations in the "conceptual family tree" as depicted above, and weight the relationships accordingly (with "siblings" containing the same character, "cousins" the same radical, "and "second cousins" containing the same base radical in composites), we can get a sense of how to assign points to any given search result when a particular character is inputted.  And again, since the Chinese characters are more or less operating in the "background"�with tagwords and keywords in any language (without obvious conceptual links) converted to their Chinese character equivalents (with the clearly established conceptual links)�we can enter a search tag in any language, opt for a "CONBASE" (concept-based) search, and obtain a rich series of meaningful results. 

I'd suspect that as an initial proof of principle, the Chinese-character CONBASE search system would be most useful in searches of technical databases, such as the National Library of Medicine's Pubmed and specialty databases for engineering, archaeology, and other fields.  The technical literature in such cases is written in a formalized language with very little slang or colloquialisms, rich in the kind of abstract Greco-Latin wordstock that has been fundamental for scientific communication across many different European languages, and which is easy to convert to corresponding Chinese characters.  Moreover, as a matter of the scientific paper-submission and publication process itself, authors submit a variety of keywords that assist in subsequent searches and classification.  Nevertheless, searches turn up hits only when the exact keyword (or a close grammatical equivalent) is inputted; concept-based searches are, as of yet, not feasible. 

Chinese �character CONBASE searching solves this problem for such technical databases, yielding more meaningful and useful searches no matter what the input language by allowing for meaning-and concept-based searching.  All tags can simply be converted into their Chinese character equivalents (again, in the background, something that a user would not generally see), with the point assignments as indicated above providing "weighted hits" based on the degree of conceptual linkage between the related Chinese characters.  From specialized technical databases, then, the Chinese-character based CONBASE search can be refined to apply it to more general text-based searches of the Web on common search engines. 

Frontiers: Interlingual searching and queries beyond text alone

Then, perhaps, come the most exciting search applications; since the Chinese characters would be operating in the background as a common "linker language" for concepts in any language, this would allow the first application of interlingual searching.  An obvious limit of all search engines at the present time, is that searches can essentially be done in only one language, and this limitation is inherently connected to the present inability to perform meaning- and concept-based searching (with only specific words or character strings allowing results to be obtained, through exact matches).  Chinese character CONBASE searching solves this problem by providing a universal conceptual tree into which tags, labels, and search terms in any language can be interconverted, then exhaustively and efficiently searched, yielding the most relevant results.  Once again, those conducting the queries don't see the "Chinese character tree" in the background; it's merely used as a behind-the-scenes scaffolding to establish pertinence of results.  But it can be used to allow searches and results in any combination of languages, since they all feed into the common, quantitative conceptual tree afforded by the Chinese characters.

Perhaps even searches beyond text and tags alone could be amenable to increased efficiency using the Chinese character conceptual matrix.  Images and video content�let alone the contents of real-world 3-dimensional spaces�are at present very difficult to search using automated systems.  Even with the most sophisticated AI, it's quite difficult to generate algorithms that can provide meaningful information in the manner that our minds are able to provide sensible, valuable summary information as observing human eyes can.  However, as we arrive closer to the point that we can effectively teach "observer algorithms" to derive and classify meaningful data about a non-text space (with "Pavlovian algorithms" that reward correct observations and summaries, for example), using a Chinese character conceptual tree can help to further establish meaningful, quantifiable relationships among distinct searched and catalogued spaces, images, and videos.  As is the case with purely text searches, using the Chinese character relational matrix, can enable high-value CONBASE searches of many different environments, including object spaces outside of digital milieus themselves.


© 2008  J. Wes Ulm, MD, PhD

All rights reserved.

Feel free to cite, print, copy, or republish this original article with proper acknowledgment of author and site.  Please cite as: "Application of novel Chinese character-based relational network for concept-based (CONBASE) and interlingual searching," J. Wes Ulm, MD, PhD, Harvard University personal website, © 2008,