Date of Award

5-29-2007

Document Type

Thesis (Undergraduate)

Department or Program

Department of Computer Science

First Advisor

Devin Balkcom

Abstract

Chinese characters are used daily by well over a billion people. They constitute the main writing system of China and Taiwan, form a major part of written Japanese, and are also used in South Korea. Anything more than a cursory glance at these characters will reveal a high degree of structure to them, but computing systems do not currently have a means to operate on this structure. Existing character databases and dictionaries treat them as numerical code points, and associate with them additional `hand-computed' data, such as stroke count, stroke order, and other information to aid in specific searches. Searching by a character's `shape' is effectively impossible in these systems. I propose a new approach to representing these characters, through an XML-based language called SCML. This language, by encoding an abstract form of a character, allows the direct retrieval of important information such as stroke count and stroke order, and permits useful but previously impossible automated analysis of characters. In addition, the system allows the design of a view that takes abstract SCML representations as character models and outputs glyphs based on an aesthetic, facilitating the creation of `meta-fonts' for Chinese characters. Finally, through the creation of a specialized database, SCML allows for efficient structural character queries to be performed against the body of inserted characters, thus allowing people to search by the most obvious of a character's characteristics: its shape.

Comments

Originally posted in the Dartmouth College Computer Science Technical Report Series, number TR2007-592.

COinS