There are no orthographic boundaries between words in Chinese. This is the main difficulty of working with Chinese computationally (in addition to the bewildering array of encodings used for Chinese and the simplified/traditional script controversy). A Chinese word frequently consists of two, three or more characters, while the definition of what counts as a word in Chinese is the subject of intense debates (though the same is true for other languages, constructions like as well as or give up have all the properties of a single word, and names, like White House, also mean what they are supposed to mean only taken as a whole).
You can download the following resources:
The lack of a single accepted definition of words also creates problems for part-of-speech tagging: if words output by a segmenter do not match what a tagger thinks of being as a word, the accuracy of the latter drops substantially. Four examples in question are:
The segmenter implements a simple longest word lookup algorithm with a couple of built-in heuristics for dealing with cases like 据报道 (when the first character is more likely to be a single token) and 就是 (when two characters may be a single word in some contexts). The algorithm is simple, but it achieves the accuracy of 94-95% on the test files from the SIGHAN 2005 Bakeoff competition. Nevertheless, much more can be achieved by clever statistical techniques, such as those described by participants in the SIGHAN competition (see the overview). The algorithm relies on the dictionary obtained from a segmented corpus, so its performance on out-of-vocabulary words is poor. Please get in touch if you want to contribute to open-source development of the segmenting tool listed on this page.
For the Internet and LCMC corpora I also computed their frequency lists: Internet corpus and LCMC corpus.
The resources and tools downloadable from this page all use UTF-8. They have been designed to work with the simplified script, though some provisions for the traditional script have been added as well.
The resources have been developed by Serge Sharoff; contact me at s.sharoffleeds.ac.uk, if you have further queries. The tokeniser is based on the original code developed by Erik Peterson, Mandarin Tools; advice on tagging provided by Martin Thomas and Daming Wu. The tools are provided under the GNU General Public License. Back to other tools