Online Chinese Resources Mar. 1997 During the last few years, the blooming growth of the Internet community had make it possible to access a large volume of text and speech resources through the Net with very little cost. The research institutes of the natural language processing community thus have more chance to practice real-world tasks using such network resources than they had earlier. The only problem is that the legal status of accessing such publicly available resources is still subject to legal problems. For instance, there are once some electronic lexicons available on the Net. But they are no more accessible probably due to legal problems. In spite of the possible legal problems, we will briefly include a few known Chinese resources that are publicly accessible. The researchers are encouraged to contact the resource providers and use such net resource as far as the laws of the respective countries permit. Since the list of accessible online resources are increasing day-by-day, the following information will be out of date quickly. The authors will try to maintain a live list of such resources at the ROCLING home page (currently) at http://www.bdc.com.tw/~rocling/ (including both Chinese and non-Chinese resources). The following paragraphs will therefore list only a few kinds of typical resource so that new NLP researchers can have a good starting point to find the resources relevant to his/her tasks. (Most of the resources outside the Taiwan area may be not listed here due to the limitation of the authors' knowledge on the other countries.) Briefly, we have the following types of resource accessible from the net. Online Electronic News Many major Chinese News paper providers, radio/cable TV news departments, in Taiwan, Hong-Kong, China, Singapore, Malaysia, United States had transferred their publications from paper copy or voice broadcast to electronic forms (including text and speech). Such text resources provide the largest volume of timely updated and well organized articles in politics, economy, recreation, literature, science and technology development. Therefore, they are always the first choice for most NLP researchers. Most of the News providers are providing monolingual text resource. Therefore, it is appropriate for using such resource for mono-lingual research. However, many non-local news are simply the translation of the same news provided by international news agencies. Therefore, it is possible in the future for using such resource and their counterparts in other languages for multi-lingual research. A few major Chinese news providers are listed as follows for reference. The readers can easily find more other links by starting from such sites, by using a searching engine, or by entering the home pages of the major Internet Service Providers (ISP) of the various countries. [China] http://www.peopledaily.co.cn/ (GB) http://www.egis.com/gb/people_daily/ (GB) http://www.egis.com/big5/people_daily/ (Big5) - People Daily (人民日報) http://www.asia1.com.sg/gzbao/ (GB) - Guangzhou Daily (廣州日報) http://info.bta.net.cn/young/you_main.htm (GB) - Beijing Youth Daily (北京青年報) [Taiwan] http://www.chinatimes.com.tw/ (Big5) - China Times (中國時報) Group (including Commercial Times (工商時報), Infotimes (時報資訊)) http://uen.globalnet.com.tw/ (Big5) http://www.sinanet.com/minsheng/ (Big5) - United Daily News (聯合報) Group - (including Ming-Sheng Daily (民生日報), United Evening News (聯合晚報)) http://www.libertytimes.com.tw/ (Big5) http://www2.nsysu.edu.tw/ - The Liberty Times (自由時報) http://www.cna.com.tw/ (Big5, GB) http://ww3.sinanet.com/rtn/ (GIF) - Central News Agency (中央社) Real Time News http://www.tpg.gov.tw/twnews/ (Big5) - Taiwan Shin Wen Daily News (台灣新聞報) http://www.aide.gov.tw/ (Big5) - Ming-Sheng Daily, United Daily, Liberty Times http://www.era.com.tw/ - The Era (TVBS) Cable TV News (年代 TVBS 電視新聞) http://www.cts.com.tw/ - The Chinese TV System News (華視新聞) http://www.bcc.com.tw/ (Real Audio) - Broadcasting Corporation of China (中廣新聞) [Hong Kong] http://www.mingpao.com/newspaper/ - Ming-Pao (明報) (Big5) http://www.singtao.com/ - Sing Tao Electronic Daily (星島日報) (Big5) http://www.chinanews.com/ (Big5) http://www.chinanews.com/project/group_list/ - China News Service, Hong Kong China News Agency (香港中新社) - a large list of Chinese media (traditional or electronic) is being constructed here [Singapore] http://www.asia1.com.sg/zaobao/ (GB) http://www.asia1.com.sg/cgi-bin/cweb/g2b.pl (Big5) - Lian-Hao Zhaobao (聯合早報) [Malaysia] http://www.founder.net.my/sinchew/ http://web3.asia1.com.sg/sinchew/ - Sin Chew Jit Poh (星洲日報) (Big5, GB) http://www.asia-online.com/nsp/ (GB) - Nanyang Siang Pau (南洋商報) Online Electronic Magazines Online electronic magazines represent another kind of well-organized but less timely updated text resource. Most such magazines, as their paper form, are characterized by a particular subdomain for a particular type of readers. The domains may include personal computers, political comments, recreation (such as cars, sports, music) and so on. It is therefore useful to use such resource for acquiring domain-specific information. For instance, a few known E-magazines accessible through the Net is listed as follows: http://www.cw.com.tw/ (Economy, Politics) - Common Wealth Magazine (天下雜誌) http://www.infopro.com.tw/ (PC) - PC Week, 資訊傳真, PC Magazine, etc. http://udn.com.tw/service/pcnews/infoweekly/ - United Daily Info Weekly (聯合報資訊專刊) (PC) http://www.cnd.org/ http://www.cnd.org:8009/HXWZ/ (Big5,GB,HZ) - China News Digest, Hwa Xia Wen Zhai (華夏文摘) http://www.rpi.edu/~cheny6/java.html (GB,Big5) - Chinese Poetry Magazine, with links to many E-News and Magazines News Groups, Mailing Lists and Bulletin Board Systems There are thousands of news groups, mailing lists (discussion lists) and bulletin board systems (BBS), which provides, mostly, dialogue-based articles in the Net. Each newsgroup, list, or board represent a subdomain, that is even subtler in readership than E-magazines. And many of the subdomains are rarely appear to the public in the form of a newspaper or magazine. Therefore, such resources are potential candidates for a very special sublanguage. The characteristics of such articles is the use of very new vocabularies, slang that may never appear in more formal articles. Since such resources are dialogue-based, they provide good scripts for real-world dialogue, question-answering systems. A particular application for using such text materials is to use them for training the error models of an error detection (or correction) system because such articles contain various types of typographic errors. For instance, it is easy to find typing errors (either intentionally or un-intentionally) resulted from homophonic Chinese characters in Chinese BBS. Searching Engines Because there are so many articles in the Net, it is difficult to find relevant materials for a research if he or she does not have a list of the resources as listed above or if the above list is too short to fit general interests of the NLP community. In that case, a searching engine will be very helpful to find relevant articles and information providers. In fact, a searching engine by itself could be used for researchers to find the context particular of words. A searching engine is also associated with a medium or large corpus behind the engine. Therefore, using searching engines for NLP research is a way for gathering language information without collecting a large corpus by the researchers themselves. Most searching engines provide exact string match, case-insensitive string match, AND/OR operators for combining queries; more advanced searching engines will also provide natural language query. A few searching engines in Taiwan for Chinese text search is list here for reference: http://csmart.iis.sinica.edu.tw/cna.html/ - Csmart search for the CNA News - provide natural language query http://www.sinica.edu.tw/csmart/ - Csmart search for Chinese Lexicon (國語辭典) and other databases http://www.sinica.edu.tw/ftms-bin/kiwi.sh - The Academia Sinica Balanced Corpus (中研院平衡語料庫) searching engine - search by keywords with other specifications such as part-of-speech and semantic features http://gais.cs.ccu.edu.tw/cgais.html - Global Area Information System - search for general internet text resources such as BBS articles (of the Taiwan and Asia areas) http://udn.com.tw/ - United Daily Full Text Indexing for Info Weekly (聯合報資訊專刊索引) http://taiwan.yam.org.tw/b5/yam/ http://www.hello.com.tw/ http://www.whatsite.com.tw/ - a few commonly used commercial searching engines Special Online Resources Most of the above resources are referring to text resources. However, natural language may exhibits itself in other forms such as speech. For instance, a Mandarin Chinese Text-to-Speech system is announced recently at http://www.bell-labs.com/project/tts/mandarin.html (Big5 page) http://www.bell-labs.com/project/tts/mandarin-gb.html (GB page) which provides translation from Chinese text into speech.