Online Chinese Resources		Mar. 1997

During the last few years, the blooming growth of
the Internet community had make it possible to access
a large volume of text and speech resources through
the Net with very little cost. The research institutes
of the natural language processing community thus have
more chance to practice real-world tasks using
such network resources than they had earlier.

The only problem is that the
legal status of accessing such publicly available
resources is still subject to legal problems.
For instance, there are once some electronic lexicons
available on the Net. But they are no more accessible
probably due to legal problems.

In spite of the possible legal problems, we will briefly
include a few known Chinese resources that are publicly
accessible. The researchers are encouraged to contact the
resource providers and use such net resource as far as
the laws of the respective countries permit.

Since the list of accessible online resources are increasing
day-by-day, the following information will be out of date
quickly. The authors will try to maintain a live list of such
resources at the ROCLING home page (currently) at (including both Chinese and
non-Chinese resources). The following paragraphs will
therefore list only a few kinds of typical resource so that 
new NLP researchers can have a good starting point to find
the resources relevant to his/her tasks. (Most of the resources
outside the Taiwan area may be not listed here due to the limitation
of the authors' knowledge on the other countries.) Briefly, we have
the following types of resource accessible from the net.

Online Electronic News

Many major Chinese News paper providers, radio/cable TV news
departments, in Taiwan, Hong-Kong, China, Singapore, Malaysia,
United States had transferred their publications from paper
copy or voice broadcast to electronic forms (including text
and speech).

Such text resources provide the largest volume of timely updated
and well organized articles in politics, economy, recreation,
literature, science and technology development. Therefore,
they are always the first choice for most NLP researchers.

Most of the News providers are providing monolingual text
resource. Therefore, it is appropriate for using such resource
for mono-lingual research. However, many non-local news are
simply the translation of the same news provided by international
news agencies. Therefore, it is possible in the future for
using such resource and their counterparts in other languages
for multi-lingual research.

A few major Chinese news providers are listed as follows for
reference. The readers can easily find more other links
by starting from such sites, by using a searching engine, or
by entering the home pages of the major Internet Service Providers
(ISP) of the various countries.

[China]                   (GB)		(GB)		(Big5)

	- People Daily (人民日報)			(GB)

	- Guangzhou Daily (廣州日報)	(GB)

	- Beijing Youth Daily (北京青年報)

[Taiwan]		(Big5)

	- China Times (中國時報) Group
	(including Commercial Times (工商時報), Infotimes (時報資訊))		(Big5)	(Big5)

	- United Daily News (聯合報) Group
	- (including Ming-Sheng Daily (民生日報),
	  United Evening News (聯合晚報))		(Big5)

	- The Liberty Times (自由時報)			(Big5, GB)		(GIF)

	- Central News Agency (中央社) Real Time News		(Big5)

	- Taiwan Shin Wen Daily News (台灣新聞報)			(Big5)

	- Ming-Sheng Daily, United Daily, Liberty Times

	- The Era (TVBS) Cable TV News (年代 TVBS 電視新聞)

	- The Chinese TV System News (華視新聞)			(Real Audio)

	- Broadcasting Corporation of China (中廣新聞)

[Hong Kong]

	- Ming-Pao (明報)			(Big5)

	- Sing Tao Electronic Daily (星島日報)	(Big5)			(Big5)

	- China News Service, Hong Kong China News Agency (香港中新社)
	- a large list of Chinese media (traditional or electronic)
		is being constructed here

[Singapore]			(GB)	(Big5)

	- Lian-Hao Zhaobao (聯合早報)


	- Sin Chew Jit Poh (星洲日報)			(Big5, GB)			(GB)

	- Nanyang Siang Pau (南洋商報)

Online Electronic Magazines

Online electronic magazines represent another kind of well-organized
but less timely updated text resource. Most such magazines, as their
paper form, are characterized by a particular subdomain for a
particular type of readers. The domains may include personal computers,
political comments, recreation (such as cars, sports, music) and so on.
It is therefore useful to use such resource for acquiring domain-specific

For instance, a few known E-magazines accessible through
the Net is listed as follows:				(Economy, Politics)

	- Common Wealth Magazine (天下雜誌)			(PC)

	- PC Week, 資訊傳真, PC Magazine, etc.

	- United Daily Info Weekly (聯合報資訊專刊)		(PC)			(Big5,GB,HZ)

	- China News Digest, Hwa Xia Wen Zhai (華夏文摘)		(GB,Big5)

	- Chinese Poetry Magazine, with links to many E-News and Magazines

News Groups, Mailing Lists and Bulletin Board Systems

There are thousands of news groups, mailing lists (discussion
lists) and bulletin board systems (BBS), which provides, mostly,
dialogue-based articles in the Net. Each newsgroup, list, or board
represent a subdomain, that is even subtler in readership
than E-magazines. And many of the subdomains are rarely appear
to the public in the form of a newspaper or magazine. Therefore,
such resources are potential candidates for a very special
sublanguage. The characteristics of such articles is the use of
very new vocabularies, slang that may never appear in more formal
articles. Since such resources are dialogue-based, they provide good
scripts for real-world dialogue, question-answering systems.

A particular application for using such text materials is to use
them for training the error models of an error detection (or
correction) system because such articles contain various types
of typographic errors. For instance, it is easy to find typing errors
(either intentionally or un-intentionally) resulted from
homophonic Chinese characters in Chinese BBS.

Searching Engines

Because there are so many articles in the Net, it is difficult
to find relevant materials for a research if he or she does not
have a list of the resources as listed above or if the above
list is too short to fit general interests of the NLP community.
In that case, a searching engine will be very helpful to find
relevant articles and information providers. In fact, a searching
engine by itself could be used for researchers to find the
context particular of words. A searching engine is also associated
with a medium or large corpus behind the engine.
Therefore, using searching engines for NLP research is a way
for gathering language information
without collecting a large corpus by the researchers themselves.

Most searching engines provide exact string match, case-insensitive
string match, AND/OR operators for combining queries; more advanced
searching engines will also provide natural language query.
A few searching engines in Taiwan for Chinese text search is
list here for reference:

	- Csmart search for the CNA News
	- provide natural language query

	- Csmart search for Chinese Lexicon (國語辭典) and other databases

	- The Academia Sinica Balanced Corpus (中研院平衡語料庫) searching engine
	- search by keywords with other specifications such
		as part-of-speech and semantic features
	- Global Area Information System
	- search for general internet text resources such as BBS articles
	 (of the Taiwan and Asia areas)

	- United Daily Full Text Indexing for Info Weekly (聯合報資訊專刊索引)

	- a few commonly used commercial searching engines

Special Online Resources

Most of the above resources are referring to text resources.
However, natural language may exhibits itself in other forms
such as speech. For instance, a Mandarin Chinese Text-to-Speech
system is announced recently at	(Big5 page)	(GB page)

which provides translation from Chinese text into speech.