桂诗春,杨惠中-语料库 联系客服

发布时间 : 星期四 文章桂诗春,杨惠中-语料库更新完毕开始阅读7baa29d333d4b14e85246821

np2 np3 np4 np5 np6 np7 np8 np9 pr1 pr2 pr3 pr4 pr5 pr6 nn phr nn phr nn phr nn phr nn phr nn phr nn phr nn phr pron pron pron pron pron pron set phrase agreement case countability number article quantifiers other determiners Reference anticipatory it Agreement Case wh- indefinite omission or replacement of a fixed element that goes after a certain noun number agreement of a noun with its determiner or a word that refers to it possessive case error: form or use uncountable noun used as countable noun countable noun used with no determiner or -s; a or -s with plural noun a/an confusion or definite/indefinite confusion misuse or confusion between many/much, (a) few/(a) little, some/any, etc misuse or confusion of demonstratives, wh- determiners, numerals, etc. incorrect/ambiguous pronoun reference/anaphoric improper or wrong use of anticipatory it ; it replaced by a demonstrative, etc number agreement with a noun it refers to case error of any personal pronoun misuse or confusion of interrogative, relative and conjunctive pronouns misuse or confusion of indefinite pronouns such as all/both, few/little, some/any, either/neither, etc error in the combination words/grammatical with other aj1 aj2 adj adj pattern set phrase error in the idiomatic use of an adjectival phrase; omission or replacement of a fixed element that goes after a certain adjective adjective degree error: form and use -ed adjective for -ing adjective or vice versa improper adverb placement/wrong position adjective modifier used as verb modifier; other kinds of confusion adverb degree error: form and use unacceptable combination words/grammatical with other aj3 aj4 aj5 ad1 ad2 ad3 pp1 pp2 cj1 cj2 adj adj adj adv adv adv prep prep conj conj degree -ed/-ing confusion order modification degree pattern set phrase pattern set phrase predicative/attributive predicative adjective used as attributive adjective error in the formation or use of an idiomatic prepositional phrase unacceptable combination words/grammatical with other error in the formation or use of a phrase functioning as a conjunction 9

wd1 wd2 wd3 wd4 wd5 wd6 wd7 cc1 cc2 cc3 cc4 cc5 cc6 sn1 sn2 sn3 sn4 sn5 sn6 sn7 sn8 word word word word word word word notional notional notional notional notional notional sentence sentence sentence sentence sentence sentence sentence sentence order part of speech substitution absence redundancy repetition ambiguity n/n collocation n/v collocation v/n collocation a/n collocation v/ad collocation ad/a collocation run-on sentence sentence fragment dangling modifier misplacement of any word other than an adverb error in part of speech: right root but wrong word class error in word choice: right word class but wrong selection (any part of speech) omission of a word(any part of speech) oversuppliance of a word(any part of speech) unnecessary repeating of a word not clear word meaning/semantic improper noun(phrase) combination/semantic improper noun(phrase) combination/semantic improper verb combination/semantic improper adjective combination/semantic improper verb and combination/semantic improper adverb combination/semantic and and and and adverb and noun(phrase) verb(phrase) noun(phrase) noun(phrase) (or ad/v) adjective improper addition of clauses/fused sentence subordinate clause as a sentence; any phrase as a sentence illogical adverbial modification of a clause illogical comparison error in the comparison of words or phrases in a sentence which can not be compared topic prominence coordination subordination structural deficiency the co-occurrence of an initial noun phrase and its equivalent(usually a pronoun) in the same sentence faulty parallelism of clauses (or words/phrases) in a sentence faulty attachment of a subordinate clause to the main clause error in the grammatical construction of a sentence: improper splitting, pattern shifting, confusing structure, etc overuse, absence, choice, apostrophe, comma splice, etc. sn9 sentence punctuation

10

4. 语料库的制作工具

语料库是在计算机上实现的一个数据库,必须使用合适的软件来进行加工。这方面的软件已有不少,如WordCruncher,MicroConcord, Longman’s Concordancer, Concordance, Concordancer, Lexa, TACT, Wordsmith, 等等。经过实验和比较,我们决定使用TACT和Wordsmith,因为它们的功能比较强大,而且是自由软件或共享软件。但是我们有特殊的标注要求,而且这些软件大都不能处理汉语(我们的LC虽然是英语的,但偶尔也有汉字,影响了文件的处理),故我们也编写了一些专门的软件,如corpfind (供标注用;有的同志还用Word的自动图文集的功能编制言语失误分类表,找到失误后,按鼠标键入码,效果也很好), cbrowser(供检索用), cleantxt(供清除汉字符号用), paragraph(供清除转行符用), merge(供合并和统计词表用),PosTagger(供做语法标注用),lemma(作词目归并用),wordlist(作改正拼写后归并词表用)。所有的这些软件都要求语料库的文件是纯文本(.txt)格式。另外我们觉得Microsoft Office的Excel制造表格的功能十分强大,我们所做的表格都是Excel的.xls格式的,必须装有Excel才能打开。对这些表格我们不作进一步转换,以便用户在Excel状态下进行处理数据。如有需要,用户可以在Excel下把文件另存为别的格式。Excel本身也能做一些统计和制图工作;在需要做进一步的统计分析和制图时,我们使用了SPSS,Statistica和Harvard Chart。

TACT和Wordsmith都可以对语料库作统计分析,并进行索引检索。但是TACT可以定出检索条件(如全部语料或某一类学习者的语料)来检索词语或失误,而Wordsmith有一个特殊的功能,叫做keyness(关键词性),可以把两个语料库的词语频数进行比较,找出比参照语料库超用或少用的词语。例如我们可以把5类学习者的词表与一个参照语料库的词表进行比较,看哪些词语是各类学习者多用或少用的。在光盘里,我们提供了这两个软件,要发挥Wordsmith的全部功能,必须经过注册。

三. CLEC的统计分析

1. 统计列表

(1)

词频排列表(按频数)

词频排列表(Rank List),按频数把语料库的词型从高到低进行排列,例如the的出现频数最高,共有61787次,排在第一位。对词频也可以按字母顺序排列,叫做字母排列表(Alphabetical List)。这两个表的数据是一样的,只是排列次序不一。本书只提供按频数的词频排列表,编号II,在光盘中还提供按字母排列的词频排列表,编号III。为了把CLEC的词频排列表和别的ECNS的词频排列表进行比较,我们必须对CLEC的语料做一些筛选处理。 语料中有许多汉语拼音的专有名词和我们加到语料库里的失误标注,还有许多拼写失误,例如*abilitical, *abilitities, *abilitys, *abillities, *ablelity, *ablity, *abtilities等等,都是ability和abilities的拼写失误的不同形式。如果我们把它们都作为词型算进词频排列表里来和ECNS的词频排列表比较,则中国学习者的词汇量显然含有水分。故我们在编制词频排列表时,把汉语拼音的专有名词和失误标注加以剔除,把拼写失误的都改过来。经过处理后,

11

原来语料库的词次(tokens,语料库所有单词出现的次数)从1207879减为1070602,词型(types,语料库中所有拼写相同的连续词符串,如do, does, did, doing, done是五个词型)从25562减为15313。但这仅在编制词频排列表时所做的改变,原始的语料并没有减少和改正,以保持原貌。但在使用词语检索器进行其他统计时,仍按原来1207879个词计算,望读者留意。

一般语料库的词频排列表都要提供一些重要参数如频数(frequency)和分布率(dispersion)。AHI还提供U值(一个词在1,000,000词理论频数)和标准频数指数(SFI)。我们采取了AHI的几个参数来整理我们的词频排列表。具体的公式和它的含义见词频排列表前的说明。

(2) 拼写失误表

拼写失误表,编号IV。我们在编制词频排列表时,为了了解学习者所使用的词汇量,把他们的拼写失误改正。但不同类型学习者的拼写失误对教学很有参考意义,故我们把词频排列表中改正的拼写失误形式单独列出一个拼写失误表。拼写失误共有10540词次、5810词型。拼写失误表先列出正确的拼写形式,然后列出各类学习者的失误形式。我们可以看到有些常用词是学习者容易拼写错的,如knowledge(22种),society(21种),important(13种),government(13种),opinion(12种),beautiful(12种),because(11种),industry(11种),people(11种),等等。

(3) 词目表

词目表,编号V。词频排列表所排列的词型来自原始语料库,所以take,took,taken,taking都作为词型而统计,我们需要把这些不同形式的词型归并而成为词目(lemmas),这就是词目归并(lemmatization)。目的是了解学习者实际使用了多少词。

在编制词目表时,我们以1998年Yasumasa Someya 所编制的E_lemma表为依据, 编成专门软件。在E_lemma里,代词、副词并没有归并。词目表仍按词频排列表所设定的参数来统计,可参考词频排列表前的说明。

经过词目归并后,词型大概减少1/3强,见表3.1:

表3.1 词目归并前后的变化

学习者类型 St2 St3 St4 St5 St6 整个语料库 词目归并前 5844 5343 5481 8459 9978 15313 *参见P5脚注3

词目归并后 3981 3578 3891 5726 6781 9861

12