句法分析是自然语言处理研究中的重点和难点,针对完整句法分析方法在分析大规 模真实文本中遇到的困难,研究人员提出了浅层句法分析的概念。组块的识别分析即是 自然语言浅层句法分析的重要任务之一,组块的分析结果可以简化句子结构,降低句法 分析的复杂度,作为一种确定性很高的部分分析结果,组块识别还可解决大部分的局部 义结构问题。 本文研究了汉语句子的组块识别。首先阐述了组块识别的研究现状及组块识别的意 义,在前人工作的基础上制定了汉语句子组块的定义、组块标记类别、组块标注规范并 分析了组块内部结构,研究并实现了一个用于汉语句子组块识别的系统,包括基于规则 和基于统计的两种识别模型。 通过规则模板获取规则集,写入一个可解决规则冗余问题自定义格式的文法文件 中,对其进行编译生成的状态转换图与一个先进后出的状态栈结合,构造了一个基于规 则的ATN (扩充转移网络)模型。通过对大规模训练语料(已完成组块标记)的统计 分析,获取组块特征信息,并给出基于支持向量机(SVM)的组块分类标准,选择组块 的多种不同特征信息组合和不同的多分类划分方法,训练学习后得到了基于统计的SVM 模型。本文分别给出了这两种模型的算法和其用于汉语句子组块的识别结果,并分析了 ATN 模型和SVM 模型各自的特点。 实验结果表明,两种不同组块识别模型都取得了较好的精确率和召回率,证明了两 种模型的有效性。其中部分研究成果已应用于实际翻译系统中,达到了简化了句子结 构、提高机器翻译系统整体性能的目的。另外还可进一步应用到信息检索、文本分类等 自然语言处理领域中。 关键词:自然语言处理;组块识别;自动机;支持向量机;句法分析 汉语句子的组块识别 - II - Chunk Parsing for Chinese Sentences Abstract Syntactic parsing is a key problem of Natural Language Processing. Because the complete syntactic parsing has difficulties in analyzing extensive actual corpus,a concept of shallow parsing is brought forward .Chunk parsing play an important role in shallow parsing.As part of the partial parsing which has high certainty, the outcome of chunk parsing can predigest the structure of sentences and reduce the complexity of the syntactic parsing. At the same time chunk parsing may solve most of the problems having different meanings in part framework. In this paper, the chunk parsing for Chinese sentences will be investiga无忧论文 【http://www.uklunwen.com】ted. At first, the current state of the chunk parsing is introduced and its significance is analyzed equally. In the light of the former work, the definition of the chunk in Chinese sentence, the sorts and spec of chunk are constituted subsequently. Synchronously, the inner structure in chunk is discussed and a system for recognizing chunks from Chinese sentences has been designed and realized, which can accomplish the chunk via using models basing on rules and statistics. Gained from rule templates, the rules are written in a grammar file with special format which may settle the redundancy issues produced by employing the regular templates.Through compiling the grammar file,a state transition diagram which is used to construct the automaton is getted, using a first-in-later-out state stack,the rule-based ATN model is realized.Via the analysis of the characteristic information from the chunks which have been tagged,after make the multi-class standard for the chunk parsing base-on SVM,this paper chooses the diversiform and different combination of the characteristic information and different classification means,does statistical studying and training to large scale corpuss,the statistic-based SVM models are realized.This paper gives the arithmetic and identification results of the two models and respective identifying characteristics of ATN model and SVM model are also analyzed. The results of the two different models for chunk p |