Python的幾個(gè)自然語(yǔ)言處理工具介紹 - 全文

　　Python以其清晰簡(jiǎn)潔的語(yǔ)法、易用和可擴(kuò)展性以及豐富龐大的庫(kù)深受廣大開發(fā)者喜愛。其內(nèi)置的非常強(qiáng)大的機(jī)器學(xué)習(xí)代碼庫(kù)和數(shù)學(xué)庫(kù)，使Python理所當(dāng)然成為自然語(yǔ)言處理的開發(fā)利器。

　　那么使用Python進(jìn)行自然語(yǔ)言處理，要是不知道這幾個(gè)工具就真的Out了。

　　Python 的幾個(gè)自然語(yǔ)言處理工具

　　NLTK是使用Python處理語(yǔ)言數(shù)據(jù)的領(lǐng)先平臺(tái)。它為像WordNet這樣的詞匯資源提供了簡(jiǎn)便易用的界面。它還具有為文本分類（classification）、文本標(biāo)記（tokenization）、詞干提?。╯temming）、詞性標(biāo)記（tagging）、語(yǔ)義分析（parsing）和語(yǔ)義推理（semantic reasoning）準(zhǔn)備的文本處理庫(kù)。

　　NLTK:NLTK 在用 Python 處理自然語(yǔ)言的工具中處于領(lǐng)先的地位。它提供了 WordNet 這種方便處理詞匯資源的借口，還有分類、分詞、除莖、標(biāo)注、語(yǔ)法分析、語(yǔ)義推理等類庫(kù)。

　　Pattern:Pattern 的自然語(yǔ)言處理工具有詞性標(biāo)注工具（Part-Of-Speech Tagger），N元搜索（n-gram search），情感分析（sentiment analysis），WordNet。支持機(jī)器學(xué)習(xí)的向量空間模型，聚類，向量機(jī)。

　　TextBlob:TextBlob 是一個(gè)處理文本數(shù)據(jù)的 Python 庫(kù)。提供了一些簡(jiǎn)單的api解決一些自然語(yǔ)言處理的任務(wù)，例如詞性標(biāo)注、名詞短語(yǔ)抽取、情感分析、分類、翻譯等等。

　　Gensim:Gensim 提供了對(duì)大型語(yǔ)料庫(kù)的主題建模、文件索引、相似度檢索的功能。它可以處理大于RAM內(nèi)存的數(shù)據(jù)。作者說(shuō)它是“實(shí)現(xiàn)無(wú)干預(yù)從純文本語(yǔ)義建模的最強(qiáng)大、最高效、最無(wú)障礙的軟件。

　　PyNLPI：它的全稱是：Python自然語(yǔ)言處理庫(kù)（Python Natural Language Processing Library，音發(fā)作： pineapple）這是一個(gè)各種自然語(yǔ)言處理任務(wù)的集合，PyNLPI可以用來(lái)處理N元搜索，計(jì)算頻率表和分布，建立語(yǔ)言模型。他還可以處理向優(yōu)先隊(duì)列這種更加復(fù)雜的數(shù)據(jù)結(jié)構(gòu)，或者像 Beam 搜索這種更加復(fù)雜的算法。

　　spaCy：這是一個(gè)商業(yè)的開源軟件。結(jié)合Python和Cython，它的自然語(yǔ)言處理能力達(dá)到了工業(yè)強(qiáng)度。是速度最快，領(lǐng)域內(nèi)最先進(jìn)的自然語(yǔ)言處理工具。

　　Polyglot:Polyglot 支持對(duì)海量文本和多語(yǔ)言的處理。它支持對(duì)165種語(yǔ)言的分詞，對(duì)196中語(yǔ)言的辨識(shí)，40種語(yǔ)言的專有名詞識(shí)別，16種語(yǔ)言的詞性標(biāo)注，136種語(yǔ)言的情感分析，137種語(yǔ)言的嵌入，135種語(yǔ)言的形態(tài)分析，以及69中語(yǔ)言的翻譯。

　　MontyLingua:MontyLingua 是一個(gè)自由的、訓(xùn)練有素的、端到端的英文處理工具。輸入原始英文文本到 MontyLingua ，就會(huì)得到這段文本的語(yǔ)義解釋。適合用來(lái)進(jìn)行信息檢索和提取，問題處理，回答問題等任務(wù)。從英文文本中，它能提取出主動(dòng)賓元組，形容詞、名詞和動(dòng)詞短語(yǔ)，人名、地名、事件，日期和時(shí)間，等語(yǔ)義信息。

　　BLLIP Parser:BLLIP Parser（也叫做Charniak-Johnson parser）是一個(gè)集成了產(chǎn)生成分分析和最大熵排序的統(tǒng)計(jì)自然語(yǔ)言工具。包括命令行和 python接口。

　　Quepy:Quepy是一個(gè)Python框架，提供將自然語(yǔ)言轉(zhuǎn)換成為數(shù)據(jù)庫(kù)查詢語(yǔ)言。可以輕松地實(shí)現(xiàn)不同類型的自然語(yǔ)言和數(shù)據(jù)庫(kù)查詢語(yǔ)言的轉(zhuǎn)化。所以，通過Quepy，僅僅修改幾行代碼，就可以實(shí)現(xiàn)你自己的自然語(yǔ)言查詢數(shù)據(jù)庫(kù)系統(tǒng)。GitHub:https://github.com/machinalis/quepy

　　HanNLP：HanLP是由一系列模型與算法組成的Java工具包，目標(biāo)是普及自然語(yǔ)言處理在生產(chǎn)環(huán)境中的應(yīng)用。不僅僅是分詞，而是提供詞法分析、句法分析、語(yǔ)義理解等完備的功能。HanLP具備功能完善、性能高效、架構(gòu)清晰、語(yǔ)料時(shí)新、可自定義的特點(diǎn)。文檔使用操作說(shuō)明：Python調(diào)用自然語(yǔ)言處理包HanLP 和菜鳥如何調(diào)用HanNLP

　　OpenNLP：進(jìn)行中文命名實(shí)體識(shí)別

　　OpenNLP是Apach下的Java自然語(yǔ)言處理API，功能齊全。如下給大家介紹一下使用OpenNLP進(jìn)行中文語(yǔ)料命名實(shí)體識(shí)別的過程。

　　首先是預(yù)處理工作，分詞去聽用詞等等的就不啰嗦了，其實(shí)將分詞的結(jié)果中間加上空格隔開就可以了，OpenNLP可以將這樣形式的的語(yǔ)料照處理英文的方式處理，有些關(guān)于字符處理的注意點(diǎn)在后面會(huì)提到。

　　其次我們要準(zhǔn)備各個(gè)命名實(shí)體類別所對(duì)應(yīng)的詞庫(kù)，詞庫(kù)被存在文本文檔中，文檔名即是命名實(shí)體類別的TypeName，下面兩個(gè)function分別是載入某類命名實(shí)體詞庫(kù)中的詞和載入命名實(shí)體的類別。

　　/**

　　* 載入詞庫(kù)中的命名實(shí)體

　　* @param nameListFile

　　* @return

　　* @throws Exception

　　public static List《String》 loadNameWords（File nameListFile）

　　throws Exception {

　　List《String》 nameWords = new ArrayList《String》（）;

　　if （！nameListFile.exists（） || nameListFile.isDirectory（）） {

　　System.err.println（“不存在那個(gè)文件”）;

　　return null;

　　}

　　BufferedReader br = new BufferedReader（new FileReader（nameListFile））;

　　String line = null;

　　while （（line = br.readLine（））！= null） {

　　nameWords.add（line）;

　　}

　　br.close（）;

　　return nameWords;

　　}

　　/**

　　* 獲取命名實(shí)體類型

　　* @param nameListFile

　　* @return

　　public static String getNameType（File nameListFile） {

　　String nameType = nameListFile.getName（）;

　　return nameType.substring（0， nameType.lastIndexOf（“?！保?

　　}

　　因?yàn)镺penNLP要求的訓(xùn)練語(yǔ)料是這樣子的：

　　XXXXXX《START:Person》？？？？《END》XXXXXXXXX《START:Action》？？？？《END》XXXXXXX

　　被標(biāo)注的命名實(shí)體被放在《START》《END》范圍中，并標(biāo)出了實(shí)體的類別。接下來(lái)是對(duì)命名實(shí)體識(shí)別模型的訓(xùn)練，先上代碼：

　　import java.io.File;

　　import java.io.FileOutputStream;

　　import java.io.IOException;

　　import java.io.StringReader;

　　import java.util.Collections;

　　import opennlp.tools.namefind.NameFinderME;

　　import opennlp.tools.namefind.NameSample;

　　import opennlp.tools.namefind.NameSampleDataStream;

　　import opennlp.tools.namefind.TokenNameFinderModel;

　　import opennlp.tools.util.ObjectStream;

　　import opennlp.tools.util.PlainTextByLineStream;

　　import opennlp.tools.util.featuregen.AggregatedFeatureGenerator;

　　import opennlp.tools.util.featuregen.PreviousMapFeatureGenerator;

　　import opennlp.tools.util.featuregen.TokenClassFeatureGenerator;

　　import opennlp.tools.util.featuregen.TokenFeatureGenerator;

　　import opennlp.tools.util.featuregen.WindowFeatureGenerator;

　　/**

　　* 中文命名實(shí)體識(shí)別模型訓(xùn)練組件

　　* @author ddlovehy

　　public class NamedEntityMultiFindTrainer {

　　// 默認(rèn)參數(shù)

　　private int iterations = 80;

　　private int cutoff = 5;

　　private String langCode = “general”;

　　private String type = “default”;

　　// 待設(shè)定的參數(shù)

　　private String nameWordsPath; // 命名實(shí)體詞庫(kù)路徑

　　private String dataPath; // 訓(xùn)練集已分詞語(yǔ)料路徑

　　private String modelPath; // 模型存儲(chǔ)路徑

　　public NamedEntityMultiFindTrainer（） {

　　super（）;

　　// TODO Auto-generated constructor stub

　　}

　　public NamedEntityMultiFindTrainer（String nameWordsPath， String dataPath，

　　String modelPath） {

　　super（）;

　　this.nameWordsPath = nameWordsPath;

　　this.dataPath = dataPath;

　　this.modelPath = modelPath;

　　}

　　public NamedEntityMultiFindTrainer（int iterations， int cutoff，

　　String langCode， String type， String nameWordsPath，

　　String dataPath， String modelPath） {

　　super（）;

　　this.iterations = iterations;

　　this.cutoff = cutoff;

　　this.langCode = langCode;

　　this.type = type;

　　this.nameWordsPath = nameWordsPath;

　　this.dataPath = dataPath;

　　this.modelPath = modelPath;

　　}

　　/**

　　* 生成定制特征

　　* @return

　　public AggregatedFeatureGenerator prodFeatureGenerators（） {

　　AggregatedFeatureGenerator featureGenerators = new AggregatedFeatureGenerator（

　　new WindowFeatureGenerator（new TokenFeatureGenerator（）， 2， 2），

　　new WindowFeatureGenerator（new TokenClassFeatureGenerator（）， 2，

　　2）， new PreviousMapFeatureGenerator（））;

　　return featureGenerators;

　　}

　　/**

　　* 將模型寫入磁盤

　　* @param model

　　* @throws Exception

　　public void writeModelIntoDisk（TokenNameFinderModel model） throws Exception {

　　File outModelFile = new File（this.getModelPath（））;

　　FileOutputStream outModelStream = new FileOutputStream（outModelFile）;

　　model.serialize（outModelStream）;

　　}

　　/**

　　* 讀出標(biāo)注的訓(xùn)練語(yǔ)料

　　* @return

　　* @throws Exception

　　public String getTrainCorpusDataStr（） throws Exception {

　　// TODO 考慮入持久化判斷直接載入標(biāo)注數(shù)據(jù)的情況以及增量式訓(xùn)練

　　String trainDataStr = null;

　　trainDataStr = NameEntityTextFactory.prodNameFindTrainText（

　　this.getNameWordsPath（）， this.getDataPath（）， null）;

　　return trainDataStr;

　　}

　　/**

　　* 訓(xùn)練模型

　　* @param trainDataStr

　　* 已標(biāo)注的訓(xùn)練數(shù)據(jù)整體字符串

　　* @return

　　* @throws Exception

　　public TokenNameFinderModel trainNameEntitySamples（String trainDataStr）

　　throws Exception {

　　ObjectStream《NameSample》 nameEntitySample = new NameSampleDataStream（

　　new PlainTextByLineStream（new StringReader（trainDataStr）））;

　　System.out.println（“**************************************”）;

　　System.out.println（trainDataStr）;

　　TokenNameFinderModel nameFinderModel = NameFinderME.train（

　　this.getLangCode（）， this.getType（）， nameEntitySample，

　　this.prodFeatureGenerators（），

　　Collections.《String， Object》 emptyMap（）， this.getIterations（），

　　this.getCutoff（））;

　　return nameFinderModel;

　　}

　　/**

　　* 訓(xùn)練組件總調(diào)用方法

　　* @return

　　public boolean execNameFindTrainer（） {

　　try {

　　String trainDataStr = this.getTrainCorpusDataStr（）;

　　TokenNameFinderModel nameFinderModel = this

　　.trainNameEntitySamples（trainDataStr）;

　　// System.out.println（nameFinderModel）;

　　this.writeModelIntoDisk（nameFinderModel）;

　　return true;

　　} catch （Exception e） {

　　// TODO Auto-generated catch block

　　e.printStackTrace（）;

　　return false;

　　}

　?。?/p>

　　注：

　　參數(shù)：iterations是訓(xùn)練算法迭代的次數(shù)，太少了起不到訓(xùn)練的效果，太大了會(huì)造成過擬合，所以各位可以自己試試效果；

　　cutoff：語(yǔ)言模型掃描窗口的大小，一般設(shè)成5就可以了，當(dāng)然越大效果越好，時(shí)間可能會(huì)受不了；

　　langCode：語(yǔ)種代碼和type實(shí)體類別，因?yàn)闆]有專門針對(duì)中文的代碼，設(shè)成“普通”的即可，實(shí)體的類別因?yàn)槲覀兿胗?xùn)練成能識(shí)別多種實(shí)體的模型，于是設(shè)置為“默認(rèn)”。

　　說(shuō)明：

　　prodFeatureGenerators（）方法用于生成個(gè)人訂制的特征生成器，其意義在于選擇什么樣的n-gram語(yǔ)義模型，代碼當(dāng)中顯示的是選擇窗口大小為5，待測(cè)命名實(shí)體詞前后各掃描兩個(gè)詞的范圍計(jì)算特征（加上自己就是5個(gè)），或許有更深更準(zhǔn)確的意義，請(qǐng)大家指正；

　　trainNameEntitySamples（）方法，訓(xùn)練模型的核心，首先是將如上標(biāo)注的訓(xùn)練語(yǔ)料字符串傳入生成字符流，再通過NameFinderME的train（）方法傳入上面設(shè)定的各個(gè)參數(shù)，訂制特征生成器等等，關(guān)于源實(shí)體映射對(duì)，就按默認(rèn)傳入空Map就好了。

　　源代碼開源在：https://github.com/Ailab403/ailab-mltk4j，test包里面對(duì)應(yīng)有完整的調(diào)用demo，以及file文件夾里面的測(cè)試語(yǔ)料和已經(jīng)訓(xùn)練好的模型。

閱讀全文

上一頁(yè)1 2 3全文