本指南概述了quanteda的基本特性和功能。 其他更详细的介绍,请参见quanteda.io上的文章.

软件包安装

quanteda已上传在CRAN上,所以可以使用GUI的R软件包安装程序进行安装,或执行:

install.packages("quanteda") 

请参阅 https://github.com/kbenoit/quanteda 上的说明来安装GitHub版本。

推荐安装:

我们建议安装以下软件包, 以便更好地支持和扩展 quanteda 的功能:

创建语料库

加载quanteda以便使用软件包中的数据和功能。

library(quanteda)

目前可用的语料库资源

quanteda 有一个简单而强大的配套软件包用于加载文本文件: readtext. 这个软件包的主函数readtext()从磁盘或者URL中读取文件或者文件集,并且返回一个可以直接和corpus()构造函数一起使用的data.frame, 可用来创建一个quanteda 语料库.

readtext() 可读取:

  • 文本文件 (.txt);
  • 逗号分割的(.csv)文本文件 ;
  • XML 数据;
  • 取自脸书API的JSON格式的数据;
  • 取自twitter API的JSON格式的数据
  • 常用的JSON数据.

语料库创建函数corpus()可用于:

  • 字符类向量,比如你使用其他工具已经加载到R工作空间的;
  • tm 软件包中VCorpus语料库对象;
  • data.frame 包含文本列和其他任何文件级元数据。

例子:从字符向量构建语料库

最简单的方式是从R中已经存在的文本向量创建一个语料库。这使得高级R用户可以灵活地选择文本输入,因为R有很多可以读取文本向量的方法。

一旦我们有了这种格式的文本数据,我们就可以直接调用语料库构造函数。以从内置的英国政党2010年选举宣言(data_char_ukimmig2010`)中提取的有关移民政策的文本为例:

myCorpus <- corpus(data_char_ukimmig2010)  # 从文本构建语料库
summary(myCorpus)
## Corpus consisting of 9 documents:
## 
##          Text Types Tokens Sentences
##           BNP  1125   3280        88
##     Coalition   142    260         4
##  Conservative   251    499        15
##        Greens   322    679        21
##        Labour   298    683        29
##        LibDem   251    483        14
##            PC    77    114         5
##           SNP    88    134         4
##          UKIP   346    723        27
## 
## Source:  /Users/kbenoit/tmp/quanteda/docs/articles/pkgdown/examples/* on x86_64 by kbenoit
## Created: Fri Dec  8 16:38:04 2017
## Notes:

我们也可以添加一些文档变量– quanteda 称之为此语料库的docvars

我们可以使用R的names()函数来读取字符向量data_char_ukimmig2010的名称,并且将其给文档变量(docvar)赋值。

docvars(myCorpus, "Party") <- names(data_char_ukimmig2010)
docvars(myCorpus, "Year") <- 2010
summary(myCorpus)
## Corpus consisting of 9 documents:
## 
##          Text Types Tokens Sentences        Party Year
##           BNP  1125   3280        88          BNP 2010
##     Coalition   142    260         4    Coalition 2010
##  Conservative   251    499        15 Conservative 2010
##        Greens   322    679        21       Greens 2010
##        Labour   298    683        29       Labour 2010
##        LibDem   251    483        14       LibDem 2010
##            PC    77    114         5           PC 2010
##           SNP    88    134         4          SNP 2010
##          UKIP   346    723        27         UKIP 2010
## 
## Source:  /Users/kbenoit/tmp/quanteda/docs/articles/pkgdown/examples/* on x86_64 by kbenoit
## Created: Fri Dec  8 16:38:04 2017
## Notes:

如果我们想要为每个文档添加额外的元数据, 而这些元数据是我们需要知道的文档属性,我们也可以 添加到语料库中。

metadoc(myCorpus, "language") <- "english"
metadoc(myCorpus, "docsource")  <- paste("data_char_ukimmig2010", 1:ndoc(myCorpus), sep = "_")
summary(myCorpus, showmeta = TRUE)
## Corpus consisting of 9 documents:
## 
##          Text Types Tokens Sentences        Party Year _language
##           BNP  1125   3280        88          BNP 2010   english
##     Coalition   142    260         4    Coalition 2010   english
##  Conservative   251    499        15 Conservative 2010   english
##        Greens   322    679        21       Greens 2010   english
##        Labour   298    683        29       Labour 2010   english
##        LibDem   251    483        14       LibDem 2010   english
##            PC    77    114         5           PC 2010   english
##           SNP    88    134         4          SNP 2010   english
##          UKIP   346    723        27         UKIP 2010   english
##               _docsource
##  data_char_ukimmig2010_1
##  data_char_ukimmig2010_2
##  data_char_ukimmig2010_3
##  data_char_ukimmig2010_4
##  data_char_ukimmig2010_5
##  data_char_ukimmig2010_6
##  data_char_ukimmig2010_7
##  data_char_ukimmig2010_8
##  data_char_ukimmig2010_9
## 
## Source:  /Users/kbenoit/tmp/quanteda/docs/articles/pkgdown/examples/* on x86_64 by kbenoit
## Created: Fri Dec  8 16:38:04 2017
## Notes:

上面的命令metadoc允许您定义自己的文档元数据域。 请注意,在赋值“英语”的值时,R会循环赋值,直到它匹配文集中的文档数量。 在给上述定制的metadoc域docsource创建一个简单的标签时,我们使用了quanteda函数ndoc()来提取语料库中的文件数量。 这个功能是我们有意设计的,提供类似于你可能已经在R中使用过的函数,比如nrow()ncol()

例子: 用readtext 软件包加载文件

require(readtext)

# Twitter json
mytf1 <- readtext("~/Dropbox/QUANTESS/social media/zombies/tweets.json")
myCorpusTwitter <- corpus(mytf1)
summary(myCorpusTwitter, 5)
# 通用json - 需要“textfield”说明符
mytf2 <- readtext("~/Dropbox/QUANTESS/Manuscripts/collocations/Corpora/sotu/sotu.json",
                  textfield = "text")
summary(corpus(mytf2), 5)
# 文本文件
mytf3 <- readtext("~/Dropbox/QUANTESS/corpora/project_gutenberg/pg2701.txt", cache = FALSE)
summary(corpus(mytf3), 5)
# 多文本文件
mytf4 <- readtext("~/Dropbox/QUANTESS/corpora/inaugural/*.txt", cache = FALSE)
summary(corpus(mytf4), 5)
# 包含取自文件名的docvars的多个文本文件
mytf5 <- readtext("~/Dropbox/QUANTESS/corpora/inaugural/*.txt", 
                  docvarsfrom = "filenames", sep = "-", docvarnames = c("Year", "President"))
summary(corpus(mytf5), 5)
# XML 数据
mytf6 <- readtext("~/Dropbox/QUANTESS/quanteda_working_files/xmlData/plant_catalog.xml", 
                  textfield = "COMMON")
summary(corpus(mytf6), 5)
# csv 文件
write.csv(data.frame(inaugSpeech = texts(data_corpus_inaugural), 
                     docvars(data_corpus_inaugural)),
          file = "/tmp/inaug_texts.csv", row.names = FALSE)
mytf7 <- readtext("/tmp/inaug_texts.csv", textfield = "inaugSpeech")
summary(corpus(mytf7), 5)

quanteda 语料库工作方式

语料库的原则

语料库被设计成原始文档的“库”,该文档被转换为UTF-8编码的纯文本文件,并与元数据一起分别存储于语料库级和文档级。我们给文档级元数据一个特殊的名字:docvars。这些变量或特征描述了每个文档的属性。

从处理和分析的角度,语料库被设计成相对静态的文本容器。这意味着语料库中的文本不能从内部通过(例如)清理或预处理改变,比如词干提取或去除标点符号。相反,作为处理过程的一部分文本可以从语料库中提取,并赋值给新的对象,但是设计的思路是将语料库作为原始参考副本保留下来,以便于其他分析 - 例如那些需要词干和标点符号的分析-比如分析阅读难易指数 - 可以在相同的语料库上执行。

为了从语料库中提取文本,我们使用一个名为texts()的提取器。

texts(data_corpus_inaugural)[2]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              1793-Washington 
## "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor, and of the confidence which has been reposed in me by the people of united America.\n\nPrevious to the execution of any official act of the President the Constitution requires an oath of office. This oath I am now about to take, and in your presence: That if it shall be found during my administration of the Government I have in any instance violated willingly or knowingly the injunctions thereof, I may (besides incurring constitutional punishment) be subject to the upbraidings of all who are now witnesses of the present solemn ceremony.\n\n "

为了总结语料库中的文本,我们可以调用一个为语料库定义的函数summary()

summary(data_corpus_irishbudget2010)
## Corpus consisting of 14 documents:
## 
##                                   Text Types Tokens Sentences year debate
##        2010_BUDGET_01_Brian_Lenihan_FF  1953   8641       374 2010 BUDGET
##       2010_BUDGET_02_Richard_Bruton_FG  1040   4446       217 2010 BUDGET
##         2010_BUDGET_03_Joan_Burton_LAB  1624   6393       307 2010 BUDGET
##        2010_BUDGET_04_Arthur_Morgan_SF  1595   7107       343 2010 BUDGET
##          2010_BUDGET_05_Brian_Cowen_FF  1629   6599       250 2010 BUDGET
##           2010_BUDGET_06_Enda_Kenny_FG  1148   4232       153 2010 BUDGET
##      2010_BUDGET_07_Kieran_ODonnell_FG   678   2297       133 2010 BUDGET
##       2010_BUDGET_08_Eamon_Gilmore_LAB  1181   4177       201 2010 BUDGET
##     2010_BUDGET_09_Michael_Higgins_LAB   488   1286        44 2010 BUDGET
##        2010_BUDGET_10_Ruairi_Quinn_LAB   439   1284        59 2010 BUDGET
##      2010_BUDGET_11_John_Gormley_Green   401   1030        49 2010 BUDGET
##        2010_BUDGET_12_Eamon_Ryan_Green   510   1643        90 2010 BUDGET
##      2010_BUDGET_13_Ciaran_Cuffe_Green   442   1240        45 2010 BUDGET
##  2010_BUDGET_14_Caoimhghin_OCaolain_SF  1188   4044       176 2010 BUDGET
##  number      foren     name party
##      01      Brian  Lenihan    FF
##      02    Richard   Bruton    FG
##      03       Joan   Burton   LAB
##      04     Arthur   Morgan    SF
##      05      Brian    Cowen    FF
##      06       Enda    Kenny    FG
##      07     Kieran ODonnell    FG
##      08      Eamon  Gilmore   LAB
##      09    Michael  Higgins   LAB
##      10     Ruairi    Quinn   LAB
##      11       John  Gormley Green
##      12      Eamon     Ryan Green
##      13     Ciaran    Cuffe Green
##      14 Caoimhghin OCaolain    SF
## 
## Source:  /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/* on x86_64 by kbenoit
## Created: Wed Jun 28 22:04:18 2017
## Notes:

我们可以将汇总命令的输出保存为data frame,并用这些信息绘制出一些基本的描述性统计信息:

tokenInfo <- summary(data_corpus_inaugural)
if (require(ggplot2))
    ggplot(data=tokenInfo, aes(x = Year, y = Tokens, group = 1)) + geom_line() + geom_point() +
        scale_x_discrete(labels = c(seq(1789,2012,12)), breaks = seq(1789,2012,12) ) 
## Loading required package: ggplot2

# 最长的就职演说: William Henry Harrison
tokenInfo[which.max(tokenInfo$Tokens), ] 
## Corpus consisting of 58 documents:
## 
##           Text Types Tokens Sentences Year President     FirstName
##  1841-Harrison  1896   9144       210 1841  Harrison William Henry
## 
## Source:  Gerhard Peters and John T. Woolley. The American Presidency Project.
## Created: Tue Jun 13 14:51:47 2017
## Notes:   http://www.presidency.ucsb.edu/inaugurals.php

处理语料库对象的功能

合并两个语料库对象

+运算符提供了一个连接两个语料库对象的简单方法。 如果它们包含了不同的文档级别的变量,这些也将被合并起来以保证不丢失任何信息。语料库级别的元数据也被连接在一起。

library(quanteda)
mycorpus1 <- corpus(data_corpus_inaugural[1:5])
mycorpus2 <- corpus(data_corpus_inaugural[53:58])
mycorpus3 <- mycorpus1 + mycorpus2
summary(mycorpus3)
## Corpus consisting of 11 documents:
## 
##             Text Types Tokens Sentences
##  1789-Washington   625   1538        23
##  1793-Washington    96    147         4
##       1797-Adams   826   2578        37
##   1801-Jefferson   717   1927        41
##   1805-Jefferson   804   2381        45
##     1997-Clinton   773   2449       111
##        2001-Bush   621   1808        97
##        2005-Bush   773   2319       100
##       2009-Obama   938   2711       110
##       2013-Obama   814   2317        88
##       2017-Trump   582   1660        88
## 
## Source:  Combination of corpuses mycorpus1 and mycorpus2
## Created: Fri Dec  8 16:38:05 2017
## Notes:

提取语料库的子集

corpus_subset()是为语料库定义的一个函数, 用于 根据基于docvars的逻辑条件提取语料库子集:

summary(corpus_subset(data_corpus_inaugural, Year > 1990))
## Corpus consisting of 7 documents:
## 
##          Text Types Tokens Sentences Year President FirstName
##  1993-Clinton   642   1833        81 1993   Clinton      Bill
##  1997-Clinton   773   2449       111 1997   Clinton      Bill
##     2001-Bush   621   1808        97 2001      Bush George W.
##     2005-Bush   773   2319       100 2005      Bush George W.
##    2009-Obama   938   2711       110 2009     Obama    Barack
##    2013-Obama   814   2317        88 2013     Obama    Barack
##    2017-Trump   582   1660        88 2017     Trump Donald J.
## 
## Source:  Gerhard Peters and John T. Woolley. The American Presidency Project.
## Created: Tue Jun 13 14:51:47 2017
## Notes:   http://www.presidency.ucsb.edu/inaugurals.php
summary(corpus_subset(data_corpus_inaugural, President == "Adams"))
## Corpus consisting of 2 documents:
## 
##        Text Types Tokens Sentences Year President   FirstName
##  1797-Adams   826   2578        37 1797     Adams        John
##  1825-Adams  1003   3152        74 1825     Adams John Quincy
## 
## Source:  Gerhard Peters and John T. Woolley. The American Presidency Project.
## Created: Tue Jun 13 14:51:47 2017
## Notes:   http://www.presidency.ucsb.edu/inaugurals.php

浏览语料库文本

kwic功能(keywords-in-context)可以搜索一个指定的词并显示它的上下文:

kwic(data_corpus_inaugural, "terror")
##                                                                     
##     [1797-Adams, 1325]              fraud or violence, by | terror |
##  [1933-Roosevelt, 112] nameless, unreasoning, unjustified | terror |
##  [1941-Roosevelt, 287]      seemed frozen by a fatalistic | terror |
##    [1961-Kennedy, 866]    alter that uncertain balance of | terror |
##     [1981-Reagan, 813]     freeing all Americans from the | terror |
##   [1997-Clinton, 1055]        They fuel the fanaticism of | terror |
##   [1997-Clinton, 1655]  maintain a strong defense against | terror |
##     [2009-Obama, 1632]     advance their aims by inducing | terror |
##                                   
##  , intrigue, or venality          
##  which paralyzes needed efforts to
##  , we proved that this            
##  that stays the hand of           
##  of runaway living costs.         
##  . And they torment the           
##  and destruction. Our children    
##  and slaughtering innocents, we
kwic(data_corpus_inaugural, "terror", valuetype = "regex")
##                                                                           
##     [1797-Adams, 1325]                   fraud or violence, by |  terror  
##  [1933-Roosevelt, 112]      nameless, unreasoning, unjustified |  terror  
##  [1941-Roosevelt, 287]           seemed frozen by a fatalistic |  terror  
##    [1961-Kennedy, 866]         alter that uncertain balance of |  terror  
##    [1961-Kennedy, 990]               of science instead of its |  terrors 
##     [1981-Reagan, 813]          freeing all Americans from the |  terror  
##    [1981-Reagan, 2196]        understood by those who practice | terrorism
##   [1997-Clinton, 1055]             They fuel the fanaticism of |  terror  
##   [1997-Clinton, 1655]       maintain a strong defense against |  terror  
##     [2009-Obama, 1632]          advance their aims by inducing |  terror  
##     [2017-Trump, 1117] civilized world against radical Islamic | terrorism
##                                     
##  | , intrigue, or venality          
##  | which paralyzes needed efforts to
##  | , we proved that this            
##  | that stays the hand of           
##  | . Together let us explore        
##  | of runaway living costs.         
##  | and prey upon their neighbors    
##  | . And they torment the           
##  | and destruction. Our children    
##  | and slaughtering innocents, we   
##  | , which we will eradicate
kwic(data_corpus_inaugural, "communist*")
##                                                                   
##   [1949-Truman, 834] the actions resulting from the | Communist  |
##  [1961-Kennedy, 519]             -- not because the | Communists |
##                            
##  philosophy are a threat to
##  may be doing it,

在上面的汇总中,YearPresident是与每个文档相关的变量。我们可以用docvars()函数访问这些变量。

# 浏览文档变量
head(docvars(data_corpus_inaugural))
##                 Year  President FirstName
## 1789-Washington 1789 Washington    George
## 1793-Washington 1793 Washington    George
## 1797-Adams      1797      Adams      John
## 1801-Jefferson  1801  Jefferson    Thomas
## 1805-Jefferson  1805  Jefferson    Thomas
## 1809-Madison    1809    Madison     James
# 浏览语料库变量
metacorpus(data_corpus_inaugural)
## $source
## [1] "Gerhard Peters and John T. Woolley. The American Presidency Project."
## 
## $notes
## [1] "http://www.presidency.ucsb.edu/inaugurals.php"
## 
## $created
## [1] "Tue Jun 13 14:51:47 2017"

quantedaData 软件包提供更多语料库资源.

从语料库中提取特征

为了执行文档缩放等统计分析,我们必须提取一个将某些特征与文档关联起来矩阵。在quanteda中,dfm函数用来生成这样一个矩阵。“dfm”是文档特征矩阵的缩写,矩阵的行总是为文档而列为“特征”。我们这样定义矩阵的行与列是因为在数据分析中标准的做法是将一个分析单元作为行,而将与每个单元有关的特征或变量作为列。我们称之为“特征”而不是“词项”,因为特征比词项更通用:词项可以被定义为原始词项,词干词项,词性词项,停用词去除后的词项,或者词项归属的字典。而特征可以是完全通用的,例如ngram或者句法依存,我们对矩阵的定义持开放式态度。

文本分词

为了简单地对文本分词,quanteda提供了一个强大的命令tokens()。这个命令会产生了一个以字符向量形式存在的分词表,表中的每单元元素 都对应于一个输入文档。

tokens()有意设计成保守的,意味着除非有指令,它不会从文本中删除任何东西。

txt <- c(text1 = "This is $10 in 999 different ways,\n up and down; left and right!", 
         text2 = "@kenbenoit working: on #quanteda 2day\t4ever, http://textasdata.com?page=123.")
tokens(txt)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "$"         "10"        "in"       
##  [6] "999"       "different" "ways"      ","         "up"       
## [11] "and"       "down"      ";"         "left"      "and"      
## [16] "right"     "!"        
## 
## text2 :
##  [1] "@kenbenoit"     "working"        ":"              "on"            
##  [5] "#quanteda"      "2day"           "4ever"          ","             
##  [9] "http"           ":"              "/"              "/"             
## [13] "textasdata.com" "?"              "page"           "="             
## [17] "123"            "."
tokens(txt, remove_numbers = TRUE,  remove_punct = TRUE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "in"        "different" "ways"     
##  [6] "up"        "and"       "down"      "left"      "and"      
## [11] "right"    
## 
## text2 :
## [1] "@kenbenoit"     "working"        "on"             "#quanteda"     
## [5] "2day"           "4ever"          "http"           "textasdata.com"
## [9] "page"
tokens(txt, remove_numbers = FALSE, remove_punct = TRUE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "10"        "in"        "999"      
##  [6] "different" "ways"      "up"        "and"       "down"     
## [11] "left"      "and"       "right"    
## 
## text2 :
##  [1] "@kenbenoit"     "working"        "on"             "#quanteda"     
##  [5] "2day"           "4ever"          "http"           "textasdata.com"
##  [9] "page"           "123"
tokens(txt, remove_numbers = TRUE,  remove_punct = FALSE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "$"         "in"        "different"
##  [6] "ways"      ","         "up"        "and"       "down"     
## [11] ";"         "left"      "and"       "right"     "!"        
## 
## text2 :
##  [1] "@kenbenoit"     "working"        ":"              "on"            
##  [5] "#quanteda"      "2day"           "4ever"          ","             
##  [9] "http"           ":"              "/"              "/"             
## [13] "textasdata.com" "?"              "page"           "="             
## [17] "."
tokens(txt, remove_numbers = FALSE, remove_punct = FALSE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "$"         "10"        "in"       
##  [6] "999"       "different" "ways"      ","         "up"       
## [11] "and"       "down"      ";"         "left"      "and"      
## [16] "right"     "!"        
## 
## text2 :
##  [1] "@kenbenoit"     "working"        ":"              "on"            
##  [5] "#quanteda"      "2day"           "4ever"          ","             
##  [9] "http"           ":"              "/"              "/"             
## [13] "textasdata.com" "?"              "page"           "="             
## [17] "123"            "."
tokens(txt, remove_numbers = FALSE, remove_punct = FALSE, remove_separators = FALSE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      " "         "is"        " "         "$"        
##  [6] "10"        " "         "in"        " "         "999"      
## [11] " "         "different" " "         "ways"      ","        
## [16] "\n"        " "         "up"        " "         "and"      
## [21] " "         "down"      ";"         " "         "left"     
## [26] " "         "and"       " "         "right"     "!"        
## 
## text2 :
##  [1] "@kenbenoit"     " "              "working"        ":"             
##  [5] " "              "on"             " "              "#quanteda"     
##  [9] " "              "2day"           "\t"             "4ever"         
## [13] ","              " "              "http"           ":"             
## [17] "/"              "/"              "textasdata.com" "?"             
## [21] "page"           "="              "123"            "."

也可以按字符分词:

tokens("Great website: http://textasdata.com?page=123.", what = "character")
## tokens from 1 document.
## text1 :
##  [1] "G" "r" "e" "a" "t" "w" "e" "b" "s" "i" "t" "e" ":" "h" "t" "t" "p"
## [18] ":" "/" "/" "t" "e" "x" "t" "a" "s" "d" "a" "t" "a" "." "c" "o" "m"
## [35] "?" "p" "a" "g" "e" "=" "1" "2" "3" "."
tokens("Great website: http://textasdata.com?page=123.", what = "character", 
         remove_separators = FALSE)
## tokens from 1 document.
## text1 :
##  [1] "G" "r" "e" "a" "t" " " "w" "e" "b" "s" "i" "t" "e" ":" " " "h" "t"
## [18] "t" "p" ":" "/" "/" "t" "e" "x" "t" "a" "s" "d" "a" "t" "a" "." "c"
## [35] "o" "m" "?" "p" "a" "g" "e" "=" "1" "2" "3" "."

以及按句子分词:

# sentence level         
tokens(c("Kurt Vongeut said; only assholes use semi-colons.", 
           "Today is Thursday in Canberra:  It is yesterday in London.", 
           "En el caso de que no puedas ir con ellos, ¿quieres ir con nosotros?"), 
          what = "sentence")
## tokens from 3 documents.
## text1 :
## [1] "Kurt Vongeut said; only assholes use semi-colons."
## 
## text2 :
## [1] "Today is Thursday in Canberra:  It is yesterday in London."
## 
## text3 :
## [1] "En el caso de que no puedas ir con ellos, ¿quieres ir con nosotros?"

构建文档特征矩阵

分词只是一个中间结果,而大多数用户都希望直接构建一个文档特征矩阵。为此,我们提供一个瑞士军刀功能dfm(),此项功能执行分词并将所提取的特征归纳成文档特征矩阵。不同于tokens()所采用的保守方法,dfm()函数默认某些应用选项,比如toLower() - 一个单独的用于转换为小写的函数,以及 - 删除标点符号。不过tokens()的所有选项都可以传递给dfm()

myCorpus <- corpus_subset(data_corpus_inaugural, Year > 1990)

# 构建dfm
myDfm <- dfm(myCorpus)
myDfm[, 1:5]
## Document-feature matrix of: 7 documents, 5 features (0% sparse).
## 7 x 5 sparse Matrix of class "dfm"
##               features
## docs           my fellow citizens   , today
##   1993-Clinton  7      5        2 139    10
##   1997-Clinton  6      7        7 131     5
##   2001-Bush     3      1        9 110     2
##   2005-Bush     2      3        6 120     3
##   2009-Obama    2      1        1 130     6
##   2013-Obama    3      3        6  99     4
##   2017-Trump    1      1        4  96     4

dfm()的其他选项还包括去除停用词和分词的词干提取。

# 构建dfm, 去除停用词以及提取词干

myStemMat <- dfm(myCorpus, remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)
myStemMat[, 1:5]
## Document-feature matrix of: 7 documents, 5 features (17.1% sparse).
## 7 x 5 sparse Matrix of class "dfm"
##               features
## docs           fellow citizen today celebr mysteri
##   1993-Clinton      5       2    10      4       1
##   1997-Clinton      7       8     6      1       0
##   2001-Bush         1      10     2      0       0
##   2005-Bush         3       7     3      2       0
##   2009-Obama        1       1     6      2       0
##   2013-Obama        3       8     6      1       0
##   2017-Trump        1       4     5      3       1

remove选项提供一个需要被去除的分词的列表。大多数用户会提供一个为多语种预定义的“停用词”的列表,可通过stopwords()函数获取:

head(stopwords("english"), 20)
##  [1] "i"          "me"         "my"         "myself"     "we"        
##  [6] "our"        "ours"       "ourselves"  "you"        "your"      
## [11] "yours"      "yourself"   "yourselves" "he"         "him"       
## [16] "his"        "himself"    "she"        "her"        "hers"
head(stopwords("russian"), 10)
##  [1] "и"   "в"   "во"  "не"  "что" "он"  "на"  "я"   "с"   "со"
head(stopwords("arabic"), 10)
##  [1] "فى"  "في"  "كل"  "لم"  "لن"  "له"  "من"  "هو"  "هي"  "قوة"

查看文档特征矩阵

可以在RStudio 的Enviroment pane中查看dfm,或者调用R的View功能。 调用plotdfm将调用wordcloud软件包绘制词云图。

mydfm <- dfm(data_char_ukimmig2010, remove = stopwords("english"), remove_punct = TRUE)
mydfm
## Document-feature matrix of: 9 documents, 1,547 features (83.8% sparse).

使用topfeatures()可以访问出现频率最高的特征:

topfeatures(mydfm, 20)  # 20 词频最高的词
## immigration     british      people      asylum     britain          uk 
##          66          37          35          29          28          27 
##      system  population     country         new  immigrants      ensure 
##          27          21          20          19          17          17 
##       shall citizenship      social    national         bnp     illegal 
##          17          16          14          14          13          13 
##        work     percent 
##          13          12

使用textplot_wordcloud()可以绘制dfm对象的词云图。这个函数将参数传递给wordcloud包的wordcloud()函数,并且可以使用相同的参数来对图进行美化:

set.seed(100)
textplot_wordcloud(mydfm, min.freq = 6, random.order = FALSE,
                   rot.per = .25, 
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))

按文档变量对文档分组

通常,我们感兴趣的是根据可能存在于文档变量中实质性因素来分析文本是如何不同的,而不仅仅是根据文档文件的边界。 创建dfm时,我们可以将具有相同文档变量的文档分成一组:

byPartyDfm <- dfm(data_corpus_irishbudget2010, groups = "party", 
                  remove = stopwords("english"), remove_punct = TRUE)

我们可以对这个dfm进行排序,并查看:

dfm_sort(byPartyDfm)[, 1:10]
## Document-feature matrix of: 5 documents, 10 features (0% sparse).
## 5 x 10 sparse Matrix of class "dfm"
##        features
## docs    people budget government public minister tax economy pay jobs
##   FF        23     44         47     65       11  60      37  41   41
##   FG        78     71         61     47       62  11      20  29   17
##   LAB       69     66         36     32       54  47      37  24   20
##   SF        81     53         73     31       39  34      50  24   27
##   Green     15     26         19      4        4  11      16   4   15
##        features
## docs    billion
##   FF         32
##   FG         21
##   LAB        34
##   SF         29
##   Green       3

请注意,最常出现的特征是“will”,这个词通常出现在英语停用词表中,但是并不包含在quanteda的内置英语停用词表中。

按字典或等价的类别对词汇分组

在某些应用中,关于文本中我们感兴趣的单词集合我们有先验知识。 例如,在电影评论中,通用的正面词汇的列表可能表示对电影正面的评价,或者我们可能会有一个与特定的意识形态立场相关的政治词汇的字典。 在这些情况下,为了分析的目的,将这些词组等同处理并将其计数归类是有用的。

例如,我们来看看总统在就职演讲的语料库中,与恐怖主义有关的词汇和与经济相关的词语在总统之间是如何变化的。从原语料库中,我们选择自克林顿以来的总统:

recentCorpus <- corpus_subset(data_corpus_inaugural, Year > 1991)

现在我们定义一个用于展示的字典:

myDict <- dictionary(list(terror = c("terrorism", "terrorists", "threat"),
                          economy = c("jobs", "business", "grow", "work")))

我们也可在构建dfm时使用字典:

byPresMat <- dfm(recentCorpus, dictionary = myDict)
byPresMat
## Document-feature matrix of: 7 documents, 2 features (14.3% sparse).
## 7 x 2 sparse Matrix of class "dfm"
##               features
## docs           terror economy
##   1993-Clinton      0       8
##   1997-Clinton      1       8
##   2001-Bush         0       4
##   2005-Bush         1       6
##   2009-Obama        1      10
##   2013-Obama        1       6
##   2017-Trump        1       5

构造函数dictionary()也适用于两种常见的“外来”字典格式:LIWC 和 Provalis Research’ Wordstat。例如,我们可以加载 LIWC 并将其应用于总统就职演讲语料库:

liwcdict <- dictionary(file = "~/Dropbox/QUANTESS/dictionaries/LIWC/LIWC2001_English.dic",
                       format = "LIWC")
liwcdfm <- dfm(data_corpus_inaugural[52:58], dictionary = liwcdict)
liwcdfm[, 1:10]

更多范例

文本之间的相似度

presDfm <- dfm(corpus_subset(data_corpus_inaugural, Year > 1980), 
               remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)
obamaSimil <- textstat_simil(presDfm, c("2009-Obama" , "2013-Obama"), 
                             margin = "documents", method = "cosine")
obamaSimil
##              2009-Obama 2013-Obama
## 2009-Obama    1.0000000  0.6815711
## 2013-Obama    0.6815711  1.0000000
## 1981-Reagan   0.6229949  0.6376412
## 1985-Reagan   0.6434472  0.6629428
## 1989-Bush     0.6253944  0.5784290
## 1993-Clinton  0.6280946  0.6265428
## 1997-Clinton  0.6593018  0.6466030
## 2001-Bush     0.6018113  0.6193608
## 2005-Bush     0.5266249  0.5867178
## 2017-Trump    0.5192075  0.5160104
# dotchart(as.list(obamaSimil)$"2009-Obama", xlab = "Cosine similarity")

我们可以用这些距离来绘制树状图,聚类分析总统:

data(data_corpus_SOTU, package = "quantedaData")
presDfm <- dfm(corpus_subset(data_corpus_SOTU, Date > as.Date("1980-01-01")), 
               stem = TRUE, remove_punct = TRUE,
               remove = stopwords("english"))
presDfm <- dfm_trim(presDfm, min_count = 5, min_docfreq = 3)

#分层聚类 -  在归一化dfm上计算距离
presDistMat <- textstat_dist(dfm_weight(presDfm, "relfreq"))
# 聚类分析文本距离
presCluster <- hclust(presDistMat)
# 按文档名标注
presCluster$labels <- docnames(presDfm)
# 绘制树状图
plot(presCluster, xlab = "", sub = "", main = "Euclidean Distance on Normalized Token Frequency")

(请尝试!)

我们也可查看特征相似度:

sim <- textstat_simil(presDfm, c("fair", "health", "terror"), method = "cosine", margin = "features")
lapply(as.list(sim), head, 10)
## $fair
##   economi     begin jefferson    author     faith      call   struggl 
## 0.9080252 0.9075951 0.8981462 0.8944272 0.8866586 0.8608285 0.8451543 
##      best     creat    courag 
## 0.8366600 0.8347300 0.8326664

文档位置的缩放分析

我们在textmodel()功能上做了大量的开发工作,这里仅演示“wordfish”模型的无监督文档缩放分析:

# make prettier document names
ieDfm <- dfm(data_corpus_irishbudget2010)
textmodel_wordfish(ieDfm, dir = c(2, 1))
## Fitted wordfish model:
## Call:
##  textmodel_wordfish.dfm(x = ieDfm, dir = c(2, 1))
## 
## Estimated document positions:
## 
##                                Documents      theta         SE       lower
## 1        2010_BUDGET_01_Brian_Lenihan_FF  1.8209510 0.02032339  1.78111713
## 2       2010_BUDGET_02_Richard_Bruton_FG -0.5932802 0.02818842 -0.64852951
## 3         2010_BUDGET_03_Joan_Burton_LAB -1.1136779 0.01540263 -1.14386709
## 4        2010_BUDGET_04_Arthur_Morgan_SF -0.1219298 0.02846327 -0.17771780
## 5          2010_BUDGET_05_Brian_Cowen_FF  1.7724207 0.02364089  1.72608459
## 6           2010_BUDGET_06_Enda_Kenny_FG -0.7145794 0.02650264 -0.76652460
## 7      2010_BUDGET_07_Kieran_ODonnell_FG -0.4844830 0.04171487 -0.56624419
## 8       2010_BUDGET_08_Eamon_Gilmore_LAB -0.5616689 0.02967367 -0.61982934
## 9     2010_BUDGET_09_Michael_Higgins_LAB -0.9703121 0.03850559 -1.04578302
## 10       2010_BUDGET_10_Ruairi_Quinn_LAB -0.9589243 0.03892391 -1.03521514
## 11     2010_BUDGET_11_John_Gormley_Green  1.1807221 0.07221447  1.03918176
## 12       2010_BUDGET_12_Eamon_Ryan_Green  0.1866466 0.06294124  0.06328178
## 13     2010_BUDGET_13_Ciaran_Cuffe_Green  0.7421919 0.07245427  0.60018149
## 14 2010_BUDGET_14_Caoimhghin_OCaolain_SF -0.1840766 0.03666272 -0.25593554
##         upper
## 1   1.8607848
## 2  -0.5380309
## 3  -1.0834888
## 4  -0.0661418
## 5   1.8187569
## 6  -0.6626343
## 7  -0.4027219
## 8  -0.5035086
## 9  -0.8948411
## 10 -0.8826334
## 11  1.3222625
## 12  0.3100115
## 13  0.8842022
## 14 -0.1122177
## 
## Estimated feature scores: showing first 30 beta-hats for features
## 
##            when               i       presented             the 
##     -0.09920434      0.38802251      0.39879175      0.25594319 
##   supplementary          budget              to            this 
##      1.11586298      0.09915369      0.37007812      0.30693462 
##           house            last           april               , 
##      0.19906993      0.28971698     -0.09526246      0.34535194 
##            said              we           could            work 
##     -0.71931049      0.47992225     -0.52976488      0.58227295 
##             our             way         through          period 
##      0.74373366      0.33611523      0.65982719      0.55621770 
##              of          severe        economic        distress 
##      0.33932222      1.27910737      0.47867020      1.84454705 
##               .           today             can          report 
##      0.27352513      0.17419740      0.36378497      0.69176130 
##            that notwithstanding 
##      0.08833468      1.84454705

主题模型

quanteda可以很轻松训练主题模型:

quantdfm <- dfm(data_corpus_irishbudget2010, 
                remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english"))
quantdfm <- dfm_trim(quantdfm, min_count = 4, max_docfreq = 10)
quantdfm
## Document-feature matrix of: 14 documents, 1,263 features (64.5% sparse).
if (require(topicmodels)) {
    myLDAfit20 <- LDA(convert(quantdfm, to = "topicmodels"), k = 20)
    get_terms(myLDAfit20, 5)
}
## Loading required package: topicmodels
##      Topic 1       Topic 2      Topic 3      Topic 4    Topic 5   
## [1,] "kind"        "million"    "spending"   "million"  "welfare" 
## [2,] "imagination" "support"    "scheme"     "welfare"  "fáil"    
## [3,] "policies"    "welfare"    "measures"   "measures" "stimulus"
## [4,] "wit"         "investment" "reduction"  "system"   "parties" 
## [5,] "create"      "back"       "investment" "reduce"   "system"  
##      Topic 6       Topic 7     Topic 8     Topic 9   Topic 10       
## [1,] "alternative" "taoiseach" "taoiseach" "review"  "families"     
## [2,] "citizenship" "employees" "fine"      "million" "million"      
## [3,] "wealth"      "rate"      "gael"      "bill"    "benefit"      
## [4,] "adjustment"  "referred"  "may"       "pension" "international"
## [5,] "breaks"      "debate"    "irish"     "per"     "believe"      
##      Topic 11   Topic 12 Topic 13       Topic 14    Topic 15   
## [1,] "fianna"   "fianna" "society"      "care"      "welfare"  
## [2,] "fáil"     "fáil"   "enterprising" "per"       "private"  
## [3,] "national" "side"   "sense"        "allowance" "families" 
## [4,] "irish"    "level"  "equal"        "million"   "provision"
## [5,] "support"  "third"  "nation"       "hospital"  "fianna"   
##      Topic 16     Topic 17    Topic 18   Topic 19  Topic 20 
## [1,] "confidence" "failed"    "system"   "levy"    "child"  
## [2,] "programme"  "strategy"  "welfare"  "million" "benefit"
## [3,] "even"       "needed"    "package"  "carbon"  "day"    
## [4,] "change"     "ministers" "taxation" "change"  "today"  
## [5,] "real"       "system"    "sinn"     "welfare" "bank"

注:以上这个指南翻译于英文版quickstart.

quanteda处理中文文档

中文停用词: 取自百度停用词

# 读取中文停用词
csw <- stopwords("chinese")

tokens("中华人民共和国成立于1949 年")
## tokens from 1 document.
## text1 :
## [1] "中华"   "人民"   "共和国" "成立"   "于"     "1949"   "年"
# 除去停用词
tokens("中华人民共和国成立于1949 年") %>%
    tokens_remove(stopwords("chinese"))
## tokens from 1 document.
## text1 :
## [1] "中华"   "人民"   "共和国" "成立"   "1949"

例子:应用于中文文档

四十九份中国总理的“政府工作报告”, 1954 - 2017

# 读取文件
load("data/data_corpus_chinesegovreport.rda")
summary(data_corpus_chinesegovreport, 10)
## Warning in nsentence.character(object, ...): nsentence() does not correctly
## count sentences in all lower-cased text
## Corpus consisting of 49 documents, showing 10 documents:
## 
##    Text Types Tokens Sentences                           doc_id Year
##   text1  2230  14093       453      1954政府工作报告_周恩来.txt 1954
##   text2  3053  35127       981      1955政府工作报告_李富春.txt 1955
##   text3  1514  13164       495      1956政府工作报告_李先念.txt 1956
##   text4  2585  21737       704      1957政府工作报告_周恩来.txt 1957
##   text5  2185  15164       412      1958政府工作报告_薄一波.txt 1958
##   text6  2512  19217       577      1959政府工作报告_周恩来.txt 1959
##   text7  1302   6268       164      1960政府工作报告_谭震林.txt 1960
##   text8  1892  11676       387 1964政府工作报告_周恩来_摘要.txt 1964
##   text9   966   3187       125      1975政府工作报告_周恩来.txt 1975
##  text10  2964  19119       659      1978政府工作报告_华国锋.txt 1978
##  Premier
##   周恩来
##   李富春
##   李先念
##   周恩来
##   薄一波
##   周恩来
##   谭震林
##   周恩来
##   周恩来
##   华国锋
## 
## Source:  /home/wangh52/quanteda_2/vignettes/pkgdown_only/* on x86_64 by wangh52
## Created: Mon Nov 27 13:29:27 2017
## Notes:
# 分词
chtoks <- 
    tokens(data_corpus_chinesegovreport, remove_punct = TRUE) %>%
    tokens_remove(stopwords("chinese"))

# 创建 dfm
chdfm <- dfm(chtoks)
topfeatures(chdfm)
## 发展 经济 社会 建设 改革 人民 主义 工作 企业 国家 
## 5627 5036 4255 4248 2931 2897 2817 2642 2627 2595
#发展 经济 社会 建设 改革 人民 主义 工作 企业 国家 
#5627 5036 4255 4248 2931 2897 2817 2642 2627 2595 

# 绘制词云图
set.seed(100)
chdfm_tf <- dfm_trim(chdfm, min_count = 500)
# 设置适用于MacOS的字体
if (Sys.info()['sysname'] == "Darwin") par(family = "STSong") 
textplot_wordcloud(chdfm_tf, min.freq = 6, random.order = FALSE,
                   rot.per = .25, max.words = 100, scale = c(2.8, .5),
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))

特征共现矩阵

# fcm -共现窗口为5, 2017政府工作报告
ch17txt <- corpus_subset(data_corpus_chinesegovreport, Year == "2017")
ch17toks <- 
    tokens(ch17txt, remove_punct = TRUE) %>% 
    tokens_remove(stopwords("chinese"))
chfcm <- fcm(ch17toks, context = "window")  
topfeatures(chfcm["改革", ])
## 推进 制度   性 体制 完善   试   点 供给   侧 结构 
##   23   19   11   11   10    8    8    7    7    7
# fcm 共现窗口为:"改革" + “xx”
chfcm <- fcm(ch17toks, context = "window", ordered = TRUE, tri = FALSE)
topfeatures(chfcm["改革", ])
## 完善 发展   试   点 制度 全面 推进 深化 社会 开放 
##   10    8    8    8    8    7    7    6    5    5

文本缩放模型

wfm <- textmodel_wordfish(chdfm)
y <- 1954:2017
y <- y[-which(y==1963 | y==1961 |y==1962 | (y>1964 & y<1975) | y==1976 |y==1977)]
plot(y, wfm@theta, xlab = "Year", ylab = "Position")

词组 - 双词词组/三词词组等

# 所有报告中的双词词组
ch_mw <- textstat_collocations(chtoks, size = 2, min_count = 20, tolower = TRUE)
head(ch_mw, 10)
##    collocation count count_nested length   lambda         z
## 1    社会 主义  1787            0      2 5.661200 128.61741
## 2        亿 元   689            0      2 7.445259  93.01487
## 3      现代 化   632            0      2 6.950471  83.55186
## 4    体制 改革   504            0      2 5.193065  77.37908
## 5    五年 计划   341            0      2 5.359068  71.64727
## 6    各级 政府   306            0      2 6.110596  66.63454
## 7    增长 百分   300            0      2 5.520761  65.88051
## 8        万 吨   212            0      2 6.589957  62.56344
## 9    国民 经济   589            0      2 6.014672  61.80426
## 10   充分 发挥   191            0      2 6.584129  61.24885
# 2017 报告中的双词词组
ch_mw17 <- textstat_collocations(ch17toks, size = 2, tolower = TRUE)
head(ch_mw17, 10)
##    collocation count count_nested length   lambda        z
## 1    人民 群众    12            0      2 5.406485 12.89405
## 2        亿 元    14            0      2 8.302483 12.62130
## 3        调 控    11            0      2 7.593472 12.41243
## 4    政府 工作     9            0      2 4.709870 11.07905
## 5    深入 实施     8            0      2 5.018234 10.92377
## 6      党 中央     7            0      2 5.746878 10.90837
## 7    体制 改革    11            0      2 5.317035 10.53518
## 8    国内 生产     6            0      2 6.166520 10.48816
## 9      现代 化     8            0      2 5.705688 10.43435
## 10   基础 设施     7            0      2 7.549273 10.42465

注:以上这部分介绍翻译于英文版chinese.