article 40

图片由本文中数据生产~

“文章原创,转载请注明出处”


文本挖掘,也称为文本数据挖掘,意思就如字面,对文本数据进行挖掘分析。文本挖掘一般包含:文本分类文本聚类概念实体挖掘自然语言处理等等。接下来,我打算用一个简单的例子,介绍一下R语言文本挖掘的一般过程,顺便介绍一些文本挖掘中的概念。

这边主要使用R中的tm包进行文本挖掘,首先加载Package:

1
2
require(tm)  ## R中处理文本挖掘的框架包
require(ggplot2)

1. 载入Corpus


Corpus(语料库),指一系列文档的集合,是tm包管理文件的数据结构。通常,我们需要将一批文档导入成Corpus结构的数据,然后才能进行进一步的处理分析。

文档的格式有非常多的格式,tm包支持的格式其实只占很少的一部分,大致有:text, PDF, Mircosoft Word和XML。所以,如果需要处理的文档,其格式不在这里面的话,就需要对格式进行一些转换。个人建议,将文档格式转换成text或者XML会比较容易处理。我们可以查看一下,tm包支持的文档格式:

1
getReaders()
1
2
3
4
5
## [1] "readDOC"                 "readPDF"
## [3] "readReut21578XML"        "readReut21578XMLasPlain"
## [5] "readPlain"               "readRCV1"
## [7] "readRCV1asPlain"         "readTabular"
## [9] "readXML"

tm包中,Corpus可以分为两种。一种是Volatile Corpus,这种数据结构是作为R对象保存在内存中,使用VCorpus()或者Corpus()函数;另一种就是Permanent Corpus,作为R的外部保存,使用PCorpus()函数。显然,如何选择取决于内存大小以及运算速率的要求了。

我们这里使用tm包自带的XML文档数据进行演示:

1
2
xml <- system.file("texts", "crude", package = "tm")  ## 数据所在的目录
docs <- Corpus(DirSource(xml), readerControl = list(reader = readReut21578XML))

这里使用的数据源是DirSource,当然也可以从其他的数据源导入,可以使用getSources()查看:

1
getSources()
1
2
## [1] "DataframeSource" "DirSource"       "ReutersSource"   "URISource"
## [5] "VectorSource"

如果读取的是其他格式的,就需要指定一些其他的参数,用path表示数据所在的目录:

1
2
3
4
5
## txt
docs <- Corpus(DirSource(<path>))
## PDF
docs <- Corpus(DirSource(<path>), readerControl = list(reader = readPDF))
## 其它的类似

2. 查看Corpus


将数据导入成Corpus之后,我们就需要查看Corpus。

1
docs  ## 只显示了Corpus中含有的文档数据数量
1
## A corpus with 20 text documents
1
names(docs)[1:3]  ## 显示前3个文档的名称
1
## [1] "reut-00001.xml" "reut-00002.xml" "reut-00004.xml"
1
summary(docs)  ## 显示更多的meta data,但不显示源信息
1
2
3
4
5
6
7
## A corpus with 20 text documents
##
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
##   create_date creator
## Available variables in the data frame are:
##   MetaID
1
inspect(docs[1])  ## 提取第一篇文档的完整信息、
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
## A corpus with 1 text document
##
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
##   create_date creator
## Available variables in the data frame are:
##   MetaID
##
## $`reut-00001.xml`
## $doc
## $file
## [1] "<buffer>"
##
## $version
## [1] "1.0"
##
## $children
## $children$REUTERS
## <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5670" NEWID="127">
##  <DATE>26-FEB-1987 17:00:56.04</DATE>
##  <TOPICS>
##   <D>crude</D>
##  </TOPICS>
##  <PLACES>
##   <D>usa</D>
##  </PLACES>
##  <PEOPLE/>
##  <ORGS/>
##  <EXCHANGES/>
##  <COMPANIES/>
##  <UNKNOWN>Y
##    f0119 reute
## u f BC-DIAMOND-SHAMROCK-(DIA   02-26 0097</UNKNOWN>
##  <TEXT>
##   <TITLE>DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES</TITLE>
##   <DATELINE>NEW YORK, FEB 26 -</DATELINE>
##   <BODY>Diamond Shamrock Corp said that
## effective today it had cut its contract prices for crude oil by
## 1.50 dlrs a barrel.
##     The reduction brings its posted price for West Texas
## Intermediate to 16.00 dlrs a barrel, the copany said.
##     &quot;The price reduction today was made in the light of falling
## oil product prices and a weak crude oil market,&quot; a company
## spokeswoman said.
##     Diamond is the latest in a line of U.S. oil companies that
## have cut its contract, or posted, prices over the last two days
## citing weak oil markets.
##  Reuter</BODY>
##  </TEXT>
## </REUTERS>
##
##
## attr(,"class")
## [1] "XMLDocumentContent"
##
## $dtd
## $external
## NULL
##
## $internal
## NULL
##
## attr(,"class")
## [1] "DTDList"
##
## attr(,"Author")
## character(0)
## attr(,"DateTimeStamp")
## [1] "1987-02-26 17:00:56 GMT"
## attr(,"Description")
## [1] ""
## attr(,"Heading")
## [1] "DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES"
## attr(,"ID")
## [1] "127"
## attr(,"Language")
## [1] "en"
## attr(,"LocalMetaData")
## attr(,"LocalMetaData")$TOPICS
## [1] "YES"
##
## attr(,"LocalMetaData")$LEWISSPLIT
## [1] "TRAIN"
##
## attr(,"LocalMetaData")$CGISPLIT
## [1] "TRAINING-SET"
##
## attr(,"LocalMetaData")$OLDID
## [1] "5670"
##
## attr(,"LocalMetaData")$Topics
## [1] "crude"
##
## attr(,"LocalMetaData")$Places
## [1] "usa"
##
## attr(,"LocalMetaData")$People
## character(0)
##
## attr(,"LocalMetaData")$Orgs
## character(0)
##
## attr(,"LocalMetaData")$Exchanges
## character(0)
##
## attr(,"Origin")
## [1] "Reuters-21578 XML"
## attr(,"class")
## [1] "Reuters21578Document" "TextDocument"         "XMLDocument"
## [4] "XMLAbstractDocument"  "oldClass"
1
2
## inspect(docs) 可以提取所有文档的完整信息,不过数据量会很大
docs[[1]]  ## 提取第一个文档
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
## $doc
## $file
## [1] "<buffer>"
##
## $version
## [1] "1.0"
##
## $children
## $children$REUTERS
## <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5670" NEWID="127">
##  <DATE>26-FEB-1987 17:00:56.04</DATE>
##  <TOPICS>
##   <D>crude</D>
##  </TOPICS>
##  <PLACES>
##   <D>usa</D>
##  </PLACES>
##  <PEOPLE/>
##  <ORGS/>
##  <EXCHANGES/>
##  <COMPANIES/>
##  <UNKNOWN>Y
##    f0119 reute
## u f BC-DIAMOND-SHAMROCK-(DIA   02-26 0097</UNKNOWN>
##  <TEXT>
##   <TITLE>DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES</TITLE>
##   <DATELINE>NEW YORK, FEB 26 -</DATELINE>
##   <BODY>Diamond Shamrock Corp said that
## effective today it had cut its contract prices for crude oil by
## 1.50 dlrs a barrel.
##     The reduction brings its posted price for West Texas
## Intermediate to 16.00 dlrs a barrel, the copany said.
##     &quot;The price reduction today was made in the light of falling
## oil product prices and a weak crude oil market,&quot; a company
## spokeswoman said.
##     Diamond is the latest in a line of U.S. oil companies that
## have cut its contract, or posted, prices over the last two days
## citing weak oil markets.
##  Reuter</BODY>
##  </TEXT>
## </REUTERS>
##
##
## attr(,"class")
## [1] "XMLDocumentContent"
##
## $dtd
## $external
## NULL
##
## $internal
## NULL
##
## attr(,"class")
## [1] "DTDList"
##
## attr(,"Author")
## character(0)
## attr(,"DateTimeStamp")
## [1] "1987-02-26 17:00:56 GMT"
## attr(,"Description")
## [1] ""
## attr(,"Heading")
## [1] "DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES"
## attr(,"ID")
## [1] "127"
## attr(,"Language")
## [1] "en"
## attr(,"LocalMetaData")
## attr(,"LocalMetaData")$TOPICS
## [1] "YES"
##
## attr(,"LocalMetaData")$LEWISSPLIT
## [1] "TRAIN"
##
## attr(,"LocalMetaData")$CGISPLIT
## [1] "TRAINING-SET"
##
## attr(,"LocalMetaData")$OLDID
## [1] "5670"
##
## attr(,"LocalMetaData")$Topics
## [1] "crude"
##
## attr(,"LocalMetaData")$Places
## [1] "usa"
##
## attr(,"LocalMetaData")$People
## character(0)
##
## attr(,"LocalMetaData")$Orgs
## character(0)
##
## attr(,"LocalMetaData")$Exchanges
## character(0)
##
## attr(,"Origin")
## [1] "Reuters-21578 XML"
## attr(,"class")
## [1] "Reuters21578Document" "TextDocument"         "XMLDocument"
## [4] "XMLAbstractDocument"  "oldClass"
1
## docs[['reut-00001.xml']] 同样可以提取第一个文档

3. 信息转化


创建好Corpus后,就需要对其进行一些修改,比如去除标点、停止词等等。这里就需要使用到一个函数tm_map(),其可以将转化函数作用到每一个文档数据上。

1. 转化为纯文本

如果Corpus中存储的是非纯文本的数据,比如XML格式的数据,那么就需要先将这些数据转换成纯文本格式:

1
2
docs <- tm_map(docs, as.PlainTextDocument)
docs[[1]]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES
## NEW YORK, FEB 26 -
## Diamond Shamrock Corp said that
## effective today it had cut its contract prices for crude oil by
## 1.50 dlrs a barrel.
##     The reduction brings its posted price for West Texas
## Intermediate to 16.00 dlrs a barrel, the copany said.
##     "The price reduction today was made in the light of falling
## oil product prices and a weak crude oil market," a company
## spokeswoman said.
##     Diamond is the latest in a line of U.S. oil companies that
## have cut its contract, or posted, prices over the last two days
## citing weak oil markets.
##  Reuter

2. 去除特殊字符

在文档数据中,可能会存在这样的字符:[email protected],我们需要将其去除掉:

1
2
3
4
5
for (i in seq(docs)) {
    docs[[i]] <- gsub("/", " ", docs[[i]])
    docs[[i]] <- gsub("@", " ", docs[[i]])
    docs[[i]] <- gsub("-", " ", docs[[i]])
}

如果存在更复杂的替换,可以使用正则表达式去解决,这里不做介绍。

3. 转换成小写

顾名思义,就是将所有的数据转换成小写字母,这样以便更加容易分析。

1
2
docs <- tm_map(docs, tolower)
docs[[1]]  ## 查看效果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## diamond shamrock (dia) cuts crude prices
## new york, feb 26
## diamond shamrock corp said that
## effective today it had cut its contract prices for crude oil by
## 1.50 dlrs a barrel.
##     the reduction brings its posted price for west texas
## intermediate to 16.00 dlrs a barrel, the copany said.
##     "the price reduction today was made in the light of falling
## oil product prices and a weak crude oil market," a company
## spokeswoman said.
##     diamond is the latest in a line of u.s. oil companies that
## have cut its contract, or posted, prices over the last two days
## citing weak oil markets.
##  reuter

4. 去除数字

有些时候,我们需要将文档中的数字去除掉:

1
2
docs <- tm_map(docs, removeNumbers)
docs[[1]]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## diamond shamrock (dia) cuts crude prices
## new york, feb
## diamond shamrock corp said that
## effective today it had cut its contract prices for crude oil by
## . dlrs a barrel.
##     the reduction brings its posted price for west texas
## intermediate to . dlrs a barrel, the copany said.
##     "the price reduction today was made in the light of falling
## oil product prices and a weak crude oil market," a company
## spokeswoman said.
##     diamond is the latest in a line of u.s. oil companies that
## have cut its contract, or posted, prices over the last two days
## citing weak oil markets.
##  reuter

5. 去除停止词

1
2
docs <- tm_map(docs, removeWords, stopwords("english"))
docs[[1]]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## diamond shamrock (dia) cuts crude prices
## new york, feb
## diamond shamrock corp said
## effective today   cut  contract prices  crude oil
## . dlrs  barrel.
##      reduction brings  posted price  west texas
## intermediate  . dlrs  barrel,  copany said.
##     " price reduction today  made   light  falling
## oil product prices   weak crude oil market,"  company
## spokeswoman said.
##     diamond   latest   line  u.s. oil companies
##  cut  contract,  posted, prices   last two days
## citing weak oil markets.
##  reuter

6. 去除标点

1
2
docs <- tm_map(docs, removePunctuation)
docs[[1]]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## diamond shamrock dia cuts crude prices
## new york feb
## diamond shamrock corp said
## effective today   cut  contract prices  crude oil
##  dlrs  barrel
##      reduction brings  posted price  west texas
## intermediate   dlrs  barrel  copany said
##      price reduction today  made   light  falling
## oil product prices   weak crude oil market  company
## spokeswoman said
##     diamond   latest   line  us oil companies
##  cut  contract  posted prices   last two days
## citing weak oil markets
##  reuter

7. 去除多余的空格

1
2
docs <- tm_map(docs, stripWhitespace)
docs[[1]]
1
2
3
## diamond shamrock dia cuts crude prices
## new york feb
## diamond shamrock corp said effective today cut contract prices crude oil dlrs barrel reduction brings posted price west texas intermediate dlrs barrel copany said price reduction today made light falling oil product prices weak crude oil market company spokeswoman said diamond latest line us oil companies cut contract posted prices last two days citing weak oil markets reuter

8. Stemming(词干化)

首先介绍一下什么是Stemming,我们知道在英文中一个单词会存在很多形式,比如说复数形式、过去分词等等。但其实它们外表看起来虽不一样,但实际上是一样的。所以在处理分析的时候,就需要将这些单词都转换成其本身。在R中可以使用SnowballC这个包来处理Stemming,举个例子:

1
require(SnowballC)
1
## Loading required package: SnowballC
1
2
exam <- c("prices, price, doing")
stemDocument(exam)
1
## [1] "prices, price, do"

对于我们的例子:

1
2
3
require(SnowballC)
docs <- tm_map(docs, stemDocument)
docs[[1]]
1
2
3
## diamond shamrock dia cut crude price
## new york feb
## diamond shamrock corp said effect today cut contract price crude oil dlrs barrel reduct bring post price west texa intermedi dlrs barrel copani said price reduct today made light fall oil product price weak crude oil market compani spokeswoman said diamond latest line us oil compani cut contract post price last two day cite weak oil market reuter

4. 创建词条-文档关系矩阵


文本挖掘中,词条-文档关系矩阵是构建模型的基础,后续分析建模都是建立在这个矩阵之上的。首先来了解一下这个矩阵,举个例子:

我们有两篇文档,内容分别为:text mining 和 data mining and text mining。 那么对应的矩阵为:

1
2
3
4
d.Exam <- c("text mining", "data mining and text mining")
doc.Exam <- Corpus(VectorSource(d.Exam))
dtm.Exam <- DocumentTermMatrix(doc.Exam)
inspect(dtm.Exam)
1
2
3
4
5
6
7
8
9
10
11
## A document-term matrix (2 documents, 4 terms)
##
## Non-/sparse entries: 6/2
## Sparsity           : 25%
## Maximal term length: 6
## Weighting          : term frequency (tf)
##
##     Terms
## Docs and data mining text
##    1   0    0      1    1
##    2   1    1      2    1

可以看到,词条-文档关系矩阵其实就是将文档作为列,词条作为行,矩阵的每个位置就是对应的词条在对应的文档中出现的次数。

对于我们的例子,可以这样来生产词条-文档矩阵:

1
2
dtm <- DocumentTermMatrix(docs)
inspect(dtm[1:5, 1:10])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## A document-term matrix (5 documents, 10 terms)
##
## Non-/sparse entries: 1/49
## Sparsity           : 98%
## Maximal term length: 6
## Weighting          : term frequency (tf)
##
##      Terms
## Docs  abdul abil abl abroad abu accept accord across activ add
##   127     0    0   0      0   0      0      0      0     0   0
##   144     0    2   0      0   0      0      0      0     0   0
##   191     0    0   0      0   0      0      0      0     0   0
##   194     0    0   0      0   0      0      0      0     0   0
##   211     0    0   0      0   0      0      0      0     0   0

到这里,我们就已经生成了词条-文档矩阵。下一次,就来看看如何对这个矩阵进行一些操作,以及如何利用这个矩阵进行后续的建模分析。