首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何从Brown语料库访问原始文档?

如何从Brown语料库访问原始文档?
EN

Stack Overflow用户
提问于 2017-11-15 14:55:03
回答 2查看 7.9K关注 0票数 5

对于所有其他NLTK语料库,调用corpus.raw()将从文件中生成原始文本。例如:

代码语言:javascript
复制
>>> from nltk.corpus import webtext
>>> webtext.raw()[:10]
'Cookie Man'

但是,当调用brown.raw()时,您会得到带标记的文本。

代码语言:javascript
复制
>>> from nltk.corpus import brown
>>> brown.raw()[:10]
'\n\n\tThe/at '

我已经阅读了我能找到的所有文档,但似乎找不到一个明显的解释或方法来获得未标记的版本。这个语料库被标记了,而其他语料库没有标记,这是不是有什么原因?

EN

回答 2

Stack Overflow用户

发布于 2017-11-15 15:25:07

TL;DR

代码语言:javascript
复制
import nltk
nltk.download('brown')
nltk.download('nonbreaking_prefixes')
nltk.download('perluniprops')

from nltk.corpus import brown
from nltk.tokenize.moses import MosesDetokenizer

mdetok = MosesDetokenizer()

brown_natural = [mdetok.detokenize(' '.join(sent).replace('``', '"').replace("''", '"').replace('`', "'").split(), return_str=True)  for sent in brown.sents()]

for sent in brown_natural:
    print(sent)

在Long中

这是因为棕色语料库的“原始”版本被标记化和标记,即语料库被标记为a,这是语料库的原始形式=)

您可以查看nltk_data目录中的各个文件:

代码语言:javascript
复制
$ head -n10 nltk_data/corpora/brown/ca01


    The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.


    The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl Atlanta/np-tl ''/'' for/in the/at manner/nn in/in which/wdt the/at election/nn was/bedz conducted/vbn ./.


    The/at September-October/np term/nn jury/nn had/hvd been/ben charged/vbn by/in Fulton/np-tl Superior/jj-tl Court/nn-tl Judge/nn-tl Durwood/np Pye/np to/to investigate/vb reports/nns of/in possible/jj ``/`` irregularities/nns ''/'' in/in the/at hard-fought/jj primary/nn which/wdt was/bedz won/vbn by/in Mayor-nominate/nn-tl Ivan/np Allen/np Jr./np ./.

如果你想要语料库中的单词,你可以用brown.words(),例如:

代码语言:javascript
复制
>>> from nltk.corpus import brown

>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]

>>> ' '.join(brown.words()[:30])
u"The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . The jury further said in"

如果要从特定文件获取单词,请执行以下操作:

代码语言:javascript
复制
>>> brown.fileids()[:10] # The first 10 fileids from brown.
[u'ca01', u'ca02', u'ca03', u'ca04', u'ca05', u'ca06', u'ca07', u'ca08', u'ca09', u'ca10']

>>> ' '.join(brown.words('ca01')[:30]) # First 30 words from the 'ca01' file.
u"The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . The jury further said in"

和来自特定文件的句子:

代码语言:javascript
复制
>>> brown.sents('ca01')
[[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary', u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that', u'any', u'irregularities', u'took', u'place', u'.'], [u'The', u'jury', u'further', u'said', u'in', u'term-end', u'presentments', u'that', u'the', u'City', u'Executive', u'Committee', u',', u'which', u'had', u'over-all', u'charge', u'of', u'the', u'election', u',', u'``', u'deserves', u'the', u'praise', u'and', u'thanks', u'of', u'the', u'City', u'of', u'Atlanta', u"''", u'for', u'the', u'manner', u'in', u'which', u'the', u'election', u'was', u'conducted', u'.'], ...]

要打印出单个句子:

代码语言:javascript
复制
>>> for sent in brown.sents('ca01')[:5]: # First 5 sentences.
...     print(' '.join(sent))
... 
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .
The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. .
`` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' .
The jury said it did find that many of Georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' .

尝试将标记化的语料库去标记化相当混乱,可能会也可能不会起作用,但您可以尝试MosesDetokenizer

首先下载MosesDetokenizer所需的数据:

代码语言:javascript
复制
>>> import nltk
>>> nltk.download('perluniprops')
[nltk_data] Downloading package perluniprops to
[nltk_data]     /home/ltan/nltk_data...
[nltk_data]   Unzipping misc/perluniprops.zip.
True
>>> nltk.download('nonbreaking_prefixes')
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /home/ltan/nltk_data...
[nltk_data]   Package nonbreaking_prefixes is already up-to-date!
True

然后初始化MosesDetokenizer

代码语言:javascript
复制
>>> from nltk.tokenize.moses import MosesDetokenizer
>>> mdetok = MosesDetokenizer()

并使用MosesDetokenizer.detokenize()

代码语言:javascript
复制
>>> for sent in brown.sents('ca01')[:5]: # First 5 sentences.
...     # Join the words in sentences and convert the `` -> "
...     # also convert '' -> " and ` -> '
...     munged_sentence = ' '.join(sent).replace('``', '"').replace("''", '"').replace('`', "'")
...     print(mdetok.detokenize(munged_sentence.split(), return_str=True)) # MosesDetokenizer expects a list of strings as input.
... 
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced "no evidence" that any irregularities took place.
The jury further said in term-end presentments that the City Executive Committee, which had over-all charge of the election, "deserves the praise and thanks of the City of Atlanta" for the manner in which the election was conducted.
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible "irregularities" in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr..
"Only a relative handful of such reports was received", the jury said, "considering the widespread interest in the election, the number of voters and the size of this city".
The jury said it did find that many of Georgia's registration and election laws "are outmoded or inadequate and often ambiguous".

要将brown中的每个句子转换为自然阅读文本,请执行以下操作:

代码语言:javascript
复制
from nltk.tokenize.moses import MosesDetokenizer
mdetok = MosesDetokenizer()
brown_natural = [mdetok.detokenize(' '.join(sent).replace('``', '"').replace("''", '"').replace('`', "'").split(), return_str=True)  for sent in brown.sents()]

输出

代码语言:javascript
复制
>>> for sent in brown_natural:
...     print(sent)
...     break
... 
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced "no evidence" that any irregularities took place.
票数 5
EN

Stack Overflow用户

发布于 2017-11-16 04:12:07

标记的文本是原始文档,即Brown语料库文件的实际内容。raw()方法准确地显示了文件中存储的内容;它只检索“纯文本”语料库的纯文本,而不是您假设的“所有其他语料库”。例如,尝试nltk.corpus.treebank.raw('wsj_0001.mrg')nltk.corpus.conll2000.raw("train.txt"),您将分别看到树和"IOB格式“文本。

现在,如果你的目标是重建可读的文本,在空格上连接通常对我来说已经足够好了:

代码语言:javascript
复制
for sent in brown.sents():
    print(" ".join(sent))

您将得到如下输出:

代码语言:javascript
复制
`` Only a relative handful of such reports was received '' , the jury said , `` considering
the widespread interest in the election , the number of voters and the size of this 
city '' .

如果您不喜欢这种方式,请参阅alvas的答案,以获得更雄心勃勃的重建。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/47301140

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档