首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >必须使用*unicode*字符串作为文本进行标记,同时使用TreeTagger标记?

必须使用*unicode*字符串作为文本进行标记,同时使用TreeTagger标记?
EN

Stack Overflow用户
提问于 2016-04-17 17:58:32
回答 1查看 549关注 0票数 1

我从TreeTagger网站创建了一个目录并下载了指定的文件。然后是树皮,因此,在文档中,我尝试测试并尝试如何标记一些文本,如下所示:

代码语言:javascript
复制
In [40]:

import treetaggerwrapper

tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')

tags = tagger.TagText("This is a very short text to tag.")

print tags

然后我收到了以下警告:

代码语言:javascript
复制
WARNING:TreeTagger:Abbreviation file not found: english-abbreviations
WARNING:TreeTagger:Processing without abbreviations file.
ERROR:TreeTagger:Must use *unicode* string as text to tag, not <type 'str'>.

---------------------------------------------------------------------------
TreeTaggerError                           Traceback (most recent call last)
<ipython-input-40-37b912126580> in <module>()
      1 import treetaggerwrapper
      2 tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
----> 3 tags = tagger.TagText("This is a very short text to tag.")
      4 print tags

/usr/local/lib/python2.7/site-packages/treetaggerwrapper.pyc in TagText(self, text, numlines, tagonly, prepronly, tagblanks, notagurl, notagemail, notagip, notagdns, encoding, errors)
   1236         return self.tag_text(text, numlines=numlines, tagonly=tagonly,
   1237                  prepronly=prepronly, tagblanks=tagblanks, notagurl=notagurl,
-> 1238                  notagemail=notagemail, notagip=notagip, notagdns=notagdns)
   1239 
   1240     # --------------------------------------------------------------------------

/usr/local/lib/python2.7/site-packages/treetaggerwrapper.pyc in tag_text(self, text, numlines, tagonly, prepronly, tagblanks, notagurl, notagemail, notagip, notagdns, nosgmlsplit)
   1302             # Raise exception now, with an explicit message.
   1303             logger.error("Must use *unicode* string as text to tag, not %s.", type(text))
-> 1304             raise TreeTaggerError("Must use *unicode* string as text to tag.")
   1305 
   1306         if isinstance(text, six.text_type):

TreeTaggerError: Must use *unicode* string as text to tag.

在哪里下载英文和西班牙文的缩写文件?以及如何正确安装treetaggerwrapper?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-04-17 18:01:07

该方法只使用unicode字符串,将一个u添加到您的字符串中,使其成为unicode字符串。

代码语言:javascript
复制
tags = tagger.TagText(u"This is a very short text to tag.")

"This is a very short text to tag."是一个str类型,一旦添加了u,它就是unicode:

代码语言:javascript
复制
In [12]: type("This is a very short text to tag.")
Out[12]: str

In [13]: type(u"This is a very short text to tag.")
Out[13]: unicode

如果您从另一个源获取str,则需要解码:

代码语言:javascript
复制
In [15]: s = "This is a very short text to tag."

In [16]: type(s)
Out[16]: str

In [17]: type(s.decode("utf-8"))
Out[17]: unicode

标记脚本可以下载这里

票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/36680144

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档