文章/答案/技术大牛

发布

社区首页 >问答首页 >从POS标签词-标记转换为POS标记短语

问从POS标签词-标记转换为POS标记短语
EN

Code Review用户

提问于 2015-07-13 07:03:03

回答 1查看 1.1K关注 0票数 0

我有句子表示为词性的一部分(POS)标记的单词。我想把所有的短句和下划线连在一起。我希望他们在短语中有最后一个词的词性标签--这并不是因为这在语言上是正确的，而是因为在我的系统中，当它不得不“Unstem”/“Unlem玉米”派生的单词/短语时，它会在我的系统中正确地执行。

例如，如果我有以下句子，(基于Microsoft研究解释语料库中的第一句)：

电讯盈科的首席运营官迈克·布彻( Mike Butcher )和首席财务官Arena兄弟(首席财务官)将直接向警察汇报。

然后POS标记的文本是：

[('PCCW', 'NNP'), ("'s", 'POS'), ('chief', 'NN'),('operating', 'VBG'), ('officer', 'NN'),(',', ','),('Mike', 'NNP'),('Butcher', 'NNP'), (',', ','),('and', 'CC'),('the', 'DT'),('Arena', 'NNP'),('brothers', 'NNS'),(',', ','),('the', 'DT'),('chief', 'JJ'), ('financial', 'JJ'),('officers', 'NNS'),(',', ','),('will', 'MD'),('report', 'VB'),('directly','RB'),('to', 'TO'),('the', 'DT'), ('police', 'NN'),('officer', 'NN'), ('.', '.')]

词组标注的词组(即我的功能输出)是：

[('PCCW', 'NNP'), ("'s", 'POS'), ('chief_operating_officer', 'NN'), (',', ','), ('Mike', 'NNP'), ('Butcher', 'NNP'), (',', ','), ('and', 'CC'), ('the', 'DT'), ('Arena', 'NNP'), ('brothers', 'NNS'), (',', ','), ('the', 'DT'), ('chief_financial_officers', 'NNS'), (',', ','), ('will', 'MD'), ('report', 'VB'), ('directly', 'RB'), ('to', 'TO'), ('the', 'DT'), ('police_officer', 'NN'), ('.', '.')]

我准备接受WordNet作为一个短语的存在与否的基本真理。

from nltk.corpus import wordnet as wn    
def get_tagged_phrases(tagged_sent, max_phrase_length):

    tagged_phrase_sent = list(tagged_sent)
    for phrase_len in range(max_phrase_length,1,-1): #Go from largest to smallest to keep information
        for indexes in n_wise(phrase_len, range(len(tagged_sent))):
            tagged_words = [tagged_phrase_sent[index] for index in indexes]
            if not(any([tagged_word is None for tagged_word in tagged_words])):
                words, tags = zip(*tagged_words)
                possible_phrase = "_".join(words)
                if wn.synsets(possible_phrase): #If there are any, then it is a phrase
                    for index in indexes:
                        tagged_phrase_sent[index] = None #Blank them out with Nones which we will remove later
                    pos = tags[-1] #Use final tag, it will be the one we need for handling plurals
                    tagged_phrase_sent[indexes[0]] = (possible_phrase, pos)
    return [tagged_phrase for tagged_phrase in tagged_phrase_sent if not tagged_phrase is None]

显而易见的代码气味是，它是嵌套在大约5深。也许，那是太多的状态而无法记住。

这是Python 2中的

python

python-2.x

natural-language-processing

回答 1

Code Review用户

发布于 2015-07-13 12:27:44

除了选择更多的最优函数(xrange、in)之外，我认为这看起来还不错；可以通过反转条件来消除更多嵌套，转而使用continue；如果内部循环的中间结果收集在单独的结果中，那么最后一个return语句可能会更快，但我可能看错了。

from nltk.corpus import wordnet as wn


def get_tagged_phrases(tagged_sent, max_phrase_length):
    tagged_sent = list(tagged_sent)

    for phrase_len in xrange(max_phrase_length, 1, -1): #Go from largest to smallest to keep information
        for indexes in n_wise(phrase_len, xrange(len(tagged_sent))):
            tagged_words = [tagged_sent[index] for index in indexes]

            if None in tagged_words:
                continue

            words, tags = zip(*tagged_words)
            possible_phrase = "_".join(words)

            if not wn.synsets(possible_phrase): #If there are any, then it is a phrase
                continue

            for index in indexes:
                tagged_sent[index] = None #Blank them out with Nones which we will remove later

            pos = tags[-1] #Use final tag, it will be the one we need for handling plurals
            tagged_sent[indexes[0]] = (possible_phrase, pos)

    return [tagged_phrase for tagged_phrase in tagged_sent if not tagged_phrase is None]

票数 1

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/96753

复制

相似问题

问从POS标签词-标记转换为POS标记短语
EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从POS标签词-标记转换为POS标记短语EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从POS标签词-标记转换为POS标记短语
EN