首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用NGram情绪分析-无法获得前5个单词

使用NGram情绪分析-无法获得前5个单词
EN

Stack Overflow用户
提问于 2019-12-11 21:15:10
回答 2查看 462关注 0票数 1

我设置了我的CountVectorizer如下;

代码语言:javascript
复制
cv = CountVectorizer(binary=True)
X = cv.fit_transform(train_text)
X_test = cv.transform(test_text)

当我使用支持向量机时,我可以打印出我的情感分析中的前5个单词;

代码语言:javascript
复制
final_svm  = LinearSVC(C=best_c)
final_svm.fit(X, target)
final_accuracy = final_svm.predict(X_test)
final_accuracy_score = accuracy_score(target_test, final_accuracy)
print ("Final SVM Accuracy: %s" % final_accuracy_score)
Report_Matricies.accuracy(target_test, final_accuracy)
feature_names = zip(cv.get_feature_names(), final_model.coef_[0])
feature_to_coef = {
    word: coef for word, coef in feature_names
}
itemz = feature_to_coef.items()
list_positive = sorted(
    itemz, 
    key=lambda x: x[1], 
    reverse=True)[:number_we_are_interested_in]

这样就行了。但是当我尝试为NGram编写类似的代码时,我会得到随机的单词;

代码语言:javascript
复制
   ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, no_of_words))
    X = ngram_vectorizer.fit_transform(train_text)
    X_test = ngram_vectorizer.transform(test_text)
    best_c = Logistic_Regression.get_best_hyperparameter(X_train, y_train, y_val, X_val)
    final_ngram = LogisticRegression(C=best_c)
    final_ngram.fit(X, target)
    final_accuracy = final_ngram.predict(X_test)
    final_accuracy_score = accuracy_score(target_test, final_accuracy)
    print ("Final NGram Accuracy: %s" % final_accuracy_score)
    Report_Matricies.accuracy(target_test, final_accuracy)
    feature_names = zip(cv.get_feature_names(), final_ngram.coef_[0])
    feature_to_coef = {
        word: coef for word, coef in feature_names
    }
    itemz = feature_to_coef.items()
    list_positive = sorted(
        itemz, 
        key=lambda x: x[1], 
        reverse=True)

我的NGram分析和支持向量机之间的精度等级是相似的,所以我为NGramm使用的代码似乎不适合提取我想要的单词类型,即它们是随机词,而不是正词。我应该使用什么代码来代替呢?在这个引用中可以找到类似的代码,但是第2部分中的示例没有打印NGram的前5个单词。https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2019-12-18 16:03:58

aberger已经回答了,也许您应该替换:

(

  • "feature_names = zip(cv.get_feature_names(),final_ngram.coef_)“by
  • "feature_names = zip(ngram_vectorizer.get_feature_names(),final_ngram.coef_)"

几个额外的注意事项--

在自然语言中,NGrams是把N个连词看作一个词的事实。它将被用来“标记”您的文本语料库,以便使这个语料库可以被机器算法使用,但它与算法本身无关。

SVM和Logistic回归是两种主要用于分类的算法(logistic回归是一种用于分类的回归,正是我们使用它的方式使这种回归成为一种分类算法)。

我试图用毫无意义的数据来说明这一点(您可以用您的数据来替换),这样您就可以直接运行这段代码并观察结果。

正如您所看到的,使用NGrams将给出几乎相同的顶部单词,除了在我自己运行时的bigram和trigram之外:

不含0.21116962061694416)

  • Logistic的
  • Logistic回归:(‘NGrams’,0.22492305532420143),(‘拳击’,0.22366726197682427),(‘跳跃’,0.22366726197682427),(‘巫师’,0.22366726197682427),(‘5’,与NGrams的回归):(‘5’,0.1549468448457053),(‘5’,0.15263348614045338),(‘拳击’,0.12657434061922093),(‘拳击术士’,0.12657434061922093),(拳击巫师跳跃,0.12657434061922093)
  • Logistic回归与NGrams,但只排序单格:(‘NGrams’,0.1549468448457053),(‘五’,0.15263348614045338),(‘拳击’,0.12657434061922093),(‘跳跃’,0.12657434061922093),(‘向导’,0.12657434061922093) <-给出几乎相同的东西,比"Logistic回归没有NGrams“(不完全相同的模型已经学习了不同的标记,即附加NGrams在这里)

代码语言:javascript
复制
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

text_train = ["The quick brown fox jumps over a lazy dog",
        "Pack my box with five dozen liquor jugs",
        "How quickly daft jumping zebras vex",
        "The five boxing wizards jump quickly",
        "the fox of my friend it the most lazy one I have seen in the past five years"]

text_test = ["just for a test"]

target_train = [1, 1, 0, 1, 0]

target_test = [1]

#######################################################################
##       OBSERVING TOKENIZATION OF DATA WITH AND WITOUT NGRAMS       ##
#######################################################################

## WITHOUT NGRAMS

cv = CountVectorizer()
count_vector = cv.fit_transform(text_train)
#Display the dictionary pairing each single word and it's position in the
#"vectorized" version of our text corpus, without any count.
print("")
print(cv.vocabulary_)
print("")
print("")
print(dict(zip(cv.get_feature_names(), count_vector.toarray().sum(axis=0))))

##  WITH NGRAMS

#Now let's also add as meaningfull entities all pair and all trios of words
#using NGrams
cv = CountVectorizer(ngram_range=(1,3))
count_vector = cv.fit_transform(text_train)
#Observe that now, "jump quickly" and "large fawn jumped" for instance are 
#considered as sort of meaningful unique "words" composed of several unique
#words.
print("")
print("")
print(cv.vocabulary_)
print("")
print("")
#List of all words and counts their occurences
print(dict(zip(cv.get_feature_names(), count_vector.toarray().sum(axis=0))))

#######################################################################
##                    YOUR ATTEMPT WITH LINEARSVC                    ##
#######################################################################
cv1 = CountVectorizer(binary=True)
count_vector_train = cv1.fit_transform(text_train)
count_vector_test = cv1.transform(text_test)

final_svm  = LinearSVC(C=1.0)
final_svm.fit(count_vector_train, target_train)
final_accuracy = final_svm.predict(count_vector_test)
final_accuracy_score = accuracy_score(target_test, final_accuracy)
print("")
print("")
print ("Final SVM without NGrams Accuracy: %s" % final_accuracy_score)
feature_names = zip(cv1.get_feature_names(), final_svm.coef_[0])
feature_to_coef = {
    word: coef for word, coef in feature_names
}
itemz = feature_to_coef.items()
list_positive = sorted(
    itemz, 
    key=lambda x: x[1], 
    reverse=True)[:5] #Here you can choose the top 5
print("")
print("SVM without NGrams")
print(list_positive)

#######################################################################
##              YOUR ATTEMPT WITH LOGISTIC REGRESSION                ##
#######################################################################
cv2 = CountVectorizer(binary=True)
count_vector_train = cv2.fit_transform(text_train)
count_vector_test = cv2.transform(text_test)

final_lr  = LogisticRegression(C=1.0)
final_lr.fit(count_vector_train, target_train)
final_accuracy = final_lr.predict(count_vector_test)
final_accuracy_score = accuracy_score(target_test, final_accuracy)
print("")
print("")
print ("Final Logistic regression without NGrams Accuracy: %s" % final_accuracy_score)
feature_names = zip(cv2.get_feature_names(), final_lr.coef_[0])
feature_to_coef = {
    word: coef for word, coef in feature_names
}
itemz = feature_to_coef.items()
list_positive = sorted(
    itemz, 
    key=lambda x: x[1], 
    reverse=True)[:5] #Here you can choose the top 5
print("")
print("Logistic regression without NGrams")
print(list_positive)

#######################################################################
##         YOUR ATTEMPT WITH LOGISTIC REGRESSION AND NGRAMS          ##
#######################################################################
cv3 = CountVectorizer(binary=True, ngram_range=(1,3))
count_vector_train = cv3.fit_transform(text_train)
count_vector_test = cv3.transform(text_test)

final_lr  = LogisticRegression(C=1.0)
final_lr.fit(count_vector_train, target_train)
final_accuracy = final_lr.predict(count_vector_test)
final_accuracy_score = accuracy_score(target_test, final_accuracy)
print("")
print("")
print ("Final Logistic regression with NGrams Accuracy: %s" % final_accuracy_score)
feature_names = zip(cv3.get_feature_names(), final_lr.coef_[0])
feature_to_coef = {
    word: coef for word, coef in feature_names
}
itemz = feature_to_coef.items()
list_positive = sorted(
    itemz, 
    key=lambda x: x[1], 
    reverse=True)[:5] #Here you can choose the top 5
print("")
print("Logistic regression with NGrams")
print(list_positive)

#######################################################################
##         YOUR ATTEMPT WITH LOGISTIC REGRESSION AND NGRAMS          ##
##                BUT EXTRACTS ONLY REAL UNIQUE WORDS                ##
#######################################################################
cv4 = CountVectorizer(binary=True, ngram_range=(1,3))
count_vector_train = cv4.fit_transform(text_train)
count_vector_test = cv4.transform(text_test)

final_lr  = LogisticRegression(C=1.0)
final_lr.fit(count_vector_train, target_train)
final_accuracy = final_lr.predict(count_vector_test)
final_accuracy_score = accuracy_score(target_test, final_accuracy)
print("")
print("")
print ("Final Logistic regression with NGrams Accuracy: %s" % final_accuracy_score)
feature_names = zip(cv4.get_feature_names(), final_lr.coef_[0])
feature_names_unigrams = [(a, b) for a, b in feature_names if len(a.split()) < 2]
feature_to_coef = {
    word: coef for word, coef in feature_names_unigrams
}
itemz = feature_to_coef.items()

list_positive = sorted(
    itemz,
    key=lambda x: x[1], 
    reverse=True)[:5] #Here you can choose the top 5
print("")
print("Logistic regression with NGrams but only getting unigrams")
print(list_positive)
票数 1
EN

Stack Overflow用户

发布于 2019-12-18 15:44:46

在实现Logistic回归模型时,您似乎做了太多的复制/粘贴操作。当您从这个模型中获得feature_names时,您使用的是二进制CountVectorizer,cv,而不是ngram_vectorizer。我认为你需要改变这条线

feature_names = zip(cv.get_feature_names(), final_ngram.coef_[0])

feature_names = zip(ngram_vectorizer.get_feature_names(), final_ngram.coef_[0])

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/59294196

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档