问TF-IDF向量可以在不同级别的输入标记(单词、字符、n-gram)中生成，我们应该使用哪种？
EN

Stack Overflow用户

提问于 2020-07-18 23:00:37

回答 1查看 231关注 0票数 1

a.词级TF-IDF :表示不同文档中每个术语的tf-idf分数的矩阵。

b. N-gram Level TF-IDF :N-gram是N个术语的组合。该矩阵表示N元文法的tf-idf分数

c.字符级TF-IDF :表示字符级tf-idf分数的矩阵

tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(trainDF['texts'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)


# ngram level tf-idf N-gram Level TF-IDF : N-grams are the combination of N terms together. This 
Matrix representing tf-idf scores of N-grams
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2, 3), 
max_features=5000)
tfidf_vect_ngram.fit(trainDF['texts'])
xtrain_tfidf_ngram = tfidf_vect_ngram.transform(train_x)
xvalid_tfidf_ngram = tfidf_vect_ngram.transform(valid_x)


# characters level tf-idf Character Level TF-IDF : Matrix representing tf-idf scores of character level n-grams in the dataset
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', token_pattern=r'\w{1,}', ngram_range=(2, 3), max_features=5000)
tfidf_vect_ngram_chars.fit(trainDF['texts'])
xtrain_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(train_x)
xvalid_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(valid_x)

machine-learning

tf-idf

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-07-18 23:25:57

对于所有的情况，没有一个正确的答案。方法将取决于数据的性质。

您应该使用GridSearchCV来识别最适合您的情况的最佳方法。以下是官方文档中的good example of the pipeline for text feature extraction。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/62970035

复制

相似问题

问TF-IDF向量可以在不同级别的输入标记(单词、字符、n-gram)中生成，我们应该使用哪种？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问TF-IDF向量可以在不同级别的输入标记(单词、字符、n-gram)中生成，我们应该使用哪种？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问TF-IDF向量可以在不同级别的输入标记(单词、字符、n-gram)中生成，我们应该使用哪种？
EN