a.词级TF-IDF :表示不同文档中每个术语的tf-idf分数的矩阵。
b. N-gram Level TF-IDF :N-gram是N个术语的组合。该矩阵表示N元文法的tf-idf分数
c.字符级TF-IDF :表示字符级tf-idf分数的矩阵
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(trainDF['texts'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
# ngram level tf-idf N-gram Level TF-IDF : N-grams are the combination of N terms together. This
Matrix representing tf-idf scores of N-grams
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2, 3),
max_features=5000)
tfidf_vect_ngram.fit(trainDF['texts'])
xtrain_tfidf_ngram = tfidf_vect_ngram.transform(train_x)
xvalid_tfidf_ngram = tfidf_vect_ngram.transform(valid_x)
# characters level tf-idf Character Level TF-IDF : Matrix representing tf-idf scores of character level n-grams in the dataset
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', token_pattern=r'\w{1,}', ngram_range=(2, 3), max_features=5000)
tfidf_vect_ngram_chars.fit(trainDF['texts'])
xtrain_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(train_x)
xvalid_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(valid_x)发布于 2020-07-18 23:25:57
对于所有的情况,没有一个正确的答案。方法将取决于数据的性质。
您应该使用GridSearchCV来识别最适合您的情况的最佳方法。以下是官方文档中的good example of the pipeline for text feature extraction。
https://stackoverflow.com/questions/62970035
复制相似问题