文章/答案/技术大牛

发布

社区首页 >问答首页 >Sci学习:调查不正确的分类数据

问Sci学习:调查不正确的分类数据
EN

Stack Overflow用户

提问于 2015-12-31 18:54:58

回答 1查看 547关注 0票数 2

我想分析的数据已经被错误地分类使用sci学习模型，以便我可以改进我的特性生成。我有这样做的方法，但我对sci学习和熊猫都是新手，所以我想知道是否有一种更有效/直接的方法来完成这个任务。这似乎是标准工作流的一部分，但在我所做的研究中，我没有找到任何东西直接解决从模型分类到原始数据的反向映射。

这是我正在使用的上下文/工作流，以及我设计的解决方案。下面是示例代码。

上下文。我的工作流程如下所示：

首先是一堆JSON blobs，原始数据。这是熊猫DataFrame。
提取建模的相关片段，将其称为数据。这是一只熊猫。
另外，对于所有的数据，我们都有真实数据，所以我们称之为真或y。
在sci学习中创建一个特征矩阵，将其称为X。这是一个大型稀疏矩阵。
创建一个随机林中对象，调用此林中。
使用sci学习split_train_test()函数为训练和测试创建特征矩阵的随机子集。
对上述训练数据进行训练，X_train是一种大型稀疏矩阵。
得到假阳性和假阴性结果的指标。这些是进入稀疏矩阵X_test的索引。
从一个假阳性索引转到X_test，返回到原始数据
如果有必要，从数据转到原始数据。

解决方案。

将索引数组传递到split_test_train()函数中，该函数将对索引数组应用相同的随机化，并将其作为训练和测试数据索引(idx_test)返回
收集假阳性和假阴性的指标，这些是nd.arrays。
使用这些方法查找索引数组中的原始位置，例如，index=idx_testfalse_example在false_neg数组中用于false_example
使用该索引查找原始数据，data.ilocindex是原始数据
然后，如果需要，data.indexindex将把索引值返回到原始数据中。

下面是与使用tweet的示例相关联的代码。同样，这是可行的，但是否有更直接/更聪明的方法来做到这一点？

# take a sample of our original data
data=tweet_df[0:100]['texts']
y=tweet_df[0:100]['truth']

# create the feature vectors
vec=TfidfVectorizer(analyzer="char",ngram_range=(1,2))
X=vec.fit_transform(data) # this is now feature matrix

# split the feature matrix into train/test subsets, keeping the indices back into the original X using the
# array indices
indices = np.arange(X.shape[0])
X_train, X_test, y_train, y_test,idx_train,idx_test=train_test_split(X,y,indices,test_size=0.2,random_state=state)

# fit and test a model
forest=RandomForestClassifier()
forest.fit(X_train,y_train)
predictions=forest.predict(X_test)

# get the indices for false_negatives and false_positives in the test set
false_neg, false_pos=tweet_fns.check_predictions(predictions,y_test)

# map the false negative indices in the test set (which is features) back to it's original data (text)
print "False negatives: \n"
pd.options.display.max_colwidth = 140
for i in false_neg:
    original_index=idx_test[i]
    print data.iloc[original_index]

而校验预测功能：

def check_predictions(predictions,truth):
    # take a 1-dim array of predictions from a model, and a 1-dim truth vector and calculate similarity
    # returns the indices of the false negatives and false positives in the predictions. 

    truth=truth.astype(bool)
    predictions=predictions.astype(bool)
    print sum(predictions == truth), 'of ', len(truth), "or ", float(sum(predictions == truth))/float(len(truth))," match"

    # false positives
    print "false positives: ", sum(predictions & ~truth)
    # false negatives
    print "false negatives: ",sum( ~predictions & truth)
    false_neg=np.nonzero(~predictions & truth) # these are tuples of arrays
    false_pos=np.nonzero(predictions & ~truth)
    return false_neg[0], false_pos[0] # we just want the arrays to return

python

machine-learning

scikit-learn

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-12-31 22:10:57

您的工作流程是：

原始数据->特性->拆分->列车->预测标签上的->错误分析

预测和特征矩阵之间有逐行对应关系，所以如果您想对这些特征进行错误分析，就应该没有问题。如果要查看什么原始数据与错误相关联，则必须对原始数据执行拆分，或者跟踪映射到哪些测试行的数据行(您的当前方法)。

第一个选项看起来是：

将变压器安装在原始数据上->拆分原始数据->变换列车/测试分别->列车/测试-> .

也就是说，它在拆分之前使用fit，在拆分后使用transform，让您使用与标签相同的方式对原始数据进行分区。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/34550577

复制

相似问题

问Sci学习:调查不正确的分类数据
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Sci学习:调查不正确的分类数据EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Sci学习:调查不正确的分类数据
EN