文章/答案/技术大牛

发布

社区首页 >问答首页 >交叉验证的Shap值-是在轴0还是轴1上实现shap值？

问交叉验证的Shap值-是在轴0还是轴1上实现shap值？
EN

Stack Overflow用户

提问于 2022-06-22 18:02:41

回答 1查看 226关注 0票数 1

请问，我的目标是使用shap和交叉验证来确定我的模型最重要的特性。

我有这样的代码：

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
import shap
import pandas as pd
import numpy as np


#loading and preparing the data
iris = load_breast_cancer()
X = iris.data
y = iris.target
columns = iris.feature_names
#if you don't shuffle you wont need to keep track of test_index, but I think 
#it is always good practice to shuffle your data
kf = KFold(n_splits=2,shuffle=True)

list_shap_values = list()
list_test_sets = list()
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    X_train = pd.DataFrame(X_train,columns=columns)
    X_test = pd.DataFrame(X_test,columns=columns)

    #training model
    clf = RandomForestClassifier(random_state=0)
    clf.fit(X_train, y_train)

    #explaining model
    explainer = shap.TreeExplainer(clf)
    shap_values = explainer.shap_values(X_test)
    #for each iteration we save the test_set index and the shap_values
    list_shap_values.append(shap_values)
    list_test_sets.append(test_index)


#combining results from all iterations
test_set = list_test_sets[0]
shap_values = np.array(list_shap_values[0])

for i in range(1,len(list_test_sets)):
    test_set = np.concatenate((test_set,list_test_sets[i]),axis=0)
    shap_values = np.concatenate((shap_values,np.array(list_shap_values[i])),axis=1)

#bringing back variable names    
X_test_df = pd.DataFrame(X[test_set],columns=columns)
cols = X_test_df.columns
sv = np.abs(shap_values[1,:,:]).mean(0)

importance_df = pd.DataFrame({
    "column_name": cols,
    "shap_values": sv
})

#expected result
importance_df.sort_values("shap_values", ascending=False)

print(importance_df)

我能问一下，我是否正确地实现了这一点？具体来说，这一行对吗？

    test_set = np.concatenate((test_set,list_test_sets[i]),axis=0)
    shap_values = np.concatenate((shap_values,np.array(list_shap_values[i])),axis=1)

我在示例代码这里中看到了这一点，但我不明白为什么test_set使用轴0，而shap值使用轴1。我问了一个关于我有这里的bug的问题，它出现在评论中，但是我不清楚如何正确地编码这个错误。

python

scikit-learn

shap

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-06-25 08:28:19

我会这样做：

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
import shap
import pandas as pd
import numpy as np


#loading and preparing the data
iris = load_breast_cancer()
X = iris.data
y = iris.target
columns = iris.feature_names
#if you don't shuffle you wont need to keep track of test_index, but I think 
#it is always good practice to shuffle your data
kf = KFold(n_splits=2,shuffle=True)

list_shap_values = list()
list_test_sets = list()
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    X_train = pd.DataFrame(X_train,columns=columns)
    X_test = pd.DataFrame(X_test,columns=columns)

    #training model
    clf = RandomForestClassifier(random_state=0)
    clf.fit(X_train, y_train)

    #explaining model
    explainer = shap.TreeExplainer(clf)
    shap_values = explainer.shap_values(X_test)
    #for each iteration we save the test_set index and the shap_values
    list_shap_values.append(shap_values)


# flatten list of lists, pick the sv for 1 class, stack the result
shap_values = np.vstack([sv[1] for sv in list_shap_values])
sv = np.abs(shap_values).mean(0)  # <-- error corrected    
importance_df = pd.DataFrame({
    "column_name": columns,
    "shap_values": sv
})

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/72720106

复制

相似问题

问交叉验证的Shap值-是在轴0还是轴1上实现shap值？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问交叉验证的Shap值-是在轴0还是轴1上实现shap值？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问交叉验证的Shap值-是在轴0还是轴1上实现shap值？
EN