Tutorial

In this tutorial we fairly compare a number of ensemble methods using EI’s built in nested cross-validation implementation, and show how predictions can be made with the selected final model. We then show how we can intepret the model by calculating feature rankings.

Performance analysis and selection of ensemble methods

First of all let’s import some sklearn models, EnsembleIntegration and some additional ensemble methods:

[44]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
import pandas as pd
from eipy.ei import EnsembleIntegration
from eipy.additional_ensembles import MeanAggregation, CES

Next load the multi-modal youth diabetes dataset.

[45]:
from eipy.datasets import load_diabetes

data = load_diabetes()
X_train, X_test, y_train, y_test = data['X_train'], data['X_test'], data['y_train'], data['y_test']
print('Modalities available are', list(X_train.keys()))
Modalities available are ['Sociodemographic', 'Health status', 'Diet', 'Other lifestyle behaviors']

Define metrics of interest. fmax_score is a custom metric that outputs both a score and a corresponding threshold.

[46]:
from eipy.metrics import fmax_score
from sklearn.metrics import roc_auc_score, matthews_corrcoef

metrics = {
            'f_max': fmax_score,
            'auc': roc_auc_score,
            'mcc': matthews_corrcoef
            }

Define base predictors:

[47]:
base_predictors = {
                    'ADAB': AdaBoostClassifier(),
                    'XGB': XGBClassifier(),
                    'DT': DecisionTreeClassifier(),
                    'RF': RandomForestClassifier(),
                    'GB': GradientBoostingClassifier(),
                    'KNN': KNeighborsClassifier(),
                    'LR': LogisticRegression(),
                    'NB': GaussianNB(),
                    'MLP': MLPClassifier(),
                    'SVM': SVC(probability=True),
}

Initialise Ensemble Integration:

[48]:
EI = EnsembleIntegration(
                        base_predictors=base_predictors,
                        k_outer=5,
                        k_inner=5,
                        n_samples=1,
                        sampling_strategy="undersampling",
                        sampling_aggregation=None,
                        n_jobs=-1,
                        metrics=metrics,
                        random_state=38,
                        project_name="diabetes",
                        model_building=True,
                        )

Fit base predictors on each modality. Remember to include the unique modality name.

[49]:
for name, modality in X_train.items():
    EI.fit_base(modality, y_train, modality_name=name)
Training base predictors on Sociodemographic...

... for ensemble performance analysis...
Generating ensemble training data: |██████████|100%
Generating ensemble test data: |██████████|100%

... for final ensemble...
Generating ensemble training data: |██████████|100%
Training final base predictors: |██████████|100%


Training base predictors on Health status...

... for ensemble performance analysis...
Generating ensemble training data: |██████████|100%
Generating ensemble test data: |██████████|100%

... for final ensemble...
Generating ensemble training data: |██████████|100%
Training final base predictors: |██████████|100%


Training base predictors on Diet...

... for ensemble performance analysis...
Generating ensemble training data: |██████████|100%
Generating ensemble test data: |██████████|100%

... for final ensemble...
Generating ensemble training data: |██████████|100%
Training final base predictors: |██████████|100%


Training base predictors on Other lifestyle behaviors...

... for ensemble performance analysis...
Generating ensemble training data: |██████████|100%
Generating ensemble test data: |██████████|100%

... for final ensemble...
Generating ensemble training data: |██████████|100%
Training final base predictors: |██████████|100%


We can check the cross validated performance of each base predictor on each modality with the base_summary dictionary. The metric scores are stored in a dataframe and can be accessed with the metrics key. The corresponding threshold values used to threshold the probability vector can be accessed with the thresholds key.

[50]:
EI.base_summary['metrics']
[50]:
modality Diet ... Sociodemographic
base predictor ADAB DT GB KNN LR MLP NB RF SVM XGB ... ADAB DT GB KNN LR MLP NB RF SVM XGB
f_max 0.235738 0.222865 0.235282 0.224144 0.251120 0.228173 0.225817 0.234442 0.239102 0.231233 ... 0.264833 0.227289 0.271730 0.235114 0.274428 0.248038 0.247952 0.264974 0.271551 0.254824
auc 0.559842 0.525266 0.564876 0.520080 0.592825 0.537618 0.532803 0.556643 0.568244 0.554326 ... 0.626564 0.538970 0.630678 0.562166 0.630872 0.584445 0.587617 0.618888 0.629168 0.594908
mcc 0.062157 0.033347 0.059942 0.018450 0.094138 0.041208 0.038954 0.054486 0.062126 0.057688 ... 0.115635 0.051554 0.125628 0.063908 0.130140 0.090261 0.081534 0.113642 0.128070 0.098644

3 rows × 40 columns

Now let’s define some ensemble models for stacked generalization. We add an “S.” prefix to the keys of stacking algorithms.

[51]:
ensemble_predictors = {
                    'Mean' : MeanAggregation(),
                    'CES' : CES(scoring=lambda y_test, y_pred: fmax_score(y_test, y_pred)[0]),
                    'S.ADAB': AdaBoostClassifier(),
                    'S.XGB': XGBClassifier(),
                    'S.DT': DecisionTreeClassifier(),
                    "S.RF": RandomForestClassifier(),
                    'S.GB': GradientBoostingClassifier(),
                    'S.KNN': KNeighborsClassifier(),
                    'S.LR': LogisticRegression(),
                    'S.NB': GaussianNB(),
                    'S.MLP': MLPClassifier(),
                    'S.SVM': SVC(probability=True),
}

Fit ensemble models:

[52]:
EI.fit_ensemble(ensemble_predictors=ensemble_predictors)
Analyzing ensembles: |          |  0%
Analyzing ensembles: |██████████|100%
Training final ensemble models: |██████████|100%
[52]:
<eipy.ei.EnsembleIntegration at 0x7f05c036b010>

Check the ensemble summary with ensemble_summary:

[53]:
EI.ensemble_summary['metrics']
[53]:
Mean CES S.ADAB S.XGB S.DT S.RF S.GB S.KNN S.LR S.NB S.MLP S.SVM
f_max 0.279307 0.297275 0.292202 0.260111 0.222865 0.27744 0.295322 0.228696 0.304192 0.300064 0.278088 0.240511
auc 0.648915 0.664662 0.652463 0.611726 0.523156 0.64466 0.667541 0.552092 0.678055 0.673285 0.642210 0.573379
mcc 0.140871 0.154157 0.060963 0.057347 0.046134 0.03141 0.053177 0.050426 0.014030 0.168266 0.056389 0.004238

The LR stacking algorithm has the best \(\text{F}_\text{max}\) performance (the preferred metric for imbalanced datasets) so let’s select it as our final model.

Predictions on unseen data

Since we ran EI with model_building=True, we can make predictions. Let’s predict the test set and apply the \(\text{F}_\text{max}\) threshold calculated during training:

[54]:
y_pred = EI.predict(X_dict=X_test, ensemble_model_key='S.LR')

threshold = EI.ensemble_summary['thresholds']['S.LR']['f_max']

y_pred[y_pred>=threshold] = 1
y_pred[y_pred<threshold] = 0

print(y_pred)
[0. 1. 1. ... 0. 1. 1.]

Interpreting the final model

We now use PermutationInterpreter to interpret the final LR stacked generalization model. Let’s first import PermutationInterpreter and our chosen metric, and initialise the interpreter:

[55]:
from eipy.interpretation import PermutationInterpreter

interpreter = PermutationInterpreter(EI=EI,
                                     metric=lambda y_test, y_pred: fmax_score(y_test, y_pred)[0],
                                     ensemble_predictor_keys=['S.LR'],
                                     n_jobs=-1)

Calculate feature importance scores:

[56]:
interpreter.rank_product_score(X_dict=X_test, y=y_test)
Interpreting ensembles...

Calculating local feature ranks: |██████████|100%
Calculating local model ranks: |██████████|100%
Calculating combined rank product score...
... complete!

[56]:
<eipy.interpretation.PermutationInterpreter at 0x7f06008e1590>

We can now inspect the most important features for model prediction:

[57]:
ranking_dataframe = interpreter.ensemble_feature_ranking['S.LR']

ranking_dataframe
[57]:
modality feature RPS feature rank ensemble method
38 Health status BMXHT 0.104007 1.0 S.LR
21 Sociodemographic FdStmp 0.126047 2.0 S.LR
0 Sociodemographic RIAGENDR 0.126740 3.0 S.LR
1 Sociodemographic RIDAGEYR 0.154443 4.0 S.LR
29 Sociodemographic OvntPT 0.160676 5.0 S.LR
... ... ... ... ... ...
35 Sociodemographic InSchool 0.380912 104.0 S.LR
3 Sociodemographic DMDHRGND 0.383682 105.0 S.LR
71 Diet VDrkGr 0.394518 106.0 S.LR
105 Other lifestyle behaviors HHSmkNum 0.422812 107.0 S.LR
88 Diet Pmps 0.430820 108.0 S.LR

108 rows × 5 columns