Tutorial

In this tutorial we fairly compare a number of ensemble methods using EI’s built in nested cross-validation implementation, and show how predictions can be made with the selected final model. We then show how we can intepret the model by calculating feature rankings.

Performance analysis and selection of ensemble methods

First of all let’s import some sklearn models, EnsembleIntegration and some additional ensemble methods:

[44]:

from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
import pandas as pd
from eipy.ei import EnsembleIntegration
from eipy.additional_ensembles import MeanAggregation, CES

Next load the multi-modal youth diabetes dataset.

[45]:

from eipy.datasets import load_diabetes

data = load_diabetes()
X_train, X_test, y_train, y_test = data['X_train'], data['X_test'], data['y_train'], data['y_test']
print('Modalities available are', list(X_train.keys()))

Modalities available are ['Sociodemographic', 'Health status', 'Diet', 'Other lifestyle behaviors']

Define metrics of interest. fmax_score is a custom metric that outputs both a score and a corresponding threshold.

[46]:

from eipy.metrics import fmax_score
from sklearn.metrics import roc_auc_score, matthews_corrcoef

metrics = {
            'f_max': fmax_score,
            'auc': roc_auc_score,
            'mcc': matthews_corrcoef
            }

Define base predictors:

[47]:

base_predictors = {
                    'ADAB': AdaBoostClassifier(),
                    'XGB': XGBClassifier(),
                    'DT': DecisionTreeClassifier(),
                    'RF': RandomForestClassifier(),
                    'GB': GradientBoostingClassifier(),
                    'KNN': KNeighborsClassifier(),
                    'LR': LogisticRegression(),
                    'NB': GaussianNB(),
                    'MLP': MLPClassifier(),
                    'SVM': SVC(probability=True),
}

Initialise Ensemble Integration:

[48]:

EI = EnsembleIntegration(
                        base_predictors=base_predictors,
                        k_outer=5,
                        k_inner=5,
                        n_samples=1,
                        sampling_strategy="undersampling",
                        sampling_aggregation=None,
                        n_jobs=-1,
                        metrics=metrics,
                        random_state=38,
                        project_name="diabetes",
                        model_building=True,
                        )

Fit base predictors on each modality. Remember to include the unique modality name.

[49]:

for name, modality in X_train.items():
    EI.fit_base(modality, y_train, modality_name=name)

Training base predictors on Sociodemographic...

... for ensemble performance analysis...

Generating ensemble training data: |██████████|100%
Generating ensemble test data: |██████████|100%


... for final ensemble...

Generating ensemble training data: |██████████|100%
Training final base predictors: |██████████|100%



Training base predictors on Health status...

... for ensemble performance analysis...

Generating ensemble training data: |██████████|100%
Generating ensemble test data: |██████████|100%


... for final ensemble...

Generating ensemble training data: |██████████|100%
Training final base predictors: |██████████|100%



Training base predictors on Diet...

... for ensemble performance analysis...

Generating ensemble training data: |██████████|100%
Generating ensemble test data: |██████████|100%


... for final ensemble...

Generating ensemble training data: |██████████|100%
Training final base predictors: |██████████|100%



Training base predictors on Other lifestyle behaviors...

... for ensemble performance analysis...

Generating ensemble training data: |██████████|100%
Generating ensemble test data: |██████████|100%


... for final ensemble...

Generating ensemble training data: |██████████|100%
Training final base predictors: |██████████|100%

We can check the cross validated performance of each base predictor on each modality with the base_summary dictionary. The metric scores are stored in a dataframe and can be accessed with the metrics key. The corresponding threshold values used to threshold the probability vector can be accessed with the thresholds key.

[50]:

EI.base_summary['metrics']

[50]:

modality	Diet										...	Sociodemographic
base predictor	ADAB	DT	GB	KNN	LR	MLP	NB	RF	SVM	XGB	...	ADAB	DT	GB	KNN	LR	MLP	NB	RF	SVM	XGB
f_max	0.235738	0.222865	0.235282	0.224144	0.251120	0.228173	0.225817	0.234442	0.239102	0.231233	...	0.264833	0.227289	0.271730	0.235114	0.274428	0.248038	0.247952	0.264974	0.271551	0.254824
auc	0.559842	0.525266	0.564876	0.520080	0.592825	0.537618	0.532803	0.556643	0.568244	0.554326	...	0.626564	0.538970	0.630678	0.562166	0.630872	0.584445	0.587617	0.618888	0.629168	0.594908
mcc	0.062157	0.033347	0.059942	0.018450	0.094138	0.041208	0.038954	0.054486	0.062126	0.057688	...	0.115635	0.051554	0.125628	0.063908	0.130140	0.090261	0.081534	0.113642	0.128070	0.098644

3 rows × 40 columns

Now let’s define some ensemble models for stacked generalization. We add an “S.” prefix to the keys of stacking algorithms.

[51]:

ensemble_predictors = {
                    'Mean' : MeanAggregation(),
                    'CES' : CES(scoring=lambda y_test, y_pred: fmax_score(y_test, y_pred)[0]),
                    'S.ADAB': AdaBoostClassifier(),
                    'S.XGB': XGBClassifier(),
                    'S.DT': DecisionTreeClassifier(),
                    "S.RF": RandomForestClassifier(),
                    'S.GB': GradientBoostingClassifier(),
                    'S.KNN': KNeighborsClassifier(),
                    'S.LR': LogisticRegression(),
                    'S.NB': GaussianNB(),
                    'S.MLP': MLPClassifier(),
                    'S.SVM': SVC(probability=True),
}

Fit ensemble models:

[52]:

EI.fit_ensemble(ensemble_predictors=ensemble_predictors)

Analyzing ensembles: |          |  0%

Analyzing ensembles: |██████████|100%
Training final ensemble models: |██████████|100%

[52]:

<eipy.ei.EnsembleIntegration at 0x7f05c036b010>

Check the ensemble summary with ensemble_summary:

[53]:

EI.ensemble_summary['metrics']

[53]:

	Mean	CES	S.ADAB	S.XGB	S.DT	S.RF	S.GB	S.KNN	S.LR	S.NB	S.MLP	S.SVM
f_max	0.279307	0.297275	0.292202	0.260111	0.222865	0.27744	0.295322	0.228696	0.304192	0.300064	0.278088	0.240511
auc	0.648915	0.664662	0.652463	0.611726	0.523156	0.64466	0.667541	0.552092	0.678055	0.673285	0.642210	0.573379
mcc	0.140871	0.154157	0.060963	0.057347	0.046134	0.03141	0.053177	0.050426	0.014030	0.168266	0.056389	0.004238

The LR stacking algorithm has the best \(\text{F}_\text{max}\) performance (the preferred metric for imbalanced datasets) so let’s select it as our final model.

Predictions on unseen data

Since we ran EI with model_building=True, we can make predictions. Let’s predict the test set and apply the \(\text{F}_\text{max}\) threshold calculated during training:

[54]:

y_pred = EI.predict(X_dict=X_test, ensemble_model_key='S.LR')

threshold = EI.ensemble_summary['thresholds']['S.LR']['f_max']

y_pred[y_pred>=threshold] = 1
y_pred[y_pred<threshold] = 0

print(y_pred)

[0. 1. 1. ... 0. 1. 1.]

Interpreting the final model

We now use PermutationInterpreter to interpret the final LR stacked generalization model. Let’s first import PermutationInterpreter and our chosen metric, and initialise the interpreter:

[55]:

from eipy.interpretation import PermutationInterpreter

interpreter = PermutationInterpreter(EI=EI,
                                     metric=lambda y_test, y_pred: fmax_score(y_test, y_pred)[0],
                                     ensemble_predictor_keys=['S.LR'],
                                     n_jobs=-1)

Calculate feature importance scores:

[56]:

interpreter.rank_product_score(X_dict=X_test, y=y_test)

Interpreting ensembles...

Calculating local feature ranks: |██████████|100%
Calculating local model ranks: |██████████|100%

Calculating combined rank product score...
... complete!

[56]:

<eipy.interpretation.PermutationInterpreter at 0x7f06008e1590>

We can now inspect the most important features for model prediction:

[57]:

ranking_dataframe = interpreter.ensemble_feature_ranking['S.LR']

ranking_dataframe

[57]:

	modality	feature	RPS	feature rank	ensemble method
38	Health status	BMXHT	0.104007	1.0	S.LR
21	Sociodemographic	FdStmp	0.126047	2.0	S.LR
0	Sociodemographic	RIAGENDR	0.126740	3.0	S.LR
1	Sociodemographic	RIDAGEYR	0.154443	4.0	S.LR
29	Sociodemographic	OvntPT	0.160676	5.0	S.LR
...	...	...	...	...	...
35	Sociodemographic	InSchool	0.380912	104.0	S.LR
3	Sociodemographic	DMDHRGND	0.383682	105.0	S.LR
71	Diet	VDrkGr	0.394518	106.0	S.LR
105	Other lifestyle behaviors	HHSmkNum	0.422812	107.0	S.LR
88	Diet	Pmps	0.430820	108.0	S.LR

108 rows × 5 columns