Benchmarking¶

In this notebook we will show how you can use CARLA for benchmarking. First, we need to load some factuals, and generate counterfactuals for them. For more explanation on how to do this, please take a look at our How to use CARLA notebook.

[1]:

from carla import Benchmark
import carla.evaluation.catalog as evaluation_catalog
from carla.data.catalog import OnlineCatalog
from carla.models.catalog import MLModelCatalog
from carla.models.negative_instances import predict_negative_instances
import carla.recourse_methods.catalog as recourse_catalog

import warnings
warnings.filterwarnings("ignore")

%load_ext autoreload
%autoreload 2

Using TensorFlow backend.

[INFO] Using Python-MIP package version 1.12.0 [model.py <module>]

Generating counterfactuals¶

Before we can benchmark anything, we need some data, a classification model and a recourse method.

[2]:

data_name = "adult"
dataset = OnlineCatalog(data_name)

[3]:

# load catalog model
model_type = "ann"
ml_model = MLModelCatalog(
    dataset,
    model_type=model_type,
    load_online=True,
    backend="pytorch"
)

[4]:

hyperparams = {
        "data_name": data_name,
        "vae_params": {
            "layers": [len(ml_model.feature_input_order), 512, 256, 8],
        },
    }

# define your recourse method
recourse_method = recourse_catalog.CCHVAE(ml_model, hyperparams)

[INFO] Start training of Variational Autoencoder... [models.py fit]
[INFO] [Epoch: 0/5] [objective: 0.375] [models.py fit]
[INFO] [ELBO train: 0.38] [models.py fit]
[INFO] [ELBO train: 0.13] [models.py fit]
[INFO] [ELBO train: 0.12] [models.py fit]
[INFO] [ELBO train: 0.12] [models.py fit]
[INFO] [ELBO train: 0.12] [models.py fit]
[INFO] ... finished training of Variational Autoencoder. [models.py fit]

[5]:

# get some negative instances
factuals = predict_negative_instances(ml_model, dataset.df)
factuals = factuals[:100]

# find counterfactuals
counterfactuals = recourse_method.get_counterfactuals(factuals)

Benchmarking¶

[6]:

# first initialize the benchmarking class by passing
# black-box-model, recourse method, and factuals into it
benchmark = Benchmark(ml_model, recourse_method, factuals)

# now you can decide if you want to run all measurements
# or just specific ones.
evaluation_measures = [
    evaluation_catalog.YNN(benchmark.mlmodel, {"y": 5, "cf_label": 1}),
    evaluation_catalog.Distance(benchmark.mlmodel),
    evaluation_catalog.SuccessRate(),
    evaluation_catalog.Redundancy(benchmark.mlmodel, {"cf_label": 1}),
    evaluation_catalog.ConstraintViolation(benchmark.mlmodel),
    evaluation_catalog.AvgTime({"time": benchmark.timer}),
]

# now run all implemented measurements and create a
# DataFrame which consists of all results
results = benchmark.run_benchmark(evaluation_measures)

display(results.head(5))

	y-Nearest-Neighbours	L0_distance	L1_distance	L2_distance	Linf_distance	Success_Rate	Redundancy	Constraint_Violation	avg_time
0	0.21	10.0	4.348742	4.036693	1.0	1.0	6	1	1.52311
1	NaN	8.0	2.235326	2.016212	1.0	NaN	3	1	NaN
2	NaN	7.0	1.564834	1.106475	1.0	NaN	5	1	NaN
3	NaN	11.0	5.938143	5.252787	1.0	NaN	6	2	NaN
4	NaN	8.0	2.435960	2.054773	1.0	NaN	6	1	NaN

[6]: