Data

This data wrapper contains the possibilities to

Example implementations for the first two use-cases can be found in our section Examples.

Interface

class data.api.data.Data

Abstract class to implement arbitrary datasets, which are provided by the user. This is the general data object that is used in CARLA.

Attributes
categorical

Provides the column names of categorical data.

continuous

Provides the column names of continuous data.

df

The full Dataframe.

df_test

The testing split Dataframe.

df_train

The training split Dataframe.

immutables

Provides the column names of immutable data.

target

Provides the name of the label column.

Methods

inverse_transform(df)

Inverts transform operation.

transform(df)

Data transformation, for example normalization of continuous features and encoding of categorical features.

abstract property categorical

Provides the column names of categorical data. Column names do not contain encoded information as provided by a get_dummy() method (e.g., sex_female)

Label name is not included.

Returns
list of Strings

List of all categorical columns

abstract property continuous

Provides the column names of continuous data.

Label name is not included.

Returns
list of Strings

List of all continuous columns

abstract property df

The full Dataframe.

Returns
pd.DataFrame
abstract property df_test

The testing split Dataframe.

Returns
pd.DataFrame
abstract property df_train

The training split Dataframe.

Returns
pd.DataFrame
abstract property immutables

Provides the column names of immutable data.

Label name is not included.

Returns
list of Strings

List of all immutable columns

abstract inverse_transform(df)

Inverts transform operation.

Parameters
df: pd.DataFrame
Returns
pd.Dataframe
abstract property target

Provides the name of the label column.

Returns
str

Target label name

abstract transform(df)

Data transformation, for example normalization of continuous features and encoding of categorical features.

Parameters
df: pd.DataFrame
Returns
pd.Dataframe

DataCatalog

class data.catalog.catalog.DataCatalog(data_name, df, df_train, df_test, scaling_method='MinMax', encoding_method='OneHot_drop_binary')

Generic framework for datasets, using sklearn processing. This class is implemented by OnlineCatalog and CsvCatalog. OnlineCatalog allows the user to easily load online datasets, while CsvCatalog allows easy use of local datasets.

Parameters
data_name: str

What name the dataset should have.

df: pd.DataFrame

The complete Dataframe. This is equivalent to the combination of df_train and df_test, although not shuffled.

df_train: pd.DataFrame

Training portion of the complete Dataframe.

df_test: pd.DataFrame

Testing portion of the complete Dataframe.

scaling_method: str, default: MinMax

Type of used sklearn scaler. Can be set with the property setter to any sklearn scaler. Set to “Identity” for no scaling.

encoding_method: str, default: OneHot_drop_binary

Type of OneHotEncoding {OneHot, OneHot_drop_binary}. Additional drop binary decides if one column is dropped for binary features. Can be set with the property setter to any sklearn encoder. Set to “Identity” for no encoding.

Returns
Data
Attributes
categorical

Provides the column names of categorical data.

continuous

Provides the column names of continuous data.

df

The full Dataframe.

df_test

The testing split Dataframe.

df_train

The training split Dataframe.

encoder

Contains a fitted sklearn encoder:

immutables

Provides the column names of immutable data.

scaler

Contains a fitted sklearn scaler.

target

Provides the name of the label column.

Methods

get_pipeline_element(key)

Returns a specific element of the transformation pipeline.

inverse_transform(df)

Transforms output after prediction back into original form.

transform(df)

Transforms input for prediction into correct form.

property df: pandas.core.frame.DataFrame

The full Dataframe.

Returns
pd.DataFrame
Return type

DataFrame

property df_test: pandas.core.frame.DataFrame

The testing split Dataframe.

Returns
pd.DataFrame
Return type

DataFrame

property df_train: pandas.core.frame.DataFrame

The training split Dataframe.

Returns
pd.DataFrame
Return type

DataFrame

property encoder: sklearn.base.BaseEstimator

Contains a fitted sklearn encoder:

Returns
sklearn.preprocessing.BaseEstimator
Return type

BaseEstimator

get_pipeline_element(key)

Returns a specific element of the transformation pipeline.

Parameters
keystr

Element of the pipeline we want to return

Returns
Pipeline element
Return type

Callable

inverse_transform(df)

Transforms output after prediction back into original form. Only possible for DataFrames with preprocessing steps.

Parameters
dfpd.DataFrame

Contains normalized and encoded data.

Returns
outputpd.DataFrame

Prediction output denormalized and decoded

Return type

DataFrame

property scaler: sklearn.base.BaseEstimator

Contains a fitted sklearn scaler.

Returns
sklearn.preprocessing.BaseEstimator
Return type

BaseEstimator

transform(df)

Transforms input for prediction into correct form. Only possible for DataFrames without preprocessing steps.

Recommended to keep correct encodings and normalization

Parameters
dfpd.DataFrame

Contains raw (not normalized and not encoded) data.

Returns
outputpd.DataFrame

Prediction input normalized and encoded

Return type

DataFrame

OnlineCatalog

class data.catalog.online_catalog.OnlineCatalog(data_name, scaling_method='MinMax', encoding_method='OneHot_drop_binary')

Implements DataCatalog using already implemented datasets. These datasets are loaded from an online repository.

Parameters
data_name{‘adult’, ‘compas’, ‘give_me_some_credit’, ‘heloc’}

Used to get the correct dataset from online repository.

Returns
DataCatalog
Attributes
categorical
continuous
immutables
target

Methods

__call__(*args, **kwargs)

Call self as a function.

property categorical: List[str]
Return type

List[str]

property continuous: List[str]
Return type

List[str]

property immutables: List[str]
Return type

List[str]

property target: str
Return type

str

CsvCatalog

class data.catalog.csv_catalog.CsvCatalog(file_path, categorical, continuous, immutables, target, scaling_method='MinMax', encoding_method='OneHot_drop_binary')

Implements DataCatalog using local csv files. Using this class is the easiest way to use your own dataset. Besides data transformation, no other preprocessing is done. E.g. the user should remove NaNs.

Parameters
file_path: str

Path of the csv file.

categorical: list[str]

List containing the column names of the categorical features.

continuous: list[str]

List containing the column names of the continuous features.

immutables: list[str]

List containing the column names of the immutable features.

target: str

Column name of the target.

Returns
DataCatalog
Attributes
categorical

Provides the column names of categorical data.

continuous

Provides the column names of continuous data.

df

The full Dataframe.

df_test

The testing split Dataframe.

df_train

The training split Dataframe.

encoder

Contains a fitted sklearn encoder:

immutables

Provides the column names of immutable data.

scaler

Contains a fitted sklearn scaler.

target

Provides the name of the label column.

Methods

get_pipeline_element(key)

Returns a specific element of the transformation pipeline.

inverse_transform(df)

Transforms output after prediction back into original form.

transform(df)

Transforms input for prediction into correct form.

property categorical: List[str]

Provides the column names of categorical data. Column names do not contain encoded information as provided by a get_dummy() method (e.g., sex_female)

Label name is not included.

Returns
list of Strings

List of all categorical columns

Return type

List[str]

property continuous: List[str]

Provides the column names of continuous data.

Label name is not included.

Returns
list of Strings

List of all continuous columns

Return type

List[str]

property immutables: List[str]

Provides the column names of immutable data.

Label name is not included.

Returns
list of Strings

List of all immutable columns

Return type

List[str]

property target: str

Provides the name of the label column.

Returns
str

Target label name

Return type

str

Causal Model

class data.causal_model.causal_model.CausalModel(scm_class)

Class with topological methods given a structural causal model. Uses the StructuralCausalModel and CausalGraphicalModel from https://github.com/ijmbarr/causalgraphicalmodels

Parameters
scm_class: str

Name of the structural causal model

Attributes
scm: StructuralCausalModel
StructuralCausalModel from assignment of the form { variable: Function(parents) }.
cgm: CausalGraphicalModel
scm_class: str

Name of the structural causal model

structural_equations_np: dict

Contains the equations for the features in Numpy format.

structural_equations_ts: dict

Contains the equations for the features in Tensorflow format.

noise_distributions: dict

Defines the noise variables.

Methods

generate_dataset(size)

Generates a Data object using the structural causal equations

get_ancestors(node)

Returns all nodes having a path to node.

get_children(node)

Returns an iterator over successor nodes of n.

get_descendents(node)

Returns all nodes reachable from node.

get_non_descendents(node)

Returns all nodes not reachable from node.

get_parents(node[, return_sorted])

Returns an set over predecessor nodes of n.

get_topological_ordering([node_type])

Returns a generator of nodes in topologically sorted order.

visualize_graph([experiment_folder_name])

Visualize the causal graph.

property cgm: causalgraphicalmodels.cgm.CausalGraphicalModel
Returns
CausalGraphicalModel
Return type

CausalGraphicalModel

property endogenous: List[str]

Get the endogenous nodes, i.e. the signal nodes.

Returns
List[str]
Return type

List[str]

property exogenous: List[str]

Get the exogenous nodes, i.e. the noise nodes.

Returns
List[str]
Return type

List[str]

generate_dataset(size)

Generates a Data object using the structural causal equations

Parameters
size: int

Number of samples in the dataset

Returns
ScmDataset

a Data object filled with samples

Return type

ScmDataset

get_ancestors(node)

Returns all nodes having a path to node.

Parameters
nodestr

A node in the graph

Returns
set()

The ancestors of node

Return type

set

get_children(node)

Returns an iterator over successor nodes of n.

A successor of n is a node m such that there exists a directed edge from n to m.

Parameters
node: str

A node in the graph

Return type

set

get_descendents(node)

Returns all nodes reachable from node.

Parameters
nodestr

A node in the graph

Returns
set()

The descendants of node

Return type

set

get_non_descendents(node)

Returns all nodes not reachable from node.

Parameters
nodestr

A node in the graph

Returns
set()

The non-descendants of node

Return type

set

get_parents(node, return_sorted=True)

Returns an set over predecessor nodes of n.

A predecessor of n is a node m such that there exists a directed edge from m to n.

Parameters
nodestr

A node in the graph

return_sortedbool

Return the set sorted

get_topological_ordering(node_type='endogenous')

Returns a generator of nodes in topologically sorted order.

A topological sort is a non-unique permutation of the nodes such that an edge from u to v implies that u appears before v in the topological sort order.

Parameters
node_type: str

“endogenous” or “exogenous”, i.e. nodes with “x” or “u” prefix respectively

Returns
iterable

An iterable of node names in topological sorted order.

property noise_distributions: dict

Defines the noise variables.

Returns
dict
Return type

dict

property scm: causalgraphicalmodels.csm.StructuralCausalModel
Returns
StructuralCausalModel
Return type

StructuralCausalModel

property scm_class: str

Name of the structural causal model used to define the CausalModel

Returns
str
Return type

str

property structural_equations_np: dict

Contains the equations for the features in Numpy format.

Returns
dict
Return type

dict

property structural_equations_ts: dict

Contains the equations for the features in Tensorflow format.

Returns
dict
Return type

dict

visualize_graph(experiment_folder_name=None)

Visualize the causal graph.

Parameters
experiment_folder_name: str

Where to save figure.

Synthetic Data

class data.causal_model.synthethic_data.ScmDataset(scm, size)

Generate a dataset from structural equations

Parameters
scmCausalModel

Structural causal model

sizeint

Number of samples in the dataset

Attributes
categorical
categorical_noise

Provides the column names of the categorical data.

continuous
continuous_noise

Provides the column names of the continuous data.

df
df_test
df_train
immutables
noise
noise_test
noise_train
target

Methods

__call__(*args, **kwargs)

Call self as a function.

inverse_transform

transform

property categorical: List[str]
Return type

List[str]

property categorical_noise: List[str]

Provides the column names of the categorical data.

Returns
List[str]
Return type

List[str]

property continuous: List[str]
Return type

List[str]

property continuous_noise: List[str]

Provides the column names of the continuous data.

Returns
List[str]
Return type

List[str]

property df: pandas.core.frame.DataFrame
Return type

DataFrame

property df_test: pandas.core.frame.DataFrame
Return type

DataFrame

property df_train: pandas.core.frame.DataFrame
Return type

DataFrame

property immutables: List[str]
Return type

List[str]

inverse_transform(df)
Return type

DataFrame

property noise: pandas.core.frame.DataFrame
Return type

DataFrame

property noise_test: pandas.core.frame.DataFrame
Return type

DataFrame

property noise_train: pandas.core.frame.DataFrame
Return type

DataFrame

property target: str
Return type

str

transform(df)
Return type

DataFrame