Contents

Data

Data ¶

This data wrapper contains the possibilities to

use pre-processed datasets from our OnlineCatalog or
use local datasets using CsvCatalog
implement every possible dataset by either inheriting our DataCatalog or for even more control the Interface

Example implementations for the first two use-cases can be found in our section Examples.

Interface ¶

class data.api.data.Data¶

Abstract class to implement arbitrary datasets, which are provided by the user. This is the general data object that is used in CARLA.

Attributes

categorical: Provides the column names of categorical data.
continuous: Provides the column names of continuous data.
df: The full Dataframe.
df_test: The testing split Dataframe.
df_train: The training split Dataframe.
immutables: Provides the column names of immutable data.
target: Provides the name of the label column.

Methods

`inverse_transform`(df)	Inverts transform operation.
`transform`(df)	Data transformation, for example normalization of continuous features and encoding of categorical features.

abstract property categorical¶

Provides the column names of categorical data. Column names do not contain encoded information as provided by a get_dummy() method (e.g., sex_female)

Label name is not included.

Returns

list of Strings: List of all categorical columns

abstract property continuous¶

Provides the column names of continuous data.

Label name is not included.

Returns

list of Strings: List of all continuous columns

abstract property df¶

The full Dataframe.

Returns

pd.DataFrame

abstract property df_test¶

The testing split Dataframe.

Returns

pd.DataFrame

abstract property df_train¶

The training split Dataframe.

Returns

pd.DataFrame

abstract property immutables¶

Provides the column names of immutable data.

Label name is not included.

Returns

list of Strings: List of all immutable columns

abstract inverse_transform(df)¶

Inverts transform operation.

Parameters

df: pd.DataFrame

Returns

pd.Dataframe

abstract property target¶

Provides the name of the label column.

Returns

str: Target label name

abstract transform(df)¶

Data transformation, for example normalization of continuous features and encoding of categorical features.

Parameters

df: pd.DataFrame

Returns

pd.Dataframe

DataCatalog ¶

class data.catalog.catalog.DataCatalog(data_name, df, df_train, df_test, scaling_method='MinMax', encoding_method='OneHot_drop_binary')¶

Generic framework for datasets, using sklearn processing. This class is implemented by OnlineCatalog and CsvCatalog. OnlineCatalog allows the user to easily load online datasets, while CsvCatalog allows easy use of local datasets.

Parameters

data_name: str: What name the dataset should have.
df: pd.DataFrame: The complete Dataframe. This is equivalent to the combination of df_train and df_test, although not shuffled.
df_train: pd.DataFrame: Training portion of the complete Dataframe.
df_test: pd.DataFrame: Testing portion of the complete Dataframe.
scaling_method: str, default: MinMax: Type of used sklearn scaler. Can be set with the property setter to any sklearn scaler. Set to “Identity” for no scaling.
encoding_method: str, default: OneHot_drop_binary: Type of OneHotEncoding {OneHot, OneHot_drop_binary}. Additional drop binary decides if one column is dropped for binary features. Can be set with the property setter to any sklearn encoder. Set to “Identity” for no encoding.

Returns

Data

Attributes

categorical: Provides the column names of categorical data.
continuous: Provides the column names of continuous data.
df: The full Dataframe.
df_test: The testing split Dataframe.
df_train: The training split Dataframe.
encoder: Contains a fitted sklearn encoder:
immutables: Provides the column names of immutable data.
scaler: Contains a fitted sklearn scaler.
target: Provides the name of the label column.

Methods

`get_pipeline_element`(key)	Returns a specific element of the transformation pipeline.
`inverse_transform`(df)	Transforms output after prediction back into original form.
`transform`(df)	Transforms input for prediction into correct form.

property df: pandas.core.frame.DataFrame¶

The full Dataframe.

Returns

pd.DataFrame

Return type: DataFrame

property df_test: pandas.core.frame.DataFrame¶

The testing split Dataframe.

Returns

pd.DataFrame

Return type: DataFrame

property df_train: pandas.core.frame.DataFrame¶

The training split Dataframe.

Returns

pd.DataFrame

Return type: DataFrame

property encoder: sklearn.base.BaseEstimator¶

Contains a fitted sklearn encoder:

Returns

sklearn.preprocessing.BaseEstimator

Return type: BaseEstimator

get_pipeline_element(key)¶

Returns a specific element of the transformation pipeline.

Parameters

keystr: Element of the pipeline we want to return

Returns

Pipeline element

Return type: Callable

inverse_transform(df)¶

Transforms output after prediction back into original form. Only possible for DataFrames with preprocessing steps.

Parameters

dfpd.DataFrame: Contains normalized and encoded data.

Returns

outputpd.DataFrame: Prediction output denormalized and decoded

Return type: DataFrame

property scaler: sklearn.base.BaseEstimator¶

Contains a fitted sklearn scaler.

Returns

sklearn.preprocessing.BaseEstimator

Return type: BaseEstimator

transform(df)¶

Transforms input for prediction into correct form. Only possible for DataFrames without preprocessing steps.

Recommended to keep correct encodings and normalization

Parameters

dfpd.DataFrame: Contains raw (not normalized and not encoded) data.

Returns

outputpd.DataFrame: Prediction input normalized and encoded

Return type: DataFrame

OnlineCatalog ¶

class data.catalog.online_catalog.OnlineCatalog(data_name, scaling_method='MinMax', encoding_method='OneHot_drop_binary')¶

Implements DataCatalog using already implemented datasets. These datasets are loaded from an online repository.

Parameters

data_name{‘adult’, ‘compas’, ‘give_me_some_credit’, ‘heloc’}: Used to get the correct dataset from online repository.

Returns

DataCatalog

Attributes

categorical
continuous
immutables
target

Methods

__call__(*args, **kwargs)

Call self as a function.

property categorical: List[str]¶

Return type: List[str]

property continuous: List[str]¶

Return type: List[str]

property immutables: List[str]¶

Return type: List[str]

property target: str¶

Return type: str

CsvCatalog ¶

class data.catalog.csv_catalog.CsvCatalog(file_path, categorical, continuous, immutables, target, scaling_method='MinMax', encoding_method='OneHot_drop_binary')¶

Implements DataCatalog using local csv files. Using this class is the easiest way to use your own dataset. Besides data transformation, no other preprocessing is done. E.g. the user should remove NaNs.

Parameters

file_path: str: Path of the csv file.
categorical: list[str]: List containing the column names of the categorical features.
continuous: list[str]: List containing the column names of the continuous features.
immutables: list[str]: List containing the column names of the immutable features.
target: str: Column name of the target.

Returns

DataCatalog

Attributes

categorical: Provides the column names of categorical data.
continuous: Provides the column names of continuous data.
df: The full Dataframe.
df_test: The testing split Dataframe.
df_train: The training split Dataframe.
encoder: Contains a fitted sklearn encoder:
immutables: Provides the column names of immutable data.
scaler: Contains a fitted sklearn scaler.
target: Provides the name of the label column.

Methods

`get_pipeline_element`(key)	Returns a specific element of the transformation pipeline.
`inverse_transform`(df)	Transforms output after prediction back into original form.
`transform`(df)	Transforms input for prediction into correct form.

property categorical: List[str]¶

Provides the column names of categorical data. Column names do not contain encoded information as provided by a get_dummy() method (e.g., sex_female)

Label name is not included.

Returns

list of Strings: List of all categorical columns

Return type: List[str]

property continuous: List[str]¶

Provides the column names of continuous data.

Label name is not included.

Returns

list of Strings: List of all continuous columns

Return type: List[str]

property immutables: List[str]¶

Provides the column names of immutable data.

Label name is not included.

Returns

list of Strings: List of all immutable columns

Return type: List[str]

property target: str¶

Provides the name of the label column.

Returns

str: Target label name

Return type: str

Causal Model ¶

class data.causal_model.causal_model.CausalModel(scm_class)¶

Class with topological methods given a structural causal model. Uses the StructuralCausalModel and CausalGraphicalModel from https://github.com/ijmbarr/causalgraphicalmodels

Parameters

scm_class: str: Name of the structural causal model

Attributes

scm: StructuralCausalModel
StructuralCausalModel from assignment of the form { variable: Function(parents) }.
cgm: CausalGraphicalModel
scm_class: str: Name of the structural causal model
structural_equations_np: dict: Contains the equations for the features in Numpy format.
structural_equations_ts: dict: Contains the equations for the features in Tensorflow format.
noise_distributions: dict: Defines the noise variables.

Methods

`generate_dataset`(size)	Generates a Data object using the structural causal equations
`get_ancestors`(node)	Returns all nodes having a path to node.
`get_children`(node)	Returns an iterator over successor nodes of n.
`get_descendents`(node)	Returns all nodes reachable from node.
`get_non_descendents`(node)	Returns all nodes not reachable from node.
`get_parents`(node[, return_sorted])	Returns an set over predecessor nodes of n.
`get_topological_ordering`([node_type])	Returns a generator of nodes in topologically sorted order.
`visualize_graph`([experiment_folder_name])	Visualize the causal graph.

property cgm: causalgraphicalmodels.cgm.CausalGraphicalModel¶

Returns

CausalGraphicalModel

Return type: CausalGraphicalModel

property endogenous: List[str]¶

Get the endogenous nodes, i.e. the signal nodes.

Returns

List[str]

Return type: List[str]

property exogenous: List[str]¶

Get the exogenous nodes, i.e. the noise nodes.

Returns

List[str]

Return type: List[str]

generate_dataset(size)¶

Generates a Data object using the structural causal equations

Parameters

size: int: Number of samples in the dataset

Returns

ScmDataset: a Data object filled with samples

Return type: ScmDataset

get_ancestors(node)¶

Returns all nodes having a path to node.

Parameters

nodestr: A node in the graph

Returns

set(): The ancestors of node

Return type: set

get_children(node)¶

Returns an iterator over successor nodes of n.

A successor of n is a node m such that there exists a directed edge from n to m.

Parameters

node: str: A node in the graph

Return type: set

get_descendents(node)¶

Returns all nodes reachable from node.

Parameters

nodestr: A node in the graph

Returns

set(): The descendants of node

Return type: set

get_non_descendents(node)¶

Returns all nodes not reachable from node.

Parameters

nodestr: A node in the graph

Returns

set(): The non-descendants of node

Return type: set

get_parents(node, return_sorted=True)¶

Returns an set over predecessor nodes of n.

A predecessor of n is a node m such that there exists a directed edge from m to n.

Parameters

nodestr: A node in the graph
return_sortedbool: Return the set sorted

get_topological_ordering(node_type='endogenous')¶

Returns a generator of nodes in topologically sorted order.

A topological sort is a non-unique permutation of the nodes such that an edge from u to v implies that u appears before v in the topological sort order.

Parameters

node_type: str: “endogenous” or “exogenous”, i.e. nodes with “x” or “u” prefix respectively

Returns

iterable: An iterable of node names in topological sorted order.

property noise_distributions: dict¶

Defines the noise variables.

Returns

dict

Return type: dict

property scm: causalgraphicalmodels.csm.StructuralCausalModel¶

Returns

StructuralCausalModel

Return type: StructuralCausalModel

property scm_class: str¶

Name of the structural causal model used to define the CausalModel

Returns

str

Return type: str

property structural_equations_np: dict¶

Contains the equations for the features in Numpy format.

Returns

dict

Return type: dict

property structural_equations_ts: dict¶

Contains the equations for the features in Tensorflow format.

Returns

dict

Return type: dict

visualize_graph(experiment_folder_name=None)¶

Visualize the causal graph.

Parameters

experiment_folder_name: str: Where to save figure.

Synthetic Data ¶

class data.causal_model.synthethic_data.ScmDataset(scm, size)¶

Generate a dataset from structural equations

Parameters

scmCausalModel: Structural causal model
sizeint: Number of samples in the dataset

Attributes

categorical
categorical_noise: Provides the column names of the categorical data.
continuous
continuous_noise: Provides the column names of the continuous data.
df
df_test
df_train
immutables
noise
noise_test
noise_train
target

Methods

__call__(*args, **kwargs)

Call self as a function.

inverse_transform
transform

property categorical: List[str]¶

Return type: List[str]

property categorical_noise: List[str]¶

Provides the column names of the categorical data.

Returns

List[str]

Return type: List[str]

property continuous: List[str]¶

Return type: List[str]

property continuous_noise: List[str]¶

Provides the column names of the continuous data.

Returns

List[str]

Return type: List[str]

property df: pandas.core.frame.DataFrame¶

Return type: DataFrame

property df_test: pandas.core.frame.DataFrame¶

Return type: DataFrame

property df_train: pandas.core.frame.DataFrame¶

Return type: DataFrame

property immutables: List[str]¶

Return type: List[str]

inverse_transform(df)¶

Return type: DataFrame

property noise: pandas.core.frame.DataFrame¶

Return type: DataFrame

property noise_test: pandas.core.frame.DataFrame¶

Return type: DataFrame

property noise_train: pandas.core.frame.DataFrame¶

Return type: DataFrame

property target: str¶

Return type: str

transform(df)¶

Return type: DataFrame

Data¶

Interface¶

DataCatalog¶

OnlineCatalog¶

CsvCatalog¶

Causal Model¶

Synthetic Data¶

Data ¶

Interface ¶

DataCatalog ¶

OnlineCatalog ¶

CsvCatalog ¶

Causal Model ¶

Synthetic Data ¶