Data¶
This data wrapper contains the possibilities to
use pre-processed datasets from our OnlineCatalog or
use local datasets using CsvCatalog
implement every possible dataset by either inheriting our DataCatalog or for even more control the Interface
Example implementations for the first two use-cases can be found in our section Examples.
Interface¶
- class data.api.data.Data¶
Abstract class to implement arbitrary datasets, which are provided by the user. This is the general data object that is used in CARLA.
- Attributes
categorical
Provides the column names of categorical data.
continuous
Provides the column names of continuous data.
df
The full Dataframe.
df_test
The testing split Dataframe.
df_train
The training split Dataframe.
immutables
Provides the column names of immutable data.
target
Provides the name of the label column.
Methods
Inverts transform operation.
transform
(df)Data transformation, for example normalization of continuous features and encoding of categorical features.
- abstract property categorical¶
Provides the column names of categorical data. Column names do not contain encoded information as provided by a get_dummy() method (e.g., sex_female)
Label name is not included.
- Returns
- list of Strings
List of all categorical columns
- abstract property continuous¶
Provides the column names of continuous data.
Label name is not included.
- Returns
- list of Strings
List of all continuous columns
- abstract property df¶
The full Dataframe.
- Returns
- pd.DataFrame
- abstract property df_test¶
The testing split Dataframe.
- Returns
- pd.DataFrame
- abstract property df_train¶
The training split Dataframe.
- Returns
- pd.DataFrame
- abstract property immutables¶
Provides the column names of immutable data.
Label name is not included.
- Returns
- list of Strings
List of all immutable columns
- abstract inverse_transform(df)¶
Inverts transform operation.
- Parameters
- df: pd.DataFrame
- Returns
- pd.Dataframe
- abstract property target¶
Provides the name of the label column.
- Returns
- str
Target label name
- abstract transform(df)¶
Data transformation, for example normalization of continuous features and encoding of categorical features.
- Parameters
- df: pd.DataFrame
- Returns
- pd.Dataframe
DataCatalog¶
- class data.catalog.catalog.DataCatalog(data_name, df, df_train, df_test, scaling_method='MinMax', encoding_method='OneHot_drop_binary')¶
Generic framework for datasets, using sklearn processing. This class is implemented by OnlineCatalog and CsvCatalog. OnlineCatalog allows the user to easily load online datasets, while CsvCatalog allows easy use of local datasets.
- Parameters
- data_name: str
What name the dataset should have.
- df: pd.DataFrame
The complete Dataframe. This is equivalent to the combination of df_train and df_test, although not shuffled.
- df_train: pd.DataFrame
Training portion of the complete Dataframe.
- df_test: pd.DataFrame
Testing portion of the complete Dataframe.
- scaling_method: str, default: MinMax
Type of used sklearn scaler. Can be set with the property setter to any sklearn scaler. Set to “Identity” for no scaling.
- encoding_method: str, default: OneHot_drop_binary
Type of OneHotEncoding {OneHot, OneHot_drop_binary}. Additional drop binary decides if one column is dropped for binary features. Can be set with the property setter to any sklearn encoder. Set to “Identity” for no encoding.
- Returns
- Data
- Attributes
categorical
Provides the column names of categorical data.
continuous
Provides the column names of continuous data.
df
The full Dataframe.
df_test
The testing split Dataframe.
df_train
The training split Dataframe.
encoder
Contains a fitted sklearn encoder:
immutables
Provides the column names of immutable data.
scaler
Contains a fitted sklearn scaler.
target
Provides the name of the label column.
Methods
get_pipeline_element
(key)Returns a specific element of the transformation pipeline.
Transforms output after prediction back into original form.
transform
(df)Transforms input for prediction into correct form.
- property df: pandas.core.frame.DataFrame¶
The full Dataframe.
- Returns
- pd.DataFrame
- Return type
DataFrame
- property df_test: pandas.core.frame.DataFrame¶
The testing split Dataframe.
- Returns
- pd.DataFrame
- Return type
DataFrame
- property df_train: pandas.core.frame.DataFrame¶
The training split Dataframe.
- Returns
- pd.DataFrame
- Return type
DataFrame
- property encoder: sklearn.base.BaseEstimator¶
Contains a fitted sklearn encoder:
- Returns
- sklearn.preprocessing.BaseEstimator
- Return type
BaseEstimator
- get_pipeline_element(key)¶
Returns a specific element of the transformation pipeline.
- Parameters
- keystr
Element of the pipeline we want to return
- Returns
- Pipeline element
- Return type
Callable
- inverse_transform(df)¶
Transforms output after prediction back into original form. Only possible for DataFrames with preprocessing steps.
- Parameters
- dfpd.DataFrame
Contains normalized and encoded data.
- Returns
- outputpd.DataFrame
Prediction output denormalized and decoded
- Return type
DataFrame
- property scaler: sklearn.base.BaseEstimator¶
Contains a fitted sklearn scaler.
- Returns
- sklearn.preprocessing.BaseEstimator
- Return type
BaseEstimator
- transform(df)¶
Transforms input for prediction into correct form. Only possible for DataFrames without preprocessing steps.
Recommended to keep correct encodings and normalization
- Parameters
- dfpd.DataFrame
Contains raw (not normalized and not encoded) data.
- Returns
- outputpd.DataFrame
Prediction input normalized and encoded
- Return type
DataFrame
OnlineCatalog¶
- class data.catalog.online_catalog.OnlineCatalog(data_name, scaling_method='MinMax', encoding_method='OneHot_drop_binary')¶
Implements DataCatalog using already implemented datasets. These datasets are loaded from an online repository.
- Parameters
- data_name{‘adult’, ‘compas’, ‘give_me_some_credit’, ‘heloc’}
Used to get the correct dataset from online repository.
- Returns
- DataCatalog
- Attributes
- categorical
- continuous
- immutables
- target
Methods
__call__
(*args, **kwargs)Call self as a function.
- property categorical: List[str]¶
- Return type
List
[str
]
- property continuous: List[str]¶
- Return type
List
[str
]
- property immutables: List[str]¶
- Return type
List
[str
]
- property target: str¶
- Return type
str
CsvCatalog¶
- class data.catalog.csv_catalog.CsvCatalog(file_path, categorical, continuous, immutables, target, scaling_method='MinMax', encoding_method='OneHot_drop_binary')¶
Implements DataCatalog using local csv files. Using this class is the easiest way to use your own dataset. Besides data transformation, no other preprocessing is done. E.g. the user should remove NaNs.
- Parameters
- file_path: str
Path of the csv file.
- categorical: list[str]
List containing the column names of the categorical features.
- continuous: list[str]
List containing the column names of the continuous features.
- immutables: list[str]
List containing the column names of the immutable features.
- target: str
Column name of the target.
- Returns
- DataCatalog
- Attributes
categorical
Provides the column names of categorical data.
continuous
Provides the column names of continuous data.
df
The full Dataframe.
df_test
The testing split Dataframe.
df_train
The training split Dataframe.
encoder
Contains a fitted sklearn encoder:
immutables
Provides the column names of immutable data.
scaler
Contains a fitted sklearn scaler.
target
Provides the name of the label column.
Methods
get_pipeline_element
(key)Returns a specific element of the transformation pipeline.
inverse_transform
(df)Transforms output after prediction back into original form.
transform
(df)Transforms input for prediction into correct form.
- property categorical: List[str]¶
Provides the column names of categorical data. Column names do not contain encoded information as provided by a get_dummy() method (e.g., sex_female)
Label name is not included.
- Returns
- list of Strings
List of all categorical columns
- Return type
List
[str
]
- property continuous: List[str]¶
Provides the column names of continuous data.
Label name is not included.
- Returns
- list of Strings
List of all continuous columns
- Return type
List
[str
]
- property immutables: List[str]¶
Provides the column names of immutable data.
Label name is not included.
- Returns
- list of Strings
List of all immutable columns
- Return type
List
[str
]
- property target: str¶
Provides the name of the label column.
- Returns
- str
Target label name
- Return type
str
Causal Model¶
- class data.causal_model.causal_model.CausalModel(scm_class)¶
Class with topological methods given a structural causal model. Uses the StructuralCausalModel and CausalGraphicalModel from https://github.com/ijmbarr/causalgraphicalmodels
- Parameters
- scm_class: str
Name of the structural causal model
- Attributes
- scm: StructuralCausalModel
- StructuralCausalModel from assignment of the form { variable: Function(parents) }.
- cgm: CausalGraphicalModel
- scm_class: str
Name of the structural causal model
- structural_equations_np: dict
Contains the equations for the features in Numpy format.
- structural_equations_ts: dict
Contains the equations for the features in Tensorflow format.
- noise_distributions: dict
Defines the noise variables.
Methods
generate_dataset
(size)Generates a Data object using the structural causal equations
get_ancestors
(node)Returns all nodes having a path to node.
get_children
(node)Returns an iterator over successor nodes of n.
get_descendents
(node)Returns all nodes reachable from node.
get_non_descendents
(node)Returns all nodes not reachable from node.
get_parents
(node[, return_sorted])Returns an set over predecessor nodes of n.
get_topological_ordering
([node_type])Returns a generator of nodes in topologically sorted order.
visualize_graph
([experiment_folder_name])Visualize the causal graph.
- property cgm: causalgraphicalmodels.cgm.CausalGraphicalModel¶
- Returns
- CausalGraphicalModel
- Return type
CausalGraphicalModel
- property endogenous: List[str]¶
Get the endogenous nodes, i.e. the signal nodes.
- Returns
- List[str]
- Return type
List
[str
]
- property exogenous: List[str]¶
Get the exogenous nodes, i.e. the noise nodes.
- Returns
- List[str]
- Return type
List
[str
]
- generate_dataset(size)¶
Generates a Data object using the structural causal equations
- Parameters
- size: int
Number of samples in the dataset
- Returns
- ScmDataset
a Data object filled with samples
- Return type
ScmDataset
- get_ancestors(node)¶
Returns all nodes having a path to node.
- Parameters
- nodestr
A node in the graph
- Returns
- set()
The ancestors of node
- Return type
set
- get_children(node)¶
Returns an iterator over successor nodes of n.
A successor of n is a node m such that there exists a directed edge from n to m.
- Parameters
- node: str
A node in the graph
- Return type
set
- get_descendents(node)¶
Returns all nodes reachable from node.
- Parameters
- nodestr
A node in the graph
- Returns
- set()
The descendants of node
- Return type
set
- get_non_descendents(node)¶
Returns all nodes not reachable from node.
- Parameters
- nodestr
A node in the graph
- Returns
- set()
The non-descendants of node
- Return type
set
- get_parents(node, return_sorted=True)¶
Returns an set over predecessor nodes of n.
A predecessor of n is a node m such that there exists a directed edge from m to n.
- Parameters
- nodestr
A node in the graph
- return_sortedbool
Return the set sorted
- get_topological_ordering(node_type='endogenous')¶
Returns a generator of nodes in topologically sorted order.
A topological sort is a non-unique permutation of the nodes such that an edge from u to v implies that u appears before v in the topological sort order.
- Parameters
- node_type: str
“endogenous” or “exogenous”, i.e. nodes with “x” or “u” prefix respectively
- Returns
- iterable
An iterable of node names in topological sorted order.
- property noise_distributions: dict¶
Defines the noise variables.
- Returns
- dict
- Return type
dict
- property scm: causalgraphicalmodels.csm.StructuralCausalModel¶
- Returns
- StructuralCausalModel
- Return type
StructuralCausalModel
- property scm_class: str¶
Name of the structural causal model used to define the CausalModel
- Returns
- str
- Return type
str
- property structural_equations_np: dict¶
Contains the equations for the features in Numpy format.
- Returns
- dict
- Return type
dict
- property structural_equations_ts: dict¶
Contains the equations for the features in Tensorflow format.
- Returns
- dict
- Return type
dict
- visualize_graph(experiment_folder_name=None)¶
Visualize the causal graph.
- Parameters
- experiment_folder_name: str
Where to save figure.
Synthetic Data¶
- class data.causal_model.synthethic_data.ScmDataset(scm, size)¶
Generate a dataset from structural equations
- Parameters
- scmCausalModel
Structural causal model
- sizeint
Number of samples in the dataset
- Attributes
- categorical
categorical_noise
Provides the column names of the categorical data.
- continuous
continuous_noise
Provides the column names of the continuous data.
- df
- df_test
- df_train
- immutables
- noise
- noise_test
- noise_train
- target
Methods
__call__
(*args, **kwargs)Call self as a function.
inverse_transform
transform
- property categorical: List[str]¶
- Return type
List
[str
]
- property categorical_noise: List[str]¶
Provides the column names of the categorical data.
- Returns
- List[str]
- Return type
List
[str
]
- property continuous: List[str]¶
- Return type
List
[str
]
- property continuous_noise: List[str]¶
Provides the column names of the continuous data.
- Returns
- List[str]
- Return type
List
[str
]
- property df: pandas.core.frame.DataFrame¶
- Return type
DataFrame
- property df_test: pandas.core.frame.DataFrame¶
- Return type
DataFrame
- property df_train: pandas.core.frame.DataFrame¶
- Return type
DataFrame
- property immutables: List[str]¶
- Return type
List
[str
]
- inverse_transform(df)¶
- Return type
DataFrame
- property noise: pandas.core.frame.DataFrame¶
- Return type
DataFrame
- property noise_test: pandas.core.frame.DataFrame¶
- Return type
DataFrame
- property noise_train: pandas.core.frame.DataFrame¶
- Return type
DataFrame
- property target: str¶
- Return type
str
- transform(df)¶
- Return type
DataFrame