API documentation

EmbeddingEncoder class

class embedding_encoder.core.EmbeddingEncoder(task, numeric_vars=None, dimensions=None, layers_units=None, dropout=0.2, classif_classes=None, classif_loss=None, optimizer='adam', epochs=5, batch_size=32, validation_split=0.2, verbose=0, mapping_path=None, pretrained=False, keep_model=False)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Obtain numeric embeddings from categorical variables.

Embedding Encoder trains a small neural network with categorical inputs passed through embedding layers. Numeric variables can be included as additional inputs by setting numeric_vars.

Embedding Encoder returns (unique_values + 1) / 2 vectors per categorical variable, with a minimum of 2 and a maximum of 50. However, this can be changed by passing a list of integers to dimensions.

The neural network architecture and training loop can be partially modified. layers_units takes an array of integers, each representing an additional dense layer, i.e, [32, 24, 16] will create 3 hidden layers with the corresponding units, with dropout layers interleaved, while dropout controls the dropout rate.

While Embedding Encoder will try to infer the appropiate number of units for the output layer and the model’s loss for classification tasks, these can be set with classif_classes and classif_loss. Regression tasks will always have 1 unit in the output layer and mean squared error loss.

optimizer and batch_size are passed directly to Keras.

validation_split is also passed to Keras. Setting it to something higher than 0 will use validation loss in order to decide whether to stop training early. Otherwise train loss will be used.

mapping_path is the path to a JSON file where the embedding mapping will be saved. If pretrained is set to True, the mapping will be loaded from this file and no model will be trained.

Parameters
  • task (str) – “regression” or “classification”. This determines the units in the head layer, loss and metrics used.

  • numeric_vars (Optional[List[str]]) – Array-like of strings containing the names of the numeric variables that will be included as inputs to the network.

  • dimensions (Optional[List[int]]) – Array-like of integers containing the number of embedding dimensions for each categorical feature. If none, the dimension will be min(50, int(np.ceil((unique + 1) / 2)))

  • layers_units (Optional[List[int]]) – Array-like of integers which define how many dense layers to include and how many units they should have. By default None, which creates two hidden layers with 24 and 12 units.

  • dropout (float) – Dropout rate used between dense layers.

  • classif_classes (Optional[int]) – Number of classes in y for classification tasks.

  • classif_loss (Optional[str], optional) – Loss function for classification tasks.

  • optimizer (str) – Optimizer, default “adam”.

  • epochs (int) – Number of epochs, default 3.

  • batch_size (int) – Batches size, default 32.

  • validation_split (float) – Passed to Keras Model.fit.

  • verbose (int) – Verbosity of the Keras Model.fit, default 0.

  • mapping_path (Union[str, Path, None]) – Path to a JSON file where the mapping from categorical variables to embeddings will be saved. If pretrained is True, the mapping will be loaded from this file and no model will be trained.

  • pretrained (bool) – Whether to use pretrained embeddings found in the JSON at mapping_path.

  • keep_model (bool) – Whether to assign the Tensorflow model to _model. Setting to True will prevent the EmbeddingEncoder from being pickled. Default False. Please note that the model’s history dict is available at _history.

_history

Keras model.history.history containing training data.

Type

dict

_model

Keras model. Only available if keep_model is True.

Type

keras.Model

_embeddings_mapping

Dictionary mapping categorical variables to their embeddings.

Type

dict

Raises
  • ValueError – If task is not “regression” or “classification”.

  • ValueError – If classif_classes or classif_loss are specified for regression tasks.

  • ValueError – If classif_classes is specified but classif_loss is not.

Parameters
  • task (str) –

  • numeric_vars (Optional[List[str]]) –

  • dimensions (Optional[List[int]]) –

  • layers_units (Optional[List[int]]) –

  • dropout (float) –

  • classif_classes (Optional[int]) –

  • classif_loss (Optional[str]) –

  • optimizer (str) –

  • epochs (int) –

  • batch_size (int) –

  • validation_split (float) –

  • verbose (int) –

  • mapping_path (Optional[Union[str, Path]]) –

  • pretrained (bool) –

  • keep_model (bool) –

fit(X, y)[source]

Fit the EmbeddingEncoder to X.

Parameters
  • X (DataFrame) – The data to process. It can include numeric variables that will not be encoded but will be used in the neural network as additional inputs.

  • y (Union[DataFrame, Series]) – Target data. Used as target in the neural network.

Returns

self – Fitted Embedding Encoder.

Return type

object

mapping_to_json()[source]
Return type

None

mapping_from_json()[source]
Return type

Dict[str, DataFrame]

transform(X)[source]

Transform X using computed variable embeddings.

Parameters

X (DataFrame) – The data to process.

Returns

Vector embeddings for each categorical variable.

Return type

embeddings

inverse_transform(X)[source]

Inverse transform X using computed variable embeddings.

Parameters

X (Union[DataFrame, ndarray]) – The data to process.

Return type

Original DataFrame.

get_feature_names_out(input_features=None)[source]
get_feature_names(input_features=None)[source]
plot_embeddings(variable, model='pca')[source]

Create a 2D scatterplot of a variable’s embeddings. Each dot represents a category.

Parameters
  • variable (str) – Variable to plot. Please note that scikit-learn’s Pipeline might strip column names.

  • model (str, optional) – Dimensionality reduction model. Either “tsne” or “pca”. Default “pca”.

Returns

Seaborn scatterplot (Matplotlib axes)

Return type

matplotlib.axes._subplots.AxesSubplot

Raises
  • ValueError – If selected variable has less than 3 unique values.

  • ValueError – If selected model is not “tsne” or “pca”.

  • ImportError – If seaborn is not installed.

Utilities

class embedding_encoder.utils.compose.ColumnTransformerWithNames(transformers, *, remainder='drop', sparse_threshold=0.3, n_jobs=None, transformer_weights=None, verbose=False, verbose_feature_names_out=True)[source]

Bases: sklearn.compose._column_transformer.ColumnTransformer

A ColumnTransformer that retains DataFrame column names. Obtained from https://stackoverflow.com/questions/61079602/how-do-i-get-feature-names-using-a-column-transformer/68671424#68671424

get_feature_names()[source]

Get feature names from all transformers.

Returns

feature_names – Names of the features produced by transform.

Return type

List[str]

transform(X)[source]

Transform X separately by each transformer, concatenate results.

Parameters

X ({array-like, dataframe} of shape (n_samples, n_features)) – The data to be transformed by subset.

Returns

X_t – Horizontally stacked results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

Return type

{array-like, sparse matrix} of shape (n_samples, sum_n_components)

fit_transform(X, y=None)[source]

Fit all transformers, transform the data and concatenate results.

Parameters
  • X ({array-like, dataframe} of shape (n_samples, n_features)) – Input data, of which specified subsets are used to fit the transformers.

  • y (array-like of shape (n_samples,), default=None) – Targets for supervised learning.

Returns

X_t – Horizontally stacked results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

Return type

{array-like, sparse matrix} of shape (n_samples, sum_n_components)

steps: List[Any]