DataFrame Extension Deep Dive

We define a custom kxy pandas accessor below, namely the class Accessor, that extends the pandas DataFrame class with all our analyses, thereby allowing data scientists to tap into the power of the kxy toolkit within the comfort of their favorite data structure.

All methods defined in the Accessor class are accessible from any DataFrame instance as df.kxy.<method_name>, so long as the kxy python package is imported alongside pandas.

class kxy.pandas_extension.accessor.Accessor(pandas_obj)

Bases: kxy.pandas_extension.pre_learning_accessor.PreLearningAccessor, kxy.pandas_extension.learning_accessor.LearningAccessor, kxy.pandas_extension.post_learning_accessor.PostLearningAccessor, kxy.pandas_extension.finance_accessor.FinanceAccessor, kxy.pandas_extension.features_accessor.FeaturesAccessor

Extension of the pandas.DataFrame class with the full capabilities of the kxy platform.

class kxy.pandas_extension.base_accessor.BaseAccessor(pandas_obj)

Base class inheritated by our customs accessors.

anonymize(columns_to_exclude=[])

Anonymize the dataframe in a manner that leaves all pre-learning and post-learning analyses (including data valuation, variable selection, model-driven improvability, data-driven improvability and model explanation) invariant.

Any transformation on continuous variables that preserves ranks will not change our pre-learning and post-learning analyses. The same holds for any 1-to-1 transformation on categorical variables.

This implementation replaces ordinal values (i.e. any column that can be cast as a float) with their within-column Gaussian score. For each non-ordinal column, we form the set of all possible values, we assign a unique integer index to each value in the set, and we systematically replace said value appearing in the dataframe by the hexadecimal code of its associated integer index.

For regression problems, accurate estimation of RMSE related metrics require the target column (and the prediction column for post-learning analyses) not to be anonymized.

Parameters

columns_to_exclude (list (optional)) – List of columns not to anonymize (e.g. target and prediction columns for regression problems).

Returns

result – The result is a pandas.Dataframe with columns (where applicable):

Return type

pandas.DataFrame

is_categorical(column)

Determine whether the input column contains categorical (i.e. non-ordinal) observations.

is_discrete(column)

Determine whether the input column contains discrete (i.e as opposed to continuous) observations.

class kxy.pandas_extension.features_accessor.FeaturesAccessor(pandas_obj)

Bases: kxy.pandas_extension.base_accessor.BaseAccessor

Extension of the pandas.DataFrame class with various feature engineering functionalities.

This class defines the kxy_features pandas accessor.

All its methods defined are accessible from any DataFrame instance as df.kxy_features.<method_name>, so long as the kxy python package is imported alongside pandas.

deviation_features(exclude=[], means=None, quantiles=None, return_baselines=False)

Extend the dataframe with deviations of ordinal columns from row-wise aggregtes such as mean, median, 25th and 75th percentiles.

Parameters
  • exclude (list) – A list of columns to exclude from feature transformations.

  • means (pandas.DataFrame | None) – Which values, if any, to use as means.

  • quantiles (pandas.DataFrame | None) – Which values, if any, to use as 25th, 50th, 75th percentiles.

  • return_baselines (bool) – Whether to return which baselines have been used.

Returns

result – The original dataframe extended with computed features.

Return type

pandas.DataFrame

entity_features(entity, exclude=[], entity_name='*', filter_target=None, filter_target_gt=None, filter_target_lt=None, include_filter_target=False)

Group rows corresponding to the same entity and apply aggregation functions.

For each ordinal column, we apply the following aggregation functions to rows corresponding to the same entity: mean, standard deviation, median, skewness, kurtosis, 25th and 75th percentiles, minimum, maximum, and the difference betwween maximum and minimum.

For each non-ordinal column, we apply the following aggregation functions to rows corresponding to the same entity: mode and its frequency, second most frequent label and its frequency, least frequent label and its frequency.

Parameters
  • entity (str) – The column mapping rows to entities.

  • exclude (list) – A list of columns to exclude from feature transformations.

  • filter_target (str | None) – When specified, this is a column based on which we need to restrict the dataframe before generating features.

  • filter_target_gt (str | None) – When specified, only rows with filter_target greater than filter_target_gt will be considered for feature generation.

  • filter_target_lt (str | None) – When specified, only rows with filter_target smaller than filter_target_gt will be considered for feature generation.

  • include_filter_target (bool) – Whether to use filter_target for features generation.

Returns

result – The dataframe of features.

Return type

pandas.DataFrame

generate_features(entity=None, encoding_method='one_hot', index=None, max_lag=None, exclude=[], means=None, quantiles=None, return_baselines=False, entity_name='*', filter_target=None, filter_target_gt=None, filter_target_lt=None, include_filter_target=False, fill_na=False, temporal_groupby=None, temporal_sort_by=None, time_columns=None)

Generate a wide range of candidate features to search from.

We first compute entity features if needed.

Then we extend the resulting dataframe with deviations of ordinal columns from row-wise aggregtes such as mean, median, 25th and 75th percentiles.

Finally, we ordinally-encode the resulting dataframe and apply temporal transformations if required.

Parameters
  • entity (str) – The column mapping rows to entities.

  • filter_target (str | None) – When specified, this is a column based on which we need to restrict the dataframe before generating entity features.

  • filter_target_gt (str | None) – When specified, only rows with filter_target greater than filter_target_gt will be considered for entity feature generation.

  • filter_target_lt (str | None) – When specified, only rows with filter_target smaller than filter_target_gt will be considered for entity feature generation.

  • include_filter_target (bool) – Whether to use filter_target for features generation.

  • encoding_method ('one_hot' (default) | 'binary') – The encoding method to use for categorical variables.

  • exclude (list) – A list of columns to exclude from feature transformations.

  • max_lag (int | None) – The largest lag, if any, to consider for temporal features. Set to None to avoid temporal features.

  • index (str | None (default)) – The column, if any, to set as index and sort before computing temporal features.

  • means (pandas.DataFrame | None) – Which values, if any, to use as means for deviation features.

  • quantiles (pandas.DataFrame | None) – Which values, if any, to use as 25th, 50th, 75th percentiles for deviation features.

  • return_baselines (bool) – Whether to return which baselines have been used for deviation features.

  • temporal_groupby (str | None) – If provided, we will use this column to perform a groupby before temporal aggregation.

  • temporal_sort_by (str | None) – If provided, we will use this column to sort the dataframe before rolling when computing temporal features.

  • time_columns (list | None) – The list of columns that correspond to times and from which we should extract features such as hour, day of week etc.

Returns

result – The original dataframe extended with computed temporal features.

Return type

pandas.DataFrame

ordinally_encode(target_column=None, method='one_hot')

Encode categorical (non-numeric) data.

Parameters
  • target_column (str) – The name of the column containing labels. When this column is categorical, each label is replaced by a distinct integer.

  • method ('one_hot' (default) | 'binary') – Whether to use one-hot encoding or binary encoding to encode categorical variables.

Returns

result – The ordinarily encoded dataframe.

Return type

pandas.DataFrame

process_time_columns(columns)

Extract features from timestamp columns such as: Month, Day, Day of Week, Hour, AM/PM.

Parameters

columns (list) – The list of columns that should be interprated as UTC epoch timestamps.

Returns

result – The features dataframe (does not include the original dataframe)

Return type

pandas.DataFrame

temporal_features(max_lag=10, exclude=[], index=None, groupby=None, sort_by=None)

Extend the dataframe with some rolling statistics (e.g. rolling average, rolling min, rolling max, rolling max-rolling min, etc.) for all lags from 2 to the configured maximum lag.

Parameters
  • exclude (list) – A list of columns to exclude from feature transformations.

  • max_lag (int) – The largest lag to consider.

  • index (str | None (default)) – The column, if any, to set as index and sort before computing rolling statistics.

  • groupby (str | None) – If provided, we will use this column to perform a groupby before temporal aggregation.

  • sort_by (str | list | None) – Columns, if any, we need to sort the dataframe by, prior to rolling.

Returns

result – The original dataframe extended with computed temporal features.

Return type

pandas.DataFrame

class kxy.pandas_extension.pre_learning_accessor.PreLearningAccessor(pandas_obj)

Bases: kxy.pandas_extension.base_accessor.BaseAccessor

Extension of the pandas.DataFrame class with various analytics for post-learning in supervised learning problems.

This class defines the kxy_pre_learning pandas accessor.

All its methods defined are accessible from any DataFrame instance as df.kxy_pre_learning.<method_name>, so long as the kxy python package is imported alongside pandas.

data_valuation(target_column, problem_type=None, anonymize=None, snr='auto', include_mutual_information=False, file_name=None)

Estimate the highest performance metrics achievable when predicting the target_column using all other columns.

When problem_type=None, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not target_column is categorical.

Parameters
  • target_column (str) – The name of the column containing true labels.

  • problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.

  • anonymize (None | bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost). When set to None (the default), your data will be anonymized when it is too big.

  • include_mutual_information (bool) – Whether to include the mutual information between target and explanatory variables in the result.

Returns

achievable_performance – The result is a pandas.Dataframe with columns (where applicable):

  • 'Achievable Accuracy': The highest classification accuracy that can be achieved by a model using provided inputs to predict the label.

  • 'Achievable R^2': The highest \(R^2\) that can be achieved by a model using provided inputs to predict the label.

  • 'Achievable RMSE': The lowest Root Mean Square Error that can be achieved by a model using provided inputs to predict the label.

  • 'Achievable Log-Likelihood Per Sample': The highest true log-likelihood per sample that can be achieved by a model using provided inputs to predict the label.

Return type

pandas.Dataframe

Theoretical Foundation

Section 1 - Achievable Performance.

variable_selection(target_column, problem_type=None, anonymize=None, snr='auto', file_name=None)

Runs the model-free variable selection analysis.

When problem_type=None, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not target_column is categorical.

Parameters
  • target_column (str) – The name of the column containing true labels.

  • problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.

  • anonymize (None | bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost). When set to None (the default), your data will be anonymized when it is too big.

Returns

result – The result is a pandas.DataFrame with columns (where applicable):

  • 'Selection Order': The order in which the associated variable was selected, starting at 1 for the most important variable.

  • 'Variable': The column name corresponding to the input variable.

  • 'Running Achievable R^2': The highest \(R^2\) that can be achieved by a classification model using all variables selected so far, including this one.

  • 'Running Achievable Accuracy': The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.

  • 'Running Achievable RMSE': The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.

Return type

pandas.DataFrame

Theoretical Foundation

Section 2 - Variable Selection Analysis.

class kxy.pandas_extension.learning_accessor.LearningAccessor(pandas_obj)

Bases: kxy.pandas_extension.base_accessor.BaseAccessor

Extension of the pandas.DataFrame class with various analytics for automatically training predictive models.

This class defines the kxy_learning pandas accessor.

All its methods defined are accessible from any DataFrame instance as df.kxy_learning.<method_name>, so long as the kxy python package is imported alongside pandas.

fit(target_column, learner_func, problem_type=None, snr='auto', train_frac=0.8, random_state=0, max_n_features=None, min_n_features=None, start_n_features=None, anonymize=False, benchmark_feature=None, missing_value_imputation=False, score='auto', n_down_perf_before_stop=3, regression_baseline='mean', additive_learning=False, regression_error_type='additive', return_scores=False, start_n_features_perf_frac=0.9, feature_selection_method='leanml', rfe_n_features=None, boruta_pval=0.5, boruta_n_evaluations=20, max_duration=None, val_performance_buffer=0.0, path=None, data_identifier=None, pca_energy_loss_frac=0.05, pfs_p=None)

Train a lean boosted supervised learner, bringing in variables one at a time, in decreasing order of importance (as per df.kxy.variable_selection), until doing so no longer improves validation performance or another stopping criterion is met.

Specifically, training proceeds as follows. First, KXY’s model-free variable selection is run (i.e. df.kxy.variable_selection).

Then we train a model (instance returned by learner_func) using the start_n_features most important feature/variable to predict the target (defined by target_column).

When start_n_features is None the initial set of variables is the smallest set of variables with which we may achieve start_n_features_perf_frac of the performance we could achieve using all variables (as per df.kxy.variable_selection).

Next we consider adding one variable at a time to fix the mistakes made by the previously trained model when additive_learning is True.

If doing so improves performance on the validation set, we keep going until either performance no longer improves on the validation set n_down_perf_before_stop consecutive times, or we’ve selected max_n_features features.

When additive_learning is set to False (the default), after adding a new variable, we train the new model on the original problem, rather than trying to improve residuals.

Parameters
  • target_column (str) – The name of the column containing true labels.

  • learner_func (function) – A function returning an instance of a base learner. They should define a fit(x, y) method and a predict(x) method.

  • problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.

  • snr ('auto' | 'low' | 'high') – Set to low if the problem is difficult (i.e. has a low signal-to-noise ratio) or the number of rows is small relative to the number of columns. Only used for model-free variable selection.

  • train_frac (float) – The fraction of rows used for training and validation.

  • random_state (int) – The seed to use for random training/validation/testing split.

  • min_n_features (int | None) – Boosting will not stop until at least this many features/explanatory variables are selected.

  • max_n_features (int | None) – Boosting will stop as soon as this many features/explanatory variables are selected.

  • start_n_features (int) – The number of most important features boosting will start with.

  • anonymize (bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost).

  • benchmark_feature (str | None) – When not None, ‘benchmark’ performance metrics using this column as predictor will be reported in the output dictionary.

  • missing_value_imputation (bool) – When set to True, replace missing values with medians.

  • n_down_perf_before_stop (int) – Number of consecutive down performances to observe before boosting stops.

  • regression_baseline (str (mean | median)) – Whether to use the unconditional mean or median as the best predictor in the absence of explanatory variables. Choosing the mean corresponds to minimizing the L2 norm, whereas choosing the median corresponds to minimizing the L1 norm.

  • additive_learning (bool) – When a new variable is added, whether errors/residuals should be fixed or a new model should be learned from scratch.

  • regression_error_type (str ('additive' | 'multiplicative')) – For regression problems with additive learning, this determines whether the final model should be additive (pruning tries to reduce regressor residuals) or multiplicative (i.e. pruning tries to bring the ratio between true and predicted labels as closed to 1 as possible).

  • start_n_features_perf_frac (float (between 0 and 1)) – When start_n_features is not specified, it is set to the number of variables required to achieve a fraction start_n_features_perf_frac of the maximum performance achievable (as per df.kxy.variable_selection).

  • return_scores (bool (Default False)) – Whether to return training, validation and testing performance after lean boosting.

  • feature_selection_method (str (leanml | rfe | boruta | pfs | pca | none. Default leanml)) – Do not change this unless you want to try out Boruta, Recursive Feature Selection, PCA or Principal Feature Selection. The leanml method outperforms all four.

  • rfe_n_features (int) – The number of features to keep when the feature selection method is rfe.

  • boruta_pval (float) – The quantile level to use when the feature selection method is boruta.

  • boruta_n_evaluations (int) – The number of trials to use when the feature selection method is boruta.

  • max_duration (float | None (default)) – If not None, then Boruta and RFE will stop after this many seconds.

  • val_performance_buffer (float (Default 0.0)) – In LeanML feature selection, this is the threshold by which the new validation performance needs to exceed the previously evaluated validation performance to consider increasing the number of features.

  • score (str | func) – The validation metric to use to determine if a new feature should be added. When set to 'auto' (the default), the \(R^2\) is used for regression problems and the classification accuracy is used for classification problems. Any other string should be the name of a globally accessible callable.

  • pca_energy_loss_frac (float (Default 0.05)) – The maximum fraction of energy (or variance) that left-out principal directions should account for when PCA is the feature selection method chosen.

  • pfs_p (int | None (Default)) – The number of principal features to learn when using PFS. A value that is not None automatically selects the one-shot flavor of PFS instead of the PCA-style.

Returns

result – Dictionary containing selected variables, as well as training, validation and testing performance, and the trained model.

Return type

dict

predict(obj, memory_bound=False)

Make predictions using the fitted model.

Parameters
  • obj (pandas.DataFrame) – A dataframe containing test explanatory variables about which we want to make predictions.

  • memory_bound (bool (Default False)) – Whether we should try to save memory.

Returns

result – A dataframe with the same index as obj, and with one column whose name is the target_column used for training.

Return type

pandas.DataFrame

class kxy.pandas_extension.post_learning_accessor.PostLearningAccessor(pandas_obj)

Bases: kxy.pandas_extension.base_accessor.BaseAccessor

Extension of the pandas.DataFrame class with various analytics for post-learning in supervised learning problems.

This class defines the kxy_post_learning pandas accessor.

All its methods defined are accessible from any DataFrame instance as df.kxy_post_learning.<method_name>, so long as the kxy python package is imported alongside pandas.

data_driven_improvability(target_column, new_variables, problem_type=None, anonymize=None, snr='auto', file_name=None)

Estimate the potential performance boost that a set of new explanatory variables can bring about.

Parameters
  • target_column (str) – The name of the column containing true labels.

  • new_variables (list) – The names of the columns to use as new explanatory variables.

  • problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from whether or not target_column is categorical.

  • anonymize (bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost). When set to None (the default), your data will be anonymized when it is too big.

Returns

result – The result is a pandas.Dataframe with columns (where applicable):

  • 'Accuracy Boost': The classification accuracy boost that the new explanatory variables can bring about.

  • 'R-Squared Boost': The \(R^2\) boost that the new explanatory variables can bring about.

  • 'RMSE Reduction': The reduction in Root Mean Square Error that the new explanatory variables can bring about.

  • 'Log-Likelihood Per Sample Boost': The boost in log-likelihood per sample that the new explanatory variables can bring about.

Return type

pandas.Dataframe

Theoretical Foundation

Section 3 - Model Improvability.

See also

kxy.post_learning.improvability.data_driven_improvability

model_driven_improvability(target_column, prediction_column, problem_type=None, anonymize=None, snr='auto', file_name=None)

Estimate the extent to which a trained supervised learner may be improved in a model-driven fashion (i.e. without resorting to additional explanatory variables).

Parameters
  • target_column (str) – The name of the column containing true labels.

  • prediction_column (str) – The name of the column containing model predictions.

  • problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from whether or not target_column is categorical.

  • anonymize (None | bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost). When set to None (the default), your data will be anonymized when it is too big.

Returns

result – The result is a pandas.Dataframe with columns (where applicable):

  • 'Lost Accuracy': The amount of classification accuracy that was irreversibly lost when training the supervised learner.

  • 'Lost R-Squared': The amount of \(R^2\) that was irreversibly lost when training the supervised learner.

  • 'Lost RMSE': The amount of Root Mean Square Error that was irreversibly lost when training the supervised learner.

  • 'Lost Log-Likelihood Per Sample': The amount of true log-likelihood per sample that was irreversibly lost when training the supervised learner.

  • 'Residual R-Squared': For regression problems, this is the highest \(R^2\) that may be achieved when using explanatory variables to predict regression residuals.

  • 'Residual RMSE': For regression problems, this is the lowest Root Mean Square Error that may be achieved when using explanatory variables to predict regression residuals.

  • 'Residual Log-Likelihood Per Sample': For regression problems, this is the highest log-likelihood per sample that may be achieved when using explanatory variables to predict regression residuals.

Return type

pandas.Dataframe

Theoretical Foundation

Section 3 - Model Improvability.

See also

kxy.post_learning.improvability.model_driven_improvability

model_explanation(prediction_column, problem_type=None, anonymize=None, snr='auto', file_name=None)

Analyzes the variables that a model relies on the most in a brute-force fashion.

The first variable is the variable the model relies on the most. The second variable is the variable that complements the first variable the most in explaining model decisions etc.

Running performances should be understood as the performance achievable when trying to guess model predictions using variables with selection order smaller or equal to that of the row.

When problem_type=None, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not target_column is categorical.

Parameters
  • prediction_column (str) – The name of the column containing model predictions.

  • problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.

  • anonymize (None | bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost). When set to None (the default), your data will be anonymized when it is too big.

Returns

result – The result is a pandas.Dataframe with columns (where applicable):

  • 'Selection Order': The order in which the associated variable was selected, starting at 1 for the most important variable.

  • 'Variable': The column name corresponding to the input variable.

  • 'Running Achievable R^2': The highest \(R^2\) that can be achieved by a classification model using all variables selected so far, including this one.

  • 'Running Achievable Accuracy': The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.

  • 'Running Achievable RMSE': The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.

Return type

pandas.DataFrame

Theoretical Foundation

Section 2 - Variable Selection Analysis.

class kxy.pandas_extension.finance_accessor.FinanceAccessor(pandas_obj)

Bases: kxy.pandas_extension.base_accessor.BaseAccessor

Extension of the pandas.DataFrame class with various finance-specific analytics.

This class defines the kxy_finance pandas accessor.

All its methods defined are accessible from any DataFrame instance as df.kxy_finance.<method_name>, so long as the kxy python package is imported alongside pandas.

information_adjusted_beta(market_column, asset_column, anonymize=False)

Estimate the information-adjusted beta of an asset return \(r\) relative to the market return \(r_m\): \(\text{IA-}\beta := \text{IA-Corr}\left(r, r_m \right) \sqrt{\frac{\text{Var}(r)}{\text{Var}(r_m)}}\), where \(\text{IA-Corr}\left(r, r_m \right) := \text{sgn}\left(\text{Corr}\left(r, r_m \right) \right) \left[1 - e^{-2I(r, r_m)} \right]\) denotes the information-adjusted correlation coefficient, with \(\text{sgn}\left(\text{Corr}\left(r, r_m \right) \right)\) the sign of the Pearson correlation coefficient.

Unlike the traditional beta coefficient, namely \(\beta := \text{Corr}\left(r, r_m \right) \sqrt{\frac{\text{Var}(r)}{\text{Var}(r_m)}}\), that only captures linear relations between market and asset returns, and that is 0 if and only if the two are decorrelated, \(\text{IA-}\beta\) captures any relationship between asset return and market return, linear or nonlinear, and is 0 if and only if the two variables are statistically independent.

Parameters
  • market_column (str) – The name of the column containing market returns.

  • asset_column (str) – The name of the column containing asset returns.

  • anonymize (bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost).

Returns

result – The information-adjusted beta coefficient.

Return type

float

information_adjusted_correlation(market_column, asset_column, anonymize=False)

Estimate the information-adjusted correlation between an asset return \(r\) and the market return \(r_m\): \(\text{IA-Corr}\left(r, r_m \right) := \text{sgn}\left(\text{Corr}\left(r, r_m \right) \right) \left[1 - e^{-2I(r, r_m)} \right]\), where \(\text{sgn}\left(\text{Corr}\left(r, r_m \right) \right)\) is the sign of the Pearson correlation coefficient.

Unlike Pearson’s correlation coefficient, which is 0 if and only if asset return and market return are decorrelated (i.e. they exhibit no linear relation), information-adjusted correlation is 0 if and only if market and asset returns are statistically independent (i.e. the exhibit no relation, linear or nonlinear).

Parameters
  • market_column (str) – The name of the column containing market returns.

  • asset_column (str) – The name of the column containing asset returns.

  • anonymize (bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost).

Returns

result – The information-adjusted correlation.

Return type

float