DataFrame Extension Deep Dive¶
We define a custom kxy
pandas accessor below,
namely the class Accessor
, that extends the pandas DataFrame class with all our analyses, thereby allowing data scientists to tap into
the power of the kxy
toolkit within the comfort of their favorite data structure.
All methods defined in the Accessor
class are accessible from any DataFrame instance as df.kxy.<method_name>
, so long as the kxy
python
package is imported alongside pandas
.
- class kxy.pandas_extension.accessor.Accessor(pandas_obj)¶
Bases:
kxy.pandas_extension.pre_learning_accessor.PreLearningAccessor
,kxy.pandas_extension.learning_accessor.LearningAccessor
,kxy.pandas_extension.post_learning_accessor.PostLearningAccessor
,kxy.pandas_extension.finance_accessor.FinanceAccessor
,kxy.pandas_extension.features_accessor.FeaturesAccessor
Extension of the pandas.DataFrame class with the full capabilities of the
kxy
platform.
- class kxy.pandas_extension.base_accessor.BaseAccessor(pandas_obj)¶
Base class inheritated by our customs accessors.
- anonymize(columns_to_exclude=[])¶
Anonymize the dataframe in a manner that leaves all pre-learning and post-learning analyses (including data valuation, variable selection, model-driven improvability, data-driven improvability and model explanation) invariant.
Any transformation on continuous variables that preserves ranks will not change our pre-learning and post-learning analyses. The same holds for any 1-to-1 transformation on categorical variables.
This implementation replaces ordinal values (i.e. any column that can be cast as a float) with their within-column Gaussian score. For each non-ordinal column, we form the set of all possible values, we assign a unique integer index to each value in the set, and we systematically replace said value appearing in the dataframe by the hexadecimal code of its associated integer index.
For regression problems, accurate estimation of RMSE related metrics require the target column (and the prediction column for post-learning analyses) not to be anonymized.
- Parameters
columns_to_exclude (list (optional)) – List of columns not to anonymize (e.g. target and prediction columns for regression problems).
- Returns
result – The result is a pandas.Dataframe with columns (where applicable):
- Return type
pandas.DataFrame
- is_categorical(column)¶
Determine whether the input column contains categorical (i.e. non-ordinal) observations.
- is_discrete(column)¶
Determine whether the input column contains discrete (i.e as opposed to continuous) observations.
- class kxy.pandas_extension.features_accessor.FeaturesAccessor(pandas_obj)¶
Bases:
kxy.pandas_extension.base_accessor.BaseAccessor
Extension of the pandas.DataFrame class with various feature engineering functionalities.
This class defines the
kxy_features
pandas accessor.All its methods defined are accessible from any DataFrame instance as
df.kxy_features.<method_name>
, so long as thekxy
python package is imported alongsidepandas
.- deviation_features(exclude=[], means=None, quantiles=None, return_baselines=False)¶
Extend the dataframe with deviations of ordinal columns from row-wise aggregtes such as mean, median, 25th and 75th percentiles.
- Parameters
exclude (list) – A list of columns to exclude from feature transformations.
means (pandas.DataFrame | None) – Which values, if any, to use as means.
quantiles (pandas.DataFrame | None) – Which values, if any, to use as 25th, 50th, 75th percentiles.
return_baselines (bool) – Whether to return which baselines have been used.
- Returns
result – The original dataframe extended with computed features.
- Return type
pandas.DataFrame
- entity_features(entity, exclude=[], entity_name='*', filter_target=None, filter_target_gt=None, filter_target_lt=None, include_filter_target=False)¶
Group rows corresponding to the same entity and apply aggregation functions.
For each ordinal column, we apply the following aggregation functions to rows corresponding to the same entity: mean, standard deviation, median, skewness, kurtosis, 25th and 75th percentiles, minimum, maximum, and the difference betwween maximum and minimum.
For each non-ordinal column, we apply the following aggregation functions to rows corresponding to the same entity: mode and its frequency, second most frequent label and its frequency, least frequent label and its frequency.
- Parameters
entity (str) – The column mapping rows to entities.
exclude (list) – A list of columns to exclude from feature transformations.
filter_target (str | None) – When specified, this is a column based on which we need to restrict the dataframe before generating features.
filter_target_gt (str | None) – When specified, only rows with
filter_target
greater thanfilter_target_gt
will be considered for feature generation.filter_target_lt (str | None) – When specified, only rows with
filter_target
smaller thanfilter_target_gt
will be considered for feature generation.include_filter_target (bool) – Whether to use
filter_target
for features generation.
- Returns
result – The dataframe of features.
- Return type
pandas.DataFrame
- generate_features(entity=None, encoding_method='one_hot', index=None, max_lag=None, exclude=[], means=None, quantiles=None, return_baselines=False, entity_name='*', filter_target=None, filter_target_gt=None, filter_target_lt=None, include_filter_target=False, fill_na=False, temporal_groupby=None, temporal_sort_by=None, time_columns=None)¶
Generate a wide range of candidate features to search from.
We first compute entity features if needed.
Then we extend the resulting dataframe with deviations of ordinal columns from row-wise aggregtes such as mean, median, 25th and 75th percentiles.
Finally, we ordinally-encode the resulting dataframe and apply temporal transformations if required.
- Parameters
entity (str) – The column mapping rows to entities.
filter_target (str | None) – When specified, this is a column based on which we need to restrict the dataframe before generating entity features.
filter_target_gt (str | None) – When specified, only rows with
filter_target
greater thanfilter_target_gt
will be considered for entity feature generation.filter_target_lt (str | None) – When specified, only rows with
filter_target
smaller thanfilter_target_gt
will be considered for entity feature generation.include_filter_target (bool) – Whether to use
filter_target
for features generation.encoding_method ('one_hot' (default) | 'binary') – The encoding method to use for categorical variables.
exclude (list) – A list of columns to exclude from feature transformations.
max_lag (int | None) – The largest lag, if any, to consider for temporal features. Set to None to avoid temporal features.
index (str | None (default)) – The column, if any, to set as index and sort before computing temporal features.
means (pandas.DataFrame | None) – Which values, if any, to use as means for deviation features.
quantiles (pandas.DataFrame | None) – Which values, if any, to use as 25th, 50th, 75th percentiles for deviation features.
return_baselines (bool) – Whether to return which baselines have been used for deviation features.
temporal_groupby (str | None) – If provided, we will use this column to perform a groupby before temporal aggregation.
temporal_sort_by (str | None) – If provided, we will use this column to sort the dataframe before rolling when computing temporal features.
time_columns (list | None) – The list of columns that correspond to times and from which we should extract features such as hour, day of week etc.
- Returns
result – The original dataframe extended with computed temporal features.
- Return type
pandas.DataFrame
- ordinally_encode(target_column=None, method='one_hot')¶
Encode categorical (non-numeric) data.
- Parameters
target_column (str) – The name of the column containing labels. When this column is categorical, each label is replaced by a distinct integer.
method ('one_hot' (default) | 'binary') – Whether to use one-hot encoding or binary encoding to encode categorical variables.
- Returns
result – The ordinarily encoded dataframe.
- Return type
pandas.DataFrame
- process_time_columns(columns)¶
Extract features from timestamp columns such as: Month, Day, Day of Week, Hour, AM/PM.
- Parameters
columns (list) – The list of columns that should be interprated as UTC epoch timestamps.
- Returns
result – The features dataframe (does not include the original dataframe)
- Return type
pandas.DataFrame
- temporal_features(max_lag=10, exclude=[], index=None, groupby=None, sort_by=None)¶
Extend the dataframe with some rolling statistics (e.g. rolling average, rolling min, rolling max, rolling max-rolling min, etc.) for all lags from 2 to the configured maximum lag.
- Parameters
exclude (list) – A list of columns to exclude from feature transformations.
max_lag (int) – The largest lag to consider.
index (str | None (default)) – The column, if any, to set as index and sort before computing rolling statistics.
groupby (str | None) – If provided, we will use this column to perform a groupby before temporal aggregation.
sort_by (str | list | None) – Columns, if any, we need to sort the dataframe by, prior to rolling.
- Returns
result – The original dataframe extended with computed temporal features.
- Return type
pandas.DataFrame
- class kxy.pandas_extension.pre_learning_accessor.PreLearningAccessor(pandas_obj)¶
Bases:
kxy.pandas_extension.base_accessor.BaseAccessor
Extension of the pandas.DataFrame class with various analytics for post-learning in supervised learning problems.
This class defines the
kxy_pre_learning
pandas accessor.All its methods defined are accessible from any DataFrame instance as
df.kxy_pre_learning.<method_name>
, so long as thekxy
python package is imported alongsidepandas
.- data_valuation(target_column, problem_type=None, anonymize=None, snr='auto', include_mutual_information=False, file_name=None)¶
Estimate the highest performance metrics achievable when predicting the
target_column
using all other columns.When
problem_type=None
, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or nottarget_column
is categorical.- Parameters
target_column (str) – The name of the column containing true labels.
problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
anonymize (None | bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost). When set to None (the default), your data will be anonymized when it is too big.
include_mutual_information (bool) – Whether to include the mutual information between target and explanatory variables in the result.
- Returns
achievable_performance – The result is a pandas.Dataframe with columns (where applicable):
'Achievable Accuracy'
: The highest classification accuracy that can be achieved by a model using provided inputs to predict the label.'Achievable R^2'
: The highest \(R^2\) that can be achieved by a model using provided inputs to predict the label.'Achievable RMSE'
: The lowest Root Mean Square Error that can be achieved by a model using provided inputs to predict the label.'Achievable Log-Likelihood Per Sample'
: The highest true log-likelihood per sample that can be achieved by a model using provided inputs to predict the label.
- Return type
pandas.Dataframe
Theoretical Foundation
Section 1 - Achievable Performance.
- variable_selection(target_column, problem_type=None, anonymize=None, snr='auto', file_name=None)¶
Runs the model-free variable selection analysis.
When
problem_type=None
, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or nottarget_column
is categorical.- Parameters
target_column (str) – The name of the column containing true labels.
problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
anonymize (None | bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost). When set to None (the default), your data will be anonymized when it is too big.
- Returns
result – The result is a pandas.DataFrame with columns (where applicable):
'Selection Order'
: The order in which the associated variable was selected, starting at 1 for the most important variable.'Variable'
: The column name corresponding to the input variable.'Running Achievable R^2'
: The highest \(R^2\) that can be achieved by a classification model using all variables selected so far, including this one.'Running Achievable Accuracy'
: The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.'Running Achievable RMSE'
: The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.
- Return type
pandas.DataFrame
Theoretical Foundation
Section 2 - Variable Selection Analysis.
- class kxy.pandas_extension.learning_accessor.LearningAccessor(pandas_obj)¶
Bases:
kxy.pandas_extension.base_accessor.BaseAccessor
Extension of the pandas.DataFrame class with various analytics for automatically training predictive models.
This class defines the
kxy_learning
pandas accessor.All its methods defined are accessible from any DataFrame instance as
df.kxy_learning.<method_name>
, so long as thekxy
python package is imported alongsidepandas
.- fit(target_column, learner_func, problem_type=None, snr='auto', train_frac=0.8, random_state=0, max_n_features=None, min_n_features=None, start_n_features=None, anonymize=False, benchmark_feature=None, missing_value_imputation=False, score='auto', n_down_perf_before_stop=3, regression_baseline='mean', additive_learning=False, regression_error_type='additive', return_scores=False, start_n_features_perf_frac=0.9, feature_selection_method='leanml', rfe_n_features=None, boruta_pval=0.5, boruta_n_evaluations=20, max_duration=None, val_performance_buffer=0.0, path=None, data_identifier=None, pca_energy_loss_frac=0.05, pfs_p=None)¶
Train a lean boosted supervised learner, bringing in variables one at a time, in decreasing order of importance (as per
df.kxy.variable_selection
), until doing so no longer improves validation performance or another stopping criterion is met.Specifically, training proceeds as follows. First, KXY’s model-free variable selection is run (i.e.
df.kxy.variable_selection
).Then we train a model (instance returned by
learner_func
) using thestart_n_features
most important feature/variable to predict the target (defined bytarget_column
).When
start_n_features
isNone
the initial set of variables is the smallest set of variables with which we may achievestart_n_features_perf_frac
of the performance we could achieve using all variables (as perdf.kxy.variable_selection
).Next we consider adding one variable at a time to fix the mistakes made by the previously trained model when
additive_learning
isTrue
.If doing so improves performance on the validation set, we keep going until either performance no longer improves on the validation set
n_down_perf_before_stop
consecutive times, or we’ve selectedmax_n_features
features.When
additive_learning
is set toFalse
(the default), after adding a new variable, we train the new model on the original problem, rather than trying to improve residuals.- Parameters
target_column (str) – The name of the column containing true labels.
learner_func (function) – A function returning an instance of a base learner. They should define a
fit(x, y)
method and apredict(x)
method.problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
snr ('auto' | 'low' | 'high') – Set to
low
if the problem is difficult (i.e. has a low signal-to-noise ratio) or the number of rows is small relative to the number of columns. Only used for model-free variable selection.train_frac (float) – The fraction of rows used for training and validation.
random_state (int) – The seed to use for random training/validation/testing split.
min_n_features (int | None) – Boosting will not stop until at least this many features/explanatory variables are selected.
max_n_features (int | None) – Boosting will stop as soon as this many features/explanatory variables are selected.
start_n_features (int) – The number of most important features boosting will start with.
anonymize (bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost).
benchmark_feature (str | None) – When not None, ‘benchmark’ performance metrics using this column as predictor will be reported in the output dictionary.
missing_value_imputation (bool) – When set to True, replace missing values with medians.
n_down_perf_before_stop (int) – Number of consecutive down performances to observe before boosting stops.
regression_baseline (str (
mean
|median
)) – Whether to use the unconditional mean or median as the best predictor in the absence of explanatory variables. Choosing the mean corresponds to minimizing the L2 norm, whereas choosing the median corresponds to minimizing the L1 norm.additive_learning (bool) – When a new variable is added, whether errors/residuals should be fixed or a new model should be learned from scratch.
regression_error_type (str ('additive' | 'multiplicative')) – For regression problems with additive learning, this determines whether the final model should be additive (pruning tries to reduce regressor residuals) or multiplicative (i.e. pruning tries to bring the ratio between true and predicted labels as closed to 1 as possible).
start_n_features_perf_frac (float (between 0 and 1)) – When
start_n_features
is not specified, it is set to the number of variables required to achieve a fractionstart_n_features_perf_frac
of the maximum performance achievable (as perdf.kxy.variable_selection
).return_scores (bool (Default False)) – Whether to return training, validation and testing performance after lean boosting.
feature_selection_method (str (
leanml
|rfe
|boruta
|pfs
|pca
|none
. Defaultleanml
)) – Do not change this unless you want to try out Boruta, Recursive Feature Selection, PCA or Principal Feature Selection. The leanml method outperforms all four.rfe_n_features (int) – The number of features to keep when the feature selection method is
rfe
.boruta_pval (float) – The quantile level to use when the feature selection method is
boruta
.boruta_n_evaluations (int) – The number of trials to use when the feature selection method is
boruta
.max_duration (float | None (default)) – If not None, then Boruta and RFE will stop after this many seconds.
val_performance_buffer (float (Default 0.0)) – In LeanML feature selection, this is the threshold by which the new validation performance needs to exceed the previously evaluated validation performance to consider increasing the number of features.
score (str | func) – The validation metric to use to determine if a new feature should be added. When set to
'auto'
(the default), the \(R^2\) is used for regression problems and the classification accuracy is used for classification problems. Any other string should be the name of a globally accessible callable.pca_energy_loss_frac (float (Default 0.05)) – The maximum fraction of energy (or variance) that left-out principal directions should account for when PCA is the feature selection method chosen.
pfs_p (int | None (Default)) – The number of principal features to learn when using PFS. A value that is not
None
automatically selects the one-shot flavor of PFS instead of the PCA-style.
- Returns
result – Dictionary containing selected variables, as well as training, validation and testing performance, and the trained model.
- Return type
dict
- predict(obj, memory_bound=False)¶
Make predictions using the fitted model.
- Parameters
obj (pandas.DataFrame) – A dataframe containing test explanatory variables about which we want to make predictions.
memory_bound (bool (Default False)) – Whether we should try to save memory.
- Returns
result – A dataframe with the same index as
obj
, and with one column whose name is thetarget_column
used for training.- Return type
pandas.DataFrame
- class kxy.pandas_extension.post_learning_accessor.PostLearningAccessor(pandas_obj)¶
Bases:
kxy.pandas_extension.base_accessor.BaseAccessor
Extension of the pandas.DataFrame class with various analytics for post-learning in supervised learning problems.
This class defines the
kxy_post_learning
pandas accessor.All its methods defined are accessible from any DataFrame instance as
df.kxy_post_learning.<method_name>
, so long as thekxy
python package is imported alongsidepandas
.- data_driven_improvability(target_column, new_variables, problem_type=None, anonymize=None, snr='auto', file_name=None)¶
Estimate the potential performance boost that a set of new explanatory variables can bring about.
- Parameters
target_column (str) – The name of the column containing true labels.
new_variables (list) – The names of the columns to use as new explanatory variables.
problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from whether or not
target_column
is categorical.anonymize (bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost). When set to None (the default), your data will be anonymized when it is too big.
- Returns
result – The result is a pandas.Dataframe with columns (where applicable):
'Accuracy Boost'
: The classification accuracy boost that the new explanatory variables can bring about.'R-Squared Boost'
: The \(R^2\) boost that the new explanatory variables can bring about.'RMSE Reduction'
: The reduction in Root Mean Square Error that the new explanatory variables can bring about.'Log-Likelihood Per Sample Boost'
: The boost in log-likelihood per sample that the new explanatory variables can bring about.
- Return type
pandas.Dataframe
Theoretical Foundation
Section 3 - Model Improvability.
See also
kxy.post_learning.improvability.data_driven_improvability
- model_driven_improvability(target_column, prediction_column, problem_type=None, anonymize=None, snr='auto', file_name=None)¶
Estimate the extent to which a trained supervised learner may be improved in a model-driven fashion (i.e. without resorting to additional explanatory variables).
- Parameters
target_column (str) – The name of the column containing true labels.
prediction_column (str) – The name of the column containing model predictions.
problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from whether or not
target_column
is categorical.anonymize (None | bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost). When set to None (the default), your data will be anonymized when it is too big.
- Returns
result – The result is a pandas.Dataframe with columns (where applicable):
'Lost Accuracy'
: The amount of classification accuracy that was irreversibly lost when training the supervised learner.'Lost R-Squared'
: The amount of \(R^2\) that was irreversibly lost when training the supervised learner.'Lost RMSE'
: The amount of Root Mean Square Error that was irreversibly lost when training the supervised learner.'Lost Log-Likelihood Per Sample'
: The amount of true log-likelihood per sample that was irreversibly lost when training the supervised learner.'Residual R-Squared'
: For regression problems, this is the highest \(R^2\) that may be achieved when using explanatory variables to predict regression residuals.'Residual RMSE'
: For regression problems, this is the lowest Root Mean Square Error that may be achieved when using explanatory variables to predict regression residuals.'Residual Log-Likelihood Per Sample'
: For regression problems, this is the highest log-likelihood per sample that may be achieved when using explanatory variables to predict regression residuals.
- Return type
pandas.Dataframe
Theoretical Foundation
Section 3 - Model Improvability.
See also
kxy.post_learning.improvability.model_driven_improvability
- model_explanation(prediction_column, problem_type=None, anonymize=None, snr='auto', file_name=None)¶
Analyzes the variables that a model relies on the most in a brute-force fashion.
The first variable is the variable the model relies on the most. The second variable is the variable that complements the first variable the most in explaining model decisions etc.
Running performances should be understood as the performance achievable when trying to guess model predictions using variables with selection order smaller or equal to that of the row.
When
problem_type=None
, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or nottarget_column
is categorical.- Parameters
prediction_column (str) – The name of the column containing model predictions.
problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
anonymize (None | bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost). When set to None (the default), your data will be anonymized when it is too big.
- Returns
result – The result is a pandas.Dataframe with columns (where applicable):
'Selection Order'
: The order in which the associated variable was selected, starting at 1 for the most important variable.'Variable'
: The column name corresponding to the input variable.'Running Achievable R^2'
: The highest \(R^2\) that can be achieved by a classification model using all variables selected so far, including this one.'Running Achievable Accuracy'
: The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.'Running Achievable RMSE'
: The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.
- Return type
pandas.DataFrame
Theoretical Foundation
Section 2 - Variable Selection Analysis.
- class kxy.pandas_extension.finance_accessor.FinanceAccessor(pandas_obj)¶
Bases:
kxy.pandas_extension.base_accessor.BaseAccessor
Extension of the pandas.DataFrame class with various finance-specific analytics.
This class defines the
kxy_finance
pandas accessor.All its methods defined are accessible from any DataFrame instance as
df.kxy_finance.<method_name>
, so long as thekxy
python package is imported alongsidepandas
.- information_adjusted_beta(market_column, asset_column, anonymize=False)¶
Estimate the information-adjusted beta of an asset return \(r\) relative to the market return \(r_m\): \(\text{IA-}\beta := \text{IA-Corr}\left(r, r_m \right) \sqrt{\frac{\text{Var}(r)}{\text{Var}(r_m)}}\), where \(\text{IA-Corr}\left(r, r_m \right) := \text{sgn}\left(\text{Corr}\left(r, r_m \right) \right) \left[1 - e^{-2I(r, r_m)} \right]\) denotes the information-adjusted correlation coefficient, with \(\text{sgn}\left(\text{Corr}\left(r, r_m \right) \right)\) the sign of the Pearson correlation coefficient.
Unlike the traditional beta coefficient, namely \(\beta := \text{Corr}\left(r, r_m \right) \sqrt{\frac{\text{Var}(r)}{\text{Var}(r_m)}}\), that only captures linear relations between market and asset returns, and that is 0 if and only if the two are decorrelated, \(\text{IA-}\beta\) captures any relationship between asset return and market return, linear or nonlinear, and is 0 if and only if the two variables are statistically independent.
- Parameters
market_column (str) – The name of the column containing market returns.
asset_column (str) – The name of the column containing asset returns.
anonymize (bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost).
- Returns
result – The information-adjusted beta coefficient.
- Return type
float
- information_adjusted_correlation(market_column, asset_column, anonymize=False)¶
Estimate the information-adjusted correlation between an asset return \(r\) and the market return \(r_m\): \(\text{IA-Corr}\left(r, r_m \right) := \text{sgn}\left(\text{Corr}\left(r, r_m \right) \right) \left[1 - e^{-2I(r, r_m)} \right]\), where \(\text{sgn}\left(\text{Corr}\left(r, r_m \right) \right)\) is the sign of the Pearson correlation coefficient.
Unlike Pearson’s correlation coefficient, which is 0 if and only if asset return and market return are decorrelated (i.e. they exhibit no linear relation), information-adjusted correlation is 0 if and only if market and asset returns are statistically independent (i.e. the exhibit no relation, linear or nonlinear).
- Parameters
market_column (str) – The name of the column containing market returns.
asset_column (str) – The name of the column containing asset returns.
anonymize (bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost).
- Returns
result – The information-adjusted correlation.
- Return type
float