Cheatsheet¶
Imports¶
import pandas as pd
import kxy
From now on, df
refers to a Pandas dataframe object and y_column
is the column of df
to be used as target. All columns in df
but y_column
are treated as explanatory variables. problem_type
is a variable taking value 'regression'
for regression problems and 'classification'
for classification problems.
Data Valuation¶
df.kxy.data_valuation(y_column, problem_type=problem_type)
By default, your data is transmitted to our backend in clear. To anonymize your data before performing data valuation, simply set anonymize=True
.
df.kxy.data_valuation(y_column, problem_type=problem_type, anonymize=True) # Data valuation using anonymized data.
Automatic (Model-Free) Feature Selection¶
df.kxy.variable_selection(y_column, problem_type=problem_type)
By default, your data is transmitted to our backend in clear. To anonymize your data before performing automatic feature selection, simply set anonymize=True
.
df.kxy.variable_selection(y_column, problem_type=problem_type, anonymize=True) # Variable selection using anonymized data.
Model Compression¶
Here’s how to wrap feature selection around LightGBM in Python.
from kxy.learning import get_lightgbm_learner_learning_api
params = {
'objective': 'rmse',
'boosting_type': 'gbdt',
'num_leaves': 100,
'n_jobs': -1,
'learning_rate': 0.1,
'verbose': -1,
}
learner_func = get_lightgbm_learner_learning_api(params, num_boost_round=10000, \
early_stopping_rounds=50, verbose_eval=50, feature_selection_method='leanml')
results = df.kxy.fit(y_column, learner_func, problem_type=problem_type)
# The trained model
predictor = results['predictor']
# Feature columns selected
selected_variables = predictor.selected_variables
# To make predictions out of a dataframe of test data.
predictions = predictor.predict(test_df)
Parameters of get_lightgbm_learner_learning_api
should be the same as those of lightgbm.train
. See the LightGBM documentation.
Wrapping feature selection around another model in Python is identical except for learner_func
. Here’s how to create learner_func
for other models.
For XGBoost:
from kxy.learning import get_xgboost_learner
# Use 'xgboost.XGBClassifier' for classification problems.
xgboost_learner_func = get_xgboost_learner('xgboost.XGBRegressor')
Parameters of get_xgboost_learner
should be those you’d pass to instantiate xgboost.XGBRegressor
or xgboost.XGBClassifier
. See the XGBoost documentation.
For Scikit-Learn models:
from kxy.learning import get_sklearn_learner
# Replace 'sklearn.ensemble.RandomForestRegressor' with the import path of the sklearn model you want to use.
rf_learner_func = get_sklearn_learner('sklearn.ensemble.RandomForestRegressor', \
min_samples_split=0.01, max_samples=0.5, n_estimators=100)
df.kxy.fit(y_column, rf_learner_func, problem_type=problem_type)
Parameters of get_sklearn_learner
should be those you’d pass to instantiate the scikit-learn model.
Model-Driven Improvability¶
For the model-driven improvability analysis, predictions made by the production model should be contained in a column of the df
. The variable prediction_column
refers to said column. All columns in df
but y_column
and prediction_column
are considered to be the explanatory variables/features used to train the production model.
anonymize = False # Set to True to anonymize your data before model-driven improvability
df.kxy.model_driven_improvability(y_column, prediction_column, problem_type=problem_type, anonymize=anonymize)
Data-Driven Improvability¶
For the data-driven improvability analysis, the list of columns representing new features/explanatory variables to consider (new_variables
) should be provided. All columns in df
that are neither y_column
nor contained in new_variables
are assumed to be the explanatory variables/features used to trained the production model.
anonymize = False # Set to True to anonymize your data before model-driven improvability
df.kxy.data_driven_improvability(y_column, new_variables, problem_type=problem_type, anonymize=anonymize)