Data Valuation¶
Estimation of the highest performance achievable in a supervised learning problem. E.g. \(R^2\), RMSE, classification accuracy, true log-likelihood per observation.
- kxy.pre_learning.achievable_performance.data_valuation(data_df, target_column, problem_type, snr='auto', include_mutual_information=False, file_name=None)¶
Estimate the highest performance metrics achievable when predicting the
target_column
using all other columns.When
problem_type=None
, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or nottarget_column
is categorical.- Parameters
data_df (pandas.DataFrame) – The pandas DataFrame containing the data.
target_column (str) – The name of the column containing true labels.
problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
include_mutual_information (bool) – Whether to include the mutual information between target and explanatory variables in the result.
file_name (None | str) – A unique identifier characterizing data_df in the form of a file name. Do not set this unless you know why.
- Returns
achievable_performance – The result is a pandas.Dataframe with columns (where applicable):
'Achievable Accuracy'
: The highest classification accuracy that can be achieved by a model using provided inputs to predict the label.'Achievable R-Squared'
: The highest \(R^2\) that can be achieved by a model using provided inputs to predict the label.'Achievable RMSE'
: The lowest Root Mean Square Error that can be achieved by a model using provided inputs to predict the label.'Achievable Log-Likelihood Per Sample'
: The highest true log-likelihood per sample that can be achieved by a model using provided inputs to predict the label.
- Return type
pandas.Dataframe
Theoretical Foundation
Section 1 - Achievable Performance.