Bank Note (UCI, Classification, n=1372, d=4, 2 classes)¶

Loading The Data¶

In [1]:

from kxy_datasets.uci_classifications import BankNote # pip install kxy_datasets

In [2]:

dataset = BankNote()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'

In [3]:

df.kxy.describe() # Visualize a summary of the data


---------------
Column: Entropy
---------------
Type:   Continuous
Max:    2.4
p75:    0.4
Mean:   -1.2
Median: -0.6
p25:    -2.4
Min:    -8.5

---------------
Column: Is Fake
---------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   0.4
Median: 0.0
p25:    0.0
Min:    0.0

----------------
Column: Kurtosis
----------------
Type:   Continuous
Max:    17
p75:    3.2
Mean:   1.4
Median: 0.6
p25:    -1.6
Min:    -5.3

----------------
Column: Skewness
----------------
Type:   Continuous
Max:    12
p75:    6.8
Mean:   1.9
Median: 2.3
p25:    -1.7
Min:    -13.8

----------------
Column: Variance
----------------
Type:   Continuous
Max:    6.8
p75:    2.8
Mean:   0.4
Median: 0.5
p25:    -1.8
Min:    -7.0

Data Valuation¶

In [4]:

df.kxy.data_valuation(y_column, problem_type=problem_type)

[====================================================================================================] 100% ETA: 0s    Duration: 0s

Out[4]:

	Achievable R-Squared	Achievable Log-Likelihood Per Sample	Achievable Accuracy
0	0.75	0.00	1.00

Automatic (Model-Free) Variable Selection¶

In [5]:

df.kxy.variable_selection(y_column, problem_type=problem_type)

[====================================================================================================] 100% ETA: 0s    Duration: 0s

Out[5]:

	Variable	Running Achievable R-Squared	Running Achievable Accuracy
Selection Order
0	No Variable	0.00	0.56
1	Variance	0.51	0.90
2	Skewness	0.58	0.93
3	Kurtosis	0.75	1.00
4	Entropy	0.75	1.00