Avila (UCI, Classification, n=20867, d=10, 12 classes)¶
Loading The Data¶
In [1]:
from kxy_datasets.uci_classifications import Avila # pip install kxy_datasets
In [2]:
dataset = Avila()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data
-------------
Column: Class
-------------
Type: Categorical
Frequency: 41.08%, Label: A
Frequency: 18.80%, Label: F
Frequency: 10.50%, Label: E
Frequency: 7.97%, Label: I
Frequency: 5.00%, Label: X
Frequency: 4.98%, Label: H
Frequency: 4.28%, Label: G
Other Labels: 7.39%
----------
Column: F1
----------
Type: Continuous
Max: 11
p75: 0.2
Mean: -0.0
Median: 0.1
p25: -0.1
Min: -3.5
-----------
Column: F10
-----------
Type: Continuous
Max: 11
p75: 0.5
Mean: 0.0
Median: -0.0
p25: -0.5
Min: -6.7
----------
Column: F2
----------
Type: Continuous
Max: 386
p75: 0.2
Mean: 0.0
Median: -0.1
p25: -0.3
Min: -2.4
----------
Column: F3
----------
Type: Continuous
Max: 50
p75: 0.4
Mean: 0.0
Median: 0.2
p25: 0.1
Min: -3.2
----------
Column: F4
----------
Type: Continuous
Max: 4.0
p75: 0.6
Mean: 0.0
Median: 0.1
p25: -0.5
Min: -5.4
----------
Column: F5
----------
Type: Continuous
Max: 1.1
p75: 0.3
Mean: 0.0
Median: 0.3
p25: 0.2
Min: -4.9
----------
Column: F6
----------
Type: Continuous
Max: 53
p75: 0.6
Mean: 0.0
Median: -0.1
p25: -0.6
Min: -7.5
----------
Column: F7
----------
Type: Continuous
Max: 83
p75: 0.4
Mean: 0.0
Median: 0.2
p25: -0.0
Min: -11.9
----------
Column: F8
----------
Type: Continuous
Max: 13
p75: 0.6
Mean: 0.0
Median: 0.1
p25: -0.5
Min: -4.2
----------
Column: F9
----------
Type: Continuous
Max: 44
p75: 0.5
Mean: 0.0
Median: 0.1
p25: -0.4
Min: -5.5
Data Valuation¶
In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s Duration: 0s
Out[4]:
Achievable R-Squared | Achievable Log-Likelihood Per Sample | Achievable Accuracy | |
---|---|---|---|
0 | 0.97 | 4.44e-16 | 1.00 |
Automatic (Model-Free) Variable Selection¶
In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s Duration: 0s
Out[5]:
Variable | Running Achievable R-Squared | Running Achievable Accuracy | |
---|---|---|---|
Selection Order | |||
0 | No Variable | 0.00 | 0.41 |
1 | F4 | 0.97 | 1.00 |
2 | F10 | 0.97 | 1.00 |
3 | F8 | 0.97 | 1.00 |
4 | F6 | 0.97 | 1.00 |
5 | F7 | 0.97 | 1.00 |
6 | F1 | 0.97 | 1.00 |
7 | F2 | 0.97 | 1.00 |
8 | F3 | 0.97 | 1.00 |
9 | F5 | 0.97 | 1.00 |
10 | F9 | 0.97 | 1.00 |