Power Plant (UCI, Regression, n=9568, d=4)¶
Loading The Data¶
In [1]:
from kxy_datasets.uci_regressions import PowerPlant # pip install kxy_datasets
In [2]:
dataset = PowerPlant()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data
----------
Column: AP
----------
Type: Continuous
Max: 1,033
p75: 1,017
Mean: 1,013
Median: 1,012
p25: 1,009
Min: 992
----------
Column: AT
----------
Type: Continuous
Max: 37
p75: 25
Mean: 19
Median: 20
p25: 13
Min: 1.8
----------
Column: PE
----------
Type: Continuous
Max: 495
p75: 468
Mean: 454
Median: 451
p25: 439
Min: 420
----------
Column: RH
----------
Type: Continuous
Max: 100
p75: 84
Mean: 73
Median: 74
p25: 63
Min: 25
---------
Column: V
---------
Type: Continuous
Max: 81
p75: 66
Mean: 54
Median: 52
p25: 41
Min: 25
Data Valuation¶
In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s Duration: 0s
Out[4]:
Achievable R-Squared | Achievable Log-Likelihood Per Sample | Achievable RMSE | |
---|---|---|---|
0 | 0.94 | 1.38 | 4.31 |
Automatic (Model-Free) Variable Selection¶
In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s Duration: 0s
Out[5]:
Variable | Running Achievable R-Squared | Running Achievable RMSE | |
---|---|---|---|
Selection Order | |||
0 | No Variable | 0.00 | 1.71e+01 |
1 | AT | 0.90 | 5.47 |
2 | V | 0.93 | 4.60 |
3 | RH | 0.93 | 4.60 |
4 | AP | 0.94 | 4.31 |