Concrete (UCI, Regression, n=1030, d=8)¶
Loading The Data¶
In [1]:
from kxy_datasets.uci_regressions import Concrete # pip install kxy_datasets
In [2]:
dataset = Concrete()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data
-----------
Column: Age
-----------
Type: Continuous
Max: 365
p75: 56
Mean: 45
Median: 28
p25: 7.0
Min: 1.0
--------------------------
Column: Blast Furnace Slag
--------------------------
Type: Continuous
Max: 359
p75: 142
Mean: 73
Median: 22
p25: 0.0
Min: 0.0
--------------
Column: Cement
--------------
Type: Continuous
Max: 540
p75: 350
Mean: 281
Median: 272
p25: 192
Min: 102
------------------------
Column: Coarse Aggregate
------------------------
Type: Continuous
Max: 1,145
p75: 1,029
Mean: 972
Median: 968
p25: 932
Min: 801
-------------------------------------
Column: Concrete Compressive Strength
-------------------------------------
Type: Continuous
Max: 82
p75: 46
Mean: 35
Median: 34
p25: 23
Min: 2.3
----------------------
Column: Fine Aggregate
----------------------
Type: Continuous
Max: 992
p75: 824
Mean: 773
Median: 779
p25: 730
Min: 594
---------------
Column: Fly Ash
---------------
Type: Continuous
Max: 200
p75: 118
Mean: 54
Median: 0.0
p25: 0.0
Min: 0.0
------------------------
Column: Superplasticizer
------------------------
Type: Continuous
Max: 32
p75: 10
Mean: 6.2
Median: 6.3
p25: 0.0
Min: 0.0
-------------
Column: Water
-------------
Type: Continuous
Max: 247
p75: 192
Mean: 181
Median: 185
p25: 164
Min: 121
Data Valuation¶
In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s Duration: 0s
Out[4]:
Achievable R-Squared | Achievable Log-Likelihood Per Sample | Achievable RMSE | |
---|---|---|---|
0 | 1.00 | -7.98e-01 | 5.35e-01 |
Automatic (Model-Free) Variable Selection¶
In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s Duration: 0s
Out[5]:
Variable | Running Achievable R-Squared | Running Achievable RMSE | |
---|---|---|---|
Selection Order | |||
0 | No Variable | 0.00 | 1.67e+01 |
1 | Age | 0.55 | 1.12e+01 |
2 | Cement | 0.69 | 9.35 |
3 | Superplasticizer | 0.80 | 7.49 |
4 | Blast Furnace Slag | 0.85 | 6.37 |
5 | Water | 0.85 | 6.37 |
6 | Fine Aggregate | 0.90 | 5.23 |
7 | Coarse Aggregate | 0.90 | 5.23 |
8 | Fly Ash | 1.00 | 0.53 |