Concrete (UCI, Regression, n=1030, d=8)

Loading The Data

In [1]:
from kxy_datasets.uci_regressions import Concrete # pip install kxy_datasets
In [2]:
dataset = Concrete()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data

-----------
Column: Age
-----------
Type:   Continuous
Max:    365
p75:    56
Mean:   45
Median: 28
p25:    7.0
Min:    1.0

--------------------------
Column: Blast Furnace Slag
--------------------------
Type:   Continuous
Max:    359
p75:    142
Mean:   73
Median: 22
p25:    0.0
Min:    0.0

--------------
Column: Cement
--------------
Type:   Continuous
Max:    540
p75:    350
Mean:   281
Median: 272
p25:    192
Min:    102

------------------------
Column: Coarse Aggregate
------------------------
Type:   Continuous
Max:    1,145
p75:    1,029
Mean:   972
Median: 968
p25:    932
Min:    801

-------------------------------------
Column: Concrete Compressive Strength
-------------------------------------
Type:   Continuous
Max:    82
p75:    46
Mean:   35
Median: 34
p25:    23
Min:    2.3

----------------------
Column: Fine Aggregate
----------------------
Type:   Continuous
Max:    992
p75:    824
Mean:   773
Median: 779
p25:    730
Min:    594

---------------
Column: Fly Ash
---------------
Type:   Continuous
Max:    200
p75:    118
Mean:   54
Median: 0.0
p25:    0.0
Min:    0.0

------------------------
Column: Superplasticizer
------------------------
Type:   Continuous
Max:    32
p75:    10
Mean:   6.2
Median: 6.3
p25:    0.0
Min:    0.0

-------------
Column: Water
-------------
Type:   Continuous
Max:    247
p75:    192
Mean:   181
Median: 185
p25:    164
Min:    121

Data Valuation

In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[4]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable RMSE
0 1.00 -7.98e-01 5.35e-01

Automatic (Model-Free) Variable Selection

In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[5]:
Variable Running Achievable R-Squared Running Achievable RMSE
Selection Order
0 No Variable 0.00 1.67e+01
1 Age 0.55 1.12e+01
2 Cement 0.69 9.35
3 Superplasticizer 0.80 7.49
4 Blast Furnace Slag 0.85 6.37
5 Water 0.85 6.37
6 Fine Aggregate 0.90 5.23
7 Coarse Aggregate 0.90 5.23
8 Fly Ash 1.00 0.53