Concrete (UCI, Regression, n=1030, d=8)¶

Loading The Data¶

In [1]:

from kxy_datasets.uci_regressions import Concrete # pip install kxy_datasets

In [2]:

dataset = Concrete()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'

In [3]:

df.kxy.describe() # Visualize a summary of the data


-----------
Column: Age
-----------
Type:   Continuous
Max:    365
p75:    56
Mean:   45
Median: 28
p25:    7.0
Min:    1.0

--------------------------
Column: Blast Furnace Slag
--------------------------
Type:   Continuous
Max:    359
p75:    142
Mean:   73
Median: 22
p25:    0.0
Min:    0.0

--------------
Column: Cement
--------------
Type:   Continuous
Max:    540
p75:    350
Mean:   281
Median: 272
p25:    192
Min:    102

------------------------
Column: Coarse Aggregate
------------------------
Type:   Continuous
Max:    1,145
p75:    1,029
Mean:   972
Median: 968
p25:    932
Min:    801

-------------------------------------
Column: Concrete Compressive Strength
-------------------------------------
Type:   Continuous
Max:    82
p75:    46
Mean:   35
Median: 34
p25:    23
Min:    2.3

----------------------
Column: Fine Aggregate
----------------------
Type:   Continuous
Max:    992
p75:    824
Mean:   773
Median: 779
p25:    730
Min:    594

---------------
Column: Fly Ash
---------------
Type:   Continuous
Max:    200
p75:    118
Mean:   54
Median: 0.0
p25:    0.0
Min:    0.0

------------------------
Column: Superplasticizer
------------------------
Type:   Continuous
Max:    32
p75:    10
Mean:   6.2
Median: 6.3
p25:    0.0
Min:    0.0

-------------
Column: Water
-------------
Type:   Continuous
Max:    247
p75:    192
Mean:   181
Median: 185
p25:    164
Min:    121

Data Valuation¶

In [4]:

df.kxy.data_valuation(y_column, problem_type=problem_type)

[====================================================================================================] 100% ETA: 0s    Duration: 0s

Out[4]:

	Achievable R-Squared	Achievable Log-Likelihood Per Sample	Achievable RMSE
0	1.00	-7.98e-01	5.35e-01

Automatic (Model-Free) Variable Selection¶

In [5]:

df.kxy.variable_selection(y_column, problem_type=problem_type)

[====================================================================================================] 100% ETA: 0s    Duration: 0s

Out[5]:

	Variable	Running Achievable R-Squared	Running Achievable RMSE
Selection Order
0	No Variable	0.00	1.67e+01
1	Age	0.55	1.12e+01
2	Cement	0.69	9.35
3	Superplasticizer	0.80	7.49
4	Blast Furnace Slag	0.85	6.37
5	Water	0.85	6.37
6	Fine Aggregate	0.90	5.23
7	Coarse Aggregate	0.90	5.23
8	Fly Ash	1.00	0.53