White Wine Quality (UCI, Regression, n=4898, d=11)¶
Loading The Data¶
In [1]:
from kxy_datasets.uci_regressions import WhiteWineQuality # pip install kxy_datasets
In [2]:
dataset = WhiteWineQuality()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data
---------------
Column: alcohol
---------------
Type: Continuous
Max: 14
p75: 11
Mean: 10
Median: 10
p25: 9.5
Min: 8.0
-----------------
Column: chlorides
-----------------
Type: Continuous
Max: 0.3
p75: 0.1
Mean: 0.0
Median: 0.0
p25: 0.0
Min: 0.0
-------------------
Column: citric acid
-------------------
Type: Continuous
Max: 1.7
p75: 0.4
Mean: 0.3
Median: 0.3
p25: 0.3
Min: 0.0
---------------
Column: density
---------------
Type: Continuous
Max: 1.0
p75: 1.0
Mean: 1.0
Median: 1.0
p25: 1.0
Min: 1.0
---------------------
Column: fixed acidity
---------------------
Type: Continuous
Max: 14
p75: 7.3
Mean: 6.9
Median: 6.8
p25: 6.3
Min: 3.8
---------------------------
Column: free sulfur dioxide
---------------------------
Type: Continuous
Max: 289
p75: 46
Mean: 35
Median: 34
p25: 23
Min: 2.0
----------
Column: pH
----------
Type: Continuous
Max: 3.8
p75: 3.3
Mean: 3.2
Median: 3.2
p25: 3.1
Min: 2.7
---------------
Column: quality
---------------
Type: Continuous
Max: 9.0
p75: 6.0
Mean: 5.9
Median: 6.0
p25: 5.0
Min: 3.0
----------------------
Column: residual sugar
----------------------
Type: Continuous
Max: 65
p75: 9.9
Mean: 6.4
Median: 5.2
p25: 1.7
Min: 0.6
-----------------
Column: sulphates
-----------------
Type: Continuous
Max: 1.1
p75: 0.6
Mean: 0.5
Median: 0.5
p25: 0.4
Min: 0.2
----------------------------
Column: total sulfur dioxide
----------------------------
Type: Continuous
Max: 440
p75: 167
Mean: 138
Median: 134
p25: 108
Min: 9.0
------------------------
Column: volatile acidity
------------------------
Type: Continuous
Max: 1.1
p75: 0.3
Mean: 0.3
Median: 0.3
p25: 0.2
Min: 0.1
Data Valuation¶
In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s Duration: 0s
Out[4]:
Achievable R-Squared | Achievable Log-Likelihood Per Sample | Achievable RMSE | |
---|---|---|---|
0 | 0.33 | -1.41 | 7.27e-01 |
Automatic (Model-Free) Variable Selection¶
In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s Duration: 0s
Out[5]:
Variable | Running Achievable R-Squared | Running Achievable RMSE | |
---|---|---|---|
Selection Order | |||
0 | No Variable | 0.00 | 0.89 |
1 | alcohol | 0.26 | 0.76 |
2 | volatile acidity | 0.29 | 0.75 |
3 | free sulfur dioxide | 0.29 | 0.75 |
4 | pH | 0.32 | 0.73 |
5 | residual sugar | 0.32 | 0.73 |
6 | citric acid | 0.32 | 0.73 |
7 | density | 0.33 | 0.73 |
8 | fixed acidity | 0.33 | 0.73 |
9 | sulphates | 0.33 | 0.73 |
10 | chlorides | 0.33 | 0.73 |
11 | total sulfur dioxide | 0.33 | 0.73 |