Water Quality (Kaggle, Classification, n=3276, d=9, 2 classes)¶
Loading The Data¶
In [1]:
from kxy_datasets.kaggle_classifications import WaterQuality # pip install kxy_datasets
In [2]:
dataset = WaterQuality()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
In [3]:
df.kxy.describe() # Visualize a summary of the data
-------------------
Column: Chloramines
-------------------
Type: Continuous
Max: 13
p75: 8.1
Mean: 7.1
Median: 7.1
p25: 6.1
Min: 0.4
--------------------
Column: Conductivity
--------------------
Type: Continuous
Max: 753
p75: 481
Mean: 426
Median: 421
p25: 365
Min: 181
----------------
Column: Hardness
----------------
Type: Continuous
Max: 323
p75: 216
Mean: 196
Median: 196
p25: 176
Min: 47
----------------------
Column: Organic_carbon
----------------------
Type: Continuous
Max: 28
p75: 16
Mean: 14
Median: 14
p25: 12
Min: 2.2
------------------
Column: Potability
------------------
Type: Continuous
Max: 1.0
p75: 1.0
Mean: 0.4
Median: 0.0
p25: 0.0
Min: 0.0
--------------
Column: Solids
--------------
Type: Continuous
Max: 61,227
p75: 27,332
Mean: 22,014
Median: 20,927
p25: 15,666
Min: 320
---------------
Column: Sulfate
---------------
Type: Continuous
Max: 481
p75: 359
Mean: 333
Median: 333
p25: 307
Min: 129
-----------------------
Column: Trihalomethanes
-----------------------
Type: Continuous
Max: 124
p75: 77
Mean: 66
Median: 66
p25: 55
Min: 0.7
-----------------
Column: Turbidity
-----------------
Type: Continuous
Max: 6.7
p75: 4.5
Mean: 4.0
Median: 4.0
p25: 3.4
Min: 1.4
----------
Column: ph
----------
Type: Continuous
Max: 13
p75: 8.1
Mean: 7.1
Median: 7.0
p25: 6.1
Min: 0.0
Data Valuation¶
In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s Duration: 0s
Out[4]:
Achievable R-Squared | Achievable Log-Likelihood Per Sample | Achievable Accuracy | |
---|---|---|---|
0 | 0.04 | -6.50e-01 | 0.65 |
Automatic (Model-Free) Variable Selection¶
In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s Duration: 0s
Out[5]:
Variable | Running Achievable R-Squared | Running Achievable Accuracy | |
---|---|---|---|
Selection Order | |||
0 | No Variable | 0.00 | 0.61 |
1 | ph | 0.01 | 0.62 |
2 | Sulfate | 0.04 | 0.65 |
3 | Chloramines | 0.04 | 0.65 |
4 | Hardness | 0.04 | 0.65 |
5 | Solids | 0.04 | 0.65 |
6 | Organic_carbon | 0.04 | 0.65 |
7 | Conductivity | 0.04 | 0.65 |
8 | Turbidity | 0.04 | 0.65 |
9 | Trihalomethanes | 0.04 | 0.65 |