Heart Attack (Kaggle, Classification, n=303, d=13, 2 classes)¶
Loading The Data¶
In [1]:
from kxy_datasets.kaggle_classifications import HeartAttack # pip install kxy_datasets
In [2]:
dataset = HeartAttack()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
In [3]:
df.kxy.describe() # Visualize a summary of the data
-----------
Column: age
-----------
Type: Continuous
Max: 77
p75: 61
Mean: 54
Median: 55
p25: 47
Min: 29
-----------
Column: caa
-----------
Type: Continuous
Max: 4.0
p75: 1.0
Mean: 0.7
Median: 0.0
p25: 0.0
Min: 0.0
------------
Column: chol
------------
Type: Continuous
Max: 564
p75: 274
Mean: 246
Median: 240
p25: 211
Min: 126
----------
Column: cp
----------
Type: Continuous
Max: 3.0
p75: 2.0
Mean: 1.0
Median: 1.0
p25: 0.0
Min: 0.0
------------
Column: exng
------------
Type: Continuous
Max: 1.0
p75: 1.0
Mean: 0.3
Median: 0.0
p25: 0.0
Min: 0.0
-----------
Column: fbs
-----------
Type: Continuous
Max: 1.0
p75: 0.0
Mean: 0.1
Median: 0.0
p25: 0.0
Min: 0.0
---------------
Column: oldpeak
---------------
Type: Continuous
Max: 6.2
p75: 1.6
Mean: 1.0
Median: 0.8
p25: 0.0
Min: 0.0
--------------
Column: output
--------------
Type: Continuous
Max: 1.0
p75: 1.0
Mean: 0.5
Median: 1.0
p25: 0.0
Min: 0.0
---------------
Column: restecg
---------------
Type: Continuous
Max: 2.0
p75: 1.0
Mean: 0.5
Median: 1.0
p25: 0.0
Min: 0.0
-----------
Column: sex
-----------
Type: Continuous
Max: 1.0
p75: 1.0
Mean: 0.7
Median: 1.0
p25: 0.0
Min: 0.0
-----------
Column: slp
-----------
Type: Continuous
Max: 2.0
p75: 2.0
Mean: 1.4
Median: 1.0
p25: 1.0
Min: 0.0
----------------
Column: thalachh
----------------
Type: Continuous
Max: 202
p75: 166
Mean: 149
Median: 153
p25: 133
Min: 71
-------------
Column: thall
-------------
Type: Continuous
Max: 3.0
p75: 3.0
Mean: 2.3
Median: 2.0
p25: 2.0
Min: 0.0
--------------
Column: trtbps
--------------
Type: Continuous
Max: 200
p75: 140
Mean: 131
Median: 130
p25: 120
Min: 94
Data Valuation¶
In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s Duration: 0s
Out[4]:
Achievable R-Squared | Achievable Log-Likelihood Per Sample | Achievable Accuracy | |
---|---|---|---|
0 | 0.64 | -1.85e-01 | 0.95 |
Automatic (Model-Free) Variable Selection¶
In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s Duration: 0s
Out[5]:
Variable | Running Achievable R-Squared | Running Achievable Accuracy | |
---|---|---|---|
Selection Order | |||
0 | No Variable | 0.00 | 0.54 |
1 | thall | 0.25 | 0.76 |
2 | caa | 0.39 | 0.84 |
3 | oldpeak | 0.39 | 0.84 |
4 | cp | 0.48 | 0.88 |
5 | slp | 0.55 | 0.92 |
6 | restecg | 0.60 | 0.94 |
7 | sex | 0.64 | 0.95 |
8 | exng | 0.64 | 0.95 |
9 | fbs | 0.64 | 0.95 |
10 | thalachh | 0.64 | 0.95 |
11 | trtbps | 0.64 | 0.95 |
12 | chol | 0.64 | 0.95 |
13 | age | 0.64 | 0.95 |