Real Estate (UCI, Regression, n=414, d=6)¶
Loading The Data¶
In [1]:
from kxy_datasets.uci_regressions import RealEstate # pip install kxy_datasets
In [2]:
dataset = RealEstate()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data
---------------------------
Column: X1 transaction date
---------------------------
Type: Continuous
Max: 2,013
p75: 2,013
Mean: 2,013
Median: 2,013
p25: 2,012
Min: 2,012
--------------------
Column: X2 house age
--------------------
Type: Continuous
Max: 43
p75: 28
Mean: 17
Median: 16
p25: 9.0
Min: 0.0
----------------------------------------------
Column: X3 distance to the nearest MRT station
----------------------------------------------
Type: Continuous
Max: 6,488
p75: 1,454
Mean: 1,083
Median: 492
p25: 289
Min: 23
---------------------------------------
Column: X4 number of convenience stores
---------------------------------------
Type: Continuous
Max: 10
p75: 6.0
Mean: 4.1
Median: 4.0
p25: 1.0
Min: 0.0
-------------------
Column: X5 latitude
-------------------
Type: Continuous
Max: 25
p75: 24
Mean: 24
Median: 24
p25: 24
Min: 24
--------------------
Column: X6 longitude
--------------------
Type: Continuous
Max: 121
p75: 121
Mean: 121
Median: 121
p25: 121
Min: 121
----------------------------------
Column: Y house price of unit area
----------------------------------
Type: Continuous
Max: 117
p75: 46
Mean: 37
Median: 38
p25: 27
Min: 7.6
Data Valuation¶
In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s Duration: 0s
Out[4]:
Achievable R-Squared | Achievable Log-Likelihood Per Sample | Achievable RMSE | |
---|---|---|---|
0 | 0.80 | -3.23 | 6.05 |
Automatic (Model-Free) Variable Selection¶
In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s Duration: 0s
Out[5]:
Variable | Running Achievable R-Squared | Running Achievable RMSE | |
---|---|---|---|
Selection Order | |||
0 | No Variable | 0.00 | 1.36e+01 |
1 | X3 distance to the nearest MRT station | 0.70 | 7.49 |
2 | X5 latitude | 0.70 | 7.49 |
3 | X2 house age | 0.70 | 7.49 |
4 | X1 transaction date | 0.80 | 6.05 |
5 | X4 number of convenience stores | 0.80 | 6.05 |
6 | X6 longitude | 0.80 | 6.05 |