Facebook Comments (UCI, Regression, n=209074, d=53)¶
Loading The Data¶
In [1]:
from kxy_datasets.uci_regressions import FacebookComments # pip install kxy_datasets
In [2]:
dataset = FacebookComments()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data
-----------
Column: x_0
-----------
Type: Continuous
Max: 486,972,297
p75: 1,341,299
Mean: 1,451,183
Median: 313,452
p25: 43,331
Min: 36
-----------
Column: x_1
-----------
Type: Continuous
Max: 1,100,558
p75: 99
Mean: 4,748
Median: 0.0
p25: 0.0
Min: 0.0
------------
Column: x_10
------------
Type: Continuous
Max: 2,783
p75: 599
Mean: 392
Median: 193
p25: 41
Min: 0.0
------------
Column: x_11
------------
Type: Continuous
Max: 1,672
p75: 29
Mean: 24
Median: 9.4
p25: 2.0
Min: 0.0
------------
Column: x_12
------------
Type: Continuous
Max: 1,672
p75: 8.0
Mean: 8.8
Median: 3.0
p25: 0.0
Min: 0.0
------------
Column: x_13
------------
Type: Continuous
Max: 1,101
p75: 66
Mean: 43
Median: 20
p25: 5.0
Min: 0.0
------------
Column: x_14
------------
Type: Continuous
Max: 2,218
p75: 0.0
Mean: 0.1
Median: 0.0
p25: 0.0
Min: 0.0
------------
Column: x_15
------------
Type: Continuous
Max: 2,455
p75: 565
Mean: 376
Median: 197
p25: 38
Min: 0.0
------------
Column: x_16
------------
Type: Continuous
Max: 2,218
p75: 26
Mean: 20
Median: 7.9
p25: 2.0
Min: 0.0
------------
Column: x_17
------------
Type: Continuous
Max: 2,218
p75: 5.0
Mean: 4.8
Median: 1.0
p25: 0.0
Min: 0.0
------------
Column: x_18
------------
Type: Continuous
Max: 912
p75: 59
Mean: 41
Median: 19
p25: 4.7
Min: 0.0
------------
Column: x_19
------------
Type: Continuous
Max: 2,405
p75: 0.0
Mean: 0.9
Median: 0.0
p25: 0.0
Min: 0.0
-----------
Column: x_2
-----------
Type: Continuous
Max: 6,784,263
p75: 54,849
Mean: 56,123
Median: 9,484
p25: 770
Min: 0.0
------------
Column: x_20
------------
Type: Continuous
Max: 2,783
p75: 705
Mean: 447
Median: 233
p25: 44
Min: 0.0
------------
Column: x_21
------------
Type: Continuous
Max: 2,405
p75: 74
Mean: 54
Median: 23
p25: 5.3
Min: 0.0
------------
Column: x_22
------------
Type: Continuous
Max: 2,405
p75: 40
Mean: 34
Median: 12
p25: 2.0
Min: 0.0
------------
Column: x_23
------------
Type: Continuous
Max: 1,101
p75: 100
Mean: 66
Median: 31
p25: 7.7
Min: 0.0
------------
Column: x_24
------------
Type: Continuous
Max: 1,660
p75: -33.0
Mean: -320.6
Median: -163.0
p25: -455.0
Min: -2119.0
------------
Column: x_25
------------
Type: Continuous
Max: 2,783
p75: 594
Mean: 387
Median: 182
p25: 41
Min: -1861.0
------------
Column: x_26
------------
Type: Continuous
Max: 1,660
p75: 1.7
Mean: 4.1
Median: 0.3
p25: -0.0
Min: -1861.0
------------
Column: x_27
------------
Type: Continuous
Max: 1,660
p75: 0.0
Mean: -0.6
Median: 0.0
p25: -2.0
Min: -1861.0
------------
Column: x_28
------------
Type: Continuous
Max: 1,386
p75: 87
Mean: 59
Median: 27
p25: 6.9
Min: 0.0
------------
Column: x_29
------------
Type: Continuous
Max: 2,858
p75: 47
Mean: 58
Median: 11
p25: 2.0
Min: 0.0
-----------
Column: x_3
-----------
Type: Continuous
Max: 107
p75: 32
Mean: 24
Median: 18
p25: 9.0
Min: 1.0
------------
Column: x_30
------------
Type: Continuous
Max: 2,783
p75: 12
Mean: 24
Median: 2.0
p25: 0.0
Min: 0.0
------------
Column: x_31
------------
Type: Continuous
Max: 2,455
p75: 8.0
Mean: 20
Median: 0.0
p25: 0.0
Min: 0.0
------------
Column: x_32
------------
Type: Continuous
Max: 2,783
p75: 45
Mean: 54
Median: 11
p25: 2.0
Min: 0.0
------------
Column: x_33
------------
Type: Continuous
Max: 2,783
p75: 4.0
Mean: 4.1
Median: 0.0
p25: -6.0
Min: -2119.0
------------
Column: x_34
------------
Type: Continuous
Max: 72
p75: 53
Mean: 34
Median: 34
p25: 16
Min: 0.0
------------
Column: x_35
------------
Type: Continuous
Max: 21,480
p75: 172
Mean: 163
Median: 97
p25: 38
Min: 0.0
------------
Column: x_36
------------
Type: Continuous
Max: 144,860
p75: 62
Mean: 119
Median: 14
p25: 2.0
Min: 1.0
------------
Column: x_37
------------
Type: Continuous
Max: 0.0
p75: 0.0
Mean: 0.0
Median: 0.0
p25: 0.0
Min: 0.0
------------
Column: x_38
------------
Type: Continuous
Max: 24
p75: 24
Mean: 23
Median: 24
p25: 24
Min: 0.0
------------
Column: x_39
------------
Type: Continuous
Max: 1.0
p75: 0.0
Mean: 0.1
Median: 0.0
p25: 0.0
Min: 0.0
-----------
Column: x_4
-----------
Type: Continuous
Max: 2,575
p75: 0.0
Mean: 0.9
Median: 0.0
p25: 0.0
Min: 0.0
------------
Column: x_40
------------
Type: Continuous
Max: 1.0
p75: 0.0
Mean: 0.1
Median: 0.0
p25: 0.0
Min: 0.0
------------
Column: x_41
------------
Type: Continuous
Max: 1.0
p75: 0.0
Mean: 0.1
Median: 0.0
p25: 0.0
Min: 0.0
------------
Column: x_42
------------
Type: Continuous
Max: 1.0
p75: 0.0
Mean: 0.1
Median: 0.0
p25: 0.0
Min: 0.0
------------
Column: x_43
------------
Type: Continuous
Max: 1.0
p75: 0.0
Mean: 0.1
Median: 0.0
p25: 0.0
Min: 0.0
------------
Column: x_44
------------
Type: Continuous
Max: 1.0
p75: 0.0
Mean: 0.2
Median: 0.0
p25: 0.0
Min: 0.0
------------
Column: x_45
------------
Type: Continuous
Max: 1.0
p75: 0.0
Mean: 0.1
Median: 0.0
p25: 0.0
Min: 0.0
------------
Column: x_46
------------
Type: Continuous
Max: 1.0
p75: 0.0
Mean: 0.1
Median: 0.0
p25: 0.0
Min: 0.0
------------
Column: x_47
------------
Type: Continuous
Max: 1.0
p75: 0.0
Mean: 0.1
Median: 0.0
p25: 0.0
Min: 0.0
------------
Column: x_48
------------
Type: Continuous
Max: 1.0
p75: 0.0
Mean: 0.1
Median: 0.0
p25: 0.0
Min: 0.0
------------
Column: x_49
------------
Type: Continuous
Max: 1.0
p75: 0.0
Mean: 0.1
Median: 0.0
p25: 0.0
Min: 0.0
-----------
Column: x_5
-----------
Type: Continuous
Max: 2,858
p75: 803
Mean: 497
Median: 258
p25: 50
Min: 0.0
------------
Column: x_50
------------
Type: Continuous
Max: 1.0
p75: 0.0
Mean: 0.2
Median: 0.0
p25: 0.0
Min: 0.0
------------
Column: x_51
------------
Type: Continuous
Max: 1.0
p75: 0.0
Mean: 0.1
Median: 0.0
p25: 0.0
Min: 0.0
------------
Column: x_52
------------
Type: Continuous
Max: 1.0
p75: 0.0
Mean: 0.1
Median: 0.0
p25: 0.0
Min: 0.0
-----------
Column: x_6
-----------
Type: Continuous
Max: 2,575
p75: 78
Mean: 58
Median: 24
p25: 5.6
Min: 0.0
-----------
Column: x_7
-----------
Type: Continuous
Max: 2,575
p75: 42
Mean: 36
Median: 12
p25: 2.0
Min: 0.0
-----------
Column: x_8
-----------
Type: Continuous
Max: 1,101
p75: 107
Mean: 71
Median: 36
p25: 8.1
Min: 0.0
-----------
Column: x_9
-----------
Type: Continuous
Max: 1,672
p75: 0.0
Mean: 0.3
Median: 0.0
p25: 0.0
Min: 0.0
---------
Column: y
---------
Type: Continuous
Max: 2,412
p75: 3.0
Mean: 8.2
Median: 0.0
p25: 0.0
Min: 0.0
Data Valuation¶
In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s Duration: 0s
Out[4]:
Achievable R-Squared | Achievable Log-Likelihood Per Sample | Achievable RMSE | |
---|---|---|---|
0 | 0.98 | -7.48 | 6.46 |
Automatic (Model-Free) Variable Selection¶
In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s Duration: 0s
Out[5]:
Variable | Running Achievable R-Squared | Running Achievable RMSE | |
---|---|---|---|
Selection Order | |||
0 | No Variable | 0.00 | 4.22e+01 |
1 | x_30 | 0.57 | 2.78e+01 |
2 | x_34 | 0.67 | 2.41e+01 |
3 | x_31 | 0.67 | 2.41e+01 |
4 | x_17 | 0.67 | 2.41e+01 |
5 | x_38 | 0.67 | 2.41e+01 |
6 | x_33 | 0.67 | 2.41e+01 |
7 | x_29 | 0.78 | 2.00e+01 |
8 | x_36 | 0.78 | 2.00e+01 |
9 | x_24 | 0.78 | 2.00e+01 |
10 | x_12 | 0.78 | 2.00e+01 |
11 | x_46 | 0.89 | 1.37e+01 |
12 | x_1 | 0.95 | 9.41 |
13 | x_15 | 0.98 | 6.46 |
14 | x_10 | 0.98 | 6.46 |
15 | x_2 | 0.98 | 6.46 |
16 | x_26 | 0.98 | 6.46 |
17 | x_52 | 0.98 | 6.46 |
18 | x_22 | 0.98 | 6.46 |
19 | x_7 | 0.98 | 6.46 |
20 | x_23 | 0.98 | 6.46 |
21 | x_8 | 0.98 | 6.46 |
22 | x_5 | 0.98 | 6.46 |
23 | x_21 | 0.98 | 6.46 |
24 | x_28 | 0.98 | 6.46 |
25 | x_25 | 0.98 | 6.46 |
26 | x_3 | 0.98 | 6.46 |
27 | x_16 | 0.98 | 6.46 |
28 | x_18 | 0.98 | 6.46 |
29 | x_11 | 0.98 | 6.46 |
30 | x_0 | 0.98 | 6.46 |
31 | x_32 | 0.98 | 6.46 |
32 | x_14 | 0.98 | 6.46 |
33 | x_9 | 0.98 | 6.46 |
34 | x_19 | 0.98 | 6.46 |
35 | x_27 | 0.98 | 6.46 |
36 | x_51 | 0.98 | 6.46 |
37 | x_48 | 0.98 | 6.46 |
38 | x_47 | 0.98 | 6.46 |
39 | x_50 | 0.98 | 6.46 |
40 | x_42 | 0.98 | 6.46 |
41 | x_49 | 0.98 | 6.46 |
42 | x_41 | 0.98 | 6.46 |
43 | x_43 | 0.98 | 6.46 |
44 | x_44 | 0.98 | 6.46 |
45 | x_40 | 0.98 | 6.46 |
46 | x_39 | 0.98 | 6.46 |
47 | x_45 | 0.98 | 6.46 |
48 | x_35 | 0.98 | 6.46 |
49 | x_4 | 0.98 | 6.46 |
50 | x_13 | 0.98 | 6.46 |
51 | x_6 | 0.98 | 6.46 |
52 | x_20 | 0.98 | 6.46 |