Your Data¶
How We Use Your Data¶
To run our analyzes, the KXY backend needs your data. The methods below are the only methods involved in sharing your data with us. The kxy
package only uploads your data if and when needed.
- kxy.api.data_transfer.generate_upload_url(file_name)¶
Requests a pre-signed URL to upload a dataset.
- Parameters
file_name (str) – A string that uniquely identifies the content of the file.
- Returns
d – The dictionary containing the pre-signed url.
- Return type
dict or None
- kxy.api.data_transfer.upload_data(df, file_name=None)¶
Updloads a dataframe to kxy servers.
- Parameters
df (pd.DataFrame) – The dataframe to upload.
- Returns
d – Whether the upload was successful.
- Return type
bool
Anonymizing Your Data¶
Fortunately, our analyses are invariant by various transformations that can completely anonymize your data.
You may simply run df_anonymized = df.kxy.anonymize()
on any dataframe df
to anonymize it, and work with df_anonymized
instead df
.
Check out the function below for more information on how we anonymize your data.
- BaseAccessor.anonymize(columns_to_exclude=[])¶
Anonymize the dataframe in a manner that leaves all pre-learning and post-learning analyses (including data valuation, variable selection, model-driven improvability, data-driven improvability and model explanation) invariant.
Any transformation on continuous variables that preserves ranks will not change our pre-learning and post-learning analyses. The same holds for any 1-to-1 transformation on categorical variables.
This implementation replaces ordinal values (i.e. any column that can be cast as a float) with their within-column Gaussian score. For each non-ordinal column, we form the set of all possible values, we assign a unique integer index to each value in the set, and we systematically replace said value appearing in the dataframe by the hexadecimal code of its associated integer index.
For regression problems, accurate estimation of RMSE related metrics require the target column (and the prediction column for post-learning analyses) not to be anonymized.
- Parameters
columns_to_exclude (list (optional)) – List of columns not to anonymize (e.g. target and prediction columns for regression problems).
- Returns
result – The result is a pandas.Dataframe with columns (where applicable):
- Return type
pandas.DataFrame