A Powerful Serverless Analysis Toolkit That Takes Trial And Error Out of Machine Learning Projects

The kxy package utilizes information theory to take trial and error out of machine learning projects.

Project Feasibility

From the get-go, the data valuation analysis of the kxy package tells data scientists whether their datasets are sufficiently informative to achieve a performance (e.g. \(R^2\), RMSE, maximum log-likelihood, and classification error) to their liking in a classification or regression problem, and if so what is the best performance that can be achieved using said datasets. Only spend time and compute resources on a project once you know it can yield the desired business impact.

Automatic (Model-Free) Feature Selection

The model-free variable selection analysis provided by the kxy package allows data scientists to train smaller models, faster, cheaper, and to achieve a higher performance than throwing all inputs in a big model or proceeding by trial-and-error.

Production Model Improvability Analyses

Data-Driven Improvability: Once a model has been trained, the kxy model-driven improvability analysis quantifies the extent to which the trained model can be improved without resorting to additional features. This allows data scientists to focus their modeling efforts on high ROI initiatives. Only throw the might of your ML team and platform at improving the fit of your production model when you know it can be improved. Never again will you spend weeks, if not months, and thousands of dollars in cloud compute, implementing the latest models on specialized hardware to improve your production model, only to find out its fit cannot be improved.

Model-Driven Improvability: Once the fit of a production model is optimal (i.e. it has successfully extracted all the value in using a given set features to predict the label), the kxy data-driven improvability allows data scientists to quickly quantify the performance increase (e.g. \(R^2\), RMSE, maximum log-likelihood, and classification error) that a new dataset may bring about. Only retrain models with additional features when you know they can bring about a meaningful performance boost.

Reducing Time and Resources Spent on Overfitted Models

We provide callbacks in the major Python machine learning libraries that will terminate training when the running best performance seems unrealistic (i.e. far exceeds the theoretical-best achievable). Our callbacks allow saving time and compute resources on models that we can reliably determine will overfit once fully trained, well before training ends. This is a cost-effective alternative to cross-validation.