Model Documentation & Architecture

Understanding the Prediction Pipeline

This documentation explains the machine learning architecture, preprocessing pipeline, optimization strategy, and the reasoning behind simplifying healthcare features for both predictive performance and end-user usability.

Why Logistic Regression (OvR)?

The prediction system uses Logistic Regression with One-vs-Rest (OvR) because the task involves multi-class classification across patient readmission risk categories.

Logistic Regression was selected because it provides strong interpretability, stable behavior on structured healthcare data, and reliable probabilistic outputs.

Interpretable Predictions Logistic Regression allows easier understanding of how patient variables influence predictions, which is valuable in healthcare systems.
One-vs-Rest (OvR) Strategy OvR creates separate classifiers for each risk category, improving multi-class classification stability while maintaining model simplicity.
Reliable Structured Data Performance Logistic Regression performs effectively on tabular healthcare datasets with engineered numerical and categorical features.
Why Not Linear Regression?

Linear Regression predicts continuous numerical values, while this project focuses on categorical risk classification. Logistic Regression is therefore significantly more suitable for healthcare prediction categories.

Binary Encoding

Binary Encoding was used to efficiently transform categorical variables into numerical form while minimizing unnecessary feature expansion.

Space Efficient Compared to One-Hot Encoding, Binary Encoding creates fewer columns, reducing memory usage and feature-space size.
Reduced Curse of Dimensionality Lower dimensional growth helps reduce sparsity and improves model learning efficiency for high-cardinality categorical data.
RandomizedSearchCV

RandomizedSearchCV was used for efficient hyperparameter tuning to identify stronger-performing parameter combinations without exhaustive computational search.

Faster Optimization Random parameter sampling reduces computational cost compared to exhaustive grid-based search approaches.
Practical Scalability Helps efficiently discover reliable parameter combinations while maintaining strong predictive performance.
Simplifying Features for the Model and the End User

One important design decision in this project was simplifying raw clinical variables into cleaner, more meaningful feature groups.

Healthcare datasets often contain fragmented, inconsistent, or highly detailed medical values. Directly using raw features can increase complexity, dimensionality, and reduce interpretability.

1
Diagnosis Categorization Raw diagnosis codes were grouped into broader medical categories such as Diabetes, Respiratory, Circulatory, Digestive, Injury, and Metabolic conditions.
2
Binary Feature Simplification Several clinical variables were converted into Yes/No indicators to reduce unnecessary complexity and simplify user interaction.
3
Standardization & Encoding Numerical variables were scaled using StandardScaler, while categorical variables were transformed using Binary Encoding for efficient machine learning behavior.
4
Improved End-User Experience Simplified features make the prediction interface easier to understand, reducing confusion while preserving important clinical information.
Design Philosophy

The objective was not only improving predictive performance, but also creating a healthcare-oriented AI system that remains understandable, structured, and usable for real-world interaction.

Clinical & Educational Disclaimer: This platform is intended for educational, analytical, and demonstration purposes only. Predictions generated by the system should not replace physician evaluation, medical diagnosis, or professional healthcare decision-making.
Open Prediction Tool Return to Home