← Back to News

Cohort Augmentation

Tue May 03 2022

It is always said that the job of a Data Scientist constitutes 20% modeling and 80% data cleaning. Every dataset that you deal with presents various challenges, the majority of which focus on the question of feasibility of the dataset itself. One of the sub-questions we ask as part of this is: Are there enough useable examples in order for our predictive models to generalize across wider populations? 

By useable examples, we mean non-null examples, where each examples has non-null input feature values, as well as non-null output values.

The whole question focuses on combating overfitting: the phenomenon where a model solely focuses on performing well on the training set, by homing on the noise (rather than the signal) of the training data. This means that, when we validate a model with a different set of data, the model performs poorly.

Among many strategies we can use to combat overfitting, we can use data augmentation. Data augmentation is based on the premise that we can increase the number of useable examples in our dataset (Shorten & Khoshgoftaar, 2019). By increasing the number of examples, we stand a higher chance of our models to generalize better over a wider population.

At biotx.ai, we have developed a strategy, which follows the principles of data augmentation on cohort data, which we call cohort augmentation. The idea of cohort augmentation is broken into two imputation steps. 

1.  Feature Imputation: For any missing cohort feature values, we impute these values to essentially fill in the missing feature values.

2.  Phenotype Imputation: We predict the phenotype on cohort feature values using a Machine Learning model.

Imputation helps to reduce possible bias, this can quickly lead to a substantial reduction in the sample size and consequent statistical information (Carpenter & Smuk, 2021).

Imputation has been used as a trusted method to reduce model bias, increase sample size, as well as statistical power. Furthermore, it ensures that all the effort involved in collecting data on an individual is not discarded, even if one or more of their variables is missing (Carpenter & Smuk, 2021). 

In the realm of propensity score analysis, (Leyrat et al., 2019) shows that feature imputation is a preferred approach for estimating treatment effects. This cohort augmentation approach has been expanded to a variety of medical literature, which utilize imputation as a reliable method for increasing the number of useable examples in their cohort studies and reduce the number of missing data over potential confounding features  (Brauer et al., 2020; Ali et al., 2019).


Ali, M. S., Prieto-Alhambra, D., Lopes, L. C., Ramos, D., Bispo, N., Ichihara, M. Y., ... others (2019). Propensity score methods in health technology assessment: principles, extended applications, and recent advances. Frontiers in pharmacology, 973.

Brauer, R., Wei, L., Ma, T., Athauda, D., Girges, C., Vijiaratnam, N., Auld, G., Whittlesea, C., Wong, I., Foltynie, T. (2020). Diabetes medications and risk of parkinson’s disease: a cohort study of patients with diabetes. Brain, 143 (10), 3067–3076.

Carpenter, J. R., & Smuk, M. (2021). Missing data: A statistical framework for practice. Biometrical Journal , 63 (5), 915–947.

Leyrat, C., Seaman, S. R., White, I. R., Douglas, I., Smeeth, L., Kim, J., Resche-Rigon, M., Carpenter, J. R., Williamson, E. J. (2019). Propensity score analysis with partially observed covariates: How should multiple imputation be used? Statistical methods in medical research, 28 (1), 3–19.

Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of big data, 6 (1), 1–48. 2