Minimum sample size for developing a multivariable prediction model: PART II - binary and time-to-event outcomes

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

When designing a study to develop a new prediction model with binary or time‐to‐event outcomes, researchers should ensure their sample size is adequate in terms of the number of participants (n) and outcome events (E) relative to the number of predictor parameters (p) considered for inclusion. We propose that the minimum values of n and E (and subsequently the minimum number of events per predictor parameter, EPP) should be calculated to meet the following three criteria: (i) small optimism in predictor effect estimates as defined by a global shrinkage factor of ≥0.9, (ii) small absolute difference of ≤ 0.05 in the model's apparent and adjusted Nagelkerke's R 2 , and (iii) precise estimation of the overall risk in the population. Criteria (i) and (ii) aim to reduce overfitting conditional on a chosen p, and require prespecification of the model's anticipated Cox‐Snell R 2 , which we show can be obtained from previous studies. The values of n and E that meet all three criteria provides the minimum sample size required for model development. Upon application of our approach, a new diagnostic model for Chagas disease requires an EPP of at least 4.8 and a new prognostic model for recurrent venous thromboembolism requires an EPP of at least 23. This reinforces why rules of thumb (eg, 10 EPP) should be avoided. Researchers might additionally ensure the sample size gives precise estimates of key predictor effects; this is especially important when key categorical predictors have few events in some categories, as this may substantially increase the numbers required.

Related collections

Most cited references 29

Record: found
Abstract: found
Article: not found

Cardiovascular disease risk profiles.

K M Anderson, P. M. Odell, P. W. Wilson … (1991)

This article presents prediction equations for several cardiovascular disease endpoints, which are based on measurements of several known risk factors. Subjects (n = 5573) were original and offspring subjects in the Framingham Heart Study, aged 30 to 74 years, and initially free of cardiovascular disease. Equations to predict risk for the following were developed: myocardial infarction, coronary heart disease (CHD), death from CHD, stroke, cardiovascular disease, and death from cardiovascular disease. The equations demonstrated the potential importance of controlling multiple risk factors (blood pressure, total cholesterol, high-density lipoprotein cholesterol, smoking, glucose intolerance, and left ventricular hypertrophy) as opposed to focusing on one single risk factor. The parametric model used was seen to have several advantages over existing standard regression models. Unlike logistic regression, it can provide predictions for different lengths of time, and probabilities can be expressed in a more straightforward way than the Cox proportional hazards model.

0 comments Cited 271 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Derivation of a simple clinical model to categorize patients probability of pulmonary embolism: increasing the models utility with the SimpliRED D-dimer.

P. S. Wells, D Anderson, M. Rodger … (2000)

We have previously demonstrated that a clinical model can be safely used in a management strategy in patients with suspected pulmonary embolism (PE). We sought to simplify the clinical model and determine a scoring system, that when combined with D-dimer results, would safely exclude PE without the need for other tests, in a large proportion of patients. We used a randomly selected sample of 80% of the patients that participated in a prospective cohort study of patients with suspected PE to perform a logistic regression analysis on 40 clinical variables to create a simple clinical prediction rule. Cut points on the new rule were determined to create two scoring systems. In the first scoring system patients were classified as having low, moderate and high probability of PE with the proportions being similar to those determined in our original study. The second system was designed to create two categories, PE likely and unlikely. The goal in the latter was that PE unlikely patients with a negative D-dimer result would have PE in less than 2% of cases. The proportion of patients with PE in each category was determined overall and according to a positive or negative SimpliRED D-dimer result. After these determinations we applied the models to the remaining 20% of patients as a validation of the results. The following seven variables and assigned scores (in brackets) were included in the clinical prediction rule: Clinical symptoms of DVT (3.0), no alternative diagnosis (3.0), heart rate >100 (1.5), immobilization or surgery in the previous four weeks (1.5), previous DVT/PE (1.5), hemoptysis (1.0) and malignancy (1.0). Patients were considered low probability if the score was 4.0. 7.8% of patients with scores of less than or equal to 4 had PE but if the D-dimer was negative in these patients the rate of PE was only 2.2% (95% CI = 1.0% to 4.0%) in the derivation set and 1.7% in the validation set. Importantly this combination occurred in 46% of our study patients. A score of <2.0 and a negative D-dimer results in a PE rate of 1.5% (95% CI = 0.4% to 3.7%) in the derivation set and 2.7% (95% CI = 0.3% to 9.0%) in the validation set and only occurred in 29% of patients. The combination of a score < or =4.0 by our simple clinical prediction rule and a negative SimpliRED D-Dimer result may safely exclude PE in a large proportion of patients with suspected PE.

0 comments Cited 237 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A prognostic index in primary breast cancer.

J. L. Haybittle, R W Blamey, C W Elston … (1982)

From a multiple-regression analysis of prognostic factors and survival in a series of 387 patients with primary breast cancer, a prognostic index has been constructed, based on lymph-node stage, tumour size and pathological grade. This index is more discriminating than lymph-node stage alone, and enables a larger group of patients to be identified with a very poor prognosis.

0 comments Cited 156 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Title: Statistics in Medicine

Abbreviated Title: Statistics in Medicine

Publisher: Wiley

ISSN: 02776715

Publication date (Electronic): October 24 2018

Affiliations

[1 ]Centre for Prognosis Research, Research Institute for Primary Care and Health Sciences; Keele University; Staffordshire UK

[2 ]Department of Biostatistics; Vanderbilt University School of Medicine; Nashville Tennessee

[3 ]Julius Centre for Health Sciences and Primary Care; University Medical Centre Utrecht; Utrecht The Netherlands

[4 ]Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences; University of Oxford; Oxford UK