This blog post is a summary of a paper I enjoyed reading immensely – Verification, Validation and Confirmation of Numerical Models in the Earth Sciences, first-authored by the brilliant Naomi Oreskes. The authors’ ultimate question is ‘what makes a good model?’ They also expand on the philosophical hoops one jumps to ‘confirm’ or ‘validate’ a complex model. This is very relevant to Quantitative Systems Pharmacology (QSP) as it is to climate modelling (the context of the paper).
Even though ‘validation’ of a QSP model seems to imply that its predictions can be trusted, this is only true in a narrow sense. Even if a model was able to predict previously unseen data (generally accepted standard for model validation), such results are appropriate only in a narrow context that is previously defined (e.g., model is calibrated to dosing regimen A and is able to predict effects of a somewhat similar dosing regimen B in the same population).
The Oreskes paper makes the distinction that the term ‘verification’ is used to establish the general reliability of a model as a basis for decision-making, while validation, a lower bar, indicates that the model does not contain known or detectable flaws and is internally consistent, but need not necessarily increase the reliability of predictions. In short, verification (a theoretical construct without real world application) and validation (which does not necessarily translate to credibility) are not the key measures modellers are looking for to put a stamp of authority on their model.
What then, does a modeller need to do to rate their creation? Firstly, models are built and used by different teams and at the minimum, sharing amongst those teams, the choices and trade-offs made during model development are as important as sharing the results. What parameters and interactions are grounded in quantitative data that has been reproduced multiple times in relevant systems vs. what have only been qualitatively established vs. what is a straight up heuristic developed to match the data2?
The paper goes on to propose a model that can be characterized as empirically adequate when it reproduces all the data that is deemed necessary (‘qualification’). In the context of QSP, part of good process is to identify a set of data and relevant biology as the initial step to model development that all stakeholders agree makes the model fit-for-purpose. Then engineers develop equations and tweak1 the model till satisfactory fits to the identified data are obtained. And that model is carried forward to predictive simulations to aid in decision-making.
Even if we can all arrive at a set of rules that result in an empirically adequate model, the key goal in research is: ‘Can the model inform you of some new science that you couldn’t have obtained without the modelling?’ and not ‘Do we have a validated QSP model?’ Consider the case of a model developed for clinical decision making and that is calibrated and validated to everyone’s satisfaction. If the modelling team recommends a particular course of action based on simulations, (especially if different from the team’s default decision), there has to be a clear and defensible explanation from first principles as to why that is reasonable and ‘our validated model has predicted this’ is never convincing enough.
It is possible that insights are gained in the process of model design and data collection, well before we get to the step of model validation and running predictive simulations. While validation and verification can provide a rigor to model building, sometimes (and in QSP, often), those terms may not be meaningful by themselves. Other aspects of model rating/scoring are equally or more important, along with the willingness to let go of the rigor on occasion to allow the rare insight to shake free and fall out of the modelling endeavour!
This paper is short and, for QSP practitioners, well worth the read, footnotes and all. My favourite quotes:
(1) If we compare a result predicted by a model with observational data and the comparison is unfavourable, then we know that something is wrong, and we may or may not be able to determine what it is (18). Typically, we continue to work on the model until we achieve a fit (19). /Somebody had to say it! What is the threshold when one ‘throws in the towel’ on model structure and new hypotheses are proposed to simultaneously fit all the data or when is a true ‘knowledge gap’ identified vs. inadequate effort from the modelling team etc.
(2) The additional assumptions, inferences, and input parameters required to make a model work are known as “auxiliary hypotheses” (17). / Useful concept. The ‘reasonableness’ of auxiliary hypotheses is critical.
(3) The greater the number and diversity of confirming observations, the more probable it is that the conceptualization embodied in the model is not flawed (38). But confirming observations do not demonstrate the veracity of a model or hypothesis, they only support its probability (39). /The diversity of data is critical in QSP: variety of sources (proprietary, public data and opinions from experts) and across multiple scales (clinical, mechanistic) all contributing to robust model design
(4) And finally a dig at the scientific community, “We have never seen a paper in which the authors wrote, “the empirical data invalidate this model.””