Introduction

Mineral prospectivity mapping (MPM) is an activity that derives spatial information regarding the prospectivity of target mineral (or ore) deposits. Depending on the approach used, the frameworks that encapsulate the activities in pursuit of MPM are many, and a high-level framework is the exploration information system (EIS) framework (Yousefi et al., 2021). Data-driven MPM is a sub-domain of practice within MPM, whose activities are increasingly formulated into artificial intelligence tasks due to the usefulness of artificial intelligence algorithms, the abundance of data, and the emerging domain of geodata science. The use of artificial intelligence methods in data-driven activities typically adheres to a data science framework, which commonly consists of data collection, data preparation, exploratory data analysis, predictive modeling, and deployment and reporting (Fig. 1) (e.g., Shearer, 2000; Hazzan & Mike, 2023). This type of framework is designed to cater to the characteristics of experimentation, which include: (1) flexibility in methodology (e.g., workflow design), and (2) high-risk of having to execute all stages to reach project feasibility analysis (Fig. 1). There is a hidden type of uncertainty that results from heuristic implementations of the data science workflow, which we call ‘workflow-induced uncertainty’ (Fig. 1) in the construction of MPM products. In particular, the sensitivity of MPM products is unknown with respect to common workflow design variability, and to what extent workflow uncertainty can be anticipated and mitigated. Fundamentally, this is because workflow design is guided by variable domain performance metrics in the data science framework, but spatial domain characteristics of models are only weakly coupled in a post-hoc manner.

Figure 1
figure 1

A common data science framework guiding workflow design and execution associated with the generation and usage of predictive models, with a specific geospatial extension to the workflow

In all sciences, models of reality advance through two modes of scientific inquiry—dissent and consensus. Dissent occurs in the form of new discoveries, breakthroughs and otherwise disruptive information or data that supersedes, invalidates or weakens existing models (Solomon, 1994; e.g., general relativity vs. Newtonian gravity). Consensus re-affirms something that is already known and occurs in the form of replications, validations and successful applications of models, which increase model confidence (Laudan, 1984). Both modes are necessary to achieve scientific advancements, which often occurs through experiments that generate new evidence. Evaluations are then made to assess the degree to which new evidence is dissent or consensus given a previous model. Replicable findings are important to determine the validity of evidence and the degree of either dissent or consensus. Therefore, experimental methodology is designed to maximize outcome replicability. The value of experimental findings is determined through replication, which is a form of scientific consensus (e.g., follow-up studies, replications and post-hoc data analysis). Even dissenting evidence must be replicable to retain scientific value. In the hypothesis-driven domain of science, models being inquired scientifically are causal or physical. In the data-driven domain, models are commonly inferential.

Data-driven MPM is a developing sub-domain (e.g., Zuo et al., 2023), which means that there are no canonical static methodology or standard operating procedures to produce models (which are usually visualized as maps). For example, the construction of evidence layers is seldom comparable across studies (Yousefi et al., 2021; Zuo et al., 2021). The most standardized portion of the data-driven MPM workflow is the data science framework, which recognizes and emphasizes workflow experimentation, guided by model performance. In particular, data science makes the explicit recognition that the applicative branch of this sub-domain is experimental in nature, which is unavoidable for data re-purposing (Grossi et al., 2021). This is entirely the case for MPM, which is the biggest non-ephemeral user of multidisciplinary geodata by its breadth. Experimentation favors method development. However, for method deployment, where a MPM map would be used to guide exploration, rigor in the spatial domain is desirable, such as in the case of mineral resource estimation, which is governed by a rigorous framework on resource assessment (e.g., OSC, 2016; SAMREC, 2016). The typical aspatial data science framework is incomplete and not designed for MPM because most algorithms model data in the variable domain, whose performance is used as guidance in model selection. Spatial connotations of algorithm and model selection cannot be generally addressed within the data science framework because, e.g., hyperparameter tuning of non-spatially aware algorithms cannot be generally and robustly related to spatial characteristics. Applications of data science into each discipline requires addressing challenges that arise from the integration of different bodies of knowledge and traditions (Hazzan & Mike, 2023). Thus far, a geospatial extension has been added to the data science framework for MPM (and similar) tasks, where additional criteria are used after the execution of a data science workflow to contextualize modeling outcomes (Fig. 1). Integration of data science into geosciences has resulted in at least two approaches: (1) an additional component to data science (e.g., Zuo, 2020), and (2) integration of spatial considerations in data modeling (e.g., Hoffimann et al., 2021). Presently, relationships between variable domain performance (the data science outcome) and spatial characteristics (a geoscience outcome) remain to be fully explored.

Within the modeling stage of the data science framework, several activities must occur whose implementation can be varied, including: (1) details of feature engineering or extraction; (2) algorithm and model selection, including the choice of the performance metric for model selection through hyperparameter tuning; and (3) details of performance assessment (e.g., varying strategies of train-test data splitting) (Fig. 1). The data science framework promotes the variation within any workflow, provided that some benchmark of performance is achieved. From the perspective of data science, there are an unlimited number of combinations of (1)–(3) to achieve a target level of model performance. Since it is impossible to explore all combinations exhaustively, it is unknown whether variations between choices across MPM practitioners result in a substantial difference in prospectivity maps and to what extent are heuristic choices appropriate, in general. This means that it is not possible to achieve scientific consensus of MPM-based exploration models without explicit planning and methodology design. This creates uncertainty that arises from workflow design, which creates unknown impacts on the characteristics of MPM products. The closest known type of uncertainty to workflow-induced uncertainty is judgment-related uncertainty, which was identified for GIS-based approaches, in the context of the human cognitive biases and heuristics (Zuo et al., 2021). Workflow uncertainty is also not directly covered in the aleatoric-and-epistemic decomposition of model uncertainty, although it can impact both (Hüllermeier & Waegeman, 2021).

This study aims to demonstrate that workflow-induced uncertainty exists and is significant, and proposes experimental consensus as a solution. In particular, this study specifically modulates a data-driven MPM workflow in a manner that mimics common variabilities in workflow design, to determine their isolated and systemic effects on the variable and spatial domain characteristics of the resulting models. The modulated components include: (1) feature space dimensionality; (2) the choice of algorithms; and (3) the choice of performance metrics during model selection. We specifically examine the relationship between metrics of model performance during performance assessment in the data science framework and the spatial characteristics of MPM products in the geodata science framework. Our findings are intended to illustrate complex interactions across key tasks that exist in data science workflows for MPM, and their impact on geospatial outcomes.

Our results demonstrate that: (1) it is possible to reach high model performance through many different combinations of feature-algorithm pairings, implying that there are many local minima in workflow design; (2) feature space dimensionality impacts model performance significantly, whose degree is algorithm dependent; (3) the choice of algorithms impact model performance but standard metrics of model performance are insufficient to distinguish the quality of resulting maps as measured by spatial selectivity; and (4) there is an inconsistent relationship between model performance and spatial selectivity, which implies that MPM products should not be solely derived using a data-driven methodology (e.g., the data science framework), but must involve some spatial or knowledge constraints. Based on our analysis, we propose that workflow modulation can be employed to propagate uncertainty to achieve consensus in data-driven MPM.

To ensure replicability and relevance for modern exploration targets, we employed the dataset that was published by Lawley et al. (2021), which contained various data layers and labels pertaining to primarily Mississippi Valley Type (MVT) and clastic-dominated Pb–Zn deposits. It is important to distinguish the purpose of our study from other studies (e.g., Lawley et al., 2022) in that ours is a theoretical/experimental MPM study, whereas, e.g., the Lawley et al. (2022) study is an applicative one. Readers interested in the applicative domain of MPM should refer to Lawley et al. (2022) for applied MPM products.

Review of Data-Driven MPM Workflows

MPM can be conducted in a manner guided by knowledge, data or some combination thereof (Yousefi et al., 2021). In geodata science-based MPM, machine learning algorithms are commonly used to model relationships between covariates and target labels because MPM is a particularly data-rich domain within the geosciences and machine learning as a discipline is primarily intended for big data analysis (Zuo, 2020; Yousefi et al., 2021). However, the extension of data science into geodata science is more recent and ongoing (e.g., Zuo, 2020; Yousefi et al., 2021) (Fig. 1). Within MPM, and for strictly the purpose of data analysis, machine learning is used to primarily: (1) identify mineralization-related anomalies through unsupervised learning (e.g., Nwaila et al., 2022; Zhang et al., 2022b); (2) predict targets that are similar to known occurrences through supervised learning (e.g., Zuo & Carranza, 2011; Zhang et al., 2021; Senanayake et al., 2023); and (3) predict targets using reinforcement learning (e.g., Shi et al., 2023). Outside of data analysis, machine learning is also beginning to be used in: (1) data generation (e.g., Zhang et al., 2022b; Bourdeau et al., 2023); (2) data processing (e.g., Song et al., 2020; Nwaila et al., 2023; Zhang et al., 2023); and (3) simulations (e.g., Song et al., 2021). Shallow learning and statistical approaches are very common (Zuo, 2020). Deep learning is common in MPM because of its lack of a-priori considerations on the statistical properties of data and the availability of sizable datasets at regional to national scales (e.g., Xiong et al., 2018; Li et al., 2020; Sun et al., 2020; Shi et al., 2023). Within the deep learning algorithms, autoencoders (e.g., Chen, 2015; Xiong et al., 2018) and convolutional neural networks (Li et al., 2020; Sun et al., 2020) are prominent (Zuo, 2020). For autoencoders, this is because they are the most general-purpose dimensionality reduction and data reconstruction tools, which means they could be used to both extract features and detect high-dimensional anomalies. For convolutional neural networks, this is because they are the most general-purpose learning algorithms that are applicable to rasterized (image-like) data, can perform automated feature extraction, and can extract spatial relationships in addition to those in the variable domain.

This study solely focuses on classification-based data-driven MPM using supervised machine learning algorithms in the variable domain (or spectrum-based methods, see Zuo & Xu, 2024). This class of approaches use training data with labels in the form of known mineral occurrences (e.g., deposits, prospects and showings) that are used to guide model construction (Yousefi et al., 2021). Supervised data-driven methods are commonly biased by the locations of training data labels (Yousefi et al., 2021). In addition, the high complexity of geological systems combined with an unknown lower-bound number of labels renders it easy to produce models that are overfitted, particularly for deep learning algorithms (Dietterich, 1995; Porwal et al., 2004; Coolbaugh et al., 2007; Skabar, 2007; Srivastava et al., 2014; Chen, 2015; Porwal et al., 2015; Zuo et al., 2019, 2021; Yousefi et al., 2021). Consequently, models that are produced using data-driven methods carry exploration biases and uncertainties (Yousefi & Nykänen, 2016). The formulation of MPM into artificial intelligence tasks is an emerging domain of research because there are generally unsolved challenges including: (1) the selection of algorithm, architecture and optimal parameters; (2) training data sufficiency, achieving class balance and representativity of negative labels; and (3) rigorous utilization of data science (Yousefi et al., 2021). In addition, there are additional unsolved issues, e.g., of the spatial transferability of models and performance expectations; specifically, the lack of geostatistical learning algorithms (e.g., Hoffimann et al., 2021).

An absolute scientific requirement is that measurements or findings must be associated with a notion of uncertainty. Here, we summarize uncertainty in MPM products in strictly the data-driven domain. Data-driven MPM products exhibit at least two classes of uncertainty: (1) aleatoric, or data related; and (2) epistemic, or model related (Zuo et al., 2021). Aleatoric uncertainty arises from imperfect data, which can be further categorized into those related to the spatial or variable domains. It is important to distinguish between “data quality” as it is referred to in MPM literature (e.g., Zuo et al., 2021) and in data management (e.g., DAMA DMBOK; Henderson et al., 2017). The former aligns closer with GIS terminology, while the latter is a broader framework that aligns with data management and analytics. The latter definition is constructed using the delineation of roles along the data lifecycle, by treating data as a product (e.g., similar to a car from design to manufacturing processes) that is passed from data generators to users. In this definition, characteristics of data from generation to management that adversely impacts analytics performance are considered quality-related. We hereon standardized the terminology as per DAMA’s definition (Henderson et al., 2017) because this study considers data-driven MPM using machine learning algorithms as a specific use-case of artificial intelligence, which is a form of advanced analytics. Therefore, completeness, resolution and availability are part of data quality, which is different than the GIS definition (Zuo et al., 2021). In this definition, spatial domain uncertainty can be affected by (non-exhaustively): coverage and coverage rate, especially relative to the areal coverage of litho-diversity classes, and imprecision of coordinate measurements. Variable domain uncertainty can be affected by (non-exhaustively): metrological characteristics, such as measurement accuracy and precision; sampling practices; and geostatistical characteristics, such as the nugget and random noise effects. Epistemic uncertainty relates to how models are typically undercomplete relative to the full behavior of mineral systems, due to model capability, availability of algorithms, knowledge of target behavior and naturally irreducible variability (e.g., complex system behavior). For example, algorithms that result in models with greater degrees of freedom, such as neural networks, are capable of modeling more complex phenomena than simpler algorithms, such as linear regression.

In addition to the preceding classes of uncertainty, there is an additional class (3) that captures judgment-related uncertainty, which is the result of cognitive heuristics, experiential bias and other primarily user-related effects, in a GIS context (Zuo et al., 2021). This class has no direct equivalent in a pure data science framework (Fig. 1) because this framework was purposefully created to remove user judgment by selecting models based on objective metric scores and permitting workflow modulations to achieve higher scores. Metric-driven search for model candidates creates a collection of models, each containing error that is decomposable into bias, variance and irreducible error (Kohavi & Wolpert, 1996). However, judgment-related uncertainty exists in the geodata science framework (Fig. 1) because the spatial basis for model selection is EIS or GIS based (e.g., Yousefi et al., 2021), but model construction through hyperparameter tuning is not. Therefore, performance as assessed in the variable domain using common scoring metrics is decoupled from spatial characteristics or performance. Consequently, it is possible to produce equiprobable models, whose model performance in the data science framework are comparable, but whose spatial domain characteristics are not. This is compounded by the fact that only a tiny minority of models per workflow are assessed in the geospatial sense (e.g., the vast majority of models explored through grid search are not assessed in the spatial domain). Therefore, variations in workflow design creates uncertainty in MPM products because user judgment or perceived best practices (e.g., interpolation) guide the implementation of at least some stages of the geodata science framework (Fig. 1). In addition, workflow tuning based on metric scores implies that models are produced using a workflow-wide gradient descent to the best possible combination of choices. However, given the large exploration space for workflow design, reaching the global minimum is impossible in general, and instead, equiprobable models pertain to local minima in workflow design. This type of uncertainty would not be an issue, if spatial domain characteristic is monotonically dependent on variable domain performance, which means optimization in one domain is equivalent to the optimization in the other. However, this has never been examined empirically nor is it theoretically obvious. Presently, it is unknown how typical variability in the construction of data science-based MPM workflows translate into spatial characteristics.

Within the domain of supervised artificial intelligence-based MPM, we consider algorithm selection as the biggest documented source of variability. This is because performance evaluations are typically made on an experimentation of algorithm selection, while controlling other portions of the workflow. This is exemplified by a sizable body of literature, whose raison d’être is to demonstrate the effectiveness of novel algorithms in MPM tasks (Chen & Wu, 2017; Xiong et al., 2018; Chen et al., 2020; Wang et al., 2020; Yang et al., 2022; Yin & Li, 2022; Zuo et al., 2022; Gharehchopogh et al., 2023; Li et al., 2023; Yin et al., 2023). There also have been recent efforts to examine specifically this type of uncertainty, in a GIS knowledge-driven framework (Daviran et al., 2022). The range of all possible algorithms is unknowable because there are emerging algorithms and variations of existing ones, either as architectural modifications (e.g., changes in neural network architecture) or as add-ons (e.g., optimization algorithms; Chen et al., 2020; Yin & Li, 2022; Gharehchopogh et al., 2023). An empirical analysis revealed that algorithms used by various authors include (non-exhaustively): Bayes network (Porwal et al., 2006; Yin & Li, 2022); logistic regression (Agterberg & Bonham-Carter, 1999; Carranza & Hale, 2001; Karbalaei Ramezanali et al., 2020; Lin et al., 2020; Zhang et al., 2022c); support vector machines (Zuo & Carranza, 2011; Zhang et al., 2021; Senanayake et al., 2023); tree-based methods, such as random forest, extra trees and XGBoost (Chen & Wu, 2019; Sun et al., 2019; Zhang et al., 2022a); artificial neural networks, such as extreme learning machines (Chen & Wu, 2017); deep learning methods (Xiong et al., 2018; Wang et al., 2020; Yang et al., 2022; Zuo et al., 2022, 2023; Li et al., 2023; Yin et al., 2023; Zuo & Xu, 2023); and reinforcement learning (Shi et al., 2023). There are also applicative MPM studies that employed ensemble learning, which is an approach to improve outcome reliability by integrating the output of multiple independent models (e.g., Senanayake et al., 2023; Shetty et al., 2023). The diversity of algorithms reflects a diversity of practitioners, scale of MPM tasks, computational and data capabilities, and algorithmic complexity choices, all of which emphasize the experimental nature of data-driven MPM. Algorithm choice primarily controls the explanatory power of the model (but also model complexity and generalizability), which implies that modulating algorithmic choice impacts epistemic uncertainty (Hüllermeier & Waegeman, 2021).

A second major source of variability in data-driven workflows is the construction of covariates, which are referred to as ‘evidence layers’ in the geospatial realm. Covariate data can range from mono-disciplinary, such as geochemical (e.g., Zhang et al., 2022b) or spectral (e.g., Nwaila et al., 2022), up to sizable combinations of qualitative and quantitative geoscientific data (e.g., Lawley et al., 2022). The variability encountered in covariate construction falls into two major categories: (1) the number of covariates; and (2) their quality. Category (1) variability primarily controls the threshold between overfitting and underfitting that differs on a per-algorithm basis (through the curse of dimensionality; Jia et al., 2022; Márquez, 2022), which further implies that the selection of data and feature engineering impacts algorithm selection. The presence of category (2) variability is theoretically easy to deduce because factors such as the support type (in a geostatistical sense), the quality of interpolation, the data coverage and coverage rate (e.g., density of samples and their distribution over space), as well as metrological properties (e.g., measurement accuracy and precision) all impact the quality of models. The impact of data quality on models is a part of aleatoric uncertainty (Hüllermeier & Waegeman, 2021). However, the impact of data quality on downstream MPM products is not fully understood because such interactions are complicated to model in general and even for specific applications, requires repetitive executions of the data science framework, through uncertainty propagation (Fig. 1) (e.g., Yang et al., 2023). Consequently, pre-data modeling processes typically occur in a feed-forward fashion. For example, spatial interpolation of sparse data into evidence layers are often performed using heuristic or best practices, although the rationale is inconsistently documented (e.g., Senanayake et al., 2023). Creation of evidence layers in this manner is considered feed-forward because much of the best practices predate the coupling of traditional geoscientific data to artificial intelligence algorithms. Consequently, it is not generally known whether such best practices are still optimal in a data re-purposing context because data usage methods have changed. Purely data-related workflow variability is outside of the scope of this study because this study uses a published and rasterized dataset. However and for example, varying how data are collected and processed clearly contributes to aleatoric uncertainty and could theoretically be varied to modulate their effects on MPM products (e.g., partially via simulations, see Yang et al., 2023). A pre-data modeling task that is often performed is dimensionality reduction because of the high dimensionality of evidence layers relative to the paucity of training labels (in supervised approaches). If left unattended, using a native number of evidence layers can lead to overfitting because the degrees of freedom of complex models are not fixed using a statistically robust number of samples. Feature space dimensionality reduction occurs using algorithms that range from knowledge-based feature engineering, simple principal component analysis to knowledge-constrained variational autoencoders (e.g., Zuo et al., 2022; Senanayake et al., 2023). Technically, the modulation of feature space dimensionality impacts both variability categories (1) and (2) because this process affects both the number of features available for predictive modeling and the quality of the features through a controlled loss in feature-explanatory power.

Popular choices of performance metrics include those that measure model discrimination power, such as the area-under-the-curve of the receiver operating function (AUC-ROC), and those that measure the quality of predictions, such as the accuracy and F1 metrics. Metrics are intended to be a robust criterion to guide model selection or hyperparameter tuning. However, in the geodata science extension (Fig. 1), this intention is weakened because spatial domain metrics, such as the prediction-area curve (Yousefi & Carranza, 2015a, 2015b) does not directly guide hyperparameter optimization (or model selection; e.g., Yin & Li, 2022). Part of the issue results from the fact that training datasets are not necessarily (and generally are not) spatially contiguous and of sufficient extent, such that spatial characteristics could be robustly determined during hyperparameter optimization. Consequently, spatial metrics are typically computed on the testing dataset (or more appropriately, the whole dataset), which means that spatial performance assessed post-hoc of hyperparameter optimization cannot be used to rigorously guide model selection but only model performance assessment in the spatial domain (e.g., Yin & Li, 2022). Therefore, it is not clear the relationships between the choice of variable domain metrics and spatial characteristics. The choice of feature space dimensionality is also similarly weak because the extent to which the explanatory power of covariates is preserved during feature engineering occurs along a smooth continuum. Once this choice is made, it is not generally documented to have been revisited explicitly, which implies that the impact of this choice on downstream metrics is not generally intelligible. Consequently, the complex interactions between feature space dimensionality and algorithm performance, particularly the relationships between variable (in a pure data science framework) and spatial domain performance (in a geodata science framework), remain to be explored.

Methodology

Data Sources

The dataset employed in this study was published and made digitally available by Lawley et al. (2021) as a single file or as a compilation of GIS-compatible files from McCafferty et al. (2023) (Fig. 2). The dataset was managed to facilitate its re-use and re-purpose by data scientists because the master data, metadata and reference data files are fully machine readable and full documentation exists to guide non-geoscientists to understand the data. By employing this dataset, we were able to experimentally control the data collection to preparation stages of the data science workflow. Here, we provide a description of the dataset (Fig. 2). Bedrock geological data for the United States were sourced from the State Geologic Map Compilation (SGMC) digital database covering the conterminous United States (Horton et al., 2017), as well as the Geologic Map of Alaska (Wilson et al., 2015). In the case of Canada, the geological datasets are a compilation of 20 previously published national, provincial and territorial geological map databases (Lawley et al., 2021). It is important to note that geological data from the United States and Canada should be considered a collection of individual bedrock geology maps with highly variable mapping scales (1:50,000 to 1:5,000,000) and boundary artifacts (e.g., induced by mapping subjectivity). In contrast, geological data for Australia are seamless and was extracted from the 1:1 million scale national bedrock geology database (Raymond et al., 2012). Lithological information from all 23 source maps was re-categorized into four main types (sedimentary, igneous, metamorphic, and other) and 31 subtypes for the purposes of prospectivity modeling (Fig. 2). New data dictionaries were also used to identify the presence or absence of up to 17 geological properties (such as coarse clastic, fine clastic, calcareous, carbonaceous, evaporitic, cherty, red beds, sedimentary, ultramafic to mafic composition, intermediate composition, felsic composition, pegmatitic, alkalic, igneous, schistose, gneissose, and anatectic) from the available geoscientific text data.

Figure 2
figure 2

Geological maps of Canada and the United States (top), and Australia (bottom). Figure was modified from Lawley et al. (2021). The H3 cells are color-coded using the hierarchical rock classification developed as part of Lawley et al. (2021), with data sourced from multiple national, state, provincial/territorial databases. Rock subtypes were used during mineral prospectivity mapping as mappable proxies for the sources and traps of mineral systems. Known Pb–Zn Mississippi Valley-type and clastic-dominated deposits and mineral occurrences used for training are shown for reference

Geochronological data were reformatted and compiled from the 23 geological map compilations described above. They were then combined with plate tectonic models to estimate the paleo-latitude and -longitude of rocks at the time of their deposition or emplacement (Scotese, 2021). The fault compilation was based on global sources (Chorlton, 2007; Styron & Pagani, 2020), national databases (Raymond et al., 2012), and the 1:5,000,000 scale Geologic Map of North America digital database (Reed et al., 2005) (Fig. 2). Duplicate faults from these different data sources were not removed prior to converting fault traces into a proximity surface. Proximity calculations were also performed for passive margins and the point locations of carbonaceous sedimentary rocks (e.g., black shales) using the global compilations from Bradley (2008) and Granitto et al. (2017), respectively.

Geophysical datasets represent the second major source of information for prospectivity modeling. Seismic datasets were sourced from a range of survey types, including active controlled-source seismic refraction and passive teleseismic surveys. Depth estimates for the seismogenic Moho were extracted from national datasets specific to each region, i.e., Canada (Schetselaar & Snyder, 2017), the contiguous United States (Shen & Ritzwoller, 2016), Alaska (Zhang et al., 2019), and Australia (Kennett et al., 2011). Global models, such as Szwillus et al. (2019), Reguzzoni & Sampietro (2015), and Laske et al. (2013) were also used as secondary sources for Moho depth estimates. Seismic velocity data for the upper mantle and estimates for the lithosphere–asthenosphere boundary were sourced from studies conducted by Debayle et al. (2020), Hoggard et al. (2020), and Priestley et al. (2018). The spatial resolution of the source seismic datasets is highly variable (e.g., seismic stations may be separated by 10–100s of km), but the depth estimates for the seismogenic Moho and lithospheric–asthenosphere boundary were expected to be within a few kilometers of their true value (Snyder et al., 2018).

Gravity datasets were sourced from satellite, airborne, and ground-based surveys. Satellite gravity data is based on the European Space Agency (ESA) Gravity Field Steady-State Ocean Circulation Explorer (GOCE) mission and its derivative products, which focus on highlighting the curvature and shape of gravity anomalies, as described in Ebbing et al. (2018). The Bouguer anomaly grid is based on national gravity databases for Canada, Alaska, the contiguous United States, and Australia (Phillips et al., 1993; Saltus et al., 2008; Geological Survey of Canada, 2017; Lane et al., 2020). Individual stations within this gravity measurement network tend to be distributed every few kilometers but can be more widely spaced for the more remote parts of North America and Australia. Multiple derivative products were then calculated to highlight density variations at different crustal depths, including the horizontal gradient magnitude (HGM) and an upward continued version of the Bouguer anomaly grid. In addition to these grids and their derivatives, we identified points tracking the maxima of the HGM gravity values for both the Bouguer gravity and its upward-continued field following the approach outlined by Phillips (2007). The proximity to these tracks, often referred to as gravity “worms”, were used as proxies for identifying steeply dipping geological structures and the contacts between rocks with varying densities.

Magnetic datasets were combined from both modern sources (Alaska Division of Geological and Geophysical Surveys, 2016; Miles & Oneschuk, 2016) and legacy surveys, such as the North America magnetic map (Ravat et al., 2009), to create an updated and merged residual magnetic anomaly map for North America. The spatial resolution of these source geophysical surveys is highly variable, and, in the case of aeromagnetic data, can include line spacings ranging from hundreds of meters to several kilometers. A differential RTP (reduction-to-pole) methodology included with the Geosoft Montaj software was used to make the adjustment for the magnetic north pole, as described by Arkani-Hamed (2007). An identical workflow was followed to produce an RTP grid and its derivative products from the residual magnetic anomaly map of Australia. In addition, several derivative products were calculated from the RTP grids, following the procedures outlined in Phillips (2007). These calculations included the first vertical derivative (1VD) to assess and map magnetic variations in the shallow crust. Shallow magnetic sources were further highlighted by calculating the horizontal gradient magnitude (HGM) of the pseudo-gravity field and the proximity to the paths of HGM maxima, often referred to as magnetic “worms”. Deep magnetic sources are based on long-wavelength anomalies from the RTP grids using a matched filtering technique described by Syberg (1972) and implemented with code developed by Phillips (2007). These long-wavelength RTP anomalies were then transformed into a pseudo-gravity field, and the HGM was calculated from this field to accentuate the boundaries and edges of deep magnetic sources. The proximity to long-wavelength “worms,” which trace the HGM maxima, served as proxies for mapping the edges of deep magnetic sources within the deep-to-mid-crust.

The dataset was published in the H3 Discrete Global Grid System (DGGS), an open-source software developed by Uber Technologies Inc. (2020) to convert latitude and longitude coordinates of the source geological and geophysical datasets into unique H3 addresses (Fig. 2). The H3 DGGS is a hierarchical and hexagonal grid system, which offers global coverage at multiple resolutions. For spatial indexing, this study used resolution 7 of the H3 DGGS, resulting in 98,825,162 unique H3 addresses worldwide. At this resolution, the average hexagon covers an area of 5.16 km2, with average edge length of 1.22 km.

Data Processing

Data pre-processing is trivial because of the high quality of the dataset. There were a total of 82 evidence layers, of which 31 contained null values for at least one data record. These layers were removed because this study does not aim to examine the variability introduced by imputation methods, which is a part of data preparation. Additionally, the intention of this study was not to provide updated prospectivity maps, which was completed by Lawley et al. (2022) for the same exact dataset. Consequently, the employed data pre-processing may result in a loss of model performance or spatial realism of derived maps, under the condition that: (1) data imputation is feasible and benefits predictive modeling; and (2) the affected data layers are useful additional covariates of the target label. After the removal of partially empty evidence layers, the data were re-scaled to span an equal range. The labels in the dataset describe two types of Zn–Pb deposits, that is, MVT and clastic-dominated (CD) deposits. For the purpose of our study, we wished to maximize the amount of training data possible. Therefore, it was irrelevant to the study’s purpose, whether the targets were MVT or CD (or any other deposit type). Consequently, we merged the two deposit types into a single class, implying that MPM is targeting all Zn–Pb deposit types. Training data were purposely generated to achieve class balance. Hence, there was a total of 2474 positive labels and the same number of negative labels, which were randomly sampled from the much larger proportion of presumed negative labels. However, negative training labels were kept consistent across workflow variations to control training data variability. This was sufficient for the purpose of pure (as opposed to applied) MPM research to understand workflow variability that excludes variability induced by data label generation. In addition, with the exception of class balancing, this was identical to the original methodology used in Lawley et al. (2022) because true negatives (areas devoid of mineralization) are generally unavailable across all areas, due to a limited availability of borehole coverage.

Data Science Workflow and Algorithm Descriptions

The workflow contained three key stages related to the experimental variables: (1) feature extraction and feature space dimensionality-modulation using autoencoders; (2) predictive modeling using machine learning algorithms; and (3) statistical analysis and geospatial visualization. For (1), there is a collection of algorithms in the data-driven domain that are theoretically useful (Jia et al., 2022), but research of their empirical effects on MPM is exceedingly rare (Wang et al., 2022). For our purpose, the experimental constraint to maximize the amount of information contained in the covariates for every feature space dimensionality implies that we require a general-purpose feature extraction and engineering algorithm that is agnostic to downstream algorithms. Therefore, we cannot rely on feature elimination methods because their outcomes are optimized to downstream algorithms. Other popular methods in MPM include mainly principal component analysis (PCA) (e.g., Wang et al., 2022). However, PCA is not the most general because it assumes linearity of relationships in data, which, if violated, can lead to counter-productive results (Abbott, 2014). Consequently, for dimensionality reduction, we employed autoencoders, which are generalizations of PCA, because among other characteristics, they make no assumptions on the types of relationships exhibited by data (Kramer, 1991). Therefore, we ensured that features extracted from the evidence layers were the most compact, de-correlated (linearly and nonlinearly), information-dense and useful for all downstream algorithms. This enabled us to ensure that feature space dimensionality can be explicitly controlled and that information is maximally retained for any feature space dimensionality. Consequently, we were able to modulate feature space density by changing the number of extracted features to study its impact on algorithms and maps.

Autoencoders are a general-purpose dimensionality reduction tool based on artificial neural networks (ANNs) in the context of unsupervised feature extraction via deep learning. The classic function of autoencoders is to encode data into a compact latent space representation, such that reconstructed data and input data are as similar as possible (Kramer, 1991). ANNs mimic biological neural networks and feature multiple layers of varying numbers of neural nodes, which in the case of feed-forward ANNs, are connected in a single direction from the input side of the network to the output. Abstract information flows along the connections. At each node, an activation function computes the nodal output given the sum of all weighted inputs from other nodes. Typical activation functions include the rectified linear (relu), hyperbolic tangent (tanh), logistic and linear functions. Where autoencoders are used for dimensionality reduction, they are a generalization (e.g., nonlinear and non-parametric) of PCA (Kramer, 1991). The bottleneck of the autoencoder, which is the narrowest layer of the ANN, also known as the coding layer, is purposely designed to be narrower than the dimensionality of the input data, such that the typical autoencoder is undercomplete—meaning that the neural network is unable to fully replicate all information contained in data. Hence, by modulating the coding layer size, autoencoders can discard variable amounts of information in the data. This behavior is similar to truncating the number of principal components. Reconstructed data exhibit an increasing loss of noise, data and physical anomalies and increasingly more common events, with decreasing coding layer size. Consequently, known uses of autoencoders include image denoising, geochemical anomaly detection and seismic data denoising (e.g., Gondara, 2016; Luo et al., 2020; Saad & Chen, 2020, respectively). Key hyperparameters of ANNs include the number of hidden layers and their size, the choice of the activation function, the ability to stop training early if performance fails to improve (early stopping), regularization strength, and the random number generator seed to control the initialization of all connection weights.

Shallow learning algorithms are suitable for predictive modeling but are incapable of automated feature extraction. Unlike deep learning, their usage necessitates discipline-specific expertise and knowledge to craft effective features for the task and algorithms considered. In this study, we explored a variety of common shallow learning algorithms with a range of algorithmic complexity that include: logistic regression (LR; Cramer, 2002); k-nearest neighbors (kNN; Tikhonov, 1943; Fix & Hodges, 1951; Cover & Hart, 1967); Gaussian process (GP; Rasmussen & Williams, 2006; Kotsiantis et al., 2007); artificial neural networks (when not used for deep learning, ANN; Curry, 1944; Rosenblatt, 1961; Rumelhart et al., 1985; Hastie et al., 2009; Lemaréchal, 2012); support vector machines (SVM); random forest (RF) and adaptive boosting of decision trees (AdaBoost or AB; Ho, 1995; Breiman, 1996a, 1996b; Freund & Schapire, 1997; Breiman, 2001; Kotsiantis, 2014; Sagi & Rokach, 2018); and extra trees (ET; Geurts et al., 2006). It is worth mentioning that ANN can be used for both shallow and deep learning purposes and, where it is used for both feature extraction and predictive modeling, the combination is considered deep learning.

In this study, we forwent an exhaustive technical description of the machine learning algorithms for brevity but provide a general description of them and their key hyperparameters. The LR algorithm is a parametric method that attempts to fit data labels to a logistic function. In the classic case, the output of the model is binary, which suits classification tasks. Its key hyperparameters include the type of regularization, which could be none, L1-norm, L2-norm or a combination (elastic net) thereof. The strength of regularization is controlled by a parameter C. The kNN algorithm is a non-parametric method that predicts the label of a sample using an average of similar samples, which is computed through a distance metric (typically the L2-norm). Its key hyperparameters include the number of neighbors (k) and distance weighting, which could be a variety that include uniform or inversely weighted by distance. GP is a non-parametric feature interpolation-based method that probabilistically assigns labels to samples by attempting to model the underlying stochastic process using a kernel. The kernel specifies the covariance function of the data. The type of kernel and its parameters are key hyperparameters. Common kernel choices include the radial basis function (RBF), which has two typical parameters that specify the length scale and amplitude. The SVM algorithm is a non-probabilistic linear method that attempts to maximize the separation distance between classes. Because it can be formulated using inner products, it is often used with kernels to effectively expand the dimensionality of the problem without data transformation and seek a decision boundary in higher dimensional space, which when projected to the original dimensions of the problem, becomes a nonlinear decision boundary. Key SVM hyperparameters include: the choice of the kernel, which could be RBF, linear or polynomial; C, which defines a penalty for misclassifying support vectors and higher values increase the decision boundary complexity; and γ, which specifies the nonlinear kernel’s coefficients.

Ensemble methods include RF, AB and ET. Although technically, any weak classifier could be ensembled to produce a strong classifier, RF, AB and ET typically refer to ensembles of decision trees, which are flowchart-like hierarchical structures that partition the trees recursively. Learning occurs to partition the trees through splitting branches (at nodes). The splitting is metric-driven to maximize the difference between the resulting leaves. RF utilizes bagging to create a forest of de-correlated trees, from which the averaged output is the prediction. De-correlation occurs through bootstrap sampling of features. The maximum number of features per tree, the splitting metric, the number of trees, and the minimum number of samples per split are model hyperparameters in addition to the tree depth parameter that is inherited from decision trees. AB uses adaptive boosting (Freund & Schapire, 1995), which is a type of predictor–corrector algorithm in the sense that trees are constructed through adaptation, which adjusts the weights of subsequent trees to focus on cases that are more difficult. The output in AB is a weighted sum of individual trees. In addition to hyperparameters for RF, the rate of adaptation is an AB-specific hyperparameter. ET is similar to RF but randomizes thresholds instead of optimizing them for splitting, and retaining the best thresholds as the splitting rule. The hyperparameter grid for all algorithms is provided in Table 1. The hyperparameter ranges are not intended to be exhaustive but are intended to cover the probable usage of each algorithm in this task. In addition, the hyperparameter ranges were iteratively refined by observing model choices (e.g., if a model is consistently at the end of a parameter range, that parameter range would be extended, if possible, to increase model diversity).

Table 1 Hyperparameter grid for all algorithms

Model selection occurs through cross-validation, whose strategy is an experimental control. Because the data were abundant, we employed four-fold cross-validation. We used a variety of metrics to assess model performance, which included accuracy, class-weighted F1 score and AUC-ROC (Fawcett, 2006). The F1 score is a harmonic mean of precision and recall. The ROC (receiver operating function) is a curve that describes the relationship between true positive rate and false positive rate at a range of discrimination thresholds, whose integral is the AUC (area under the curve)-ROC score. In essence, the AUC-ROC measures the degree of cross-over of the predictions given a model. However, the AUC-ROC metric can give misleading results in class-imbalanced cases. Consequently, the AUC-ROC metric is better for examining the quality of a model, whereas the class-weighted F1 score is robust to class imbalance and is better for assessing the quality of predictions (e.g., outputs of a deployed model). AUC-ROC scores are easy to interpret; e.g., a score of 0.5 implies that the model is no better than a random guess. The AUC-ROC score is desirable for classification tasks because it is classification-threshold-invariant and describes the model's discrimination quality. This further implies that MPM models derived from different implementations of the data science framework can be compared purely on the basis of their ability to separate prospective and non-prospective sites, regardless of the details of the workflow. However, the classification-threshold invariance can lead to issues with the usability of prospectivity maps because the investment cost associated with false negatives is zero, compared with the non-zero investment cost associated with false positives. Because we controlled class imbalance through purposely constructing two identically sized positive and negative label sets, the effect of class imbalance is minimized for all performance metrics. A summary for the MPM workflow, particularly as it pertains to the experimental controls and variables within the workflow, is shown in Figure 3.

Figure 3
figure 3

Illustration of the experimental MPM workflow, depicting the experimental controls and variables, as well as the result analysis methods. The experimental variables are modulated, whose choices are provided in set notation (members in curly brackets)

Results

Feature Extraction

The optimized autoencoder within the grid range has the structure (102, 51, 34, 25, nodes in the coding layer, 25, 34, 51, 102) (Fig. 4) with early stopping enabled, a regularization strength of 0.0001 and using the tanh activation function. The coding layer size can be controlled to modulate the performance of the autoencoder. Varying the coding layer size from 25 to 5 for a total of 20 runs (with randomized initializations of the nodal weights) per size reveals a gradual loss of autoencoder performance (Fig. 5). This loss has two distinguishable regimes, with a shallow trend from 25 to 10 and a steeper trend from 10 to 5 (Fig. 5). The reconstruction performance is often above a coefficient of determination (CoD) value of 0.95 for coding layer sizes above 11, which implies that less than 5% of the variability of the data was lost. This implies that the evidence layers are highly compressible and about 11 fully de-correlated latent features contain about 95% of the original variability. It is important to recognize that the rate of information loss (determined through the CoD metric) is significantly slower than the rate of dimensionality loss, especially down to a coding layer size of 8 (see shallow slope of curve, Fig. 5). Consequently, the main difference imparted onto downstream algorithms is more attributable to changes in feature space dimensionality, rather than feature information-content.

Figure 4
figure 4

Network configuration for the optimized autoencoder. The coding layer size is shown as “x”. The black arrows abbreviate full connectivity between neurons in adjacent layers (fully connected layers)

Figure 5
figure 5

Autoencoder’s performance as measured through the coefficient of determination (CoD) metric, as a function of the coding layer size of the autoencoder

To provide the readers with a qualitative visualization of the latent data, we visualized the extracted features, for two features for coding layer sizes 5 and 25 (Fig. 6). There was excellent spatial variability that roughly followed geological terranes (e.g., around the Canadian Shield, across the interior plains and the Cordilleran range). However, the exact nature of the features was not geologically interpretable because they were complex nonlinear combinations and transformations of the input data. Moreover, it was theoretically clear and empirically observable (Fig. 6) that nonlinear combinations resulted in features exhibiting nonlinear spatial variability. This may also be the case for many evidence layers even prior to encoding because many types of data are discontinuous across terranes, faults or other types of boundaries (e.g., geochronological age and rock type). Incidentally, this implies that linear geostatistical methods (including interpolation) are not generally reliable on nonlinear latent data.

Figure 6
figure 6

Visualization of the autoencoder’s extracted features (AE features) 1 in (a) and (b), and 21 in (c) and (d) for coding layer sizes 5 and 25

Predictive Modeling

For each set of autoencoder features from 5 to 25, we performed predictive modeling using all eight machine learning algorithms (Table 1), with the model selection driven by either the AUC-ROC or weighted F1 metric. This resulted in two sets of 168 optimal models with a range of performance (Figs. 7 and 8). Two categories of algorithms are evident based on the results—those that were performance-sensitive or -insensitive to the number of features used. The first category consisted of SVM, LR, ANN and GP, all of which generally exhibited an increase in all metric scores up to about 20 features or coding layer nodes (Figs. 7b, f, g, and h and 8b, f, g, and h). Beyond a coding layer size of 20 nodes, model performance in this category demonstrated plateaus or slight decreases for models that were selected using the AUC-ROC metric (Fig. 7). For models selected using the weighted F1 metric, the trend was somewhat similar but the transition was not as clear for some algorithms, such as the ANN algorithm, which demonstrated a performance plateau past 20 nodes (Fig. 8g). For other algorithms, the performance plateau was not reached until 21 or 22 (e.g., AB and SVM; Fig. 8e and b, respectively) or slightly more (e.g., ET and GP; Fig. 8d and h). The second category consisted of kNN, RF, ET and AB. This category tended to exhibit the same model performance with minor variance (Fig. 7a, c, d, and e). This is an important observation because, clearly, the strategy to perform dimensionality reduction depends on the choice of algorithms and to a lesser extent, the choice of performance metric for performance assessment. An additional observation was that hyperparameter tuning using the AUC-ROC metric increased the AUC-ROC score of all best models for each algorithm, whereas tuning using the weighted F1 metric resulted in generally lower performance metric scores for all performance metrics used (Figs. 7 and 8). It is also worth to note that hyperparameter tuning using the weighted F1 metric in our study resulted in essentially identical metric scores for all metrics used, but this would not be expected where class imbalance existed.

Figure 7
figure 7

Performance of various algorithms with hyperparameter tuned using the AUC-ROC metric during predictive modeling as a function of the number of autoencoder features used (as controlled by the coding layer size)

Figure 8
figure 8

Performance of various algorithms with hyperparameter tuned using the weighted F1 metric during predictive modeling as a function of the number of autoencoder features used (as controlled by the coding layer size)

Analyzing prospectivity maps in solely the spatial domain is challenging to standardize because there are no data science metrics for this task, and it is impossible to craft universal metrics because there are no universal notions of spatial quality. The closest metric is related to the prediction–area curve, which simultaneously assess both spatial and variable domain characteristics (Yousefi & Carranza, 2015a, 2015b). The ratio of class-dependent prediction rate (sites per class divided by total number of sites) to its corresponding occupied area (area of the class divided by total area) is known as normalized density (Mihalasky & Bonham-Carter, 2001). Consequently, normalized density is a composite metric that is formulated like a posterior probability with the numerator incorporating variable domain performance, and the denominator spatial selectivity of models. Prediction rate is somewhat comparable, but not identical to the accuracy metric, and is incomparable to either the F1 or AUC-ROC metrics. This is a problem for our intended usage because we cannot explicitly and generally relate model quality in terms of purely data science metrics, which are always used to guide model construction, to model quality in terms of purely geospatial characteristics. Therefore, we extended the idea of the normalized density metric as a ratio of two independent metrics (across domains) to incorporate typical data science metrics, eschewing prediction rate. It was also unnecessary that the numerator–denominator assignment be kept the same as it was in normalized density. Spatial selectivity was assessed through the occupied area metric, which was calculated in this study as the fraction of sites prospective (fraction prospective). Consequently, visualizing spatial selectivity as a function of metrics in the data science framework revealed the spatial evolution of models given workflow variability. Therefore, the spatial-variable domain coupling can be directly related to model tuning. An excessively overfitted model would predict few sites beyond positive training labels as prospective. Therefore, it was possible to compare, among other things, the fitting quality of models in response to changes in feature space dimensionality between algorithms, by examining the changes in spatial selectivity vs. model performance. A physical upper bound on spatial selectivity (occupied area) is unknown because the total area of mineral deposits cannot be known, but heuristic order-of-magnitude guesses may be possible (e.g., Fermi’s piano tuner problem; Zipf’s Law; Merriam et al., 2004).

In terms of spatial selectivity alone, tree-based methods (group 1) exhibited the highest selectivity to within the range of parameters studied, across all sizes of feature dimensions (Fig. 9). The moderately selective algorithms included the kNN, ANN, SVM and GP (group 2). However, the variance of selectivity within this group was relatively large compared to that of the tree-based group (Fig. 9). The least selective group (group 3) consisted of solely LR, which did not overlap with any other group at all sizes of feature dimensionality (Fig. 9). Within groups, there was a common trend of increasing spatial selectivity with increasing coding layer size (feature space dimensionality), although at feature space dimensions beyond about 17, there was a gradual plateauing and loss in selectivity for kNN, ET, AB and RF algorithms, which was prominent for models selected using the weighted F1 metric (Fig. 9b). The net change in selectivity (average slope) was not comparable for all groups. The average slopes for the tree-based algorithms were shallower than those of the other groups, with the exception of the kNN algorithm, which was comparable (Fig. 9). To provide a robust summary, we performed a sensitivity analysis of spatial variability induced by changes in workflow, which we determined using the mean range metric (the mean of the absolute range of scores). The fraction prospective metric was the most sensitive to changes in algorithm, which resulted in a mean range of 0.15 across all feature space dimensions and model selection metrics. The second-most sensitive variable was the feature space dimensionality (coding layer size), with a mean range of 0.05, across all algorithms and all model selection metrics. The least sensitive variable was the choice of tuning metric, with a mean range of 0.01, across all algorithms and feature space dimensions.

Figure 9
figure 9

The fraction of sites that are deemed prospective (fraction prospective) as a function of the number of features used (as controlled by the coding layer size) for hyperparameter tuning using the AUC-ROC metric (a) and the weighted F1 metric (b)

The relationships between spatial selectivity and model performance (the numerator and denominator of the composite metric) were noisy and varied (for the AUC-ROC metric, see Fig. 10, the results were qualitatively similar for the weighted F1 metric). There were two distinct categories: (1) those that exhibit clear relationships between spatial and variable domain characteristics, which included SVM, LR, ANN and GP (Fig. 10b, f, g, and h); and (2) those that exhibited no strong relationship between spatial and variable domain characteristics, which included kNN, RF, ET and AB (Fig. 10a, c, d, and e). In effect, only some algorithms resulted in models whose spatial selectivity was sensitive to variable domain performance (the “spatial sensitivity issue”). In addition, the tree-based algorithms (RF, ET and AB) produced models that exhibited an inflection in their variable–spatial domain relationships, which can be observed as a curvature in the scatter plots (Fig. 10c, d, and e). The peak variable domain performance for the tree-based methods occurred at roughly the mid-range of spatial selectivity and away from the mid-range, the performance was generally lower as measured through any performance metric. Therefore, the relationship between variable domain performance and spatial selectivity was not monotonic and exhibited at least a peak (the “selectivity ambiguity issue”). The existence of the peak is telltale of a gradual overfitting of the models that may not be detectable within heuristic designs of the workflow. Consequently, because of both sensitivity and ambiguity issues, variable domain metrics cannot generally guide model selection toward a target degree of spatial selectivity.

Figure 10
figure 10

Relationships between spatial selectivity as measured by the fraction of sites predicted as prospective (fraction prospective) and variable domain metric scores

Because MPM results in maps that are often used visually, it is also important to examine qualitatively the resulting prospectivity maps. For each set of maps resulting from a unique combination of the number of features and model-tuning metric, there were observable differences between the best models for each algorithm (e.g., Fig. 11). The differences were the greatest between the best and worst performing models regardless of the performance metric used to tune the models (Fig. 9). To understand the degree of consensus within the two sets of 168 optimal models, we summarized the differences of each set using a linearly weighted average, based on the tuning metric used (AUC-ROC or F1 score). Weighted averaging seeks to reduce the disproportionate impact of the poorest performing models. This yields a set of maps that are not intended to be prospectivity maps per se (because this is not an applicative study), but are experimental ‘consensus maps’ (equivalent to a workflow uncertainty analysis). Values on the maps that are closer to 1 imply consensus for a positive label and at 0 consensus for a negative label, and in between meaning variable dissent or disagreement. To create a range in variable domain performance, we combined only models above a given performance level using their hyperparameter-tuning performance metric, producing three sets of maps with all models (cut-off threshold of 0.0) and only those above 0.85 and 0.90 (Figs. 12 and 13). These thresholds were not classification thresholds, but were performance criteria for model merging. It is visually clear that there was better consensus at higher model performance, which can be seen in the form of a loss of intermediate-colored (colors near the center of the color bar) sites and more bright (more yellow) sites, and dark (more deep blue) sites (compare Fig. 12e with Fig. 12a). Regardless of the performance threshold, it was also visually obvious that consensus was significant in key prospective zones, which were consistently colorized yellow (e.g., western Canada extending into Alaska; Fig. 12). Conversely, the consensus was poor, e.g., in western to the central United States (Fig. 12).

Figure 11
figure 11

Visual comparison of the prospectivity maps produced using 19 features and the AUC-ROC metric. Results are shown for ANN in (a), RF in (b), GP in (c) and SVM in (d)

Figure 12
figure 12

Weighted average maps (consensus) of all models using the AUC-ROC metric with a performance score: > 0.0 (a, d); > 0.85 (b, e); > 0.90 (c, f)

Figure 13
figure 13

Weighted average maps (consensus) of all models using scores of the F1 metric with a performance score: > 0.0 (a, d); > 0.85 (b, e); > 0.90 (c, f)

It is visually discernible that models tuned using the weighted F1 metric were more selective spatially (e.g., Fig. 13f compared with Fig. 12f). Indeed, the quantitative difference can be substantial in maps (Fig. 14). In fact, for most algorithms with the exception of GP and kNN, the general trend was such that hyperparameter tuning using the weighted F1 metric resulted in the most spatially selective models across a range of feature space dimensionality (Fig. 15). Spatial selectivity also increased more rapidly at increasing AUC-ROC thresholds for models that were tuned using weighted F1 metric than the AUC-ROC metric (Fig. 16). This observation combined with the observation that hyperparameter tuning using the weighted F1 metric resulted in generally lower AUC-ROC scores (Figs. 6 and 7), which implies that although there was a rough relationship between spatial selectivity and model performance metric scores on average, it varied depending on the performance metric used and is algorithm dependent (Fig. 16). A strategy to tune models to maximize AUC-ROC score is ineffective to create models with the highest spatial selectivity for our workflow and dataset.

Figure 14
figure 14

Differences in the consensus maps produced by tuning hyperparameters using the AUC-ROC metric vs. the weighted F1 metric (using only models with metric scores above 0.90). The difference is calculated as the consensus map produced using the AUC-ROC metric minus that of the F1 metric

Figure 15
figure 15

Fraction of all maps that are deemed prospective of all models tuned using the weighted F1 metric vs. those tuned using the AUC-ROC metric

Figure 16
figure 16

Comparison of the fraction of sites classified as prospective, averaged over all models meeting the performance threshold in (a) and the number of models satisfying the performance threshold in (b), both as a function of threshold values of AUC-ROC

Discussion

In applicative domains of data science, model performance is usually stipulated by a client and through unlimited choices in workflow design, and the data scientist attempts to locate at least one such model satisfying the performance constraint. In this research study, we adopted a reasonable performance threshold of 0.90 as assessed through the AUC-ROC metric. Models meeting this criterion were considered satisfactory in the data science framework. Results suggest that there always exist at least one combination of feature space density/dimensionality and algorithm that produces an AUC-ROC score above 0.90 (Fig. 7). This is false for models produced using the weighted F1 metric because all LR and GP models fail this criterion (Fig. 8). Consequently, we demonstrated that it is highly possible for MPM workflows to differ in key stages, including feature selection (or extraction using deep learning), predictive modeling algorithms and performance metrics, while still producing equiprobable models (in the case of classification tasks). The effect of feature space dimensionality was also significant. The increase in spatial selectivity of all algorithms with increasing feature space dimensionality, at least up to 20 features was expected given the increase in explanatory power of the features (Fig. 5). However, beyond 20 to 22 features, the curse of dimensionality effect likely caused performance plateaus or slight losses in model performance (Figs. 7 and 8). This implies that feature space density is becoming too low, such that, despite increases in the explanatory power of the features, there is a loss in model performance because of increasing sparsity of training samples relative to feature space dimensionality. In general, there was a noticeable to significant difference in prospectivity maps both quantitatively and qualitatively as a function of changes in algorithm and changes in feature space dimensionality. An implication of this finding is that feature engineering and algorithm interactions should be experimentally modified using feedback as part of the workflow, in the sense that a range of feature space dimensions should be explored for each algorithm to optimize model performance. Another implication of this finding is that, where MPM adopts machine learning, a comprehensive implementation of a data science framework should be a minimal requirement for model deployment settings (e.g., Fig. 1). This implication may also apply to data pre-processing, which could include the construction of evidence layers using discipline-specific methods, such as geostatistics, because clearly differences in data pre-processing would also propagate to machine learning features. However, this exploration was outside of the scope of this study.

Another major finding was that there was not always a sensitive or monotonic relationship between spatial selectivity and model performance as assessed through the AUC-ROC, F1 or accuracy metrics. This is a potential problem because the geospatial extension to the data science framework is a post-hoc one. Therefore, e.g., it is unknown how spatially-guided model construction (e.g., hyperparameter tuning) could be integrated into the data science framework in general (i.e., not necessitating spatially aware algorithms). In our case, for the explored tree-based algorithms, which are popular in current MPM practices, optimizing model performance via any metric did not necessarily result in more spatially selective maps. On average, optimizing the F1 score, by using the F1 metric to perform model selection through hyperparameter tuning, produced more spatially selective maps. However, this produced models with generally lower AUC-ROC (as well as accuracy and F1) scores. This implies that the choice of metric during hyperparameter tuning was a significant source of variability in prospectivity maps in the spatial domain. Because the overarching objective of MPM is to narrow the search space for mineral resources, this implies that the value of MPM products cannot be determined within the data science framework because this framework employs solely variable domain metrics to select algorithms and models. Consequently, the rigor in which the geospatial extension (Fig. 1) is practiced significantly controls the value of MPM products. However, it is not obvious whether the extension of the data science framework using a geospatial post-hoc analysis can create effective feedback, such that the metric-driven aspect of the data science framework is preserved. For example, metrics in the spatial domain relying on large portions of maps could violate the train-test independence of datasets in data science. As such, it is not feasible to treat the style of geospatial extension of the data science framework as practiced in this study for the task of MPM, as a rigorous geodata science framework because cyclical tasks, such as data preparation to modeling cannot be generally conducted using metric-driven feedback and experimentation. For solely prospectivity mapping using this type of framework, some measure of spatial concordance between prospective sites and actual mineral occurrences would be an effective validation of any MPM product. However, because for many commodities or deposit types, the positive label is a constraint on training data abundance, it is not practical in general to set aside a spatially extensive and statistically large number of labels for robust model testing.

Unfortunately, the inconsistent and sometimes unreliable relationship between performance of models in the variable and spatial domains implies that out-of-sample testing and metric-driven model selection are insufficient to control models’ spatial selectivity, the (unconstrained) optimization of which is the fundamental goal of MPM. This is demonstrably true for supervised classification approaches using spectrum-based (aspatial) algorithms and we expect that it would hold true also for regression-based MPM, but it remains to be empirically examined. This is likely more problematic for singleton or experiential studies (studies that result in single models or are based on subjective experience). This further implies that the uncertainty of MPM products is generally unknown in the sense of their spatial characteristics, e.g., how to define a desirable extent of reduction in exploration search space, and how to relate it to the data science framework. With the current state of theoretical development in the geodata science framework, it is not yet possible to design an objective workflow and its constraints (e.g., of number of features, algorithms and otherwise) to definitively produce a MPM product meeting or exceeding spatial requirements. Physical feedback approaches are not scalable in general to MPM products. For example, in mineral resource estimation, additional sampling could be used to delineate between resources and reserves, which could occur after the resource modeling stage. For MPM, this cannot generally happen to the scale necessary to validate maps. Therefore, feed-forward approaches are probably the only viable solution for now. Our results indicate that a possible way to de-risk MPM products is via consensus of the outputs of multiple workflows, whose key experimental variables are as de-correlated as possible, as was utilized in this study (Fig. 17). Any workflow component could be varied, beyond those examined in this study (e.g., the interpolation of sparse data into evidence layers). As such, we propose that a practical mitigation measure to workflow-induced variability is to build a large variety of experimental models, each of which is optimized through a single variation of a data science workflow (e.g., the highest scoring model under a set of experimental controls and variables, Fig. 3), from which, a consensual model is derived that depicts regions of varying consensus. In this way, exploration risk can be stratified (categorized) based on the degree of consensus or a level of tolerance of dissent. The most optimal exploration model is then the subset of regions depicted by the consensual model that meets or exceeds a target degree of risk, which is defined by users of MPM products (e.g., 99% consensus or 1% dissent). Unfortunately, this means that prospectivity maps produced using singular workflows should probably be better treated as isolated experimental findings, in the sense that they require multiple either ensemble-type or subsequent replications via de-correlated confirmation studies or workflows to construct a scientific consensus. This would make sense given that MPM is technically a type of data-driven experimentation. Where consensus is likely to be a deciding factor for downstream investment are greenfield settings, where no additional knowledge would be available to constrain data-driven findings. To quantify uncertainty, it would only require modulation of the workflow to produce a set of (equiprobable) models or maps. To mitigate uncertainty, a threshold could be used to eliminate prospective areas of high disagreement from further consideration.

Figure 17
figure 17

Summary of uncertainty propagation through workflow modulation to create consensus of mineral prospectivity mapping (MPM) products. The key stages depicted are identification of subjective choices, modulation of experimental variables and controls among the choices, and finally, propagation of uncertainty to determine degree of consensus (e.g., the uncertainty) of MPM products. A consensual model can then be constructed using many individual models to understand exploration risks created by workflow design

Conclusions

Uncertainty of MPM products derived from data-driven methods can be traced to the guiding frameworks. For MPM methods that are formulated as artificial intelligence tasks, the data science framework is the most common framework to guide model construction. The decoupling of data science practices, such as the experimentation of algorithms and models, and geoscience needs, such as spatial characteristics of models, prevents effective feedback to guide objective and extensive experimentation within data-driven MPM workflows. However, it is still a matter of research how spatial constraints can be integrated into the data science framework in a manner that respects the automatability, objectivity and rigor of the framework. Therefore, although the data science framework was originally designed to standardize inferential data modeling, to remove operator subjectivity and increase replicability, the extension of data science frameworks via post-hoc additions of geospatial analysis does not preserve this intention because it cannot be used to steer workflow design. This is the context in which heuristic choices in workflow design creates unknown and hidden uncertainty, which is not addressed by the data science framework, and whose closest concept is judgment-related uncertainty in GIS-based MPM. Given heuristic workflow design, variability in MPM products is significant and individual MPM outcomes (e.g., models or maps) contain unknown uncertainty with respect to choices in workflow design.

This study contributes to the understanding of uncertainty in MPM products. Specifically, we empirically demonstrated the existence of workflow-induced uncertainty for MPM workflows adhering to the geodata science framework. Moreover, we showed that its spatial effects are significant and unintuitive and exhibits complex relationships with model performance metrics. Specifically, we found that: (1) it is possible to reach high model performance through distinct combinations of feature-algorithm pairings, implying that there are many local minima in workflow design; (2) feature space dimensionality impacts model performance significantly, whose degree is algorithm dependent; (3) the choice of algorithms impact model performance but model performance alone is not sufficient to distinguish the quality of resulting maps as measured by spatial selectivity; and (4) there is an inconsistent relationship between model performance and spatial selectivity, which implies that MPM maps should not be solely derived using a data-driven methodology (e.g., the data science framework), but must involve some spatial or knowledge constraints.

We also identified a solution to minimize workflow-induced uncertainty, which leverages the notion of scientific consensus, reframing data science-based MPM into the greater experimental science framework. We demonstrated that consensus of workflows is realizable by treating the output of each workflow variation as an independent experimental measurement or outcome, then analyzing the degree of consensus across outcomes. Therefore, by carefully choosing experimental controls and variables, it is possible to both isolate uncertainty contributions associated with a particular stage of the workflow and study its interactions with those of other stages. In effect, we demonstrated that modulation of workflow design enables uncertainty propagation to the resulting MPM products. We expect our findings and solution to be generalizable to most known types of data science-based MPM because our assumptions are unspecific to our data or algorithms and we purposely explored a large (multi-continental scale) dataset and range of workflow implementations. We only anticipate that to adopt our solution, the key requirement would be computational capacity and high-performance computing expertise, as modulating workflows is computationally intensive, even if it is fully automated. However, uncertainty analysis does not necessitate that all sources of variability to be propagated, only those with a dominant impact. Finally, in a philosophical interpretation, because data-driven MPM is a type of data-driven experimentation, it would be consistent with experimental science practices to ascribe value to outcome using scientific consensus. An implication of this finding is that where geodata science is used to derive scientific outcome, the geodata science framework should be embedded into an experimental science framework to properly assess the value of data-driven products.