This study is part of the MACRO project. This project will end approximately in the summer of 2023. Briefly, its goals are to explore the application of machine learning for mapping/regionalization in the field of hydrogeology in general.

MACRO project logo

Figure 0.1: MACRO project logo






1 Abstract

Information on the spatial distribution of hydrogeochemical parameters is crucial for decision making. Machine learning based methods for the mapping of hydrogeochemical parameter concentrations have been already studied for many years to evolve from deterministic and geostatistical interpolation methods. However, the reflection of all relevant processes that the target variables depend on is often difficult to achieve, because of the mostly insufficient determination and/or availability of features. This is especially true if you limit yourself to freely accessible data.

In this study, we apply an extreme gradient boosting learner (XGBoost) to map major ion concentrations across Germany. The training data consists of water samples from approximately 35K observation wells across Germany and a wide range of environmental data as predictors. The water samples were collected between the 1950s and 2005 at anthropogenically undisturbed locations.

The environmental data includes hydrogeological units and parameters, soil type, lithology, digital elevation model (DEM) and DEM derived parameters etc. The values of these features at the respective water sample location were extracted on the basis of a polygon, approximately representing the area that has an impact on the target variable (ion concentration). For a comparison, different polygon shapes are used.

2 Workflow

The workflow from data preprocessing to model evaluation is schematically shown in fig. 2.1. The single steps are described more in detail in the following sections. This work of this study is still in progress. The preprocessing of the hydrogeochemical background values data set is described in Hydrogeochemical Parameters, the feature extraction in section Feature Extraction, the modelling pipeline in Model Training and the evaluation can be found in Evaluation.

All data processing and modelling was done using the R programming language using multiple packages (see chapter References).

flowchart

Figure 2.1: flowchart

3 Data

3.1 Hydrogeochemical Parameters

The target variables of the training data being used in this study is based on a data set containing initially approximately 53000 measurements of hydrogeochemical parameters from groundwater samples predominantly taken during the second half of the 20th century until 2010. The number of samples used for training the models is reduced due to the following steps:

3.1.1 Preprocessing


Intersection with Study Area

The sample locations were intersected with the administrative border of Germany.


Filter by Sample Date

Only samples between 1990-01-01 and 2010-01-01 were kept to limit the data set to the most current time period. The distribution of the sample date after applying the previous processing steps is shown in fig. 3.1.

Distribution of sample date

Figure 3.1: Distribution of sample date


Filter by Sample Depth

The sample depth was calculated as mean depth of the screen top and bottom if it was provided. Thus, it reflects the screen center depth below ground level of an observation well. For the model training, the sample depth was used as feature (predictor variable).

Only samples with a sample depth between 100m and ground level or with a value of 1 in the column lage were kept in the data set to exclude deeper aquifers. The distribution of the sample depth after applying the previous processing steps is shown in figure 3.2.

Distribution of sample depth

Figure 3.2: Distribution of sample depth


Aggregation of multiple measurements per sample site

Some of the sample sites have multiple measurements over time which were aggregated by calculating the mean. The distribution of the number of measurements per sample site after applying the previous processing steps is shown in figure 3.3.

Distribution of multiple measurements per sample site

Figure 3.3: Distribution of multiple measurements per sample site

##  [1] "Ca"   "Cl"   "Fe"   "HCO3" "K"    "Mg"   "Mn"   "Na"   "NO3"  "SO4"

From all measured parameters, the ten ions with the most samples were selected as target variables to be modeled (Ca, Cl, Fe, HCO3, K, Mg, Mn, Na, NO3, SO4; see fig. 3.4). Across all these parameters and after all preprocessing steps, 34536 samples and 12 columns, 1 for the station ID (station_id), 1 for the sample depth (sample_depth) and 10 for each target variable remain for the model training.

Samples per hydrogeochemical.

Figure 3.4: Samples per hydrogeochemical.

3.1.2 Location

The locations of the sample sites used for modelling are shown in 3.5 as the number of sample sites per hexagon. The spatial distribution of sampling locations is unbalanced with regions that have few locations and regions with a high density. The latter are mainly concentrated around larger cities in Germany such as Berlin, Hamburg, Frankfurt. The eastern and northern areas of Germany also generally have more sampling sites compared to southern or central Germany.

Sample site locations

Figure 3.5: Sample site locations

3.1.3 Data Summary

The first three rows of the data set containing the target variables after all preprocessing steps is shown in table 3.1 as an example.

Table 3.1: First three rows of the data set containing the target variables
station_id ca_mg_l cl_mg_l fe_mg_l hco3_mg_l k_mg_l mg_mg_l mn_mg_l na_mg_l no3_mg_l so4_mg_l
110_1015 44.7 19.5 17.2 164.7 1.7 6.5 0.75 9.4 1.0 29.5
110_1016 82.9 46.0 15.0 207.4 6.5 13.9 2.80 25.0 0.1 126.0
110_1017 90.0 35.0 10.8 323.3 4.0 9.5 0.65 23.0 0.1 33.0

Figure 3.6 gives an overview on the occurrence of missing values in that data set. The occurrence of missing values varies between the different target variables which leads to different sample sizes when modelling each target separately

Missing values across the target variables

Figure 3.6: Missing values across the target variables

More details on the column statistics are shown in the following summary

Table 3.2: Data summary
Name Piped data
Number of rows 34536
Number of columns 10
_______________________
Column type frequency:
numeric 10
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ca_mg_l 4303 0.88 88.39 103.78 0 40.50 74.20 112.00 3670.0 ▇▁▁▁▁
cl_mg_l 3025 0.91 250.48 4212.52 0 13.00 25.00 45.87 177000.0 ▇▁▁▁▁
fe_mg_l 13758 0.60 3.18 12.42 0 0.10 0.86 2.80 1170.0 ▇▁▁▁▁
hco3_mg_l 4578 0.87 220.75 175.41 0 99.00 207.40 318.10 7755.7 ▇▁▁▁▁
k_mg_l 9258 0.73 5.62 31.51 0 1.10 2.00 4.00 1440.0 ▇▁▁▁▁
mg_mg_l 4264 0.88 17.28 43.32 0 5.00 10.10 19.30 1801.3 ▇▁▁▁▁
mn_mg_l 9379 0.73 0.29 1.30 0 0.01 0.11 0.29 156.0 ▇▁▁▁▁
na_mg_l 10451 0.70 180.93 2978.77 0 7.00 12.20 24.00 116000.0 ▇▁▁▁▁
no3_mg_l 9548 0.72 14.47 26.88 0 0.10 3.00 18.00 708.0 ▇▁▁▁▁
so4_mg_l 4276 0.88 89.93 212.27 0 18.60 43.00 90.50 6880.0 ▇▁▁▁▁

The distribution of target values as violin chart is shown in figure 3.7.

Distribution of the target variable values

Figure 3.7: Distribution of the target variable values

3.2 Features

3.2.1 Feature Extraction

In addition to this dataset, geophysical attributes were extracted from other spatial data sources (see the following list) and used as features:

The features were extracted for a 1km buffer as approximated groundwater contributing area for every sample location respectively (Knoll, Breuer, and Bach 2019) (see figure 3.8). For categorical data, the proportion of each class in the buffer was calculated. As an advantage, this leads to an encoding as numerical feature. On the other hand, many sparse features are created for rare classes. For numerical data, the mean was calculated for this buffer.

Example of feature extraction based on circular buffer around sample sites (red) (e.g. the land use and land cover data as shown here)

Figure 3.8: Example of feature extraction based on circular buffer around sample sites (red) (e.g. the land use and land cover data as shown here)

3.2.2 Data Summary

The previously described method of extracting the features results in 165 features. A summary of the statistics of the features is provided in tab. 3.3.

Table 3.3: Data summary
Name Piped data
Number of rows 34536
Number of columns 165
_______________________
Column type frequency:
numeric 165
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
sampledepth_sampledepth 0 1 41.71 36.13 0.00 17.00 42.00 50.00 644.00 ▇▁▁▁▁
lulc_agriculturalareas 0 1 0.53 0.33 0.00 0.24 0.56 0.83 1.00 ▆▅▅▆▇
lulc_forestandseminaturalareas 0 1 0.30 0.32 0.00 0.00 0.19 0.52 1.00 ▇▂▂▂▂
lulc_artificialsurfaces 0 1 0.15 0.24 0.00 0.00 0.05 0.20 1.00 ▇▁▁▁▁
lulc_waterbodies 0 1 0.02 0.06 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
lulc_wetlands 0 1 0.00 0.02 0.00 0.00 0.00 0.00 0.84 ▇▁▁▁▁
gwrecharge_gwrecharge 0 1 121.53 71.54 0.00 73.44 109.89 158.57 879.34 ▇▂▁▁▁
seepage_seepage 2 1 253.65 187.94 -99.89 131.54 223.25 335.48 2559.33 ▇▁▁▁▁
temperature_temperature 0 1 84.56 8.72 8.82 80.65 85.08 89.54 107.44 ▁▁▁▇▆
precipitation_precipitation 0 1 719.63 191.05 453.25 574.28 691.90 800.75 2646.69 ▇▁▁▁▁
hydrounits_14 0 1 0.21 0.40 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
hydrounits_13 0 1 0.12 0.31 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_17 0 1 0.03 0.17 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_15 0 1 0.14 0.34 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_63 0 1 0.02 0.13 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_31 0 1 0.05 0.21 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_62 0 1 0.02 0.14 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_101 0 1 0.02 0.13 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_41 0 1 0.04 0.20 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_64 0 1 0.01 0.11 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_65 0 1 0.00 0.04 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_0 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.43 ▇▁▁▁▁
hydrounits_66 0 1 0.01 0.10 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_95 0 1 0.02 0.14 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_71 0 1 0.01 0.10 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_92 0 1 0.04 0.20 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_96 0 1 0.01 0.08 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_32 0 1 0.01 0.11 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_52 0 1 0.01 0.11 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_12 0 1 0.01 0.10 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_11 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.06 ▇▁▁▁▁
hydrounits_81 0 1 0.05 0.22 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_33 0 1 0.01 0.11 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_54 0 1 0.02 0.13 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_51 0 1 0.02 0.13 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_16 0 1 0.03 0.18 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_83 0 1 0.00 0.03 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_22 0 1 0.02 0.13 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_21 0 1 0.02 0.13 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_53 0 1 0.00 0.04 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_23 0 1 0.01 0.11 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_61 0 1 0.01 0.08 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_82 0 1 0.00 0.04 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_94 0 1 0.00 0.05 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_91 0 1 0.01 0.11 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_93 0 1 0.01 0.08 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrounits_97 0 1 0.00 0.01 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_114 0 1 0.37 0.45 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▅
geology_111 0 1 0.14 0.31 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_113 0 1 0.16 0.35 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_115 0 1 0.01 0.08 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_223 0 1 0.00 0.06 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_231 0 1 0.02 0.15 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_232 0 1 0.02 0.11 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_233 0 1 0.04 0.18 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_600 0 1 0.02 0.14 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_120 0 1 0.02 0.14 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_221 0 1 0.01 0.10 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_130 0 1 0.01 0.07 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_330 0 1 0.00 0.03 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_230 0 1 0.00 0.04 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_400 0 1 0.03 0.17 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_222 0 1 0.00 0.06 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_500 0 1 0.01 0.10 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_510 0 1 0.01 0.08 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_312 0 1 0.01 0.11 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_320 0 1 0.01 0.08 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_220 0 1 0.00 0.03 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_112 0 1 0.01 0.09 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_211 0 1 0.02 0.13 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_210 0 1 0.00 0.05 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_350 0 1 0.01 0.10 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_360 0 1 0.01 0.08 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_888 0 1 0.00 0.01 0.00 0.00 0.00 0.00 0.83 ▇▁▁▁▁
geology_333 0 1 0.02 0.14 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_332 0 1 0.01 0.09 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_331 0 1 0.00 0.05 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_300 0 1 0.00 0.04 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_311 0 1 0.00 0.04 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_212 0 1 0.00 0.05 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
geology_340 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.50 ▇▁▁▁▁
soilunits_19 0 1 0.14 0.34 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_12 0 1 0.03 0.15 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_28 0 1 0.11 0.31 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_17 0 1 0.11 0.30 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_33 0 1 0.04 0.19 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_31 0 1 0.08 0.26 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_8 0 1 0.06 0.23 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_36 0 1 0.04 0.18 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_27 0 1 0.01 0.08 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_41 0 1 0.02 0.14 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_11 0 1 0.03 0.16 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_40 0 1 0.05 0.20 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_63 0 1 0.01 0.12 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_55 0 1 0.04 0.20 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_44 0 1 0.01 0.09 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_21 0 1 0.01 0.09 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_49 0 1 0.01 0.12 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_61 0 1 0.03 0.17 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_30 0 1 0.01 0.08 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_54 0 1 0.04 0.20 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_15 0 1 0.00 0.03 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_52 0 1 0.00 0.06 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_72 0 1 0.00 0.02 0.00 0.00 0.00 0.00 0.96 ▇▁▁▁▁
soilunits_69 0 1 0.00 0.07 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_18 0 1 0.02 0.14 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_59 0 1 0.03 0.18 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_68 0 1 0.00 0.02 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_4 0 1 0.01 0.09 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_2 0 1 0.00 0.01 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_5 0 1 0.00 0.04 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_60 0 1 0.01 0.10 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_66 0 1 0.02 0.14 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_51 0 1 0.00 0.06 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_34 0 1 0.00 0.03 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_46 0 1 0.00 0.05 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_42 0 1 0.01 0.07 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
soilunits_22 0 1 0.01 0.11 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrogeologygc_s 0 1 0.78 0.38 0.00 0.73 1.00 1.00 1.00 ▂▁▁▁▇
hydrogeologygc_Gew 0 1 0.00 0.04 0.00 0.00 0.00 0.00 0.82 ▇▁▁▁▁
hydrogeologygc_m 0 1 0.13 0.30 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrogeologygc_a 0 1 0.00 0.03 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrogeologygc_so 0 1 0.01 0.07 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrogeologygc_kA 0 1 0.00 0.02 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrogeologygc_k 0 1 0.06 0.21 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrogeologygc_g 0 1 0.01 0.09 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrogeologygc_h 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.14 ▇▁▁▁▁
hydrogeologygc_gh 0 1 0.00 0.01 0.00 0.00 0.00 0.00 0.90 ▇▁▁▁▁
hydrogeologykf_3 0 1 0.41 0.44 0.00 0.00 0.15 0.99 1.00 ▇▁▁▁▅
hydrogeologykf_9 0 1 0.22 0.36 0.00 0.00 0.00 0.32 1.00 ▇▁▁▁▂
hydrogeologykf_10 0 1 0.09 0.26 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrogeologykf_4 0 1 0.06 0.18 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrogeologykf_6 0 1 0.03 0.15 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrogeologykf_99 0 1 0.00 0.04 0.00 0.00 0.00 0.00 0.82 ▇▁▁▁▁
hydrogeologykf_11 0 1 0.02 0.10 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrogeologykf_12 0 1 0.05 0.18 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrogeologykf_0 0 1 0.00 0.02 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrogeologykf_5 0 1 0.06 0.19 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrogeologykf_2 0 1 0.04 0.18 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrogeologykf_7 0 1 0.00 0.02 0.00 0.00 0.00 0.00 0.95 ▇▁▁▁▁
hydrogeologykf_8 0 1 0.00 0.01 0.00 0.00 0.00 0.00 0.53 ▇▁▁▁▁
hydrogeologyga_S 0 1 0.90 0.28 0.00 1.00 1.00 1.00 1.00 ▁▁▁▁▇
hydrogeologyga_G 0 1 0.00 0.04 0.00 0.00 0.00 0.00 0.82 ▇▁▁▁▁
hydrogeologyga_kA 0 1 0.00 0.02 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrogeologyga_Me 0 1 0.05 0.20 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hydrogeologyga_Ma 0 1 0.04 0.17 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
elevation_elevation 0 1 178.52 200.25 -3.75 42.90 89.15 283.62 1846.03 ▇▂▁▁▁
slope_slope 0 1 2.14 3.38 0.00 0.15 0.57 2.72 43.65 ▇▁▁▁▁
aspect_aspect 0 1 178.40 27.61 35.84 162.94 179.11 194.25 301.42 ▁▁▇▃▁
mohplp_streamorder1 12 1 5235.34 2961.90 0.00 2762.00 5273.00 7896.00 10030.00 ▆▆▆▇▇
mohplp_streamorder2 5 1 5428.72 2906.28 0.00 3045.00 5574.00 7998.00 10030.00 ▅▆▆▇▇
mohplp_streamorder3 3 1 5505.89 2920.10 0.00 3068.35 5735.00 8137.00 10030.00 ▅▆▆▆▇
mohplp_streamorder4 3 1 5486.75 2934.16 0.00 3061.00 5682.67 8145.00 10007.00 ▅▆▆▆▇
mohplp_streamorder5 3 1 5444.35 3002.40 0.00 2881.00 5758.00 8150.00 10001.00 ▆▅▅▆▇
mohplp_streamorder6 3 1 5543.70 2952.94 0.00 3123.00 5870.00 8180.00 10000.00 ▅▅▆▆▇
mohplp_streamorder7 3 1 5314.65 3058.03 0.00 2657.00 5427.00 8128.00 10000.00 ▆▅▆▆▇
mohplp_streamorder8 3 1 7679.21 1413.66 385.00 6760.00 7644.00 8814.00 9999.00 ▁▁▂▇▇
mohpdsd_streamorder1 12 1 5235.34 2961.90 0.00 2762.00 5273.00 7896.00 10030.00 ▆▆▆▇▇
mohpdsd_streamorder2 5 1 5428.72 2906.28 0.00 3045.00 5574.00 7998.00 10030.00 ▅▆▆▇▇
mohpdsd_streamorder3 3 1 5505.89 2920.10 0.00 3068.35 5735.00 8137.00 10030.00 ▅▆▆▆▇
mohpdsd_streamorder4 3 1 5486.75 2934.16 0.00 3061.00 5682.67 8145.00 10007.00 ▅▆▆▆▇
mohpdsd_streamorder5 3 1 5444.35 3002.40 0.00 2881.00 5758.00 8150.00 10001.00 ▆▅▅▆▇
mohpdsd_streamorder6 3 1 5543.70 2952.94 0.00 3123.00 5870.00 8180.00 10000.00 ▅▅▆▆▇
mohpdsd_streamorder7 3 1 5314.65 3058.03 0.00 2657.00 5427.00 8128.00 10000.00 ▆▅▆▆▇
mohpdsd_streamorder8 3 1 7679.21 1413.66 385.00 6760.00 7644.00 8814.00 9999.00 ▁▁▂▇▇

The first three rows of the data set containing the features is shown in table 3.4 as an example. This table was then joined with the table holding the target variable by the station_id

Table 3.4: First three rows of the data set containing the features
station_id sampledepth_sampledepth lulc_agriculturalareas lulc_forestandseminaturalareas lulc_artificialsurfaces lulc_waterbodies lulc_wetlands gwrecharge_gwrecharge seepage_seepage temperature_temperature precipitation_precipitation hydrounits_14 hydrounits_13 hydrounits_17 hydrounits_15 hydrounits_63 hydrounits_31 hydrounits_62 hydrounits_101 hydrounits_41 hydrounits_64 hydrounits_65 hydrounits_0 hydrounits_66 hydrounits_95 hydrounits_71 hydrounits_92 hydrounits_96 hydrounits_32 hydrounits_52 hydrounits_12 hydrounits_11 hydrounits_81 hydrounits_33 hydrounits_54 hydrounits_51 hydrounits_16 hydrounits_83 hydrounits_22 hydrounits_21 hydrounits_53 hydrounits_23 hydrounits_61 hydrounits_82 hydrounits_94 hydrounits_91 hydrounits_93 hydrounits_97 geology_114 geology_111 geology_113 geology_115 geology_223 geology_231 geology_232 geology_233 geology_600 geology_120 geology_221 geology_130 geology_330 geology_230 geology_400 geology_222 geology_500 geology_510 geology_312 geology_320 geology_220 geology_112 geology_211 geology_210 geology_350 geology_360 geology_888 geology_333 geology_332 geology_331 geology_300 geology_311 geology_212 geology_340 soilunits_19 soilunits_12 soilunits_28 soilunits_17 soilunits_33 soilunits_31 soilunits_8 soilunits_36 soilunits_27 soilunits_41 soilunits_11 soilunits_40 soilunits_63 soilunits_55 soilunits_44 soilunits_21 soilunits_49 soilunits_61 soilunits_30 soilunits_54 soilunits_15 soilunits_52 soilunits_72 soilunits_69 soilunits_18 soilunits_59 soilunits_68 soilunits_4 soilunits_2 soilunits_5 soilunits_60 soilunits_66 soilunits_51 soilunits_34 soilunits_46 soilunits_42 soilunits_22 hydrogeologygc_s hydrogeologygc_Gew hydrogeologygc_m hydrogeologygc_a hydrogeologygc_so hydrogeologygc_kA hydrogeologygc_k hydrogeologygc_g hydrogeologygc_h hydrogeologygc_gh hydrogeologykf_3 hydrogeologykf_9 hydrogeologykf_10 hydrogeologykf_4 hydrogeologykf_6 hydrogeologykf_99 hydrogeologykf_11 hydrogeologykf_12 hydrogeologykf_0 hydrogeologykf_5 hydrogeologykf_2 hydrogeologykf_7 hydrogeologykf_8 hydrogeologyga_S hydrogeologyga_G hydrogeologyga_kA hydrogeologyga_Me hydrogeologyga_Ma elevation_elevation slope_slope aspect_aspect mohplp_streamorder1 mohplp_streamorder2 mohplp_streamorder3 mohplp_streamorder4 mohplp_streamorder5 mohplp_streamorder6 mohplp_streamorder7 mohplp_streamorder8 mohpdsd_streamorder1 mohpdsd_streamorder2 mohpdsd_streamorder3 mohpdsd_streamorder4 mohpdsd_streamorder5 mohpdsd_streamorder6 mohpdsd_streamorder7 mohpdsd_streamorder8
110_1 37.8 0.7103502 0.2896498 0.0000000 0 0 48.711914 147.50668 77.02868 566.8224 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 96.92465 0.2502644 165.2706 4936 1831 7135 9685 9381 5714 5603 9149 4936 1831 7135 9685 9381 5714 5603 9149
110_10 24.2 0.7080190 0.1805797 0.1114013 0 0 80.524147 174.22295 77.97757 559.4204 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 87.36120 0.6016899 180.6347 1716 5411 8938 8842 9281 5718 5557 8989 1716 5411 8938 8842 9281 5718 5557 8989
110_100 20.5 1.0000000 0.0000000 0.0000000 0 0 1.587526 -52.47207 88.00000 535.1705 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 28.67915 0.0000000 185.9964 8771 8771 1481 1281 705 3764 7040 7561 8771 8771 1481 1281 705 3764 7040 7561

4 Model Training

Like all the data preprocessing all modelling was done using the R programming language and the tidymodels package. The modelling pipeline includes the following steps:

  • Normalization (Standardization) of all features
  • Removal of highly correlated and/or sparse features
  • 80/20 train-test split
  • 5-fold cross-validation and stratification of the target variable (no spatial cross-validation yet implemented)
  • Hyperparameter tuning of the model parameters for a XGBoost learner (number of trees fixed to 1000) with 50 parameter combinations using a parameter space filling grid

5 Results

5.1 Hyperparameter Tuning

The following table summaries the hyperparameters of the best model configuration according to the tuning process.
parameter min_n tree_depth learn_rate loss_reduction .config
Ca 34 10 0.0145681 0.0036223 Preprocessor1_Model48
Cl 7 9 0.0315631 0.0000001 Preprocessor1_Model24
Fe 37 14 0.0029476 1.9521420 Preprocessor1_Model33
HCO3 34 10 0.0145681 0.0036223 Preprocessor1_Model48
K 12 8 0.0278855 0.0000000 Preprocessor1_Model20
Mg 7 9 0.0315631 0.0000001 Preprocessor1_Model24
Mn 37 14 0.0029476 1.9521420 Preprocessor1_Model33
Na 3 6 0.0067682 0.0000005 Preprocessor1_Model01
NO3 34 10 0.0145681 0.0036223 Preprocessor1_Model48
SO4 7 9 0.0315631 0.0000001 Preprocessor1_Model24

5.2 Evaluation

Disclaimer: As these results are preliminary and this study is still work in progress, the plain performance metrics are provided without any further statements or interpretations.

5.2.1 Performance

Ca

Cl

Fe

HCO3

K

Mg

Mn

Na

NO3

SO4

5.2.2 Feature Importance

Ca

Cl

Fe

HCO3

K

Mg

Mn

Na

NO3

SO4

[[1]] NULL

[[2]] NULL

[[3]] NULL

[[4]] NULL

[[5]] NULL

[[6]] NULL

[[7]] NULL

[[8]] NULL

[[9]] NULL

[[10]] NULL

6 Conclusion and Outlook

This study aims on setting up a benchmark model as a basis for further development of more sophisticated machine learning models. The relatively simple model approach shows that the model performance varies strongly between the target variables indicating that further features are required to reflect the geophysical characteristics which play a role in the processes that drive the concentration of the target variable. Further steps to increase this preliminary model setup are:

  • Inclusion of more and relevant geophysical attributes as features
  • Evaluation of different feature extraction/engineering methods
  • Set up of a model ensemble to approximate a XGBoost multi-output model and model all parameters in a single model
  • Implement spatial cross-validation for more realistic model performance estimation
  • Evaluate deep learning approaches

References

Allaire, J. J., Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2021. Rmarkdown: Dynamic Documents for R. https://github.com/rstudio/rmarkdown.
Belitz, Kenneth, Richard B. Moore, Terri L. Arnold, Jennifer B. Sharpe, and J. J. Starn. 2019. “Multiorder Hydrologic Position in the Conterminous United States: A Set of Metrics in Support of Groundwater Mapping at Regional and National Scales.” Water Resources Research 55 (12): 11188–207. https://doi.org/10.1029/2019WR025908.
BGR. 2003. Swr1000 V1.0, (C) BGR, Hannover, 2003.” Hannover.
———. 2014. Bgl5000 V3.0, (C) BGR, Hannover, 2014.” Hannover.
———. 2019. Gwn1000 (c) BGR Hannover 2019.” Hannover.
Chen, Tianqi, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, et al. 2020a. Xgboost: Extreme Gradient Boosting. https://CRAN.R-project.org/package=xgboost.
———, et al. 2020b. Xgboost: Extreme Gradient Boosting. https://CRAN.R-project.org/package=xgboost.
———, et al. 2020c. Xgboost: Extreme Gradient Boosting. https://CRAN.R-project.org/package=xgboost.
Cowgill, Matt. 2020. Ggannotate: Interactively Annotate Ggplot2 Plots. https://github.com/MattCowgill/ggannotate.
DeSimone, Leslie A., Jason P. Pope, and Katherine M. Ransom. 2020. “Machine-Learning Models to Map pH and Redox Conditions in Groundwater in a Layered Aquifer System, Northern Atlantic Coastal Plain, Eastern USA.” Journal of Hydrology: Regional Studies 30 (August): 100697. https://doi.org/10.1016/j.ejrh.2020.100697.
EU-Hydro - River Network DatabaseCopernicus Land Monitoring Service.” 2021. EU-Hydro - River Network Database — Copernicus Land Monitoring Service. https://land.copernicus.eu/imagery-in-situ/eu-hydro/eu-hydro-river-network-database?tab=download.
European Union, ©. 2018. CORINE Land Cover (CLC).” © European Union, Copernicus Land Monitoring Service 2018, European Environment Agency (EEA).
———, ed. n.d.a. “Aspect.” Aspect. © European Union, Copernicus Land Monitoring Service 2018, European Environment Agency (EEA).
———. n.d.b. EU-DEM.” © European Union, Copernicus Land Monitoring Service 2018, European Environment Agency (EEA).
———. n.d.c. “Slope.” Slope. © European Union, Copernicus Land Monitoring Service 2018, European Environment Agency (EEA).
Knoll, Lukas, Lutz Breuer, and Martin Bach. 2019. “Large Scale Prediction of Groundwater Nitrate Concentrations from Spatial Data Using Machine Learning.” Science of The Total Environment 668 (June): 1317–27. https://doi.org/10.1016/j.scitotenv.2019.03.045.
Kremer, Lukas P. M. 2019a. Ggpointdensity: A Cross Between a 2d Density Plot and a Scatter Plot. https://CRAN.R-project.org/package=ggpointdensity.
———. 2019b. Ggpointdensity: A Cross Between a 2d Density Plot and a Scatter Plot. https://CRAN.R-project.org/package=ggpointdensity.
Kuhn, Max, and Hadley Wickham. 2020. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles. https://www.tidymodels.org.
Landau, William Michael. 2021. “The Targets R Package: A Dynamic Make-Like Function-Oriented Pipeline Toolkit for Reproducibility and High-Performance Computing.” Journal of Open Source Software 6 (57): 2959. https://doi.org/10.21105/joss.02959.
Pebesma, Edzer. 2018. “Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.
R Core Team. 2020a. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
———. 2020b. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
“Raster Der Vieljährigen Mittel Der Niederschlagshöhe für Deutschland1961-90.” n.d. DWD Climate Data Center (CDC).
SGD, BGR &. 2019. “HÜK250 V1.0.3, (C) BGR & SGD 2019.”
———. 2021. “GÜK250 V3.0 © BGR & SGD 2021.”
“Vieljährige Mittlere Raster Der Lufttemperatur (2m) für Deutschland1971-2000.” n.d. DWD Climate Data Center (CDC).
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wilke, Claus O. 2020. Ggtext: Improved Text Rendering Support for ’Ggplot2’. https://CRAN.R-project.org/package=ggtext.
Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.
Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.