This study is part of the MACRO project. This project will end approximately in the summer of 2023. Briefly, its goals are to explore the application of machine learning for mapping/regionalization in the field of hydrogeology in general.
Figure 0.1: MACRO project logo
Information on the spatial distribution of hydrogeochemical parameters is crucial for decision making. Machine learning based methods for the mapping of hydrogeochemical parameter concentrations have been already studied for many years to evolve from deterministic and geostatistical interpolation methods. However, the reflection of all relevant processes that the target variables depend on is often difficult to achieve, because of the mostly insufficient determination and/or availability of features. This is especially true if you limit yourself to freely accessible data.
In this study, we apply an extreme gradient boosting learner (XGBoost) to map major ion concentrations across Germany. The training data consists of water samples from approximately 35K observation wells across Germany and a wide range of environmental data as predictors. The water samples were collected between the 1950s and 2005 at anthropogenically undisturbed locations.
The environmental data includes hydrogeological units and parameters, soil type, lithology, digital elevation model (DEM) and DEM derived parameters etc. The values of these features at the respective water sample location were extracted on the basis of a polygon, approximately representing the area that has an impact on the target variable (ion concentration). For a comparison, different polygon shapes are used.
The workflow from data preprocessing to model evaluation is schematically shown in fig. 2.1. The single steps are described more in detail in the following sections. This work of this study is still in progress. The preprocessing of the hydrogeochemical background values data set is described in Hydrogeochemical Parameters, the feature extraction in section Feature Extraction, the modelling pipeline in Model Training and the evaluation can be found in Evaluation.
All data processing and modelling was done using the R programming language using multiple packages (see chapter References).
Figure 2.1: flowchart
The target variables of the training data being used in this study is based on a data set containing initially approximately 53000 measurements of hydrogeochemical parameters from groundwater samples predominantly taken during the second half of the 20th century until 2010. The number of samples used for training the models is reduced due to the following steps:
Intersection with Study Area
The sample locations were intersected with the administrative border of Germany.
Filter by Sample Date
Only samples between 1990-01-01 and 2010-01-01 were kept to limit the data set to the most current time period. The distribution of the sample date after applying the previous processing steps is shown in fig. 3.1.
Figure 3.1: Distribution of sample date
Filter by Sample Depth
The sample depth was calculated as mean depth of the screen top and bottom if it was provided. Thus, it reflects the screen center depth below ground level of an observation well. For the model training, the sample depth was used as feature (predictor variable).
Only samples with a sample depth between 100m and ground level or with a value of 1 in the column lage
were kept in the data set to exclude deeper aquifers. The distribution of the sample depth after applying the previous processing steps is shown in figure 3.2.
Figure 3.2: Distribution of sample depth
Aggregation of multiple measurements per sample site
Some of the sample sites have multiple measurements over time which were aggregated by calculating the mean. The distribution of the number of measurements per sample site after applying the previous processing steps is shown in figure 3.3.
Figure 3.3: Distribution of multiple measurements per sample site
## [1] "Ca" "Cl" "Fe" "HCO3" "K" "Mg" "Mn" "Na" "NO3" "SO4"
From all measured parameters, the ten ions with the most samples were selected as target variables to be modeled (Ca, Cl, Fe, HCO3, K, Mg, Mn, Na, NO3, SO4; see fig. 3.4). Across all these parameters and after all preprocessing steps, 34536 samples and 12 columns, 1 for the station ID (station_id
), 1 for the sample depth (sample_depth
) and 10 for each target variable remain for the model training.
Figure 3.4: Samples per hydrogeochemical.
The locations of the sample sites used for modelling are shown in 3.5 as the number of sample sites per hexagon. The spatial distribution of sampling locations is unbalanced with regions that have few locations and regions with a high density. The latter are mainly concentrated around larger cities in Germany such as Berlin, Hamburg, Frankfurt. The eastern and northern areas of Germany also generally have more sampling sites compared to southern or central Germany.
Figure 3.5: Sample site locations
The first three rows of the data set containing the target variables after all preprocessing steps is shown in table 3.1 as an example.
station_id | ca_mg_l | cl_mg_l | fe_mg_l | hco3_mg_l | k_mg_l | mg_mg_l | mn_mg_l | na_mg_l | no3_mg_l | so4_mg_l |
---|---|---|---|---|---|---|---|---|---|---|
110_1015 | 44.7 | 19.5 | 17.2 | 164.7 | 1.7 | 6.5 | 0.75 | 9.4 | 1.0 | 29.5 |
110_1016 | 82.9 | 46.0 | 15.0 | 207.4 | 6.5 | 13.9 | 2.80 | 25.0 | 0.1 | 126.0 |
110_1017 | 90.0 | 35.0 | 10.8 | 323.3 | 4.0 | 9.5 | 0.65 | 23.0 | 0.1 | 33.0 |
Figure 3.6 gives an overview on the occurrence of missing values in that data set. The occurrence of missing values varies between the different target variables which leads to different sample sizes when modelling each target separately
Figure 3.6: Missing values across the target variables
More details on the column statistics are shown in the following summary
Name | Piped data |
Number of rows | 34536 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
numeric | 10 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ca_mg_l | 4303 | 0.88 | 88.39 | 103.78 | 0 | 40.50 | 74.20 | 112.00 | 3670.0 | ▇▁▁▁▁ |
cl_mg_l | 3025 | 0.91 | 250.48 | 4212.52 | 0 | 13.00 | 25.00 | 45.87 | 177000.0 | ▇▁▁▁▁ |
fe_mg_l | 13758 | 0.60 | 3.18 | 12.42 | 0 | 0.10 | 0.86 | 2.80 | 1170.0 | ▇▁▁▁▁ |
hco3_mg_l | 4578 | 0.87 | 220.75 | 175.41 | 0 | 99.00 | 207.40 | 318.10 | 7755.7 | ▇▁▁▁▁ |
k_mg_l | 9258 | 0.73 | 5.62 | 31.51 | 0 | 1.10 | 2.00 | 4.00 | 1440.0 | ▇▁▁▁▁ |
mg_mg_l | 4264 | 0.88 | 17.28 | 43.32 | 0 | 5.00 | 10.10 | 19.30 | 1801.3 | ▇▁▁▁▁ |
mn_mg_l | 9379 | 0.73 | 0.29 | 1.30 | 0 | 0.01 | 0.11 | 0.29 | 156.0 | ▇▁▁▁▁ |
na_mg_l | 10451 | 0.70 | 180.93 | 2978.77 | 0 | 7.00 | 12.20 | 24.00 | 116000.0 | ▇▁▁▁▁ |
no3_mg_l | 9548 | 0.72 | 14.47 | 26.88 | 0 | 0.10 | 3.00 | 18.00 | 708.0 | ▇▁▁▁▁ |
so4_mg_l | 4276 | 0.88 | 89.93 | 212.27 | 0 | 18.60 | 43.00 | 90.50 | 6880.0 | ▇▁▁▁▁ |
The distribution of target values as violin chart is shown in figure 3.7.
Figure 3.7: Distribution of the target variable values
In addition to this dataset, geophysical attributes were extracted from other spatial data sources (see the following list) and used as features:
The features were extracted for a 1km buffer as approximated groundwater contributing area for every sample location respectively (Knoll, Breuer, and Bach 2019) (see figure 3.8). For categorical data, the proportion of each class in the buffer was calculated. As an advantage, this leads to an encoding as numerical feature. On the other hand, many sparse features are created for rare classes. For numerical data, the mean was calculated for this buffer.
Figure 3.8: Example of feature extraction based on circular buffer around sample sites (red) (e.g. the land use and land cover data as shown here)
The previously described method of extracting the features results in 165 features. A summary of the statistics of the features is provided in tab. 3.3.
Name | Piped data |
Number of rows | 34536 |
Number of columns | 165 |
_______________________ | |
Column type frequency: | |
numeric | 165 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
sampledepth_sampledepth | 0 | 1 | 41.71 | 36.13 | 0.00 | 17.00 | 42.00 | 50.00 | 644.00 | ▇▁▁▁▁ |
lulc_agriculturalareas | 0 | 1 | 0.53 | 0.33 | 0.00 | 0.24 | 0.56 | 0.83 | 1.00 | ▆▅▅▆▇ |
lulc_forestandseminaturalareas | 0 | 1 | 0.30 | 0.32 | 0.00 | 0.00 | 0.19 | 0.52 | 1.00 | ▇▂▂▂▂ |
lulc_artificialsurfaces | 0 | 1 | 0.15 | 0.24 | 0.00 | 0.00 | 0.05 | 0.20 | 1.00 | ▇▁▁▁▁ |
lulc_waterbodies | 0 | 1 | 0.02 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
lulc_wetlands | 0 | 1 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.84 | ▇▁▁▁▁ |
gwrecharge_gwrecharge | 0 | 1 | 121.53 | 71.54 | 0.00 | 73.44 | 109.89 | 158.57 | 879.34 | ▇▂▁▁▁ |
seepage_seepage | 2 | 1 | 253.65 | 187.94 | -99.89 | 131.54 | 223.25 | 335.48 | 2559.33 | ▇▁▁▁▁ |
temperature_temperature | 0 | 1 | 84.56 | 8.72 | 8.82 | 80.65 | 85.08 | 89.54 | 107.44 | ▁▁▁▇▆ |
precipitation_precipitation | 0 | 1 | 719.63 | 191.05 | 453.25 | 574.28 | 691.90 | 800.75 | 2646.69 | ▇▁▁▁▁ |
hydrounits_14 | 0 | 1 | 0.21 | 0.40 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
hydrounits_13 | 0 | 1 | 0.12 | 0.31 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_17 | 0 | 1 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_15 | 0 | 1 | 0.14 | 0.34 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_63 | 0 | 1 | 0.02 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_31 | 0 | 1 | 0.05 | 0.21 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_62 | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_101 | 0 | 1 | 0.02 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_41 | 0 | 1 | 0.04 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_64 | 0 | 1 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_65 | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_0 | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.43 | ▇▁▁▁▁ |
hydrounits_66 | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_95 | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_71 | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_92 | 0 | 1 | 0.04 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_96 | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_32 | 0 | 1 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_52 | 0 | 1 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_12 | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_11 | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.06 | ▇▁▁▁▁ |
hydrounits_81 | 0 | 1 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_33 | 0 | 1 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_54 | 0 | 1 | 0.02 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_51 | 0 | 1 | 0.02 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_16 | 0 | 1 | 0.03 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_83 | 0 | 1 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_22 | 0 | 1 | 0.02 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_21 | 0 | 1 | 0.02 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_53 | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_23 | 0 | 1 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_61 | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_82 | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_94 | 0 | 1 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_91 | 0 | 1 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_93 | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrounits_97 | 0 | 1 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_114 | 0 | 1 | 0.37 | 0.45 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
geology_111 | 0 | 1 | 0.14 | 0.31 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_113 | 0 | 1 | 0.16 | 0.35 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_115 | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_223 | 0 | 1 | 0.00 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_231 | 0 | 1 | 0.02 | 0.15 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_232 | 0 | 1 | 0.02 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_233 | 0 | 1 | 0.04 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_600 | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_120 | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_221 | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_130 | 0 | 1 | 0.01 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_330 | 0 | 1 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_230 | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_400 | 0 | 1 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_222 | 0 | 1 | 0.00 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_500 | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_510 | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_312 | 0 | 1 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_320 | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_220 | 0 | 1 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_112 | 0 | 1 | 0.01 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_211 | 0 | 1 | 0.02 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_210 | 0 | 1 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_350 | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_360 | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_888 | 0 | 1 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.83 | ▇▁▁▁▁ |
geology_333 | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_332 | 0 | 1 | 0.01 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_331 | 0 | 1 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_300 | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_311 | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_212 | 0 | 1 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
geology_340 | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.50 | ▇▁▁▁▁ |
soilunits_19 | 0 | 1 | 0.14 | 0.34 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_12 | 0 | 1 | 0.03 | 0.15 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_28 | 0 | 1 | 0.11 | 0.31 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_17 | 0 | 1 | 0.11 | 0.30 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_33 | 0 | 1 | 0.04 | 0.19 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_31 | 0 | 1 | 0.08 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_8 | 0 | 1 | 0.06 | 0.23 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_36 | 0 | 1 | 0.04 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_27 | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_41 | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_11 | 0 | 1 | 0.03 | 0.16 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_40 | 0 | 1 | 0.05 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_63 | 0 | 1 | 0.01 | 0.12 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_55 | 0 | 1 | 0.04 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_44 | 0 | 1 | 0.01 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_21 | 0 | 1 | 0.01 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_49 | 0 | 1 | 0.01 | 0.12 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_61 | 0 | 1 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_30 | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_54 | 0 | 1 | 0.04 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_15 | 0 | 1 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_52 | 0 | 1 | 0.00 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_72 | 0 | 1 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.96 | ▇▁▁▁▁ |
soilunits_69 | 0 | 1 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_18 | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_59 | 0 | 1 | 0.03 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_68 | 0 | 1 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_4 | 0 | 1 | 0.01 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_2 | 0 | 1 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_5 | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_60 | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_66 | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_51 | 0 | 1 | 0.00 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_34 | 0 | 1 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_46 | 0 | 1 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_42 | 0 | 1 | 0.01 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
soilunits_22 | 0 | 1 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrogeologygc_s | 0 | 1 | 0.78 | 0.38 | 0.00 | 0.73 | 1.00 | 1.00 | 1.00 | ▂▁▁▁▇ |
hydrogeologygc_Gew | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.82 | ▇▁▁▁▁ |
hydrogeologygc_m | 0 | 1 | 0.13 | 0.30 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrogeologygc_a | 0 | 1 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrogeologygc_so | 0 | 1 | 0.01 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrogeologygc_kA | 0 | 1 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrogeologygc_k | 0 | 1 | 0.06 | 0.21 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrogeologygc_g | 0 | 1 | 0.01 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrogeologygc_h | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.14 | ▇▁▁▁▁ |
hydrogeologygc_gh | 0 | 1 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.90 | ▇▁▁▁▁ |
hydrogeologykf_3 | 0 | 1 | 0.41 | 0.44 | 0.00 | 0.00 | 0.15 | 0.99 | 1.00 | ▇▁▁▁▅ |
hydrogeologykf_9 | 0 | 1 | 0.22 | 0.36 | 0.00 | 0.00 | 0.00 | 0.32 | 1.00 | ▇▁▁▁▂ |
hydrogeologykf_10 | 0 | 1 | 0.09 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrogeologykf_4 | 0 | 1 | 0.06 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrogeologykf_6 | 0 | 1 | 0.03 | 0.15 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrogeologykf_99 | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.82 | ▇▁▁▁▁ |
hydrogeologykf_11 | 0 | 1 | 0.02 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrogeologykf_12 | 0 | 1 | 0.05 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrogeologykf_0 | 0 | 1 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrogeologykf_5 | 0 | 1 | 0.06 | 0.19 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrogeologykf_2 | 0 | 1 | 0.04 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrogeologykf_7 | 0 | 1 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.95 | ▇▁▁▁▁ |
hydrogeologykf_8 | 0 | 1 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.53 | ▇▁▁▁▁ |
hydrogeologyga_S | 0 | 1 | 0.90 | 0.28 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▁▁▇ |
hydrogeologyga_G | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.82 | ▇▁▁▁▁ |
hydrogeologyga_kA | 0 | 1 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrogeologyga_Me | 0 | 1 | 0.05 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hydrogeologyga_Ma | 0 | 1 | 0.04 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
elevation_elevation | 0 | 1 | 178.52 | 200.25 | -3.75 | 42.90 | 89.15 | 283.62 | 1846.03 | ▇▂▁▁▁ |
slope_slope | 0 | 1 | 2.14 | 3.38 | 0.00 | 0.15 | 0.57 | 2.72 | 43.65 | ▇▁▁▁▁ |
aspect_aspect | 0 | 1 | 178.40 | 27.61 | 35.84 | 162.94 | 179.11 | 194.25 | 301.42 | ▁▁▇▃▁ |
mohplp_streamorder1 | 12 | 1 | 5235.34 | 2961.90 | 0.00 | 2762.00 | 5273.00 | 7896.00 | 10030.00 | ▆▆▆▇▇ |
mohplp_streamorder2 | 5 | 1 | 5428.72 | 2906.28 | 0.00 | 3045.00 | 5574.00 | 7998.00 | 10030.00 | ▅▆▆▇▇ |
mohplp_streamorder3 | 3 | 1 | 5505.89 | 2920.10 | 0.00 | 3068.35 | 5735.00 | 8137.00 | 10030.00 | ▅▆▆▆▇ |
mohplp_streamorder4 | 3 | 1 | 5486.75 | 2934.16 | 0.00 | 3061.00 | 5682.67 | 8145.00 | 10007.00 | ▅▆▆▆▇ |
mohplp_streamorder5 | 3 | 1 | 5444.35 | 3002.40 | 0.00 | 2881.00 | 5758.00 | 8150.00 | 10001.00 | ▆▅▅▆▇ |
mohplp_streamorder6 | 3 | 1 | 5543.70 | 2952.94 | 0.00 | 3123.00 | 5870.00 | 8180.00 | 10000.00 | ▅▅▆▆▇ |
mohplp_streamorder7 | 3 | 1 | 5314.65 | 3058.03 | 0.00 | 2657.00 | 5427.00 | 8128.00 | 10000.00 | ▆▅▆▆▇ |
mohplp_streamorder8 | 3 | 1 | 7679.21 | 1413.66 | 385.00 | 6760.00 | 7644.00 | 8814.00 | 9999.00 | ▁▁▂▇▇ |
mohpdsd_streamorder1 | 12 | 1 | 5235.34 | 2961.90 | 0.00 | 2762.00 | 5273.00 | 7896.00 | 10030.00 | ▆▆▆▇▇ |
mohpdsd_streamorder2 | 5 | 1 | 5428.72 | 2906.28 | 0.00 | 3045.00 | 5574.00 | 7998.00 | 10030.00 | ▅▆▆▇▇ |
mohpdsd_streamorder3 | 3 | 1 | 5505.89 | 2920.10 | 0.00 | 3068.35 | 5735.00 | 8137.00 | 10030.00 | ▅▆▆▆▇ |
mohpdsd_streamorder4 | 3 | 1 | 5486.75 | 2934.16 | 0.00 | 3061.00 | 5682.67 | 8145.00 | 10007.00 | ▅▆▆▆▇ |
mohpdsd_streamorder5 | 3 | 1 | 5444.35 | 3002.40 | 0.00 | 2881.00 | 5758.00 | 8150.00 | 10001.00 | ▆▅▅▆▇ |
mohpdsd_streamorder6 | 3 | 1 | 5543.70 | 2952.94 | 0.00 | 3123.00 | 5870.00 | 8180.00 | 10000.00 | ▅▅▆▆▇ |
mohpdsd_streamorder7 | 3 | 1 | 5314.65 | 3058.03 | 0.00 | 2657.00 | 5427.00 | 8128.00 | 10000.00 | ▆▅▆▆▇ |
mohpdsd_streamorder8 | 3 | 1 | 7679.21 | 1413.66 | 385.00 | 6760.00 | 7644.00 | 8814.00 | 9999.00 | ▁▁▂▇▇ |
The first three rows of the data set containing the features is shown in table 3.4 as an example. This table was then joined with the table holding the target variable by the station_id
station_id | sampledepth_sampledepth | lulc_agriculturalareas | lulc_forestandseminaturalareas | lulc_artificialsurfaces | lulc_waterbodies | lulc_wetlands | gwrecharge_gwrecharge | seepage_seepage | temperature_temperature | precipitation_precipitation | hydrounits_14 | hydrounits_13 | hydrounits_17 | hydrounits_15 | hydrounits_63 | hydrounits_31 | hydrounits_62 | hydrounits_101 | hydrounits_41 | hydrounits_64 | hydrounits_65 | hydrounits_0 | hydrounits_66 | hydrounits_95 | hydrounits_71 | hydrounits_92 | hydrounits_96 | hydrounits_32 | hydrounits_52 | hydrounits_12 | hydrounits_11 | hydrounits_81 | hydrounits_33 | hydrounits_54 | hydrounits_51 | hydrounits_16 | hydrounits_83 | hydrounits_22 | hydrounits_21 | hydrounits_53 | hydrounits_23 | hydrounits_61 | hydrounits_82 | hydrounits_94 | hydrounits_91 | hydrounits_93 | hydrounits_97 | geology_114 | geology_111 | geology_113 | geology_115 | geology_223 | geology_231 | geology_232 | geology_233 | geology_600 | geology_120 | geology_221 | geology_130 | geology_330 | geology_230 | geology_400 | geology_222 | geology_500 | geology_510 | geology_312 | geology_320 | geology_220 | geology_112 | geology_211 | geology_210 | geology_350 | geology_360 | geology_888 | geology_333 | geology_332 | geology_331 | geology_300 | geology_311 | geology_212 | geology_340 | soilunits_19 | soilunits_12 | soilunits_28 | soilunits_17 | soilunits_33 | soilunits_31 | soilunits_8 | soilunits_36 | soilunits_27 | soilunits_41 | soilunits_11 | soilunits_40 | soilunits_63 | soilunits_55 | soilunits_44 | soilunits_21 | soilunits_49 | soilunits_61 | soilunits_30 | soilunits_54 | soilunits_15 | soilunits_52 | soilunits_72 | soilunits_69 | soilunits_18 | soilunits_59 | soilunits_68 | soilunits_4 | soilunits_2 | soilunits_5 | soilunits_60 | soilunits_66 | soilunits_51 | soilunits_34 | soilunits_46 | soilunits_42 | soilunits_22 | hydrogeologygc_s | hydrogeologygc_Gew | hydrogeologygc_m | hydrogeologygc_a | hydrogeologygc_so | hydrogeologygc_kA | hydrogeologygc_k | hydrogeologygc_g | hydrogeologygc_h | hydrogeologygc_gh | hydrogeologykf_3 | hydrogeologykf_9 | hydrogeologykf_10 | hydrogeologykf_4 | hydrogeologykf_6 | hydrogeologykf_99 | hydrogeologykf_11 | hydrogeologykf_12 | hydrogeologykf_0 | hydrogeologykf_5 | hydrogeologykf_2 | hydrogeologykf_7 | hydrogeologykf_8 | hydrogeologyga_S | hydrogeologyga_G | hydrogeologyga_kA | hydrogeologyga_Me | hydrogeologyga_Ma | elevation_elevation | slope_slope | aspect_aspect | mohplp_streamorder1 | mohplp_streamorder2 | mohplp_streamorder3 | mohplp_streamorder4 | mohplp_streamorder5 | mohplp_streamorder6 | mohplp_streamorder7 | mohplp_streamorder8 | mohpdsd_streamorder1 | mohpdsd_streamorder2 | mohpdsd_streamorder3 | mohpdsd_streamorder4 | mohpdsd_streamorder5 | mohpdsd_streamorder6 | mohpdsd_streamorder7 | mohpdsd_streamorder8 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
110_1 | 37.8 | 0.7103502 | 0.2896498 | 0.0000000 | 0 | 0 | 48.711914 | 147.50668 | 77.02868 | 566.8224 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 96.92465 | 0.2502644 | 165.2706 | 4936 | 1831 | 7135 | 9685 | 9381 | 5714 | 5603 | 9149 | 4936 | 1831 | 7135 | 9685 | 9381 | 5714 | 5603 | 9149 |
110_10 | 24.2 | 0.7080190 | 0.1805797 | 0.1114013 | 0 | 0 | 80.524147 | 174.22295 | 77.97757 | 559.4204 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 87.36120 | 0.6016899 | 180.6347 | 1716 | 5411 | 8938 | 8842 | 9281 | 5718 | 5557 | 8989 | 1716 | 5411 | 8938 | 8842 | 9281 | 5718 | 5557 | 8989 |
110_100 | 20.5 | 1.0000000 | 0.0000000 | 0.0000000 | 0 | 0 | 1.587526 | -52.47207 | 88.00000 | 535.1705 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 28.67915 | 0.0000000 | 185.9964 | 8771 | 8771 | 1481 | 1281 | 705 | 3764 | 7040 | 7561 | 8771 | 8771 | 1481 | 1281 | 705 | 3764 | 7040 | 7561 |
Like all the data preprocessing all modelling was done using the R programming language and the tidymodels
package.
The modelling pipeline includes the following steps:
min_n
: minimum sum of instance weight (hessian) needed in a child (Chen et al. 2020a)tree_depth
: maximum depth of a tree (Chen et al. 2020a)learn_rate
: learning rate (Chen et al. 2020a)loss_reduction
: minimum loss reduction required to make a further partition on a leaf (Chen et al. 2020a) node of the treeparameter | min_n | tree_depth | learn_rate | loss_reduction | .config |
---|---|---|---|---|---|
Ca | 34 | 10 | 0.0145681 | 0.0036223 | Preprocessor1_Model48 |
Cl | 7 | 9 | 0.0315631 | 0.0000001 | Preprocessor1_Model24 |
Fe | 37 | 14 | 0.0029476 | 1.9521420 | Preprocessor1_Model33 |
HCO3 | 34 | 10 | 0.0145681 | 0.0036223 | Preprocessor1_Model48 |
K | 12 | 8 | 0.0278855 | 0.0000000 | Preprocessor1_Model20 |
Mg | 7 | 9 | 0.0315631 | 0.0000001 | Preprocessor1_Model24 |
Mn | 37 | 14 | 0.0029476 | 1.9521420 | Preprocessor1_Model33 |
Na | 3 | 6 | 0.0067682 | 0.0000005 | Preprocessor1_Model01 |
NO3 | 34 | 10 | 0.0145681 | 0.0036223 | Preprocessor1_Model48 |
SO4 | 7 | 9 | 0.0315631 | 0.0000001 | Preprocessor1_Model24 |
Disclaimer: As these results are preliminary and this study is still work in progress, the plain performance metrics are provided without any further statements or interpretations.
[[1]] NULL
[[2]] NULL
[[3]] NULL
[[4]] NULL
[[5]] NULL
[[6]] NULL
[[7]] NULL
[[8]] NULL
[[9]] NULL
[[10]] NULL
This study aims on setting up a benchmark model as a basis for further development of more sophisticated machine learning models. The relatively simple model approach shows that the model performance varies strongly between the target variables indicating that further features are required to reflect the geophysical characteristics which play a role in the processes that drive the concentration of the target variable. Further steps to increase this preliminary model setup are: