In France, the threshold en (or objective)of clean-up is defined individually for each site according to health constraints and characteristics of contamination (space location, total pollutant mass, etc.). This threshold is flexible, it includes economic aspects of treatment costs, land constraints and goes hand in hand with the notion of residual pollution that corresponds to pollution left on site. The ability to ‘customize’ a site’s decontamination threshold is valuable because it provides elasticity and allows for a reasoned adjustment on a case-by-case basis. But it also brings a constraint: that of characterizing the contamination as precisely as possible!
- The environmental characterization of a site is classically as follows:
- A research phase on the history of the site’s activities and the location of areas at risk of contamination;
- A field phase: carrying out surveys using drills and taking soil samples
- Analysis of soil samples by laboratories (certified);
- The interpretation of these results by a consultant.
Field operations and sample analysis are the most expensive stages of characterization. As a result, these are sensitive positions in the construction of characterization programs, and budget constraints often limit the amount of surveys and analyses.
Added to this first financial constraint is another notion: that of representativeness. A sample sent to the laboratory weighs between 200 and 500g. The raw result of the analysis is then applied – regardless of the change in scale – to masses in the order of ten tons of soil. The ratio between these two sizes is 106. By way of comparison, this is equivalent to the surface of a handkerchief to that of a football field. If this ratio is already dizzy, tell yourself that the analysis of our sample is done only on 1/10 of the pot, that the heterogeneity of the soils has not been integrated (we are among geologists after all!), nor the sampling errors and the list continues… In short the laboratory data is relative but it is possible to work with it under good conditions.
As you guessed,the observations made lead to the notion of uncertainty about laboratory measurement – and the need to integrate these uncertainties at the level of a contaminated site. There are techniques that can quantify and even reduce these uncertainties related to this representativeness of the sample.. Of particular note are:
- On-site measurements. These are all types of measurements that can be made in real time on the site: XRF, REMSCAN, PID, Color Kit,etc. These measurements complement laboratory tests. They are less accurate but also significantly cheaper and faster, allowing the number of data to be multiplied in a field day. They therefore make it possible to characterize contaminated sites much more finely, thus reducing uncertainties.
However, these measures are considered inferior to laboratory tests that have a legal scope. Regulatory authorities in the area of contaminated sites (DREAL1, ministries,etc.) will rely on these as a priority to validate the conclusions of environmental characterizations.
1 Regional Directorates of Environment, Planning and Housing
Geostatistics can quantify uncertainty. Indeed, spatializing information and modeling contamination in 3 dimensions makes it possible to put laboratory data into perspective. In Geostatistics puts the consistency of the volume impacted as a whole ahead of the gross values of individual surveys.
This tool is powerful and still underutilized in the industry but requires a certain amount of surveys and analyses to be fully effective (in order of the thumb, thirty analyses to minimum). Geostatistics alone does not solve the problem of the number of surveys, which is often too small to accurately characterize contaminated sites.
We have finally touched on the problem that artificial intelligence or machine-learning can solve: establishing a relationship between on-site measurements and laboratory data.
Machine-learning to predict laboratory results from in-situ measurements
Treating in-situ measurements these measures complementary to laboratory analyses, less precise but much faster and cheap – becomes an important challenge to adapt to innovations in the field of contaminated sites and soils. There are two possible solutions:
The geostatistical processing of these measurements as co-variables of conventional laboratory analyzes (cokriging in particular). The advantage of in-situ measurements is then to strengthen the volume estimates by making a more detailed mesh of the site under study.
The laboratory value remains the benchmark,the field measurements only support the estimates.
The treatment of these measurements using machine-learning approaches to predict laboratory analysis. This approach is completely different because it is about transforming in-situ measurements into a laboratory equivalent – giving them legal scope.
In this context, in-situ measurements become the benchmark – and can go so far as to replace laboratory tests.
Why talk about machine-learning that is often attached to the concepts of Big Data, not a statistical approach? Are wesurfing on Buzz Word?
Yes and no! C’est Of course, some of what was called statistics yesterday is now integrated into the machine-learning family, but these are properties specific to this field of expertise that we are looking for.
- Machine-learning models effectively manage erratic datasets – common in the field of geosciences (heterogeneity is an understatement in our field).
As we said, in-situ measurements are less accurate than laboratory analyses. The figure below en is an example: laboratory analyses effectively discriminate lithologies from Greater Paris with their geochemical signatures. Conversely, on the same samples, field measurements are not precise enough to effectively distinguish lithologies.
Complex Forêts algorithms allow us to work on this data: for example, the Random Forest family, which brings together algorithms that are regularly used in geoscientific datasets for their flexibility.
Machine-learning models support the multiplication of predictors, prioritize and allow to experiment without too many constraints…
There are a multitude of ways to deal with totally different sources of information: from photos to emission spectra to field observations. Algorithms can rely on all these sources to respond to a purpose within a defined or unsealed framework (we are talking about supervised or unsupervised algorithms). While in general,a large amount of data is needed for stable models, there are ways to work with smaller datasets. In the latter case,the key is to model the uncertainties on the prediction – notably by producing a value distribution rather than a “fixed” estimate.
In illustration below, the correlation between predictions of a GLM-type algorithm and laboratory analyses. The objective was to predict the en antimony concentration of samples from in-situ measurements taken by a field XRF aircraft. The predictions – which follow a Gamma distribution law (blue dot: median distribution, red cross: distribution average) – allow us to account for the margin of error presented by the model. Such a model is inoperative to make reliable predictions for high concentrations (saydisons 1 mg/kg). For samples with such en antimony concentrations, the algorithm’s response range ranges from 0.5 to 11 mg/kg!
One of the methods used to improve prediction was to include in the predictors a treatment on the photos of the samples. This step alone allows the range of distributions to be divided by 3, improving the accuracy of predictions.
By this example,we touch on a specific aspect of the area of contaminated sites and soils: our stakes are on extreme values, the outliers, because they are the ones that indicate the presence of contamination. We nous are en therefore more interested in modeling precisely the few occurrences of high values (in the example above, values greater than 1 that indicates “contamination” in antimony)than the distribution itself. Here again the large family of machine-learning is necessary in relation to purely statistical methodologies, through algorithms for detecting outliers for example.
Example of a case of industrialization: work on the tunnellers of Greater Paris
On this project,we were able to develop machine-learning algorithms that can determine soil drainage pathways instantly using in-situ measurements. The success rate compared to laboratory tests is around 95%.
The financial gain is substantial because the waste disposal route can be known even before the soils have come out of the tunneller. In comparison,the traditional method is to store the waste, carry out laboratory analyses(delays of several days) and then send the waste to the appropriate centers according to the results of the analyses.
The main technical difficulty in this project was that algorithms had to be trained on the variability of the values they have to predict. In a more graphicway, imagine that you have to guess what an image represents with only 10% of these pixels visible. If these 10% pixels are spread all over the en image, it is easier to predict what is represented than if they are all included in the bottom left corner.
A tunneller advancing linearly, it is difficult to predict pockets of heterogeneity (common in geology) before falling on them. The algorithm is then able to manage this heterogeneity in proportion to the robustness (and somewhere the amount of data)of the training game. A regular update of the models increases the robustness of the models.
Finally, fashion effect or change of background?
AI and machine-learning technologies provide powerful tools for processing environmental data and allow for the value of information sources that are too often set aside. In France, with regulations to treat contaminated sites on a case-by-case basis, we have much to gain from using and developing our expertise in data processing. Geostatistics and machine-learning are levers to improve the characterization of contaminated sites, reduce associated costs and have more efficient environmental management of our territories.
Machine-learning responds to a need and, I believe,to a change of background rather than a fashion effect. Attitudes are already en changing and we are gradually becoming aware that data is a resource that must be used.