Among its various responsibilities, Rijkswaterstaat is responsible for the maintenance of the primary infrastructure. To make a good trade-off between costs, performance and risks when maintaining assets, Rijkswaterstaat uses a tool called RCMCost.
Too many variables
The request that RWS has approached Cmotions with concerns the input data of their RCMCost model. With this input they can review the expected maintenance per location. Aggregated analyses over many similar locations allows RWS to make better estimations and management of costs. However, the registrations of costs and description per object are not standardized, individuals within the organization fill in the maintenance interface differently.
This is where the complication occurs: With hundreds of locations, multiple unstandardized registration fields and dozens of different experts filling in the registrations, the analysis of costs over locations is time consuming and restricted. Together, RCMCost modelling experts from RWS and data scientists from Cmotions collaborated to develop and implement a data driven solution, to automate and standardize their cost model input data.
The solution
Cmotions implemented and evaluated several different NLP techniques to standardize the input of the RCMCost model.
First an exploratory data analysis (EDA) was conducted to find potential patterns and keywords that are frequently occurring. Based on these frequently occurring keywords we were able to narrow down our textual input by selecting one keyword. Moreover, we were able to map these textual entities to the properties of the model, to make the cost analysis much more efficient.
Second, we used a semantic similarity technique, to identify and pair textual fields that are semantically similar. The goal of this step is to merge different terms and label them with a standardized version. Semantic similarity techniques look at the meaning of terms instead of the literal way of writing, enabling to group words that might not look the same but have the same meaning. If you would like to know more about Semantic similarity and how it works, the following blog gives a great overview.
Finally for fields where the textual variety was very high and the results based on semantic similarity were still not accurate enough, a more advanced technique was used. With the help of Generative AI, we were able to map these variety of fields to a small number of standardized values. Large Language Models have a deep understanding of language and can be used to identify meaningful categorisation of words or sentences. In our case, this helped to identify a set of valuable group labels and descriptions for functions that parts of infrastructural locations have.
Standardization leads to greater efficiency
In total, with the use of NLP techniques ranging from keyword selection to Generative AI we were able add standardized labels to the textual fields used as input for the RCMCost model. RWS can use these results to standardize values for efficient RCMCost modeling analyses and to define the standardized values to use in their interface in the future. Key to the success of the project were the close collaboration of RWS business experts and Cmotions NLP experts, the iterative approach and the broad set of NLP techniques considered.