Documentation
Metadata Extractor
This stage applies a regular expressions to extract date or text information from documents. This extraction can be applied on specific metadata fields or the document body.
Please note that a reasonable text extraction on the body field can only take place, if prior to this step a text extraction took place.
The actual configuration of this stage is given as a JSON array. A JSON example, which contains the two examples from below is the following:
[ { "sourceField": "body", "targetField": "lastModificationDate", "regex": ".Updated on.?(\\d+ .+? \\d\\d\\d\\d).*", "override": false, "isDate":true, "dateLocale":"en", "dateFormat": "dd LLL yyyy" }, { "sourceField": "field1", "targetField": "taxonomy", "regex": "^(.*),.*$", "override": false, "replaceValue":"$1" } ]
There are two kinds of extractions which you can apply, text extraction and date extraction.
Text Extraction
Here, you apply a regular expression to the contents of a source field in order to extract text or number values. If it matches, a transformed result will be stored in a targetField. The parameters are the following:
sourceField can be an arbitrary name of a metadata field. If you enter “body”, the extraction
targetField is the name of the field where the result should be stored, if the regular expression below matches.
regex is a Java regular expression, which is used to match the input (cf. https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html ). It can contain one or more matching groups, which is relevant for the replaceValue.
replaceValue defines the transformation which should be applied to the sourceFields contents before storing into the target field. Here, you can use $1, $2, etc. to include matching groups from “regex”.
override defines if field contents should be overwritten if contents are already stored in there. If override is false and the target field is non-empty, the transformed result is not stored at all.
Example
The following example takes a string like “pizza, lasagne, carpaccio”. And extracts “pizza” from it. Afterwards it stores “pizza” into “taxonomy”.
[ { "sourceField": "field1", "targetField": "taxonomy", "regex": "^(.*),.*$", "replaceValue":"$1", "override": false } ]
Date Extraction
Here, you apply a regular expression to the contents of a source field in order to extract an actual Date. If it matches, the date will be stored in the target field. The parameters are the following:
isDate must be set to true and determines the date extraction mode.
sourceField can be an arbitrary name of a metadata field. If you enter “body”, the extraction
targetField is the name of the field where the result should be stored, if the regular expression below matches.
regex is a Java regular expression, which is used to match the input (cf. https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html ). It must contain exactly one matching group. The contents of this group must be applicable to the date format below.
dateFormat. This defines a simple date format (cf. https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html ) which will be applied to Matching Group 1 of regex.
dateLocale. This defines which locale should be applied to the date format parsing. Please use a language code like “en”, “fr”, “de” here.
It is important to define the language code, as an expression like “18 Dec 2023” obviously is English and an interpretation as German (where it would be Dez) will not work.override defines if field contents should be overwritten if contents are already stored in there. If override is false and the target field is non-empty, the transformed result is not stored at all.
Example
The following example takes a string like “Release Notes Updated on 23 Apr 2024 5 Minutes to read” and first detects “23 Apr 2024”. Afterwards it applies the date format parsing to transform it (internally) to “2024-04-23T00:00:00.000Z”.
[ { "isDate":true, "sourceField": "body", "targetField": "lastModificationDate", "regex": ".Updated on.?(\\d+ .+? \\d\\d\\d\\d).*", "override": false, "dateLocale":"en", "dateFormat": "dd LLL yyyy" } ]