Index Data Reader "Pdf"
This index data reader allows extracting native text content from PDF files. Extracting text from embedded image data which requires OCR processing is not supported.
When extracting text content, the structure of the PDF format must be taken into account. PDF files do not have contiguous text areas. Instead, each string, usually the individual words, is considered a separate text fragment. To extract fragments, each text fragment must be addressed using the X/Y coordinates on the page. The concrete coordinates of the fragments can be measured manually (e.g., using a paper printout to scale). Alternatively, a helper method of the Config Web Service can be used (see property ConfigService.Activate). Alternatively, the coordinates can be read out using a wizard in the Configurator.
Property | Description |
|---|---|
ProcessReadIndex[].Tolerance | Tolerance range in millimeters. The tolerance range specifies the extent of deviation from a given value for the coordinates of a text fragment to still be considered a match to that value. Default value: |
ProcessReadIndex[].LineFeed(#) | Line-break characters used within extracted text areas that span multiple lines Default value:
|
ProcessReadIndex[].ItemPerLine | Boolean value determining whether an extracted text area assigned to a table field will be divided into a separate value per line to generate a corresponding number of table lines from it Otherwise, multiple values for tabular target fields can only be read by using one of the syntax variants for the The Default value: |
To define the .InputName (i.e., the text fragment or range of fragments to be read) use the special syntax explained below.
Absolute Position
In the simplest case, a single fragment is addressed via the absolute position (i.e., via the X and Y coordinates) with the unit of measurement being millimeters.
(X,Y)
The X/Y coordinates are relative to the page origin. In contrast to a normal coordinate system, the page origin is located at the top left of the page; this corresponds to the direction of reading. The top left corner of the page has the coordinates (0,0).
The coordinates of a text fragment always refer to the starting position of the fragment (i.e., its lower left corner). For example, the coordinates (20,100) are used to search for a fragment whose start is 20 mm from the left page margin and whose bottom edge is 100 mm from the top page margin. These specifications refer to a fragment that is aligned straight. For a rotated fragment, the desired coordinates must be measured using the correspondingly rotated page so that the fragment itself is aligned straight again. Only rotations at a 90-degree angle are supported. Rotations of the entire page instead of its constituent fragments do not need to be taken into account, since these are implicitly compensated for by the program.
When searching for a fragment at the specified position, the .Tolerance property is taken into account. To define a larger search area or deliberately extract multiple fragments from a larger area, the width and height of this area can also be specified:
(X,Y,{width},{height})
The height is specified as an absolute value, which is evaluated against the shifted Y-origin in the normal direction. The Y-position specifies the lower edge of the area, such that with a value of 100 and a height of 40, an area is not defined with the upper edge 140, but with that of 60. The coordinates (20,100,80,40) thus describe, for example, an area whose upper edge is at 60, lower edge at 100, left edge at 20 and right edge at 100.
All fragments whose starting position is in this range are considered to belong to a range. This assignment is made regardless of whether the complete fragments go beyond the range. When specifying a range, unlike a single position, the .Tolerance property is not taken into account because the range can already be defined to account for certain variations in the positioning of fragments.
If a range includes multiple fragments, the extraction result for that range is a composite string of those fragments. The fragments are sorted according to their left-to-right and top-to-bottom positions, which usually restores the normal text flow, provided that fragments belonging to one line of text do not have fluctuations in their Y-position that exceed the configured tolerance value.
Relative Position
Fragments can be addressed using a relative position instead of the absolute position. The position is relative to a keyword that is searched for. For this, extend the syntax as follows:
"Keyword",[First|Last|All],(X,Y,{Width},{Height})
For example, "invoice number",First,(30,0,20,0) is used to search for the first occurrence of the word "invoice number" on a page. The extraction area starts 30 mm away from the start position of the found word in the X direction and has a width of 20 mm. No offset is specified for the Y direction in this example, so the term to be extracted must be in the same text line.
You can specify the occurrence. This specification determines which occurrences of the keyword on a page are used:
First: the first occurrence of the keyword is used (default value).Last: the last occurrence of the keyword is used.All: all occurrences of the keyword are used, which may result in multiple extraction values.
If the keyword consists of several single words, the words can be separated with a space (e.g. invoice number). Implicitly, the system then searches for 2 consecutive fragments, which have the value "invoice" and "number" respectively. The reference point for the relative offset is the first of the two fragments.
Instead of fixed keywords, wildcard expressions (with wildcards *, ? and #) and regular expressions (embedded in / characters) can also be used for the search (e.g., Inv*). In the case of a combined search over several fragments, the syntax with the space between the keywords is retained. In the example above, the result is thus Inv* Num*.
Specifying pages
To specify pages on which the extraction is performed, extend the syntax as follows:
(X,Y,{width},{height}),{page range}
This syntax extension can be combined with the previously described syntax extension for addressing a relative position.
The page range can be described either by using the terms First (default value), Last, and All, or by specifying concrete page numbers (e.g., 1;2;3 or 1-3).
Searching across multiple pages may result in multiple extraction values. If the target field is a header field, an array of values will then be assigned to that field according to the ReadMultiValues configuration property. If the target field is a table field, the values are split into an appropriate number of table rows.