Skip to main content

xSuite Interface Windows Prism 5.x – Online Help

Input Format "Pdf"

This input format allows you to split a multi-page PDF file into several individual files. The split is performed based on certain characteristics of the PDF file's text content, which is evaluated page by page. Only the native content of the PDF file is used to perform the split. Text from embedded image data that requires OCR processing is not included.

Property

Description

InputFormat[].SplitMode

Definition of the separation mode

The split mode defines how to identify the pages where a split into a new part file is performed. When splitting, the original document is discarded and instead, a copy of it is generated for each partial file. The partial file is added to this copy as another attachment.

The split documents generated are given the name suffix .splitN, in which N represents a serial number.

The following modes are available:

  • None: no separation (default value)

  • FixedPageNo: seperation into files with a fixed number of pages

  • StartKey: start of a new file for every page that contains a keyword

  • EndKey: start of a new file after each page that contains a keyword (i.e., the term is located on the last page in each case)

  • RepeatKey: combination of all consecutive pages that contain an identical keyword into one file each

For the FixedPageNo mode, specify the desired number of pages as a text value in the .SplitValue[] property. For the StartKey and EndKey modes, one or more alternative keywords must be defined in the .SplitValue[] property, which must be contained on a page to satisfy the split condition.

In RepeatKey mode, the .SplitValue[] property has no relevance because fixed terms are not searched for. Instead, a term that is at a certain position is dynamically extracted and compared to the same term on the previous page. If the term has changed, a new file will start. A term that is not found on a page is not considered a valid split criterion.

The .SplitFieldDef property is used to define the value to extract in RepeatKey mode. Optionally, this property can also be used for StartKey and EndKey to limit the search for the .SplitValue[] to a specific page position or page range instead of searching over the entire page by default.

InputFormat[].SplitFieldDef(*)

Definition of the extraction range in RepeatKey mode and optionally in StartKey and EndKey modes

Use the same syntax here as for the PDF index data reader (see Index Data Reader "Pdf"). A page range need not be specified (the evaluation is performed implicitly for each page in the present context).

InputFormat[].SplitValue[](*)

Definition of one or more search terms in StartKey and EndKey mode and definition of the numerical number of pages in FixedPageNo mode

Caution: Specify the page number for the FixedPageNo mode here as a text value in quotes. If it is not specified as a text value, then the definition will not be recognized.

The search terms can be wildcard expressions (with wildcards *, ?, and #) or regular expressions (embedded in / characters). The search for such an expression takes place across the entire composite page content (i.e., not per text fragment that makes up a PDF page). This allows one expression to search through multiple fragments at the same time. However, it also hinders the search for values such as Invoice of a single fragment; instead, only *Invoice* would be searched for because this term is embedded in preceding and following text in the context of the entire page.

InputFormat[].Tolerance

Tolerance range in millimeters.

The tolerance range specifies the extent of deviation from a given value for the coordinates of a text fragment to still be considered a match to that value.

Default value: 1