XML split

Purpose¶

xmlsplit is designed to split large >25 Mb and very large > 1 Gb XML documents into a number of smaller XML documents. The smaller XML documents can then be processed in-memory. The split process works by making a single pass down a source XML document.

While passing down the source, an output document is constructed by selecting elements or groups of elements. The last ‘selection’ in a group of selections (known as a map) causes a flush of the output model which delivers a split output document. Furthermore, multiple maps can be defined allowing a second stage of recombination of all flushed maps into a final output document.

Document selections are defined using XPath 1.0 expressions and supplied to the XML split action using XML parameters.

Methods¶

Binding name: xmlsplit

Method: void transform(Map configuration, Closure mapNotify)

Initialise and run an xmlsplit transformation, using the given configuration Map and notify closure (optional, set to null if not callback required).

Details¶

A configuration map (configuration) must be provided.

Configuration Map

Configuration Name	Description
parameterXml	Mandatory: XML parameter block
inputUri	Mandatory: URI of the input file to split

XML parameter block

    <parameters>
      <map>
        <selector>XPath V1 Expression</selector>
      </map>
    </parameters>

Parameter Block Syntax

The map element is mandatory and may repeat.

A map must have at least one child element, a selector.

A selector specifies a valid XPath V.1 expression selecting either a single node value or group of nodes.

When the last selector in a map returns results, all results selected within the map are combined into an output document.

When multiple map elements are defined the resulting output documents are combined using their evaluation sequence or using a further XPath expression that may be run against each map document (more on this later). Optional Parameter Elements

Continue processing if errors are encountered in subsequent transformer steps: defaults to ‘false’. When true the transformer context contains accumulated error details. (See elements: XSPLIT_ERRORS_STEP and XSPLIT_ERRORS).

    <noError>true/false</noError>

To specify that selector results should be processed in memory (greatly speeds up processing but can result in OOM exceptions): defaults to ‘false’.

    <selectedInMemory>true/false</selectedInMemory>

To specify the name of the root element on all output documents: defaults to ‘xmlsplit’.

    <documentRoot>ElementName</documentRoot>

To specify how aggressive the split process should be. This is a value between 1 and 10. 10 is the MOST aggressive: defaults to ‘8’.

    <aggression>ValueBetween1and10</aggression>

To specify namespaces to use in Xpath selectors: defaults to ‘xmlns=http://www.w3.org/2000/xmlns/’. This is a comma seperated list of namespace ‘prefix = uri’ pairs

    <namespaces>prefix=uri</namespaces>

Optional ‘map’ Element Attributes

id: A name for the map. This is useful during debugging and testing: defaults to ‘unamedmap_nnnnn’
selectorsResetAfter: The name of the selector which will cause all current selector results within the map to be discarded (reset): defaults to ‘null’ The default behavior is to ‘reset’ the map once the final selector has results. Overriding this behavior allows documents to be produced that combine a repeating child element selection together with one or more parent selections.

Optional ‘selector’ Element Attributes

id: A name for the selector. This is useful during debugging and testing: defaults to ‘unamedselector_nnnnn’
indexXpath: A valid XPath V.2 expression that will be used to determine an index value for the current map. The XPath will be run against the selector results document. This attribute is ONLY processed if it is on the last selector within a map. It is used to name the resulting map so maps with the same name can be recombined into output documents.
allowedRepeats: The number of times a selectors results will be added to the current map : defaults to ‘1’
keepFirst: If selector results repeat more than the allowedRepeats do we ‘keep the first’ occurrence(s) (throwing the latest result away) or do we overwrite a previous occurrence with the latest result: defaults to ‘true’
mandatory: Results for this selector must be found before the map generates an output document: defaults to ‘true’ This attribute is ignored on the last selector in a map. If mandatory selectors have not generated results when the last selector in a map generates results, an error LOG is generated and all results relating to that map are discarded.

Parameter Element Attributes For Debugging And Testing

trace: This is an attribute of both map and selector elements. This allows details of each ‘selection’ and ‘map’ to be logged using DEGUG level log4j logging : defaults to ‘false’
traceResetAfter: This attribute is only valid on the map element. Sets trace level to false after the following number of mappings: defaults to ‘0’
stopParsingAfter: This attribute is only valid on the map element. This will cause the splitter to stop after the following number of mappings: defaults to ‘0’

Examples¶

def cnf = [
    parameterXml: resource.get('SPLITPARAMS'),
    inputUri: 'file:${B2BOX_DATA}/resources/big.xml',
]

xmlsplit.transform( cnf ) { bytes ->
    def split = new String(bytes, "UTF-8")
    println split
}

SPLITPARAMS

<parameters>
    <documentRoot>Invoice</documentRoot>
    <aggression>10</aggression>
    <namespaces>xsd=http://www.w3.org/2001/XMLSchema</namespaces>
    <map trace="true" traceResetAfter="3" stopParsingAfter="3" >
        <selector id="customerName" mandatory="true" trace="true">//Customer/Division/Department/CustomerName</selector>
        <selector id="customerContact" trace="true">//Customer/Division/Department/CustomerContact</selector>
        <selector id="invoiceHeader" mandatory="true" trace="true">//Customer/Division/Department/InvoiceHeader</selector>
    </map>
</parameters>