Monday, May 19, 2014

Processing large text file using smooks in wso2esb

Smooks  library supports large file processing which is used in ESB to process file which are in text/xml format.
Text files can be a CSV file or with a space delimiter or a fixed length file. To process large file, smooks write the output stream , which is a single line record, to a temporary file or a jms queue. After writing all files to the temporary location, we need to process all those records/files again. This is how we can avoid large memory growth/out of heap space issue which occurs when we try to process large file as in- memory process.

Following is a sample configuration to process fixed length large text file in wso2esb.

Proxy configuration

To pick and process the file form a file location , we could use VFS transport.

<proxy xmlns="http://ws.apache.org/ns/synapse"
       name="cor_file_pull"
       transports="https,http,vfs"
       statistics="disable"
       trace="disable"
       startOnLoad="true"target>
     <inSequence>
  <property name="DISABLE_SMOOKS_RESULT_PAYLOAD"
                   value="true"
                   scope="default"
                   type="STRING"/>
          <smooks config-key="simple-smook">
            <input type="text"/>
           <output type="xml"/>
         </smooks>
       </inSequence>
      <outSequence/>
   </target>
   <parameter name="transport.vfs.Streaming">true&lt;/parameter>
   <parameter name="transport.PollInterval">10&lt;/parameter>
   <parameter name="transport.vfs.ActionAfterProcess">MOVE&lt;/parameter>
   <parameter name="transport.vfs.MoveAfterProcess">C:/Users/TOSH/Desktop/success</parameter>
   <parameter name="transport.vfs.FileURI">C:/Users/TOSH/Desktop/in&lt;/parameter>
   <parameter name="transport.vfs.MoveAfterFailure">C:/Users/TOSH/Desktop/failure</parameter>
   <parameter name="transport.vfs.Locking">disable&lt;/parameter>
   <parameter name="transport.vfs.FileNamePattern">^(cor).+</parameter>
   <parameter name="transport.vfs.ContentType">text/plain</parameter>
   <parameter name="transport.vfs.ActionAfterFailure">MOVE</parameter>
   <description/>
</proxy>

Smooks configuration

simple-smook localentry

<smooks-resource-list xmlns="http://www.milyn.org/xsd/smooks-1.1.xsd" xmlns:ftl="http://www.milyn.org/xsd/smooks/freemarker-1.1.xsd" xmlns:fl="http://www.milyn.org/xsd/smooks/fixed-length-1.3.xsd" xmlns:file="http://www.milyn.org/xsd/smooks/file-routing-1.1.xsd">
<params>
            <param name="stream.filter.type">SAX </param>
           <param name="default.serialization.on">false </param>
         </params>

<fl:reader fields="RecordId[4],CorpName[60],FileNumber[9]" skipLines="0">
</fl:reader>
      <resource-config selector="record">
         <resource> org.milyn.delivery.DomModelCreator</resource>
     </resource-config>
      <file:outputStream openOnElement="record" resourceName="fileSplitStream">
         <file:fileNamePattern> ${.vars["record"].FileNumber}.xml</file:fileNamePattern>
         <file:destinationDirectoryPattern> C:\Users\TOSH\Desktop\processed</file:destinationDirectoryPattern>
         <file:highWaterMark mark="10000000"> </file:highWaterMark>
      </file:outputStream>
      <ftl:freemarker applyOnElement="record">
         <ftl:template> /repository/resources/smooks/record_as_xml.ftl</ftl:template>
         <ftl:use>
            <ftl:outputTo outputStreamResource="fileSplitStream"> </ftl:outputTo>
         </ftl:use>
      </ftl:freemarker>
   </smooks-resource-list>

Each record will   be written to the specified file location   "C:\Users\TOSH\Desktop\processed" with the name constructed through the parameter "  ${.vars["record"].FileNumber}.xml ".  Here it is a xml file, which will be named with a file number    record in the text file. We have to select a unique parameter to write the processed files else there will be file overwriting issue will be thrown in smooks.
To process large file we have to use 'SAX' filter type. DOM needs more memory and it will lead to out of memory issue.
In this sample, i have used DOM+SAX mixing model to keep the lower footprint of memory.
High  water mark provides the maximum file records need to be written. provide enough number based on the lines in the text file.
The property ;
 <property name="DISABLE_SMOOKS_RESULT_PAYLOAD"   value="true" scope="default"     type="STRING"/>

will be available in ESB 4.9.0. If you use lower version, you might need a patch to be applied to your system.

XML template file
record_as_xml.ftl
<#assign rec = .vars["record"]>
<p:insert_acc_data_file xmlns:p="http://ws.wso2.org/dataservice"><p:RecordId>${rec.RecordId}</p:RecordId>
<p:CorpName>${rec.CorpName}&lt;/p:CorpName>
<p:FileNumber>${rec.FileNumber}&lt;/p:FileNumber></p:insert_acc_data_file>

Based on above template the records will be formatted and will be written to the .

Using another vfs proxy, we can pick the entries written to the  'processed' file and do further processing.

3 comments:

  1. This is very helpful. Thank you.
    Is it possible to have the include the original file name of the larger source file?

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Hi,

    Im trying using smooks to process a huge file. I have installed the patch for the smooks mediator but when I am processing a file of 3 Gb I get a java heap space.

    Too I follow the steps to configure Transferring large files of documentation: https://docs.wso2.com/display/ESB481/VFS+Transport

    Any suggest to fix it?

    Thank you

    ReplyDelete