Current Article

Use JPedal and JBoss Drools for Flexible PDF Extraction

By admin on Oct 16, 2007 in Java, Programming, open source

In one of my previous projects. I need to extract information from some PDF files in a very flexible way. I need to extract certain regions from some of the PDF pages based on the page header. The region can be single-line/multiple-line text or multiple column tables. Also, I need to parse the content to check if it is the information that I want. The requirements are so complicated and storing the configurations in a file is nearly impossible.

After trials and errors, finally I decided to use Drools and JPedal. JPedal is used for PDF extraction, whereas Drools is used as the rule engine to code the extraction logic. Any changes in the extraction logic only require modification to the rule. In this way, the extraction can be made flexible.

For extraction from PDF file, I decided to use JPedal, which is a Java PDF viewer with PDF extraction and PDF workflow tools. Of course I had also looked into other tools like PdfBox, but at the end I chose JPedal. I felt that it is more flexible as I can extract certain region of a PDF page, on a page by page basis.

For the extraction logic, I used Drools from JBoss. Drools is a business rule management system (BRMS) and an enhanced Rules Engine implementation, ReteOO, based on Charles Forgy’s Rete algorithm tailored for the Java language.

By using both JPedal and Drools, I was able to make my Pdf text extraction extremely flexible. Below I will provide a simple example.

Let’s say I have a very simple PDF, as shown below.

Of course in real world you should have a more complicated PDF layout, but the same programming logic can be applied.

I have a RuleParser class that reads Drools rule files using a XML configuration file, as shown below.

Configuration File


<?xml version="1.0" encoding="ISO-8859-1" ?>

<config>
    <ruleset>
        <rule-file>/extraction/extraction.drl</rule-file>
    </ruleset>

</config>

RuleParse class

public class RuleParser {

/**
* Logging object
*/
private static Log log = LogFactory.getLog(RuleParser.class);

/**
* Singleton object
*/
private static RuleParser instance;

/**
* Drools RuleBase object.
*/
private RuleBase ruleBase;

/**
* Constructor
*
* @throws RuleException Rule exception
*/
private RuleParser() throws RuleException {
init();
}

/**
* Initialization.
*
* @throws RuleException Rule exception.
*/
private void init() throws RuleException {
try {
AppConfiguration config = AppConfiguration.getInstance();
Object obj = config.getProperty(”ruleset.rule-file”);
List<String> ruleList = null;
if (obj instanceof Collection) {
int size = ((Collection) obj).size();
ruleList = new ArrayList<String>(size);
for (int i = 0; i < size; i++) {
String ruleFile = config.getString(”ruleset.rule-file(” + i + “)”);
if (StringUtils.isNotBlank(ruleFile))
ruleList.add(ruleFile);
}
} else if (obj instanceof String) {
ruleList = new ArrayList<String>(1);
String ruleFile = config.getString(”ruleset.rule-file”);
if (StringUtils.isNotBlank(ruleFile))
ruleList.add(ruleFile);
}

// Rule package builder
PackageBuilder builder = new PackageBuilder();

// Add the package to a rulebase (deploy the rules package).
ruleBase = RuleBaseFactory.newRuleBase();

if (ruleList != null && ruleList.size() > 0) {
for (String ruleFile : ruleList) {
if (log.isDebugEnabled()) {
log.debug(”Loading rules from ” + ruleFile);
}
// Read in the source
InputStream in = RuleParser.class.getResourceAsStream(ruleFile);
Reader source = new InputStreamReader(RuleParser.class.getResourceAsStream(ruleFile));

// This wil parse and compile in one step
builder.addPackageFromDrl(source);

source.close();
}
// Get the compiled package (which is serializable)
Package pkg = builder.getPackage();
ruleBase.addPackage(pkg);
}
} catch (Exception ex) {
if (log.isErrorEnabled()) {
log.error(”Error loading rules. ” + ex.getMessage());
ex.printStackTrace();
}
throw new RuleException(ex.getMessage(), ex);
}
}

/**
* Singleton access method.
*
* @return Singleton object
* @throws RuleException Rule exception.
*/
public static RuleParser getInstance() throws RuleException {
if (instance == null)
instance = new RuleParser();
return instance;
}

/**
* Get rule base
*
* @return Rule base
*/
public RuleBase getRuleBase() {
return ruleBase;
}

/**
* Set rule base
*
* @param ruleBase Rule base
*/
public void setRuleBase(RuleBase ruleBase) {
this.ruleBase = ruleBase;
}

/**
* Reload the rule engine, and recompile all the rule files
*
* @throws RuleException Rule exception
*/
public static synchronized void reload() throws RuleException {
RuleParser newInstance = new RuleParser();
RuleParser oldInstance = instance;
instance = newInstance;
oldInstance = null;
}

}

The extraction logic is coded in extraction.drl rule file. The extraction is done using JPedal

rule “Sample Extraction”
no-loop true
when
pdfPages: List()
then
Log log = LogFactory.getLog(PdfLoaderMain.class);

PdfDecoder decodePdf = new PdfDecoder(false);
PdfDecoder.useTextExtraction();
decodePdf.setExtractionMode(PdfDecoder.TEXT); //extract just text
decodePdf.init(true);
decodePdf.openPdfFile(”c:/temp/RuleEngine/src/sample.pdf”);
if (!decodePdf.isExtractionAllowed()) {
log.error(”Text extraction not allowed”);
} else if (decodePdf.isEncrypted() && !decodePdf.isPasswordSupplied()) {
log.error(”Pdf is encrypted”);
} else {
int startPage = 1;
int endPage = decodePdf.getPageCount();
PdfDecoder.enforceFontSubstitution = true;
int x1 = 122, y1 = 736, x2 = 425, y2 = 696;
for (int page = startPage; page < endPage + 1; page++) {
decodePdf.decodePage(page);
PdfGroupingAlgorithms currentGrouping = decodePdf.getGroupingObject();
try {
String text = currentGrouping.extractTextInRectangle(
x1,
y1,
x2,
y2,
page,
false,
true);
PdfPage pdfPage = new PdfPage();
pdfPage.setContent(text);
pdfPages.add(pdfPage);
} catch (PdfException e) {
decodePdf.closePdfFile();
log.error("Exception " + e.getMessage() + " in file " + decodePdf.getObjectStore().fullFileName);
}
}
decodePdf.flushObjectValues(false);
}
end

The PdfProcessor invoke the rule engine to process the Pdf file, and return the content in a list.

public class PdfProcessor {

public List<PdfPage> processPdf() throws PdfException {
try {
List<PdfPage> pdfPages = new ArrayList<PdfPage>(10);
RuleParser ruleParser = RuleParser.getInstance();
RuleBase ruleBase = ruleParser.getRuleBase();
WorkingMemory workingMemory = ruleBase.newStatefulSession();
workingMemory.insert(pdfPages);
workingMemory.fireAllRules(new RuleNameStartsWithAgendaFilter(”Sample Extraction”));
return pdfPages;
} catch (RuleException e) {
throw new PdfException(”Error processing Pdf file: ” + e.getMessage());
}
}
}

To test the extraction is very straightfoward.

 public static void main(String args[]) {
        try {
            PdfProcessor pdfProcessor = new PdfProcessor();
            List pdfPages = pdfProcessor.processPdf();

            for (PdfPage pdfPage: pdfPages) {
                log.info(”Content: ” + pdfPage.getContent());
            }

        } catch (Exception e) {
            System.out.println(e.getMessage());
        }

    }

You can download the code and try it out yourself !! Enjoy !!

Trackback URL

1 Trackback(s)

From Load Data into Database without Creating Java Data Object | twit88.com | Oct 25, 2007

twit88.com

Current Article

Use JPedal and JBoss Drools for Flexible PDF Extraction

1 Trackback(s)

Post a Comment

Recent Posts

Categories

Links

semantic web

Recent Comments

twit88.com

Current Article

Use JPedal and JBoss Drools for Flexible PDF Extraction

Related Posts

1 Trackback(s)

Post a Comment

Recent Posts

Categories

Popular Posts

Links

semantic web

Recent Comments