1

Closed

PDF Text Parser is duplicating information on multiple page documents

description

I tested the PDFTextParser on a 5 page PDF and the text extraction is working well, but the output is duplicating in the following way:

Page 1: Everything looks fine
Page 2: Text from page 1 is combined with text from page 2
Page 3: Text from pages 1 and 2 is combined with text from page 3
Page 4: Text from pages 1, 2, and 3 is combined with text from page 4
Page 5: Text from pages 1, 2, 3, and 4 is combined with text from page 5

The text is actually combining line by line, so the first line of page 1 is included in the first line of page 2, and the first line of pages 1 and 2 is combined with the first line of page 3, etc.

I can't include the input or output files since the PDF contains sensitive information, but the issue is reproduced on another 7 page PDF that I parsed also.

I tested using the compiled 1.0 dll. The currently available source doesn't include the PDFTextParser so I can't debug the issue myself.
Closed Dec 22, 2014 at 12:01 AM by tonyqus

comments

tonyqus wrote Jul 7, 2014 at 5:17 AM

Can you upload the PDF you are using?

tonyqus wrote Jul 8, 2014 at 5:16 AM

You are right. It's a bug. I'll fix it in the next release. The latest code is located at https://github.com/tonyqus/toxy