AspPDF.NET is capable of extracting raw text information
from PDF documents for searching and indexing purposes. Text is extracted
from an individual page via the ExtractText method
of the PdfPage object. ExtractText takes an optional
parameter object or parameter string (described below.)
This method always returns text strings
in Unicode format.
Text extraction with coordinates, introduced in Version 2.8, is described in Section 17.7 - Structured Text Extraction.
9.4.1 Code Sample
The following code sample extracts and prints out text data from
all pages of a PDF (we use the 1-page file 1040es.pdf from section 9.2):
C# |
PdfManager objPdf = new PdfManager();
// Open a PDF file for text extraction
PdfDocument objDoc = objPdf.OpenDocument( Server.MapPath("1040es.pdf") );
string strText = "";
foreach( PdfPage objPage in objDoc.Pages )
{
strText += objPage.ExtractText();
}
lblResult.Text = Server.HtmlEncode( strText );
|
VB.NET |
Dim objPdf As PdfManager = new PdfManager()
' Open a PDF file for text extraction
Dim objDoc As PdfDocument = objPdf.OpenDocument( Server.MapPath("1040es.pdf") )
Dim strText As String = ""
For Each objPage As PdfPage in objDoc.Pages
strText = strText + objPage.ExtractText()
Next
lblResult.Text = Server.HtmlEncode( strText )
|
Click the links below to run this code sample:
http://localhost/asppdf.net/manual_09/09_extract.cs.aspx
http://localhost/asppdf.net/manual_09/09_extract.vb.aspx
9.4.2 Possible Text Extraction Problems
PDF text extraction is not always reliable, sometimes it produces split
and conjoined words, or even unreadable gibberish.
9.4.2.1 Split and Conjoined Words
Unlike HTML or Word documents, PDFs do not usually contain
blocks of meaningful, readable text. Instead, they contain
text drawing operators that reference short phrases, individual
words, word parts and even separate characters.
As a result, an attempt to extract text information from a PDF document often
yields split and conjoined words. For example, the phrase "Brown dog"
may come out as "Browndog" (conjoined words) or "Bro wn d og"
(split words).
9.4.2.2 Gibberish
Many PDF documents, especially those using non-Latin alphabets, do not
use strings of readable characters to display text at all.
Instead, they use "glyph codes" which are numbers identifying character
appearances in a font file. "Good" PDF documents also provide mapping
tables (referred to as ToUnicode maps) enabling a consumer application
to convert those codes back to human-readable characters. However, not every
PDF document is "good". Those that aren't cannot technically be read.
An attempt to extract text from such a document yields gibberish.
Copying information from such a file via clipboard from Acrobat Reader
will fail as well.
9.4.2.3 Unknown Encoding
Certain foreign-language PDF documents use ASCII characters in the 129 - 255
range to display text information. Copying and pasting from such documents
with Acrobat Reader usually produces unreadable text. However, AspPDF.NET is
capable of extracting text from these documents and converting them into
Unicode, but a code page must be passed to
ExtractText method via the CodePage parameter, such as "CodePage=1251" (Cyrillic),
or "CodePage=1256" (Arabic), etc.
9.4.3 Permission Issues
A secure document may disallow content extraction by clearing Bit 5
of its permission flags (see Section 8.1.2).
To be in compliance with Adobe PDF licensing requirements, AspPDF.NET
enforces this permission flag. For the content extraction functionality to work,
a secure document with Bit 5 cleared must be opened with the owner
password, or an error exception will be thrown.