Extract Text from PDF Document Based on Position C++

Extract text from PDF document based on position c++

First of all, when trying to extract text from a PDF based on its position, while supplying only tx and ty, it does not suffice to only consider the text matrix (which you set using the Tm operator you already found). You also have to consider the current transformation matrix!

I assume when you refer to a position as given in default user space coordinates.

To avoid the device-dependent effects of specifying objects in device space, PDF defines a device-independent coordinate system that always bears the same relationship to the current page, regardless of the output device on which printing or displaying occurs. This device-independent coordinate system is called user space.

The user space coordinate system shall be initialized to a default state for each page of a document. The CropBox entry in the page dictionary shall specify the rectangle of user space corresponding to the visible area of the intended output medium (display window or printed page). The positive x axis extends horizontally to the right and the positive y axis vertically upward

(section 8.3.2.3, ISO 32000-1:2008)

As we only see the x and y coordinates, we see the position as a vector (x, y) in R². Internally, though, PDFs consider this plane embedded in R³ with a constant z coordinate value 1, i.e. [x, y, 1]. This is because PDF wants to allow numerous kinds of transformations (translations, rotations, scaling, skewing, ...) but on the other hand wants to limit the required mathematical operations as far as possible. Incidentally after embedding our plane as [x, y, 1] into R³ all these transformations are possible by means of matrix multiplications:

Single transformation

Here you already see those numbers a, b, c, d, e, and f you asked about.

Now, before taking the text specific transformations into account, you have to take into account the manipulations of the current (text independent) transformation matrix. This matrix is manipulated by the cm operators:

a b c d e f cm Modify the current transformation matrix (CTM) by concatenating the specified matrix (see 8.3.2, "Coordinate Spaces"). Although the operands specify a matrix, they shall be written as six separate numbers, not as an array.

(section 8.4.4, ISO 32000-1:2008)

This implies, BTW, that you have to consider all cm operators currently in action, i.e. all presented since the start of the page content, with the exception of those revoked by restoring a former graphics state (cf. the operators q and Q pushing and restoring graphic states, section 8.4.2, ISO 32000-1:2008).

Only now you can consider the text specific transformation matrices:

At the beginning of a text object, Tm shall be the identity matrix; therefore, the origin of text space shall be initially the same as that of user space. The text-positioning operators, described in Table 108, alter Tm and thereby control the placement of glyphs that are subsequently painted. Also, the text-showing operators, described in Table 109, update Tm (by altering its e and f translation components) to take into account the horizontal or vertical displacement of each glyph painted as well as any character or word-spacing parameters in the text state.

Additionally, within a text object, a conforming reader shall keep track of a text line matrix, Tlm, which captures the value of Tm at the beginning of a line of text. The text-positioning and text-showing operators shall read and set Tlm on specific occasions mentioned in Tables 108 and 109

(section 9.4.2, ISO 32000-1:2008)

Thus, inside of a text object you have to keep track of the text matrix which primarily is set using the Tm operator you found with the operands arranged in the matrix as shown above but which also is manipulated as an effect of other text positioning and text showing operators.

And there still are additional parameters determining the final position of the text, the text state parameters Tfs (the text font size), Th (the horizontal scaling), and Trise (the text rise), cf. section 9.3.1, ISO 32000-1:2008.

Conceptually, the entire transformation from text space to device space [or in your case to the default user space] may be represented by a text rendering matrix, Trm:

Text rendering matrix

Trm is a temporary matrix; conceptually, it is recomputed before each glyph is painted during a text-showing operation.

(section 9.4.2, ISO 32000-1:2008)

Thus, your coordinates (x, y) conceptually result from the text space coordinates by multiplication with Trm:

[x, y, 1] = [xts, yts, 1] x Trm

where (xts, yts) are (0, 0) at the glyphs origin. For every glyph printed you have a glyph displacement to get to the point where the next glyph origin will be positioned:

glyph displacement

The text matrix shall be updated by these glyph displacement values as follows:

Text matrix update by glyph displacement

(section 9.4.4, ISO 32000-1:2008)

I quoted a number of paragraphs from the current PDF specification ISO 32000-1:2008. I gather this is preferable to using the PDF Reference 1.4 which es quite ancient; furthermore it has been called "not normative in nature" by Adobe personal.

EDIT Some clarifications in answer to comments

device space and user space, what is the distinction between them, isn't the device space reffering to printer/ video display? and user space to a way of overcoming every device's particularities? like the user page being the document page that I see?

Yes, the device space is a fixed coordinate system essentially determined by the properties of the device at hand. And yes, the user space is a coordinate system independant from the target device. But no, it is not "the document page you see" because you see it on some device (or after being processed by some device).

The user space coordinate system is an independent coordinate system the coordinates of of a point of which can be translated to the device coordinates by means of a matrix multiplication with the current transformation matrix (CTM).

UserCoords x CTM = DeviceCoords

The user space coordinate system is initialized to a state where the CropBox entry in the page dictionary specifies the rectangle of user space corresponding to the visible area (see above) by initializing the CTM accordingly.

But as the choice of words already indicates ("current transformation matrix", "the coordinate system is initialized"), the user space coordinate system is a dynamic, everchanging coordinate system.

The default user space provides a consistent, dependable starting place for PDF page descriptions regardless of the output device used. If necessary, a PDF content stream may modify user space to be more suitable to its needs by applying the coordinate transformation operator, cm (see 8.4.4, "Graphics State Operators"). Thus, what may appear to be absolute coordinates in a content stream are not absolute with respect to the current page because they are expressed in a coordinate system that may slide around and shrink or expand. Coordinate system transformation not only enhances device-independence but is a useful tool in its own right.

(section 8.3.2.3, ISO 32000-1:2008)

Thus, when a PdfReader stumbles upon a cm operator with its parameters representing some matrix M, the CTM changes:

CTMnew = M x CTMold

and coordinates present in following operators are interpreted according to this new matrix CTMnew:

UserCoords x CTMnew = DeviceCoords

So now the user space coordinate system might be very different from its former state, scaled, rotated, skewed, whatever.

The coordinates you are essentially interested in most likely are those in the coordinate system the user space is initialized as, i.e. the device coordinate system for a virtual device for which the CTM is initialized as identity matrix.

where does text space and glyph space start and end.

The coordinates of text are specified in text space. The transformation from text space to user space are defined by a text matrix in combination with several text-related parameters in the graphics state (see 9.4.2, "Text-Positioning Operators").

The text matrix TM is initialized as the identity matrix at the start of a text object but changes during the execution of text operations, most visibly when you use the Tm operator, implicitly when you use others. This matrix is manipulated by a matrix TR containing the text-related parameters font size, horizontal scaling, and text rise. For details see the text rendering matrix TRM above. Thus,

DeviceCoords = UserCoords x CTM = TextCoords x TR x TM x CTM

The transformation from glyph space to text space shall be defined by the font matrix. For most types of fonts, this matrix shall be predefined to map 1000 units of glyph space to 1 unit of text space; for Type 3 fonts, the font matrix shall be given explicitly in the font dictionary (see 9.6.5, "Type 3 Fonts").

Thus, this transformation depends on the current font. The font matrix FM from the font dictionary would act like this:

DeviceCoords = GlyphCoords x FM x TR x TM x CTM

You do not want to locate the device coordinates of a single segment of a glyph, so these coordinates do not seem to interest. The glyph widths, though, are to be interpreted in glyph space. Unless you are dealing with Type 3 fonts, though, this merely means that you have to divide them by 1000...

And how does parameters w0 and w1 evolve during glyph painting? are they initially (0,0)

w0 and w1 denote the glyph's horizontal and vertical displacements. In horizontal writing mode, w0 is the glyph widths transformed to text mode (i.e. most often merely divided by 1000) and w1 is 0. For vertical writing mode text inspect sections 9.2.4 and 9.7.4.3 in ISO 32000-1:2008.

does text space have the same origin as the first glyph space? and they get updated with the calculated (tx,ty)?

As the glyph space coordinates are merely multiplied by the font matrix to result in text space coordinates and the font matrix in all cases but for Type 3 fonts merely compresses by a factor of 1000, see above, the glyph origin is mapped to the text space origin.

But tx and ty are used to update the text matrix itself. Thus, the text spece coordinate system moves for each glyph and for each (non-Type 3) glyph origin maps to origin... of a slightly changed text space coordinate system.

Extract PDF text by coordinates

Well, thank you for your effort anyone.

I got it using Apache's PDFBox on top of IKVM compilation, and this is the final code:

PDDocument doc = PDDocument.load(@"c:\invoice.pdf");

PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.addRegion("testRegion", new java.awt.Rectangle(0, 10, 100, 100));
stripper.extractRegions((PDPage)doc.getDocumentCatalog().getAllPages().get(0));

string text = stripper.getTextForRegion("testRegion");

And it works like a charm.

Thank you anyway and I hope my own answer will help others. If you need further details, just comment out here and I'll update this answer.

Is there a C++ library to extract text from a PDF file like PDFBox for Java?

Xpdf is a C++ application/library which includes tools to extract plain text from a PDF file.



Related Topics



Leave a reply



Submit