Extract text, glyphs, words and metrics from PDF documents with PHP

SetaPDF-Extractor

Extract text, glyphs, words and metrics from PDF documents with PHP

PDF Text extraction with PHP

The SetaPDF-Extractor component is written in PHP and allows PHP developers to extract textual content from existing PDF documents.

Beside extracting text it is also possible to extract words, glyphs and their positions and bounding boxes.

A simple text extraction process of a single page will look like:

PHP
<?php
require_once("library/SetaPDF/Autoload.php");

// create a document instance
$document = SetaPDF_Core_Document::loadByFilename('Laboratory-Report.pdf');

// create an extractor instance
$extractor = new SetaPDF_Extractor($document);

// get the plain text from page 1
$result = $extractor->getResultByPageNumber(1);

// output
echo '<pre>';
echo htmlspecialchars($result);
echo '</pre>';

In Action [See all demos]

Phrase Search

Create a phrase search with the SetaPDF-Extractor component.

Count Words

Count words in a PDF document with PHP.

More demos are available here.

Examples of Usage

  • Create a search index for PDF documents 
    Extract the plain text from PDF documents to create a search index.
  • Extract data from a specific locations on a PDF page
    For example an invoice number, sender name, po number,... 
  • Highlight words in a PDF document
    A full indexed search catalog may allow your customers to hightlight the words in the PDF document due a Highlight Annotation. 

Miscellaneous

Questions?

If you are searching for a feature or have any question regarding this or any other product, feel free to contact us at support@setasign.com.

Do you like this product?

Then it would be awesome, if you‘d recommend it to your friends!