Extract text, glyphs, words and metrics from PDF documents with PHP

SetaPDF-Extractor

Extract text, glyphs, words and metrics from PDF documents with PHP

Downloads and Changelogs of the SetaPDF-Extractor

The following table will show you all changelogs and available downloads of the SetaPDF-Extractor component. A full overview of all your licenses is available in your personal Pickup Depot.

SetaPDF-Extractor

Version 2.24.0.1049

Release date: 2017-05-08

Rev. 1026 to 1049

SetaPDF-Extractor Component
Feature
  • Implemented handling of invisible glyphs above visible glyphs. They are ignored now and will not trigger a word break in both Word and ExactPlain strategies.
  • Added setCleanStreamCallback() and getCleanStreamCallback() in the abstract strategy class.
  • Added SetaPDF_Extractor_ContentStreamCleaner class and implemented it as a default callback in all strategies. It allows you to clean up a content stream before it is parsed and will bring a performance boost for documents with e.g. very much vector operations.
Bugfix
  • Pass $_ignoreSpaceCharacter property in SetaPDF_Extractor_Strategy_ExactPlain to sub-instance.
Tweak
  • Optimized code in sorting functions.
SetaPDF-Core Component
Feature
  • Added setTabOrder() and getTabOrder() in SetaPDF_Core_Document_Page_Annotations class.
  • Added method to normalize line breaks: SetaPDF_Core_Text::normalizeLineBreaks().
  • Added SetaPDF_Core_Document::getSaveMethod() and renamed $update parameter in save() method to $saveMethod (meaning is the same as before).
  • Added support for PDF documents with invalid data before their PDF file header.
Bugfix
  • Fixed a bug which occurs if AES encryption is used while the Length value of a stream is an indirect object which is already written because of the SetaPDF_Core_Document::SAVE_METHOD_REWRITE_ALL save method.
  • Prevent "String offset cast occured" notice in abstract reader class.
  • Fixed SetaPDF_Core_Document_Destination::findByName() when name was not found.
Tweak
  • Normalizes line breaks in SetaPDF_Core_Text_Block class.
  • Check for null bytes in file names before passing them to PHP functions to prevent warnings or notices.
  • Added support for faulty objects with a generation number higher than zero in object streams.
  • Allow date strings without "D:" prefix.
Demo
  • Updated get image sizes demo to handle rotated pages correctly.