Extract text, glyphs, words and metrics from PDF documents with PHP

SetaPDF-Extractor

Extract text, glyphs, words and metrics from PDF documents with PHP

Downloads and Changelogs of the SetaPDF-Extractor

The following table will show you all changelogs and available downloads of the SetaPDF-Extractor component. A full overview of all your licenses is available in your personal Pickup Depot.

SetaPDF-Extractor

Version 2.36.0.1597

Release date: 2020-12-22

Rev. 1508 to 1597

SetaPDF-Extractor Component
Feature
  • Added setKeepIntersectingSpaces() and getKeepIntersectingSpaces() methods to control behavior of overlapping white signs.
Change
  • Moved SetaPDF_Extractor_TextItem::_getFontBBoxVector() to SetaPDF_Core_Font class in Core.
Bugfix
  • Do not process text if no font is registered in graphic state in all strategies extending the Glpyh strategy.
Tweak
  • Optimized some if-logic in the Plain strategy.
  • Added SetaPDF_Extractor_Sorter_Baseline::(set|get)BaselineThreshold() methods to allow modification of the threshold value.
  • Optimized handling of inline images.
  • Optimized interpretation of obsolete empty strings (recurring spaces) in all strategies that inherit the Glyph strategy (all but the Plain text strategy).
SetaPDF-Core Component
Feature
  • Added TempStream writer class which uses buffering and a temporary stream internally. Both combined result in best CPU and memory usage.
  • Updated Http writer to extend the new TempStream writer class to reduce CPU and memory usage.
  • Added support for multiple quadrilaterals in setQuadPoints() method in all annoation classes supporting quad points.
  • Added SetaPDF_Core_Canvas_Draw::polygon() method.
Bugfix
  • Fixed fallback logic in Flate filter class.
  • Ignore invalid references in AcroForm data and handle them as they would not exist.
  • Allow DateTimeInterface instances in set*Date() methods in Annotation classes.
  • Handle missing encoding information in Type0 font instead of throwing a fatal error.
  • Handle empty Keyword array in XmpHelper class.
  • Resolve values of FontBBox array in several font classes (instead of accessing it directly).
  • Fixed access to undefined annotation appearance streams.
  • Added compatibility for PHP 8.
  • Handle invalid /Parent values in form field structures.
Tweak
  • Recreate FirstChar or LastChar entries in faulty Type1 and TrueType fonts (if possible).
  • Added support for DateTimeInterface in SetaPDF_Core_DataStructure_Date class.
  • Return color space instance in SetaPDF_Core_TransparencyGroup::getColorSpace() instead of raw value.
  • Fixed too many recursions in gif frame handling.
  • Micro optimization in pdf parser classes.
  • Ignore empty strings or streams in AES encrypted documents.
  • Moved SetaPDF_Core_Font::getFontBBox($recalculate=true) to own method (recalculateFontBBox).
  • Optimized performance of implementations of SetaPDF_Core_Font_FontInterface::getFontBBox().
  • Optimized performance of SetaPDF_Core_Geometry Rectangle and Vector.
  • Ensure that SetaPDF_Core_Document_Info::getDictionary(true) resolves to, or creates a dictionary if needed.
  • Speed optimization for parsing of JPEG files.
  • Optimized SetaPDF_Core_Document_Catalog::setOpenAction() to accept "null" as parameter which removes the OpenAction entry from the catalog dictionary.
  • Optimized handling of custom lists in SetaPDF_Core_Font_Glyph_List.
  • Optimized handling of the /Difference array in Simple and Type3 fonts.
  • Reduce extending page boundaries to their intersection with the media box.
  • Optimized SetaPDF_Core_Type_Dictionary_Helper::resolveAttribute() to ignore not found references.
Demos
  • Completely rewritten with a structured GUI and hosted on GitHub now.