Concatenate or split existing PDF documents with PHP

SetaPDF-Merger

Concatenate or split existing PDF documents with PHP

Optimize Resources

Sometimes you have to extract or merge only a single page or a specifc page range from a PDF document. Depending on the internal structure of the PDF document it could happen that the resulting document has nearly the same file size as the original.

Why does that happen?

Each page dealing with images, fonts or any other resource type need to have access to a resource dicionary that maps explicit names to individual objects (e.g. images or fonts). Technically it is possible to reference this entry to a global resource dictionary which is used by e.g. all pages of a document. It is also possible that this entry is inherited by all pages from an intermediate page tree node. If a page is extracted the SetaPDF-Merger component will copy all resources defined in the resource dictionary as well (if they were used or not). Because of this the file size may be nearly the same as the whole document.

This demo will go a step further and analyze the resulting documents and remove unused resources (fonts and images) from the resulting resources dictionary. We haven't implemented this feature by default because of its overhead which will slow down the whole process for sure.

We created a helper class that will remove unused resources (fonts and images) from a document instance:

PHP
<?php

/**
 * Class OptimizePagesResources
 *
 * This class will analyse the content stream of all pages of a document and will remove unused resources (XObjects
 * and Fonts) from the resource dictionaries.
 */
class OptimizePagesResources
{
    /**
     * The document instance
     *
     * @var \SetaPDF_Core_Document
     */
    protected $_document;

    /**
     * An array mapping pages to resource dictionaries
     *
     * @var array
     */
    protected $_pagesByResourceDictionary = array();

    /**
     * An array mapping used resources to page numbers
     *
     * @var array
     */
    protected $_resourcesUsedByPage = array(
        'XObject' => array(),
        'Font' => array(),
    );

    /**
     * An array with counts of removed resources
     *
     * @var array
     */
    protected $_removedResources = array();

    /**
     * The current page number
     *
     * @var int
     */
    protected $_currentPageNo;

    /**
     * Removes unused resources from all pages resource dictionaries.
     *
     * @param \SetaPDF_Core_Document $document
     * @return array
     */
    public static function optimize(\SetaPDF_Core_Document $document)
    {
        $instance = new self($document);
        $instance->_parse();
        $instance->_removeUnusedResources();

        return $instance->_removedResources;
    }

    /**
     * The constructor.
     *
     * @param \SetaPDF_Core_Document $document
     */
    private function __construct(\SetaPDF_Core_Document $document)
    {
        $this->_document = $document;
    }

    /**
     * Parse all pages and extract all required information.
     */
    protected function _parse()
    {
        $pages = $this->_document->getCatalog()->getPages();
        for ($this->_currentPageNo = 1, $pageCount = $pages->count(); $this->_currentPageNo <= $pageCount; $this->_currentPageNo++) {
            $page = $pages->getPage($this->_currentPageNo);

            // let's parse all used resources
            if (!isset($this->_resourcesUsedByPage[$this->_currentPageNo])) {
                $this->_resourcesUsedByPage[$this->_currentPageNo] = array();
            }

            $contentParser = new \SetaPDF_Core_Parser_Content($page->getContents()->getStream());
            $contentParser->registerOperator('Do', function($arguments) {
                $resourceName = $arguments[0]->getValue();
                $this->_resourcesUsedByPage[$this->_currentPageNo]['XObject'][$resourceName] = $resourceName;
            });
            $contentParser->registerOperator('Tf', function($arguments) {
                $resourceName = $arguments[0]->getValue();
                $this->_resourcesUsedByPage[$this->_currentPageNo]['Font'][$resourceName] = $resourceName;
            });
            $contentParser->process();


            // group resource dictionaries by page numbers and clone resource dictionaries
            $resources = $page->getAttribute('Resources')->getValue();
            if ($resources instanceof \SetaPDF_Core_Type_Dictionary) {
                $page->getObject(true)->ensure()->offsetSet('Resources', clone $resources);
                $this->_pagesByResourceDictionary['direct-' . $this->_currentPageNo] = array($this->_currentPageNo);

            } elseif ($resources instanceof \SetaPDF_Core_Type_IndirectObjectInterface) {
                $resources = $this->_document->cloneIndirectObject($resources->getValue());
                $page->getObject(true)->ensure()->offsetSet('Resources', $resources);

                $ident = $resources->getObjectIdent();
                if (!isset($this->_pagesByResourceDictionary[$ident])) {
                    $this->_pagesByResourceDictionary[$ident] = array();
                }

                $this->_pagesByResourceDictionary[$ident][] = $this->_currentPageNo;
            }
        }
    }

    /**
     * This method removes the unused resources from the resource dictionaries.
     */
    protected function _removeUnusedResources()
    {
        $pages = $this->_document->getCatalog()->getPages();

        foreach ($this->_pagesByResourceDictionary AS $ident => $pageNumbers) {
            // collect used resources by pages
            $usedResources = array();
            foreach ($pageNumbers AS $pageNumber) {
                $resourceTypes = $this->_resourcesUsedByPage[$pageNumber];
                foreach ($resourceTypes AS $resourceType => $resourceNames) {
                    foreach ($resourceNames AS $resourceName) {
                        $usedResources[$resourceType][$resourceName] = $resourceName;
                    }
                }
            }

            /**
             * @var \SetaPDF_Core_Type_Dictionary $resources
             */
            $resources = $pages->getPage($pageNumber)->getAttribute('Resources')->ensure();

            $resourceTypes = array_keys($this->_resourcesUsedByPage);
            foreach ($resourceTypes AS $resourceType) {
                /**
                 * @var \SetaPDF_Core_Type_Dictionary $resourcesByType
                 */
                $resourcesByType = $resources->getValue($resourceType);
                if ($resourcesByType) {
                    $resourcesByType = $resourcesByType->ensure();
                    if (!$resourcesByType instanceof \SetaPDF_Core_Type_Dictionary) {
                        continue;
                    }

                    foreach ($resourcesByType->getKeys() AS $resourceName) {
                        if (!isset($usedResources[$resourceType][$resourceName])) {
                            $resourcesByType->offsetUnset($resourceName);
                            if (!isset($this->_removedResources[$resourceType])) {
                                $this->_removedResources[$resourceType] = 0;
                            }

                            $this->_removedResources[$resourceType]++;
                        }
                    }
                }
            }
        }
    }
}

The following demo takes a PDF document with 4 pages which display 4 different images and texts written with 4 different fonts, defined in a single resource dictionary.

(You can view the original PDF document here)