Friday, September 5, 2014

Convert 2 PDF, text extraction and thumbnail generation from Office, Open Office and other standard documents without Microsoft Office installed.

Introduction

My team and I in Mikrocop d.o.o. have been testing a variety of products, for use in our enterprise content management system, that would allow us to extract document contents for purpose of building a search index and convert those documents to PDF or thumbnails, in order to provide a unified interface, through which our users could view documents on various devices.

Content extraction

A cornerstone of each content management system is search engine that will be used to locate documents, once they are stored. In order to provide users with greatest range of search queries, it is necessary to build a search index, containing not only metadata of each file, but also its contents.
Since we were required to support a wide variety of file types, including Microsoft office, Open office, TIF, PDF, MSG and RTF files, we were looking for products, that would support as many of those files as possible, without dependencies that would prevent them from being used on servers. We have tested a variety of products, of which Aspose proved to be both easiest to use and in most cases the fastest way to extract text. Aspose proved capable of extracting content from all of those file types, including any attachments embedded within them. Support for older Microsoft Office formats will also allow us to provide same functionalities for older files, archived within our system.

Thumbnail generation

To make it easier for our customers, to find the documents, they are looking for and to reduce the number of documents downloaded, we have decided to convert the first page of each document to a thumbnail. This allows users to preview a document, before they initialize a download, reducing the time it takes them to locate a document, they were looking for.
Similar to our requirements for text extraction, we were looking for a product that would support a wide variety of document types, without any external dependencies. Aspose, once again proved to be the best choice. We haven't found any other product that would match Aspose in terms of speed or ease of use.

Conversion to PDF

Due to wide variety of devices our system is required to support, from desktop computers to tablets, we were faced with a challenge of providing unified interface for reading documents. We have decided to convert documents to PDF and display them using PDF.js platform.
Excellent support for document conversions in Aspose framework proved more than capable of this task and just as in thumbnail generation, proved faster than competition.

Conclusion


Aspose Total supports all our needs, making it easy to handle various types of documents.  It offers more than decent performance, which is critical in big data cloud systems. When reviewing the components we also ran into problems, which were rapidly handled by Aspose technical support.