I know, this may seem like a simple task, and you will probably find references on the web about how to do this. But I’ll also write a blog post on this topic, as I came across this problem today.

So, if you have a PDF file and don’t know how to read data from it, here it is what you could do.

 

First of all, you’ll need some DLLs that will help you manipulate the PDF files. I came across the PDFBox. What is PDFBox? I’ll cite from their website: PDFBox is an open source Java PDF library for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities.

 

Oh, nice, you’ll say, but I need a .NET solution. Don’t worry. Even though PDFBox is written in Java, there is also a .NET version that is available. It utilizes IKVM (also, a very interesting project: an implementation of the Java language for .NET Framework and Mono) to create a fully functioning PDF library for the .NET framework. The released version contains a bin directory with all of the required DLL files.

 

So you’ll have to download the PDFBox package. In this package you’ll find a bin directory. To read your PDF file, you’ll need the following files:

  • IKVM.GNU.Classpath.dll
  • PDFBox-0.7.3.dll
  • FontBox-0.1.0-dev.dll
  • IKVM.Runtime.dll

 

You’ll have to add a reference to the first two in your project. You’ll also have to copy the last two on your project’s bin directory.

The program will look something like this (if you’re working with a Console application):

 

using System;

using org.pdfbox.pdmodel;

using org.pdfbox.util;

 

namespace PDFReader

{

    class Program

    {

        static void Main(string[] args)

        {

            PDDocument doc = PDDocument.load("lopreacamasa.pdf");

            PDFTextStripper pdfStripper = new PDFTextStripper();

            Console.Write(pdfStripper.getText(doc));

        }

    }

}

 

Of course, in a project you’ll have the path to the PDF you want to read in your App.config file, but I’ve considered, for simplicity, that the PDF file is in the bin directory of your project. Next article will be the second part of my "Sidebar gadget" article. So stay tuned!

Technorati tags: