Daniel Storbeck

How to Translate PDF Documents (Using a Heterogeneous Set of Tools)


By Daniel Storbeck. Submitted on February 19, 2009

About the author: Daniel Storbeck has a PhD in linguistics and works as freelance programmer and translator. As translator he specializes in computer science, IT, general science, linguistics and speech technology.



Abstract

This article describes a workflow for translating text documents which are only available in the PDF format. A heterogeneous set of tools is used, consisting of free and non-free software. The workflow is ad hoc and it is meant to illustrate what you can do, not what you should do.

Motivation

PDF is a format intended for publication, not for editing. Therefore some well known systems for Computer Aided Translation don't support PDF as a source format. But sometimes the PDF version of a document is all you have and you need to transform it somehow to a format that your CAT tool accepts. After making the translation the resulting document is usually editable which may not be what you want so you need a way to transform it back to PDF.

Assumptions about the reader

To profit from reading this article you should already know or be willing to learn what a command shell is and how to issue commands on its command prompt / command line. Ideally you already have written a couple of shell scripts. Part of the article is a script for the bash command shell which is common on unix-like systems. In this article I only provide background information about the specific tools involved in the workflow. If you feel you need more information about general tools like operating systems or command shells, please look it up in places like Wikipedia or via a search engine like Google.

Copyright issues

In what follows we assume that you are the author of the PDF document you want to translate or that you have the author's permission to do so.

Workflow overview

The described workflow illustrates only one possible way to translate PDF documents and I don't present it as recommend practice for several reasons. The first reason is that I did the whole process on two kinds of systems, namely a unix-like operating system and a Microsoft Windows operating system. This is because some of the tools were available only on the one, and others only on the other platform. The second reason is that although some of the tools are available under a free software license, for others you will have to pay directly or indirectly. So it may be desirable for you to find variations of the workflow which allow you to do the entire process on only one platform or using only free software.

The workflow I'm going to describe is as follows:

  1. The PDF source document is transformed to the PS format.
  2. From the PS document the individual pages are extracted.
  3. The individual pages are converted to JPEG images.
  4. An OCR software in batch mode produces a Microsoft Word document from the JPEG images.
  5. You translate the Word document with your CAT tool.
  6. The target document is transformed back to the PDF format.

One further remark is in order. You may think that the indirection via conversion to images and via OCR isn't necessary as there are free tools that convert PDF to HTML which is a common source format for some CAT tools. This may be true for documents consisting entirely of continuous text but there are also elements whose structure isn't properly transformed or lost completely, like programming language source code listings or tables. So the described workflow assumes worst case documents that require an OCR.

Detailed description

I did the first three steps on a GNU/Linux system and steps 4 to 6 on a Windows system just because the software I used was already installed on these systems. In the examples that follow the source document is called mydoc.pdf.

Steps 1 to 3

The three tools needed for the first three steps are part of three different software packages: xpdf, psutils and ImageMagick. The following three sections present the tools and their basic usage on the command line. After this I will show how to automate this part of the conversion with a simple shell script.

1. As a first step the source document is transformed from the portable document format (PDF) to the postscript format (PS). Like PDF PS is not intended for editing documents. But we do this conversion step because the next tool we need to use works with PS documents. The conversion to PS is done like this on the command line:

pdftops mydoc.pdf

This command produces a PS document with the same name but with the filename extension ps. The pdftops command is part of the xpdf software package, see [1].

2. Next we need to extract the individual pages of the PS document. This step will produce a set of further PS documents, each of which will contain a single page of the source document. To extract the first page of the source document on the command line and store it in an individual PS document we use the psselect command like this:

psselect -p1 mydoc.ps mydoc1.ps

where 1 is the number of the page and mydoc1.ps is the name of the generated document that will contain the extracted page. You will have noticed that doing this step over and over again for a document of many pages is tedious and needs to be automated. The command psselect is part of the psutils software package, see [2].

3. By now we have a set of PS documents containing individual pages. In the next step we want to convert each of them to a format that can be read by an optical character recognition software (OCR). Here I will target the JPEG format. To convert one PS document from the set we use the convert command like this:

convert -density 300 mydoc1.ps mydoc1.jpg

where the density option specifies the resolution in dots per inch. Here we run into the same problem from step 2: the conversion of the whole set of source documents needs to be automated unless you only want to convert a very small number of pages. The convert command is part of the ImageMagick software package, see [3].

Automating the previous steps

For documents with more than a handful of pages steps 2 and 3 are tedious to execute. Furthermore the individual JPEG image files need to be numerated in a way that preserves the ordering of the pages in the original document. That means for a document with 1000 pages we don't want filenames like mydoc1.jpg, mydoc11.jpg, mydoc111.jpg etc. because, for example, when you list the content of the directory containing the files, mydoc11.jpg will be listed after mydoc100.jpg. This is crucial in later steps where other tools that process all files / pages in a directory assume that the order in the directory represents the order of the pages in the document. I have provided a bash script that takes care of this problem as well. It creates filenames where the numeric part is padded with a number of leading zeros whose number depends in turn on the number of digits in the number of the document's pages.

Please note that this script works on my system and its particular setup, but may not, for some reason, work on yours, where 'some reason' could be anything. So if you want to use the script, review it first, adapt it to your working environment and if you don't know how to do that, don't execute it and take it instead as mere illustrative code.

That said, please note also that the script has been written for a unix-like environment. I executed it under a bash shell but didn't test it with other shells. The script expects exactly one argument which should be the name of a PDF file in the same directory. The filename must end with '.pdf' and the whole name must not contain whitespace, i.e. don't use it with filenames like "How to translate PDF documents.pdf". The script assumes that a number of other tools are available on your system, like pdfinfo ([1]), grep, awk etc.

Variation of the previous stages

Up to this point where we obtain the set of JPEG files all commands have been executed in a unix-like environment. The following steps will be executed on a Windows system. So it would be sensible to do the entire process on a Windows machine. You have two possibilities:

1. The tools mentioned unter [1, 2, 3] are available as compiled binaries for Windows, although I can't tell if the binaries for pstools are up to date. I didn't try them. You could try to install them and write an equivalent script that automates steps 1 to 3 for the cmd command prompt.

2. You could install Cygwin, which is a unix compatibility layer including its own collection of software packages. Please refer to [4] for documentation on how to install and use Cygwin. The good news is that Cygwin's software distribution contains all the software needed for steps 1 to 3. But please note that I didn't test the script in a Cygwin environment. It should work, though.

Steps 4 to 6

I can only give very general information about steps 4 and 5 because in the present case they rely on non-free software and for each program there are numerous alternatives.

4. The set of JPEG files was transferred to a Windows system that had an installation of an OCR software. In this case it was a program that came with a scanner as bonus software. This is an example for commercial OEM software you pay for indirectly. The software was able to produce a Word document from the set of JPEG files. How this happens exactly depends entirely on the software you have on your system.

Also, the process of generating the editable document may vary with a given piece of OCR software. This is because the process may require user intervention during character recognition and there may also be several options regarding the formatting and other properties of the original document that will appear in the target document.

5. Here, essentially the same remarks apply as for step 4. The only thing we assume here is that you start with a Word document making your translation with your favorite CAT tool and end up also with a Word document.

6. The last step consists in converting the document back to its original format, namely PDF. We are also going to use a free tool to do this. The Windows system I used had two tools which are straightforward to install and to use: Ghostscript and GSview, see [5]. In addition there must be at least one printer driver for PS capable printers installed on your system.

With this setup, having your translation open in Word, go to the 'print ...' item in the file menu. In the dialog select the option that prints the document not to the printer but to a file. This file will have the extension prn. Next, start GSview, open the prn file and select the 'convert' item in the file menu. In the dialog select the PDF writer as output device and convert.

Variation of the previous stages

The previous equally named section indicated possibilities to do the whole conversion process on Windows systems. It would also be interesting to do the whole process on a unix-like system using only free software. I have to admit that I haven't even looked for OCR software available under unix-like systems. As for CAT tools there is a list on Wikipedia, see [6], that also includes free alternatives for unix-like systems.

Conclusion

The previous workflow is a complete PDF to PDF conversion roundtrip via a number of transformation steps which can be adapted to your needs.

References

[1] xpdf: http://www.foolabs.com/xpdf

[2] psutils: http://www.tardis.ed.ac.uk/~ajcd/psutils

[3] ImageMagick: http://www.imagemagick.org

[4] Cygwin: http://www.cygwin.com

[5] Ghostscript, GSview: http://pages.cs.wisc.edu/~ghost

[6] http://en.wikipedia.org/wiki/Computer-assisted_translation

Recommend this article: stumbleupon|digg|del.icio.us|reddit|facebook


Back

© ANVICA Software Development 2002—2009. All rights reserved.