Abstract
This article describes a workflow for translating text documents which
are only available in the PDF format. A heterogeneous set of tools is
used, consisting of free and non-free software. The workflow is ad
hoc and it is meant to illustrate what you can do, not what you should
do.
Motivation
PDF is a format intended for publication, not for editing. Therefore
some well known systems for Computer Aided Translation don't support
PDF as a source format. But sometimes the PDF version of a document
is all you have and you need to transform it somehow to a format that
your CAT tool accepts. After making the translation the resulting
document is usually editable which may not be what you want so you
need a way to transform it back to PDF.
Assumptions about the reader
To profit from reading this article you should already know or be
willing to learn what a command shell is and how to issue commands on
its command prompt / command line. Ideally you already have written a
couple of shell scripts. Part of the article is a script for the bash
command shell which is common on unix-like systems. In this article I
only provide background information about the specific tools involved
in the workflow. If you feel you need more information about general
tools like operating systems or command shells, please look it up in
places like Wikipedia or via a search engine like Google.
Copyright issues
In what follows we assume that you are the author of the PDF document
you want to translate or that you have the author's permission to do
so.
Workflow overview
The described workflow illustrates only one possible way to translate
PDF documents and I don't present it as recommend practice for several
reasons. The first reason is that I did the whole process on two
kinds of systems, namely a unix-like operating system and a Microsoft
Windows operating system. This is because some of the tools were
available only on the one, and others only on the other platform. The
second reason is that although some of the tools are available under a
free software license, for others you will have to pay directly or
indirectly. So it may be desirable for you to find variations of the
workflow which allow you to do the entire process on only one platform
or using only free software.
The workflow I'm going to describe is as follows:
-
The PDF source document is transformed to the PS format.
-
From the PS document the individual pages are extracted.
-
The individual pages are converted to JPEG images.
-
An OCR software in batch mode produces a Microsoft Word document
from the JPEG images.
-
You translate the Word document with your CAT tool.
-
The target document is transformed back to the PDF format.
One further remark is in order. You may think that the indirection
via conversion to images and via OCR isn't necessary as there are free
tools that convert PDF to HTML which is a common source format for
some CAT tools. This may be true for documents consisting entirely of
continuous text but there are also elements whose structure isn't
properly transformed or lost completely, like programming language
source code listings or tables. So the described workflow assumes
worst case documents that require an OCR.
Detailed description
I did the first three steps on a GNU/Linux system and steps 4 to 6 on
a Windows system just because the software I used was already
installed on these systems. In the examples that follow the source
document is called mydoc.pdf.
Steps 1 to 3
The three tools needed for the first three steps are part of three
different software packages: xpdf, psutils and ImageMagick. The
following three sections present the tools and their basic usage on
the command line. After this I will show how
to automate this part of the conversion with a simple shell script.
1. As a first step the source document is transformed from the
portable document format (PDF) to the postscript format (PS). Like
PDF PS is not intended for editing documents. But we do this
conversion step because the next tool we need to use works with PS
documents. The conversion to PS is done like this on the command
line:
pdftops mydoc.pdf
This command produces a PS document with the same name but with the
filename extension ps. The pdftops command is part of the xpdf
software package, see [1].
2. Next we need to extract the individual pages of the PS document.
This step will produce a set of further PS documents, each of which
will contain a single page of the source document. To extract the
first page of the source document on the command line and store it in
an individual PS document we use the psselect command like this:
psselect -p1 mydoc.ps mydoc1.ps
where 1 is the number of the page and mydoc1.ps is the name of the
generated document that will contain the extracted page. You will
have noticed that doing this step over and over again for a document
of many pages is tedious and needs to be automated. The command
psselect is part of the psutils software package,
see [2].
3. By now we have a set of PS documents containing individual pages.
In the next step we want to convert each of them to a format that can
be read by an optical character recognition software (OCR). Here I
will target the JPEG format. To convert one PS document from the set
we use the convert command like this:
convert -density 300 mydoc1.ps mydoc1.jpg
where the density option specifies the resolution in dots per inch.
Here we run into the same problem from step 2: the conversion of the
whole set of source documents needs to be automated unless you only
want to convert a very small number of pages. The convert command is
part of the ImageMagick software package,
see [3].
Automating the previous steps
For documents with more than a handful of pages steps 2 and 3 are
tedious to execute. Furthermore the individual JPEG image files need
to be numerated in a way that preserves the ordering of the pages in
the original document. That means for a document with 1000 pages we
don't want filenames like mydoc1.jpg, mydoc11.jpg, mydoc111.jpg
etc. because, for example, when you list the content of the directory
containing the files, mydoc11.jpg will be listed after mydoc100.jpg.
This is crucial in later steps where other tools that process all
files / pages in a directory assume that the order in the directory
represents the order of the pages in the document. I have provided
a bash script that takes care of this
problem as well. It creates filenames where the numeric part is
padded with a number of leading zeros whose number depends in turn on
the number of digits in the number of the document's pages.
Please note that this script works on my system and its particular
setup, but may not, for some reason, work on yours, where 'some
reason' could be anything. So if you want to use the script, review
it first, adapt it to your working environment and if you don't know
how to do that, don't execute it and take it instead as mere
illustrative code.
That said, please note also that the script has been written for a
unix-like environment. I executed it under a bash shell but didn't
test it with other shells. The script expects exactly one argument
which should be the name of a PDF file in the same directory. The
filename must end with '.pdf' and the whole name must not contain
whitespace, i.e. don't use it with filenames like "How to translate
PDF documents.pdf". The script assumes that a number of other tools
are available on your system, like pdfinfo ([1]), grep, awk etc.
Variation of the previous stages
Up to this point where we obtain the set of JPEG files all commands
have been executed in a unix-like environment. The following steps
will be executed on a Windows system. So it would be sensible to do
the entire process on a Windows machine. You have two possibilities:
1. The tools mentioned unter [1, 2, 3] are available as compiled
binaries for Windows, although I can't tell if the binaries for
pstools are up to date. I didn't try them. You could try to install
them and write an equivalent script that automates steps 1 to 3 for the
cmd command prompt.
2. You could install Cygwin, which is a unix compatibility layer
including its own collection of software packages. Please refer to
[4] for documentation on how to
install and use Cygwin. The good news is that Cygwin's software
distribution contains all the software needed for steps 1 to 3. But
please note that I didn't test the script in a Cygwin environment. It
should work, though.
Steps 4 to 6
I can only give very general information about steps 4 and 5 because
in the present case they rely on non-free software and for each
program there are numerous alternatives.
4. The set of JPEG files was transferred to a Windows system that had
an installation of an OCR software. In this case it was a program
that came with a scanner as bonus software. This is an example for
commercial OEM software you pay for indirectly. The software was able
to produce a Word document from the set of JPEG files. How this
happens exactly depends entirely on the software you have on your
system.
Also, the process of generating the editable document may vary with a
given piece of OCR software. This is because the process may require
user intervention during character recognition and there may also be
several options regarding the formatting and other properties of the
original document that will appear in the target document.
5. Here, essentially the same remarks apply as for step 4. The only
thing we assume here is that you start with a Word document making
your translation with your favorite CAT tool and end up also with a
Word document.
6. The last step consists in converting the document back to its
original format, namely PDF. We are also going to use a free tool to
do this. The Windows system I used had two tools which are
straightforward to install and to use: Ghostscript and GSview,
see [5]. In addition
there must be at least one printer driver for PS capable printers
installed on your system.
With this setup, having your translation open in Word, go to the
'print ...' item in the file menu. In the dialog select the option
that prints the document not to the printer but to a file. This file
will have the extension prn. Next, start GSview, open the prn file
and select the 'convert' item in the file menu. In the dialog select
the PDF writer as output device and convert.
Variation of the previous stages
The previous equally named section indicated possibilities to do the
whole conversion process on Windows systems. It would also be
interesting to do the whole process on a unix-like system using only
free software. I have to admit that I haven't even looked for OCR
software available under unix-like systems. As for CAT tools there is
a list on Wikipedia,
see [6],
that also includes free alternatives for unix-like systems.
Conclusion
The previous workflow is a complete PDF to PDF conversion roundtrip
via a number of transformation steps which can be adapted to your
needs.
References
[1] xpdf: http://www.foolabs.com/xpdf
[2] psutils: http://www.tardis.ed.ac.uk/~ajcd/psutils
[3] ImageMagick: http://www.imagemagick.org
[4] Cygwin: http://www.cygwin.com
[5] Ghostscript, GSview: http://pages.cs.wisc.edu/~ghost
[6] http://en.wikipedia.org/wiki/Computer-assisted_translation
Recommend this article:  |  |  |  | 
|