|
RE: JRC-ACQUIS Multilingual Parallel Corpus
Thanks Dragomir...
After looking a bit more into this, there are two sets of data over there... one is the TMX one and the other is the one I was referring to.
I went with this choice because the page @ http://langtech.jrc.it/JRC-Acquis.html said:
What is the difference between the DGT Translation Memory and the JRC-Acquis
The two resources are rather similar in nature as they are both based on the Acquis Communautaire, but they are not identical and can both serve different purposes. The main differences are the following:
- The collection of documents of both resources should mostly be the same, but they are not identical as both resources were collected in different ways. None of the resources is exactly equivalent to the Acquis Communautaire. The criteria for the collection of the JRC-Acquis were rather loose (all documents were collected which were available in at least ten languages of which at least three 'new' EU languages) so that the JRC-Acquis is bigger.
So that being said, that automated tool seems only to work with the DGT TM.
I've been able to use a script to get the terms into a TMX format with some ugly hacking. I also have it so that there are two files, one source, one target that has a term on each line.
The problem is that no TM tool I find can deal with a 100MB text file. Deja Vu crashes right away. Stingray just runs forever. I killed it after 3 hours.
I looked at OmegaT, but that too crashes.
And Oliphant looks nice, but I cannot for the life of me, figure out how to import two aligned files.
Does anyone know of a tool that can handle large text files for importing?
Thanks,
John
|