a mechanism for automated translation..

For the discussion of language mechanics, grammar, vocabulary, trends, and other such linguistic topics, in english and other languages.

Moderators: gmalivuk, Moderators General, Prelates

inhahe
Posts: 59
Joined: Sun Feb 22, 2009 11:16 pm UTC

a mechanism for automated translation..

Postby inhahe » Thu Mar 26, 2009 5:22 am UTC

Here's a copy of a post I just blogged. (or is that a blog I just posted?)

Automatic Translation

I've yet to see an automated translation service that doesn't totally suck.

One idea for a very effective translator would be to feed a self-teaching AI program tons and tons of documents that already have existing translations, and have it automatically generate rules for proper translation. This would *automatically* accommodate for correct grammar, loose grammar, idioms, jargon, etc. This would require *a lot* of computing power, but it only has to be done *once*. Training documents can be found in existing corpora, or translated by hand specifically for the project. Two possible ways to generate rules would be: genetic algorithms, or some sort of exhaustion of many possible rule formulations (this could be bootstrapped with various types of data, for example, a word-sense->part-of-speech key, and a word-sense->popularity-of-use key.). Incidentally a week after I had this idea I heard a couple of people were working on just such a project, but I've yet to see the fruits of their results anywhere..

Rather than determining rules for translating to and from each possible combination of two languages, it's probably best to come up with *one* language that all languages can be translated to/from with no loss. Just making a collation of linguistic categories for words and clauses in each known language, and using these in an X-bar kinda structure, should be enough. Then any given language would be translated to this intermediate language, and then from that to the target language. This greatly reduces the costs of furnishing texts for the learning algorithm and of the running it. This intermediate language's lexicon should be the superset of all the senses of all the words of the input languages, but with words identical in grammatical function and alike in meaning grouped into synsets, where each word in a synset is linked to each other word in the synset with a particular weight, which is their level of similarity (this may have to be done by hand). A word in a source text would, via a table, point to a word in some synset (if the word has any synonyms), and then the closest word to that (weight-wise), or the word itself, that some word in the target language points to, would be used. A problem arises when a language possessing a certain tense/aspect/modality is translated to a language not possessing that. Possible solutions are: compromise and translate to a similar tense/aspect/modality that gets the point across, or totally rearrange the semantics of the sentence in the resultant text. This should not be too difficult given that the algorithm fully groks the grammatical structure of the sentence in the first place. Similarly some words won't exist in all languages. They can be handled by: using a similar-enough word that won't cause too much misunderstanding, or substituting the given word with a phrase (or, in some languages, possibly using agglutination). Obviously I'm not implying that the semantic rearranging or phrase substitution would be wholly "figured out" by the translator; it would rely on a pre-programmed (or self-learned, via particular patterns found in the training texts) ruleset for such rearrangements. "Similar-enough" words could be implemented using a weight mechanism just like the one used within a synset, but applying cross-synset/non-synonym. (In fact, we might as well just do away with a categorical consideration of synonym sets altogether.) Just enough vague linkages have to be drawn to accomodate all combinations of source/target languages. In fact for the sake of laziness, perhaps unlimited-length chains of weight-linkages could be used, when necessary. I suppose this requires a function for generating an overall priority value based on X number of Y-valued weights. For example, would we favor an a--(1)-->b--(2)-->c link, or an a--(3)-->d link? (1 means highest priority, because there is no bottom limit.) In this case, it would do to specify weights in arbitrary decimals rather than simply orders of priority.

We could effectively have myriad already-made translation texts available for training in this one-language approach, by creating a pretty darn good English<->the-one-language translator, and then using texts translated to and from English (it probably being the most widely translated language), with the English part being converted to/from the one language for the purposes of training. It remains in question how much trouble we should go through, if any, to make the program aware of whether a given training pair was actively translated from X to Y or from Y to X. This goes for the non-one-language approach also.

Machine learning may not be necessary: humans could perhaps construct perfectly comprehensive BNF notations (including idioms) and use a GLR parser, but I don't know how well this will work for (not-so-atypical) taking of grammatical liberties. If this approach is taken, the machine should obviously be programmed to understand affixes so that base words and inflections can be deduced for inflected words that aren't specifically in any dictionary. Another possible adaptation could be Damerau–Levenshtein distance or similar, to account for typos, misspellings, spelling variants, and OCR miscalculations. Also, a list of commonly misused words might also be helpful, though maybe not.

One trick to this translating could be to resolve ambiguous meanings, or connotations, of words in a sentence based on surrounding sentences. Meaning that if the word is used in such a way in a surrounding sentence that it definitely, or probably, means this or that, then we can induce that it probably means this in the given sentence, too. It could even be determined (by the given sentence or by a surrounding sentence) based on some pattern recognitions afforded by the training process. (These may even include subtle and holistic inferences.) Semantic resolution can go both ways: grammar can help determine the sense of a word, and a known word sense could help determine the grammar of a sentence. Connotation inferences (whether being done as such, or effectively for consideration purposes but not tokenized, per se, on that level) can even help determine the most germane translation synonym. We *may* want to even layer our conferencing of meaning-resolution amongst sentences according to paragraph, chapter, document, author/source, and/or genre, but that's probably overkill, beyond just having a sentence-level tier and a document-level tier. Actually genre and source seem to be good too. Oh, and I guess a sub-sentence-level tier could be relevant too (because the word could be used twice in the same sentence), but this would be treated a little differently of course, since individual syntax trees (generally) start and end at the sentence level.

People can arbitrarily create new words on-the-fly in an agglutinating language. This would be hard for a translator to automatically substitute with defining phrases..but it would be easy to simply use a form of pseudo-agglutination in the given target language: for example, if poltergeist weren't already a well-known and established word, it could be translated into English as "rumble-ghost." Perhaps a little awkward, but I think it's pretty effective for getting a point across.

stolid
Posts: 167
Joined: Mon Sep 15, 2008 3:18 am UTC
Location: 25th state

Re: a mechanism for automated translation..

Postby stolid » Thu Mar 26, 2009 8:29 am UTC

In my brain's many idle cycles, I've also thought about this. I'd love to work on a AI language system, but I have no idea where to start. It would definitely be the best way to translate (except for a real fluent human of course). A learning translator would reduce the amount of work to do and keep up with all the changes in languages.
Registered Linux User #555399

User avatar
gmalivuk
GNU Terry Pratchett
Posts: 26826
Joined: Wed Feb 28, 2007 6:02 pm UTC
Location: Here and There
Contact:

Re: a mechanism for automated translation..

Postby gmalivuk » Thu Mar 26, 2009 3:09 pm UTC

Please don't just post copypasta from your own blogs, especially if it is a wall of text that doesn't seem to ask discussion questions at all.
Unless stated otherwise, I do not care whether a statement, by itself, constitutes a persuasive political argument. I care whether it's true.
---
If this post has math that doesn't work for you, use TeX the World for Firefox or Chrome

(he/him/his)

kirkedal
Posts: 22
Joined: Fri Jun 26, 2009 1:52 pm UTC

Re: a mechanism for automated translation..

Postby kirkedal » Fri Jun 26, 2009 2:18 pm UTC

Some good points in your blog/post.

The concept of "one language to rule them all" is often called interlingua and is somthing rule-based machine translation is researching towards. The problem is that interlangua has to express concepts indepenent of any specific language and currently logic is used to express these concepts. However, there are some issues to my mind in expressing somthing that should be language independent with language.

A problem with the machine learning approach is - as you probably know - that texts are translated with regards to meaning and not word-to-word, phrase-to-phrase or sentence-to-sentence correlation. Machine learning would require parallel texts with sentence alignment to create data and there are precious few of those.

inhahe
Posts: 59
Joined: Sun Feb 22, 2009 11:16 pm UTC

Re: a mechanism for automated translation..

Postby inhahe » Sat Jun 27, 2009 2:14 am UTC

kirkedal wrote:Some good points in your blog/post.

The concept of "one language to rule them all" is often called interlingua and is somthing rule-based machine translation is researching towards. The problem is that interlangua has to express concepts indepenent of any specific language and currently logic is used to express these concepts. However, there are some issues to my mind in expressing somthing that should be language independent with language.

A problem with the machine learning approach is - as you probably know - that texts are translated with regards to meaning and not word-to-word, phrase-to-phrase or sentence-to-sentence correlation. Machine learning would require parallel texts with sentence alignment to create data and there are precious few of those.


Oh, I see what you mean about needing sentence alignment.. good point!

User avatar
6453893
Posts: 557
Joined: Wed Dec 13, 2006 2:40 am UTC
Location: Australia

Re: a mechanism for automated translation..

Postby 6453893 » Sat Jun 27, 2009 4:23 am UTC

A machine will not be capable of properly translating a text until it is capable of writing one.

User avatar
sparks
Posts: 119
Joined: Sat May 17, 2008 7:24 pm UTC
Contact:

Re: a mechanism for automated translation..

Postby sparks » Sat Jun 27, 2009 6:03 pm UTC

The thing is that even a very powerful AI machine would not even translate everything. I've seen books where the references to american (or any other country, this is an example) pop culture aren't quite handled. An AI system would have to be able to not literally translate those but rather keep the original with a footnote, or translating it in context. Besides, there are so many possible word meanings for certain words, that the system would have to be able to process the general context and look it up in the reference texts, which are not extremely likely of having the exact same combination for many words. That, and I believe a human touch is important when keeping the same tone of texts (this probably doesn't iclude business texts actually) similar. That's probably one of the biggest challenged usually faced by translators, for example, when translating an Irvine Welsh or Ernest Hemingway novel, for example.
(icon by clockwork-harlequin.net)
Image
"An idea that is not dangerous is unworthy of being called an idea at all." ~ Oscar Wilde

User avatar
Roĝer
Posts: 445
Joined: Sun Oct 05, 2008 9:36 pm UTC
Location: Many worlds, but mostly Copenhagen.
Contact:

Re: a mechanism for automated translation..

Postby Roĝer » Sun Jun 28, 2009 10:22 pm UTC

Some useful ideas, but really, if you are going to pick one language please let it not be English. Way, way too much ambiguity. Pick one that relies more on inflection than syntax, or use a constructed language that removes as much ambiguity as possible.

Besides that, what would be the difference between this method and statistical machine translation?
Ik ben niet koppig, ik heb gewoon gelijk.

kirkedal
Posts: 22
Joined: Fri Jun 26, 2009 1:52 pm UTC

Re: a mechanism for automated translation..

Postby kirkedal » Mon Jun 29, 2009 9:17 am UTC

Roĝer wrote:Some useful ideas, but really, if you are going to pick one language please let it not be English. Way, way too much ambiguity. Pick one that relies more on inflection than syntax, or use a constructed language that removes as much ambiguity as possible.

Besides that, what would be the difference between this method and statistical machine translation?



Statistical MT does not use AI or reference texts. SMT relies on a large phrase table with associated probabilities for translation into a phrase in a given foreign language.

Also, when creating MT systems, it would be unusual to bypass a language such as English which is probably the language most documents are translated into. Ambiguity can be handled by rule-based systems and SMT ignores semantics, but still creates the best results.


Return to “Language/Linguistics”

Who is online

Users browsing this forum: No registered users and 8 guests