## How to decode a code when I have pairs of (coded, decoded)?

A place to discuss the science of computers and programs, from algorithms to computability.

Formal proofs preferred.

Moderators: phlip, Moderators General, Prelates

jacques01
Posts: 42
Joined: Thu Oct 08, 2015 4:56 am UTC

### How to decode a code when I have pairs of (coded, decoded)?

Hi,

I have some Strings which I'd like to know if they are actually code for another String, e.g. the code is just a base 64 encoding of the actual String.

I am fortunate that I have pairs which I know the coded String and what it looks like decoded.

However, I don't know if the code is actually a code or just a complex key for a database lookup.

What algorithm/program can I run on my pairs of (code, decoded) to see if there's a way to transform

Code: Select all

`code`
into

Code: Select all

`decoded`
?

Here's a simple example where it's obvious what the "transformation" is. Pairs are (code, decoded). My goal is given a code, get the decoded String.

(gab, bag)
(god, dog)
(raboof, foobar)

So for this example the encoding rule is: Take the String s and reverse it's characters. To decode the encoded String c, reverse its characters.

Here is a less obvious example, where I want to know if such a transformation exists, or the coded String is just random (no mapping between the pairs):

(ABDBEF449F9F-FFF, Apple)
(27f7sf8sdfjfsdjsfd, Foobar)
(f2i39f9w9f99f2020, Turkey)

Note I just typed random characters for the coded String. Because without knowing anything, it appears the coded string is just random to me.

Supposing I can continuously sample new pairs (Coded, Decoded), what algorithm can I use that will tell me if there is a mapping/bijection between the two, or if there's no mapping with high probability?

Flumble
Yes Man
Posts: 2249
Joined: Sun Aug 05, 2012 9:35 pm UTC

### Re: How to decode a code when I have pairs of (coded, decoded)?

See this and related questions.

jacques01
Posts: 42
Joined: Thu Oct 08, 2015 4:56 am UTC

### Re: How to decode a code when I have pairs of (coded, decoded)?

Thank you. That seems very relevant, except my case seems to be a lot easier.

I can keep getting pairs (coded, decoded) without limit. That is, I can know the "ground truth." But I need to know if there's actually a mapping between the two, or it's just random (e.g. the "coded" is actually just a key to a database and not an encoding of the decoded).

So the algorithm should return a probability after given N pairs of (coded, decoded), seeing if there is a possible mapping.

Tub
Posts: 472
Joined: Wed Jul 27, 2011 3:13 pm UTC

### Re: How to decode a code when I have pairs of (coded, decoded)?

what algorithm can I use that will tell me if there is a mapping/bijection between the two

If you put it that way, most unix-like systems already have such an algorithm installed. It's in /usr/bin/yes.

Your original question is too ambiguous to have a precise answer, so it cannot be turned into an algorithm. Your first attempt at a precise formulation turned out to provide useless answers.
In this case, coming up with a question that is both precise and useful is incredibly difficult, and even if you could find one, I would guess that your question is either undecidable, or any algorithm that could answer it would run several orders of magnitude longer than you intend to wait. If you've ever suffered through a course of theoretical computer science, then you've probably seen a lot of proofs relevant to your problem.

Best you can do is follow the advice in the thread you've been linked: forget about having an algorithm, make some educated guesses and see if any of them pan out.

jacques01
Posts: 42
Joined: Thu Oct 08, 2015 4:56 am UTC

### Re: How to decode a code when I have pairs of (coded, decoded)?

Your original question is too ambiguous to have a precise answer, so it cannot be turned into an algorithm. Your first attempt at a precise formulation turned out to provide useless answers.

I am not asking for an exact "yes" or "no." I'm looking for an approximate/probabilistic solution. So it doesn't even have to return the correct answer. But it should get closer to the correct answer if I keep providing it with additional learning samples of (coded, decoded).

If you've ever suffered through a course of theoretical computer science, then you've probably seen a lot of proofs relevant to your problem.

I have but none of them cover non-deterministic / approximate / probabilistic algorithms. That is, algorithms which sacrifice the guarantee of "yes" or "no" with probability = 1.0 in exchange for being "tractable".

As an example of what I'm saying: I would solve the halting problem (which can't be solved) using machine learning to try to guess if a program terminates or not on a given input assuming I had training pairs or some kind of success function. It'd learn something, but of course can't ever be perfect.

Best you can do is follow the advice in the thread you've been linked: forget about having an algorithm, make some educated guesses and see if any of them pan out.

I did and unfortunately the programs I tried said the "coded" wasn't in any base encoding.

In this case, coming up with a question that is both precise and useful is incredibly difficult, and even if you could find one, I would guess that your question is either undecidable, or any algorithm that could answer it would run several orders of magnitude longer than you intend to wait

Well, this kind of problem is actually solved every day by human children. They receive sound waves as inputs and eventually learn to map these to a language's phonology and grammar. I'd argue their problem is even harder because nobody tells them the (coded, decoded) pairs. So, if babies can solve this, I don't see why a computer can't in a much easier situation.

Here, I have a solution which is probabilistic and may not even work, but it only is polynomial time using a sequential state model, e.g. HMM, CRF, whatever you want.

Here's my intuition of why my solution could work:

Let's suppose we need to do a grapheme to phoneme conversion. That is, given a word we need to produce its phonetic pronunciation.

We have pairs like this:

(dog, dag)
(the, T^)
etc.

If I train a CRF/HMM on the observable states (orthographic representation) and have the hidden states being the individual phonemes, I'll probably get an accuracy of 80% at least, with doing no feature engineering.

The baseline does so well because there is some kind of mapping/function between how a word is written and how it is spelt. Granted English has very irregular spelling but the observation holds true.

On the other hand, if I asked you to tell me whether there is a mapping between the name for an animal and the name for its meat, the trained CRF/HMM will have terrible accuracy because it won't learn anything--because the null hypothesis is true. That is, based on how an animal is written in English from just its characters you cannot predict how the meat of the animal is written.

(cow, beef)
(lamb, mutton)
(pig, pork)

So, my hypothesis is that for pairs of (coded, decoded), if one trains a sequential model and then tests, if the accuracy/F1 is low enough, it confirms the null hypothesis. If it's high enough, then it could be evidence for there being some deterministic mapping.

OF course, it works for grapheme to phoneme because we know how to model it intuitively. But for arbitrary data? It may be the "features" are mathematical functions that aren't discoverable without some kind of intuition about how such things work.

So, I've given an actual idea thats actionable. It probably won't work, but you see at least it's going in the right direction.

I hope this provides some more grounds for discussion so I could hear about what kind of approaches you'd have in mind.

Tub
Posts: 472
Joined: Wed Jul 27, 2011 3:13 pm UTC

### Re: How to decode a code when I have pairs of (coded, decoded)?

jacques01 wrote:I am not asking for an exact "yes" or "no." I'm looking for an approximate/probabilistic solution. So it doesn't even have to return the correct answer. But it should get closer to the correct answer if I keep providing it with additional learning samples of (coded, decoded).

Being ok with an approximate solution does not free you from the burden of formulating the problem precisely. The way you've worded it the only reasonable answer is still "100%", and you've yet to come up with a better question.

That being said, if you think machine learning can solve the halting problem, then you should do that; it will lead you directly to a solution for one possible version of the question you intended to ask. Whether the runtime is acceptable remains to be seen.

jacques01 wrote:I have but none of them cover non-deterministic / approximate / probabilistic algorithms. That is, algorithms which sacrifice the guarantee of "yes" or "no" with probability = 1.0 in exchange for being "tractable".

Well, you still have a countable number of algorithms against an uncountable number of problems. Even if you allow a small but bounded error, undecideable problems remain.

jacques01 wrote:I did and unfortunately the programs I tried said the "coded" wasn't in any base encoding.

If I were you, I'd take that result and stick with it.

jacques01 wrote:Well, this kind of problem is actually solved every day by human children. They receive sound waves as inputs and eventually learn to map these to a language's phonology and grammar.

Memorizing and recognizing (sound, meaning) pairs is a different problem from yours. You want a child to learn a handful of (sound, meaning) pairs and then deduce the meaning of every other word of the language without ever having heard it. You also require the child to learn the meaning of every sound that is not a word in the language. If you don't see why that's harder, you still haven't understood your question.

You also keep bringing up examples with a small and finite number of pairs, and those will lead your intuition in all the wrong directions.

Xanthir
My HERO!!!
Posts: 5413
Joined: Tue Feb 20, 2007 12:49 am UTC
Contact:

### Re: How to decode a code when I have pairs of (coded, decoded)?

Tub is being obtuse. To answer the question in the spirit of what it's asking: it depends.

If the ciphertext is generated by any decent encryption/hashing algo (off-the-shelf and readily available in all major programming languages for the past decade), statistical tests won't find any connection. That's sorta the point of those algos, to maximize apparent entropy. Finding the correlations you're asking for is actually an integral part of cracking encryption schemes.

It's possible they're instead using some home-grown encryption/obfuscation scheme, but that's actually more difficult than just doing it right. Unlikely to be the case.

(Which is probably good, since this once again sounds like you're trying to do something shady, particularly since you're once again masking details of what/why you're doing it.)
(defun fibs (n &optional (a 1) (b 1)) (take n (unfold '+ a b)))

chridd
Has a vermicelli title
Posts: 843
Joined: Tue Aug 19, 2008 10:07 am UTC
Location: ...Earth, I guess?
Contact:

### Re: How to decode a code when I have pairs of (coded, decoded)?

If coded strings are all the same length, but decoded strings are sometimes way longer than coded strings, then it's probably not an encoding of the string that you can recover the original from (could be a hash, or some database key that doesn't relate to the input, or an encoding of part of the input)
If coded strings are longer for longer decoded strings, then it's probably actually encoding something. If it's using actual encryption, then it probably won't be possible to actually decode it yourself, though.
If decoded strings are all short, or all the same length as each other, then you might not be able to tell.

Otherwise, you could try checking to see if it's any well-known algorithm (test each algorithm individually). Some possible ways to generate database keys that don't actually encode stuff:
• random numbers (which are indistinguishable from encrypted data)
• hashes (could be a common hashing algorithm directly applied to the data; otherwise, could be indistinguishable from random numbers)
• various encodings of the time/date (e.g., UNIX time)
GUID's
• numbers assigned sequentially
~ chri d. d. /tʃɹɪ.di.di/ (Phonotactics, schmphonotactics) · she · Forum game scores
mittfh wrote:I wish this post was very quotable...