1683: "Digital Data"

This forum is for the individual discussion thread that goes with each new comic.

Moderators: Moderators General, Prelates, Magistrates

Tyndmyr
Posts: 11443
Joined: Wed Jul 25, 2012 8:38 pm UTC

Re: 1683: "Digital Data"

Postby Tyndmyr » Mon May 23, 2016 9:42 pm UTC

commodorejohn wrote:No, but you'll note that A. you're talking specifically about plain text, whereas the original question was general with a specific bent towards multimedia data (videos/etc.) and B. you're suggesting a solution like Subversion which strictly targets a specific set of files or folders rather than generally tracking all document activity. Imagine if the sytem made a new copy (or even a new partial copy) every time there was a change in your bash history or logfiles or whatever.


If it's binary data, who really cares what kind of file it is? Subversion, etc, will store the diff regardless.

Now, this runs into some messiness with compression, but that's more a theoretical problem, not one specific to the application. Leaving aside git-specific filters and other increasingly complicated tasks, if you're worried about size, just use uncompressed formats. Sure, they're heavier, but duplicating for frequent changes will quickly outstrip the savings.

User avatar
ucim
Posts: 6890
Joined: Fri Sep 28, 2012 3:23 pm UTC
Location: The One True Thread

Re: 1683: "Digital Data"

Postby ucim » Tue May 24, 2016 1:05 am UTC

mikael wrote:Now that I think of it, every application that has an "undo" command basically behaves this way.
Audacity essentially saves the undo information. But yes, if you export to a different file format, it's lost.

So the question becomes whether or not you want that to even be possible.

Jose
Order of the Sillies, Honoris Causam - bestowed by charlie_grumbles on NP 859 * OTTscar winner: Wordsmith - bestowed by yappobiscuts and the OTT on NP 1832 * Ecclesiastical Calendar of the Order of the Holy Contradiction * Heartfelt thanks from addams and from me - you really made a difference.

User avatar
mikael
Posts: 28
Joined: Mon Feb 16, 2015 6:56 pm UTC
Location: Avignon, France
Contact:

Re: 1683: "Digital Data"

Postby mikael » Tue May 24, 2016 11:36 am UTC

ucim wrote:Audacity essentially saves the undo information. But yes, if you export to a different file format, it's lost.

So the question becomes whether or not you want that to even be possible.

Well, let's see: what would it mean to make the undo history available outside the application?

From the user's point of view, it would mean that for any file that has such a history "attached", it would be possible to undo the actions that have been taken to make that file, and thus explore that file's history. I guess that would look pretty much like a Git repo with auto-commit, for every file.

So yes, I would like that to be possible, assuming it doesn't end up eating up 90% of my ressources and that it provides some usable navigation interface.

On the other hand, I don't think a version control system such as Git (or even a versioning filesystem, like commodorejohn suggested) is in any way concerned with the process by which data is transformed. It knows in what state the data was before and after the commit, but nothing about the transformation itself. With Git, when committing manually, you give it a few words to explain that transformation but that about it.

Now in the case of Git, that process presumably takes places inside the programmer's brain, so it's hard to get it out. But when manipulating most binary data, we use high-level tools which the computer should know everything about since they are just interactive programs that it runs. So in principles, it could store the history of every file as a chain of commits, with an actual description of the process attached to each commit. In effect, it could redo every action automatically.

Does that make sense?

User avatar
ucim
Posts: 6890
Joined: Fri Sep 28, 2012 3:23 pm UTC
Location: The One True Thread

Re: 1683: "Digital Data"

Postby ucim » Tue May 24, 2016 12:27 pm UTC

mikael wrote:From the user's point of view, it would mean that for any file that has such a history "attached", it would be possible to undo the actions that have been taken to make that file, and thus explore that file's history. [...] So yes, I would like that to be possible.
So, when you send a letter to your boss or your competitor or the IRS, you want them to be able to "explore the history of the file"? And, presumably, every other file that it comes from? You want it to not be possible to hide this information by creating a "fresh virgin" of the file?

Sit down and have a think.

mikael wrote:But when manipulating most binary data, we use high-level tools which the computer should know everything about since they are just interactive programs that it runs. So in principles, it could store the history of every file as a chain of commits...
For this to be effective, the computer would need also a copy of every version of every program that has ever interacted with the file. That is, every program on everybody else's machine that was used to create or modify the file. Under every operating system.

First, put on your white beret and have a think.

Now put on your black hat and have a think.

Jose
Order of the Sillies, Honoris Causam - bestowed by charlie_grumbles on NP 859 * OTTscar winner: Wordsmith - bestowed by yappobiscuts and the OTT on NP 1832 * Ecclesiastical Calendar of the Order of the Holy Contradiction * Heartfelt thanks from addams and from me - you really made a difference.

Tyndmyr
Posts: 11443
Joined: Wed Jul 25, 2012 8:38 pm UTC

Re: 1683: "Digital Data"

Postby Tyndmyr » Tue May 24, 2016 12:37 pm UTC

mikael wrote:
ucim wrote:Audacity essentially saves the undo information. But yes, if you export to a different file format, it's lost.

So the question becomes whether or not you want that to even be possible.

Well, let's see: what would it mean to make the undo history available outside the application?


It's already essentially available outside of any given instance of the application. He's asking if you want a system where it's ALWAYS available, not where it may or may not be available at your preference.

I do not think you can guarantee always available. At a bare minimum, the ol' picture of the monitor has a way of stripping out all non visible data. Securing data FROM the creator is generally not very enforceable.

User avatar
mikael
Posts: 28
Joined: Mon Feb 16, 2015 6:56 pm UTC
Location: Avignon, France
Contact:

Re: 1683: "Digital Data"

Postby mikael » Tue May 24, 2016 1:54 pm UTC

ucim wrote:So, when you send a letter to your boss or your competitor or the IRS, you want them to be able to "explore the history of the file"? And, presumably, every other file that it comes from? You want it to not be possible to hide this information by creating a "fresh virgin" of the file?

No, of course not. You can't force anyone to provide data if they don't want to. All I want is a framework for collecting and managing this data which I believe to be useful.

ucim wrote:For this to be effective, the computer would need also a copy of every version of every program that has ever interacted with the file. That is, every program on everybody else's machine that was used to create or modify the file. Under every operating system.

That's right, and depending on the actual program, specifying its effect on the file might range from pretty straightforward (e.g. a Sox command-line) to dautingly difficult (e.g. a session in a proprietary DAW like Logic, running on a proprietary operating system). But in principle that should always be possible, given some sane (deterministic) runtime environment.

Tyndmyr wrote:It's already essentially available outside of any given instance of the application. He's asking if you want a system where it's ALWAYS available, not where it may or may not be available at your preference.

Yes, I figured that much. But how is this "already essentially available"? Could you give me an example of a file that could come with full history, should the user wish that to be public?

Tyndmyr
Posts: 11443
Joined: Wed Jul 25, 2012 8:38 pm UTC

Re: 1683: "Digital Data"

Postby Tyndmyr » Tue May 24, 2016 2:23 pm UTC

mikael wrote:
Tyndmyr wrote:It's already essentially available outside of any given instance of the application. He's asking if you want a system where it's ALWAYS available, not where it may or may not be available at your preference.

Yes, I figured that much. But how is this "already essentially available"? Could you give me an example of a file that could come with full history, should the user wish that to be public?


Many document formats come with an "undo" history, which functionally serves that purpose. They may or may not be indefinite, but they're usually large enough to pretty efficiently track change history, and it can be stored as part of a file.

For instance, Vim has a persistant undo that gets saved to a file on exit, should you choose to use it. Office, I believe, saves them in a .wml.

PDF stores it in the same doc itself, if memory serves.

So yeah, it's often already being tracked, if you care to utilize it. Just, most people don't.

User avatar
ucim
Posts: 6890
Joined: Fri Sep 28, 2012 3:23 pm UTC
Location: The One True Thread

Re: 1683: "Digital Data"

Postby ucim » Tue May 24, 2016 4:15 pm UTC

mikael wrote:Could you give me an example of a file that could come with full history, should the user wish that to be public?
Microsoft Word documents, if you aren't careful with your settings, save versioning info, much to the embarrassment (or more) of people who have sent them. Perhaps it's changed; I no longer use Word.

Jose
Order of the Sillies, Honoris Causam - bestowed by charlie_grumbles on NP 859 * OTTscar winner: Wordsmith - bestowed by yappobiscuts and the OTT on NP 1832 * Ecclesiastical Calendar of the Order of the Holy Contradiction * Heartfelt thanks from addams and from me - you really made a difference.

User avatar
HES
Posts: 4896
Joined: Fri May 10, 2013 7:13 pm UTC
Location: England

Re: 1683: "Digital Data"

Postby HES » Tue May 24, 2016 4:20 pm UTC

ucim wrote:Microsoft Word documents, if you aren't careful with your settings, save versioning info, much to the embarrassment (or more) of people who have sent them.

I did once send out my CV without clearing Track Changes first. Thankfully my supervisor caught the mistake before distributing it any further, and I got the job.
He/Him/His Image

User avatar
orthogon
Posts: 3102
Joined: Thu May 17, 2012 7:52 am UTC
Location: The Airy 1830 ellipsoid

Re: 1683: "Digital Data"

Postby orthogon » Tue May 24, 2016 4:35 pm UTC

ucim wrote:
mikael wrote:Could you give me an example of a file that could come with full history, should the user wish that to be public?
Microsoft Word documents, if you aren't careful with your settings, save versioning info, much to the embarrassment (or more) of people who have sent them. Perhaps it's changed; I no longer use Word.

Jose

I was thinking there was a recent case of this, but on re-reading the article I see that was a more glaring error: they'd used "track changes" and left all that history in the document. D'oh.

(EDIT: Sorry, HES, I see that's what you did too!)
xtifr wrote:... and orthogon merely sounds undecided.

User avatar
mikael
Posts: 28
Joined: Mon Feb 16, 2015 6:56 pm UTC
Location: Avignon, France
Contact:

Re: 1683: "Digital Data"

Postby mikael » Wed May 25, 2016 6:25 pm UTC

Thanks for all your comments. I've been a little taken aback by the all the negativity about the idea of sharing file history, so please excuse me if I ramble about this some more.

The central idea in Randall's strip is that digital data, although it is in some sense eternal, seems to "degrade" over time through transmission by careless actors.

By "careless actors", I mean users that get that data somewhere and then further propagate it, but after changing it in some way. Some might not even realize they are changing it while some might be doing so in a very deliberate way. The changes might range from the very slight to the drastic. But in the end, the details are irrelevant: in the digital realm, it's either the same data or it's not.

And if it's not the same data, I argue that we currently don't have an "efficient" way to "link" the two together. Again, as ucim aptly put it:
More troubling for me is that when I search on the net for something, I find hundreds of nearly identical undated versions that are clearly ripoffs of one another, but with no indication of which (if any) are the original, and who is the original author.

In other words, wouldn't it just be great if we could somehow know when some file has been derived from some other file? Also, if we could do that for all files, we could just "follow the path" up to the original version. And finally, I argue that if we could do that, it would restore the eternal nature of digital data even in the presence of careless actors. Problem solved.

OK, so how can you know that some file was derived from some other file? I could just tell you that's the case, perhaps even arguing that "I should know since I'm the one who did it", but then I would be instantly contradicted by countless other people whose interests in the matter differ from mine. So in order for that to work, I would actually have to prove it to you.

Now that's where things start getting really interesting. But long story short, the easiest way for me to prove to you that file B was derived from file A is to give you the process by which the file was derived in the first place, inside a suitable runtime environment so that you can replicate it by yourself.

To sum up: assuming that I want to prove to you that file B was derived from file A, it is much easier to do so if I can just give you the "redo history". That's all about how I could prove this to you, should I wish to do so. The question of why I would want to prove this to you in the first place is another one entirely.

I hope this clears things up a bit.

User avatar
ucim
Posts: 6890
Joined: Fri Sep 28, 2012 3:23 pm UTC
Location: The One True Thread

Re: 1683: "Digital Data"

Postby ucim » Thu May 26, 2016 4:17 am UTC

It comes to identifying just what the problem is. And that depends on where you sit.

If the problem is people uploading (copies of) copyrighted material to youtube (or to their own phones), then yes, this would provide an ironclad way to lock all those pirates and criminals up for the rest of their lives, and fine them everything they own plus interest. No trial even needed; the file proves itself.

If the problem is websites take too long to download, then adding all the metadata and versioning info as well as the original of the (unimportant) original file from which the eye candy is taken makes the problem a zillion times worse.

If the problem is that there's a petabyte of sky data, and I'm interested in just this one tiny quadrant, and it becomes easier to see after this file transformation, and I want to email my colleague about my discovery, it's much easier (and just as useful) to send just the altered version of the teeny piece of the sky, and tell her (in text) where it came from.

If the problem is that the original disappears after this, then, well, yeah, it would have been nice to have a copy in my email... except that the email would still be transmitting long after I've retired.

If the problem is that there are a zillion ripoffs of some original on the net, and I want to find the original, well, good luck. Most of the ripoffs do not want you to find the original; they want you to find their own ripoff (and watch the ads). That's what they're made for.

For another angle on a similar problem, consider hotlinking vs rehosting. Which is better? Why? The answer is "it depends" and it's not always clear what it depends on (and it's not always bandwidth vs storage either).

In the end, for some files it matters. For some files it doesn't matter. And for some files, it depends whose ox is gored.

Jose
Order of the Sillies, Honoris Causam - bestowed by charlie_grumbles on NP 859 * OTTscar winner: Wordsmith - bestowed by yappobiscuts and the OTT on NP 1832 * Ecclesiastical Calendar of the Order of the Holy Contradiction * Heartfelt thanks from addams and from me - you really made a difference.

User avatar
gmalivuk
GNU Terry Pratchett
Posts: 26823
Joined: Wed Feb 28, 2007 6:02 pm UTC
Location: Here and There
Contact:

Re: 1683: "Digital Data"

Postby gmalivuk » Thu May 26, 2016 11:49 am UTC

Yeah, sites like 9gag rehost other people's pictures without credit and even add their own watermark. They wouldn't include file history data even if it was more readily available.
Unless stated otherwise, I do not care whether a statement, by itself, constitutes a persuasive political argument. I care whether it's true.
---
If this post has math that doesn't work for you, use TeX the World for Firefox or Chrome

(he/him/his)

UpGoing Kerbal
Posts: 3
Joined: Tue Jul 22, 2014 3:06 pm UTC

Re: 1683: "Digital Data"

Postby UpGoing Kerbal » Thu May 26, 2016 12:12 pm UTC

Digital data truely is forever with proper care.
Capture.JPG


But with layers of compression, changes in resolution and aspect ratio, as well as the odd photographed screen, it's not impossible for it to degrade.

User avatar
mikael
Posts: 28
Joined: Mon Feb 16, 2015 6:56 pm UTC
Location: Avignon, France
Contact:

Re: 1683: "Digital Data"

Postby mikael » Thu May 26, 2016 12:47 pm UTC

ucim wrote:It comes to identifying just what the problem is. And that depends on where you sit.

OK, maybe I was being too abstract. Here my motivating example:

Let's assume that we have a content-addressed file sharing network (like eDonkey, BitTorrent or IPFS) that allows us to download any file present in the system given that file's hash. If I download a 1 GB file, change a few bytes and then re-upload it, I will have "created" 1 GB of fresh data out of thin air. Because the new file's hash is different from the old one's, the system is unable to realize that most of the data in the two files is the same. Thus, much storage is wasted.

That particular case could be solved by somehow sharding the files and computing the hash on every shard. Because most of the shards in the two files would be identical, they would be de-duplicated.

But what if I changed every byte in the file, for example by changing the contrast and lighting in a picture? Then, if we could somehow give that information to the network, the redundant information would not need to be stored anymore.

The information I'm talking about is "picture B is just picture A with function X applied to it". Here, A and B are file hashes and X would be a program that modifies the contrast and lighting of a picture.

Now if someone wanted to download picture B, there would be two options: either find someone who has B and download it directly, or find someone who has A, download it and apply X to it. In the extreme case, instead of having to store both A and B, the network could just store A and recompute B on the fly when needed. Storage wouldn't be wasted anymore.

So that's my scenario. As you can see, it's mostly about using the network's resources efficiently in the face of digital transformation of contents, but it can't do anything about the analog hole, for example.

User avatar
HES
Posts: 4896
Joined: Fri May 10, 2013 7:13 pm UTC
Location: England

Re: 1683: "Digital Data"

Postby HES » Thu May 26, 2016 1:15 pm UTC

mikael wrote:Storage wouldn't be wasted anymore.

But processing power would. Which is more readily available?
He/Him/His Image

User avatar
mikael
Posts: 28
Joined: Mon Feb 16, 2015 6:56 pm UTC
Location: Avignon, France
Contact:

Re: 1683: "Digital Data"

Postby mikael » Thu May 26, 2016 2:11 pm UTC

HES wrote:
mikael wrote:Storage wouldn't be wasted anymore.

But processing power would. Which is more readily available?

That depends on the process: some are cpu-intensive and produce little data (like protein folding) and some are very light on processing but produce large datasets (like color-tuning a picture). Optimization is hard, but the system would a least provide a generic way to trade one for the other.

User avatar
ucim
Posts: 6890
Joined: Fri Sep 28, 2012 3:23 pm UTC
Location: The One True Thread

Re: 1683: "Digital Data"

Postby ucim » Thu May 26, 2016 2:21 pm UTC

You not only need the original data, and the transform, but you need the program that does the transform. So what happens when the programmer upgrades that program by changing a few features. This may have the intended (or unintended) effect of altering the way this particular transform happens, or it may not affect this particular transform. Or, like lzw compression, it may become illegal.

And what if some version of the program also contains a security vulnerability making it unsafe to run?

So now, I have the original, and I know what the transform is, but I can no longer apply the transform.

I see where you're going, but I don't see this as any sort of universal solution.

Jose
Order of the Sillies, Honoris Causam - bestowed by charlie_grumbles on NP 859 * OTTscar winner: Wordsmith - bestowed by yappobiscuts and the OTT on NP 1832 * Ecclesiastical Calendar of the Order of the Holy Contradiction * Heartfelt thanks from addams and from me - you really made a difference.

User avatar
Flumble
Yes Man
Posts: 2264
Joined: Sun Aug 05, 2012 9:35 pm UTC

Re: 1683: "Digital Data"

Postby Flumble » Thu May 26, 2016 3:18 pm UTC

ucim wrote:I see where you're going, but I don't see this as any sort of universal solution.

Indeed. Even expressing the transform as a pure function (so it doesn't require any external state other than the previous state of the data) implemented in a universal machine language has its problems.
For one, we don't have a universal machine language. In practice Java bytecode comes close, but even that one introduces new features with newer versions which old JVMs can't handle.
For another, making transforms pure (which you really want, since you can't rely on external sources for data) leads to a metric fuckton of code duplication. Sure, we have plenty of disk space nowadays, but the whole idea is to reuse stuff, right? (at the very least it is for programming in general)
For yet another, what comprises the original and what a transform? Is it the data on which you apply the transform? Is it the transform too? Is it none or only the transform? I could say that a photo is a transform of nothingness, cropping it and adding a watermark is another transform, lossily compressing it is another transform and the combination (and optimization) of these transforms is also a transform. Now my original is [] and the transform is "making it a crappy 9gag image".

It's a can of ants best left alone. The best we have now are program-specific implementations, like word, vim and some other editing software support, and generic software in which the transform is simply "these bytes are replaced with these bytes".

User avatar
orthogon
Posts: 3102
Joined: Thu May 17, 2012 7:52 am UTC
Location: The Airy 1830 ellipsoid

Re: 1683: "Digital Data"

Postby orthogon » Thu May 26, 2016 3:34 pm UTC

Flumble wrote:metric fuckton

Isn't that just a fucktonne?
xtifr wrote:... and orthogon merely sounds undecided.

User avatar
HES
Posts: 4896
Joined: Fri May 10, 2013 7:13 pm UTC
Location: England

Re: 1683: "Digital Data"

Postby HES » Thu May 26, 2016 4:00 pm UTC

orthogon wrote:
Flumble wrote:metric fuckton

Isn't that just a fucktonne?

It's the fuck that's metric, not the ton.
He/Him/His Image

commodorejohn
Posts: 1198
Joined: Thu Dec 10, 2009 6:21 pm UTC
Location: Placerville, CA
Contact:

Re: 1683: "Digital Data"

Postby commodorejohn » Thu May 26, 2016 4:00 pm UTC

I believe it's equivalent to ~1.1 US fucktons, no?
"'Legacy code' often differs from its suggested alternative by actually working and scaling."
- Bjarne Stroustrup
www.commodorejohn.com - in case you were wondering, which you probably weren't.

User avatar
ucim
Posts: 6890
Joined: Fri Sep 28, 2012 3:23 pm UTC
Location: The One True Thread

Re: 1683: "Digital Data"

Postby ucim » Thu May 26, 2016 4:36 pm UTC

ucim wrote: This may have the intended (or unintended) effect of altering the way this particular transform happens...
Quoting myself - let's see if I get a notification that I've been quoted.

This is a side-effect of the same transform being applied by different programs (specifically, different implementations of the <canvas> HTML5 tag in different browsers, or even the same browser on different machines, or even the same browser on the same machine with different versions of fonts....). The results are different enough to be used as a surreptitious tracking device on the web, that would be virtually impossible to avoid without disabling the <canvas> tag functionality, and websites could easily weave it into their code in such a way as to make it required (such as using the canvas to draw the basic navigation elements).

Draw your own conclusions.

As to the fuckton(ne)s... I think it's more important whether or not it's a European or an American fuckton(ne). American fuckton(ne)s are a bunch of shittons, but European ones are a shitton of shitton(ne)s. And like giga vs gibi, a shitton of shittonnes is different from a shittonne of shittonnes by a little bit.

Jose
ETA: I was not notified that I quoted myself. Interesting. Somebody's thinking!
ETA2: It wasn't even the case that I was, and it was automatically marked read because I read my own post. It just didn't happen.
Last edited by ucim on Thu May 26, 2016 5:32 pm UTC, edited 2 times in total.
Order of the Sillies, Honoris Causam - bestowed by charlie_grumbles on NP 859 * OTTscar winner: Wordsmith - bestowed by yappobiscuts and the OTT on NP 1832 * Ecclesiastical Calendar of the Order of the Holy Contradiction * Heartfelt thanks from addams and from me - you really made a difference.

commodorejohn
Posts: 1198
Joined: Thu Dec 10, 2009 6:21 pm UTC
Location: Placerville, CA
Contact:

Re: 1683: "Digital Data"

Postby commodorejohn » Thu May 26, 2016 4:45 pm UTC

Ah, yet another terrific reason to use NoScript...
"'Legacy code' often differs from its suggested alternative by actually working and scaling."
- Bjarne Stroustrup
www.commodorejohn.com - in case you were wondering, which you probably weren't.

User avatar
mikael
Posts: 28
Joined: Mon Feb 16, 2015 6:56 pm UTC
Location: Avignon, France
Contact:

Re: 1683: "Digital Data"

Postby mikael » Fri May 27, 2016 10:38 am UTC

Yes, I know, there's a big gaping hole in my pretty picture. Actually, there are a few of them, but this one is a doozy: for the whole thing to work, we need a "perfect" runtime environment.

It needs to be deterministic, so that running the same code on the same data always produces the same result, and it needs to be sandboxed, so that nothing that runs inside it can "creep out". As Flumble said, that's called a pure function. I've become quite obsessed with these since I learned Haskell.

The reason we need that environment is that we want to "equate" data with the code that produced this data. But making such an environment is hard, and the ones that do exist are very restrictive: you wouldn't just run Audacity in them.

But as I said earlier, the easiest way to obtain the information that some data is the result of running some program is to record it when the data is first produced, and that means producing it inside our runtime environment.

So how do we do that? I'm not sure. Maybe by creating some kind of "pure computational module" that apps that want to generate data deterministically could defer the work to?

ucim wrote:You not only need the original data, and the transform, but you need the program that does the transform. So what happens when the programmer upgrades that program by changing a few features.
So now, I have the original, and I know what the transform is, but I can no longer apply the transform.

The idea here is that all computations are performed inside some immutable "base" environment. When data is produced, it is the result of some transform applied by some program, itself running on some OS, and so on until you reach your base environment. As long as you keep track of all intermediate code, running it always produces the same data.

And before you ask: yes, that's only legally possible if all said code is free software.

Flumble wrote:For another, making transforms pure (which you really want, since you can't rely on external sources for data) leads to a metric fuckton of code duplication.

I don't undestand why that's true. Could you care to explain?

rmsgrey
Posts: 3655
Joined: Wed Nov 16, 2011 6:35 pm UTC

Re: 1683: "Digital Data"

Postby rmsgrey » Fri May 27, 2016 4:17 pm UTC

mikael wrote:
Flumble wrote:For another, making transforms pure (which you really want, since you can't rely on external sources for data) leads to a metric fuckton of code duplication.
I don't undestand why that's true. Could you care to explain?


Imagine if every image file came with a copy of MS Paint in order to be able to view it - or if every mp3 came with a lightweight audio-player, or every .txt came with a separate copy of Notepad.

If every transform is pure, that means that, rather than, say, 100 bytes saying "apply standard function FOO from standard library BAR with parameter BAZ", you have to provide, say, 100kB of the code that has the effect of BAR.FOO(BAZ) (which is still less than the, say, 10MB that BAR takes up). If you have a thousand documents all using functions that could be found in BAR, then it's more efficient to just have your own copy of BAR and let each of them reference that, than to have a separate copy of part of BAR for each document. It gets even better if, rather than needing the whole of BAR, the five documents you have that use FOO can each share one copy of BAR.FOO.

Of course, shared dependencies aren't all upside - you can run into version conflicts and other problems and end up with a hundred subtly different versions of BAR and be worse off again.

Still, the main point is that over a large enough number of files with their own pure transforms, there will be a lot of times different files do the same (or very similar) things, so you'll end up with a lot of chunks of code that are very similar, even if the overall output is very different. In other words, lots of code duplication.

User avatar
SuicideJunkie
Posts: 429
Joined: Sun Feb 22, 2015 2:40 pm UTC

Re: 1683: "Digital Data"

Postby SuicideJunkie » Fri May 27, 2016 5:06 pm UTC

richP wrote: * Electronic document approval systems (no, printing the title block, having the chief engineer and QA lead initial and date, then re-scan does not count).
What a coincidence; that is *exactly* what I'm getting paid to do today!
Print the cover sheet, walk around the office for signatures, then scan and paste that image into the doc.

User avatar
mikael
Posts: 28
Joined: Mon Feb 16, 2015 6:56 pm UTC
Location: Avignon, France
Contact:

Re: 1683: "Digital Data"

Postby mikael » Fri May 27, 2016 5:27 pm UTC

rmsgrey wrote:Imagine if every image file came with a copy of MS Paint in order to be able to view it - or if every mp3 came with a lightweight audio-player, or every .txt came with a separate copy of Notepad.

Thanks for the great explanation. If every piece of data came "bundled" with the code that produced it, that would lead to tremendous code duplication indeed.

But I don't see why that would be necessary: all we need is for the code to be immutable, so couldn't we just identify it through its hash? Your example would become "apply function FOO from the library whose hash is BARHASH with parameter the data whose hash is BAZHASH". That would be very little data and it would still identify the whole transform uniquely, right?

Does that somehow not qualify as a pure transform?

rmsgrey
Posts: 3655
Joined: Wed Nov 16, 2011 6:35 pm UTC

Re: 1683: "Digital Data"

Postby rmsgrey » Fri May 27, 2016 6:12 pm UTC

mikael wrote:
rmsgrey wrote:Imagine if every image file came with a copy of MS Paint in order to be able to view it - or if every mp3 came with a lightweight audio-player, or every .txt came with a separate copy of Notepad.

Thanks for the great explanation. If every piece of data came "bundled" with the code that produced it, that would lead to tremendous code duplication indeed.

But I don't see why that would be necessary: all we need is for the code to be immutable, so couldn't we just identify it through its hash? Your example would become "apply function FOO from the library whose hash is BARHASH with parameter the data whose hash is BAZHASH". That would be very little data and it would still identify the whole transform uniquely, right?

Does that somehow not qualify as a pure transform?


The idea of a pure transform is that you don't need any external data to use it - if you don't supply the library, then whoever's trying to apply the transform will need to go find a copy of the library somewhere, and it stops being a pure transform (as Flumble used the term).

Obviously, in practice, you need to make some assumptions - you need the recipient's machine to be able to run whatever code you use for the transform - but assuming they have a particular library function available is a pretty big assumption.

User avatar
mikael
Posts: 28
Joined: Mon Feb 16, 2015 6:56 pm UTC
Location: Avignon, France
Contact:

Re: 1683: "Digital Data"

Postby mikael » Fri May 27, 2016 9:06 pm UTC

rmsgrey wrote:Obviously, in practice, you need to make some assumptions - you need the recipient's machine to be able to run whatever code you use for the transform - but assuming they have a particular library function available is a pretty big assumption.

OK, so I guess I must explicitly include the assumption that we have a content-addressed storage network at our disposal, which allows us to retrieve any chunk of data that has been made public (including code) based on the hash of its content.

ps.02
Posts: 378
Joined: Fri Apr 05, 2013 8:02 pm UTC

Re: 1683: "Digital Data"

Postby ps.02 » Fri May 27, 2016 9:25 pm UTC

mikael wrote:OK, so I guess I must explicitly include the assumption that we have a content-addressed storage network at our disposal, which allows us to retrieve any chunk of data that has been made public (including code) based on the hash of its content.

So, I use the Noscript browser plugin, because I don't really trust my sandboxes. You're saying that, in your world, I would basically be unable to view any media ever?

rmsgrey
Posts: 3655
Joined: Wed Nov 16, 2011 6:35 pm UTC

Re: 1683: "Digital Data"

Postby rmsgrey » Fri May 27, 2016 9:28 pm UTC

mikael wrote:
rmsgrey wrote:Obviously, in practice, you need to make some assumptions - you need the recipient's machine to be able to run whatever code you use for the transform - but assuming they have a particular library function available is a pretty big assumption.

OK, so I guess I must explicitly include the assumption that we have a content-addressed storage network at our disposal, which allows us to retrieve any chunk of data that has been made public (including code) based on the hash of its content.


Sounds like you're heading into Linux territory with package repositories.

User avatar
mikael
Posts: 28
Joined: Mon Feb 16, 2015 6:56 pm UTC
Location: Avignon, France
Contact:

Re: 1683: "Digital Data"

Postby mikael » Fri May 27, 2016 10:14 pm UTC

ps.02 wrote:So, I use the Noscript browser plugin, because I don't really trust my sandboxes. You're saying that, in your world, I would basically be unable to view any media ever?

No, only that you would have to download the static content every time. But the better option would be to make the sandbox secure enough for you to trust it.

rmsgrey wrote:Sounds like you're heading into Linux territory with package repositories.

Packages are not addressed by content, so they are not immutable. But they are cryptographically signed, so it would take a malicious or really clumsy maintainer to change some package while keeping the version identical.

Also, the Linux environment is too powerful to safely run untrusted code. Again, the code I run should only be able to produce the data it's supposed to, and not have any side effects.

But now that I think about it, if it could dynamically download (static) content from the network at runtime, that would provide it with a return channel that could leak data...

commodorejohn
Posts: 1198
Joined: Thu Dec 10, 2009 6:21 pm UTC
Location: Placerville, CA
Contact:

Re: 1683: "Digital Data"

Postby commodorejohn » Fri May 27, 2016 10:21 pm UTC

Mostly I'm still just trying to fathom why you would go to all this trouble in the first place. I seriously doubt that much space is being wasted on files that are more-or-less functional duplicates.
"'Legacy code' often differs from its suggested alternative by actually working and scaling."
- Bjarne Stroustrup
www.commodorejohn.com - in case you were wondering, which you probably weren't.

ps.02
Posts: 378
Joined: Fri Apr 05, 2013 8:02 pm UTC

Re: 1683: "Digital Data"

Postby ps.02 » Fri May 27, 2016 10:33 pm UTC

mikael wrote:
ps.02 wrote:So, I use the Noscript browser plugin, because I don't really trust my sandboxes. You're saying that, in your world, I would basically be unable to view any media ever?

No, only that you would have to download the static content every time. But the better option would be to make the sandbox secure enough for you to trust it.

Oh, that's all? Why didn't the computer industry think of that sometime in the past 30 years?

Seriously, I think we're a long ways away from a world where it's reasonable to ask people to enable macros for every Word document they get in their email.

User avatar
ucim
Posts: 6890
Joined: Fri Sep 28, 2012 3:23 pm UTC
Location: The One True Thread

Re: 1683: "Digital Data"

Postby ucim » Fri May 27, 2016 11:15 pm UTC

SuicideJunkie wrote:
richP wrote: * Electronic document approval systems (no, printing the title block, having the chief engineer and QA lead initial and date, then re-scan does not count).
What a coincidence; that is *exactly* what I'm getting paid to do today!
Print the cover sheet, walk around the office for signatures, then scan and paste that image into the doc.
Or, into any other doc you feel like, after a little gimping. :)

ps.02 wrote:Oh, that's all? Why didn't the computer industry think of that sometime in the past 30 years?
Because it was dominated by Microsoft? :twisted:

mikael wrote:But the better option would be to make the sandbox secure enough for you to trust it.
Sorry, I have the uncontrollable giggles now.

Ok, let's turn this around. *poof!* - you have the system you dream of. Pretend you're a black hat. How long before total world domination?

Pretend you're a white hat and want to create a 16x16 icon of a rose to be displayed as a favicon. The perfect rose is the thirty-fifth flower to the left of the second row of bushes around the green house in panorama.pic. You are limited to a file size of 1k. How do you do it?

Jose
Order of the Sillies, Honoris Causam - bestowed by charlie_grumbles on NP 859 * OTTscar winner: Wordsmith - bestowed by yappobiscuts and the OTT on NP 1832 * Ecclesiastical Calendar of the Order of the Holy Contradiction * Heartfelt thanks from addams and from me - you really made a difference.

cdxf6465
Posts: 8
Joined: Fri Nov 30, 2012 8:15 am UTC

Re: 1683: "Digital Data"

Postby cdxf6465 » Sat May 28, 2016 7:08 am UTC

rmsgrey wrote:Of course, shared dependencies aren't all upside - you can run into version conflicts and other problems and end up with a hundred subtly different versions of BAR and be worse off again.

Still, the main point is that over a large enough number of files with their own pure transforms, there will be a lot of times different files do the same (or very similar) things, so you'll end up with a lot of chunks of code that are very similar, even if the overall output is very different. In other words, lots of code duplication.


That's kind of like WinSxS :lol:

User avatar
mikael
Posts: 28
Joined: Mon Feb 16, 2015 6:56 pm UTC
Location: Avignon, France
Contact:

Re: 1683: "Digital Data"

Postby mikael » Tue May 31, 2016 9:53 am UTC

commodorejohn wrote:Mostly I'm still just trying to fathom why you would go to all this trouble in the first place. I seriously doubt that much space is being wasted on files that are more-or-less functional duplicates.

If you mean globally, you're probably right, if only because the vast majority of storage is wasted on exact duplicates. But inside any de-duplicated system such as dropbox or bittorrent, I'm guessing it's a major source of wasted space.

As for why I want to save on storage in the first place... I don't know. Hard drives are cheap, right?
I guess it's just my gut telling me that swamping the Internet with big blobs of data generated with a few mouse clicks is "bad".

User avatar
ucim
Posts: 6890
Joined: Fri Sep 28, 2012 3:23 pm UTC
Location: The One True Thread

Re: 1683: "Digital Data"

Postby ucim » Tue May 31, 2016 4:15 pm UTC

mikael wrote:If you mean globally, you're probably right, if only because the vast majority of storage is wasted on exact duplicates.
Call them "backups" and you have a feature, not a bug.

Jose
Order of the Sillies, Honoris Causam - bestowed by charlie_grumbles on NP 859 * OTTscar winner: Wordsmith - bestowed by yappobiscuts and the OTT on NP 1832 * Ecclesiastical Calendar of the Order of the Holy Contradiction * Heartfelt thanks from addams and from me - you really made a difference.

scarletmanuka
Posts: 533
Joined: Wed Oct 17, 2007 4:29 am UTC
Location: Perth, Western Australia

Re: 1683: "Digital Data"

Postby scarletmanuka » Wed Jun 08, 2016 1:25 am UTC

mikael wrote:OK, so how can you know that some file was derived from some other file? I could just tell you that's the case, perhaps even arguing that "I should know since I'm the one who did it", but then I would be instantly contradicted by countless other people whose interests in the matter differ from mine. So in order for that to work, I would actually have to prove it to you.

Now that's where things start getting really interesting. But long story short, the easiest way for me to prove to you that file B was derived from file A is to give you the process by which the file was derived in the first place, inside a suitable runtime environment so that you can replicate it by yourself.

One of the problems with this - particularly if you want to apply it to things like copy protection - is that it will also be possible to provide a sequence of transformations that produces file A from file B.

Another issue is that in many cases there will be more than one source file at each stage (for example in the "add a watermark" process).


Return to “Individual XKCD Comic Threads”

Who is online

Users browsing this forum: Keyman, Soup and 99 guests