Page 1 of 1

2054: "Data Pipeline"

Posted: Wed Oct 03, 2018 3:25 pm UTC
by Hiferator
Image
Title text: "Is the pipeline literally running from your laptop?" "Don't be silly, my laptop disconnects far too often to host a service we rely on. It's running on my phone."

Is this a reference to some specific example?

(Created with chridd's xkcd thread formatter.)

Re: 2054: "Data Pipeline"

Posted: Wed Oct 03, 2018 3:45 pm UTC
by pebkac
I am a datawarehouse administrator, and I approve of this message.

Re: 2054: "Data Pipeline"

Posted: Wed Oct 03, 2018 4:01 pm UTC
by alanbbent
Ouch, this one is pointed. I'm always the one saying "I can automate the collection and parsing of that data!"

In my defense, I never intend for those scripts to be used by anyone but me. Because yeah, web sites and servers change, and the script needs to be updated. I'm just... the guy who is in charge of getting that data, and I automate it. And even if the script needs fixing once or twice a year, I still think it's way better than throwing my hands up and saying "We shouldn't try to automate this! Sometimes inputs change and it breaks the automation!"

I guess I've never run into anyone like Ponytail who points out how fragile the whole idea is. I'd probably look a lot like panel 3 if I did.

Re: 2054: "Data Pipeline"

Posted: Wed Oct 03, 2018 4:15 pm UTC
by HES
alanbbent wrote:In my defense, I never intend for those scripts to be used by anyone but me. Because yeah, web sites and servers change, and the script needs to be updated. I'm just... the guy who is in charge of getting that data, and I automate it. And even if the script needs fixing once or twice a year, I still think it's way better than throwing my hands up and saying "We shouldn't try to automate this! Sometimes inputs change and it breaks the automation!"

I mean, as long as you're within the bounds of https://xkcd.com/1205/ and not https://xkcd.com/1319/ , then what could go wrong?

Re: 2054: "Data Pipeline"

Posted: Wed Oct 03, 2018 4:23 pm UTC
by rmsgrey
So long as you plan for the scripts to break periodically, there's nothing wrong with the basic idea. The trick is planning for them to break and keeping the conglomeration robust rather than having everything collapse when something weird comes in...

Re: 2054: "Data Pipeline"

Posted: Wed Oct 03, 2018 4:42 pm UTC
by richP
Hiferator wrote:...
Is this a reference to some specific example?

(Created with chridd's xkcd thread formatter.)

No universal example, but most of us have an example in our past.
I always thought of the process as less "house of cards" and more "meat grinder to sausage stuffer". Of course, my process usually involved grep, sed, awk, and maybe a PERL one-liner if things got really hairy.

Re: 2054: "Data Pipeline"

Posted: Wed Oct 03, 2018 5:01 pm UTC
by cellocgw
richP wrote:
Hiferator wrote:...
Is this a reference to some specific example?

No universal example, but most of us have an example in our past.
I always thought of the process as less "house of cards" and more "meat grinder to sausage stuffer". Of course, my process usually involved grep, sed, awk, and maybe a PERL one-liner if things got really hairy.


Contest Time!

Write a one-line shell command (80 char or less) using all of grep, sed, and awk that actually does something recognizable.

Prizes will be given on the basis of both style and output. Points taken off if the output is useful. Judges' determination of "useful" is final.

Re: 2054: "Data Pipeline"

Posted: Wed Oct 03, 2018 5:06 pm UTC
by ucim
cellocgw wrote: Points taken off if the output is useful.
Is it useful to win points?

Jose

Re: 2054: "Data Pipeline"

Posted: Wed Oct 03, 2018 5:18 pm UTC
by Sableagle
rmsgrey wrote:So long as you plan for the scripts to break periodically, there's nothing wrong with the basic idea. The trick is planning for them to break and keeping the conglomeration robust rather than having everything collapse when something weird comes in...

A supermarket chain over here failed to include "check the input makes sense" lines and someone added a new product to their system with volume 1 litre and mass 350 kg. Fortunately, the part of the system that decided each of these needed assigning its own van to transport it was followed by a part of the system that would let staff combine assigned vanloads indefinitely and pile "hundreds of tonnes" into 1 van.

Daggerfall had a system under which the mana cost of a spell from any school of magic got lower as the caster's skill in that school got higher. There was a check to make sure the total cost was at least +5, but by maxing out Destruction and always including a mighty firebolt in every custom spell, a player could get a super-awesome shield spell to cost (+380) + (-540) = (-160) 5 mana.

Then there was Heartbleed, in which the server code was streamlined by not bothering to check that it had even received the correct length of string in a protocol that sent a string back and forth to make sure the connection was alright. :roll:

Re: 2054: "Data Pipeline"

Posted: Wed Oct 03, 2018 6:32 pm UTC
by Flumble
cellocgw wrote:Write a one-line shell command (80 char or less) using all of grep, sed, and awk that actually does something recognizable.

Prizes will be given on the basis of both style and output. Points taken off if the output is useful. Judges' determination of "useful" is final.

Code: Select all

echo "'awk!' sed grep -e 'bit me finger!'"

It's a stylish 42 characters and it does absolutely nothing of interest. :D "Using" awk, grep and sed in the broadest sense possible.

...I have no idea how to use awk. It seems like a heavily outdated language and interpreter that should only exist today to support code that was written and checked 30 years ago.

Re: 2054: "Data Pipeline"

Posted: Wed Oct 03, 2018 6:52 pm UTC
by richP
Sableagle wrote:A supermarket chain over here failed to include "check the input makes sense" lines and someone added a new product to their system with volume 1 litre and mass 350 kg.

350 kg/L? Your supermarket sells dark matter? or Liquid black holes?

Re: 2054: "Data Pipeline"

Posted: Wed Oct 03, 2018 7:21 pm UTC
by NotAllThere
First comic I've laughed out loud over for a long time. And there's something I really need to address in my code tomorrow... :oops:

Re: 2054: "Data Pipeline"

Posted: Wed Oct 03, 2018 10:47 pm UTC
by Archgeek
richP wrote:
Sableagle wrote:A supermarket chain over here failed to include "check the input makes sense" lines and someone added a new product to their system with volume 1 litre and mass 350 kg.

350 kg/L? Your supermarket sells dark matter? or Liquid black holes?

Nah, even common electron degenerate matter in white dwarf stars piles in at around a million kg/L. This is still nearly 15.5 x the density of osmium, though, so I'm going to guess they're selling liter containers of very compressed dense gas or highly compressible liquid, if anything can be crushed that hard.

Re: 2054: "Data Pipeline"

Posted: Wed Oct 03, 2018 10:57 pm UTC
by Mikeski
richP wrote:
Sableagle wrote:A supermarket chain over here failed to include "check the input makes sense" lines and someone added a new product to their system with volume 1 litre and mass 350 kg.

350 kg/L? Your supermarket sells dark matter? or Liquid black holes?


That's only 18 times the density of tungsten. No unproven dark-matter physics required, just get the delivery van up to 0.998c.

A black hole of that mass would have a volume about 70 orders of magnitude smaller than 1 liter.

Re: 2054: "Data Pipeline"

Posted: Thu Oct 04, 2018 12:23 am UTC
by Soupspoon
Regardless, probably a lot more of it gets sold than needs to be delivered to the store, while every avocado mysteriously evaporates from stock.

(Half the first Google page for that search I just made pointed at Aussies doing it, I had to search down a bit to find the articles from home that I knew existed!)

Re: 2054: "Data Pipeline"

Posted: Thu Oct 04, 2018 12:29 am UTC
by freezeblade
Soupspoon wrote:Regardless, probably a lot more of it gets sold than needs to be delivered to the store, while every avocado mysteriously evaporates from stock.

(Half the first Google page for that search I just made pointed at Aussies doing it, I had to search down a bit to find the articles from home that I knew existed!)


I see this commonly in the US, except rung up as "bananas" in my area, which are typically somewhere around 19-29 cents a pound.

Re: 2054: "Data Pipeline"

Posted: Thu Oct 04, 2018 3:54 am UTC
by fluffysheap
Mikeski wrote:That's only 18 times the density of tungsten. No unproven dark-matter physics required, just get the delivery van up to 0.998c.

A black hole of that mass would have a volume about 70 orders of magnitude smaller than 1 liter.

I was actually wondering about that. Tungsten, osmium, gold, or whatever conventional heavy materials are much too light, but then exotic materials (at least exotic on Earth) are much too heavy. There doesn't seem to be any physically reasonable material with that density.

Maybe stellar core material? What would have to be fusing to get the right density?

Re: 2054: "Data Pipeline"

Posted: Thu Oct 04, 2018 11:13 am UTC
by pscottdv
Hiferator wrote:Image
Title text: "Is the pipeline literally running from your laptop?" "Don't be silly, my laptop disconnects far too often to host a service we rely on. It's running on my phone."

Is this a reference to some specific example?

(Created with chridd's xkcd thread formatter.)


Obviously, he has worked for my company.

Re: 2054: "Data Pipeline"

Posted: Thu Oct 04, 2018 11:19 am UTC
by pscottdv
Flumble wrote:
cellocgw wrote:Write a one-line shell command (80 char or less) using all of grep, sed, and awk that actually does something recognizable.

Prizes will be given on the basis of both style and output. Points taken off if the output is useful. Judges' determination of "useful" is final.

Code: Select all

echo "'awk!' sed grep -e 'bit me finger!'"

It's a stylish 42 characters and it does absolutely nothing of interest. :D "Using" awk, grep and sed in the broadest sense possible.

...I have no idea how to use awk. It seems like a heavily outdated language and interpreter that should only exist today to support code that was written and checked 30 years ago.


I don't know. I never seem to get cut to work the way I want. awk seems to "just work"TM.

Re: 2054: "Data Pipeline"

Posted: Thu Oct 04, 2018 3:01 pm UTC
by Zamfir
I once stumbled on the following, written by some seemingly sane and highly respected venture capital guy:
The example I often give here is of a VP of Something or Other in a big company who every month downloads data from an internal system into a CSV, imports that into Excel and makes charts, pastes the charts into PowerPoint and makes slides and bullets, and then emails the PPT to 20 people. Tell this person that they could switch to Google Docs and they’ll laugh at you; tell them that they could do it on an iPad and they’ll fall off their chair laughing. But really, that monthly PowerPoint status report should be a live SaaS dashboard that’s always up-to-date, machine learning should trigger alerts for any unexpected and important changes, and the 10 meg email should be a Slack channel. Now ask them again if they want an iPad.


And I thought NOOOOOOOOOOOOO . They VP guy has a working, robust system that he understands. You want to replace it by a black box, that breaks the next time that someone changes a column name in the CSV?

Re: 2054: "Data Pipeline"

Posted: Thu Oct 04, 2018 3:39 pm UTC
by SuicideJunkie
I think the biggest problems my pipeline has had is dealing with IT changes to the network.

The fact that most of the inputs I trust aren't touched by humans helps a lot.
And of the inputs that are touched by humans, the scripts are mostly doing error checking and reporting on the problems they find.

Also, my pipeline is less of a krazy-straw and more of a 50 pack of regular straws filling the second desktop with a rainbow of status colors and countdown timers.

Re: 2054: "Data Pipeline"

Posted: Thu Oct 04, 2018 3:44 pm UTC
by Flumble
Zamfir wrote:I once stumbled on the following, written by some seemingly sane and highly respected venture capital guy:
... But really, that monthly PowerPoint status report should be a live SaaS dashboard ...


That part of the quote absolutely makes sense to me: do all the information processing on the server and export as little info as feasible. (and of course only expose that dashboard to the internal network, preferably only to the VP's account) That VP is a walking liability with a detailed CSV on his laptop that he takes to conventions and probably has the password 'welcome1!'.

Re: 2054: "Data Pipeline"

Posted: Thu Oct 04, 2018 4:09 pm UTC
by Zamfir
If it works, that would be fine. But it's going to break, like the comic says. And then it's a black box.

Programming people underestimate this power of excel: regular people understand it.

Re: 2054: "Data Pipeline"

Posted: Thu Oct 04, 2018 5:05 pm UTC
by cellocgw
Zamfir wrote:If it works, that would be fine. But it's going to break, like the comic says. And then it's a black box.

Programming people underestimate this power of excel: regular people understand it.


No, regular people think they understand it. In reality, they understand just enough to produce results which look great but are often wrong.

Re: 2054: "Data Pipeline"

Posted: Thu Oct 04, 2018 5:20 pm UTC
by Zamfir
Nah, there are lots of people who make competent use of excel, but who could not write a simple script, let alone some server-based app with all the surrounding complications.

Re: 2054: "Data Pipeline"

Posted: Thu Oct 04, 2018 5:30 pm UTC
by Viqsi
That's so totally us. Me as the ponytail, my father with the laptop, and our supervisor with the hat. (He doesn't wear a hat, but I do wear my hair in a ponytail. And yes, my father and I are both on the same development team.)

Really, all Laptop Guy has to do is throw in something about how it's software built for folks who know what they're doing rather than for end users (probably with some sort of meretricious reference to the controls of a fighter plane) and the comparison becomes exact.

Re: 2054: "Data Pipeline"

Posted: Thu Oct 04, 2018 10:40 pm UTC
by Old Bruce
SuicideJunkie wrote:... a 50 pack of regular straws filling the second desktop with a rainbow of status colors and countdown timers.

I want to work where you work and I would do crazy amounts of drugs all day.

Re: 2054: "Data Pipeline"

Posted: Thu Oct 04, 2018 11:32 pm UTC
by Tub
cellocgw wrote:Contest Time!

Write a one-line shell command (80 char or less) using all of grep, sed, and awk that actually does something recognizable.

Prizes will be given on the basis of both style and output. Points taken off if the output is useful. Judges' determination of "useful" is final.

Here's one:

Code: Select all

cat .bash_history | grep grep | awk '/awk/' | sed -e 's/sed/sed/;t;d' | sort -u

Purpose: find candidates for submission to this contest
Usefulness: Yields no results on all logins I've tried, so the output clearly isn't useful. Even if there were output, it's use is questionable.

Re: 2054: "Data Pipeline"

Posted: Sat Oct 06, 2018 4:09 am UTC
by CatCube
cellocgw wrote:
Zamfir wrote:If it works, that would be fine. But it's going to break, like the comic says. And then it's a black box.

Programming people underestimate this power of excel: regular people understand it.


No, regular people think they understand it. In reality, they understand just enough to produce results which look great but are often wrong.


What will implementing this as a software black box fix? If you have a programmer implement a bad model, then now you have a bad model that literally nobody can inspect, but with a shinier interface. Plus, if it becomes apparent that something needs to change, instead of the user being able to do it, now you have to have a programmer do it, who may not be involved in the problem on a day-to-day basis and will have to relearn it. Or you make your shiny interface the lord and master of your organization, and everybody else is reduced to "Computer says no."

I'm a structural engineer, and I'm always pissed off when I have to move my work from Excel to Python (usually because of size limitations), because in Excel I can follow the logic line-by-line in a convenient tabular format, where when I have Python chewing on it I have to struggle with the debugger and its rather limited view of the internal state of the program and data to follow what's going on while debugging. (This is separate from dealing with finite element programs, which can make things easy so long as you understand the modeling assumptions--and don't make mistakes subtle enough to lead you astray.)

Python is obviously still better than not doing the task due to Excel's limitations, but it's frustrating compared to being able to see what's going on. At the end of the day, I'm trying to use my computer as a computing machine. That is, I just want it to do the arithmetic bitchwork for me. Excel is great for that.

Re: 2054: "Data Pipeline"

Posted: Sat Oct 06, 2018 5:28 pm UTC
by Zamfir
I'm a structural engineer, and I'm always pissed off when I have to move my work from Excel to Python (usually because of size limitations), because in Excel I can follow the logic line-by-line in a convenient tabular format, where when I have Python chewing on it I have to struggle with the debugger and its rather limited vie

Yeah, I know exactly this problem. I partially tackle this by running Python in Spyder, a MATLAB clone with a variable viewer etc. It encourages matlab- style programming where everything and it's mother goes to the global scope, for easy inspection. But it's still Python, so it's much easier than in matlab to switch to proper limited scoping when needed.

The other aid for me is Pandas, a table-based data library. You can use it for "excel" style work, where you add a new column for every derived variable.

Re: 2054: "Data Pipeline"

Posted: Mon Oct 08, 2018 11:42 am UTC
by Hafting
Tub wrote:Here's one:

Code: Select all

cat .bash_history | grep grep | awk '/awk/' | sed -e 's/sed/sed/;t;d' | sort -u

Purpose: find candidates for submission to this contest
Usefulness: Yields no results on all logins I've tried, so the output clearly isn't useful. Even if there were output, it's use is questionable.


A slight modification turns up at least one example:

Code: Select all

history|grep grep | awk '/awk/' | sed -e 's/sed/sed/;t;d' | sort -u