Is linguistics a real science?

Eebster the Great
Re: Is linguistics a real science?

It's worth pointing out that what a "p-value" means is just the probability of getting results at least that extreme if the null hypothesis is true. Very low p-values are would be unlikely if the null hypothesis were true, so they suggest it is false. If your p-value is below your α (basically your cutoff), you reject the null hypothesis.

So even if all your statistics were done correctly, all they would demonstrate would be that the null hypothesis was (very probably) false. In the first case, your null hypothesis was completely random assortment of letters (save the assumption that words are at least 2 letters long). In the second case, your null hypothesis is that rivers are given Croatian words starting in k(v)r no more often than Croatian words in general. This is a better null hypothesis, but even if it's false, that doesn't do very much to show your hypothesis is true. Unless you have specific evidence for your purported root, it makes no sense to assert it. There are other possibilities that also need to be rejected, such as the PIE root meaning "cut" from above.

gmalivuk
GNU Terry Pratchett
Re: Is linguistics a real science?

Yeah, basically the null hypothesis is, "There's nothing unusual about this observation," and the alternate hypothesis is, "There is something unusual about it." So even if you're correct to conclude that there's something unusual, that doesn't get you to any particular hypothesis.

Eebster the Great wrote:all they would demonstrate would be that the null hypothesis was (very probably) false.

They wouldn't even do that, really. A low p-value just means that if the null hypothesis were true, this would be a very unlikely observation. If lottery balls are chosen randomly, then any particular number winning will be extremely unlikely. And yet, some particular number must be chosen.

The lottery also gives a good example of why coming up with statistical tests after you see all the data can be problematic.
Randomly pulling 16, 19, 25, 32, 49, 18 in the Powerball lottery has a likelihood of slightly better than 1 in 300,000,000. But if you picked those numbers to "test" after seeing that those were in fact yesterday's numbers, the tiny likelihood tells you exactly nothing about any hypotheses about how the numbers are chosen.

FlatAssembler takes this into account somewhat when considering that any of the other consonant pairs would be equally surprising if it showed up 5 or 6 times at the start of river names, but I'd say that still doesn't go far enough. For example, wouldn't consonant pairs at the end of the words also be surprising? After all, some languages (like English) usually put the word "river" after the proper name rather than before. What about in the middle of the word? If we're starting with the assumption that most river names are borrowed from other, possibly dead languages, we can't really make any assumptions about where morphemes go, and maybe the word for "flow" is an infix.

This is the problem with anomaly hunting. When you don't say, ahead of time, what will count as an "anomaly", then you have to be extremely careful about what you count after the fact.

And since you weren't careful, your initial data and calculations are very nearly worthless. If you want to provide evidence for a "k-r" root in some old language, look for the same sequence in river names outside Croatia.
gmalivuk
GNU Terry Pratchett
Re: Is linguistics a real science?

Suppose you flip 10 (ordered) coins 10 times, and you're surprised to see that one of those times was HTHTHTHTHT.

This particular sequence has a likelihood of 1/1024 each time, and a likelihood of happening at least once in 10 flips of 0.97%
So the most naive p-value is 0.0097. This is very small so we should reject the null hypothesis that the coins are totally random!

What FlatAssembler does is something like acknowledging that THTHTHTHTH would be equally surprising, so really the likelihood is 1/512 each time.
The correct p-value is therefore 0.019. This is still very small so we should reject the hypothesis that the coins are totally random!

However, HHHHHHHHHH and TTTTTTTTTT would also be quite surprising, no? (p = 0.038)

HHTTHHTTHH and TTHHTTHHTT as well (p = 0.057)

HHHHHTTTTT and TTTTTHHHHH (p = 0.075)

HTTHTTHTTH and THHTHHTHHT (0.093)
HTTTTTTTTH and THHHHHHHHT (0.111)
and so on
Eebster the Great
Re: Is linguistics a real science?

gmalivuk wrote:
Eebster the Great wrote:all they would demonstrate would be that the null hypothesis was (very probably) false.

They wouldn't even do that, really. A low p-value just means that if the null hypothesis were true, this would be a very unlikely observation. If lottery balls are chosen randomly, then any particular number winning will be extremely unlikely. And yet, some particular number must be chosen.

Yes I know, I kind of turned around the statement there. You should usually adjust your belief based on this new information, but it may well be that you still determine it is more likely that you got those extreme results by chance than that the null hypothesis is false.

And it is certainly true that if you keep hunting for anomalous data from a large set, you are bound to find some. It's like finding words in the Bible by finding letters equally spaced by other letters.

FlatAssembler
Re: Is linguistics a real science?

If nothing else, it's not really honest to count the repeated name twice.

Why not? The fact that there are two rivers named "Karašica" in Croatia does indeed strongly suggest that "Karašica" meant "river", rather than something else.
This is a better null hypothesis

Well, like I've said, to me it seems that a much more reasonable null-hypothesis would be to assign a high a-priori probability to a hypothetical d-n-pattern (as in the Proto-Indo-European for "river", *danu), a hypothetical h-p-pattern (as in Proto-Indo-European for "body of water", *h2ep), a hypothetical p-l-pattern (as in the Proto-Indo-European for "to flow", *plew), and a hypothetical v-d-pattern (as in the Croatian for "water", "voda", and Proto-Indo-European for "water", *wed). Because that's what the mainstream linguistic theories actually predict, right? There would really be nothing unusual if six river names matched the p-l-pattern or the d-n-pattern, right? Yet, we have more river names matching the k-r-pattern than any of those patterns mainstream linguistics predicts would occur, and that's what requires an explanation.

gmalivuk
GNU Terry Pratchett
Re: Is linguistics a real science?

FlatAssembler wrote:
If nothing else, it's not really honest to count the repeated name twice.

Why not? The fact that there are two rivers named "Karašica" in Croatia does indeed strongly suggest that "Karašica" meant "river", rather than something else.

I already explained why not in the part you didn't quote.

Karaš is the name of one river and part of the name of two other rivers in Croatia, the name of a river in Serbia, and part of the name of a region and a city and an ethnic group in Romania. That strongly suggests that it means *sinething*, but not necessarily "river". Plus even if Karaš means "river" that doesn't get you to "k-r" means "flow".

(Incidentally, one of the Wikipedia suggestions for Karašica is that it comes from the Turkish for "black water". There are nine Blackwater Rivers in the United States, so it doesn't seem too crazy to suppose there are some similar names elsewhere. And the fact that a river is remarkably clear now for much of its length doesn't mean it's clear everywhere or always has been.)
The lack of those other roots just goes to suggest that people don't generally use the word "river" in the names of their rivers. But none of those are null hypotheses anyway.

Look, handwaving and bad statistics aside, if you want to convince anyone that there's something to your k-r hypothesis, show that it's disproportionate outside of Croatia as well. Your first data set can't provide very good evidence for a hypothesis you only came up with after looking at it. All it did was *suggest* a hypothesis. Now check Serbian rivers to see if there's more there than a random anomaly.

(Of course even if there are "too many" rivers in other places with k-r, you'll still have to do more work to argue against the "cut" etymology, which already does a decent job of explaining why that root might be common in river names.)
FlatAssembler
Re: Is linguistics a real science?

Incidentally, one of the Wikipedia suggestions for Karašica is that it comes from the Turkish for "black water".

OK, then, from which ancient Turkic language? The Proto-Turkic for "water" was *sub.
The lack of those other roots just goes to suggest that people don't generally use the word "river" in the names of their rivers.

Or, a much simpler explanation, the Illyrian word for "river" was different from the word for "river" in other Indo-European languages. Scythians seem to have been quite a fond of naming the rivers after their word coming from Proto-Indo-European *danu: Danube, Dniester, Dnieper, Don, and no doubt names of some smaller rivers.
if you want to convince anyone that there's something to your k-r hypothesis, show that it's disproportionate outside of Croatia as well.

Well, as you note, there is a river flowing through Serbia and Romania named "Karaš". And, there is a river named "Krka" in Slovenia. Now, I haven't checked if there are more such river names, and I don't think that's too important. After all, more river names I look at, more likely it is that one would start with k(v)r by chance.

Anyway, I've researched this a bit, what's your measure of "uniform"? As I've tested using simple computer programs, the probability of choosing the same consonant two times in a row in an English text is 1/11, while the probability of that happening in an English word-list is 1/13. The probability of choosing the same consonant two times in a row in a Croatian text is 1/13, while the probability of that happening in a Croatian word-list is 1/14. Clearly the frequencies of consonants become more uniform once you eliminate the syntax of a language. Eliminate the morphology, and that probability will become very close 1/20.

gmalivuk
GNU Terry Pratchett
Re: Is linguistics a real science?

FlatAssembler wrote:[ow, I haven't checked if there are more such river names, and I don't think that's too important. After all, more river names I look at, more likely it is that one would start with k(v)r by chance.

It's not just important, it's necessary.

What you have now is an apparent anomaly that you noticed in one data set. That's not terribly interesting no matter how much you reach for explanations. Every data set will have some things that look unusual. This is especially true when you don't limit yourself in any way before you start looking. (As I said before, you likely would have noticed a consonant pair that showed up a lot at the end of the names. You also probably would have noticed a common consonant+vowel pair if it appeared a lot at the beginning or the end. And who knows how many other things would have similarly caught your attention?)

If k-r is more common than expected in other places outside of Croatia, *that* might mean something, though there's still the question of how common we should expect it.

Eliminate the morphology, and that probability will become very close 1/20.
You have zero justification for this assumption. You just want it to be true because it would support your initial gross overestimate of how unlikely the observations are.

You can't eliminate morphology because wherever those words came from they're Croatian words *now*, and so will still tend to be influenced by the way Croatians say and write things. As I already explained, even in borrowed words there are sequences of letters that English rarely if ever uses,not by chance but because those sequences don't comply with how English works. That means there are other sequences that appear more frequently than they would randomly.

Like I said before, if you want to run the statistics for just proper nouns or just toponyms in general, go ahead and do that. I am certain you will find nothing close to uniformity, but even in highly non-uniform distribution some things can be close to average. (In other words, though there are going to be some pairs more likely than 1 in 100 and others less likely than 1 in 1000, perhaps k-r will in fact be near 1/400.)

But whatever you do, there are two things necessary for this to be anything like real linguistic science:
1) The expectation you attach to k-r must be based on actual observations, not baseless assumptions of uniformity.
2) You have to check those expectations against a *different* set of data than the list of Croatian rivers where you noticed it first.
gmalivuk
GNU Terry Pratchett
Re: Is linguistics a real science?

This list of Serbian rivers has 177 rivers and subrivers on it. I counted 5 that begin with Kr or K-r.

At a probability of 1/60, that's p=0.175.

At your observed probability for Croatian rivers, 6/100, the likelihood of so *few* in Serbia would be 0.042.
FlatAssembler
Re: Is linguistics a real science?

OK, then, if my etymologies are pseudoscience, then all the other etymologies of the river names, including the Krahe's Old European Hydronymy, which is presented without any statistical argumentation, are even more so.

Eebster the Great
Re: Is linguistics a real science?

Are they actually presented without argument, or have you just not checked what those arguments are?

gmalivuk
GNU Terry Pratchett
Re: Is linguistics a real science?

One extremely common mark of pseudoscience is a refusal to admit when a hypothesis might be wrong...
Lots of etymology does have statistical backup, even if they're not always presented along with it in whatever source you've found.

However, even if they didn't do statistics, connecting a modern word to a similar sounding root that we know really exists is very different from what you're doing, where you propose a completely new root from an unknown language based on an observation of one small data set and a refusal to use any proper statistics.

And maybe yes, some of those proposed etymologies are garbage. Your development of an additional garbage proposal doesn't help anyone.
FlatAssembler
Re: Is linguistics a real science?

And, what do you guys here think, what causes the relative frequencies of consonants?
I've also asked this question on StackExchange.

Heimhenge
Re: Is linguistics a real science?

I'd stick with my earlier comment. You could maybe conclude similar relative frequencies of consonants is evidence of a common root for two languages. But the question of what "causes" those frequencies might not be meaningful.

Consonants are just a sub-class of invented written symbols that came after the spoken word. How do you classify a symbol as a consonant? Is it any sound that can't be expressed as a continuous waveform (like a vowel)? OK then, there's a multiplicity of sounds outside of the standard Western alphabet that also meet that definition. And probably others that haven't been used yet.

The reason why specific sounds got assigned to specific written symbols is a mystery. Why is the percussive sound "tee" assigned to the (Western) letter "T"? I don't think that question has any answer that could be evidence of some linguistic "law".

I think what you're asking here is equivalent to asking: Why is there a peak in the Gaussian distribution of word lengths at N digits? The "cause" might just be random evolution to an optimal state. It might just be the roll of the dice.

gmalivuk
GNU Terry Pratchett
Re: Is linguistics a real science?

Heimhenge wrote:
Consonants are just a sub-class of invented written symbols that came after the spoken word. How do you classify a symbol as a consonant? Is it any sound that can't be expressed as a continuous waveform (like a vowel)?

You're mixing up consonants-sounds and consonant-letters here. The symbols 'y' and 'h' and 'u' and 'o' can all spell both consonant sounds and vowel sounds at the beginning of English words, after all.
Eebster the Great
Re: Is linguistics a real science?

I'm having a hard time thinking of an example of O as a consonant in English words, unless you count foreign proper names like Oaxaca.

gmalivuk
GNU Terry Pratchett
Re: Is linguistics a real science?

one

(Like 'u', 'o' can spell a consonant-vowel pair of sounds, rather than being a consonant on its own like 'y'.)

Edit: I forgot about 'e' in words like "Europe" and "euphoria".
Eebster the Great
Re: Is linguistics a real science?

Ah, good one. Is A the only English letter that never represents a consonant (or consonant + vowel)?

Pfhorrest
Re: Is linguistics a real science?

That would be funny if so because the letter A evolved from what was originally a consonant.
DavidSh
Re: Is linguistics a real science?

But didn't all letters in the English alphabet derive from letters that were consonants? Phoenician script, like Hebrew or Arabic, doesn't have any vowels.

Pfhorrest
Re: Is linguistics a real science?

That is a good point.
