@nostupidquestions How does generative AI handle creoles? Does it get confused and respond like it’s one of the parent languages?
What you won’t find in the dictionary difference between dialects or creoles is historically racialized. A good example is Louisiana. Technically, Cajun is a subset of creole. But in practice Cajuns are white and Creoles are black.
https://hnoc.org/publishing/first-draft/whats-difference-between-cajun-and-creole-or-there-one
Black dialects (In America) have historically been treated as bastardizations, pollution, or just plan ignorance of proper grammar and syntax. I witnessed the short life of “Ebonics” until it was ridiculed into oblivion. I can remember plenty of people who spoke a Southern dialect only loosely associated with English mercilessly mocking and doing impressions of AAVE.
Common parlance like “They don’t think it be that way, but it do” is a contemporary example of black dialect with syntax and rules as complex as any language being mocked as as stupidity, ignorance, and an inability to speak English.
I’m not the racist police. If you’ve laughed at these jokes or told them, I don’t think your a bad person. I’ve laughed and spread them myself. This is just an FYI.
I know that the current generation of LLM have a language agnostic knowledge base, which is damn awesome, but I don’t know how the language layer works.
@whaleross what does that agnosticism mean? Don’t you need to store that knowledge somehow in say maybe a language?
Yeah but not one “master” language. All knowledge it can relate across languages is available in all languages. LLM (and computers) don’t care about the data, they just process it. Humans would translate it and compile it in ordered encyclopedia. For LLM it is all just an insane number of references and cross references all over that is available from anywhere that the link has been established. The input/output of desired language and formulation and whatever is a different part of it.
As far as I understand it.
I don’t know about creoles but with dialects in my experience it mixes the standard language, the dialect I use with it and other dialects. Probably confusing that they all use the same writing system and have some common vocab.
Ad an aside, I was surprised to see that when I wrote chatgpt a sentence written in an obscure (mine) norwegian dialect, peppered with slang and nonstandard contractions, chatgpt actually understood me very well. It seems LLMs are really good at checking possible root words and their origins to infer meaning when it’s not recognizing the word.
For example, I wrote “veitkji”, which is a contractions of “veit ikkji”, which is normally written “vet ikke”. However, cgpt correctly interpreted it as "some scandinavian non-standard dialect for “don’t know” "
I am no expert, but I think it depends on how present those are in the training data. If you have barely any representation, I wouldn’t be surprised if the LLM thinks it is ome if the parent languages instead.