Earlier this month, I stumbled upon two articles that discussed the disparities in tokenization among languages titled "All languages are NOT created (tokenized) equal" and “Why is GPT-3 15.77x more expensive for certain languages?”.
Hi, thanks for this very comprehensive write-up. Some really interesting insights. I just wish there were more citations. One I'm looking for in particular is about the claim that sub-optimal tokenization impacts accuracy. Do you have a reference for this? Or are you working on this, because evaluating the correlation between model performance and tokenization (for Nepali, for example) sounds like something worth exploring.
Hi Sharad, thank you for reading the blog. I don’t have a paper that I can cite but an experiment to prove this statistically is on TODOs. That being said I have personally observed GPT mixing up Hindi inflections for Nepali often, which was the basis of my original statement in the blog.
I'm not surprised about the inflections part (the curse of multilinguality - I'm myself slightly against clubbing languages together), but how does it really relate to accuracy? Do you mean that such inflections (possibly artifacts of a multilingual tokenizer) imply decrease in downstream accuracy (close enough but inaccurate words)?
I've been trying to find small interesting language experiments to do, and this tokenization--performance eval looks like an interesting problem. If you're looking for collaborators to work on it, please let me know.
Yes, when I say accuracy, here I mean the accuracy of generation and the words that are formed as a result are at times completely wrong and don't exist in the vocabulary (and sometime they are not even close enough to a root to make sense). I don't know if this would affect other downstream tasks. I would love to collaborate if that is still an option.
Hi, thanks for this very comprehensive write-up. Some really interesting insights. I just wish there were more citations. One I'm looking for in particular is about the claim that sub-optimal tokenization impacts accuracy. Do you have a reference for this? Or are you working on this, because evaluating the correlation between model performance and tokenization (for Nepali, for example) sounds like something worth exploring.
Hi Sharad, thank you for reading the blog. I don’t have a paper that I can cite but an experiment to prove this statistically is on TODOs. That being said I have personally observed GPT mixing up Hindi inflections for Nepali often, which was the basis of my original statement in the blog.
I'm not surprised about the inflections part (the curse of multilinguality - I'm myself slightly against clubbing languages together), but how does it really relate to accuracy? Do you mean that such inflections (possibly artifacts of a multilingual tokenizer) imply decrease in downstream accuracy (close enough but inaccurate words)?
I've been trying to find small interesting language experiments to do, and this tokenization--performance eval looks like an interesting problem. If you're looking for collaborators to work on it, please let me know.
Yes, when I say accuracy, here I mean the accuracy of generation and the words that are formed as a result are at times completely wrong and don't exist in the vocabulary (and sometime they are not even close enough to a root to make sense). I don't know if this would affect other downstream tasks. I would love to collaborate if that is still an option.
Great! Pls DM me your email or whichever method you prefer to touch base.