#icodeformyभाषा: Research

The Cost of Ideas

Shreeya — Sat, 30 Mar 2024 03:05:17 GMT

While proofreading the "How LLMs Break Down Language from Text to Tokens" section from my last blog, Nolan asked me a thought-provoking question: "Then what would the cost of ideas for ideographic languages be?" His question highlighted the significant differences between two writing systems: ideographic languages, where characters represent ideas, and orthographic languages, where characters represent sounds. This further prompted an investigation into whether this distinction affects the overall cost of ideas for large language models (LLMs).

In this blog post, we will try to answer Nolan's question and explore the following:

Ideographic vs. Orthographic Languages: Explore the fundamental differences between ideographic languages like Chinese and Japanese and orthographic languages like English and Nepali.
The Trade-Off between Conciseness and Digital Footprint: We will briefly explore the relationship between character count and byte count, acknowledging that ideographic languages might be more concise on paper but not necessarily in terms of digital storage.

Additionally, we will discuss:

The "Cost of Ideas" for LLMs: We will define the concept of "cost of ideas" in the context of LLMs as the number of tokens required to represent the same idea.

In my previous post, I showed that large language models (LLMs) heavily optimized for English and Latin-based languages exhibit consistently higher token counts when processing non-Latin scripts. This discrepancy in token counts translates to increased computational costs for operating LLM-based applications like ChatGPT on non-Latin languages. Building on this foundation, my focus on this post will be on investigating the "Cost of Ideas" in ideographic and orthographic writing systems. The primary objective is to explore how the fundamental distinctions between these two language categories influence their digital footprint in representing and conveying concepts and ideas.

Ideographic vs. Orthographic Languages

Ideographic languages, are writing systems where each character or symbol represents a complete word or concept. Chinese, which uses thousands of characters (called hanzi) is the most prominent example of this type of writing system.

Key Features of Ideographic Languages:

Characters Represent Ideas: Instead of corresponding to phonetic sounds, each character in an ideographic language symbolizes an entire word or concept. the character 木 in Chinese represents the word "tree" or the concept of "wood."
Extensive Character Set: To encompass a language's full vocabulary, ideographic writing systems require a vast number of characters, often numbering in the thousands or tens of thousands. The Kangxi dictionary, one of the most comprehensive Chinese dictionaries, contains over 47,000 characters.
Context Matters: While some characters may have multiple pronunciations or meanings, the context in which they appear often provides clues to their intended interpretation. The character 行 can mean "to walk" or "behavior," depending on the context.
Character Composition: In some languages (like Chinese), characters can be built from simpler components that offer hints about meaning or pronunciation. The character 林, meaning "forest," is composed of two instances of the character 木 (tree).

In contrast, orthographic languages utilize a set of symbols, typically letters or syllabic characters, to represent the individual sounds (phonemes) that make up spoken words. These symbols are then combined to form words based on their phonetic values. The most familiar example of an orthographic language is English.

Key Features of Orthographic Languages:

Phoneme Representation: The letters or symbols in these writing systems correspond to the smallest units of sound (phonemes) in the spoken language. For example, the letter "c" represents the /k/ sound in the word "cat."
Limited Symbol Set: Compared to ideographic languages, orthographic systems typically require a relatively small number of symbols to function. The English language has 26 alphabets.
Phonetic Combination: Words are formed by combining these symbols based on their phonetic values, creating a more direct link between sound and written word. The word "book" is composed of the letters "b," "o," "o," and "k," representing the sounds /b/, /ʊ/, /k/.

In addition, there are some languages like Japanese that incorporate both ideograms (kanji characters borrowed from Chinese) and phonetic scripts (hiragana and katakana) within their writing system. For example, the Japanese word for "computer" is written as コンピューター (using hiragana and katakana) or 電脳 (using kanji characters).

Universal Declaration Human Rights as a Lens for Comparing "Cost of Ideas"

To analyze the "cost of ideas" across ideographic and orthographic languages, we will leverage the Universal Declaration of Human Rights (UDHR) as a parallel corpus in Nepali (orthographic), English (orthographic), Japanese (hybrid with ideographic kanji and syllabic kana), and Chinese (ideographic).

The UDHR translations are maintained and overseen by the United Nations (UN) ensuring that UDHR's articles are conveyed accurately and with semantic equivalence across all languages. This ensures that any differences observed in the "cost of ideas" are primarily due to the inherent characteristics of the writing systems themselves, rather than discrepancies in translation.

We will examine the parallel translations across these languages to understand how the same ideas when represented using different writing systems vary in terms of the cost that a digital system has to bear.

Preprocessing the Data

This plain text version of UDHR was originally prepared and hosted by the Unicode Consortium under the "UDHR in Unicode" project. While as of January 2024, the Unicode Consortium is no longer hosting the UDHR in Unicode project, the XML files with translations in multiple languages are available at UDHR in XML.

I pre-processed the XML files in Mandarin Chinese (Simplified), Mandarin Chinese (Traditional), English, Japanese, and Nepali languages. The processed dataset includes 31 rows for each language, with a preamble and 30 articles defined in UDHR.

The Trade-Off: Conciseness vs. Digital Footprint

The trade-off between conciseness and digital footprint becomes particularly evident when comparing ideographic writing systems, like Chinese, with orthographic systems, like English or Nepali. Let's delve deeper into this trade-off by examining the grapheme and byte counts for the text in our dataset.

Average Graphemes and Bytes count across languages in UDHR

Grapheme Count and Conciseness

Grapheme count refers to the number of characters, such as letters or ideographs, needed to represent a word or concept. Ideographic scripts like Chinese exhibit a significant advantage in conciseness. In our dataset, Traditional Chinese has an average grapheme count of 82.54, and Simplified Chinese has 82.45. In contrast, the average grapheme count for English is considerably higher at 321.70. This conciseness in ideographic scripts stems from their ability to convey complex ideas and concepts through a single ideographic character, reducing the need for multiple graphemes.

Graphemes count across each article in UDHR for English and Chinese

Byte Count and Digital Footprint

However, the byte count represents the number of bytes required to encode the text digitally. Despite their lower grapheme counts, Traditional and Simplified Chinese texts required an average of 246.41 and 240.77 bytes, respectively, to encode their characters. This higher byte count is a consequence of the complex character encodings required for ideographic scripts, which often involve multiple bytes per character. The cost of an increased digital footprint.

Bytes count across each article in UDHR for English and Chinese

English and the Trade-Off

On average, English requires 3.9 times more graphemes than Traditional and Simplified Chinese to convey the same concepts. However, when it comes to byte counts needed for digital encoding, the gap narrows down drastically. English requires only 1.31 times more bytes than Traditional Chinese, and 1.34 times more bytes than Simplified Chinese. This highlights the trade-off: while Chinese is far more concise requiring fewer graphemes, English benefits from a simpler encoding requiring fewer bytes per grapheme representation compared to the ideographic Chinese scripts.

The Case of Japanese

Similarly, Japanese, which incorporates ideographic kanji characters borrowed from Chinese, and orthographic characters hiragana and katakana, has an average grapheme count of 124.35 in our dataset, lower than English. However, the byte count for Japanese text jumps to 371.83, exceeding even that of English. This significant increase in byte count can, again, be attributed to the complex character encodings for Japanese characters.

In essence, while ideographic scripts like Chinese and Japanese offer conciseness in terms of grapheme counts, they often require more bytes to encode digitally, resulting in a trade-off between conciseness and digital footprint. This trade-off has implications for tasks such as text storage, transmission, and processing within language technologies and applications.

Script Intricacies and Their Impact on Digital Footprint

While ideographic scripts like Chinese exhibit a clear trade-off between conciseness in grapheme counts and an increased digital footprint due to their complex character encodings, the case of Nepali presents a different challenge.

Even though Nepali, like English, is an orthographic language, its text characteristics in our dataset differ significantly in terms of grapheme and byte count. While Nepali uses far fewer graphemes on average (194.80) compared to English (321.70), this efficiency stems from the unique features of the Devanagari script used by Nepali. Unlike the Latin script, where consonants often need additional characters to represent sounds like syllables or consonant clusters, Devanagari generally uses a single character per sound. This is because Devanagari consonants typically come with an inherent vowel sound, a characteristic not always present in the Latin script. This allows Nepali texts to be represented with fewer graphemes on average compared to their English.

However, the byte count tells a different story. Nepali text required a staggering 759.09 bytes on average to encode digitally, over 2.3 times higher than the 321.96 bytes needed for English text. This disproportionately high byte count for Nepali, despite its lower grapheme count compared to English, highlights the complexity involved in digitally encoding the intricate system of consonant clusters, vowel diacritics, and combining characters present in the Devanagari script.

The "Cost of Ideas" for LLMs

As explored in my previous work, the number of tokens required to represent ideas in LLMs can vary significantly across languages. It really depends on how the tokenizer was trained for each model. While the inherent characteristics of a language influence the number of graphemes needed to represent ideas, the tokenization method plays a crucial role in determining the actual token counts within the LLM.If the tokenizer is trained on a diverse dataset that includes a good representation of ideographic languages like Chinese, it can potentially learn to tokenize these languages more efficiently. This can lead to lower token counts and a cost advantage for representing ideas in the LLM.

You can find the visualizer here.

Tokenizing Article 1 in English and Chinese using XML-Roberta Tokenizer

Tokenizing Article 1 in English and Chinese using NLLB Tokenizer

Tokenizing Article 1 in English and Chinese using GPT-4 Tokenizer

If the tokenizer is trained on a diverse dataset that includes a good representation of ideographic languages like Chinese, it can potentially learn to tokenize these languages more efficiently, resulting in lower token counts and, consequently, a cost advantage for representing ideas within the LLM.

Conversely, a tokenizer trained on data skewed towards certain languages or writing systems may struggle to tokenize other languages optimally. This can result in higher token counts and increased costs for representing ideas in those languages. This seems to be the case with GPT-4 tokenization, where it exhibits sub-optimal performance when tokenizing texts in non-Latin languages.

Average Token counts for XML-Roberta, NLLB and GPT-4 in UDHR

This observation highlights the importance of carefully curating the training dataset, as well as tailoring the tokenization process when developing large language models. By ensuring that the tokenizer is exposed to a diverse range of languages, including ideographic scripts, during the training process, LLMs can potentially leverage the inherent advantages of certain writing systems. For example, they can exploit the compact representation of ideas offered by ideographic languages like Chinese. Ultimately, the tokenization method and the quality of the training data can significantly impact the cost and efficiency of representing ideas across different languages within large language models.

Acknowledgements

Nolan Kramer, for not just asking the question that became the basis of this post but also for the discussions throughout the time I was working on this project.

Gwendolyn Gillingham, for helping me with the study and providing the idea of using the UDHR dataset.

Beyond the ABCs: Exploring the nuances of tokenization in diverse languages

Shreeya — Wed, 13 Mar 2024 03:40:52 GMT

Earlier this month, I stumbled upon two articles that discussed the disparities in tokenization among languages titled "All languages are NOT created (tokenized) equal" and “Why is GPT-3 15.77x more expensive for certain languages?”. This piqued my interest and motivated me to conduct further investigations on my own.

In this article, I'll discuss Byte-Pair Encoding (BPE) based tokenization and the disparities in the tokenization process across different languages. Using the Indo-European language family as a case study, I will show how these discrepancies arise not from inherent language family differences but rather from the training data and the representation of characters in Unicode for each language. In addition, I will:

Explore GPT-4 vocab and compare it to XML-RoBERTa and NLLB-200-distilled-600M.
Explore token length distribution for Indo-European languages: English, French, Spanish, Hindi, and Nepali.
Explore the relationship between grapheme counts vs token lenghts across the languages.
Compare the speed of tokenization for the three tokenizers across five languages mentioned above.

How LLMs break down language from text to tokens

Let’s buid GPT tokenizer by Andrej Karpathy was very helpful in understanding the tokenizers used by LLMs.

Tokenization is a fundamental process that involves breaking down a text into smaller units called tokens, typically words or subwords. LLMs like GPT-4 utilizes a technique called byte pair encoding (BPE) for tokenization. It iteratively merges the most frequently occurring pairs of consecutive characters into single units, forming a dynamic vocabulary that adapts to the unique characteristics of the training data. This approach enables LLMs to effectively handle rare words and improves its computational efficiency compared to traditional word-based methods.

In addition, instead of treating text as sequences of individual characters, GPT-4 uses byte-level BPE for tokenization and leverages the properties of UTF-8 encoding, which represents text using sequences of bytes called code points.

Byte-Level BPE in GPT Models

By working with bytes instead of characters, these models achieve the following advantages, in addition to dynamic vocabulary building:

Compact Vocabulary: It starts with a base vocabulary consisting of 256 individual bytes, representing all possible byte values in UTF-8 encoding. This small vocabulary size translates to computational efficiency and faster processing.
Universal Character Representation: This ensures all characters, regardless of their origin, can be represented using a combination of bytes, effectively eliminating the need for "unknown tokens." This allows the models to handle diverse text from various languages and writing systems seamlessly.

Decoding GPT-4 vocab

To understand the discrepancies discussed above, I first looked into the vocab used by GPT-4. Tokens in the original vocab cl100k_base.tiktoken used by the cl100k_base tokenizer, which is the BPE tokenizer used by GPT-4 and is encoded in base64. I converted vocabulary to UTF-8 for my analysis. Some tokens resulted in encoding errors due to incomplete generation, highlighting limitations of byte-level BPE in handling uncommon texts and text in writing systems other than latin.

The decoded vocabulary comprises 70,988 entries containing only Latin characters. This suggests a potential bias towards Latin-based languages in GPT-4's training data.
There are 29,268 entries containing at least one non-Latin character. This indicates that the model was exposed to other languages during training.
Among these non-Latin entries, 803 entries partial byte sequences.

GPT-4 Tokenization Visualization from Open AI’s Tokenizer playground

Limitations in representing uncommon texts and other writing systems

While byte-level BPE effectively eliminates the need for unknown tokens with a compact vocabulary, there are some limitations, especially in representing uncommon texts and texts in writing systems other than latin.

Despite universal character representation, it might struggle with tokenizing uncommon texts not seen during training. For cases involving extremely rare combinations or characters from under-represented writing systems BPE might resort to suboptimal tokenization, like breaking down the sequence into individual bytes, which can impact accuracy.

English letters are assigned a one-byte encoding in UTF-8. However, this is not true for all languages, some languages use multiple bytes. Hindi and Nepali are examples of such languages. Both Hindi and Nepali use Devanagari script which has a larger character set than the basic Latin alphabet used in English. This means that these languages need more unique symbols to represent its characters. UTF-8 encodes characters using a variable number of bytes depending on their rarity. To represent these less common characters, UTF-8 uses two, three, or even four bytes. Since a byte-level BPE model would treat each byte as a separate token, a letter in languages like Hindi or Nepali would be broken down into multiple tokens, potentially impacting the model's understanding and generation capabilities. The impact of the process in model’s understanding in out of the scope of this article.

Let's explore how the byte-based BPE tokenization process can lead to this issue i discussed with the Nepali word "सोमबार" (sombaar, meaning "Monday") as an example.

Unicode Code Points: The word "सोमबार" is represented by the following Unicode code points in hexadecimal:
```
स: 0x0938
ो: 0x094B
म: 0x092E
ब: 0x092C
ा: 0x093E
र: 0x0930
```

UTF-8 Encoding: When encoded using UTF-8, the word "सोमबार" becomes the following byte sequence:

स: 0xE0 0xA4 0xB8
ो: 0xE0 0xA5 0x8B
म: 0xE0 0xA4 0xAE
ब: 0xE0 0xA4 0xAC
ा: 0xE0 0xA4 0xBE
र: 0xE0 0xA4 0xB0

Byte-based BPE Tokenization: During the training BPE can , for example for character ब, merge the byte sequences 0xE0 0xA4 into one single token and leaves out 0xAC as a separate token depending on the data it has seen. This causes the vocab to have byte sequences that do not make up a valid code point. So let’s assume after several iteration we have the following vocabulary.
```
0xE0 0xA4 0xB8
0xE0 0xA5 0x8B
0xE0 0xA4 0xAE
0xE0 0xA4 --> incomplete
0xAC --> incomplete
0xE0 0xA4 0xBE
0xE0 0xA4 0xB0
```
Tokenization of "सोमबार": When the tokenizer tries to tokenize the word "सोमबार", it would then generate the following sequence of tokens:
```
['0xE0 0xA4 0xB8', '0xE0 0xA5 0x8B', '0xE0 0xA4 0xAE', '0xE0 0xA4', '0xAC', '0xE0 0xA4 0xBE', '0xE0 0xA4 0xB0']
Decoding at token id level:
['स', 'ो', 'म', '�', '�', 'ा', 'र']
```
When decoding the tokens individually, you would encounter an unicode decoding error and by default one would encounter an Unicode replacement character (�). See more on this here.
Note: A slightly different case would be where byte sequences for multiple characters would be combined by BPE to one entry in vocab, which would also cause similar issue.
```
Decoding at token level:
['स', 'ो', 'म', '�', '�', 'ा', 'र']
Decoding at input level:
सोमबार
```
However, when you decode the entire sequence of tokens together, the tokenizer can correctly reconstruct the original word "सोमबार" by combining the individual byte sequences represented by each token.

Open AI’s tiktoken Tokenizer Visualizer for Nepali. Note that there are several decoding issues here. This is because many of these tokens are represented using incomplete utf-8 code points. Also, note that number of tokens is greater than the number of characters, this is because some characters are represented using multiple sequence of bytes and tiktoken tokenizer, for some of these sequences of bytes, sees each byte for a character as a token.

Factors influencing high invalid byte sequences in vocab

The quality and size of the training data in a particular language can expose algorithm to learn sub-optimal vocabulary. Languages with less diverse or smaller training datasets may exhibit higher rates of invalid byte sequences due to insufficient coverage of character combinations or linguistic phenomena.

No decoding error for latin characters that have one byte sequence in utf-encoding

In addition to the quality and size of training data, some of the factors that influence high invalid byte sequences in vocab are:

Script Complexity: Languages with more complex scripts, such as those with non-Latin scripts like Devanagari, Thai, or Chinese characters, may have a higher likelihood of invalid byte sequences representation in vocab. These scripts often have a larger number of characters and more complex character compositions, leading to a wider range of possible byte sequences and potential challenges in tokenization.
Character Frequency: Characters that are less frequent in the training data may have their byte sequences split more frequently during merging, increasing the likelihood of incomplete tokens.
Word Morphology: Languages with rich morphology, such as agglutinative languages, may exhibit a larger number of morphemes or affixes, leading to more opportunities for byte sequences to be split during tokenization.

Can training with more multi-lingual data solve this?

Looking at the vocab we can infer that GPT-4 was heavily optimized towards English. In this section, I will compare the GPT-4 tokenizer with two other byte-based BPE tokenizers: XML-RoBERTa and NLLB-200-distilled-600M that were trained with multi-lingual data. The purpose of this study is to see if and how exposing more multi-lingual data in training affects tokenization. I chose these two tokenizers in particular because in Denys Linkov’s blog he shows that the ratio between the largest and smallest token numbers is the lowest for these two tokenizers in comparison to the others he compared.

A more diverse and distributed vocabulary

NLLB and XML-RoBERTa demonstrate significantly more diverse vocabularies compared to GPT-4’s cl100k_base vocab:

Non-Latin characters: NLLB and XML-RoBERTa contain roughly 79.53% and 83.62% non-Latin entries respectively, while cl100k_base only has 29.2%. This indicates that NLLB and XML-RoBERTa can handle a wider range of languages beyond Latin-based ones.
Vocabulary size: NLLB and XML-RoBERTa have a much larger vocabulary size, with 2.55 and 2.49 times more entries than cl100k_base with a more distributed sub-tokens across languages.

Vocab counts for cl100k_base, NLLB-200-distilled-600M, and XML-RoBERTa. Note, non-latin entires also include vocab with non-latin characters, not necessarily belonging to a specific language, like “>”.

I also found that cl100k_base vocab contains a significantly higher number of entries representing incomplete byte sequences, at roughly 29.7 times and 25.1 times more than NLLB and XML-RoBERTa respectively. The limited exposure to non-Latin byte sequences during training might explain large number of incomplete sequences in cl100k_base vocab. As mentioned earlier section, a smaller or less diverse multilingual corpus could restrict the model's ability to learn the proper representation of uncommon text sequences.

Aya Dataset

For this study, I used Aya Dataset, which contains human-curated prompt-completion pairs in 65 languages written by fluent speakers of the languages. I chose this dataset for three reasons, 1. diverse sequences in terms of lengths and topic, 2. contains all languages of interest, and 3. since it is human-curated, the dataset is of high quality, which is what I observed for English, Nepali, and Hindi.

I took texts in inputs column of the dataset and kept a max of 1500 samples for each languages. The total samples in the final split for the five languages of interest are:

English - 1499
French - 1349
Spanish - 1500
Hindi - 1087
Nepali - 1500

Results

Tokenization Trends Across Languages

Although, I used a different dataset, I observed a similar trend in token length distribution as discussed in the articles: "All languages are NOT created (tokenized) equal" and “Why is GPT-3 15.77x more expensive for certain languages?”.

Inspired by the first work, I have created a similar dashboard for this work.

The distribution of token lengths across languages non-English languages (French, Spanish, Hindi, and Nepali) were closer to English for NLLB and XML-RoBERTa tokenizers in comparison to GPT-4 tokenizer. However, for GPT-4 tokenizer, the token distribution for non-latin languages (Hindi, and Nepali) were very different from that of English, with non-latin languages (Hindi, and Nepali) having a consistently higher number of tokens across the samples.

The median token length for non-Latin languages (Hindi and Nepali) is only slightly higher than for Latin languages (English, French, Spanish), 17 vs. 16, for NLLB and RoBERTa tokenizers. However, GPT-4 tokenizer exhibits a significantly larger difference with median token lenghts of 62 for Hindi and Nepali vs. 16 for English, French, and Spanish.

This observation suggests that training on a more comprehensive multilingual corpus can influence token length distribution. NLLB and RoBERTa, likely trained on broader datasets, show a smaller difference in token lengths between Latin and non-Latin languages compared to GPT-4 tokenizer, which might have been trained on a less diverse corpus.

There was no replacement token for any sample in the dataset for NLLB and XML-RoBERTa, while there were a fair amount on replacement tokens, for non-latin languages for GPT-4 tokenizer.

Graphemes vs Token Counts

I compared the grapheme count (number of written characters) to the token count (number of tokens after tokenization) for the above tokenizers and observed that GPT-4’s tokenizer stands out with a much higher token count compared to its grapheme count for Hindi and Nepali.

While all three models utilize BPE (Byte Pair Encoding), NLLB and RoBERTa tokenizers, likely trained on broader multilingual datasets, would have encountered various writing systems and grammatical structures. This exposure allows them to adapt their tokenization strategies to handle the complexities of non-Latin languages.

GPT-4 tokenizer seems to have been heavily optimized for English and might not have been adequately exposed to the specific characteristics of non-Latin languages. As a way to manage unseen characters or complex word structures, GPT-4 tokenizer seems to be excessively splititting words into subwords such that a lot of the subwords are incomplete/sub byte sequences, inflating the token count compared to graphemes.

Nepali and Hindi both have a complex morphology involving prefixes, suffixes, and other meaningful units and limited exposure to such structures during training could hinder GPT-4 tokenizer’s ability to effectively tokenize these languages.

Does this affect the overall tokenization time?

The peak distribution of time taken for tokenization by NLLB and RoBERTa is around 2.2 seconds and 2.0 seconds, respectively. cl100k_base is significantly faster in comparison with peak distribution of time at 0.0006 seconds. However, the speed of tokenization only varies slightly for a tokenizer across the languages.

This metric was collected from a device with following configuration:

Courtesy: Infinity Technology Inc.

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  48
  On-line CPU(s) list:   0-47
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
    CPU family:          6
    Model:               63
    Thread(s) per core:  2
    Core(s) per socket:  12
    Socket(s):           2
    Stepping:            2
    CPU max MHz:         3500.0000
    CPU min MHz:         1200.0000
Caches (sum of all):     
  L1d:                   768 KiB (24 instances)
  L1i:                   768 KiB (24 instances)
  L2:                    6 MiB (24 instances)
  L3:                    60 MiB (2 instances)

NLLB and RoBERTa both have a larger vocabulary, so it natural that these take more time to map the input text to corresponding tokens. However, the relationship between the speed of tokenization and vocabulary size is not linear.

Conclusion

To conclude, in this article I explored the impact of training data and character representation on tokenization discrepancies for byte-based BPE tokenizers across languages. While focusing on the Indo-European language family, the findings suggest that these disparities primarily stem from the models' exposure during training and how characters are represented in unicode.

Key observations:

GPT-4 vocabulary differed significantly from NLLB-200-distilled-600M and XML-RoBERTa, both in terms of size and distribution on non-latin tokens.
Token length distribution varied across languages, for all the tokenizers. While the variation is not significant for NLLB-200-distilled-600M and XML-RoBERTa, the token counts for non-latin languages are much higher for GPT-4 tokenizer.
GPT-4 tokenizer showed a large discrepancy between grapheme count and token length for non-latin languages.
Tokenization speed varied only slightly across the languages for each tokenizer but GPT-4 tokenizer was atleast twice as fast compared to NLLB-200-distilled-600M and XML-RoBERTa tokenizers.

Future considerations:

Investigating the tokenization strategies employed by each model in more detail.
Exploring the performance of these models on downstream tasks involving non-Latin languages would be valuable.
Explore if a high token count in GPT-4 translates to poor generation and language understanding capbilities.
Thanks for reading #icodeformyभाषा! Subscribe for free to receive new posts and support my work.