<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[#icodeformyभाषा: Research]]></title><description><![CDATA[Research]]></description><link>https://www.icodeformybhasa.com/s/research</link><image><url>https://substackcdn.com/image/fetch/$s_!DMde!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4d92160-5f82-40ca-8233-b069f42bbba6_1080x1080.png</url><title>#icodeformyभाषा: Research</title><link>https://www.icodeformybhasa.com/s/research</link></image><generator>Substack</generator><lastBuildDate>Mon, 11 May 2026 13:58:07 GMT</lastBuildDate><atom:link href="https://www.icodeformybhasa.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[#icodeformyभाषा]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[icodeformybhasa@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[icodeformybhasa@substack.com]]></itunes:email><itunes:name><![CDATA[Shreeya]]></itunes:name></itunes:owner><itunes:author><![CDATA[Shreeya]]></itunes:author><googleplay:owner><![CDATA[icodeformybhasa@substack.com]]></googleplay:owner><googleplay:email><![CDATA[icodeformybhasa@substack.com]]></googleplay:email><googleplay:author><![CDATA[Shreeya]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[The Cost of Ideas]]></title><description><![CDATA[While proofreading the "How LLMs Break Down Language from Text to Tokens" section from my last blog, Nolan asked me a thought-provoking question: "Then what would the cost of ideas for ideographic languages be?" His question highlighted the significant differences between two writing systems: ideographic languages, where characters represent ideas, and orthographic languages, where characters represent sounds.]]></description><link>https://www.icodeformybhasa.com/p/the-cost-of-ideas</link><guid isPermaLink="false">https://www.icodeformybhasa.com/p/the-cost-of-ideas</guid><dc:creator><![CDATA[Shreeya]]></dc:creator><pubDate>Sat, 30 Mar 2024 03:05:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cc9af4b-97d2-4119-ad36-126e14e54f8f_1460x1316.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>While proofreading the <a href="https://www.icodeformybhasa.com/i/142259391/how-llms-break-down-language-from-text-to-tokens">"How LLMs Break Down Language from Text to Tokens"</a> section from my last blog, <a href="https://www.linkedin.com/in/nolan-kramer-85536890/">Nolan</a> asked me a thought-provoking question: "Then what would the cost of ideas for ideographic languages be?" His question highlighted the significant differences between two writing systems: ideographic languages, where characters represent ideas, and orthographic languages, where characters represent sounds. This further prompted an investigation into whether this distinction affects the overall cost of ideas for large language models (LLMs).</p><p>In this blog post, we will try to answer Nolan's question and explore the following:</p><ul><li><p><strong>Ideographic vs. Orthographic Languages:</strong> Explore the fundamental differences between ideographic languages like Chinese and Japanese and orthographic languages like English and Nepali.</p></li><li><p><strong>The Trade-Off between Conciseness and Digital Footprint:</strong> We will briefly explore the relationship between character count and byte count, acknowledging that ideographic languages might be more concise on paper but not necessarily in terms of digital storage.</p></li></ul><p>Additionally, we will discuss:</p><ul><li><p><strong>The "Cost of Ideas" for LLMs:</strong> We will define the concept of "cost of ideas" in the context of LLMs as the number of tokens required to represent the same idea.</p></li></ul><p>In my <a href="https://www.icodeformybhasa.com/p/beyond-the-abcs-exploring-the-nuances">previous post</a>, I showed that large language models (LLMs) heavily optimized for English and Latin-based languages exhibit consistently higher token counts when processing non-Latin scripts. This discrepancy in token counts translates to increased computational costs for operating LLM-based applications like ChatGPT on non-Latin languages. Building on this foundation, my focus on this post will be on  investigating the "Cost of Ideas" in ideographic and orthographic writing systems. The primary objective is to explore how the fundamental distinctions between these two language categories influence their digital footprint in representing and conveying concepts and ideas.</p><h2>Ideographic vs. Orthographic Languages</h2><p>Ideographic languages, are writing systems where each character or symbol represents a complete word or concept. Chinese, which uses thousands of characters (called hanzi) is the most prominent example of this type of writing system. </p><p><strong>Key Features of Ideographic Languages:</strong></p><ul><li><p><strong>Characters Represent Ideas:</strong> Instead of corresponding to phonetic sounds, each character in an ideographic language symbolizes an entire word or concept. the character <strong>&#26408;</strong> in Chinese represents the word "tree" or the concept of "wood."</p></li><li><p><strong>Extensive Character Set:</strong> To encompass a language's full vocabulary, ideographic writing systems require a vast number of characters, often numbering in the thousands or tens of thousands. The Kangxi dictionary, one of the most comprehensive Chinese dictionaries, contains <strong>over 47,000 characters.</strong></p></li><li><p><strong>Context Matters:</strong> While some characters may have multiple pronunciations or meanings, the context in which they appear often provides clues to their intended interpretation. The character <strong>&#34892;</strong> can mean "to walk" or "behavior," depending on the context.</p></li><li><p><strong>Character Composition: </strong>In some languages (like Chinese), characters can be built from simpler components that offer hints about meaning or pronunciation. The character <strong>&#26519;</strong>, meaning "forest," is composed of two instances of the character <strong>&#26408;</strong> (tree).</p></li></ul><p>In contrast, orthographic languages utilize a set of symbols, typically letters or syllabic characters, to represent the individual sounds (phonemes) that make up spoken words. These symbols are then combined to form words based on their phonetic values. The most familiar example of an orthographic language is English. </p><p><strong>Key Features of Orthographic Languages:</strong></p><ul><li><p><strong>Phoneme Representation:</strong> The letters or symbols in these writing systems correspond to the smallest units of sound (phonemes) in the spoken language. For example, the letter "c" represents the /k/ sound in the word "cat."</p></li><li><p><strong>Limited Symbol Set:</strong> Compared to ideographic languages, orthographic systems typically require a relatively small number of symbols to function. The English language has 26 alphabets.</p></li><li><p><strong>Phonetic Combination:</strong> Words are formed by combining these symbols based on their phonetic values, creating a more direct link between sound and written word. The word "book" is composed of the letters "b," "o," "o," and "k," representing the sounds /b/, /&#650;/, /k/.</p></li></ul><p>In addition, there are some languages like Japanese that incorporate both ideograms (kanji characters borrowed from Chinese) and phonetic scripts (hiragana and katakana) within their writing system. For example, the Japanese word for "computer" is written as &#12467;&#12531;&#12500;&#12517;&#12540;&#12479;&#12540; (using hiragana and katakana) or &#38651;&#33075; (using kanji characters).</p><h2>Universal Declaration Human Rights as a Lens for Comparing "Cost of Ideas"</h2><p>To analyze the "cost of ideas" across ideographic and orthographic languages, we will leverage the Universal Declaration of Human Rights (UDHR) as a parallel corpus in Nepali (orthographic), English (orthographic), Japanese (hybrid with ideographic kanji and syllabic kana), and Chinese (ideographic).</p><p>The UDHR translations are maintained and overseen by the United Nations (UN) ensuring that UDHR's articles are conveyed accurately and with semantic equivalence across all languages. This ensures that any differences observed in the "cost of ideas" are primarily due to the inherent characteristics of the writing systems themselves, rather than discrepancies in translation.</p><p>We will examine the parallel translations across these languages to understand how the same ideas when represented using different writing systems vary in terms of the cost that a digital system has to bear.</p><p><strong>Preprocessing the Data</strong></p><p>This plain text version of UDHR was originally prepared and hosted by the Unicode Consortium under the "UDHR in Unicode" project. While as of January 2024, the Unicode Consortium is no longer hosting the UDHR in Unicode project, the XML files with translations in multiple languages are available at <a href="http://efele.net/udhr/">UDHR in XML</a>.</p><p>I pre-processed the XML files in Mandarin Chinese (Simplified), Mandarin Chinese (Traditional), English, Japanese, and Nepali languages. The processed dataset includes 31 rows for each language, with a preamble and 30 articles defined in UDHR.</p><h2><strong>The Trade-Off: Conciseness vs. Digital Footprint</strong></h2><p>The trade-off between conciseness and digital footprint becomes particularly evident when comparing ideographic writing systems, like Chinese, with orthographic systems, like English or Nepali. Let's delve deeper into this trade-off by examining the grapheme and byte counts for the text in our dataset.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8LXg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1cd779-2b97-4a9d-a1db-dd4ee8af7888_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8LXg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1cd779-2b97-4a9d-a1db-dd4ee8af7888_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!8LXg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1cd779-2b97-4a9d-a1db-dd4ee8af7888_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!8LXg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1cd779-2b97-4a9d-a1db-dd4ee8af7888_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!8LXg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1cd779-2b97-4a9d-a1db-dd4ee8af7888_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8LXg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1cd779-2b97-4a9d-a1db-dd4ee8af7888_1200x600.png" width="1200" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4d1cd779-2b97-4a9d-a1db-dd4ee8af7888_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:33610,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8LXg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1cd779-2b97-4a9d-a1db-dd4ee8af7888_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!8LXg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1cd779-2b97-4a9d-a1db-dd4ee8af7888_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!8LXg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1cd779-2b97-4a9d-a1db-dd4ee8af7888_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!8LXg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1cd779-2b97-4a9d-a1db-dd4ee8af7888_1200x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Average Graphemes and Bytes count across languages in UDHR</figcaption></figure></div><h4><strong>Grapheme Count and Conciseness</strong></h4><p>Grapheme count refers to the number of characters, such as letters or ideographs, needed to represent a word or concept. Ideographic scripts like Chinese exhibit a significant advantage in conciseness. In our dataset, Traditional Chinese has an average grapheme count of 82.54, and Simplified Chinese has 82.45. In contrast, the average grapheme count for English is considerably higher at 321.70. This conciseness in ideographic scripts stems from their ability to convey complex ideas and concepts through a single ideographic character, reducing the need for multiple graphemes.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nwRT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cc9af4b-97d2-4119-ad36-126e14e54f8f_1460x1316.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nwRT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cc9af4b-97d2-4119-ad36-126e14e54f8f_1460x1316.png 424w, https://substackcdn.com/image/fetch/$s_!nwRT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cc9af4b-97d2-4119-ad36-126e14e54f8f_1460x1316.png 848w, https://substackcdn.com/image/fetch/$s_!nwRT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cc9af4b-97d2-4119-ad36-126e14e54f8f_1460x1316.png 1272w, https://substackcdn.com/image/fetch/$s_!nwRT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cc9af4b-97d2-4119-ad36-126e14e54f8f_1460x1316.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nwRT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cc9af4b-97d2-4119-ad36-126e14e54f8f_1460x1316.png" width="1456" height="1312" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2cc9af4b-97d2-4119-ad36-126e14e54f8f_1460x1316.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1312,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:299282,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nwRT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cc9af4b-97d2-4119-ad36-126e14e54f8f_1460x1316.png 424w, https://substackcdn.com/image/fetch/$s_!nwRT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cc9af4b-97d2-4119-ad36-126e14e54f8f_1460x1316.png 848w, https://substackcdn.com/image/fetch/$s_!nwRT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cc9af4b-97d2-4119-ad36-126e14e54f8f_1460x1316.png 1272w, https://substackcdn.com/image/fetch/$s_!nwRT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cc9af4b-97d2-4119-ad36-126e14e54f8f_1460x1316.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Graphemes count across each article in UDHR for English and Chinese</figcaption></figure></div><h4>Byte Count and Digital Footprint</h4><p>However, the byte count represents the number of bytes required to encode the text digitally. Despite their lower grapheme counts, Traditional and Simplified Chinese texts required an average of 246.41 and 240.77 bytes, respectively, to encode their characters. This higher byte count is a consequence of the complex character encodings required for ideographic scripts, which often involve multiple bytes per character. The cost of an increased digital footprint. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tURU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a514d65-f067-4f93-89d5-bee96afccf59_1454x1300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tURU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a514d65-f067-4f93-89d5-bee96afccf59_1454x1300.png 424w, https://substackcdn.com/image/fetch/$s_!tURU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a514d65-f067-4f93-89d5-bee96afccf59_1454x1300.png 848w, https://substackcdn.com/image/fetch/$s_!tURU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a514d65-f067-4f93-89d5-bee96afccf59_1454x1300.png 1272w, https://substackcdn.com/image/fetch/$s_!tURU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a514d65-f067-4f93-89d5-bee96afccf59_1454x1300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tURU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a514d65-f067-4f93-89d5-bee96afccf59_1454x1300.png" width="1454" height="1300" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1a514d65-f067-4f93-89d5-bee96afccf59_1454x1300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1300,&quot;width&quot;:1454,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:295041,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tURU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a514d65-f067-4f93-89d5-bee96afccf59_1454x1300.png 424w, https://substackcdn.com/image/fetch/$s_!tURU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a514d65-f067-4f93-89d5-bee96afccf59_1454x1300.png 848w, https://substackcdn.com/image/fetch/$s_!tURU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a514d65-f067-4f93-89d5-bee96afccf59_1454x1300.png 1272w, https://substackcdn.com/image/fetch/$s_!tURU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a514d65-f067-4f93-89d5-bee96afccf59_1454x1300.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Bytes count across each article in UDHR for English and Chinese</figcaption></figure></div><p><strong>English and the Trade-Off</strong></p><p>On average, English requires 3.9 times more graphemes than Traditional and Simplified Chinese to convey the same concepts. However, when it comes to byte counts needed for digital encoding, the gap narrows down drastically. English requires only 1.31 times more bytes than Traditional Chinese, and 1.34 times more bytes than Simplified Chinese. This highlights the trade-off: while Chinese is far more concise requiring fewer graphemes, English benefits from a simpler encoding requiring fewer bytes per grapheme representation compared to the ideographic Chinese scripts.</p><p><strong>The Case of Japanese</strong></p><p>Similarly, Japanese, which incorporates ideographic kanji characters borrowed from Chinese, and orthographic characters hiragana and katakana, has an average grapheme count of 124.35 in our dataset, lower than English. However, the byte count for Japanese text jumps to 371.83, exceeding even that of English. This significant increase in byte count can, again, be attributed to the complex character encodings for Japanese characters.</p><p>In essence, while ideographic scripts like Chinese and Japanese offer conciseness in terms of grapheme counts, they often require more bytes to encode digitally, resulting in a trade-off between conciseness and digital footprint. This trade-off has implications for tasks such as text storage, transmission, and processing within language technologies and applications.</p><h3>Script Intricacies and Their Impact on Digital Footprint</h3><p>While ideographic scripts like Chinese exhibit a clear trade-off between conciseness in grapheme counts and an increased digital footprint due to their complex character encodings, the case of Nepali presents a different challenge.</p><p>Even though Nepali, like English, is an orthographic language, its text characteristics in our dataset differ significantly in terms of grapheme and byte count. While Nepali uses far fewer graphemes on average (194.80) compared to English (321.70), this efficiency stems from the unique features of the Devanagari script used by Nepali. Unlike the Latin script, where consonants often need additional characters to represent sounds like syllables or consonant clusters, Devanagari generally uses a single character per sound. This is because Devanagari consonants typically come with an inherent vowel sound, a characteristic not always present in the Latin script. This allows Nepali texts to be represented with fewer graphemes on average compared to their English. </p><p>However, the byte count tells a different story. Nepali text required a staggering 759.09 bytes on average to encode digitally, over 2.3 times higher than the 321.96 bytes needed for English text. This disproportionately high byte count for Nepali, despite its lower grapheme count compared to English, highlights the complexity involved in digitally encoding the intricate system of consonant clusters, vowel diacritics, and combining characters present in the Devanagari script.</p><h3><strong>The "Cost of Ideas" for LLMs</strong></h3><p>As explored in <a href="https://www.icodeformybhasa.com/p/beyond-the-abcs-exploring-the-nuances#%C2%A7how-llms-break-down-language-from-text-to-tokens">my previous work</a>, the number of tokens required to represent ideas in LLMs can vary significantly across languages. It really depends on how the tokenizer was trained for each model. While the inherent characteristics of a language influence the number of graphemes needed to represent ideas, the tokenization method plays a crucial role in determining the actual token counts within the LLM.If the tokenizer is trained on a diverse dataset that includes a good representation of ideographic languages like Chinese, it can potentially learn to tokenize these languages more efficiently. This can lead to lower token counts and a cost advantage for representing ideas in the LLM.</p><p>You can find the <a href="https://huggingface.co/spaces/shreeyad/tokenizers-multilingual">visualizer here</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!q-qQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe40df143-8670-498e-b30e-ab723bfd7fa7_1913x573.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!q-qQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe40df143-8670-498e-b30e-ab723bfd7fa7_1913x573.png 424w, https://substackcdn.com/image/fetch/$s_!q-qQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe40df143-8670-498e-b30e-ab723bfd7fa7_1913x573.png 848w, https://substackcdn.com/image/fetch/$s_!q-qQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe40df143-8670-498e-b30e-ab723bfd7fa7_1913x573.png 1272w, https://substackcdn.com/image/fetch/$s_!q-qQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe40df143-8670-498e-b30e-ab723bfd7fa7_1913x573.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!q-qQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe40df143-8670-498e-b30e-ab723bfd7fa7_1913x573.png" width="1456" height="436" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e40df143-8670-498e-b30e-ab723bfd7fa7_1913x573.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:436,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:284476,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!q-qQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe40df143-8670-498e-b30e-ab723bfd7fa7_1913x573.png 424w, https://substackcdn.com/image/fetch/$s_!q-qQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe40df143-8670-498e-b30e-ab723bfd7fa7_1913x573.png 848w, https://substackcdn.com/image/fetch/$s_!q-qQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe40df143-8670-498e-b30e-ab723bfd7fa7_1913x573.png 1272w, https://substackcdn.com/image/fetch/$s_!q-qQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe40df143-8670-498e-b30e-ab723bfd7fa7_1913x573.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Tokenizing Article 1 in English and Chinese using XML-Roberta Tokenizer</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UTMY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae81a7cc-b5f9-420e-95a7-b53b50048748_1914x555.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UTMY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae81a7cc-b5f9-420e-95a7-b53b50048748_1914x555.png 424w, https://substackcdn.com/image/fetch/$s_!UTMY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae81a7cc-b5f9-420e-95a7-b53b50048748_1914x555.png 848w, https://substackcdn.com/image/fetch/$s_!UTMY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae81a7cc-b5f9-420e-95a7-b53b50048748_1914x555.png 1272w, https://substackcdn.com/image/fetch/$s_!UTMY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae81a7cc-b5f9-420e-95a7-b53b50048748_1914x555.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UTMY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae81a7cc-b5f9-420e-95a7-b53b50048748_1914x555.png" width="1456" height="422" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ae81a7cc-b5f9-420e-95a7-b53b50048748_1914x555.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:422,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:286448,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UTMY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae81a7cc-b5f9-420e-95a7-b53b50048748_1914x555.png 424w, https://substackcdn.com/image/fetch/$s_!UTMY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae81a7cc-b5f9-420e-95a7-b53b50048748_1914x555.png 848w, https://substackcdn.com/image/fetch/$s_!UTMY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae81a7cc-b5f9-420e-95a7-b53b50048748_1914x555.png 1272w, https://substackcdn.com/image/fetch/$s_!UTMY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae81a7cc-b5f9-420e-95a7-b53b50048748_1914x555.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Tokenizing Article 1 in English and Chinese using NLLB Tokenizer</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vvi-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc971e67f-5e13-43b8-be84-ce05b58607db_1901x564.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vvi-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc971e67f-5e13-43b8-be84-ce05b58607db_1901x564.png 424w, https://substackcdn.com/image/fetch/$s_!Vvi-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc971e67f-5e13-43b8-be84-ce05b58607db_1901x564.png 848w, https://substackcdn.com/image/fetch/$s_!Vvi-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc971e67f-5e13-43b8-be84-ce05b58607db_1901x564.png 1272w, https://substackcdn.com/image/fetch/$s_!Vvi-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc971e67f-5e13-43b8-be84-ce05b58607db_1901x564.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vvi-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc971e67f-5e13-43b8-be84-ce05b58607db_1901x564.png" width="1456" height="432" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c971e67f-5e13-43b8-be84-ce05b58607db_1901x564.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:432,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:287963,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Vvi-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc971e67f-5e13-43b8-be84-ce05b58607db_1901x564.png 424w, https://substackcdn.com/image/fetch/$s_!Vvi-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc971e67f-5e13-43b8-be84-ce05b58607db_1901x564.png 848w, https://substackcdn.com/image/fetch/$s_!Vvi-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc971e67f-5e13-43b8-be84-ce05b58607db_1901x564.png 1272w, https://substackcdn.com/image/fetch/$s_!Vvi-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc971e67f-5e13-43b8-be84-ce05b58607db_1901x564.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Tokenizing Article 1 in English and Chinese using GPT-4 Tokenizer</figcaption></figure></div><p>If the tokenizer is trained on a diverse dataset that includes a good representation of ideographic languages like Chinese, it can potentially learn to tokenize these languages more efficiently, resulting in lower token counts and, consequently, a cost advantage for representing ideas within the LLM.</p><p>Conversely, a tokenizer trained on data skewed towards certain languages or writing systems may struggle to tokenize other languages optimally. This can result in higher token counts and increased costs for representing ideas in those languages. This seems to be the case with GPT-4 tokenization, where it exhibits sub-optimal performance when tokenizing texts in non-Latin languages.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wcAg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9fdafb2-9e42-42cc-a929-4224da6562c1_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wcAg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9fdafb2-9e42-42cc-a929-4224da6562c1_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!wcAg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9fdafb2-9e42-42cc-a929-4224da6562c1_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!wcAg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9fdafb2-9e42-42cc-a929-4224da6562c1_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!wcAg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9fdafb2-9e42-42cc-a929-4224da6562c1_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wcAg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9fdafb2-9e42-42cc-a929-4224da6562c1_1200x600.png" width="1200" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b9fdafb2-9e42-42cc-a929-4224da6562c1_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:37636,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!wcAg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9fdafb2-9e42-42cc-a929-4224da6562c1_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!wcAg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9fdafb2-9e42-42cc-a929-4224da6562c1_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!wcAg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9fdafb2-9e42-42cc-a929-4224da6562c1_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!wcAg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9fdafb2-9e42-42cc-a929-4224da6562c1_1200x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Average Token counts for XML-Roberta, NLLB and GPT-4 in UDHR</figcaption></figure></div><p>This observation highlights the importance of carefully curating the training dataset, as well as tailoring the tokenization process when developing large language models. By ensuring that the tokenizer is exposed to a diverse range of languages, including ideographic scripts, during the training process, LLMs can potentially leverage the inherent advantages of certain writing systems. For example, they can exploit the compact representation of ideas offered by ideographic languages like Chinese. Ultimately, the tokenization method and the quality of the training data can significantly impact the cost and efficiency of representing ideas across different languages within large language models.</p><h3>Acknowledgements</h3><p><a href="https://www.linkedin.com/in/nolan-kramer-85536890/">Nolan Kramer</a>, for not just asking the question that became the basis of this post but also for the discussions throughout the time I was working on this project.</p><p><a href="https://www.linkedin.com/in/gwendolyngillingham">Gwendolyn Gillingham</a>, for helping me with the study and providing the idea of using the UDHR dataset.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.icodeformybhasa.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading #icodeformy&#2349;&#2366;&#2359;&#2366;! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Beyond the ABCs: Exploring the nuances of tokenization in diverse languages]]></title><description><![CDATA[Earlier this month, I stumbled upon two articles that discussed the disparities in tokenization among languages titled "All languages are NOT created (tokenized) equal" and &#8220;Why is GPT-3 15.77x more expensive for certain languages?&#8221;.]]></description><link>https://www.icodeformybhasa.com/p/beyond-the-abcs-exploring-the-nuances</link><guid isPermaLink="false">https://www.icodeformybhasa.com/p/beyond-the-abcs-exploring-the-nuances</guid><dc:creator><![CDATA[Shreeya]]></dc:creator><pubDate>Wed, 13 Mar 2024 03:40:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbae78e48-d797-494e-bda1-7e4d140ddca5_1456x846.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Earlier this month, I stumbled upon two articles that discussed the disparities in tokenization among languages titled "<a href="https://www.artfish.ai/p/all-languages-are-not-created-tokenized">All languages are NOT created (tokenized) equal</a>" and &#8220;<a href="https://denyslinkov.medium.com/why-is-gpt-3-15-77x-more-expensive-for-certain-languages-2b19a4adc4bc">Why is GPT-3 15.77x more expensive for certain languages?</a>&#8221;. This piqued my interest and motivated me to conduct further investigations on my own. </p><p>In this article, I'll discuss Byte-Pair Encoding (BPE) based tokenization and the disparities in the tokenization process across different languages. Using the Indo-European language family as a case study, I will show how these discrepancies arise <strong>not</strong> from inherent language family differences but rather from the training data and the representation of characters in Unicode for each language. In addition, I will:</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.icodeformybhasa.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading #icodeformy&#2349;&#2366;&#2359;&#2366;! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><ol><li><p>Explore GPT-4 vocab and compare it to XML-RoBERTa and NLLB-200-distilled-600M.</p></li><li><p>Explore token length distribution for Indo-European languages: English, French, Spanish, Hindi, and Nepali.</p></li><li><p>Explore the relationship between grapheme counts vs token lenghts across the languages.</p></li><li><p>Compare the speed of tokenization for the three tokenizers across five languages mentioned above.</p></li></ol><h3>How LLMs break down language from text to tokens</h3><p><em><a href="https://www.youtube.com/watch?v=zduSFxRajkE">Let&#8217;s buid GPT tokenize</a>r by Andrej Karpathy was very helpful in understanding the tokenizers used by LLMs.</em></p><p>Tokenization is a fundamental process that involves breaking down a text into smaller units called tokens, typically words or subwords. LLMs like GPT-4 utilizes a technique called byte pair encoding (BPE) for tokenization. It iteratively merges the most frequently occurring pairs of consecutive characters into single units, forming a dynamic vocabulary that adapts to the unique characteristics of the training data. This approach enables LLMs to effectively handle rare words and improves its computational efficiency compared to traditional word-based methods. </p><p>In addition, instead of treating text as sequences of individual characters, GPT-4 uses <strong>byte-level BPE</strong> for tokenization and leverages the properties of UTF-8 encoding, which represents text using sequences of bytes called <a href="https://en.wikipedia.org/wiki/Code_point">code points</a>. </p><h4>Byte-Level BPE in GPT Models</h4><p>By working with bytes instead of characters, these models achieve the following advantages, in addition to dynamic vocabulary building:</p><ol><li><p><strong>Compact Vocabulary:</strong> It starts with a base vocabulary consisting of 256 individual bytes, representing all possible byte values in UTF-8 encoding. This small vocabulary size translates to computational efficiency and faster processing.</p></li><li><p><strong>Universal Character Representation:</strong> This ensures all characters, regardless of their origin, can be represented using a combination of bytes, effectively eliminating the need for "unknown tokens." This allows the models to handle diverse text from various languages and writing systems seamlessly.</p></li></ol><h4>Decoding GPT-4 vocab</h4><p>To understand the discrepancies discussed above, I first looked into the vocab used by GPT-4. Tokens in the original vocab <a href="https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken">cl100k_base.tiktoken</a> used by the <code>cl100k_base</code> tokenizer, which is the BPE tokenizer used by GPT-4 <strong>and is encoded in base64</strong>. I converted vocabulary to UTF-8 for my analysis. Some tokens resulted in encoding errors due to incomplete generation, highlighting limitations of byte-level BPE in handling uncommon texts and text in writing systems other than latin.</p><ul><li><p>The decoded vocabulary comprises 70,988 entries containing only Latin characters. This suggests a potential bias towards Latin-based languages in GPT-4's training data.</p></li><li><p>There are 29,268 entries containing at least one non-Latin character. This indicates that the model was exposed to other languages during training. </p></li><li><p>Among these non-Latin entries, 803 entries partial byte sequences. </p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hDVI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a75b5c-1112-4cb1-8b53-e3a71c236000_1422x1144.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hDVI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a75b5c-1112-4cb1-8b53-e3a71c236000_1422x1144.png 424w, https://substackcdn.com/image/fetch/$s_!hDVI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a75b5c-1112-4cb1-8b53-e3a71c236000_1422x1144.png 848w, https://substackcdn.com/image/fetch/$s_!hDVI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a75b5c-1112-4cb1-8b53-e3a71c236000_1422x1144.png 1272w, https://substackcdn.com/image/fetch/$s_!hDVI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a75b5c-1112-4cb1-8b53-e3a71c236000_1422x1144.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hDVI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a75b5c-1112-4cb1-8b53-e3a71c236000_1422x1144.png" width="1422" height="1144" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36a75b5c-1112-4cb1-8b53-e3a71c236000_1422x1144.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1144,&quot;width&quot;:1422,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:213834,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hDVI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a75b5c-1112-4cb1-8b53-e3a71c236000_1422x1144.png 424w, https://substackcdn.com/image/fetch/$s_!hDVI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a75b5c-1112-4cb1-8b53-e3a71c236000_1422x1144.png 848w, https://substackcdn.com/image/fetch/$s_!hDVI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a75b5c-1112-4cb1-8b53-e3a71c236000_1422x1144.png 1272w, https://substackcdn.com/image/fetch/$s_!hDVI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a75b5c-1112-4cb1-8b53-e3a71c236000_1422x1144.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">GPT-4 Tokenization Visualization from <a href="https://platform.openai.com/tokenizer">Open AI&#8217;s Tokenizer playground</a></figcaption></figure></div><h4>Limitations in representing uncommon texts and other writing systems</h4><p>While byte-level BPE effectively eliminates the need for unknown tokens with a compact vocabulary, there are some limitations, especially in representing uncommon texts and texts in writing systems other than latin.</p><p>Despite universal character representation, it might struggle with tokenizing uncommon texts not seen during training. For cases involving extremely rare combinations or characters from under-represented writing systems BPE might resort to suboptimal tokenization, like breaking down the sequence into individual bytes, <strong>which can impact accuracy</strong>.</p><p>English letters are assigned a one-byte encoding in UTF-8. However, this is not true for all languages, some languages use multiple bytes. Hindi and Nepali are examples of such languages. Both Hindi and Nepali use Devanagari script which has a larger character set than the basic Latin alphabet used in English. This means that these languages need more unique symbols to represent its characters. UTF-8 encodes characters using a variable number of bytes depending on their rarity. To represent these less common characters, UTF-8 uses two, three, or even four bytes. Since a byte-level BPE model would treat each byte as a separate token, a letter in languages like Hindi or Nepali would be broken down into multiple tokens, potentially impacting the model's understanding and generation capabilities. The impact of the process in model&#8217;s understanding in out of the scope of this article.</p><p>Let's explore how the byte-based BPE tokenization process can lead to this issue i discussed with the Nepali word "&#2360;&#2379;&#2350;&#2348;&#2366;&#2352;" (sombaar, meaning "Monday") as an example.</p><ol><li><p><strong>Unicode Code Points:</strong> The word "&#2360;&#2379;&#2350;&#2348;&#2366;&#2352;" is represented by the following Unicode code points in hexadecimal:</p><pre><code><code>&#2360;: 0x0938
&#2379;: 0x094B
&#2350;: 0x092E
&#2348;: 0x092C
&#2366;: 0x093E
&#2352;: 0x0930</code></code></pre></li><li><p><strong>UTF-8 Encoding:</strong> When encoded using UTF-8, the word "&#2360;&#2379;&#2350;&#2348;&#2366;&#2352;" becomes the following byte sequence:</p><pre><code><code>&#2360;: 0xE0 0xA4 0xB8
&#2379;: 0xE0 0xA5 0x8B
&#2350;: 0xE0 0xA4 0xAE
&#2348;: 0xE0 0xA4 0xAC
&#2366;: 0xE0 0xA4 0xBE
&#2352;: 0xE0 0xA4 0xB0</code></code></pre></li></ol><ol start="3"><li><p><strong>Byte-based BPE Tokenization:</strong> During the training BPE can , for example for character <code>&#2348;</code>, merge the byte sequences <code>0xE0 0xA4 into one single token and leaves out 0xAC as a separate token depending on the data it has seen.</code> This causes the vocab to have byte sequences that do not make up a valid code point. So let&#8217;s assume after several iteration we have the following vocabulary. </p><pre><code><code>0xE0 0xA4 0xB8
0xE0 0xA5 0x8B
0xE0 0xA4 0xAE
0xE0 0xA4 --&gt; incomplete
0xAC --&gt; incomplete
0xE0 0xA4 0xBE
0xE0 0xA4 0xB0</code></code></pre></li><li><p><strong>Tokenization of "&#2360;&#2379;&#2350;&#2348;&#2366;&#2352;":</strong> When the tokenizer tries to tokenize the word "&#2360;&#2379;&#2350;&#2348;&#2366;&#2352;", it would then generate the following sequence of tokens:</p><pre><code><code>['0xE0 0xA4 0xB8', '0xE0 0xA5 0x8B', '0xE0 0xA4 0xAE', '0xE0 0xA4', '0xAC', '0xE0 0xA4 0xBE', '0xE0 0xA4 0xB0']
Decoding at token id level:
['&#2360;', '&#2379;', '&#2350;', '&#65533;', '&#65533;', '&#2366;', '&#2352;']</code></code></pre><p><br>When decoding the tokens individually, you would encounter an unicode decoding error and by default one would encounter an Unicode replacement character (&#65533;). See more on this <a href="https://docs.python.org/3/library/stdtypes.html#bytes.decode">here</a>.</p><p><strong>Note:</strong> A slightly different case would be where byte sequences for multiple characters would be combined by BPE to one entry in vocab, which would also cause similar issue.</p><pre><code><code>Decoding at token level:
['&#2360;', '&#2379;', '&#2350;', '&#65533;', '&#65533;', '&#2366;', '&#2352;']
Decoding at input level:
&#2360;&#2379;&#2350;&#2348;&#2366;&#2352;</code></code></pre><p>However, when you decode the entire sequence of tokens together, the tokenizer can correctly reconstruct the original word "&#2360;&#2379;&#2350;&#2348;&#2366;&#2352;" by combining the individual byte sequences represented by each token.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ey4g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbe6287d-8c38-4a6d-aee7-deccc053a2e9_1412x1154.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ey4g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbe6287d-8c38-4a6d-aee7-deccc053a2e9_1412x1154.png 424w, https://substackcdn.com/image/fetch/$s_!Ey4g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbe6287d-8c38-4a6d-aee7-deccc053a2e9_1412x1154.png 848w, https://substackcdn.com/image/fetch/$s_!Ey4g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbe6287d-8c38-4a6d-aee7-deccc053a2e9_1412x1154.png 1272w, https://substackcdn.com/image/fetch/$s_!Ey4g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbe6287d-8c38-4a6d-aee7-deccc053a2e9_1412x1154.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ey4g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbe6287d-8c38-4a6d-aee7-deccc053a2e9_1412x1154.png" width="1412" height="1154" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dbe6287d-8c38-4a6d-aee7-deccc053a2e9_1412x1154.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1154,&quot;width&quot;:1412,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:318730,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ey4g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbe6287d-8c38-4a6d-aee7-deccc053a2e9_1412x1154.png 424w, https://substackcdn.com/image/fetch/$s_!Ey4g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbe6287d-8c38-4a6d-aee7-deccc053a2e9_1412x1154.png 848w, https://substackcdn.com/image/fetch/$s_!Ey4g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbe6287d-8c38-4a6d-aee7-deccc053a2e9_1412x1154.png 1272w, https://substackcdn.com/image/fetch/$s_!Ey4g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbe6287d-8c38-4a6d-aee7-deccc053a2e9_1412x1154.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Open AI&#8217;s tiktoken Tokenizer Visualizer for Nepali. Note that there are several decoding issues here. This is because many of these tokens are represented using incomplete utf-8 code points. Also, note that number of tokens is greater than the number of characters, this is because some characters are represented using multiple sequence of bytes and tiktoken tokenizer, for some of these sequences of bytes, sees each byte for a character as a token.</figcaption></figure></div><h4>Factors influencing high invalid byte sequences in vocab </h4><p>The quality and size of the training data in a particular language can expose algorithm to learn sub-optimal vocabulary. Languages with less diverse or smaller training datasets may exhibit higher rates of invalid byte sequences due to insufficient coverage of character combinations or linguistic phenomena. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BWQo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79184d26-49c1-4290-8c06-141b4b4dd209_1412x580.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BWQo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79184d26-49c1-4290-8c06-141b4b4dd209_1412x580.png 424w, https://substackcdn.com/image/fetch/$s_!BWQo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79184d26-49c1-4290-8c06-141b4b4dd209_1412x580.png 848w, https://substackcdn.com/image/fetch/$s_!BWQo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79184d26-49c1-4290-8c06-141b4b4dd209_1412x580.png 1272w, https://substackcdn.com/image/fetch/$s_!BWQo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79184d26-49c1-4290-8c06-141b4b4dd209_1412x580.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BWQo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79184d26-49c1-4290-8c06-141b4b4dd209_1412x580.png" width="1412" height="580" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/79184d26-49c1-4290-8c06-141b4b4dd209_1412x580.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:580,&quot;width&quot;:1412,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52875,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BWQo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79184d26-49c1-4290-8c06-141b4b4dd209_1412x580.png 424w, https://substackcdn.com/image/fetch/$s_!BWQo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79184d26-49c1-4290-8c06-141b4b4dd209_1412x580.png 848w, https://substackcdn.com/image/fetch/$s_!BWQo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79184d26-49c1-4290-8c06-141b4b4dd209_1412x580.png 1272w, https://substackcdn.com/image/fetch/$s_!BWQo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79184d26-49c1-4290-8c06-141b4b4dd209_1412x580.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">No decoding error for latin characters that have one byte sequence in utf-encoding</figcaption></figure></div><p>In addition to the quality and size of training data, some of the factors that influence high invalid byte sequences in vocab are:</p><ol><li><p><strong>Script Complexity: </strong>Languages with more complex scripts, such as those with non-Latin scripts like Devanagari, Thai, or Chinese characters, may have a higher likelihood of invalid byte sequences representation in vocab. These scripts often have a larger number of characters and more complex character compositions, leading to a wider range of possible byte sequences and potential challenges in tokenization.</p></li><li><p><strong>Character Frequency: </strong>Characters that are less frequent in the training data may have their byte sequences split more frequently during merging, increasing the likelihood of incomplete tokens.</p></li><li><p><strong>Word Morphology:</strong> Languages with rich morphology, such as agglutinative languages, may exhibit a larger number of morphemes or affixes, leading to more opportunities for byte sequences to be split during tokenization.</p></li></ol><h3>Can training with more multi-lingual data solve this?</h3><p>Looking at the vocab we can infer that GPT-4 was heavily optimized towards English. In this section, I will compare the GPT-4 tokenizer with two other byte-based BPE tokenizers: <a href="https://huggingface.co/docs/transformers/v4.17.0/en/model_doc/xlm-roberta#transformers.XLMRobertaTokenizer">XML-RoBERTa </a>and  <a href="https://huggingface.co/docs/transformers/en/model_doc/nllb#nllbtokenizer">NLLB-200-distilled-600M</a> that were trained with multi-lingual data. The purpose of this study is to see if and how exposing more multi-lingual data in training affects tokenization. I chose these two tokenizers in particular because in <a href="https://denyslinkov.medium.com/why-is-gpt-3-15-77x-more-expensive-for-certain-languages-2b19a4adc4bc">Denys Linkov&#8217;s blog</a> he shows that the ratio between the largest and smallest token numbers is the lowest for these two tokenizers in comparison to the others he compared. </p><h4>A more diverse and distributed vocabulary</h4><p>NLLB and XML-RoBERTa demonstrate significantly more diverse vocabularies compared to GPT-4&#8217;s cl100k_base vocab:</p><ul><li><p><strong>Non-Latin characters:</strong> NLLB and XML-RoBERTa contain roughly <strong>79.53%</strong> and <strong>83.62%</strong> non-Latin entries respectively, while cl100k_base only has <strong>29.2%</strong>. This indicates that NLLB and XML-RoBERTa can handle a wider range of languages beyond Latin-based ones.</p></li><li><p><strong>Vocabulary size:</strong> NLLB and XML-RoBERTa have a much larger vocabulary size, with <strong>2.55</strong> and <strong>2.49 times</strong> more entries than cl100k_base with a more distributed sub-tokens across languages.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zOuv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702d5f48-9d67-4216-ae18-5987bd99dbef_1404x982.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zOuv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702d5f48-9d67-4216-ae18-5987bd99dbef_1404x982.png 424w, https://substackcdn.com/image/fetch/$s_!zOuv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702d5f48-9d67-4216-ae18-5987bd99dbef_1404x982.png 848w, https://substackcdn.com/image/fetch/$s_!zOuv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702d5f48-9d67-4216-ae18-5987bd99dbef_1404x982.png 1272w, https://substackcdn.com/image/fetch/$s_!zOuv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702d5f48-9d67-4216-ae18-5987bd99dbef_1404x982.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zOuv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702d5f48-9d67-4216-ae18-5987bd99dbef_1404x982.png" width="1404" height="982" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/702d5f48-9d67-4216-ae18-5987bd99dbef_1404x982.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:982,&quot;width&quot;:1404,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:164755,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zOuv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702d5f48-9d67-4216-ae18-5987bd99dbef_1404x982.png 424w, https://substackcdn.com/image/fetch/$s_!zOuv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702d5f48-9d67-4216-ae18-5987bd99dbef_1404x982.png 848w, https://substackcdn.com/image/fetch/$s_!zOuv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702d5f48-9d67-4216-ae18-5987bd99dbef_1404x982.png 1272w, https://substackcdn.com/image/fetch/$s_!zOuv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702d5f48-9d67-4216-ae18-5987bd99dbef_1404x982.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Vocab counts for cl100k_base,  NLLB-200-distilled-600M, and XML-RoBERTa. Note, non-latin entires also include vocab with non-latin characters, not necessarily belonging to a specific language, like &#8220;&gt;&#8221;.</figcaption></figure></div><p>I also found that cl100k_base vocab contains a significantly higher number of entries representing incomplete byte sequences, at roughly <strong>29.7 times</strong> and <strong>25.1 times</strong> more than NLLB and XML-RoBERTa respectively. The limited exposure to non-Latin byte sequences during training might explain large number of incomplete sequences in cl100k_base vocab. As mentioned earlier section, a smaller or less diverse multilingual corpus could restrict the model's ability to learn the proper representation of uncommon text sequences.</p><h3>Aya Dataset</h3><p>For this study, I used <a href="https://arxiv.org/pdf/2402.06619.pdf">Aya Dataset</a>, which contains human-curated prompt-completion pairs in 65 languages written by fluent speakers of the languages. I chose this dataset for three reasons, 1. diverse sequences in terms of lengths and topic, 2. contains all languages of interest, and 3. since it is human-curated, the dataset is of high quality, which is what I observed for English, Nepali, and Hindi.</p><p>I took texts in <code>inputs</code> column of the dataset and kept a max of 1500 samples for each languages. The total samples in the final split for the five languages of interest are:</p><ul><li><p>English - 1499</p></li><li><p>French - 1349</p></li><li><p>Spanish - 1500</p></li><li><p> Hindi - 1087</p></li><li><p>Nepali - 1500</p></li></ul><h3>Results</h3><h4>Tokenization Trends Across Languages</h4><p>Although, I used a different dataset, I observed a similar trend in token length distribution as discussed in the articles: "<a href="https://www.artfish.ai/p/all-languages-are-not-created-tokenized">All languages are NOT created (tokenized) equal</a>" and &#8220;<a href="https://denyslinkov.medium.com/why-is-gpt-3-15-77x-more-expensive-for-certain-languages-2b19a4adc4bc">Why is GPT-3 15.77x more expensive for certain languages?</a>&#8221;. </p><p><strong>Inspired by the first work, I have created <a href="https://huggingface.co/spaces/shreeyad/tokenizers-multilingual">a similar dashboard </a>for this work.</strong></p><p>The distribution of token lengths across languages non-English languages (French, Spanish, Hindi, and Nepali) were closer to English for NLLB and XML-RoBERTa tokenizers in comparison to GPT-4 tokenizer. However, for GPT-4 tokenizer, the token distribution for non-latin languages (Hindi, and Nepali) were very different from that of English, with non-latin languages (Hindi, and Nepali) having a consistently higher number of tokens across the samples.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nYgc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa128301b-befa-4b99-a947-2fedc7bdd09a_1418x918.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nYgc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa128301b-befa-4b99-a947-2fedc7bdd09a_1418x918.png 424w, https://substackcdn.com/image/fetch/$s_!nYgc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa128301b-befa-4b99-a947-2fedc7bdd09a_1418x918.png 848w, https://substackcdn.com/image/fetch/$s_!nYgc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa128301b-befa-4b99-a947-2fedc7bdd09a_1418x918.png 1272w, https://substackcdn.com/image/fetch/$s_!nYgc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa128301b-befa-4b99-a947-2fedc7bdd09a_1418x918.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nYgc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa128301b-befa-4b99-a947-2fedc7bdd09a_1418x918.png" width="1418" height="918" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a128301b-befa-4b99-a947-2fedc7bdd09a_1418x918.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:918,&quot;width&quot;:1418,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:316689,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nYgc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa128301b-befa-4b99-a947-2fedc7bdd09a_1418x918.png 424w, https://substackcdn.com/image/fetch/$s_!nYgc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa128301b-befa-4b99-a947-2fedc7bdd09a_1418x918.png 848w, https://substackcdn.com/image/fetch/$s_!nYgc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa128301b-befa-4b99-a947-2fedc7bdd09a_1418x918.png 1272w, https://substackcdn.com/image/fetch/$s_!nYgc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa128301b-befa-4b99-a947-2fedc7bdd09a_1418x918.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2m-N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbb5f007-ada4-43f0-b249-1b78005d3290_1400x892.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2m-N!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbb5f007-ada4-43f0-b249-1b78005d3290_1400x892.png 424w, https://substackcdn.com/image/fetch/$s_!2m-N!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbb5f007-ada4-43f0-b249-1b78005d3290_1400x892.png 848w, https://substackcdn.com/image/fetch/$s_!2m-N!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbb5f007-ada4-43f0-b249-1b78005d3290_1400x892.png 1272w, https://substackcdn.com/image/fetch/$s_!2m-N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbb5f007-ada4-43f0-b249-1b78005d3290_1400x892.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2m-N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbb5f007-ada4-43f0-b249-1b78005d3290_1400x892.png" width="1400" height="892" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fbb5f007-ada4-43f0-b249-1b78005d3290_1400x892.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:892,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:299749,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2m-N!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbb5f007-ada4-43f0-b249-1b78005d3290_1400x892.png 424w, https://substackcdn.com/image/fetch/$s_!2m-N!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbb5f007-ada4-43f0-b249-1b78005d3290_1400x892.png 848w, https://substackcdn.com/image/fetch/$s_!2m-N!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbb5f007-ada4-43f0-b249-1b78005d3290_1400x892.png 1272w, https://substackcdn.com/image/fetch/$s_!2m-N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbb5f007-ada4-43f0-b249-1b78005d3290_1400x892.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The median token length for non-Latin languages (Hindi and Nepali) is only slightly higher than for Latin languages (English, French, Spanish), 17 vs. 16, for NLLB and RoBERTa tokenizers. However, GPT-4 tokenizer exhibits a significantly larger difference with median token lenghts of 62 for Hindi and Nepali vs. 16 for English, French, and Spanish.</p><p>This observation suggests that <strong>training on a more comprehensive multilingual corpus can influence token length distribution</strong>. NLLB and RoBERTa, likely trained on broader datasets, show a smaller difference in token lengths between Latin and non-Latin languages compared to GPT-4 tokenizer, which might have been trained on a less diverse corpus.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DU86!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e7b4fff-6b94-4f0a-99b6-e08000f876e5_1408x932.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DU86!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e7b4fff-6b94-4f0a-99b6-e08000f876e5_1408x932.png 424w, https://substackcdn.com/image/fetch/$s_!DU86!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e7b4fff-6b94-4f0a-99b6-e08000f876e5_1408x932.png 848w, https://substackcdn.com/image/fetch/$s_!DU86!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e7b4fff-6b94-4f0a-99b6-e08000f876e5_1408x932.png 1272w, https://substackcdn.com/image/fetch/$s_!DU86!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e7b4fff-6b94-4f0a-99b6-e08000f876e5_1408x932.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DU86!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e7b4fff-6b94-4f0a-99b6-e08000f876e5_1408x932.png" width="1408" height="932" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e7b4fff-6b94-4f0a-99b6-e08000f876e5_1408x932.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:932,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:266109,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DU86!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e7b4fff-6b94-4f0a-99b6-e08000f876e5_1408x932.png 424w, https://substackcdn.com/image/fetch/$s_!DU86!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e7b4fff-6b94-4f0a-99b6-e08000f876e5_1408x932.png 848w, https://substackcdn.com/image/fetch/$s_!DU86!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e7b4fff-6b94-4f0a-99b6-e08000f876e5_1408x932.png 1272w, https://substackcdn.com/image/fetch/$s_!DU86!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e7b4fff-6b94-4f0a-99b6-e08000f876e5_1408x932.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There was no replacement token for any sample in the dataset for NLLB and XML-RoBERTa, while there were a fair amount on replacement tokens, for non-latin languages for GPT-4 tokenizer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6NOj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F750431df-146b-4cd3-a4c8-b8674d91e825_1412x932.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6NOj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F750431df-146b-4cd3-a4c8-b8674d91e825_1412x932.png 424w, https://substackcdn.com/image/fetch/$s_!6NOj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F750431df-146b-4cd3-a4c8-b8674d91e825_1412x932.png 848w, https://substackcdn.com/image/fetch/$s_!6NOj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F750431df-146b-4cd3-a4c8-b8674d91e825_1412x932.png 1272w, https://substackcdn.com/image/fetch/$s_!6NOj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F750431df-146b-4cd3-a4c8-b8674d91e825_1412x932.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6NOj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F750431df-146b-4cd3-a4c8-b8674d91e825_1412x932.png" width="1412" height="932" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/750431df-146b-4cd3-a4c8-b8674d91e825_1412x932.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:932,&quot;width&quot;:1412,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:253758,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6NOj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F750431df-146b-4cd3-a4c8-b8674d91e825_1412x932.png 424w, https://substackcdn.com/image/fetch/$s_!6NOj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F750431df-146b-4cd3-a4c8-b8674d91e825_1412x932.png 848w, https://substackcdn.com/image/fetch/$s_!6NOj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F750431df-146b-4cd3-a4c8-b8674d91e825_1412x932.png 1272w, https://substackcdn.com/image/fetch/$s_!6NOj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F750431df-146b-4cd3-a4c8-b8674d91e825_1412x932.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Graphemes vs Token Counts</strong></h3><p>I compared the grapheme count (number of written characters) to the token count (number of tokens after tokenization) for the above tokenizers and observed that GPT-4&#8217;s tokenizer stands out with a much higher token count compared to its grapheme count for Hindi and Nepali. </p><p>While all three models utilize BPE (Byte Pair Encoding), NLLB and RoBERTa tokenizers, likely trained on broader multilingual datasets, would have encountered various writing systems and grammatical structures. This exposure allows them to adapt their tokenization strategies to handle the complexities of non-Latin languages.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DBr1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61e5fef7-7396-4fef-8072-552a4c83e123_1422x836.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DBr1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61e5fef7-7396-4fef-8072-552a4c83e123_1422x836.png 424w, https://substackcdn.com/image/fetch/$s_!DBr1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61e5fef7-7396-4fef-8072-552a4c83e123_1422x836.png 848w, https://substackcdn.com/image/fetch/$s_!DBr1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61e5fef7-7396-4fef-8072-552a4c83e123_1422x836.png 1272w, https://substackcdn.com/image/fetch/$s_!DBr1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61e5fef7-7396-4fef-8072-552a4c83e123_1422x836.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DBr1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61e5fef7-7396-4fef-8072-552a4c83e123_1422x836.png" width="1422" height="836" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61e5fef7-7396-4fef-8072-552a4c83e123_1422x836.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:836,&quot;width&quot;:1422,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:503063,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DBr1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61e5fef7-7396-4fef-8072-552a4c83e123_1422x836.png 424w, https://substackcdn.com/image/fetch/$s_!DBr1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61e5fef7-7396-4fef-8072-552a4c83e123_1422x836.png 848w, https://substackcdn.com/image/fetch/$s_!DBr1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61e5fef7-7396-4fef-8072-552a4c83e123_1422x836.png 1272w, https://substackcdn.com/image/fetch/$s_!DBr1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61e5fef7-7396-4fef-8072-552a4c83e123_1422x836.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D48Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64de6172-085a-404a-aaa8-6a88f79b39c9_1432x850.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D48Q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64de6172-085a-404a-aaa8-6a88f79b39c9_1432x850.png 424w, https://substackcdn.com/image/fetch/$s_!D48Q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64de6172-085a-404a-aaa8-6a88f79b39c9_1432x850.png 848w, https://substackcdn.com/image/fetch/$s_!D48Q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64de6172-085a-404a-aaa8-6a88f79b39c9_1432x850.png 1272w, https://substackcdn.com/image/fetch/$s_!D48Q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64de6172-085a-404a-aaa8-6a88f79b39c9_1432x850.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D48Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64de6172-085a-404a-aaa8-6a88f79b39c9_1432x850.png" width="1432" height="850" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64de6172-085a-404a-aaa8-6a88f79b39c9_1432x850.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:850,&quot;width&quot;:1432,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:505051,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!D48Q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64de6172-085a-404a-aaa8-6a88f79b39c9_1432x850.png 424w, https://substackcdn.com/image/fetch/$s_!D48Q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64de6172-085a-404a-aaa8-6a88f79b39c9_1432x850.png 848w, https://substackcdn.com/image/fetch/$s_!D48Q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64de6172-085a-404a-aaa8-6a88f79b39c9_1432x850.png 1272w, https://substackcdn.com/image/fetch/$s_!D48Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64de6172-085a-404a-aaa8-6a88f79b39c9_1432x850.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>GPT-4 tokenizer seems to have been heavily optimized for English and might not have been adequately exposed to the specific characteristics of non-Latin languages.  As a way to manage unseen characters or complex word structures, GPT-4 tokenizer  seems to be excessively splititting words into subwords such that a lot of the subwords are incomplete/sub byte sequences, inflating the token count compared to graphemes.</p><p>Nepali and Hindi both have a complex morphology involving prefixes, suffixes, and other meaningful units and limited exposure to such structures during training could hinder GPT-4 tokenizer&#8217;s ability to effectively tokenize these languages.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ywML!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbae78e48-d797-494e-bda1-7e4d140ddca5_1456x846.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ywML!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbae78e48-d797-494e-bda1-7e4d140ddca5_1456x846.png 424w, https://substackcdn.com/image/fetch/$s_!ywML!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbae78e48-d797-494e-bda1-7e4d140ddca5_1456x846.png 848w, https://substackcdn.com/image/fetch/$s_!ywML!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbae78e48-d797-494e-bda1-7e4d140ddca5_1456x846.png 1272w, https://substackcdn.com/image/fetch/$s_!ywML!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbae78e48-d797-494e-bda1-7e4d140ddca5_1456x846.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ywML!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbae78e48-d797-494e-bda1-7e4d140ddca5_1456x846.png" width="1456" height="846" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bae78e48-d797-494e-bda1-7e4d140ddca5_1456x846.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:846,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:412695,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ywML!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbae78e48-d797-494e-bda1-7e4d140ddca5_1456x846.png 424w, https://substackcdn.com/image/fetch/$s_!ywML!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbae78e48-d797-494e-bda1-7e4d140ddca5_1456x846.png 848w, https://substackcdn.com/image/fetch/$s_!ywML!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbae78e48-d797-494e-bda1-7e4d140ddca5_1456x846.png 1272w, https://substackcdn.com/image/fetch/$s_!ywML!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbae78e48-d797-494e-bda1-7e4d140ddca5_1456x846.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Does this affect the overall tokenization time?</h4><p>The peak distribution of time taken for tokenization by NLLB and RoBERTa is around <strong>2.2 seconds</strong> and <strong>2.0 seconds,</strong> respectively. cl100k_base is significantly faster in comparison with peak distribution of time at <strong>0.0006 seconds</strong>. However, the speed of tokenization only varies slightly for a tokenizer across the languages. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9krw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c20503e-e34a-4909-b544-2e41435bbaba_1416x920.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9krw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c20503e-e34a-4909-b544-2e41435bbaba_1416x920.png 424w, https://substackcdn.com/image/fetch/$s_!9krw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c20503e-e34a-4909-b544-2e41435bbaba_1416x920.png 848w, https://substackcdn.com/image/fetch/$s_!9krw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c20503e-e34a-4909-b544-2e41435bbaba_1416x920.png 1272w, https://substackcdn.com/image/fetch/$s_!9krw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c20503e-e34a-4909-b544-2e41435bbaba_1416x920.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9krw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c20503e-e34a-4909-b544-2e41435bbaba_1416x920.png" width="1416" height="920" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9c20503e-e34a-4909-b544-2e41435bbaba_1416x920.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:920,&quot;width&quot;:1416,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:312823,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9krw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c20503e-e34a-4909-b544-2e41435bbaba_1416x920.png 424w, https://substackcdn.com/image/fetch/$s_!9krw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c20503e-e34a-4909-b544-2e41435bbaba_1416x920.png 848w, https://substackcdn.com/image/fetch/$s_!9krw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c20503e-e34a-4909-b544-2e41435bbaba_1416x920.png 1272w, https://substackcdn.com/image/fetch/$s_!9krw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c20503e-e34a-4909-b544-2e41435bbaba_1416x920.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!afNC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d457671-e8bc-40d9-8e2d-1a836731c943_1422x910.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!afNC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d457671-e8bc-40d9-8e2d-1a836731c943_1422x910.png 424w, https://substackcdn.com/image/fetch/$s_!afNC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d457671-e8bc-40d9-8e2d-1a836731c943_1422x910.png 848w, https://substackcdn.com/image/fetch/$s_!afNC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d457671-e8bc-40d9-8e2d-1a836731c943_1422x910.png 1272w, https://substackcdn.com/image/fetch/$s_!afNC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d457671-e8bc-40d9-8e2d-1a836731c943_1422x910.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!afNC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d457671-e8bc-40d9-8e2d-1a836731c943_1422x910.png" width="1422" height="910" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3d457671-e8bc-40d9-8e2d-1a836731c943_1422x910.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:910,&quot;width&quot;:1422,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:340804,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!afNC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d457671-e8bc-40d9-8e2d-1a836731c943_1422x910.png 424w, https://substackcdn.com/image/fetch/$s_!afNC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d457671-e8bc-40d9-8e2d-1a836731c943_1422x910.png 848w, https://substackcdn.com/image/fetch/$s_!afNC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d457671-e8bc-40d9-8e2d-1a836731c943_1422x910.png 1272w, https://substackcdn.com/image/fetch/$s_!afNC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d457671-e8bc-40d9-8e2d-1a836731c943_1422x910.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This metric was collected from a device with following configuration:</p><p><em>Courtesy: Infinity Technology Inc.</em></p><pre><code>Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  48
  On-line CPU(s) list:   0-47
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
    CPU family:          6
    Model:               63
    Thread(s) per core:  2
    Core(s) per socket:  12
    Socket(s):           2
    Stepping:            2
    CPU max MHz:         3500.0000
    CPU min MHz:         1200.0000
Caches (sum of all):     
  L1d:                   768 KiB (24 instances)
  L1i:                   768 KiB (24 instances)
  L2:                    6 MiB (24 instances)
  L3:                    60 MiB (2 instances)</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hCFb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332261d5-c053-461d-aef5-cd2d68bcad73_1422x948.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hCFb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332261d5-c053-461d-aef5-cd2d68bcad73_1422x948.png 424w, https://substackcdn.com/image/fetch/$s_!hCFb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332261d5-c053-461d-aef5-cd2d68bcad73_1422x948.png 848w, https://substackcdn.com/image/fetch/$s_!hCFb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332261d5-c053-461d-aef5-cd2d68bcad73_1422x948.png 1272w, https://substackcdn.com/image/fetch/$s_!hCFb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332261d5-c053-461d-aef5-cd2d68bcad73_1422x948.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hCFb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332261d5-c053-461d-aef5-cd2d68bcad73_1422x948.png" width="1422" height="948" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/332261d5-c053-461d-aef5-cd2d68bcad73_1422x948.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:948,&quot;width&quot;:1422,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:248487,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hCFb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332261d5-c053-461d-aef5-cd2d68bcad73_1422x948.png 424w, https://substackcdn.com/image/fetch/$s_!hCFb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332261d5-c053-461d-aef5-cd2d68bcad73_1422x948.png 848w, https://substackcdn.com/image/fetch/$s_!hCFb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332261d5-c053-461d-aef5-cd2d68bcad73_1422x948.png 1272w, https://substackcdn.com/image/fetch/$s_!hCFb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332261d5-c053-461d-aef5-cd2d68bcad73_1422x948.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>NLLB and RoBERTa both have a larger vocabulary, so it natural that these take more time to map the input text to corresponding tokens. However, the relationship between the speed of tokenization and vocabulary size is not linear. </p><h3>Conclusion</h3><p>To conclude, in this article I explored the impact of training data and character representation on tokenization discrepancies for byte-based BPE tokenizers across languages. While focusing on the Indo-European language family, the findings suggest that these disparities primarily stem from the models' exposure during training and how characters are represented in unicode.</p><p><strong>Key observations:</strong> </p><ul><li><p>GPT-4 vocabulary differed significantly from NLLB-200-distilled-600M and XML-RoBERTa, both in terms of size and distribution on non-latin tokens.</p></li><li><p>Token length distribution varied across languages, for all the tokenizers. While the variation is not significant for  NLLB-200-distilled-600M and XML-RoBERTa, the token counts for non-latin languages are much higher for GPT-4 tokenizer.</p></li><li><p>GPT-4 tokenizer showed a large discrepancy between grapheme count and token length for non-latin languages.</p></li><li><p>Tokenization speed varied only slightly across the languages for each tokenizer but GPT-4 tokenizer was atleast twice as fast compared to NLLB-200-distilled-600M and XML-RoBERTa tokenizers.</p></li></ul><p><strong>Future considerations:</strong></p><ul><li><p>Investigating the tokenization strategies employed by each model in more detail.</p></li><li><p>Exploring the performance of these models on downstream tasks involving non-Latin languages would be valuable.</p></li><li><p>Explore if a high token count in GPT-4 translates to poor generation and language understanding capbilities.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.icodeformybhasa.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading #icodeformy&#2349;&#2366;&#2359;&#2366;! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p></li></ul>]]></content:encoded></item></channel></rss>