#icodeformyभाषा

Low-Resource NLP in the Era of LLMs - Introduction

Shreeya — Sat, 01 Mar 2025 07:18:08 GMT

There has, undoubtedly, been drastic shifts in the landscape of Natural Language Processing (NLP) research and development with the breakthrough of Large Language Models (LLMs) like ChatGPT. However the majority of LLMs are optimized for a few high-resource languages such as English. This is because these LLMs are pretrained with large corpora of text from the internet and the digital footprint for a select few languages is significantly higher than the majority. According to the W3Tech Surveys just five languages (English, Spanish, German, Japanese and French) account for over two-thirds of the content in web.

The percentages of websites using various content languages. The reports are updated daily - these numbers are from Feb 23, 2025.

One thing to note is that this digital divide is not representative of the global distribution of language speakers. For example, Mandarin Chinese has over 1.1 billion speakers globally but is not even among the top 5 languages for web content, while English with approximately 1.5 billion speakers accounts for nearly half of all web content. The digital world is heavily skewed towards Western European languages: among the top 5 content languages, Japanese is the only non-Western European language.

This is the first post in my “Low-Resource NLP in the era of LLMs” series. In this post, we will introduce challenges in developing NLP systems for low-resource languages and opportunities in the space with the emergence of LLMs. In the subsequent posts, I will focus on hands-on experimentation to apply LLMs to different part of workflow for various low-resource NLP tasks.

Before jumping into the topics, let’s talk a bit more about low-resources languages and why we care about developing LLMs for these languages.

The Six Kinds of Languages

Joshi et al., 2020 defines six classes of languages from 0 to 5 based on their resource availability for NLP tasks with 0 or “The Left-Behinds” being the class languages with exceptionally limited data and 5 or “The Winners“ being the ones with dominant online presence, have massive corporate and government investments in the development language technologies. While Class 0 languages benefit the least the from the LLM breakthrough and the author claim that “it is probably impossible to lift them up in the digital space.”

The following figure shows the six classes of languages based on their resource availability for NLP tasks:

From The State and Fate of Linguistic Diversity and Inclusion in the NLP World

While the Class 0 or the The Left-Behinds practically have no unlabelled or labelled data, it consists of the largest section of languages and represents 15% of all speakers across classes. This means that even with LLMs and the recent breakthroughs in NLP, this disparity in the language resource distribution pose a difficult challenge for global AI adoption.

From The State and Fate of Linguistic Diversity and Inclusion in the NLP World

Additionally, the absence of high quality NLP systems for low-resource languages extends far beyond the missed opportunities for broader global AI adoption. The impact of globalization and the dominance of a few high-resource languages often lead to continued marginalization and, in some cases, even disappearance of some languages in the digital space.

The Data Gap

Clearly the availability of data, both labelled and unlabelled, is one of the most critical challenges when developing LLMs/NLP systems for low-resource languages. In fact, it is hard to even evaluate these models for quality as high quality evaluation data is often unavailable for some of these languages.

This however, is a complex cycle where we need data to develop LLMs/NLP systems; but if these systems don’t support a language, the chance of reduced use of the language in the digital space increases, and hence decreasing the overall data available for model training. Disrupting this cycle will take a substantial amount of effort from multiple stakeholders, including language experts and domain experts. Since collecting high quality data is expensive, it requires acquisition of data from a wide range of sources/domains, inspection, cleaning, and refinement of the data.

Approaches to bridging this data gap involve crowdsourcing, mining the web, and data augmentation using techniques like repetition of blocks of text, minor editing of text, or back-translation. With the emergence of LLMs, more recent approaches seem to leverage inherent abilities of LLMs to bridge the data as well as the performance gaps for low-resource languages.

Leveraging LLMs for Low-Resource Languages

LLMs pre-trained on large corpora of multilingual texts display multilingual abilities, whereby these models can understand and generated texts in multiple languages. These models also exhibit cross-lingual transferabilities, which means they are able to transfer knowledge learned during training on one language to improve on tasks in another.

Although it is often associated with pre-trained models, transfer learning is not a recent concept - it has been an actively researched area since as early as 1995. In the low-resource context, high-resource languages/domain has been leveraged to transfer and improve performance on low-resource languages/domain. Recent works have demonstrated strategies like multilingual pre-training (Conneau et al., 2020, Xue et al., 2020, Big Science BLOOM) and cross-lingual alignment (Tanwar et al., 2023), which help improve LLMs on tasks across multiple languages - including low resource languages. Various Parameter Efficient Fine-Tuning (PEFT) methods (Lester et al., 2021, Hu et al., 2021, Liu et al., 2022, Liu et al., 2023) have also emerged as a potential solution allowing effective fine-tuning of models with less amounts of labelled data.

Read one of my previous post on Low-Rank Adaptation of LLaMA 2 for Nepali and Hindi where we discuss different Parameter Efficient Fine-Tuning (PEFT) techniques and share results from fine-tuning LLaMA 3 for Nepali and Hindi, two South Asian languages, one of which is a high-resource language and another which is a low-resource language.

In addition to improving the model itself for low-resource language, LLMs are also being used to enrich various other parts of the NLP pipeline. Existing research has shown LLMs are few and zero (Wu and Dredze et al., 2019, Brown et al., 2020, Lin et al., 2022) shot learners. This ability seems to provide oppotunity to brigde the data gap by using LLMs to generate synthetic data (Schick et al., 2021, Meng et al., 2022) in low-resource languages. Recently, LLMs are being used as a judge to evaluate model output (Zheng et al., 2023), a similar set up can be used in low-resource settings to speed up both data creation and model evaluation.

Conclusion

Clearly the digital divide between the high and low resource languages is vast and this divide has led to lack of robust and high perfoming NLP systems and LLMs for low-resource lanuguages. A large number of languages around the globe continue to be under-represented in the web and in LLMs. Most LLMs that we have today are optimized for a selective set of high-resources languages and perform poorly on low-resource languages. Recent studies have found several strategies to enrich the LLMs and support the NLP R&D for low-resource language by multilingual pretraining of LLMs; and in fact, leveraging LLMs and their multilingual, cross-lingual, and zero/few shot abilities. However, synthetic data generated from LLMs is not enough and we need to foster local efforts in data sourcing at scale.

What Next in This Series?

There is limited research dedicated to low-resource NLP and LLMs. There is still a lack of clarity on how much we can achieve with just LLMs. Although synthetic data could help with the research for low-resource NLP and LLMs, it will not capture nuances specific to different languages well. This means that for many use-cases where these nuances matter, NLP systems and LLMs will hit the preformance ceiling soon, and the accuracy of the models will not be good enough for real-world use.

In the new chapters, we will work with examples to apply LLMs at different stages of the NLP pipeline, and understand what problems we can solve in the process. We will mostly likely rely on transfer learning and PEFT methods to develop tools and resources for Nepali - a low resource language.

Do LLMs Engage in True Reasoning?

Shreeya — Thu, 30 Jan 2025 09:11:33 GMT

Can LLMs “truly” reason? This question of whether LLMs are truly capable of reasoning is one of the widely discussed questions in the field of AI today. There are a significant number of studies and claims both in favor and against that LLMs exhibiting reasoning capabilities. This discussion is not limited within the AI research community but it extends to AI practitioners, adopters and general users. As the new LLMs are released that beat the existing benchmarks and solve a wider range of tasks - there are who believe that LLMs are capable of complex human-like reasoning; there are also those who doubt if the capabilities are true reasoning or pattern matching based on what the LLMs had seen during training. In this post I summarize some of the recent papers that discuss reasoning in LLMs.

Reasoning

Reasoning is a systematic and cognitive process of using existing knowledge to make inferences and evaluations for solving problems. Reasoning is a key aspect of human intelligence and one of the major goals of AI research includes enabling AI systems to exhibit human-like reasoning capabilities.

Based on philosophy and context in Natural Language Processing (NLP), authors in Natural Language Reasoning, A Survey define Natural Language Reasoning (NLR) as “a process to integrate multiple knowledge to derive some new conclusions about the world. Knowledge can be from both explicit and implicit sources. Conclusions are assertions or events assumed to be true in the world, or practical actions.”

From Natural Language Reasoning, A Survey

In this work, the authors also suggest “what isn’t reasoning in NLP” and “what NLP reasoning can do.” Reaching a solution based on memorization from training data, information from knowledge base and context is not considered reasoning. Reasoning in LLMs allows the models to solve complex and unique problems, make informed decisions, and generalize across a new domain and problem space.

Definition 2.4 (NLP reasoning). Natural language reasoning is a process to integrate multiple knowledge (e.g. encyclopedic knowledge and commonsense knowledge) to derive some new conclusions about the (realistic or hypothetical) world. Knowledge can be from both explicit and implicit sources. Conclusions are assertions or events assumed to be true in the world, or practical actions.
Description 2.3 (NLP negation-based). Natural language reasoning is to derive new assertions, events, or actions without direct recourse to models’ memorization, knowledge base storage and the provided context.
Description 2.4 (NLP task-based). Reasoning is an important method to arrive at the required answers or solutions. It is effective when what we need is neither provided by context nor memorized by models and stored by knowledge bases, but reachable by integrating available information.
- Natural Language Reasoning, A Survey

Reasoning in LLMs

LLMs that can reason are robust and generalizable across domains while those that rely on pattern matching learned from training data exhibit high variance in their responses, lack generalizability and are not trustworthy.

Existing research has found that sufficiently large LLMs show emergent abilities that are not present in smaller models, including complex multi-step reasoning abilities. These reasoning abilities in model can be unlocked with techniques like chain-of-thought prompting (CoT) or similar techniques where model relies on prior reasoning steps to produce later steps and the final answer (Yao et al., 2023a,Wang et al., 2023). This works even in models that are not explicitly trained for reasoning.

A CoT prompting includes input, chain of thought, and output, where a chain of thought is a series of intermediate natural language reasoning steps that lead to the final output.

From Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

CoT prompts are carefully constructed and contain instructions and few-shot examples for the LLMs to learn from, however in Kojima et al. (2022) the authors show that LLMs are “zero-shot reasoners” and can significantly improve performance on reasoning benchmarks by simply adding “Let’s think step by step” before each answer, which they call zero-shot-CoT.

From Large Language Models are Zero-Shot Reasoners

In both Wei et al. (2022b) and Kojima et al. (2022), the authors found that CoT and zero-shot-CoT prompting drastically improves LLMs’ ability to perform complex reasoning and CoT reasoning is an emergent property of model scale that is not present in smaller models. While the authors in Wei et al. (2022b) show that CoT prompting improves reasoning in LLMs and reveals human-like reasoning processes in LLMs, they refrain from claiming that the model is actually reasoning.

“As for limitations, we first qualify that although chain of thought emulates the thought processes of human reasoners, this does not answer whether the neural network is actually “reasoning,” which we leave as an open question.
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Reasoning or Reciting - How reliable are they?

Several works have evaluated LLMs on diverse set of reasoning tasks and while LLMs seem to be getting better at solving problems in diverse domains by generating step-by-step chain-of-thought reasoning - existing research have argued if this step-by-step reasoning in LLMs is actual reasoning or repetition of patterns the models learned during training.

Wu et al. (2024) designed a suite of 11 counterfactual tasks to evaluate if the reasoning abilities in LLMs are general and transferable, or specialized to specific tasks seen during pretraining. They found that the performance of the model degrades on the counterfactual tasks compared to the default version of the tasks. This shows that the reasoning and problem solving abilities the LLMs display often rely on narrow, non-transferable steps for solving problems that are learned during training.

From Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks

In Jiang et al (2024) the authors generate synthetic data, perform systematic token perturbations, and evaluate an LLM for comparative studies to show that LLMs are subject to token bias which creates an illusion of reasoning in LLMs. The experiments in this work reveal that the reasoning abilities displayed by the LLMs are not consistent and rely heavily on token bias for response generation, which shows that the reasoning process in LLMs is more a pattern matching than genuine reasoning.

From A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners

Mirzadeh et al. 2024 conducted a large-scale study of 25 state-of-the-art open and closed models. In this work, they evaluate the models for their mathematical reasoning abilities. In this work they introduce GSM-Symbolic, an enhanced benchmark with diverse variants of GSM8K questions generated using symbolic templates. GSM8K is a popular evaluation benchmark of grade school Math questions and has been used by several existing studies to show Mathematical reasoning abilities in LLMs. Additionally, they also introduce the GSM-NoOp dataset, created by adding “seemingly relevant but ultimately irrelevant information to problems.”

From GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

The findings from GSM-Symbolic experiments reveal that all state-of-the-art LLMs exhibit performance decline on GSM-Symbolic compared to original GSM8K benchmark, “hinting at potential data contamination.”

From GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Additionally, they observed substantial performance drop (up to 65%) on the GSM-NoOp dataset, which shows that the models lack the ability to understand mathematical concepts and discern relevant information required for problem-solving.

From GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

In addition to what seems like recitation than reasoning, existing studies have also shown that the LLM explanations in themselves might not be faithful Turpin et al. (2023), Lanham et al. (2023), Shapley Value Attribution in Chain of Thought.

Turpin et al. (2023) reveals that CoT explanations from LLMs can be systematically misleading and may not reflect the true reason for a model’s prediction. In their work they show that biasing the model’s input towards incorrect answers results in LLMs to fail and generate CoT explanations that rationalizes the incorrect answers. For example by just reordering the multiple-choice options in a few-shot prompt to make the answer always “(A)”, an LLM could be tricked into predicting A with a CoT explanation that is consistent with the prediction but is not faithful to the model's decision process.

From Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

In contrast, Lanham et al. (2023) explores non-adversarial experiments to study CoT faithfulness in LLMs. They do it by truncating, adding mistakes/fillers or paraphrasing the original CoT prompt.

From Measuring Faithfulness in Chain-of-Thought Reasoning

The authors found that some tasks rely heavily on reasoning steps to get to the final answer (e.g. AQuA), while others don’t (e.g. ARC) and reasoning faithfulness is significantly lower for some tasks. Surprisingly, they observed that the post-hoc reasoning (i.e. reasoning generated after an answer is decided, which is more likely to be unfaithful) increases with model size on each task, and increases with easier tasks at the same model size. They also observed that the amount of post-hoc reasoning doesn’t always predict how much CoT improves task performance. In fact, faithful reasoning does not always mean higher performance.

Conclusion

Existing research has shown LLMs can effectively mimic human reasoning processes when solving complex problems. By decomposing complex tasks into smaller and more manageable tasks, LLMs can solve the problems via a sequence of logical chain-of-thought (CoT). The chain-of-thought showcases how LLMs are able to engage in human-like step-by-step problem solving, which is often quoted as evidence of reasoning abilities in LLMs.

However, more recent studies found some CoT explanations to be unfaithful and post-hoc, meaning LLMs’ reasoning was generated after an answer was decided. One could argue that the post-hoc explanations provide evidence for reasoning ability but there is mounting evidence that show LLMs rely heavily on token bias and are not true reasoners. They are susceptible to small changes in the input prompt and show high variance on different versions of a question from benchmarks they are good at, which suggests that they have memorized some of the solutions from training.

Yes, over the years, LLMs’ performance on several benchmarks across various domains have significantly improved and they are now able to solve more complex tasks more efficiently. However, a simple adversarial prompting like reordering the multiple-choice options in a few-shot prompt to make the answer always “(A)” could trick a LLM into predicting A even when it is incorrect shows that LLMs are susceptible manipulation and are incapable of discerning misleading patterns, which shows LLMs rely heavily on patterns and input token biases rather than engaging in true reasoning.

Strawberry (o1) - Does changing language affect its reasoning?

Shreeya — Sun, 22 Sep 2024 19:06:18 GMT

On September 12, OpenAI released its new series of AI models trained with reinforcement learning to perform complex reasoning - called o1. Two versions of the model were released, o1-preview and o1-mini. These models are trained to “think” before answering or it is trained to generate a long chain of thought (CoT) before answering. This allows the model to handle complex reasoning and solve tasks that require reasoning. With o1, we will see a shift in focus from scaling pretraining to inference compute.

“o1 marks the start of a new era in AI, where models are trained to "think" before answering through a private chain of thought. The more time they take to think, the better they handle complex reasoning. We're no longer limited by pretraining paradigm; now, we can scale through inference compute, opening up new possibilities for capabilities and alignment.
- Part of a tweet by Mira Murati, CTO, OpenAI on Twitter

There are many speculations on the o1 series of models and how the models were potentially trained to think, although the specific techniques used by OpenAI is still not public. In this post, we will first summarize what we know from official sources - release note and AMA by OpenAI o1 team on Twitter. We will also look into the “Let’s Verify Step by Step”, which was published last year and many, including me, believe that this paper could provide insights into the techniques used in training the o1 family of models. Finally, I will share my experience with using the o1-mini model for Mathematics and Science questions from SEE exams in Nepal. I will also ask the same set of questions in Nepali and English to understand if changing language affects the model’s reasoning.

Before we jump into the rest of the blog - I highly recommend reading the following:

Learning to Reason with LLMs, September 12, 2024 by OpenAI
OpenAI o1-mini, Advancing cost-efficient reasoning, September 12, 2024 by OpenAI
Let’s Verify Step by Step, 31 May 2023 by OpenAI
Reverse engineering OpenAI’s o1, September 16, 2024 by Nathan Lambert

Straight From The Source

The o1 series models are trained with reinforcement learning to think before they answer, which allows for the model to perform sophisticated reasoning. They produce an internal chain of thought before responding, which allows these models to provide highly accurate results - especially for domains like science, mathematics, programming and analytics that require structured thinking and step-by-step problem solving.

“o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).
- Learning to Reason with LLMs, September 12, 2024 by OpenAI

The o1 family of models are trained with reinforcement learning, where the models are likely trained using large datasets annotated for correctness at every step of its reasoning process. This allows that the model to learn to reason before returning the answer. In their research, they found that the performance of the model consistently improved with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).

From Learning to Reason with LLMs

On contrary to the most of existing LLMs where a very little amount of compute is dedicated to inference, o1 spends more time on inference. o1 model is a paradigm shift in how compute resources are allocated in LLMs. As Jim Fan pointed out in his Twitter post - now we see a paradigm of inference-time scaling deployed in production.

From Jim Fan’s Twitter post

In a League of Its Own

The evaluation results shared in the blog shows that o1 has superior reasoning in comparison to GPT-4o. It significantly outperforms GPT-4o in 54 out of 57 MMLU subcategories. According to OpenAI - “o1 models offer significant advancements in reasoning, but they are not intended to replace GPT-4o in all use-cases.”

From Learning to Reason with LLMs

The results are impressive, although the training process of the model remains undisclosed - there are speculations, nevertheless. In the next section, we will briefly discuss “Let’s Verify Step by Step”, which could potentially give away the techniques used in training the o1 family of models.

Let’s Verify Step by Step

Let’s Verify Step by Step, looks into methods to train reliable reward models that can detect hallucinations for LLMs. In this work, the researchers investigate two methods for training reward models: outcome supervision and process supervision.

Outcome supervision reward model provides feedback for a final result from LLMs. The LLM tries different CoT paths and receives feedback based on the final outcome. The correct responses get rewarded while the incorrect ones get penalized and over time the model learns to discard paths that are unsuccessful and reinforce the ones that are successful.

Process supervision on the other hand provides feedback for each intermediate reasoning step, not just the final state. This trains the model to break down the problems into logical steps - similar to what a human would. This would allow the model to not just be structured and logical it would also be able to identify mistakes when they occur during a reasoning step.

From Let’s Verify Step by Step

So we know that o1 thinks before responding and this thinking process involves reinforcement learning to train the model to generate a private chain of thought (CoT). Based on this, it is fair to speculate that the o1 model likely iterates through and evaluates each action state in its CoT, which allows the model to reason and generate the best CoT path that leads to a correct solution. There is a possibility that OpenAI could have used either of the methods or a combination of both outcome and process supervision. I am leaning towards the latter because in their investigation in Let’s Verify Step by Step they found that models trained with process supervision significantly outperformed those trained with outcome supervision on challenging reasoning tasks.

However, with process supervision reward model alone, the model might keep generating paths that don’t lead to an answer - so an outcome supervision reward model might have been used in combination as well to give feedback on whether the LLMs got the answer right or some kind of stopping criterion could have been used. By combining these two models, the system gets feedback on each state and the final answer - this helps the model to correct its path if it gets any of the intermediary paths wrong. Additionally, Q* framework could be employed during CoT generation to help the model choose the next best reasoning state, making the overall process more efficient and strategic.

o1 Inference

While o1 outperforms the existing models several on benchmarks, it is still an extremely expensive model in comparison. o1-preview available via chat completion endpoint and costs $15 per 1 million input tokens and $60 per 1 million output tokens, which is much higher than GPT-4o that costs $5 per 1 million input tokens and $15 per 1 million output tokens.

Access. As of now there are two reasoning models available in beta versions: o1-preview and o1-mini with limited features. The preview model is only an early checkpoint of the o1 model. Many chat completion API parameters are not available yet. Additionally, if you want to access o1 models via chat completion API you need to be a Tier 5 user.

o1-preview: o1 model, designed to reason about hard problems using broad general knowledge about the world; Limit: 30 messages a week.
o1-mini: a faster and cheaper version of o1, particularly adept at coding, math, and science tasks where extensive general knowledge isn't required; Limit: 50 messages a week.

Scaling o1 for general use-case and a widespread deployment is hard, which is very likely the reason for limited access. The o1 model is also a lot more expensive than GPT-4o. The users are also charged for the reasoning tokens, which they don’t have access to. While the improvements are great - the inaccessibility - not so much.

- “While reasoning tokens are not visible via the API, they still occupy space in the model's context window and are billed as output tokens.
- Reasoning models by OpenAI

We know that o1 is 6x more expensive than GPT-4o but offers advanced reasoning capabilities. While tokenization and pricing have improved for GPT-4o, especially with recent tokenizer updates, tokenization in diverse, non-Latin languages like Nepali remains suboptimal and more expensive. So now, let's test if o1 is worth the extra cost for a non-Latin language like Nepali. Also, if changing language affects o1’s reasoning ability?

A Short Bilingual Experiment

In this brief experiment, I will present GPT-4o and o1-preview with 10 SEE examination questions via the chat application - consisting of 5 questions in science and 5 in mathematics. Each question is in Nepali and English. This will help us explore o1’s reasoning abilities and assess how it handles questions in a non-Latin low-resource language compared to English.

Note: I am OpenAI-poor, so these examples were run through the regular chatGPT with GPT-4o and o1-preview and not via the API and the total output tokens are the tokens generated via tiktoken for comparison, so this is just an estimated difference and not an actual one.

Correctness. Out of the 5 general science questions in English and Nepali, both GPT-4o and o1-preview got all 5 questions right. For 5 mathematics questions both the models got one question wrong each, different questions. Upon inspection, I found that the cause of mistake in o1-preview was not because of the reasoning or any steps in problem solving but because of a translation error that propagated to the error in the final answer. However, the error with GPT-4o was actually due to a failure in understanding a parameter in the question.

o1-preview failing to correctly translate “अर्घव्यास” which results in the error in the final answer but pretty impressive that it did try to clarify the terminology although it failed

For Example:
एउटा बेलनाको अर्घव्यास 35 से.मि. र अर्धव्यास र उचाइको योगफल 65 से.मि. भए सो बेलनाको वक्रसतहको क्षेत्रफल पत्ता लगाउनुहोस्‌ ।
The question translates to: The radius of a cylinder is 35 cm, and the sum of its radius and height is 65 cm. Find the curved surface area of the cylinder.
However when asked this to o1-preview, अर्घव्यास was translated to diameter instead of radius. Also, looking at the following block from reasoning section (refer above image to see the reasoning summary):
Revisiting the measure
I’m clarifying the terminology in Nepali, noting that "अर्थव्यास" means diameter and "अर्धव्यास" means radius, likely a typo. The diameter is 35 cm, and the sum of the radius and height is 65 cm.
अर्धव्यास was was somehow mapped to अर्थव्यास, which was then translated to diameter - which anyway is incorrect. Since, we don’t see the full reasoning stack - it is hard to say what happened here. My guess is अर्धव्यास could have been tokenized incorrectly by the model, which may have confused the model when interpreting the word.

Output token and word count. For both Nepali and English variants of the questions, o1-preview generated significantly more output tokens and words compared to GPT-4o. In English, o1-preview generated 4.5 times more tokens (711.9 vs 158.9) and 3 times more words (344.3 vs 112.9) than GPT-4o. Although less pronounced, a similar trend was observed in Nepali, where o1-preview produced 1.8 times more tokens (495.2 vs 270.4) and 1.6 times more words (150.7 vs 91.4) than GPT-4o. This shows that o1-preview has a tendency to return detailed responses, which is inline with what I observed during my manual analysis of the output. My observation Note: These numbers exclude the thinking tokens that are visible to users. Including these tokens would further increase the counts for o1-preview. Given that the actual reasoning tokens exceed what is shown to users, the true difference in token count - and hence, the cost of solving the same problem - is substantially higher for o1-preview compared to GPT-4o.

Thinking or reasoning token and word count. With o1-preview you get “a summary” of the model’s reasoning process. We know that the actual reasoning tokens are hidden from the users so this looks like a summarized version of the model's reasoning stack.

Interestingly, despite the output token and word counts being significantly higher for English compared to Nepali, the reasoning token and word counts are only slightly higher for English. This suggests that while the model engages in more extensive reasoning and elaboration when generating English responses, the underlying reasoning steps might be more consistent across both languages. One thing that I did notice during my qualitative analysis was that queries in Nepali were often translated to English, which means even when the query is in Nepali the model reasoning happens in English. As a result, while the model can still handle Nepali input, it is likely to lose some of the original context during translation that leads to errors or less precise responses. This probably happens with GPT-4o as well.

o1-preview first translating a Nepali question to English and then reasoning through the solution.

Reasoning time. Given how the reasoning unfolds in English regardless of the language of the original query - the reasoning time is similar 8.9 seconds for Nepali v/s 8.6 seconds for English.

Conclusion

The o1 family of models are a new class of models that are trained with the ability to reason. While there is little to no information on how the model was trained - adding up the recent publications, theories of LLMs and information released on the model’s release notes there are speculations that point to how the o1 models were potentially trained. The models could have been trained using either process supervision and outcome supervision reward models or a combination of both. A short bilingual experimentation with 5 science and 5 mathematics questions in Nepali and English shows that the models, both GPT-4o and o1-preview, generate a more verbose response for English. This shows that the models are better at explaining output and generating response in English than in Nepali. However, most of the reasoning happens in English - regardless of the language of the query, so an error in translation in the beginning can cause the entire reasoning chain to be futile. Also, for this use-case I tested the model in GPT-4o was almost as good as o1-preview, however this is a very small experiment and the questions are on the simpler side. It would be fun to test this on more challenging Nepali questions.

Low-Rank Adaptation of LLaMA 3 for Nepali and Hindi

Shreeya — Thu, 05 Sep 2024 03:14:54 GMT

The space of open-source and open-weights Large Language Models (LLMs) is growing and it is great news for practitioners, researchers and consumers of these advanced AI models. Now, individuals and organizations, who otherwise have limited financial resources to cover the substantial costs associated with pre-training “can leverage” these open-source and open-weights LLMs for their specialized use cases. However, adapting these LLMs for specialized use cases remains an involved and expensive process, especially when it comes to implementing them in low-resource settings. A lot of times, these adaptations require significant computational resources and expertise, which can be challenging to obtain in resource-constrained environments. This is particularly true when dealing with low-resource languages and domains that have not been seen or have had little exposure during the pre-training process.

Generated using Midjourney

In the previous post, we discussed different methods used in aligning LLMs to specific tasks and desired behaviors, including Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), both of which are expensive and challenging tasks that involves a complex process and a carefully curate high quality dataset.

In this post, we will focus on methods for fine-tuning LLMs to adapt them to specific domains and use cases in resource constraint settings. We will discuss different Parameter Efficient Fine-Tuning (PEFT) techniques. We will share our results from fine-tuning LLaMA 3 for Nepali and Hindi, two South Asian languages, one of which is a high-resource language and another is a low-resource language.

Fine-Tuning LLMs

A language model is pretrained on massive corpora of unlabelled texts, during which it learns rich representations about language and acquires general knowledge across multiple domains without relying on explicit annotations. After the initial pretraining, LLMs are fine-tuned, often using human-curated data for specialized goals and use cases. Fine-tuning is computationally cheaper than pretraining the model, but full fine-tuning of LLMs can still be computationally unachievable for a large number of individuals and institutions. As pretrained language models continue to grow in size, research into computationally efficient methods to fine-tune these LLMs is also evolving.

Parameter Efficient Fine-Tuning

During a full fine-tuning, a pretrained model is initialized with the pretrained weights and all of the model parameters and layers are trained and updated. This requires a substantial amount of curated data and computational resources as the model is trained from scratch for specific use cases. In contrast, Parameter Efficient Fine-Tuning (PEFT) methods aim to adapt LLMs by reducing the number of trainable parameters or update a small number of additional parameters or modifying only a subset of the pretrained parameters - all while maintaining the comparable performance of a full fine-tuning.

PEFT Methods

Several PEFT methods have evolved over the years, each offering a way to adapt large pretrained models without the need for full fine-tuning of the model parameters.

From Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment

Some popular PEFT methods as also discussed in Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment involve:

Additive fine-tuning involves adding new trainable parameters to the pretrained models while keeping the original pre-trained weights frozen. Some popular examples of additive fine-tuning include adapters, prefix tuning and prompt-based fine-tuning.

Partial or selective fine-tuning involves selecting and updating only a subset of pretrained model parameters. Layer-wise bias update, sparse fine-tuning of selected weights are some methods of partial fine-tuning of pretrained models.

Reparameterized fine-tuning methods leverage low-rank transformations of high-dimensional matrices enabling the efficient adaptation of LLMs by introducing a small number of new parameters that interact with the original model weights. Low-Rank Decomposition and LoRA are popular examples of reparameterized fine-tuning.

For this work, we will be fine-tuning a quantized LLaMA 3 8B model using a efficient reparameterized fine-tuning method called Low-Rank Adaptation (LoRA).

Low-Rank Adaptation (LoRA)

Low-Rank Adaptation, proposed in LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS, is one of the popular PEFT methods used in adapting LLMs. LoRA is based on the concept of intrinsic dimensionality and draws on the findings from two key studies: MEASURING THE INTRINSIC DIMENSION OF OBJECTIVE LANDSCAPES and INTRINSIC DIMENSIONALITY EXPLAINS THE EFFECTIVENESS OF LANGUAGE MODEL FINE-TUNING.” These studies demonstrated that over-parameterized pre-trained models possess very low intrinsic dimensions, i.e., the minimum number of parameters needed to achieve optimal performance is significantly lower than the total parameter count of the models.

“We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned over-parametrized models in fact reside on a low intrinsic dimension.
- LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

LoRA allows efficient finetuning of large over-parameterized models like LLMs by focusing on the low intrinsic dimensionality. It works by decomposing the weight update matrices into two low-rank matrices, which drastically reduces the number of parameters that needs to be trained during model fine-tuning. Hence, also decreasing the amount of compute required in adapting the model, while maintaining the model performance.

From LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

Full fine-tuning v/s LoRA. During full fine-tuning of LLMs, the weight changes ΔW is computed and applied directly to the pretrained weights.

In case of LoRA, the weight update matrices is decomposed into two low-rank matrices, A and B. Thus, the weight update can be generalized as:

This decomposition drastically reduces the number of parameters that need to be updated.

Let’s consider LLaMA 3 8B model for example. The model has ~8 billion parameters, i.e. n ≈ 8 × 10^9 parameters. For simplicity, we will assume we have a single weight matrix with ~8 × 10^9 parameters that needs to be updated.

During a full fine-tuning:

We would update all 8 billion parameters.
The weight change matrix, ΔW, would also be an 8 billion parameter matrix.
Both W and ΔW are 8 billion parameters.

Since, LoRA decomposes the weight change into two low-rank matrices, A and B.

Where d is the number of rows and k is the number of columns in the original matrix; and r is the rank of the decomposition, which is much smaller than the original 8 × 10^9. The weight update is approximated as: ΔW ≈ AB. The number of parameters in A and B combined (d×r + r×k) is much smaller than the original 8 × 10^9 parameters.

Say, the dimensions of the original matrix is d = 80,000 and k = 100,000, for the full fine-tuning, the number of trainable parameters is: d × k = 80,000 × 100,000 = 8,000,000,000 (8 billion).
In case of LoRA with r=16, the number of trainable parameters is: 80,000*16+16*100,000 = 2,880,000.

Thus, the number of trainable parameters is reduced drastically using LoRA, which offers memory and compute efficiency during fine-tuning. However, it's important to note that in practice, the percentage reduction would differ from our simplified example given the real models have multiple weight matrices distributed across different layers of the large models instead of just one.

In addition to reducing the number of trainable parameters, LoRA has added advantages including:

A pre-trained model can be frozen and shared across smaller LoRA modules and efficiently switched for different tasks. This reduces the storage requirement and overhead with respect to switching tasks significantly.
Since, LoRA only optimizes the low-rank decomposition matrices, the memory requirement is much lower and training is more efficient.
LoRA does not add any inference latency since the trainable matrices can be merged with the frozen pretrained weights.
LoRA finetuning can be combined with other PEFT methods like prefix tuning.

Quantized Low-Rank Adaptation (QLoRA)

Quantized Low-Rank Adaptation, introduced in QLORA: Efficient Finetuning of Quantized LLMs, is a variant of LoRA which further reduces the memory used during model fine-tuning. Pretrained model weights are stored in 32-bit precision, which consumes significant memory. While LoRA reduces the memory requirement in comparison to full fine-tuning, it does not suffice for training very large models on consumer devices. QLoRA addresses this by first quantizing a pretrained model to 4-bit precision and then training LoRA on top of this.

“QLORA reduces the average memory requirements of finetuning a 65B parameter model from >780GB of GPU memory to <48GB without degrading the runtime or predictive performance compared to a 16- bit fully finetuned baseline.
- QLORA: Efficient Finetuning of Quantized LLMs

From QLORA: Efficient Finetuning of Quantized LLMs

QLoRA offers reduced memory usage without sacrificing performance and it achieves this through three key techniques: 4-bit NormalFloat Quantization, Double Quantization, and Paged Optimizers.

“QLORA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) Double Quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) Paged Optimizers to manage memory spikes.
- QLORA: Efficient Finetuning of Quantized LLMs

More on model quantization on the upcoming series “TinyBits”, where I will explore tiny models, and discuss the latest techniques and research focused on optimizing smaller models for performance and efficiency!

In the following sections, we will present our experiments and findings from LoRA fine-tuning of the LLaMA 3 8B model for Hindi and Nepali, two South Asian languages written using Devanagari script. Our previous analysis of the LLaMA 3 tokenizer revealed improved tokenization for these languages compared to the LLaMA 2 variant. However, we also observed that the tokenization quality still lags significantly behind that of English. For more details on this tokenization analysis refer to the post below.

Experiment: QLoRA Fine-tuning LLaMA 3 8B Model

The next set of experiments is to assess the baseline capabilities of LLaMA 3 for Hindi and Nepali and evaluate the effectiveness of QLoRA fine-tuning for these languages. We have strategically chosen these two languages to explore the model's capabilities in non-Latin scripts, representing different resource scenarios:

Hindi: A high-resource language with a significant amount of digital content.
Nepali: A comparatively low-resource language with limited digital content.

We used usloth library for our experiments, because it offers a straight-forward way of fine-tuning LLMs along with enhanced speed during fine-tuning. Additionally, the availability of pre-quantized models makes unsloth an attractive choice for our experiments.

Experimental Setup

Dataset. We utilized Hindi and Nepali translations of the Alpaca Dataset, a popular instruction-tuning dataset released as a part of the Stanford Alpaca Project, which aims to build an instruction-following LLaMA model. We split this dataset into train and test sets and fine-tune the models using the train split.

Samples from Alpaca Dataset translated to Nepali

QLoRA Setup. All the experiments and results in this post uses the following configuration for QLoRA finetuning:

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, 
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, 
    bias = "none",
    use_gradient_checkpointing = "unsloth", 
    random_state = 3407,
    use_rslora = False, 
    loftq_config = None,
)

Metrics and Evaluation Criteria. We calculated rougeL and BLEURT scores with respect to the ground truth responses. In addition, we used ChatGPT as a judge to grade the responses for each of the following criteria. We manually reviewed the scores from different GPT versions and found that GPT-4-o is relatively more reliable in grading the responses, so the results that are posted/discussed below are from GPT-4-o. The responses will be graded on a 10-point scale, with 1 being the lowest and 5 being the highest score.

Relevance to instruction: How relevant is the response generated by LLaMA 3 to the instruction and input to the model.
Clarity and coherence: How clear and coherent is the response generated by LLaMA 3.
Syntax and grammar: How correct is the syntax and grammar of the generated response.
Completeness: How complete is the response.

Since in our early analysis we observed several instances of hallucinations, we also asked GPT to identify if any hallucination exists in the response.

Hallucination Type: Identify if the model response has any type of hallucination: factual inaccuracies, nonsensical responses, contradictions, repetitions or others.

Results

Overall, the baseline LLaMA 3 4-bit model showed a poor ability at generating both Nepali and Hindi, with low scores across all metrics and criterias. Surprisingly, the performance in Hindi (high-resource and one of the eight languages supported by LLaMA 3) was sometimes worse than Nepali (low-resource). This suggests that the challenges in adapting to non-English languages extend beyond resource availability.

“Llama 3 supports 8 languages — English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai, although the underlying foundation model has been trained on a broader collection of languages.
- The Llama 3 Herd of Models

QLoRA fine-tuning for both Hindi and Nepali significantly improved performance across all metrics and criteria. Score distributions for the fine-tuned model (orange) shifted towards higher values for all the four criteria: syntax, completeness, clarity and coherence, and relevance.

Score distributions for the fine-tuned QLoRA (orange) v/s Baseline LLaMA (blue) on relevance, clarity and coherence, syntax and completeness of generation for Nepali.

Score distribution for fine-tune QLoRA (orange) v/s Baseline (blue) on relevance, clarity and coherence, syntax and completeness of generation for Hindi.

Additionally, there is also significant reduction in nonsensical hallucinations, contradictions and repetitions for the fine-tuned model for Nepali. This shows that fine-tuning likely helped the model better understand Nepali texts and related linguistics nuances, which may not have been well-captured in the original pre-trained model. However, there is an increase in factual hallucinations post fine-tuning, which could be because the model overfit on specific factual details during the fine-tuning.

The fine-tuned model Hindi also shows less nonsensical and repeated responses than the baseline model. Note, that we don’t observe an increase in factual hallucination for Hindi, which we observed in Nepali but we see an increase in contradictions. Since, we used the translated version of Alpaca Dataset for respective model fine-tuning, it could mean that the translated dataset to Nepali and Hindi might contain more ambiguities or inconsistencies, which are then propagated during fine-tuning, causing factual hallucination, in case of Nepali and contradictions in case of Hindi to increase.

Distribution of hallucination type for the fine-tune LoRA (orange) v/s Baseline LLaMA (blue) for Nepali

Distribution of hallucination type for fine-tune LoRA (orange) v/s Baseline LLaMA (blue) model for Hindi

Limitations and Future Work

While we observed significant performance gain for both Hindi and Nepali text generation using QLoRA finetuning, there are several limitations and scope for future work. The quality of fine-tuning dataset greatly improves the quality of generation. Since we are using a translated dataset, the errors during translation may be amplified during LoRA finetuning. A human curated, high quality and diverse dataset, for both fine-tuning and evaluation, should be a consideration for future work. In the quick sets of experimentation in this work, we used GPT as a judge to score responses, having a human grade the outcome of the models would be helpful in providing better insights into the model's performance. Additionally, different techniques to regularize the fine-tuning process, such as incorporating adversarial training and controlled generation could help mitigate factual hallucinations and contradictions.

Acknowledgements

On this blog, I collaborated with Shilpa Bhandari, who helped me with preliminary analysis of baseline results and finetuning/analysis of Hindi results.

Shilpa graduated with a Bachelors in Mathematics and Bachelors in Computer Science from Youngstown State University, Ohio in 2021. She currently works as a data analyst reporting on schedule and financial data at a utility company. As the founder of the Nepalese in Tech Discord community with 1000+ members, she is passionate about technical research involving the Nepalese language and the Nepalese community.

Aligning LLMs - Fine-Tuning LLaMA with SFT and RHLF

Shreeya — Thu, 04 Jul 2024 05:15:36 GMT

Large Language Models (LLMs) like LLaMA are pretrained with large amounts of unlabeled text with self-supervised training objectives like next token prediction. Pretraining LLMs with self-supervised objectives allows the model to learn rich representation in language and across different domain from the large volume of unlabeled text that is readily available. However, such pretrained models might not be optimized for specific downstream tasks, domains or desired behaviors and this is where aligning large pretrained language models come into play.

Generated using Midjourney

This is the third post of the blog series on LLaMA family of models. In part one, we briefly compared LLaMA 2 and LLaMA 3. We additionally discussed the improvements in LLaMA 3, particularly focusing on the improved tokenizer along with increased focus on multi-lingual ability of the third iteration of the model.

Part two of the series discusses the architecture of the LLaMA family of models along with the modifications that are made on top of the original transformer model and the techniques employed for efficient pretraining of the LLaMA models.

In this post, we will be focusing on methods used in aligning LLMs to specific tasks and desired behaviors. We will also walk through fine-tuning details, including Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) discussed in the LLaMA 2 paper. Additionally, we will discuss the LLM safety approaches employed by LLaMA 2 model during the fine-tuning process. Fine-tuning details for LLaMA 3 is out of scope for this post, as the paper for the model is not out yet.

RECAP: Large Language Models

Large language models (LLMs) are deep neural network models that are trained on a large corpora of text to understand and generate natural language. Transformer, introduced in the 2017 paper “Attention Is All You Need” is now the most widely adopted architecture for language modeling. The following resources will give a comprehensive understanding of language modeling and transformer based language models. This makes an important foundation for the rest of the blog.

Customizing and Aligning LLMs

Training LLMs involves multiple stages. First, pretraining the models on a large unlabeled corpus using self-supervised techniques like next token/sentence prediction and/or masked language model. Pretraining helps models learn rich representations about language and acquire general knowledge across multiple domains without relying on explicit annotations. However, pretraining LLMs is computationally expensive and unattainable to many individuals and organizations. These pretrained LLMs, at many times, also need to be adapted for specialized goals and use-cases.

“According to AI Index estimates, the training costs of state-of-the-art AI models have reached unprecedented levels. For example, OpenAI’s GPT-4 used an estimated $78 million worth of compute to train, while Google’s Gemini Ultra cost $191 million for compute.
- Artificial Intelligence Index Report 2024 by HAI Stanford

After the initial pretraining, LLMs are fine-tuned, or sometimes further pretrained to customize them for specific tasks, domains, or goals. In addition to adapting the LLMs to specific tasks and domains, fine-tuning allows models to align with desired behaviors and safety considerations. This is achieved through techniques like Supervised Fine-Tuning (SFT) and/or Reinforcement Learning from Human Feedback (RLHF), which are computationally cheaper in comparison to pretraining but time-consuming given these processes typically require high-quality and human curated datasets.

A three-step alignment process described in the InstructGPT paper, is a widely adopted method to train language models to exhibit desired behaviors. The first step supervised fine-tuning (SFT) involves teaching the model to follow instructions. Step two and three involve learning from human feedback what a desired behavior is and optimizing the model to produce outputs to align with the desired behaviors. Note that only steps 1 and 3 involve modifying model parameters, in the second step the data to train the reward model is created and the reward model that helps score the responses during RLFH fine-tuning is trained.

Three-step model alignment process from InstructGPT paper.

LLaMA 2 chat was aligned towards dialog-style instructions using the three-step process. It was aligned for helpfulness, i.e “how well Llama 2-Chat responses fulfill users’ requests and provide requested information,” and safety, which refers to “whether Llama 2-Chat’s responses are unsafe.” The LLaMA 3 release note leaves out the fine-tuning details but speculating based on the information in the note, it seems to follow the similar process as LLaMA 2 chat.

LLaMA 2 Chat Fine-tuning from LLaMA 2 Paper

Supervised Fine-Tuning

Supervised Fine-Tuning (SFT) is a technique for adapting general pretrained LLMs to a specific downstream task, domain, or application. It involves further training a pretrained model using a smaller and labeled dataset with a next-token prediction objective, similar to the original pretraining. Since, SFT uses a high quality curated dataset to update the model’s parameters for a specialized task/application, it is called “supervised fine-tuning.”

“Quality Is All You Need.
“We found that SFT annotations in the order of tens of thousands was enough to achieve a high-quality result.
- LLaMA 2 Paper

For the LLaMA 2 chat model, the authors initially conducted SFT with publicly available instruction tuning data. However, they discovered that despite its large size, this dataset lacked diversity and quality, which are needed to produce quality results. After analyzing the results from this initial SFT phase, they decided to fine-tune the model using a smaller but high-quality dataset. Their findings show that using a small and curated high-quality dataset is enough to produce high-quality results. They also highlight that "the output from the fine-tuned model were often competitive with SFT data handwritten by human annotators, which is why they stopped annotating data for SFT at 27K and decided to devote more annotation effort to preference-based annotation for RLHF. Similarly in the LLaMA 3 release note the share a similar observation where the biggest gains in model quality came from the quality of data.

“Some of our biggest improvements in model quality came from carefully curating this data and performing multiple rounds of quality assurance on annotations provided by human annotators.
- LLaMA 3 Release Note

LLaMA 2 chat model was fine-tuned for two epochs. During this fine-tuning process, each sample was constructed by concatenating a prompt and its corresponding response, separated by a special token. The model was then trained using the next sentence prediction objective and the loss on prompt tokens were not considered during back propagation.

Reinforcement Learning from Human Feedback (RLHF)

The next two steps in the three-step process for model alignment focuses on training the model to select optimal response. A prompt can have multiple responses, give this the goal is to teach the model to distinguish between the responses and choose the best response. The curated data used in SFT training only tells the model what a plausible response looks like, it does not teach the model which of the responses is the best one. This limitation in SFT is addressed by reinforcement learning from human feedback (RLHF), where a model is fine-tuned directly on human feedback on the model’s responses.

The two steps in RLHF includes:

Train a reward model, which is used to score responses.
Optimize the pretrained or supervised fine-tuned LLM to generate responses that receive high scores from the reward model.

Reward Model (RM)

For a (prompt, reward) pair, a reward model outputs a score based on the alignment objective. In LLaMA 2, there were two reward models, one for assigning helpfulness score and other for safety score. Sometimes a single reward model can be trained for multiple alignment objectives if the objectives are compatible. The decision to use a single or multiple reward objectives depends on the alignment goals and if those goals can be meaningfully combined together. Also, additional factors like compute and end-user requirements may also play a role in the decision-making.

Data collection. The overall data curation process would involve annotators writing the prompts based on the alignment objective, like helpfulness and the LLM would generate multiple responses to the prompt. These responses are then reviewed by the annotators and are scored or a preferred response is marked, depending upon the requirement. While training the model to assign a score in itself is not a challenging task, the process of curating a dataset for each objective with consistent scores among the human annotators is difficult. For this reason, some approaches exclude the scoring the response, instead the annotators are asked to choose the best response among the choices. The preferred response or score for the response is decided based on how close the response is to the alignment objective.

The human preference data curated for LLaMA 2 consisted of pairwise comparisons. Each instance of the comparison data included a prompt, a preferred response, and a rejected response. The annotators also rated the degree of preference using a four-point scale: significantly better, better, slightly better, or negligibly better. The data was collected in batches and RLHF models were fine-tuned iteratively.

From LLaMA 2 Paper

Reward model training. The reward model is trained on the human preference dataset to output a preference score, which basically indicates the quality of the model’s output on specific alignment objectives. To achieve this model’s classification layer used in the next token prediction is replaced by a regression layer and the model is optimized using a loss function. This loss function is designed such that it incentivizes the model to assign higher scores to preferred responses and lower scores to non-preferred ones.

In the LLaMA 2, two reward models were trained, one for helpfulness and another for safety. Each model was initialized from pre-trained chat model checkpoints and was trained by first converting the pairwise preference dataset into binary label ranking format, where each pair consisted of a "chosen" (preferred) and a "rejected" (less preferred) response. Then the reward model was trained using binary ranking loss function, which penalizes the model when it assigns higher score to a rejected response. This effectively teaches the model to favor the preferred response over the rejected one. The binary ranking loss function follows form:

Since the pairwise preference data collected for reward modeling in LLaMA 2 was also annotated on a four-level preference scale, a margin component, m(r), was added to to the binary ranking loss to help the reward model to assign more distinct scores with larger gaps for generation that have bigger differences.

In addition to the pairwise preference data that was curated for LLaMA 2, the authors also used open source preference data for reward model training because they did not observe any negative transfer from using the public dataset in the performance of the reward model. The authors also claimed that incorporating open source data could allow better generalization and mitigate the risk of reward hacking, which would occur when the LLaMA Chat model exploits the weaknesses of the reward functions and artificially inflate the score without actually improving the model performance.

“However, in our experiments, we do not observe negative transfer from the open-source preference datasets. Thus, we have decided to keep them in our data mixture, as they could enable better generalization for the reward model and prevent reward hacking, i.e. Llama 2-Chat taking advantage of some weaknesses of our reward, and so artificially inflating the score despite performing less well.
- LLaMA 2 Paper

The reward models were trained for one epoch over the training data with the same optimizer parameters as for the base model.

Iterative RLHF fine-tuning

During RLHF, given a prompt, the model generates a response that is scored by the Reward Model (RM) trained in step 2 of the three-step model alignment process. As the model is fine-tuned to maximize these scores, it can diverge from its original behavior due to factors such as overfitting to reward signals, exposure bias, limited training data, or inaccurate scoring by the RM. This divergence could potentially compromise the broad knowledge base originally acquired during the pretraining phase. Therefore, it is crucial to ensure that the model trained during RLHF does not perform worse than or deviate from its original behavior.

Proximal Policy Optimization (PPO). is a standard RL algorithm that helps constraint the model such that the model fine-tuned in this stage does not deviate too much from the original behavior. One of the ways PPO ensures that the model does not deviate too far, a KL divergence penalty is added to the objective.

Here are some papers that are helpful in understanding the PPO algorithm:

In addition to PPO, in LLaMA 2 the authors also explored rejection sampling for RLHF.

Rejection sampling involves generating multiple candidate responses for each prompt and selecting the one with the highest reward score for the gradient update. In LLaMA 2, rejection sampling was done with the 70B LLaMA 2 Chat model. All the smaller models were fine-tuned on rejection sampled data from 70B LLaMA 2 Chat model, effectively distilling the capabilities of the larger model into the smaller ones. The authors trained 5 successive versions for RLHF models, RLHF-V1 to RLHF-V5. In LLaMA 2, instead of fine-tuning the model solely on the best candidate response from the previous iteration, the authors included the best candidate from the current iteration along with top-performing samples from all prior iterations. This choice was made to address regression that they observed in experiments where they sampled the best candidate from the preceding iteration only.

Also until RLHF-V4, only rejection sampling was used for fine-tuning. Subsequently, they applied PPO on top of rejection rampling to refine the model checkpoint obtained from rejection sampling before proceeding with further sampling.

The objective being optimized is to maximize the expected reward with respect to the policy π as:

The reward function used during PPO is:

Note that the second term in the final objective contains a KL divergence penalty that helps constrain the model such that it does not diverge from the original policy π_0. Also, Rc is a piecewise combination of the safety (Rs) and helpfulness (Rh) reward models.

For prompts that are tagged as potentially eliciting unsafe responses, Rc prioritizes scores from the safety model. If the safety score for response to a prompt is either explicitly marked as unsafe or scores below a threshold of 0.15, Rc favors the score from the safety reward model over the helpfulness one.

All the RLHF models were fine-tuned between 200 and 400 iterations and early stopping was applied based on the evaluations on held-out prompts.

Safety fine-tuning in LLaMA 2

Safety was one of the two primary alignment objectives in LLaMA 2. The safety fine-tuning process used is similar to the general fine-tuning methods with additional efforts to ensure that the models avoid displaying any unsafe behaviors. To address this, adversarial prompts and safe demonstrations were included in the SFT and RLHF processes. The RLHF pipeline was also refined with context distillation by prefixing a prompt with a safety instruction, such as “You are a safe and responsible assistant,” and then fine-tuning the model on the safer responses without the safety instruction. Additionally, the annotation team was guided to create adversarial prompts along two dimensions: risk categories and attack vectors.

“we design instructions for our annotation team to create adversarial prompts along two dimensions: a risk category, or potential topic about which the LLM could produce unsafe content; and an attack vector, or question style to cover different varieties of prompts that could elicit bad model behaviors.
- LLaMA 2 Paper

They identified three risk categories: illicit and criminal, hateful and harmful activities, and unqualified advice. Several attack vectors were explored including psychological manipulation, logic manipulation, syntactic manipulation, semantic manipulation, perspective manipulation, and non-English languages, among others. This comprehensive approach to data curation and RLHF fine-tuning of the model was done to systematically identify unsafe behaviors and train the model for safety and reliability.

Conclusion

Aligning LLMs is important for adapting pre-trained LLMs to specific tasks and desired behaviors. While the standard three-step model alignment process may seem straightforward, it is an involved process that requires a comprehensive fine-tuning dataset, rigorous evaluation processes and multiple training iterations to ensure that the model's performance aligns with intended goals. Additionally, the concept of desired behaviors in LLMs seems to be evolving with increased emphasis on safety.

In the next post in this series, we will work on a Low Rank Adaptation (LoRA) fine-tuning of the LLaMA 3 model for Nepali and Hindi. We'll examine the model's multilingual capabilities and assess its performance on low-resource languages like Nepali. So stay tuned!

Large Language Models - A Curated Reading List

Shreeya — Sun, 16 Jun 2024 02:31:17 GMT

While I am working on the blog series on the LLaMA family of models, I have also put together a curated reading list of papers that chart the evolution of large language models. These papers provide crucial context for understanding the foundations of large language models and landscape of LLMs that are the backbone of systems like meta.ai, ChatGPT, and Claude, among others.

In this blog, my attempt is to create a comprehensive reading list, including some that are featured in my blog above.

Key research papers on deep learning architectures

Sequence to Sequence Learning with Neural Networks. Sutskever et al., 2014, Google
Attention is All You Need. Vaswani et al., 2017, Google
Improving Language Understanding by Generative Pre-Training. Radford et al., 2018. OpenAI - GPT-1
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Devlin et al., 2018. Google
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Lewis et al., 2019. Facebook
Language Models are Unsupervised Multitask Learners. Radford et al., 2019. OpenAI - GPT-2
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Raffel et al., 2019. Google - T5
Language Models are Few-Shot Learners. Brown et al., 2020. Open AI - GPT-3
On Layer Normalization in the Transformer Architecture. Xiong et al., 2020.

Survey Papers

A Survey of Large Language Models. Zhao et al., 2023. [Github]
Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers. Qin et al., 2024.
Large Language Models: A Survey. Minaee et al., 2024

Efficient pre-training and scaling laws

Scaling Laws for Neural Language Models. Kaplan et al., 2020. Open AI
Scaling Language Models: Methods, Analysis & Insights from Training Gopher. Rae et al., 2021. DeepMind
Training Compute-Optimal Large Language Models. Hoffmann et al., 2022 DeepMind - Chinchilla
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Dao et al., 2022. Stanford University
PaLM: Scaling Language Modeling with Pathways. Chowdhery et al., 2022. Google
Cramming: Training a Language Model on a Single GPU in One Day. Geiping et al., 2022. University of Maryland, College Park

Survey Papers

Efficient Transformers: A Survey. Tay et al., 2020 (Revised 2022). Google
A Survey on Efficient Training of Transformers. Zhuang et al., 2023

Fine-tuning and parameter-efficient transfer learning

Parameter-Efficient Transfer Learning for NLP. Houlsby et al., 2019. Google [Github]
Finetuned Language Models are Zero-Shot Learners. Wei et al., 2021. Google
LoRA: Low-Rank Adaptation of Large Language Models. Hu et al., 2021. Microsoft [Github] [Video]
QLoRA: Efficient Finetuning of Quantized LLMs. Dettmers et al., 2023. University of Washington [Github]

Survey Papers

A Survey of Quantization Methods for Efficient Neural Network Inference. Gholami et al., 2021. UC Berkeley
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning. Lialin et al., 2022. UMass Lowell
Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond. Yang et al., 2023 [ Github]
The Efficiency Spectrum of Large Language Models: An Algorithmic Survey. Ding et al., 2024, Microsoft

Aligning LLMs

Deep Reinforcement Learning from Human Preferences. Christiano et al., 2017 (Revised 2023). Google
Fine-Tuning Language Models from Human Preferences. Ziegler et al., 2019 (Revised 2020)
Training Language Models to Follow Instructions with Human Feedback. Ouyang et al., 2022. OpenAI - InstructGPT
Training a helpful and harmless assistant with reinforcement learning from human feedback. Bia et al., 2022. Anthropic
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. Ganguli et al., 2022. Anthropic
Constitutional AI: Harmlessness from AI Feedback. Bia et al., 2022. Anthropic

Survey Papers

Aligning Large Language Models with Human: A Survey. Wang et al., 2023. Huawei Noah’s Ark Lab
Large Language Model Alignment: A Survey. Shen et al., 2023. Tianjin University

Note, this is not an exhaustive reading list, and I will be updating it as I come across any new paper while I work on #icodeformyभाषा! There will be more of such reading lists in the future for different areas within natural language processing.

Whose Weights and Biases?

Shreeya — Sun, 19 May 2024 16:37:34 GMT

On May 13, Open AI launched GPT-4o, their new flagship model that can reason across audio, vision, and text in real time and the next day on May 15 the scientists leading superalignment team at OpenAI, Ilya Sutskever and Jan Leike, departed from the company. While I am impressed by the future of human-tech interactions, I can’t help but wonder what influence Open AI and enterprises alike will have over the information the modern society consumes. The global information exchanged via news, social media and even books is heavily polarized, fragmenting our knowledge of the world into small disconnected and biased information silos. Much of AI at our disposal for use is trained with different versions of these data and shown to exhibit a similar societal biases present in the data.

Wordcloud generated using Collective Constitutional AI: Aligning a Language Model with Public Input

In this opinion piece, I will discuss the bias AI models encode and how it impacts society. I will also briefly talk about how firms like Open AI, Meta, Google and Anthropic are taking ethics and safety into considerations when training and deploying AI models. Drawing from the recent research and news, I will evaluate how the narrow lens of global tech might still fall short in developing AI for all. Finally, I will discuss AI regulations around the world and touch a little upon how the global south is behind in terms of making the best use of AI systems.

What is bias?

Bias, in the context of AI, means prejudices that influence AI decision-making, which leads to unfair outcomes that favors one group over another. These biases can stem from different sources: data used to train the AI, design of the algorithms and biases of the developers.

It is everywhere!

“Bias occurs in many algorithms and AI systems — from sexist and racist search results to facial recognition systems that perform worse on Black faces.
- How AI reduces the world to stereotypes by Victoria Turk

Biases are not exclusive to AI models. Every form of information we consume has been filtered through the perceptions and interpretations of an individual or a group of individuals. From the books/articles we read to videos we consume, every piece of information carries the imprint of its source. Even the education system and institutions impart knowledge, influenced by cultural, societal, and historical contexts. The weights and biases shaping artificial intelligence reflect the predispositions of the humans involved in its development along with disparity in representation present in online data sources. This disparity could stem from unequal access to the internet, varying propensities to discuss topics openly or could simply be a selection bias among others.

Regardless of the source, AI biases cast an increasingly far-reaching impact as these models are now starting to penetrate crucial areas like education, healthcare, and the legal system. Now AI is no longer just shaping our perceptions of the world around us, it now directly influences decisions that change people’s lives.

From societal representations to encoded beliefs

“The world according to Stable Diffusion is run by White male CEOs. Women are rarely doctors, lawyers or judges. Men with dark skin commit crimes, while women with dark skin flip burgers.
- Humans Are Biased. Generative AI Is Even Worse by By Leonardo Nicoletti and Dina Bass for Bloomberg Technology + Equality

Existing studies have shown that AI encodes human-like biases [1, 2, 3, 4]. A more recent study by Bloomberg on Stable Diffusion, shows that AI models no longer just mirror societal representation, they are in fact distorted and show biases worse than what exists in the society. Currently, the data used to train AI is skewed towards the viewpoints of majority groups, often overshadowing marginalized communities. This perpetuates biased narratives that disadvantage vulnerable groups, reinforcing the inequalities and biases present in the society.

Source: Humans Are Biased. Generative AI Is Even Worse

While there are structured processes to measure these biases, debiasing the AI models, especially those deployed to across different communities, is hard as the notion of right and wrong is itself fluid, and evolving over time - differently across cultures and regions.

“the idea of correcting even prejudiced biases is also problematic. That is because societal understanding of prejudice is constantly evolving, along with our understanding of humanity and human rights, and also varies between cultures. It is therefore hard or impossible to specify algorithmically what is prejudiced.
- Semantics derived automatically from language corpora contain human-like biases

The nature of biases in our society and hence the data we have is complex and multifaceted. A holistic view and approach at eliminating biases from models trained on such data is hard and sometimes attempts at correcting these models, while only looking at the problem from a limited sight, can result in unintended encoding of biases from humans who create these models. The recent Gemini incident is an example of impact such tunnel vision can have as Google's overcorrection resulted in the model generating inaccurate and problematic portrayal of the history of the world.

Source: Google apologizes for ‘missing the mark’ after Gemini generated racially diverse Nazis

Who is regulating and what?

It is an accepted consensus among developers, researchers and consumers of these models that there need to be guardrails in place to regulate biases in models. But, what should the regulations look like and who gets to decide those? The big names in the space, Open AI, Meta, Google, and Anthropic all have one recommendation when it comes to developing AI systems responsibly: engaging a diverse set of users throughout the development of these systems.

“Through regular collaboration with subject matter experts, policy stakeholders and people with lived experiences, we’re continuously building and testing approaches to help ensure our machine learning (ML) systems are designed and used responsibly.
- Driven by our belief that AI should benefit everyone by Meta

“We believe that many decisions about our defaults and hard bounds should be made collectively, and while practical implementation is a challenge, we aim to include as many perspectives as possible.
- How should AI systems behave, and who should decide? by Open AI

“Engage with a diverse set of users and use-case scenarios, and incorporate feedback before and throughout project development. This will build a rich variety of user perspectives into the project and increase the number of people who benefit from the technology.
- Responsible AI practices by Google

Anthropic and the Collective Intelligence Project recently conducted a public input process involving approximately 1000 Americans to formulate a constitution for an AI system. Although this is a great initiative, it falls short of representing the global user base of systems like Claude, as emphasized by Anthropic. Much like how the American constitution may not be suitable for the rest of the world, an AI constitution derived from the input of 1000 Americans is unlikely to be universally applicable. In fact, it may not even be effective for a country as diverse as the United States.

“The United States leads China, the EU, and the U.K. as the leading source of top AI models. In 2023, 61 notable AI models originated from U.S.-based institutions, far outpacing the European Union’s 21 and China’s 15.
- Artificial Intelligence Index Report 2024 by HAI Stanford

With just a few key players, mostly coming from developed regions like the US, China, and Europe, it is no surprise there is a narrow outlook in terms of how these models should behave, leaving out perspectives and considerations from other parts of the world when developing AI models.

Global governance for global systems

A survey* by Stanford researchers and Accenture found that 88% of over 1,000 global organizations believe the entities developing foundation models should be responsible for mitigating all associated risks, putting more responsibility on the model developers. However, in addition to this expectation that AI creators should be accountable in addressing biases in their AI models, 86% of respondents also recognized a need for global governance frameworks to address the broader risks involved with generative AI.

Note: the full survey* is not available yet but Artificial Intelligence Index Report 2024 by HAI Stanford reports some of the findings from the survey.

Source: Artificial Intelligence Index Report 2024 by HAI Stanford

Given how the majority of the corporations leading the advancements in AI inclined towards the gated access of these models, it is natural that global consumers and organizations want greater accountability from the model developers and stricter governance frameworks to mitigate risks associated with these models. However, relying on the makers of AI too heavily to regulate the capabilities of these models would mean concentrating the power in the hands of a few monopolistic tech giants that have the resources to develop these sophisticated AI models.

“Until 2014, most significant machine learning models were released by academia. Since then, industry has taken over. In 2022, there were 32 significant industry-produced machine learning models compared to just three produced by academia.
- Artificial Intelligence Index Report 2024 by HAI Stanford

Today the majority of AI development effort is led by the industry, most of which are proprietary models. Much of what goes into development of these models, starting with data, pretraining and post-training practices are not known to the public. While all these companies have talked about their intention of building responsible AI, their prioritization does not seem to align with this intention and the newer iterations of models still display similar biases and harmful contents as their predecessors.

Source: Universal and Transferable Adversarial Attacks on Aligned Language Models

“Safety and security teams are being downsized or sidelined to bring AI products to market.
- The Tragedy of AI Governance by Simon Chesterman

In the fierce race to release the next best model, ethical considerations in AI development often take a back seat. Safety and security teams are often viewed as cost centers, compliance burdens even. Hence, it is important to have multi-party stakeholders making decisions on the values and principles that should guide how AI systems should be built and how they should behave.

AI regulations around the world

“Countries and governments tend to align national priorities with their approaches to AI regulation. As a result, the flavor of ethical AI policies issued by current leaders in AI regulation - namely, the U.S., EU, and China - are very different.
- Responsible and Safe AI: A Primer for Policymakers in the Global South by Igarapé Institute

In the recent year we have seen a growth in the number of AI regulations and recommendations from nations, economic blocs, experts and non-profits organization. However, these policies and guidelines are shaped by the respective priorities and value systems of each nation or region, resulting in distinct approaches to AI regulation. This, I believe, works well, as localized AI policy can provide a more nuanced approach and regulations that comply with different culture, values, and economic goals of the country. One concern with localized AI policy is that this can lead to fragmentation in the regulatory landscape. Additionally, "developing world faces several disadvantages that make it more difficult for countries to formulate and enforce responsible AI policies."

Source: AI Regulations around the World

The efforts to develop policies around AI regulation is still in its early stages, and how that global AI governance will look remains an open question. While the US leads AI model development, a significant legislative action was only taken last October with the Biden administration signing an executive order on AI. In fact, China was the first country to issue enforceable interim regulations governing applications of AI. Meanwhile, with the approval of the European Union AI act, the EU has taken the lead in establishing a comprehensive regulatory framework for AI, which aims to regulate AI by categorizing potential risk and applying the appropriate regulations. It was first proposed in April 2021 and approved in March 2024. These are all remarkable feats in their own rights but all the major AI policy efforts are driven by developed nations, and developing nations, especially those in the global south, are and will be behind as they don't have the necessary resources, expertise, and structure to develop localized AI regulations. This disparity in AI governance capabilities could widen the digital divide and hinder the equitable distribution of benefits of AI.

AI policy guidelines must be adaptable

“the dominant policy prescriptions to encourage explainable AI are often rooted in Western perspectives.
- A Global South Perspective on Explainable AI by Jake Okechukwu Effoduh

Personally, I am in favor of localized AI policy but not all nations and regions can afford to work an AI policy from scratch so we need international frameworks that establish core principles, while allowing for adaptation to address specific needs of each country. As with most of AI, even the policies and guidelines with respect to AI safety, fairness and explainability are based on the narratives from the global north. Hence, the international frameworks must be a multi-stakeholder effort involving leaders, researchers, and representatives from diverse nations and communities around the globe. These frameworks should be grounded in research and an international understanding of what constitutes human well-being, fundamental rights, and equitable treatment across all races, genders, ages, abilities, and socioeconomic backgrounds.

Establishing universal standards and regulations for ethical AI governance is challenging given the disparities in technological access, cultural nuances, and economic conditions around the globe. However, it is important that we have international standards to make sure that AI systems are deployed responsibly even in nations that are not wealthy enough to devise their own policies and regulations surrounding AI.

Ethical alignment and benchmarking of models should begin early

With any innovation, the laws and policies governing them tend to follow only after a few concerning incidents have occurred and highlighted the need for regulation. As we develop AI systems at a speed and scale never seen before, we need a more proactive approach to tackling concerns in AI. Ensuring AI systems align with global values and goals requires collaboration from diverse stakeholders and careful ethical considerations, including (but not limited to) protecting users' privacy, mitigating biases, and ensuring safety and transparency throughout the lifecycle of the AI systems. Along with a clear alignment goal, proactive distilling of data, algorithms and and active red-teaming with experts from diverse domains and backgrounds must be in place to ensure that AI is adhering to the global ethical alignment.

“development of mature AI safety benchmarks that are both effective and trusted is not possible without the involvement of the community.
- Supporting benchmarks for AI safety with MLCommons, by Google

To fully leverage the benefits of AI systems we need effective and comprehensive AI safety benchmarks to rigorously test these systems before it is released for general access. The existing benchmarks (Eg. HELM, Big-Bench) focus on smaller subsets of issues with respect to fairness and safety, mostly limited to English and are not comprehensive. Development of comprehensive benchmarks that are inclusive require active and collaborative efforts from researchers, experts, developers and communities around the globe. As AI systems evolve rapidly in a "build-fast-break-fast" cycle, continuous testing and multiple iterations of the model is essential.

Conclusion

AI models not only exhibit human-like biases, studies have shown that they in fact amplify the biases that exist in our society. The data used to train these models is representative of the majority viewpoint and added with the implicit/explicit bias from the developers in selecting the data and processes to train the models, they continue to over-represent the dominant narratives. The proprietary nature of AI models makes understanding the sources of biases impossible and often leaves most of the world behind, especially, the already marginalized communities. In addition, the huge cost of training these models means the ability to train large AI systems is concentrated among a few players, making it challenging to ethically align these models to values and goals of the larger population. A collaboration from diverse stakeholders from around the globe is important in setting clear alignment goals before the development of AI. This collective approach helps ensure that AI is aligned for the people and their well-being, and it is not limited to profits and earnings for the corporations developing it.

In addition, the existing AI regulations are mainly developed from the perspective of the global north and the developing nations, especially those in the global south don’t have the resources or expertise to invest in research and development of AI policies. We need international AI regulation frameworks that can be adapted to the needs of developing nations with little effort. The responsibility of making AI responsible lies in all of us: industry leaders, researchers, experts, policy makers and the users of the system. A collective effort is needed to develop AI safety benchmarks and hold developers of these models accountable. In addition, establishing universal standards for ethical AI that can be adapted to the local community to ensure transparent and safe development of these models is critical.

Acknowledgements

Gwendolyn Gillingham Thank you for your feedback and help with editing this blog.
Rishabh Bhardwaj and Kshitiz Karki Thank you for pointing me to the reading materials.

The LLaMA Family of Models, Model Architecture, Size, and Scaling Laws

Shreeya — Sun, 05 May 2024 21:19:08 GMT

Since February 2023, Meta has open-sourced three versions of their LLaMA language model. This has enabled thousands of people in the AI and NLP communities to explore and build upon the LLaMA models for their use-cases.

On April 18, 2024, Meta open-sourced LLaMA 3, which they claim is "the most capable openly available large language model to date," backed by its performance across multiple benchmarks. In my pervious post, we briefly talked about and compared LLaMA 2 and LLaMA 3. In the blog, we also touched upon the focus on the multilingual ability in the latest release. Additionally, we discussed the expanded vocab and improved tokenizer in LLaMA 3.

Image generated using Meta AI

In this post, we will discuss the architecture of the LLaMA family of models and focus on the modifications that are made on top of the original transformer model. In addition, we will also discuss how the second and third iterations of the model differ from LLaMA 1. Note: since LLaMA 3 paper is not out yet, the comparison will be limited to what was released as a part of the model’s release note.

Fine-tuning methodology and LLM safety employed by LLaMA 2 and LLaMA 3 are out of scope of this blog post.

Large Language Models

Large language models (LLMs) are deep neural network models that are trained on large corpora of text to understand and generate natural language. Transformer, introduced in the 2017 paper “Attention Is All You Need” is now the most widely adopted architecture for language modeling. The following resources will give a comprehensive understanding of language modeling and transformer based language models. This makes an important foundation for the rest of the blog.

Now, lets look into the LLaMA family of models.

Model Architecture

The LLaMA family of models are auto-regressive decoder-only models. These models are based on the transformer architecture with some modifications. We will start by looking into the LLaMA 1 architecture and discuss the differences it has in comparison to the transformer model. Then, we will build up from there up to LLaMA 2 and LLaMA 3 models, comparing how the newer iterations of the model improves on the previous ones.

LLaMA 1

LLaMA 1 works by predicting the next token given a sequence of input tokens, similar to any other transformer-based decoder-only models. However, LLaMA 1 incorporates several architectural modifications that set it apart, including pre-normalization of input with RMSNorm, use of SwiGLU activation function and rotary positional embedding (RoPE).

Pre-normalization of Input

LLaMA 1 and the subsequent iterations of the model use pre-normalization to enhance model training stability and performance. Normalization is applied to the input of each sub-layer, unlike the original transformer model where the normalization is applied after each sub-layer.

Pre-normalization in deep models like LLaMA is motivated by its ability to facilitate more efficient gradient flow during the backpropagation process by allowing error gradient to flow directly from top to bottom layers without passing through the normalization operations. The residual connection acts as a bypass, mitigating the risk of vanishing or exploding gradients in deep networks.

In Learning Deep Transformer Models for Machine Translation, the authors find that pre-normalization allows more efficient training of deeper Transformer models.

“More specifically, we find that prenorm is more efficient for training than post-norm if the model goes deeper.
- Learning Deep Transformer Models for Machine Translation

LLaMA also uses RMSNorm (Root Mean Square Normalization) as the normalization function instead of the LayerNorm. In deep neural networks, changing parameters in one layer can cause shifts in the input distributions for subsequent layers. These internal distribution shifts is also known as internal covariate shift and can make it more challenging for models, especially those with a large number of layers, to learn and converge. LayerNorm stabilizes the training of deep neural network by addressing internal covariate shift but introduces additional overhead, which becomes substantial for deeper networks. LayerNorm has two key properties:

re-centering invariance, which makes the model insensitive to shift noises in inputs and weights, and,
re-scaling invariance, which preserves output representations when inputs and weights are randomly scaled.

Authors in Root Mean Square Layer Normalization show that re-centering has little impact on stabilizing model training and re-scaling alone gives similar or more effective results when training deeper neural networks.

“Although RMSNorm does not re-center the summed inputs as in LayerNorm, we demonstrate through experiments that this property is not fundamental to the success of LayerNorm, and that RMSNorm is similarly or more effective.
- Root Mean Square Layer Normalization

Additional reading materials for normalizations in deep neural networks:

SwiGLU Activation

Inspired by PaLM: Scaling Language Modeling with Pathways, LLaMA uses SwiGLU (Sigmoid-Weighted Linear Unit) activation.

“SwiGLU Activation – We use SwiGLU activations (Swish(xW )· xV ) for the MLP intermediate activations because they have been shown to significantly increase quality compared to standard ReLU, GeLU, or Swish activations (Shazeer, 2020) - Section 2
- PaLM: Scaling Language Modeling with Pathways

SwiGLU combines Swish and GLU (Gated Linear Unit) activations. GLU is a component-wise product of two linear transformations on input, where one is sigmoid-gated:

The Swish activation function, introduced in the paper SWISH: A SELF-GATED ACTIVATION FUNCTION, is a smooth, non-monotonic activation that has been shown to outperform or match the widely-used ReLU activation across various of deep learning models. It defined by:

Swish Activation Function; Image from SWISH: A SELF-GATED ACTIVATION FUNCTION

SwiGLU introduced in GLU Variants Improve Transformer is defined by:

where, W, b, and β is a trainable parameter.

In the paper, GLU Variants Improve Transformer the author experimented with different variations of GLU and showed that variants of GLU (including SwiGLU) shows improvements in quality compared to ReLU or GELU activations.

“We test these variants in the feed-forward sublayers of the Transformer [Vaswani et al., 2017] sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations. - Abstract
“We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence. - Conclusion
- GLU Variants Improve Transformer

Evaluating the performance impact of one activation function over another is a challenging task. It is difficult to study activation functions in isolation and pinpoint the exact reasons why one function outperforms another. Several variables can influence the training process, making it hard to draw definitive conclusions. However, certain properties of the SwiGLU activation function may offer insights into its potential advantages. SwiGLU is a smoother function in comparison which could allow better optimization and convergence, while its non-monotonic nature enables capturing complex non-linear relationships. Additionally, SwiGLU uses a gating mechanism that selectively activates neurons based on the received input, reducing overfitting and improving generalization.

“Our experiments show that Swish consistently outperforms or matches the ReLU function on a variety of deep models. While it is difficult to prove why one activation function outperforms another because of the many confounding factors that affect training, we believe that the properties of Swish being unbounded above, bounded below, non-monotonic, and smooth are all advantageous. - Section 2.1
- SWISH: A SELF-GATED ACTIVATION FUNCTION

Rotary Positional Embeddings (RoPE)

Positional embedding help transformers to distinguish between the same word at different positions in a sequence. The LLaMA family of models uses RoPE instead of absolute positional embeddings. Absolute positional embeddings adds the position information by adding position vector to token embeddings and it does not take into account how one position in the sequence relates to another.

In contrast, Rotary Positional Embedding (RoPE), introduced in ROFORMER: ENHANCED TRANSFORMER WITH ROTARY, applies rotations to word vectors based on where it occurs in the sequence. Instead of directly adding positional embeddings to word vectors, RoPE rotates the word vectors by an angle proportional to their position. Specifically, if θ is the degree by which a word vector is rotated, a word occurring at position m is rotated by m × θ. This rotation preserves the advantage of absolute positional embeddings by maintaining a unique representation for words at each position. Additionally, RoPE gains the benefit of relative positional embeddings, as two word vectors are rotated by the same amount as long as the distance between the two words remains constant.

Since RoPE maintains a contextual relationship between two tokens, it is able to capture long range dependencies enabling improved performance and faster convergence in tasks involving long texts/documents.

“Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding. - abstract
- ROFORMER: ENHANCED TRANSFORMER WITH ROTARY

LLaMA 2 and LLaMA 3:

The LLaMA family of models share majority of its architecture. Following are how the three iterations of the model differ from one another:

The context length of LLaMA 1 is 2K tokens, that of LLaMA 2 is 4K and the latest LLaMA 3 has 8K context.
LLaMA 2 and LLaMA 3 adopts Group Attention Query (GQA). LLaMA 2 only uses GQA in its larger parameter models, while LLaMA 3 uses it on all versions.

Grouped Attention Query (GQA) in LLaMA 2 and LLaMA 3

Traditional Multi-Headed Attention (MHA) used by the Transformers have a high memory overhead as all attention keys and values need to be loaded during each decoding step. Grouped Query Attention addresses that overhead by grouping query heads into G groups such that each group shares a key and value. This reduces computational and memory overhead and allows faster inference time while maintaining the quality on par with the MHA models.

Source: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

“"Grouped-query attention divides query heads into G groups, each of which shares a single key head and value head.
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Efficient Pretraining

The self-attention mechanism used in Transformers can become a bottleneck in terms of computational time and memory usage, when training LLMs with long sequences. This is because the time and memory complexity of standard self-attention scales quadratically with the sequence length, making it increasingly expensive for longer sequences. To address this issue, LLaMA 1 and its variants employ two key optimizations: memory efficient attention and FlashAttention.

“First, we use an efficient implementation of the causal multi-head attention to reduce memory usage and runtime. This implementation, available in the xformers library,2 is inspired by Rabe and Staats (2021) and uses the backward from Dao et al. (2022).
- LLaMA 1

Memory efficient attention: In causal LLMs like LLaMA, the self-attention mechanism does not need to attend to future (mask) tokens, so using causal multi-head attention can reduce memory usage and runtime. LLaMA uses an efficient implementation of causal multi-head attention that avoids explicitly creating and storing large attention mask tensors. It leverages an AttentionBias object to hardcode the mask pattern into the computation kernels, leading to substantial memory savings, particularly for long input sequences.

Memory Efficient Causal Self-Attention

FlashAttention: LLaMA training also used backward from FlashAttention algorithm, which is an algorithm that optimizes the self-attention computation in Transformers. The optimization in FlashAttention is achieved by utilizing the memory hierarchy and intelligently mapping the computation to minimize memory transfers. It does so through two key techniques: tiling and recomputation. Tiling splits input sequences and attention matrices into smaller blocks/tiles, enabling computations to reside in fast on-chip memory while minimizing slow off-chip memory accesses. Recomputation avoids storing and transferring large intermediate attention matrices between memory levels by recomputing them on-the-fly during the backward pass using statistics stored from the forward pass.

“To further improve training efficiency, we reduced the amount of activations that are recomputed during the backward pass with checkpointing. More precisely, we save the activations that are expensive to compute, such as the outputs of linear layers.
- LLaMA 1

Activation Checkpointing: In addition, LLaMA also leverages activation checkpointing. They save the activations that are expensive to compute, such as the outputs of linear layers instead of recomputing them during the backward pass.

Pretrained with public dataset

The LLaMA family of models were all trained with only publicly available datasets. The training dataset for all three variants is composed majorly of English datasets. Starting with LLaMA 2 the focus has shifted heavily towards using higher quality and larger dataset to train the models. Data from sources known to contain personal information was removed from the training data and the data from factual sources were upsampled for model training.

LLaMA 3 was pretrained on over 15T tokens, again all open source datasets. With LLaMA 3, the focus has also shifted towards multilingual use cases and in that effort the training data for LLaMA 3 contains 5% high-quality non-English data spanning more than 30 languages. It also contains 4 time more code in comparison to LLaMA 2.

Here are the details on the size of pretraining dataset for each variant of LLaMA models so far:

Number of training tokens for LLaMA 1 and LLaMA 2 models; Table Source: LLaMA 2 paper

Number of training tokens for LLaMA 3 models

Scaling laws

One thing that I was excited about in the LLaMA 1 paper and their subsequent works was their focus on optimizing the inference budget by training the smaller models on larger number of tokens. Smaller models with faster inference speed would mean people can use LLMs on-device, which would ensure:

Improved privacy as user’s personal data remains on device and is not sent to remote servers.
Access to LLMs in offline mode, especially useful in remote places with limited internet connectivity.
Better personalization as users can fine-tune and personalize the models to their data and use-cases without sharing their personal information to the third-party servers.
Faster inference time as it would no longer be necessary to send data to and from remote servers.
Seamless integration into on-device user applications.

Personally, I am GPU poor and so is the majority of the world, smaller models that are cheaper at inference is useful for individuals like me who have limited computational resources available.

“The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used. - Introduction
- LLaMA 1

The smallest LLaMA 1 7B was trained on 1T tokens, LLaMA 2 7B was trained on 2T tokens and their latest LLaMA 3 8B was trained with 15T tokens, which is 75 times larger than the Chinchilla’s compute-optimal threshold of 200B tokens recommended for a 10B model. The latest results for LLaMA 3 across several benchmarks demonstrate that continuing to scale up the amount of training data and training for longer time can yield significant performance gains, even for relatively small model sizes, when optimizing for the inference budget rather than just the training budget.

You can read more on scaling laws here:

Stay Tuned!

The consistent emphasis of the LLaMA work on the cheaper inference budget is promising, especially for enabling wider access to large language models across low-resource languages and use-cases - a domain I'm interested in. I am looking forward to do work more with LLaMA and compare it against other small and open source models. Next in this series, we will talk about each (where it makes sense) components in the LLaMA models from the multilingual lens. When we get to fine-tuning the LLaMA 3 ourself, we will dive-deeper into the fine-tuning methodology and LLM safety measures employed by LLaMA models.

Exploring multilingual aspects and vocabulary of LLaMA 3 compared to LLaMA 2

Shreeya — Mon, 22 Apr 2024 05:04:38 GMT

Meta AI launched LLaMA 3 earlier on Thursday: LLaMA-8B and 70B models. While I can't wait to conduct a comprehensive study of the model's multilingual abilities, in this introductory blog post, I will briefly discuss how it differs from LLaMA 2. Much of the information shared here is already available as part of the model's release notes. However, this post will narrow down the information shared by Meta from a multilingual perspective. Additionally, we will also explore LLaMA 3's vocabulary and tokenizer and compare it to those of LLaMA 2.

This image was generated using Midjourney

Summary of the updates

“The most capable openly available LLM to date”, - meta
8B and 70B parameter models, state-of-the-art models at those scales
Improved tokenizer with a vocabulary of 128K tokens
Grouped Query Attention in both models
8,192 sequence lengths
Pretrained on 15 trillion tokens, with 4X more code and 5% non-English data

Focus on multilingual capabilities

Both LLaMA 3 8B and 70B that were released on Thursday are text-based models and these models improve significantly in comparison to their predecessor, LLaMA 2. In this blog, I will only discuss the following enhancements highlighted in the LLaMA 3 release:

Improved Tokenizer: LLaMA 3 features an updated tokenizer with an expanded 128K vocabulary compared to LLaMA 2's 32K vocab. The new tokenizer has up to 15% fewer tokens compared to LLaMA 2.
Larger Training Data: LLaMA 3 was trained with 15 trillion tokens in contrast to LLaMA 2 which was trained with 2 trillion tokens.
- LLaMA 3 training data contains 4 time more code
- LLaMA 3 contains 5% of the total data is in non-english language
- LLaMA 2 contains 10.3% non-English data but 8.38% is classified as unknown which is partially made with coding data.

While the LLaMA 2 paper explicitly listed "Use in languages other than English" as out-of-scope, the LLaMA 3 release blog indicates they have specifically prepared for upcoming multilingual use cases by including over 5% high-quality non-English data spanning more than 30 languages in the pretraining dataset for LLaMA 3.

Expanded vocabulary from 32k to 128K

LLaMA 3 uses a vocabulary that is nearly four times larger than LLaMA 2's vocab. This substantial increase in vocabulary results in a considerably larger embedding matrix, consequently increasing the overall model parameters from 7 billion in LLaMA 2 to 8 billion in LLaMA 3.

Considering LLaMA 3 was trained with more multi-lingual data, let's explore how this increased exposure to data in different languages might be reflected in its vocabulary distribution.

Note: These subwords were categorized into a specific language class using langdetect.

Subword distribution by language in LLaMA 2 (left) and LLaMA 3 vocab (right)

The overall shape of the distribution of subwords across languages follows power-law distribution for both LLaMA 2 (left) and LLaMA 3 (right). The LLaMA 3 vocab demonstrates impressive language coverage. It includes a wider set of languages with higher subword count. Notably, it includes several languages with subword counts between 5,000 and 10,000, signifying stronger support for multiple high-resource languages. However, there is still a large number of languages with fewer than 2,500 subword types, suggesting the need to improve the representation for low-resource languages.

Tokenization times

While LLaMA 3 offers a wider language coverage in its vocab, it comes at the cost of slower tokenization times. Moving from LLaMA 2 to LLaMA 3 results in a significant increase in tokenization time – from a median time of 0.28 to 0.81 seq/s. This can be seen in the below diagrams, as the curve for both English and other languages has moved towards the right.

Tokenization time for English: LLaMA 2 (left) and LLaMA 3 (right)

Tokenization time to Hindi and Nepali: LLaMA 2 (left) and LLaMA 3 (right)

For both LLaMA 2 and LLaMA 3, the average and median tokenization times are consistent between English and other languages like Hindi and Nepal. The distribution for LLaMA 2, overall, is taller and slimmer, which implies a narrower range for tokenization times in comparison. This also means that LLaMA 2 has a more consistent and efficient tokenization time across sequences. This is not only true for English but other languages as well.

This might not be a serious issue overall, depending on how much time tokenization takes in the entire model execution pipeline. We will take a closer look at this in a subsequent post.

Tokenization trends across languages

The following curves show the distribution of token lengths for English, Hindi and Nepali for LLaMA 2 and LLaMA 3 tokenizers. Both tokenizers have taller and narrower curves for English. The median token lengths for English is similar (14 v/s 13) for LLaMA 2 and LLaMA 3, which shows that there's consistency in tokenization lengths across versions for English.

Token lengths distribution for English: LLaMA 2 (left) and LLaMA 3 (right)

Token lengths distribution for Hindi and Nepali: LLaMA 2 (left) and LLaMA 3 (right)

In comparison, for both LLaMA 2 and LLaMA 3, Hindi and Nepali have a shorter and wider token lengths distribution curve. This suggests Hindi and Nepali texts are tokenized into longer token lengths by both versions of LLaMA. However, there's a noticeable difference in median token lengths for Hindi and Nepali, 64 for LLaMA 2 to 33 for LLaMA 3. Also, the curves more flat and wide for LLaMA 2 in comparison. This indicates that LLaMA 3 seems to have learned from the additional multilingual data to better segment sequences in languages other than English. Given that Nepali and Hindi are morphologically more complex than English, this shows that LLaMA 3 has an improved ability to handle complex languages.

Conclusion

In conclusion, LLaMA 3 offers a substantial leap in vocabulary size and language coverage compared to LLaMA 2. This expansion improves handling of morphologically complex languages but comes at the cost of less efficient tokenization times. LLaMA 3 demonstrates the benefit of multilingual data, achieving more accurate segmentation in languages like Hindi and Nepali. These findings show a trade-off between multilingual tokenization capabilities and processing speed. However, the regression in tokenization time might not be as significant when we look as the inference time for the entire pipeline.

What Next in This Series?

Dive deeper! In the next parts of this series, we will dive deeper into different components of the LLaMA 3 model and examine each component from a multi-lingual lens. This will help us understand how LLaMA 3 handles different languages.

We will also fine-tune LLaMA 3 8B for Hindi and Nepali. This will help us assess its capabilities for non-latin languages. In particular, fine-tuning for Nepali, we will also get insights into how effectively, or if at all, adapt LLaMA 3 to low-resource languages.

In addition to blogs, I will prepare notebooks and visualizers for each component I look into in this series. So stay tuned!

The Cost of Ideas

Shreeya — Sat, 30 Mar 2024 03:05:17 GMT

While proofreading the "How LLMs Break Down Language from Text to Tokens" section from my last blog, Nolan asked me a thought-provoking question: "Then what would the cost of ideas for ideographic languages be?" His question highlighted the significant differences between two writing systems: ideographic languages, where characters represent ideas, and orthographic languages, where characters represent sounds. This further prompted an investigation into whether this distinction affects the overall cost of ideas for large language models (LLMs).

In this blog post, we will try to answer Nolan's question and explore the following:

Ideographic vs. Orthographic Languages: Explore the fundamental differences between ideographic languages like Chinese and Japanese and orthographic languages like English and Nepali.
The Trade-Off between Conciseness and Digital Footprint: We will briefly explore the relationship between character count and byte count, acknowledging that ideographic languages might be more concise on paper but not necessarily in terms of digital storage.

Additionally, we will discuss:

The "Cost of Ideas" for LLMs: We will define the concept of "cost of ideas" in the context of LLMs as the number of tokens required to represent the same idea.

In my previous post, I showed that large language models (LLMs) heavily optimized for English and Latin-based languages exhibit consistently higher token counts when processing non-Latin scripts. This discrepancy in token counts translates to increased computational costs for operating LLM-based applications like ChatGPT on non-Latin languages. Building on this foundation, my focus on this post will be on investigating the "Cost of Ideas" in ideographic and orthographic writing systems. The primary objective is to explore how the fundamental distinctions between these two language categories influence their digital footprint in representing and conveying concepts and ideas.

Ideographic vs. Orthographic Languages

Ideographic languages, are writing systems where each character or symbol represents a complete word or concept. Chinese, which uses thousands of characters (called hanzi) is the most prominent example of this type of writing system.

Key Features of Ideographic Languages:

Characters Represent Ideas: Instead of corresponding to phonetic sounds, each character in an ideographic language symbolizes an entire word or concept. the character 木 in Chinese represents the word "tree" or the concept of "wood."
Extensive Character Set: To encompass a language's full vocabulary, ideographic writing systems require a vast number of characters, often numbering in the thousands or tens of thousands. The Kangxi dictionary, one of the most comprehensive Chinese dictionaries, contains over 47,000 characters.
Context Matters: While some characters may have multiple pronunciations or meanings, the context in which they appear often provides clues to their intended interpretation. The character 行 can mean "to walk" or "behavior," depending on the context.
Character Composition: In some languages (like Chinese), characters can be built from simpler components that offer hints about meaning or pronunciation. The character 林, meaning "forest," is composed of two instances of the character 木 (tree).

In contrast, orthographic languages utilize a set of symbols, typically letters or syllabic characters, to represent the individual sounds (phonemes) that make up spoken words. These symbols are then combined to form words based on their phonetic values. The most familiar example of an orthographic language is English.

Key Features of Orthographic Languages:

Phoneme Representation: The letters or symbols in these writing systems correspond to the smallest units of sound (phonemes) in the spoken language. For example, the letter "c" represents the /k/ sound in the word "cat."
Limited Symbol Set: Compared to ideographic languages, orthographic systems typically require a relatively small number of symbols to function. The English language has 26 alphabets.
Phonetic Combination: Words are formed by combining these symbols based on their phonetic values, creating a more direct link between sound and written word. The word "book" is composed of the letters "b," "o," "o," and "k," representing the sounds /b/, /ʊ/, /k/.

In addition, there are some languages like Japanese that incorporate both ideograms (kanji characters borrowed from Chinese) and phonetic scripts (hiragana and katakana) within their writing system. For example, the Japanese word for "computer" is written as コンピューター (using hiragana and katakana) or 電脳 (using kanji characters).

Universal Declaration Human Rights as a Lens for Comparing "Cost of Ideas"

To analyze the "cost of ideas" across ideographic and orthographic languages, we will leverage the Universal Declaration of Human Rights (UDHR) as a parallel corpus in Nepali (orthographic), English (orthographic), Japanese (hybrid with ideographic kanji and syllabic kana), and Chinese (ideographic).

The UDHR translations are maintained and overseen by the United Nations (UN) ensuring that UDHR's articles are conveyed accurately and with semantic equivalence across all languages. This ensures that any differences observed in the "cost of ideas" are primarily due to the inherent characteristics of the writing systems themselves, rather than discrepancies in translation.

We will examine the parallel translations across these languages to understand how the same ideas when represented using different writing systems vary in terms of the cost that a digital system has to bear.

Preprocessing the Data

This plain text version of UDHR was originally prepared and hosted by the Unicode Consortium under the "UDHR in Unicode" project. While as of January 2024, the Unicode Consortium is no longer hosting the UDHR in Unicode project, the XML files with translations in multiple languages are available at UDHR in XML.

I pre-processed the XML files in Mandarin Chinese (Simplified), Mandarin Chinese (Traditional), English, Japanese, and Nepali languages. The processed dataset includes 31 rows for each language, with a preamble and 30 articles defined in UDHR.

The Trade-Off: Conciseness vs. Digital Footprint

The trade-off between conciseness and digital footprint becomes particularly evident when comparing ideographic writing systems, like Chinese, with orthographic systems, like English or Nepali. Let's delve deeper into this trade-off by examining the grapheme and byte counts for the text in our dataset.

Average Graphemes and Bytes count across languages in UDHR

Grapheme Count and Conciseness

Grapheme count refers to the number of characters, such as letters or ideographs, needed to represent a word or concept. Ideographic scripts like Chinese exhibit a significant advantage in conciseness. In our dataset, Traditional Chinese has an average grapheme count of 82.54, and Simplified Chinese has 82.45. In contrast, the average grapheme count for English is considerably higher at 321.70. This conciseness in ideographic scripts stems from their ability to convey complex ideas and concepts through a single ideographic character, reducing the need for multiple graphemes.

Graphemes count across each article in UDHR for English and Chinese

Byte Count and Digital Footprint

However, the byte count represents the number of bytes required to encode the text digitally. Despite their lower grapheme counts, Traditional and Simplified Chinese texts required an average of 246.41 and 240.77 bytes, respectively, to encode their characters. This higher byte count is a consequence of the complex character encodings required for ideographic scripts, which often involve multiple bytes per character. The cost of an increased digital footprint.

Bytes count across each article in UDHR for English and Chinese

English and the Trade-Off

On average, English requires 3.9 times more graphemes than Traditional and Simplified Chinese to convey the same concepts. However, when it comes to byte counts needed for digital encoding, the gap narrows down drastically. English requires only 1.31 times more bytes than Traditional Chinese, and 1.34 times more bytes than Simplified Chinese. This highlights the trade-off: while Chinese is far more concise requiring fewer graphemes, English benefits from a simpler encoding requiring fewer bytes per grapheme representation compared to the ideographic Chinese scripts.

The Case of Japanese

Similarly, Japanese, which incorporates ideographic kanji characters borrowed from Chinese, and orthographic characters hiragana and katakana, has an average grapheme count of 124.35 in our dataset, lower than English. However, the byte count for Japanese text jumps to 371.83, exceeding even that of English. This significant increase in byte count can, again, be attributed to the complex character encodings for Japanese characters.

In essence, while ideographic scripts like Chinese and Japanese offer conciseness in terms of grapheme counts, they often require more bytes to encode digitally, resulting in a trade-off between conciseness and digital footprint. This trade-off has implications for tasks such as text storage, transmission, and processing within language technologies and applications.

Script Intricacies and Their Impact on Digital Footprint

While ideographic scripts like Chinese exhibit a clear trade-off between conciseness in grapheme counts and an increased digital footprint due to their complex character encodings, the case of Nepali presents a different challenge.

Even though Nepali, like English, is an orthographic language, its text characteristics in our dataset differ significantly in terms of grapheme and byte count. While Nepali uses far fewer graphemes on average (194.80) compared to English (321.70), this efficiency stems from the unique features of the Devanagari script used by Nepali. Unlike the Latin script, where consonants often need additional characters to represent sounds like syllables or consonant clusters, Devanagari generally uses a single character per sound. This is because Devanagari consonants typically come with an inherent vowel sound, a characteristic not always present in the Latin script. This allows Nepali texts to be represented with fewer graphemes on average compared to their English.

However, the byte count tells a different story. Nepali text required a staggering 759.09 bytes on average to encode digitally, over 2.3 times higher than the 321.96 bytes needed for English text. This disproportionately high byte count for Nepali, despite its lower grapheme count compared to English, highlights the complexity involved in digitally encoding the intricate system of consonant clusters, vowel diacritics, and combining characters present in the Devanagari script.

The "Cost of Ideas" for LLMs

As explored in my previous work, the number of tokens required to represent ideas in LLMs can vary significantly across languages. It really depends on how the tokenizer was trained for each model. While the inherent characteristics of a language influence the number of graphemes needed to represent ideas, the tokenization method plays a crucial role in determining the actual token counts within the LLM.If the tokenizer is trained on a diverse dataset that includes a good representation of ideographic languages like Chinese, it can potentially learn to tokenize these languages more efficiently. This can lead to lower token counts and a cost advantage for representing ideas in the LLM.

You can find the visualizer here.

Tokenizing Article 1 in English and Chinese using XML-Roberta Tokenizer

Tokenizing Article 1 in English and Chinese using NLLB Tokenizer

Tokenizing Article 1 in English and Chinese using GPT-4 Tokenizer

If the tokenizer is trained on a diverse dataset that includes a good representation of ideographic languages like Chinese, it can potentially learn to tokenize these languages more efficiently, resulting in lower token counts and, consequently, a cost advantage for representing ideas within the LLM.

Conversely, a tokenizer trained on data skewed towards certain languages or writing systems may struggle to tokenize other languages optimally. This can result in higher token counts and increased costs for representing ideas in those languages. This seems to be the case with GPT-4 tokenization, where it exhibits sub-optimal performance when tokenizing texts in non-Latin languages.

Average Token counts for XML-Roberta, NLLB and GPT-4 in UDHR

This observation highlights the importance of carefully curating the training dataset, as well as tailoring the tokenization process when developing large language models. By ensuring that the tokenizer is exposed to a diverse range of languages, including ideographic scripts, during the training process, LLMs can potentially leverage the inherent advantages of certain writing systems. For example, they can exploit the compact representation of ideas offered by ideographic languages like Chinese. Ultimately, the tokenization method and the quality of the training data can significantly impact the cost and efficiency of representing ideas across different languages within large language models.

Acknowledgements

Nolan Kramer, for not just asking the question that became the basis of this post but also for the discussions throughout the time I was working on this project.

Gwendolyn Gillingham, for helping me with the study and providing the idea of using the UDHR dataset.

Beyond the ABCs: Exploring the nuances of tokenization in diverse languages

Shreeya — Wed, 13 Mar 2024 03:40:52 GMT

Earlier this month, I stumbled upon two articles that discussed the disparities in tokenization among languages titled "All languages are NOT created (tokenized) equal" and “Why is GPT-3 15.77x more expensive for certain languages?”. This piqued my interest and motivated me to conduct further investigations on my own.

In this article, I'll discuss Byte-Pair Encoding (BPE) based tokenization and the disparities in the tokenization process across different languages. Using the Indo-European language family as a case study, I will show how these discrepancies arise not from inherent language family differences but rather from the training data and the representation of characters in Unicode for each language. In addition, I will:

Explore GPT-4 vocab and compare it to XML-RoBERTa and NLLB-200-distilled-600M.
Explore token length distribution for Indo-European languages: English, French, Spanish, Hindi, and Nepali.
Explore the relationship between grapheme counts vs token lenghts across the languages.
Compare the speed of tokenization for the three tokenizers across five languages mentioned above.

How LLMs break down language from text to tokens

Let’s buid GPT tokenizer by Andrej Karpathy was very helpful in understanding the tokenizers used by LLMs.

Tokenization is a fundamental process that involves breaking down a text into smaller units called tokens, typically words or subwords. LLMs like GPT-4 utilizes a technique called byte pair encoding (BPE) for tokenization. It iteratively merges the most frequently occurring pairs of consecutive characters into single units, forming a dynamic vocabulary that adapts to the unique characteristics of the training data. This approach enables LLMs to effectively handle rare words and improves its computational efficiency compared to traditional word-based methods.

In addition, instead of treating text as sequences of individual characters, GPT-4 uses byte-level BPE for tokenization and leverages the properties of UTF-8 encoding, which represents text using sequences of bytes called code points.

Byte-Level BPE in GPT Models

By working with bytes instead of characters, these models achieve the following advantages, in addition to dynamic vocabulary building:

Compact Vocabulary: It starts with a base vocabulary consisting of 256 individual bytes, representing all possible byte values in UTF-8 encoding. This small vocabulary size translates to computational efficiency and faster processing.
Universal Character Representation: This ensures all characters, regardless of their origin, can be represented using a combination of bytes, effectively eliminating the need for "unknown tokens." This allows the models to handle diverse text from various languages and writing systems seamlessly.

Decoding GPT-4 vocab

To understand the discrepancies discussed above, I first looked into the vocab used by GPT-4. Tokens in the original vocab cl100k_base.tiktoken used by the cl100k_base tokenizer, which is the BPE tokenizer used by GPT-4 and is encoded in base64. I converted vocabulary to UTF-8 for my analysis. Some tokens resulted in encoding errors due to incomplete generation, highlighting limitations of byte-level BPE in handling uncommon texts and text in writing systems other than latin.

The decoded vocabulary comprises 70,988 entries containing only Latin characters. This suggests a potential bias towards Latin-based languages in GPT-4's training data.
There are 29,268 entries containing at least one non-Latin character. This indicates that the model was exposed to other languages during training.
Among these non-Latin entries, 803 entries partial byte sequences.

GPT-4 Tokenization Visualization from Open AI’s Tokenizer playground

Limitations in representing uncommon texts and other writing systems

While byte-level BPE effectively eliminates the need for unknown tokens with a compact vocabulary, there are some limitations, especially in representing uncommon texts and texts in writing systems other than latin.

Despite universal character representation, it might struggle with tokenizing uncommon texts not seen during training. For cases involving extremely rare combinations or characters from under-represented writing systems BPE might resort to suboptimal tokenization, like breaking down the sequence into individual bytes, which can impact accuracy.

English letters are assigned a one-byte encoding in UTF-8. However, this is not true for all languages, some languages use multiple bytes. Hindi and Nepali are examples of such languages. Both Hindi and Nepali use Devanagari script which has a larger character set than the basic Latin alphabet used in English. This means that these languages need more unique symbols to represent its characters. UTF-8 encodes characters using a variable number of bytes depending on their rarity. To represent these less common characters, UTF-8 uses two, three, or even four bytes. Since a byte-level BPE model would treat each byte as a separate token, a letter in languages like Hindi or Nepali would be broken down into multiple tokens, potentially impacting the model's understanding and generation capabilities. The impact of the process in model’s understanding in out of the scope of this article.

Let's explore how the byte-based BPE tokenization process can lead to this issue i discussed with the Nepali word "सोमबार" (sombaar, meaning "Monday") as an example.

Unicode Code Points: The word "सोमबार" is represented by the following Unicode code points in hexadecimal:
```
स: 0x0938
ो: 0x094B
म: 0x092E
ब: 0x092C
ा: 0x093E
र: 0x0930
```

UTF-8 Encoding: When encoded using UTF-8, the word "सोमबार" becomes the following byte sequence:

स: 0xE0 0xA4 0xB8
ो: 0xE0 0xA5 0x8B
म: 0xE0 0xA4 0xAE
ब: 0xE0 0xA4 0xAC
ा: 0xE0 0xA4 0xBE
र: 0xE0 0xA4 0xB0

Byte-based BPE Tokenization: During the training BPE can , for example for character ब, merge the byte sequences 0xE0 0xA4 into one single token and leaves out 0xAC as a separate token depending on the data it has seen. This causes the vocab to have byte sequences that do not make up a valid code point. So let’s assume after several iteration we have the following vocabulary.
```
0xE0 0xA4 0xB8
0xE0 0xA5 0x8B
0xE0 0xA4 0xAE
0xE0 0xA4 --> incomplete
0xAC --> incomplete
0xE0 0xA4 0xBE
0xE0 0xA4 0xB0
```
Tokenization of "सोमबार": When the tokenizer tries to tokenize the word "सोमबार", it would then generate the following sequence of tokens:
```
['0xE0 0xA4 0xB8', '0xE0 0xA5 0x8B', '0xE0 0xA4 0xAE', '0xE0 0xA4', '0xAC', '0xE0 0xA4 0xBE', '0xE0 0xA4 0xB0']
Decoding at token id level:
['स', 'ो', 'म', '�', '�', 'ा', 'र']
```
When decoding the tokens individually, you would encounter an unicode decoding error and by default one would encounter an Unicode replacement character (�). See more on this here.
Note: A slightly different case would be where byte sequences for multiple characters would be combined by BPE to one entry in vocab, which would also cause similar issue.
```
Decoding at token level:
['स', 'ो', 'म', '�', '�', 'ा', 'र']
Decoding at input level:
सोमबार
```
However, when you decode the entire sequence of tokens together, the tokenizer can correctly reconstruct the original word "सोमबार" by combining the individual byte sequences represented by each token.

Open AI’s tiktoken Tokenizer Visualizer for Nepali. Note that there are several decoding issues here. This is because many of these tokens are represented using incomplete utf-8 code points. Also, note that number of tokens is greater than the number of characters, this is because some characters are represented using multiple sequence of bytes and tiktoken tokenizer, for some of these sequences of bytes, sees each byte for a character as a token.

Factors influencing high invalid byte sequences in vocab

The quality and size of the training data in a particular language can expose algorithm to learn sub-optimal vocabulary. Languages with less diverse or smaller training datasets may exhibit higher rates of invalid byte sequences due to insufficient coverage of character combinations or linguistic phenomena.

No decoding error for latin characters that have one byte sequence in utf-encoding

In addition to the quality and size of training data, some of the factors that influence high invalid byte sequences in vocab are:

Script Complexity: Languages with more complex scripts, such as those with non-Latin scripts like Devanagari, Thai, or Chinese characters, may have a higher likelihood of invalid byte sequences representation in vocab. These scripts often have a larger number of characters and more complex character compositions, leading to a wider range of possible byte sequences and potential challenges in tokenization.
Character Frequency: Characters that are less frequent in the training data may have their byte sequences split more frequently during merging, increasing the likelihood of incomplete tokens.
Word Morphology: Languages with rich morphology, such as agglutinative languages, may exhibit a larger number of morphemes or affixes, leading to more opportunities for byte sequences to be split during tokenization.

Can training with more multi-lingual data solve this?

Looking at the vocab we can infer that GPT-4 was heavily optimized towards English. In this section, I will compare the GPT-4 tokenizer with two other byte-based BPE tokenizers: XML-RoBERTa and NLLB-200-distilled-600M that were trained with multi-lingual data. The purpose of this study is to see if and how exposing more multi-lingual data in training affects tokenization. I chose these two tokenizers in particular because in Denys Linkov’s blog he shows that the ratio between the largest and smallest token numbers is the lowest for these two tokenizers in comparison to the others he compared.

A more diverse and distributed vocabulary

NLLB and XML-RoBERTa demonstrate significantly more diverse vocabularies compared to GPT-4’s cl100k_base vocab:

Non-Latin characters: NLLB and XML-RoBERTa contain roughly 79.53% and 83.62% non-Latin entries respectively, while cl100k_base only has 29.2%. This indicates that NLLB and XML-RoBERTa can handle a wider range of languages beyond Latin-based ones.
Vocabulary size: NLLB and XML-RoBERTa have a much larger vocabulary size, with 2.55 and 2.49 times more entries than cl100k_base with a more distributed sub-tokens across languages.

Vocab counts for cl100k_base, NLLB-200-distilled-600M, and XML-RoBERTa. Note, non-latin entires also include vocab with non-latin characters, not necessarily belonging to a specific language, like “>”.

I also found that cl100k_base vocab contains a significantly higher number of entries representing incomplete byte sequences, at roughly 29.7 times and 25.1 times more than NLLB and XML-RoBERTa respectively. The limited exposure to non-Latin byte sequences during training might explain large number of incomplete sequences in cl100k_base vocab. As mentioned earlier section, a smaller or less diverse multilingual corpus could restrict the model's ability to learn the proper representation of uncommon text sequences.

Aya Dataset

For this study, I used Aya Dataset, which contains human-curated prompt-completion pairs in 65 languages written by fluent speakers of the languages. I chose this dataset for three reasons, 1. diverse sequences in terms of lengths and topic, 2. contains all languages of interest, and 3. since it is human-curated, the dataset is of high quality, which is what I observed for English, Nepali, and Hindi.

I took texts in inputs column of the dataset and kept a max of 1500 samples for each languages. The total samples in the final split for the five languages of interest are:

English - 1499
French - 1349
Spanish - 1500
Hindi - 1087
Nepali - 1500

Results

Tokenization Trends Across Languages

Although, I used a different dataset, I observed a similar trend in token length distribution as discussed in the articles: "All languages are NOT created (tokenized) equal" and “Why is GPT-3 15.77x more expensive for certain languages?”.

Inspired by the first work, I have created a similar dashboard for this work.

The distribution of token lengths across languages non-English languages (French, Spanish, Hindi, and Nepali) were closer to English for NLLB and XML-RoBERTa tokenizers in comparison to GPT-4 tokenizer. However, for GPT-4 tokenizer, the token distribution for non-latin languages (Hindi, and Nepali) were very different from that of English, with non-latin languages (Hindi, and Nepali) having a consistently higher number of tokens across the samples.

The median token length for non-Latin languages (Hindi and Nepali) is only slightly higher than for Latin languages (English, French, Spanish), 17 vs. 16, for NLLB and RoBERTa tokenizers. However, GPT-4 tokenizer exhibits a significantly larger difference with median token lenghts of 62 for Hindi and Nepali vs. 16 for English, French, and Spanish.

This observation suggests that training on a more comprehensive multilingual corpus can influence token length distribution. NLLB and RoBERTa, likely trained on broader datasets, show a smaller difference in token lengths between Latin and non-Latin languages compared to GPT-4 tokenizer, which might have been trained on a less diverse corpus.

There was no replacement token for any sample in the dataset for NLLB and XML-RoBERTa, while there were a fair amount on replacement tokens, for non-latin languages for GPT-4 tokenizer.

Graphemes vs Token Counts

I compared the grapheme count (number of written characters) to the token count (number of tokens after tokenization) for the above tokenizers and observed that GPT-4’s tokenizer stands out with a much higher token count compared to its grapheme count for Hindi and Nepali.

While all three models utilize BPE (Byte Pair Encoding), NLLB and RoBERTa tokenizers, likely trained on broader multilingual datasets, would have encountered various writing systems and grammatical structures. This exposure allows them to adapt their tokenization strategies to handle the complexities of non-Latin languages.

GPT-4 tokenizer seems to have been heavily optimized for English and might not have been adequately exposed to the specific characteristics of non-Latin languages. As a way to manage unseen characters or complex word structures, GPT-4 tokenizer seems to be excessively splititting words into subwords such that a lot of the subwords are incomplete/sub byte sequences, inflating the token count compared to graphemes.

Nepali and Hindi both have a complex morphology involving prefixes, suffixes, and other meaningful units and limited exposure to such structures during training could hinder GPT-4 tokenizer’s ability to effectively tokenize these languages.

Does this affect the overall tokenization time?

The peak distribution of time taken for tokenization by NLLB and RoBERTa is around 2.2 seconds and 2.0 seconds, respectively. cl100k_base is significantly faster in comparison with peak distribution of time at 0.0006 seconds. However, the speed of tokenization only varies slightly for a tokenizer across the languages.

This metric was collected from a device with following configuration:

Courtesy: Infinity Technology Inc.

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  48
  On-line CPU(s) list:   0-47
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
    CPU family:          6
    Model:               63
    Thread(s) per core:  2
    Core(s) per socket:  12
    Socket(s):           2
    Stepping:            2
    CPU max MHz:         3500.0000
    CPU min MHz:         1200.0000
Caches (sum of all):     
  L1d:                   768 KiB (24 instances)
  L1i:                   768 KiB (24 instances)
  L2:                    6 MiB (24 instances)
  L3:                    60 MiB (2 instances)

NLLB and RoBERTa both have a larger vocabulary, so it natural that these take more time to map the input text to corresponding tokens. However, the relationship between the speed of tokenization and vocabulary size is not linear.

Conclusion

To conclude, in this article I explored the impact of training data and character representation on tokenization discrepancies for byte-based BPE tokenizers across languages. While focusing on the Indo-European language family, the findings suggest that these disparities primarily stem from the models' exposure during training and how characters are represented in unicode.

Key observations:

GPT-4 vocabulary differed significantly from NLLB-200-distilled-600M and XML-RoBERTa, both in terms of size and distribution on non-latin tokens.
Token length distribution varied across languages, for all the tokenizers. While the variation is not significant for NLLB-200-distilled-600M and XML-RoBERTa, the token counts for non-latin languages are much higher for GPT-4 tokenizer.
GPT-4 tokenizer showed a large discrepancy between grapheme count and token length for non-latin languages.
Tokenization speed varied only slightly across the languages for each tokenizer but GPT-4 tokenizer was atleast twice as fast compared to NLLB-200-distilled-600M and XML-RoBERTa tokenizers.

Future considerations:

Investigating the tokenization strategies employed by each model in more detail.
Exploring the performance of these models on downstream tasks involving non-Latin languages would be valuable.
Explore if a high token count in GPT-4 translates to poor generation and language understanding capbilities.
Thanks for reading #icodeformyभाषा! Subscribe for free to receive new posts and support my work.

Scientific Research Ontology

Shreeya — Sun, 18 Apr 2021 11:22:00 GMT

The primary goal of a research paper is to convince its readers and scientific communities of the novelty and relevance of the study presented in the paper [1]. Scientific publications are targeted towards a certain community of readers, which is why scientific argumentation takes a predictable structure [2]. In this document, I describe two different models of the structure of scientific texts: rhetorical and argumentative. The nodes we describe here, for both argumentation and rhetorical models, are taken directly from the annotations of the Dr. Inventor corpus [3]. I am working on to find new relations that hold between rhetorical components.

Scientific Argumentation

In this section I will attempt to describe scientific texts in terms of argumentative components and relations. Argumentative components make the nodes of the argumentation graph and the relations make the arcs.

Argumentative Components

Dr. Inventor has been annotated for three argumentative components by [4]. The span of these argumentative components are not limited to sentence boundaries. They can be of any length and can span over multiple sentences. The three argumentative components are described below.

Own claims represent general argumentative statements or claims made by the author that closely relates to the author's own work.

Background claims are general claims that related to the background of author's work.

Data node represents facts and information that support or contest a claim.

Argumentative Relations

In the corpus, argumentative relations are composed of two argumentative text units Arg1 and Arg2. For our ontology, we plan to introduce a text field for each of the relation arc that contains reasons of the relation. Three argumentative relations type have been defined. They are:

Support relations hold between two argument components, Arg1 and Arg2. Arg1 supports Agr2 means Arg1 strengthens the claim being made in Arg2.

Oppose relations, like support, hold between two argument components, Arg1 and Arg2. However, Arg1 opposes Agr2 means Arg1 weakens the claim being made in Arg2.

Sematically same relations hold between two argument components, Arg1 and Arg2 that are two occurrences of the same claim or data.

Rhetorical View of Science

Scientific publications are designed to persuade. Persuading a reader of the value and novelty of research being described in a scientific publication requires the author to make different rhetorical moves. Rhetorical view of science views scientific texts as rhetorical objects/components with with a specific rhetorical move. These rhetorical components are described in this section.

Rhetorical Components

Rhetorical components are sentences that serve different rhetorical roles in writings. Sentences in Dr. Inventor has been annotated for rhetorical roles. The span of a rhetorical component is limited to sentence boundaries and no relation has been defined between these components. Based on the Dr. Inventor annotations we have the following eight different rhetorical components.

Background: Represents sentences that help to understand the overall problem that is being addressed in the publication. These include sentences that describe the commonly accepted knowledge and related work in the area of research.
Example: An early contribution concerning the animation of deformable objects is [Magnenat-Thalmann et al. 1988], which considers the movement of a human hand.
Approach: Sentences describe the models, frameworks, and experimental setup of investigation of the research discussed in the publication.
Example: Our basic idea is to change the interpolation domain: we interpolate transformations itself instead of transformed vertex positions.
Challenge: Describes the problem statement, current challenge, and gap in the area of research and motivation of the study.
Example: Although LBS is very fast and advantageous to graphics hardware, it suffers from inherent artifacts, known as ”collapsing joints”, ”twisting elbow problem” or a ”candy-wrapper artifact”.
Challenge_Hypothesis: Represents sentences that describe a current challenge and how it can possibly be addressed.
Example: We observe that we can help avoid the collapse problem by avoiding blending transformations that are so dissimilar.
Challenge_Goal: The sentences that describe the current challenge/gap that is being addressed in the publication.
Example: The paper discusses also theoretical properties of rotation interpolation, essential to spherical blend skinning.
Outcome: Includes sentences that describe the findings of the research.
Example: For small deformations, both algorithms produce similar results, as in the second row of Figure 6 (although a small loss of volume is noticeable even there).
Outcome_Contribution: Special outcome sentences that describe how the research contributes to the area of research.
Example: In contrast to other methods, the SBS does not need any additional information, such as the example skins.
Future Work: Describes the future research that can be done to improve the solution described in the publication.
Example: It would be interesting to find out how much can be the SBS results improved by a set of weights especially designed for SBS.

References

[1] Swales, J. M. (1990). Genre analysis: English in academic and research settings. Cambridge University Press.

[2] Kintsch, W., & van Dijk, T. A. (1978). Toward a model of text comprehension and production. Psychological review, 85(5), 363.

[3] Fisas, B., & Padró, L. (2016). Multi-level annotation of rhetorical entities, relations and structures in scientific articles. In Proceedings of the 10th Linguistic Annotation Workshop (pp. 102-111).

[4] Lauscher, A., & Glavaš, G. (2018). Argument Component Annotation: Towards Efficient Argument Analysis and Retrieval in Large Corpora. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 2558-2569).

Understanding Toulmin Argumentation

Shreeya — Thu, 10 Dec 2020 20:06:00 GMT

Argumentation is core to human communication and decision-making processes. Developed by philosopher Stephen Toulmin, the Toulmin model of Argumentation theory is an influential work in the field. This model provides a framework for analyzing and constructing arguments.

What is the Toulmin Model?

The Toulmin model presents arguments as more than just statements of fact or opinion. It models the complex interplay of claims, evidence, warrants, backing, qualifiers, and rebuttals that make up arguments. The Toulmin model emphasizes on the importance of reasoning and evidence in supporting claims. It also provides a systematic approach to evaluating the strength of arguments.

Key Components of the Toulmin Model

The Toulmin model identifies six components in arguments:

Claim: The assertion that the arguer seeks to demonstrate.
Data: The grounds that support the claim. This can include facts, statistics, examples, anecdotes, and expert opinions.
Warrant: The implicit assumption or explicit reasoning that connects the data to the claim.
Backing: Additional evidence or reasoning that supports the warrant. In cases where warrants are implied, backing provides support for the warrant by giving additional justification.
Qualifier: The qualifier specifies the degree of certainty associated with the claim. It acknowledges the limitations or conditions under which the claim may be valid.
Rebuttal: Counterarguments that challenge the claim. Rebuttals anticipate potential objections and address them preemptively.

Application of the Toulmin Model

The Toulmin model can be applied to variety of contexts. Some common examples includes academic writing, persuasive speeches, and legal arguments. By identifying and analyzing the components of an argument, individuals can evaluate the strengths and weaknesses of different positions and construct more effective arguments.

Writers/speakers can use the model to develop clear and coherent arguments, provide sufficient evidence to support their claims, and anticipate and address potential objections. The Toulmin model provides a structured approach for crafting compelling arguments.

Example

In the following example, I try to use Toulmin model to argue why investing in low-resource language technologies is important in ensuring inclusivity and bridging the global digital divide.

Claim: Investing in low-resource language technologies is critical for fostering linguistic diversity and ensuring digital inclusivity worldwide.

Grounds: Presently, a considerable segment of the global population lacks adequate access to computing resources, primarily due to language disparities and limited technological infrastructure. Studies by UNESCO reveal that over half of the world's languages face the risk of extinction by the century's end, with many spoken by marginalized communities lacking technological access.

Warrant: Through investments in low-resource language support, we can develop tools and resources tailored to the diverse linguistic needs of populations worldwide. Such efforts would enable individuals from all linguistic backgrounds to access information, and engage in digital communication.

Backing: Initiatives such as the Global Voices project and UNESCO's efforts to promote linguistic diversity highlight the importance of supporting low-resource languages in digital environments. These endeavors aim to develop NLP technologies that cater to diverse linguistic needs, enabling individuals from all backgrounds to access and contribute to digital content.

Qualifier: Although investing in low-resource language support may require initial funding and collaborative efforts among various stakeholders, the long-term benefits are manifold. By ensuring equitable access to language technologies, we can promote linguistic diversity, preserve cultural heritage, and foster social inclusion globally.

Rebuttal: Some may argue against investing in low-resource language support, citing concerns about cost-effectiveness compared to focusing on widely spoken languages or commercially lucrative markets. However, this perspective overlooks the ethical obligation to provide equitable access to technological advancements and digital opportunities, irrespective of linguistic backgrounds or geographical locations.

Derivation in Nepali

Shreeya — Wed, 20 Dec 2017 00:33:00 GMT

In linguistics, the formation of new words from free morphemes is a very common process. The process of word formation (शब्दनिर्माण / शब्दरचना) in Nepali can be divided into inflection (रुपायन) and derivation (व्युत्पादन). In this blog post, we'll briefly discuss derivation and its types in Nepali.

You can read about inflections in Nepali at:

Derivation in Nepali

Derivation is a process of forming new words from existing roots. In contrast to inflection, which is a process of word formation that maintains the lexical category of the root, derivation can change the lexical category of the root. Derivation in Nepali is a highly productive process. It can be categorized into four types:

Affixation (सर्ग प्रक्रिया)
Compounding (समास प्रक्रिया)
Reduplication (द्वित्य प्रक्रिया)
Euphonic Combination (सन्धि प्रक्रिया)

Affixation

Affixation is the process of word formation by attaching free morphemes - affixes - to the roots. In Nepali, there are two types of affixes: prefix (उपसर्ग) and suffix (प्रत्यय) and hence, the two type of affixation namely, prefixation and suffixation.

# Suffixation
ईख + आलु = ईखालु
पहाड + इया = पहाडिया
पूर्व + एली = पूर्वेली

# Prefixation
अधि + कृत = अधिकृत
उप + नाम = उपनाम
दुस् + साहस = दुस्साहस

Suffixation

Suffixation is a process of word formation by the addition of bound morpheme to the end of the root. The bound morpheme involved in suffixation is called suffix (प्रत्यय/परसर्ग). On the basis of the root they attach to, there are two types of suffixes in Nepali and they are:

Primary Suffix (कृत् प्रत्यय)
Secondary Suffix (तद्वित प्रत्यय)

A primary suffix (कृत् प्रत्यय) is added to the end of a verb root to form a primary derivative word (कृदन्त शब्द). A primary derivative word belongs to noun (नाम), verb (क्रियापद), adjective (विशेषण) or adverb (क्रियाविशेषण).

A secondary suffix (तद्वित प्रत्यय) can be applied to anything but a verb root to form a secondary derivative word (तद्वितान्त शब्द). Such suffixes act on noun, pronoun, adjective, adverb and primary derivative word.

On the basis of their ability to maintain or change the lexical category of the base they get attached to, suffixes in Nepali can be of two types. They are:

Class Maintaining Suffix
Class Changing Suffix

A class maintaining suffix maintains the lexical category of the root after suffixation. Some of the class maintaining suffixes in Nepali are given in Table 3.

A class changing suffix changes the lexical category of the root after suffixation. Some of the class maintaining suffixes in Nepali are given in Table 4.

Prefixation

Prefixation is the process of word formation in which a bound morpheme is attached to the beginning of a root. Prefix (उपसर्ग) is the bound morpheme which is involved in the process of prefixation. In Nepali, the process of prefixation produces noun, adjective and adverb only.

Prefixes in Nepali can have various effects in the meaning of the base they get attached to. Some of the possible effects of prefixation on meaning of the root are listed below.

Express the lack of
Express too much of
Express negative meaning of the root
Add special meaning to

Prefixes in Nepali can be divided into three types and they are:

Prefixes from Sanskrit (तत्सम उपसर्ग)
Prefixes from Nepali (तत्भव उपसर्ग)
Prefixes from other sources (आगन्तुक उपसर्ग)

Prefixes from Sanskrit are the prefixes that are taken from Sanskrit language. Some of such prefixes and their meanings are given in Table 5.

Prefixes from Nepali are the prefixes that are from Nepali language itself. Some of such prefixes and their meanings are given in Table 6.

Prefixes from other sources are the type of prefixes that are taken from sources other than Sanskrit and Nepali. Some of such prefixes and their meanings are given in Table 7.

Compounding

Compounding occurs when two or more words combine to form a new word. In Nepali, there are six types of compounding, all of which are listed below:

Determinative Compound (तत्पुरुस समास)
Numeral Compound (द्विगु समास)
Appositional Compound (कर्मधारय समास)
Adverbial Compound (अव्य्यीभाव समास)
Attributive Compound (बहुब्रीहि समास)
Copulative Compound (द्वन्द्व समास)

Determinative Compounding
कामलाई चोर = कामचोर
जग्गाको धनी = जग्गाधनी
भोकले मरी = भोकमरी
रोगबाट मुक्त = रोगमुक्त

Numeral Compounding
सात कोसीको समूह = सप्तकोसी
पाँचवटा पत्रको समूह = पञ्चपात्रो
चार खण्डको समूह = चौखण्ड

Appositional Compounding
फुफू + दिदी = फुफूदिदी
आमा + छोरी = आमाछोरी 
प्रिय + जन = प्रियजन

Adverbial Compounding
यता + उता = यताउता
ध्यान + पूर्वक = ध्यानपूर्वक
यस + अर्थ = यसर्थ

Attributive Compounding
लामो पात भएको = लमपाते
चारवटा मुख भएको = चौमुखी
लाज नभएको = निर्लज्ज

Copulative Compounding
चाल र चलन = चालचलन 
तल र माथि = तलमाथि 
धन र सम्पत्ति = धनसम्पत्ति

Determinative Compound

A determinative compound is formed when the first word loses its case marker (विभक्ति चिन्ह) when combining with the second root. Such compounds express the meaning of the second root and the meaning of the first word is lost in the process. There are six types of determinative compounds, all of which are listed and described in Table 8.

Numeral Compound

When a numeral adjective (सङ्ख्यावाचक विशेषण) and collective noun (समूहवाचक नाम) combine, a numeral compound is formed. Some of such compounds are given below:

Independent Form => Compound Word
नौ गेडाको समूह => नौगेडी
आठ आनाको समूह => अठन्नी
तीन कोणको समूह => त्रिकोण
सात ऋषिको समूह => सप्तऋषि
नौवटा रात्रिको समूह => नवरात्री

Appositional Compound

An appositional compound is formed when:

An adjective (विशेषण) combines with an adjective
A noun (नाम) combines with a noun
An adjective combines with a noun
A metaphor (उपमा) combines with a word to be compared (उपमेय)
An attribute (आरोप) combines with a word to be attributed (आरोप्य)

All the independent words forming an appositional compound have equal importance. Types and example of appositional compounds are given in Table 9.

Adverbial Compound

An adverbial compound is usually an indeclinable word (अव्यय) formed when two indeclinable words or any other words combine. Some of such compounds are listed below:

Independent Form => Compound Word
आज + भोलि => आजभोलि
बिना + पैसा => बिनापैसा
रात + दिन => रातदिन
फल + स्वरूप => फलस्वरूप
यस + कारण => यसकारण

Attributive Compound

An attributive compound is an adjective, which is formed when two or more words combine. An attributive compound expresses a new meaning, which is different from that of the independent words. The types of attributive compound are described in Table 10.

Copulative Compound

Copulative compounds are formed when two or more words connected by a conjunction combine by losing the conjunction. The types of copulative compound are listed in Table 11.

Reduplication

Reduplication (द्वित्व) means repetition. It is a process of word formation in which a root or a part of the root repeats in order to derive a new word. Reduplication can be classified into three distinct types and they are:

Total Reduplication (पूर्ण द्वित्व)
Partial Reduplication (आंशिक द्वित्व)
Echo Reduplication (आपरिवर्तित द्वित्व)

Total Reduplication
बाटो + बाटो = बाटैबाटो
तान + तान = तानातान
तँ + तँ = तँतँ

Partial Reduplication
सम्मान + सम्मान = ससम्मान
झगडा + झगडा = झैझगडा
तयार + तयार = तम्तयार

Echo Reduplication
तेल + तेल = तेलसेल
खटन + खटन = खटनपटन
घर + घर = घरसर

Total Reduplication

New words are formed when an entire word repeats in the process of word formation. Such words express varied levels of intensity to the meaning represented by the root. Some of such compounds are given below:

Independent Form => Compound Word
खल खल => खलखल
खेद खेद => खेदाखेद
तान तान => तानातान

Partial Reduplication

In partial reduplication, only a part of a word is repeated to form a new word. Some of such compounds are given below:

Independent Form => Compound Word
झगडा झगडा => झैझगडा
कसले कसले => क-कसले
सल्लाह सल्लाह => सरसल्लाह

Echo Reduplication

A word formed due to echo reduplication is similar to the one formed due to total reduplication. The only difference is that the initial syllable of the second copy of the root replaced by a similar sounding substring. The process of echo reduplication expresses higher level of intensity to the meaning of the root. Some of such compounds are given below.

Independent Form => Compound Word
केटो केटो => केटोसेटो
नरम नरम => नरमकरम
टाल् टाल् => टालटुल

Euphonic Combination

Euphonic Combination is the joining of two phonemes from two different words to derive a single word. In Nepali, we can find two different categories of the euphonic combination. The first category contains all the types of euphonic combination taken from Sanskrit (तत्सम सन्धि) as Nepali is one of the languages which is derived from Sanskrit. The other category contains the types of euphonic combination specific to Nepali language (तत्भव सन्धि).

Sanskrit Euphonic Combination
उप + अध्यक्ष = उपाध्यक्ष
महा + इन्द्र = महेन्द्र
अधि + आत्मा = अध्यात्म

Nepali Euphonic Combination
मोटा + आइ = मोटाइ 
जाउ + अत = जावत
राती + ओली = रत्यौली

Negation in Nepali Verbs

Shreeya — Sun, 19 Nov 2017 21:52:00 GMT

Negation is used to express the opposite meaning of affirmative sentences. Negation in Nepali verbs takes place due to affixation(suffixation and prefixation). The negative case marker न(na) is either prefixed or suffixed with verb roots or verb forms to express negation.

Negation due to Prefixation

Negation in some verb forms occurs when the morpheme न(na) is prefixed to the verbs. Some of such verb forms are given below.

Negation due to Suffixation

Negation in Nepali verbs can also occur due to suffixation of the morpheme न(na) to verb forms/ verb roots. The morpheme can occur either at the end or in the middle of the negated verb.

For the following verb forms, negation occurs due to suffixation.

References

A Computational Analysis of Nepali Morphology: A Model for Natural Language Processing

Verbal Inflections: The Varied Forms of Nepali Verbs

Shreeya — Fri, 11 Aug 2017 17:56:00 GMT

In Nepali, verbs exhibit rich inflectional patterns, allowing for intricate nuances in meaning and expression. This richness stems from the language's extensive conjugation system, where verbs undergo various inflectional changes to convey details such as tense, aspect, mood, person, number, and honorifics. The combination of suffixation and sometimes auxiliary verbs enables Nepali verbs to capture a wide array of cultural nuances. For example, verbs can inflect to denote not only the time of an action but also its duration, completion status, and the speaker's attitude towards it. Additionally, Nepali verbs also inflect for honorifics to reflects the speaker's respect for the subject.

Nepali verbs inflect to encode: Tense, Aspect, Mood, Person, Number, Gender and Honorifics.

Tense (काल)

In Nepali, verbs exhibit three tenses to signify the timing of an action in relation to the moment of speaking: Past (भुत काल), Present (वर्तमान काल) and Future (भविष्यत काल).

Past (भुत काल): खायो (khaa-yo)
Present (वर्तमान काल): खान्छ (khaa-n-cha)
Future (भविष्यत काल): खानेछु (khaa-ne-chu)

Aspect (पक्ष)

While tense tells us when an action occurs, aspect tells us how that action unfolds or is perceived. In Nepali, verbs can be inflected to express 4 aspects: Perfect (पूर्ण पक्ष), Habitual (अभ्यस्त पक्ष), Imperfect (अपूर्ण पक्ष) and Inferential (अज्ञात पक्ष).

Perfect Aspect (पूर्ण पक्ष)

The perfect aspect of a verb indicates that the action has been completed prior to the current moment. In Nepali grammar, the presence of the suffix "एको" (eko) signals the perfect aspect. Notably, the suffix "एको" (eko) undergoes inflection to match the number (singular or plural) and gender (masculine or feminine) of the subject involved in the action. Examples: खाएको (khaa-eko), खाएका (khaa-yeka)

Examples of Perfect Aspect (पूर्ण पक्ष) Verbs
खाएको (kha-eko) - masculine, singular
खाएका (kha-eka) - masculine, plural
खाएकी (kha-eki) - feminine, singular
खाएका (kha-eko) - feminine, plural

Imperfect Aspect (अपूर्ण पक्ष)

Imperfect aspect denotes an action that is in progress or ongoing. In Nepali grammar, this aspect is indicated by specific suffixes such as "दै" (dai), "दो" (do), "दी" (dii), and "दा" (da). These suffixes serve to highlight the ongoing nature of the action, suggesting that it is not yet complete or finalized. Examples: खाँदै (khaa-dai), खाँदा (khaa-daa)

Examples of Imperfect Aspect (पूर्ण पक्ष) Verbs
खाँदो(kha-do), खाँदै(kha-dai) - masculine, singular
खाँदा(kha-da), खाँदै(kha-dai) - masculine, plural
खाँदी(kha-di), खाँदै(kha-dai) - feminine, singular
खाँदा(kha-da), खाँदै(kha-dai) - feminine, plural

Habitual Aspect (अभ्यस्त पक्ष)

Habitual aspect depicts actions that occur regularly or habitually. It conveys the idea of actions that are customary or repeatedly performed. In Nepali grammar, the habitual aspect is represented by variants of the suffixes "छ" (cha) and "थ्" (th). These suffixes inflect to encode the person, number, gender, and levels of honorifics associated with the subject.

Examples of Habitual Aspect (अभ्यस्त पक्ष) Verbs
खान्छु (kha-n-chu), खान्थे (kha-n-the) - first person, inclusive, singular
खान्छौ (kha-n-chau) - first person, inclusive, plural
खान्थिस् (kha-n-this) - second person, masculine, singular
खान्थ्यौ (kha-n-theu) - second person, masculine, plural
खान्थ्यो (kha -n-thyo), खान्थे (kha-n-the) - third person, masculine, singular
खान्थे (kha-n-the) - third person, inclusive, plural
खान्थि (kha-n-thi), खान्थिन् (kha-n-thin) - third person, feminine, singular

Inferential Aspect (अज्ञात पक्ष)

Inferential aspect in Nepali is used to describe actions that have occurred in the past but are only known or inferred in the present, often based on indirect evidence or hearsay. In Nepali grammar, the occurrence of the suffixes "ए" (e) and "इ" (i) between the verb stem and the forms of "छ" (cha) indicates the inferential aspect. These suffixes serve to imply that the action, while not directly observed or experienced, is deduced or inferred to have taken place.

Examples of Inferential Aspect (अज्ञात पक्ष) Verbs
खाएछु (kha-e-chu) - first person, inclusive, singular
खाएछौँ (kha-e-chau)- first person, inclusive, plural
खाएछस् (kha-e-chas) - second person, masculine, singular
खाइछस् (kha-i-chas)  - second person, feminine, singular
खाएछौ (kha-e-chau) - second person, inclusive, plural
खाएछ (kha-e-cha) - third person, masculine, singular
खाइछ (kha-i-cha)  - third person, feminine, singular
खाएछन् (kha-e-chan) - third person, inclusive, plural

Mood (भाव)

Mood conveys the speaker's attitude or stance towards the content being communicated. In Nepali, verbs exhibit three moods, each indicating a different perspective or intention of the speaker: Imperative (आज्ञार्थ भाव), Optative (इच्छार्थ भाव), and Potential (सम्भावनार्थ भाव).

Imperative Mood (आज्ञार्थ भाव)

Imperative Mood express commands or requests. It is a directive form that directs the listener to perform a certain action. Verbs in the imperative mood undergo inflection to match the number and honorifics associated with the subject or the individual being addressed.

Examples of Imperative Mood (आज्ञार्थ भाव) Verbs
खा (kha) - singular, informal
खानुहोस् (khaanu hos) - singular, formal
खाओ(kha-o) - plural, inclusive
खानुहोस् (khaanu hos) - plural, formal, inclusive

Optative Mood (इच्छार्थ भाव)

Optative mood indicates wishes, hopes, or desires. It allows speakers to express their desires regarding a particular situation or outcome. Optativer forms also undergo inflection to align with the person, number, and level of honorifics involved in the communication.

Examples of Optative Mood (इच्छार्थ भाव) Verbs
खाउँ (kha-u) - first person, singular/plural
खाएस् (kha-esh) - second person, singular
खाए (kha-e) - second person, plural
खाओस् (kha-os) - third person, singular
खाउन् (kha-un) -  third person, plural

Potential Mood (सम्भावनार्थ भाव)

Potential mood in language serves as a means to convey the possibility or likelihood of an action occurring. This mood allows for the expression of hypothetical scenarios or speculative statements about what may happen. In Nepali grammar, verbs inflect to indicate the potential mood, adapting to align with the person, gender, and number of the subject involved in the action.

Examples of Potential Mood (सम्भावनार्थ भाव) Verbs
खाउँला (kha-u-la) - first person, inclusive, singular
खाऔँला (kha-aula) - first person, inclusive, plural
खालास् (kha-lash) - second person, masculine, singular
खालिस् (kha-lish) - second person, feminine, singular
खाऔला (kha-aula) - second person, inclusive, plural
खाला (kha-la) - third person, masculine, singular
खाली (kha-li) - third person, feminine, singular
खालान् (kha-lan) - third person, inclusive, plural

Person

In Nepali grammar, verbs undergo inflection in agreement with person. It distinguishes between the first person (the speaker), the second person (the recipient or addressee), and the third person (others not involved in the conversation). Person verbs also demonstrate variations based on mood, tense, aspect, and number. In addition, second and third person verbs also inflect for gender and grades of honorifics.

Examples of First Person Verbs
खाए (kha-e) - past tense, singular
खान्छु (kha-n-chu) - present tense, singular
खानेछु (kha-ne-chu) - future tense, singular
खायौँ (kha-you)  - past tense, plural
खान्छौँ (kha-n-chau) - present tense, plural
खाने छौँ (kha-ne-chau) - future tense, plural

Examples of Second Person Verbs
खाइस् (kha-i-sh) - past tense, inclusive, singular
खान्छस् (kha-n-chas) - present tense, masculine, singular
खान्छेस् (kha-n-ches) - present tense, feminine, singular
खानेछस् (kha-ne-chas) - future tense, masculine, singular
खानेछेस् (kha-ne-ches) - future tense, feminine, singular
खायौ (kha-you)  - past tense, inclusive, plural
खान्छौ(kha-n-chau) - present tense, inclusive, plural
खाने छौ (kha-ne-chau) - future tense, inclusive, plural

Examples of Third Person Verbs
खाइस् (kha-i-sh) - past tense, inclusive, singular
खान्छस् (kha-n-chas) - present tense, masculine, singular
खान्छेस् (kha-n-ches) - present tense, feminine, singular
खानेछस् (kha-ne-chas) - future tense, masculine, singular
खानेछेस् (kha-ne-ches) - future tense, feminine, singular
खायौ (kha-you)  - past tense, inclusive, plural
खान्छौ(kha-n-chau) - present tense, inclusive, plural
खाने छौ (kha-ne-chau) - future tense, inclusive, plural

Number

Nepali verbs exhibit agreement with the subject in terms of number: singular and plural.

Examples of Singular Verbs
खाइस् (kha-i-sh) - past tense, inclusive
खान्छस् (kha-n-chas) - present tense, masculine
खान्छेस् (kha-n-ches) - present tense, feminine, singular
Examples of Plural Verbs
खायौ (kha-you)  - past tense, inclusive, plural
खान्छौ(kha-n-chau) - present tense, inclusive, plural
खाने छौ (kha-ne-chau) - future tense, inclusive, plural

Gender

Verbs inflect to demonstrate two forms of gender agreement: masculine and feminine, reflecting the gender of the subject. This distinction is particularly evident in the conjugation of verbs for the second and third person singular forms.

Honorifics

In Nepali language, honorifics play a significant role in social interactions, influencing verb forms to express different levels of respect in speech. Second and third person verb forms can undergo inflection to reflect three distinct grades of honorifics: casual, familiar, and respectful. These honorifics are chosen based on the level of formality and the relationship between the speaker and the listener.

Casual Honorifics

Casual honorifics are used when addressing juniors or friends in a casual setting. The language used is informal and relaxed, reflecting a close relationship between the speaker and the listener.

तँ (You, singular) - Casual Honorific
  तँ खान्छस् (kha-n-chas)

Familiar Honorifics

The second grade of honorific, familiar honorifics, is employed when speaking to friends, acquaintances, or colleagues. It is more formal compared to casual honorifics and is considered appropriate for most social interactions.

तिमी (You, singular) - Familiar Honorific
  तिमी खान्छौ (kha-n-chau)

Respectful Honorifics

Respectful honorifics are used while addressing seniors or elders, as well as individuals deserving of utmost respect. Verbs inflected with respectful honorifics exhibit a high level of respect and formality.

हजुर (You, singular) - Respectful Honorific
  हजुर खानुहोस् (kha-nu-hos)

Nominal Inflections: The Morphological Adaptations of Nepali Nouns, Pronouns and Adjectives

Shreeya — Sat, 05 Aug 2017 05:52:00 GMT

Nepali is an inflectionally rich language. In this post we will look into inflections in Nepali Nouns, Pronouns, and Adjectives.

Recap: Inflectional Morphemes

Inflectional morphemes encode grammatical meaning like gender, number, tense, person, and levels of honorifics.
Inflectional forms of a word have the same meaning as the root.
Inflectional morphemes have a transparent and regular function such that they do not change the lexical category.
Since inflection encode grammatical meaning, they do not precede derivational morphemes.
In Nepali, there are only inflectional suffixes.

Inflections in Nepali Nouns and Pronouns

Nepali nouns and pronouns inflect for seven cases (कारक): Nomiantive, Accusative, Instrumental, Dative, Ablative, Genitive and Locative and number (वचन): Singular, Plural. All of thses inflectional forms are marked by postpositions.

Nominative Case (कर्ता) Nominative case is used to indicate the subject of a sentence, or the doer of an action.

साथीले (sathi-le) - noun with nominative, singular
साथीहरुले (sathi-haru-le) - noun nominative case, plural
तिमीले (timi-le) - pronoun nominative case, singular
तिमीहरुले (timi-haru-le) - pronoun nominative case, plural

Accusative Case (कर्म) The accusative case marks the direct object of a verb, the receiver of the action.

साथीलाई (sathi-lai:) - noun with accusative, singular
तिमीलाई (timi-lai:) - pronoun with accusative, singular
साथीहरुलाई (sathi-haru-lai:) - noun with accusative, plural
तिमीहरुलाई (timi-haru-lai:) - pronoun with accusative, plural

Instrumental Case (करण) The instrumental case denotes the instrument or means by which an action is performed.

साथीले (sathi-le) - noun with instrumental, singular
साथीहरुले (sathi-haru-le) - noun instrumental case, plural
तिमीले (timi-le) - pronoun instrumental case, singular
तिमीहरुले (timi-haru-le) - pronoun instrumental case, plural

Dative Case (सम्प्रदान) Dative case indicates the indirect object of a verb, the recipient of the action.

साथीलाई (sathi-lai:) - noun with dative, singular
तिमीलाई (timi-lai:) - pronoun with dative, singular
साथीहरुलाई (sathi-haru-lai:) - noun with dative, plural
तिमीहरुलाई (timi-haru-lai:) - pronoun with dative, plural

Ablative Case (अपादान) The ablative case expresses the origin, source, or cause of an action.

साथीदेखि (sathi-dekhi), साथीबाट (sathi-bata) - noun with ablative, singular
तिमीदेखि (timi-dekhi), तिमीबाट (timi-bata) - pronoun with ablative, singular
साथीहरुदेखि (sathi-haru-dekhi), साथीहरुबाट (sathi-haru-bata) - noun with ablative, plural
तिमीहरुदेखि (timi-haru-dekhi),  तिमीहरुबाट (timi-haru-bata) - pronoun with ablative, plural

Genitive Case (सम्बन्ध) The genitive case signifies possession, association, or belonging.

साथीको (sathi-ko) - noun with genitive, singular
उनको (un-ko) - pronoun with genitive, singular
साथीहरुको (sathi-haru-ko) - noun with genitive, plural
उनीहरुको (uni-haru-ko) - pronoun with genitive, plural

Locative (अधिकरण) The locative case indicates location or place where something happens or is located.

साथीमा (sathi-ma) - noun with locative, singular
उनीमा (uni-ma) - pronoun with locative, singular
साथीहरुमा (sathi-haru-ma) - noun with locative, plural
उनीहरुमा (uni-haru-ma) - pronoun with locative, plural

Inflections in Nepali Adjectives

Nepali adjectives inflect for number (Singular, Plural) and gender (Masculine and Feminine) and seven cases of the nouyn they modify. It involves transforming singular adjectives ending in "o" to their plural counterparts ending in "a", and masculine adjectives with "o" ending to feminine adjectives with "i" ending. For instance, the adjective "राम्रो" (ramro), meaning "good" in singular masculine form, becomes "राम्रा" (ramra) in its plural form and "राम्री" (ramri) in its feminine form. See Table 2 for a detailed overview of inflectional patterns in Nepali adjectives.

राम्रो (ramr-o) - singular, masculine
राम्री (ramr-i) - singular, feminine
राम्रा (ramr-a) - plural, masculine
राम्रा (ramra) - plural, feminine

Exploring the Building Blocks: A Simple Guide to Morphemes in Nepali

Shreeya — Tue, 01 Aug 2017 13:48:00 GMT

In this post, we will explore morphemes, which are the smallest units of meaning in language. Words in Nepali can be comprised of: 1) a free morpheme; 2) one or more bound morphemes attached to a free morpheme; or 3) two or more bound morphemes. The process by which a bound morpheme attaches to a free morpheme or another bound morpheme to form a word can be categorized into inflection (रुपायन) and derivation (व्युत्पादन), resulting in two types of bound morphemes: inflectional and derivational.

Note: A free morpheme is a standalone word and has a meaning of its own. However, a bound morpheme by itself cannot be considered a word even though it may carry some meaning.

Inflections v/s Derivations

Inflectional morphemes encode grammatical meaning, hence inflectional forms of a word have the same meaning as the root.
Inflectional morphemes have a transparent and regular function such that they do not change the lexical category.
Since inflection encode grammatical meaning, they do not precede derivational morphemes.
In Nepali, there are only inflectional suffixes.

स्कुल (school) + हरु (haru) = स्कुलहरु (school-haru) # Plural Form
स्कुल (school) + को (ko) = स्कुलको (school-ko) # Genetive Form

Derivational morphemes carry lexical meaning, hence when such morphemes combine with free morphemes they form new words.
Derivational morphemes may also change the lexical category of the root.
In Nepali, there are derivational prefixes and suffixes.

# Prefix
अनु (anu) + रुप (rup) = अनुरुप (anu-rup)
अभि (abhi) + योग (yog) = अभियोग (abhi-yog)

# Suffix
बच् (bach) + अत (ata) = बचत (bach-ata) # Verb to Noun
ठग् (thag) + आहा (aahaa) = ठगाहा (thag-aahaa) # Verb to Adjective

NOTE: There are no derivational/inflectional infixes and circumfixes in Nepali.

Form and Structure Classes

In addition to inflection and derivation, it's crucial to understand the concept of form and structure classes in Nepali. Form classes refer to words that primarily convey content and carry meaning, such as nouns, verbs, adjectives, and adverbs. These words usually undergo inflectional and derivational processes.

Structure classes, on the other hand, include function words like articles, conjunctions, prepositions, and pronouns. They contribute to the grammatical structure and organization of sentences but do not typically undergo inflectional or derivational changes.

An Introduction to Written Nepali

Shreeya — Sun, 23 Jul 2017 18:37:00 GMT

Nepali is an Indo-Aryan language. It is written in Devnagari script. Hindi, Bengali, Marathi and Sanskrit are some other languages written in the same script. It follows Subject + Object + Verb pattern in writing and is written from left to right. This blog post will serve as an introduction to written Nepali.

Nepali is written using vowels, consonants, modifiers(मात्रा) and symbols(चिन्ह).

`Nepali Alphabets`

Nepali Alphabets consists of 11 vowels and 33 consonants. All the vowels and consonants, in Nepali, are listed in Table 1.

क्ष, त्र and ज्ञ are usually mistaken for consonants but these letters are special combination of two different consonants and halanta "्".

क + ् + ष = क्ष	
त + ् + र = त्र
ज + ् + ञ = ज्ञ

NOTE: In Nepali, there is no provision for uppercase and lowercase letters.

`Modifiers(मात्रा)`

Modifiers, in Nepali, are representations for vowels. A modifier is used when the vowel is preceded by a consonant. Vowels and their respective modifiers are given in Table 2.

These modifiers can appear at the foot, top or middle of a consonant. They can also appear before and after a consonant.

ु and ू can appear at the foot or middle of a consonant and ृ appears at the foot of a consonant.

At foot:

क + ु = कु
क + ू = कू
क + ृ = कृ

At middle:

र + ु = रु
र + ू = रू

ि appears before a consonant.

क + ि = कि

ा, ी, ो and ौ appear after a consonant.

क + ा = का
क + ी = की
क + ो = को
क + ौ = कौ

NOTE: Regardless of where the modifiers appear to be in writing, in computerized texts, consonants preced the modifiers.

`Symbols(चिन्ह)`

In addition to alphabets and modifiers, there are other symbols used in Nepali. Such symbols are listed below.

Chandrabindu(चन्द्रविन्दु) : ँ  
Sirbindu(शिरविन्दु): ं 
Halanta(हलन्त): ् 
Visarga(विसर्ग): :

`Chandrabindu and Sirbindu`

Both, Chandrabindu and Sirbindu are used to indicate nasalization. Chandrabindu is usually used with vowels, indicating that the previous vowel is nasalized.

जाउँला
आउँछु

Sirbindu is usually used to before य, र, ल, व, श, ष, स and ह.

संयुक्त
संरचना
संलग्न
संवाद
वंश
संसार
हंश

`Halanta`

Every consonant has an inherent vowel; halanta is used to supress or cancel the occurrence of such vowels. Occurrence of halanta between two consonants, say C1 and C2, there can be three different cases of conjoined consonants.

In the first case, C1 and C2 fully conjoin to form a consonant with a visible halanta occurring at the foot of C1.

छङ्छङ
सङ्लो

Second, the half or modified form of C1 and the full form of C2 conjoin to give a consonant.

रत्न
निहत्था
भुल्क

Third, the full form C1 and the half or modified form of C2 conjoin to give a consonant.

प्रणाम
बज्र
छाप्रो

`Visarga`

Visarga is a frequent occurrence in sanskrit texts. In Nepali, a handful of words borrowed from sanskrit contains visarga. Few of such words are listed below.

दु:ख
नि:स्वार्थ
मूलत:
प्रथमत:

This sums up the basic introduction to written Nepali.

Tokenization in Nepali

Shreeya — Mon, 17 Jul 2017 04:22:00 GMT

Tokenization is generally the first step in text analysis applications. It is the process of splitting the given string into units, called tokens. A token is a sequence of characters, usually a word or sentence that is semantically significant for text analysis. Tokenization is a language-specific task; for instance, a language like Chinese that has no space between words requires a language-specific approach towards tokenization. However, Nepali, being a language where words are separated by white spaces, the general boundary-defined tokenization method can be used.

Sentence-level Tokenization

In cases where we want our tokens to be sentences, our possible token boundary is either पूर्णविराम(।) or प्रश्न चिन्ह(?) or विस्मयादिबोधक(!). So, by splitting the given text at पूर्णविराम(।) or प्रश्न चिन्ह(?) or विस्मयादिबोधक(!), sentence-level tokenization can be achieved.

पूर्णविराम(।) in Nepali is equivalent to a full stop (.) in English.

# Split at ?, । or !
return re.split('(?<=[।?!]) +', text)

INPUT: 
परिश्रम नगरी हुन्छ? परिश्रम सफलताको एक मात्र बाटो हो। जो परिश्रम गर्छ, उही सफल हुन्छ। अब त परिश्रम गर्छौ नि? नगरी कहाँ हुन्छ त!

OUTPUT:
['परिश्रम नगरी हुन्छ?', 'परिश्रम सफलताको एक मात्र बाटो हो।', 'जो परिश्रम गर्छ, उही सफल हुन्छ।', 'अब त परिश्रम गर्छौ नि?', 'नगरी कहाँ हुन्छ त!']

You can find out more about punctuation marks in Nepali at: [Punctuation in Nepali] (https://nepalgo.tumblr.com/post/71951951192/punctuation-marks).

Word-level Tokenization

In languages that have words separated by blank spaces, the token boundary for word-level tokenization is the blank/white space. Nepali is one such language, so word-level tokenization in Nepali can be achieved by stripping tokens at those white spaces. However, it is not as simple as it looks, especially when dealing with punctuations. Some punctuation is easy to handle. All that you need to do is replace them with a white space and you are good to go.

Cases involving hyphen (-) and colon (:) need to be handled differently.

Replacing Punctuations with White Space

The following is the list of punctuations that can be handled by simply replacing them with a white space.

,
)
(
{
}
[]
!
‘
’
“
”
:-
?
।
/
—

INPUT:
राम आयो र भन्यो, "दाइ र दिदि, यति दिन कता हुनुहुन्थ्यो?" तब, दाइ/दिदिले केहि नबोली आफ्नो (बाटो) लाग्नुभयो।

OUTPUT:
['राम', 'आयो', 'र', 'भन्यो', 'दाइ', 'र', 'दिदि', 'यति', 'दिन', 'कता', 'हुनुहुन्थ्यो', 'तब', 'दाइ', 'दिदिले', 'केहि', 'नबोली', 'आफ्नो', 'बाटो', 'लाग्नुभयो']

Hyphen (-)

In Nepali, a Hyphen (योजक चिन्ह) is used in linking word pairs: opposite, analogy or similar, together. In such cases, the hyphen is considered as a part of the token itself.

जीवन-प्रक्रिया

आ-आफ्नो

स-सानो

However, in many online texts, hyphens are also being used in place of one of the निर्देशक चिन्ह: dash (—). In this case, there are two ways in which a hyphen is used.

For the first case, the hyphen occurs independently.

# Independent hyphen i.e. not attached to the word
आक्रोश - घाम

In case, as shown above, the hyphen is simply replaced with a white space.

In the second case the hyphen is attached to the end of a word. In such cases the hyphen is tokenized with the word as a part of the word and then all the hyphens that occurs as the last character in the token is removed.

if word[len(word) - 1:] == '-':
    words.append(word[:len(word) - 1])

INPUT: 

आ-आफ्नो बाटो लागे- भन्दै - श्यामपनि आफ्नो बाटो हिड्यो।

OUTPUT:

['आ-आफ्नो', 'बाटो', 'लागे', 'भन्दै', 'श्यामपनि', 'आफ्नो', 'बाटो', 'हिड्यो']

Colon(:)

Colon in Nepali, can occur as a part of word or as निर्देशक चिन्ह in place of a dash(—). Colon, which is a part of word can appear either in between or at the end of the word. Some of such words are listed below.

दु:ख
नि:स्वार्थ
मूलत:
प्रथमत:
क्रमश:

For colon that appear as in place of dash, they have two cases similar to the ones discussed above in hyphen. The first case is handled in the same way the first case of hyphen was handled i.e. by replacing the punctuation with a white space. However, handling the second case where colon appears as the last character of the word is a little tricky because it is difficult to identify if is a part of the word or is used in place of a dash. Since, a handful of words in Nepali end with colon, the current tokenizer uses a lexicon of such words to to understand if the end-of-word colon is a part of the word or is it used in place of a dash.

The colon is replaced with a white space if the word doesn't exist in the lexicon of words.

if word[len(word) - 1:] == ':' and word not in colon_lexicon:
    words.append(word[:len(word) - 1])
else:
    words.append(word)

INPUT:

यस्ता छन्: प्रथमत: नि:स्वार्थ।

OUTPUT: 

['यस्ता', 'छन्', 'प्रथमत:', 'नि:स्वार्थ']

Period(.)

Period is another punctuation used in Nepali. Unlike in English where a period can be used in two cases: ending a sentence i.e. as a fullstop and as a part of abbreviation, in Nepali it is used for latter only.

गा.वि.स.
डा.
कि.मी.

So, for tokenization of Nepali texts, period is considered as a part of the token itself.