Linguistically Diverse Prompting Evergreen

> [!META]- Inline Metadata > [status:: boat] > [source:: [[Democratizing LLMs for Low-Resource Languages by Leveraging Their English Dominant Abilities With Linguistically Diverse Prompts]]] > [tags:: #note/evergreen #state/boat #concepts/programming/machine-learning/large-language-models #concepts/programming/machine-learning/machine-translation/low-resource-languages/techniques] > [up:: [[Large Language Model MOC]]] **Linguistically Diverse Prompting** is a technique for doing unsupervised zero-shot generative translation and summarization tasks with low-resource languages. It makes three assumptions: 1. LLMs have already learned most of the knowledge and task concepts implicitly during pre-training[^1] 2. LLMs intuitively learn to perform language encoding and understanding before learning to generate language. This would mean that it is just generative abilities that are improved later when more data is seen. 3. LLMs can already exhibit near-human generative abilities in the dominant language $E$ where pre-training data is orders of magnitude larger than others. This means that translation tasks between $E$ and minority language $X$ are not symmetric, but are understood as the following tasks: 1. $X \rightarrow E$ translation is a natural language understanding task (NLU) in $X$ 2. $E \rightarrow X$ translation is a natural language generation task (NLG) in $X$, which is harder than NLU. While input in $E$ is easy to encode, generating intended results in $X$ will be challenging if the model has not seen enough texts in $X$. With this in mind, LDP works by presenting in-context exemplars to the model so that the model locates the task of "translate from *any language* $X$ into $E$." The exemplars are `[input]\n[output]` pairs, where inputs are a linguistically and geographically diverse set of high-resource languages with various scripts, while the target `[output]` are their $E$ equivalents which can be translated using existing multilingual unsupervised machine translation models. Then a test input from a low-resource language is used to translate the target low-resource language $X$ into English. ![[Pasted image 20230704211653.png]] ## LDP for Translation Tasks ### X $\rightarrow$ E task $\mathcal{L}_{X \rightarrow E}^{mt}$ Gather $n$ $Z_i \rightarrow E$ exemplar pairs, where $Z_i$ is one of a diverse set of languages with various writing systems, lexical, and regional characteristics - none of which are English or the low resource target language. These can be gathered by randomly selecting a single sentence from unlabeled data of $Z_i$ and using unsupervised MT models to translate them into $E$ if the translations of the exemplars are not available. ### E $\rightarrow$ X task For this, use $\mathcal{L}_{X \rightarrow E}^{mt}$ (above task) to build intra-lingual prompts with unlabeled data from target $X$ language. Given $m$ unlabeled texts $s_X^j$ from a monolingual corpus in $X$ language, produce synthetic **back-translation** (BT) target $s_E^j = \mathcal{L}_{X \rightarrow E}^{mt}(s_X^j)$. Then, use those synthetic pairs as in-context exemplars for $E \rightarrow X$ translation tasks for input $s_E$. > [!NOTE] > This backtranslation strategy can also be used for $X \rightarrow E$ translation tasks, but the source paper found it gives no additional benefit over $\mathcal{L}_{X \rightarrow E}^{mt}$ ### Unsupervised query-key-value (QKV) fine-tuning LDP allows us to create high volumes of synthetic $X-E$ data using $\mathcal{L}_{X \rightarrow E}^{mt}$, which can be used to fine-tune the LLM for translation tasks without any in-context prompting during inference. During training, only compute loss on `[output]` to train model to generate the right language. This paper finds that the model fails to learn to generate the low-resource languages unless the learnable parameter count is increased enough to negate any improvements granted by approaches like parameter-efficient fine-tuning (PEFT). The format for this data is based on the template `[input]<lang-tag>[output]` and is used to directly fine-tune the query-key-value linear weights of all attention layers. ## Benchmarking and Tests Used [chrf++](https://huggingface.co/spaces/evaluate-metric/chrf) as a metric for comparing performance with supervised prompting[[Democratizing LLMs for Low-Resource Languages by Leveraging Their English Dominant Abilities With Linguistically Diverse Prompts#^3geuz9|*]] as well as BLEU. High-resource languages used: Arabic, Chinese, Vietnamese, French for synthetic back-translation. > [!NOTE] > Language tags did not give any performance boost. ### Low-Resource $\leftrightarrow$ English Translation - ROOTS corpus - BLOOM model - Also used [[Linguistically Diverse Prompting#Unsupervised fine-tuning|unsupervised fine-tuning (QKV)]] ![[Pasted image 20230704231334.png]] ### Non-English-centric Translation ### Translation with LLaMa Evaluated similarly to [[Linguistically Diverse Prompting#Low-Resource leftrightarrow English Translation|above]], with similar trends. Overall scores were higher for non-Latin languages. ### Zero-shot Multilingual Summarization LDP methods outperform XLT English-pivoting instruction. ## Usage Considerations - For 10 Indic LRLs, choosing a single related language (in this case, Hindi) for cross-lingual prompting caused the model to translate the prompt language rather than the test language. A single distant language gives better results, but the optimal choice is a wide variety of languages across different regions. - Using the English language tag for each exemplar language causes confusion for the model when translating. Native-tags + back-translation or no tags with back-translation seem to work best for in-context prompts. # Source - [[Democratizing LLMs for Low-Resource Languages by Leveraging Their English Dominant Abilities With Linguistically Diverse Prompts]] - [Paper link](https://www.researchgate.net/publication/371728889_Democratizing_LLMs_for_Low-Resource_Languages_by_Leveraging_their_English_Dominant_Abilities_with_Linguistically-Diverse_Prompts/fulltext/64926c52b9ed6874a5c36110/Democratizing-LLMs-for-Low-Resource-Languages-by-Leveraging-their-English-Dominant-Abilities-with-Linguistically-Diverse-Prompts.pdf?origin=publication_detail) [^1]: ![[Democratizing LLMs for Low-Resource Languages by Leveraging Their English Dominant Abilities With Linguistically Diverse Prompts#^w8799k]]