NLP1Ep1: Understanding what’s happening behind ChatGPT, Bard, HuggingChat and others – Large Language Models
Series NLP1: Understanding what’s happening behind ChatGPT, Bard, HuggingChat and others.
Episode 1 – Large Language Models
ChatGPT, Bard and HuggingChat are conversational generative artificial intelligence chatbots that have attracted widespread attention from society. This is understandable as they have shown amazing conversation abilities with humans. Under the hood, they are based on Large Language Models (respectively on models from GPT, PaLM and LLaMA series) carefully adapted for dialogue. So, what’s a Large Language Model (LLM)?
LLMs are pre-trained language models (PLM) that contain tens or hundreds of billions (or more) of parameters which are trained on massive text. Technically, they are built upon the Transformer architecture and consist of multi-head attention layers stacked in a very deep neural network. This said, if you have followed the evolution of research in NLP in recent years, you certainly know past groundbreaking models like BERT, which introduced the Transformer architecture, and GPT-1, the ancestor of GPT-4. Thus, you surely ask yourself if those models are also LLMs?
Not exactly. BERT and GPT-1/GPT-2 are surely pre-trained language models, but strictly speaking, they are not LLMs because they are not big enough. Although, there is no formal consensus on the minimum parameter scale for LLMs, a loose threshold would be 10B to distinguish between LLMs and small-scale PLMs. So, LLMs are a subset of PLMs that are big enough, but small-scale PLMs are not LLMs. At this point, you may wonder why was it useful to invent the term LLM and distinguish large-scale PLM from small-scale PLM?
Because LLMs usually achieve a significant performance improvement as compared to small-scale PLMs as well as exhibiting some new special abilities. LLMs largely scale the model size, pre-training data, and total compute. Thanks to their better understanding of the natural language, they can generate high-quality text based on the given context. While, the scaling law can partially describe some improvements, other abilities that are only observed when the model size exceeds a certain level cannot be explained by this law. It’s important to point out that a LLM is not necessarily more capable than a small PLM, and emergent abilities may not occur in some LLMs. Now, you should be eager to know what these emerging abilities are?
Well, let’s save the question for later. That’s all for this post. Make sure to stay tuned for the next episode.
Credit: https://arxiv.org/abs/2303.18223