Large language models (LLMs) have shown remarkable performances across a wide range of tasks. However, the mechanisms by which these models encode tasks of varying complexities remain poorly understood. In this paper, we explore the hypothesis that LLMs process concepts of varying complexities in different layers, introducing the idea of "Concept Depth" to suggest that more complex concepts are typically acquired in deeper layers.
Specifically, we categorize concepts based on their level of abstraction, defining them in the order of increasing complexity within factual, emotional, and inferential tasks. We conduct extensive probing experiments using layer-wise representations across various LLM families (Gemma, LLaMA, Qwen) on various datasets spanning the three domains of tasks.
Our findings reveal that models could efficiently conduct probing for simpler tasks in shallow layers, and more complex tasks typically necessitate deeper layers for accurate understanding. Additionally, we examine how external factors, such as adding noise to the input and quantizing the model weights, might affect layer-wise representations. Our findings suggest that these factors can impede the development of a conceptual understanding of LLMs until deeper layers are explored.
We hope that our proposed concept and experimental insights will enhance the understanding of the mechanisms underlying LLMs. Our codes are available at https://github.com/Luckfort/CD.
We use the following datasets in our experiments.
We let LLM solve true-of-false problems, based on an objective factual basis.
🌆 Cities: Consists of statements about the location of cities and their veracity labels (e.g., The city of Zagreb is in Japan, which is wrong).
🔎 Common-Claim: A dataset of boolean statements, each labeled by two humans as common-knowledge-true, common-knowledge-false, or neither.
💥 Counterfact: Includes myriad counterfactuals that allows quantitative testing of specificity and generalization when learning a counterfactual.
Given a social media or movie review, have the LLM solve a dichotomous sentiment problem.
💢 HateEval: Has tweets which were annotated hierarchically.
🎬 STSA: Includes movie reviews, wih positive and negative reviews, reflecting the writer's overall intention for this review.
📽️ IMDb: A benchmark dataset for binary sentiment classification. We use 2000 of these samples.
🤭 Sarcasm: A superior news headline dataset that tells if the headlines are sarcastic.
Examing LLM's reasoning skills for bi-classification.
🗝️ StrategyQA: Contains questions across all knowledge domains to elicit creative and diverse yes/no questions that require implicit reasoning steps.
🪙 Coinflip: Includes coin flipping queries, asking if a coin remains heads up after it is either flipped or left unflipped by individuals.
✅ Judge the statement is True or False. The city of Tokyo is in Japan.
❌ Judge the statement is True or False. The city of Lodz is in the Dominican Republic.
✅ Here it is not about Refugees or Illegal immigrants. It is about whether one has documents before 1971 or not. Now, it is difficult for slum people and beggars to show valid documents, except the name in voter list. According to the comment, tell whether they present hate speech or not.
❌ Labor migrants transfer almost $10 billion a year to Ukraine. According to the comment, tell whether they present hate speech or not.
✅ A coin is heads up. Whitney flips the coin. Erika does not flip the coin. Tj does not flip the coin. Benito flips the coin. Is the coin still heads up? Note that "flip" here means "reverse". According to the flipping process above, determine if a coin remains heads up after it is either flipped or left unflipped by individuals. Therefore, the answer (Yes or No) is?
❌ A coin is heads up. Lucky does not flip the coin. Mireya flips the coin. Jj flips the coin. Kc flips the coin. Is the coin still heads up? Note that "flip" here means "reverse". According to the flipping process above, determine if a coin remains heads up after it is either flipped or left unflipped by individuals. Therefore, the answer (Yes or No) is?
To ascertain the learning difficulty of each dataset, we have utilized the LlaMA3-8B-Instruct , GPT-4o-mini , and QWen2-7B-Instruct model. Our approach involves testing each sample in the datasets as a binary classification problem via a prompting way.
The model generates a response for each sample, from which we infer a judgment, categorizing it as either "Yes" or "No". By comparing these judgments with the actual labels, we compute the accuracy for each dataset.
RQ1: Do different LLMs' Concept Depths behave consistently in the same dataset?
We categorize the performances into three types. 1) For Cities, STSA, IMDb, and Sarcasm, the LLMs suddenly understand the tasks at intermediate layers. 2) For CommonClaim and HateEval, the LLMs have already understood the tasks in shallower layers. 3) For Counterfact, StrategyQA, and Coinflip, The tasks are more difficult to understand compared with others. Therefore, we consider the tasks in type 1 and 2 easy tasks, and those in type 3 are complex.
RQ2: Do different size LLMs in the same family (e.g., the LLaMA family) have consistent Concept Depth?
We have two observations by comparing different sizes of models from the same LLM family. 1) As the number of parameters increases, peak accuracy gradually increases, and the converging point gradually advances. 2) Larger models grasp the concepts earlier and better.
RQ3: Do LLMs' Concept Depth of the same size behave consistently?
With the same number of model parameters, the models generally have a comparable understanding of the datasets.
Ablation Study: How can quantization (lower model precision) and noises (examing the robustness) affect LLM's Concept Depths?
Noises or 8-bit-quantization can cause the accuracy to converge more slowly. Compressing the LLMs to 16 bits doesn't harm the understanding process too much. The layer-wise representations of LLMs are susceptible to noise and high-ratio quantization. Therefore, it is crucial to proceed cautiously when conducting high-ratio quantization inference.
This paper proposes Concept Depth, the phenomenon that different concepts are learned in different layers of LLMs, i.e., more difficult concepts are fully acquired with deeper layers.
We conducted several experiments around Concept Depth using probe techniques. Our research suggests that LLMs tend to effectively categorize easy tasks, indicating that these concepts are learned in the first few initial layers.
In contrast, complex tasks may only be recognizable (if at all) in deeper layers, and LLMs of the same size perform largely consistently across datasets regarding Concept Depth.
Compressing the model weight to 16-bit representations for future LLMs' designs is also a promising method for saving computation memory.
@article{jin2024exploring,
author = {Jin, Mingyu and Yu, Qinkai and Huang, Jingyuan and Zeng, Qingcheng and Wang, Zhenting and Hua, Wenyue and Zhao, Haiyan and Mei, Kai and Meng, Yanda and Ding, Kaize and others},
title = {Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?},
journal = {arXiv preprint arXiv:2404.07066},
year = {2024},
}