Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?

Mingyu Jin🌟¹, Qinkai Yu🌟², Jingyuan Huang🌟¹, Qingcheng Zeng🌟³, Zhenting Wang¹, Wenyue Hua¹, Haiyan Zhao⁴, Kai Mei¹, Yanda Meng², Kaize Ding³, Fan Yang⁵, Mengnan Du⁴, Yongfeng Zhang¹

¹Rutgers University, ²University of Exeter, ³Northwestern University, ⁴New Jersey Institute of Technology, ⁵Wake Forest University

🌟Equal Contribution

COLING 2025 Accepted

Paper Code

Concept Depth is used to analyze the comprehension ability of Large Language Models (LLM) and the difficulty of a concept's understanding. We use probing techniques🔍 on each layer's embedding to detect the layer accuracy, F1-score, and AUC of the classification task.

The LLMs are trying to understand easy tasks and complex tasks. The more complex the task, the deeper layers an LLM needs to understand. The stronger the LLM is, the more complex the task level it can learn.

Abstract

Large language models (LLMs) have shown remarkable performances across a wide range of tasks. However, the mechanisms by which these models encode tasks of varying complexities remain poorly understood. In this paper, we explore the hypothesis that LLMs process concepts of varying complexities in different layers, introducing the idea of "Concept Depth" to suggest that more complex concepts are typically acquired in deeper layers.

Specifically, we categorize concepts based on their level of abstraction, defining them in the order of increasing complexity within factual, emotional, and inferential tasks. We conduct extensive probing experiments using layer-wise representations across various LLM families (Gemma, LLaMA, Qwen) on various datasets spanning the three domains of tasks.

Our findings reveal that models could efficiently conduct probing for simpler tasks in shallow layers, and more complex tasks typically necessitate deeper layers for accurate understanding. Additionally, we examine how external factors, such as adding noise to the input and quantizing the model weights, might affect layer-wise representations. Our findings suggest that these factors can impede the development of a conceptual understanding of LLMs until deeper layers are explored.

We hope that our proposed concept and experimental insights will enhance the understanding of the mechanisms underlying LLMs. Our codes are available at https://github.com/Luckfort/CD.

🗃️Dataset Information

We use the following datasets in our experiments.

💡Fact

We let LLM solve true-of-false problems, based on an objective factual basis.

🌆 Cities: Consists of statements about the location of cities and their veracity labels (e.g., The city of Zagreb is in Japan, which is wrong).

🔎 Common-Claim: A dataset of boolean statements, each labeled by two humans as common-knowledge-true, common-knowledge-false, or neither.

💥 Counterfact: Includes myriad counterfactuals that allows quantitative testing of specificity and generalization when learning a counterfactual.

❤️Emotion

Given a social media or movie review, have the LLM solve a dichotomous sentiment problem.

💢 HateEval: Has tweets which were annotated hierarchically.

🎬 STSA: Includes movie reviews, wih positive and negative reviews, reflecting the writer's overall intention for this review.

📽️ IMDb: A benchmark dataset for binary sentiment classification. We use 2000 of these samples.

🤭 Sarcasm: A superior news headline dataset that tells if the headlines are sarcastic.

🧠Reasoning

Examing LLM's reasoning skills for bi-classification.

🗝️ StrategyQA: Contains questions across all knowledge domains to elicit creative and diverse yes/no questions that require implicit reasoning steps.

🪙 Coinflip: Includes coin flipping queries, asking if a coin remains heads up after it is either flipped or left unflipped by individuals.

👀Case Examples

Fact: Cities

✅ Judge the statement is True or False. The city of Tokyo is in Japan.

❌ Judge the statement is True or False. The city of Lodz is in the Dominican Republic.

Emotional: HateEval

✅ Here it is not about Refugees or Illegal immigrants. It is about whether one has documents before 1971 or not. Now, it is difficult for slum people and beggars to show valid documents, except the name in voter list. According to the comment, tell whether they present hate speech or not.

❌ Labor migrants transfer almost $10 billion a year to Ukraine. According to the comment, tell whether they present hate speech or not.

Reasoning: Coin-Flip

✅ A coin is heads up. Whitney flips the coin. Erika does not flip the coin. Tj does not flip the coin. Benito flips the coin. Is the coin still heads up? Note that "flip" here means "reverse". According to the flipping process above, determine if a coin remains heads up after it is either flipped or left unflipped by individuals. Therefore, the answer (Yes or No) is?

❌ A coin is heads up. Lucky does not flip the coin. Mireya flips the coin. Jj flips the coin. Kc flips the coin. Is the coin still heads up? Note that "flip" here means "reverse". According to the flipping process above, determine if a coin remains heads up after it is either flipped or left unflipped by individuals. Therefore, the answer (Yes or No) is?

⚓️Anchoring Difficulties

The dataset with the highest accuracy, IMDb, is deemed the easiest dataset to classify. Conversely, the dataset with the lowest accuracy, Coin-Flip, is considered the most difficult to classify.

To ascertain the learning difficulty of each dataset, we have utilized the LlaMA3-8B-Instruct , GPT-4o-mini , and QWen2-7B-Instruct model. Our approach involves testing each sample in the datasets as a binary classification problem via a prompting way.

The model generates a response for each sample, from which we infer a judgment, categorizing it as either "Yes" or "No". By comparing these judgments with the actual labels, we compute the accuracy for each dataset.

📊Experiments

RQ1: Do different LLMs' Concept Depths behave consistently in the same dataset?

We categorize the performances into three types. 1) For Cities, STSA, IMDb, and Sarcasm, the LLMs suddenly understand the tasks at intermediate layers. 2) For CommonClaim and HateEval, the LLMs have already understood the tasks in shallower layers. 3) For Counterfact, StrategyQA, and Coinflip, The tasks are more difficult to understand compared with others. Therefore, we consider the tasks in type 1 and 2 easy tasks, and those in type 3 are complex.

Linear probing accuracy of Gemma-7B, LLaMA-7B, Qwen-7B on nine datasets.

RQ2: Do different size LLMs in the same family (e.g., the LLaMA family) have consistent Concept Depth?

We have two observations by comparing different sizes of models from the same LLM family. 1) As the number of parameters increases, peak accuracy gradually increases, and the converging point gradually advances. 2) Larger models grasp the concepts earlier and better.

The peak accuracy of each dataset on Gemma, LLaMA, and Qwen represented by the percent depth proportion.

The converging point of each dataset on Gemma, LLaMA, and Qwen represented by the percent depth proportion.

RQ3: Do LLMs' Concept Depth of the same size behave consistently?

With the same number of model parameters, the models generally have a comparable understanding of the datasets.

Linear probing accuracy of Gemma-7B, LLaMA-7B, Qwen-7B on nine datasets.

Ablation Study: How can quantization (lower model precision) and noises (examing the robustness) affect LLM's Concept Depths?

Noises or 8-bit-quantization can cause the accuracy to converge more slowly. Compressing the LLMs to 16 bits doesn't harm the understanding process too much. The layer-wise representations of LLMs are susceptible to noise and high-ratio quantization. Therefore, it is crucial to proceed cautiously when conducting high-ratio quantization inference.

Conclusion

This paper proposes Concept Depth, the phenomenon that different concepts are learned in different layers of LLMs, i.e., more difficult concepts are fully acquired with deeper layers.

We conducted several experiments around Concept Depth using probe techniques. Our research suggests that LLMs tend to effectively categorize easy tasks, indicating that these concepts are learned in the first few initial layers.

In contrast, complex tasks may only be recognizable (if at all) in deeper layers, and LLMs of the same size perform largely consistently across datasets regarding Concept Depth.

Compressing the model weight to 16-bit representations for future LLMs' designs is also a promising method for saving computation memory.

BibTeX

@inproceedings{jin-etal-2025-exploring,
      title = "Exploring Concept Depth: How Large Language Models Acquire Knowledge and Concept at Different Layers?",
      author = "Jin, Mingyu and Yu, Qinkai and Huang, Jingyuan and Zeng, Qingcheng and Wang, Zhenting and Hua, Wenyue and Zhao, Haiyan and Mei, Kai and Meng, Yanda and Ding, Kaize and Yang, Fan and Du, Mengnan and Zhang, Yongfeng",
      editor = "Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven",
      booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
      month = jan,
      year = "2025",
      address = "Abu Dhabi, UAE",
      publisher = "Association for Computational Linguistics",
      url = "https://aclanthology.org/2025.coling-main.37/",
      pages = "558--573"
  }