Reading List
This is my NLP paper reading list! Here, I maintain a list of papers (and posts) that I consider important for understanding the fundamentals of NLP. I also add papers I enjoyed reading the most in different sub-domains and try to update the list frequently.
Fundamentals - Neural Language Models, Transformers, BERT, etc.
- On the difficulty of training recurrent neural networks - This paper introduces the vanishing gradient problem in RNNs. (2013)
- A super easy-to-read blog post for understanding LSTMs. (2023)
- Neural machine translation by jointly learning to align and translate - Introduces attention mechanism. (2015)
- Attention Is All You Need - The super famous transformers paper! (2017)
- Understanding attention and transformer - Blog post 1, Blog post 2. (2021)
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2019)
- Roformer: Enhanced transformer with rotary position embedding - An excellent paper on positional embedding. (2021)
- Deep contextualized word representations - ELMo word embeddings (2018)
- Finetuned language models are zero-shot learners - Jason Wei’s paper on Instruction Tuning (FLAN) (2022)
Parameter-efficient Adoption of LLMs
- LoRA: Low-Rank Adaptation of Large Language Models (2021)
- The Power of Scale for Parameter-Efficient Prompt Tuning (2023)
- QLoRA: Efficient Finetuning of Quantized LLMs (2023)
LLM Alignment and Reinforcement Learning
- Training language models to follow instructions with human feedback - RLHF Paper (2022)
- Constitutional AI: Harmlessness from AI Feedback (2022)
- An excellent, easy-to-read blog explaining the need for RLHF. (2023)
- RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback (2023)
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model (2023)
- Fine-tuning Language Models for Factuality (2023)
- Alignment for Honesty (2023)
- Are aligned neural networks adversarially aligned?(2023)
- Understanding the effects of rlhf on llm generalisation and diversity(2024)
- Huggingface’s blog post on DPO v/s IPO v/s KTO. (2024)
- Weak-to-strong extrapolation expedites alignment (2024)
- Unpacking DPO and PPO: Disentangling Best Practices for Learning from Prefere (2024)
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025)
- One of the best videos I’ve seen about DeepSeek-R1. (2025)
Evaluation, Factuality, and Long-context
- BLEURT: robust metrics for text generation (2020)
- Judging LLM as a Judge with MT-Bench and Chatbot Arena (2023)
- FactScore: Fine-grained atomic evaluation of factual precision (2023)
- AVeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web (2023)
- Long-form factuality in large language models - SAFE Score (2024)
- BooookScore: A systematic exploration of book-length summarization in the era of LLMs (2024)
- Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference (2024)
- Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices (2024)
- Enabling Language Models to Implicitly Learn Self-Improvement (2024)
- VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation (2024)
- On Positional Bias of Faithfulness for Long-form Summarization (2024)
- One Thousand and One Pairs: A “novel” challenge for long-context language models (2024) - Nocha for long-context
- HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly (2024)
- Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation(2024)
- Retrieval or Global Context Understanding? On Many-Shot In-Context Learning for Long-Context Evaluation (2024)
- Evaluating large language models at evaluating instruction following (2024)
- Evaluating correctness and faithfulness of instruction-following models for question answering (2024)
- HALoGEN: Fantastic LLM Hallucinations and Where to Find Them (2025)
Creativity and Narrative Writing
- THE GENERATIVE AI PARADOX:“What It Can Create, It May Not Understand” (2023)
- It’s not Rocket Science: Interpreting Figurative Language in Narratives (2022)
- AI as Humanity’s Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text (2024)
- Are Large Language Models Capable of Generating Human-Level Narratives? (2024)
- Art or Artifice? Large Language Models and the False Promise of Creativity (2024)
- A Design Space for Intelligent and Interactive Writing Assistants (2024)
- Echoes in AI: Quantifying Lack of Plot Diversity in LLM Outputs (2024)
RAG
- REALM: Retrieval-Augmented Language Model Pre-Training (2020)
- FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation (2023)
- Evaluating Retrieval Quality in Retrieval-Augmented Generation (2024)
- RECOMP: Improving Retrieval-Augmented LMs with Context Compression and Selective Augmentation (2024)
LLM Personalization
- LaMP: When Large Language Models Meet Personalization (2023)
- Are personalized stochastic parrots more dangerous? evaluating persona biases in dialogue systems (2023)
- LongLaMP: A Benchmark for Personalized Long-form Text Generation (2024)
- Exploring Safety-Utility Trade-Offs in Personalized Language Models (2024)
- Learning to Rewrite Prompts for Personalized Text Generation (2024)
- Learning Personalized Alignment in Evaluating Open-ended Text Generation (2024)
Prompt Engineering and In-context Learning
- Rethinking the role of demonstrations: What Makes In-Context Learning Work? (2022)
- Blog on different prompting strategies. (2023)
- What learning algorithm is in-context learning? Investigations with linear models (2023)
- LLMs Are In-Context Reinforcement Learners (2024)
- More Samples or More Prompts? Exploring Effective Few-Shot In-Context Learning for LLMs with In-Context Sampling (2024)
Detecting LLM-generated Text
- A Watermark for Large Language Models (2023)
- Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense (2023)
- PostMark: A Robust Blackbox Watermark for Large Language Models (2024)
- Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text (2024)
- Red Teaming Language Model Detectors with Language Models (2024)
Reasoning and Chain-of-Thought
- Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought (2022)
- To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning (2024)
- Iterative reasoning preference optimization (2024)
- It’s Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning (2024)
- Language Models Still Struggle to Zero-shot Reason about Time Series (2024)
- Language models can improve event prediction by few-shot abductive reasoning (2024)
Bio-NLP
- Paper Plain: Making Medical Research Papers Approachable to Healthcare Consumers with Natural Language Processing (2023)
- On-the-fly Definition Augmentation of LLMs for Biomedical NER (2024)
- JMLR: Joint Medical LLM and Retrieval Training for Enhancing Reasoning and Professional Question Answering Capability (2024)