First Look at Retrieval-Augmented Generation Technology

0. Step Why We Need RAG

When we press Enter in ChatBox and send a request "Can excess credits from core general education courses at the University of Electronic Science and Technology of China be converted to elective credits?", it is predictable that we won't get an effective conclusion: in ChatGPT Web, GPT 5.2 Thinking retrieves relevant school documents and campus forum posts through powerful search capabilities and finally concludes "No"; DeepSeek in ChatBox answers randomly

Another example, asking AI "The original sentence or paragraph where the Bodhisattva tells Sun Wukong that someone will save him, with the chapter title indicated", a very specific question: in ChatGPT Web, thinking for more than 2 minutes, checking 106 web pages, answering irrelevantly, seemingly not understanding the question

The answer we want should be from "Chapter 8: The Buddha Creates Scriptures to Transmit to Paradise; Guanyin Receives Orders to Go to Chang'an": "When the Bodhisattva heard these words, she was filled with joy. She said to the Great Sage: "The scriptures say: 'If the words spoken are good, then they will be answered from thousands of miles away; if the words spoken are not good, then they will be violated from thousands of miles away.' Since you have this intention, when I arrive in the Eastern Land of the Great Tang to find a scripture-seeker, I will teach him to save you. You can become his disciple, uphold the teachings, enter my Buddhist school, and cultivate the true fruit again, how about that?" The Great Sage said repeatedly: "I wish to go! I wish to go!""

Books like Journey to the West are mostly used as training data, but even so, they cannot provide accurate information, and if using web search, it may not necessarily handle factual answers well

The introduction of RAG can help solve such problems:
Lack of domain knowledge or private knowledge, some professional knowledge not being trained, or belonging to internal knowledge bases that the model cannot access
Knowledge has a cutoff date, unable to obtain the latest knowledge

We only need the school's official training manual to provide credit-related knowledge; we only need the complete novel to provide semantically similar paragraphs

1. Intro What is RAG

RAG, short for Retrieval-Augmented Generation, is a technology that combines information retrieval and LLM. It retrieves relevant information from knowledge bases, such as internal documents, databases, etc., to "augment" the LLM's generation capability, making its answers more accurate, real-time, and relevant to specific domains or private data, without retraining the entire model, effectively solving the large model's knowledge limitations and "hallucination" problems

As the name suggests, it is also divided into three stages: Retrieval, Augmentation, and Generation
Retrieval: When a user asks a question, the system searches for relevant information fragments in the knowledge base
Augmentation: Combining the retrieved relevant information with the user's original query to form an augmented Prompt
Generation: Feeding this augmented input prompt to the LLM to generate more accurate and evidence-based answers

2. Simple RAG

As shown in the figure, this is the most basic RAG workflow

Divided into three major stages:
Indexing phase: chunking raw knowledge documents and converting them into vector format
Retrieval phase: converting user queries into vector format and querying the most relevant document knowledge chunks
Generation phase: producing "accurate" answers based on retrieved information

2.1 Indexing Phase

The indexing phase is responsible for chunking raw knowledge documents into knowledge chunks and converting them into vector format through the Embedding model, storing them in a vector database for retrieval

2.1.1 Document Chunking

Document chunking is responsible for cutting raw knowledge documents into text knowledge chunks through different chunking strategies

The core requirement is semantic coherence

There are the following chunking schemes:

Chunk by fixed character count or Token count
Chunk by punctuation marks
Chunk by sentences
Chunk by paragraphs
Chunk by semantics
Overlapping chunking

In short, there are various methods, mainly depending on the document type, appropriately chunking it into small pieces to ensure semantic coherence

2.1.2 Text Vectorization

The purpose of text vectorization Embedding is to convert the chunked text chunks into numerical representations that computers can easily calculate, i.e., vectors

Naive methods include: deduplicated bag-of-words, non-deduplicated bag-of-words, TF-IDF

The bag-of-words method first extracts tokens from the corpus, treating each token as a dimension
Deduplicated bag-of-words: if the token exists in the text, the corresponding dimension is marked as 1, otherwise 0
Non-deduplicated bag-of-words: each token's corresponding dimension records the number of times the token appears in the text

TF-IDF: On the basis of word frequency, introduces inverse document frequency weighting. Common but low-distinction words are down-weighted, while words that better reflect text differences are given higher weights

But in any case, they utilize dimensions too sparsely. For a novel, there may be hundreds of thousands of tokens, meaning a vector may have hundreds of thousands of dimensions, but for a specific knowledge chunk, most dimensions are 0 with only a few having values. Thus, dense vectors were developed

Dense vectors, i.e., the Embedding models we now use, compress the meaning of text into a fixed-dimensional numerical space. Each dimension does not correspond to a specific token but uses lower dimensions to express semantic features. It can capture that happy and glad are synonyms, and knows the relationship between Beijing and China, Paris and France

In our usage, we only need to input text into the Embedding model to get the corresponding text vector
You can choose appropriate models on HuggingFace's leaderboard, such as
Qwen3-Embedding-8B
gemini-embedding-001

2.1.3 Vector Storage

After obtaining text vectors, we need a place to store them and corresponding retrieval methods. We usually use vector databases to store these vectors

Vector databases support storing vectors and provide similarity-based vector retrieval capabilities, such as TopK nearest neighbor queries, used to quickly find content most relevant to the query from massive chunks

Generally, three types of information are saved in vector databases:

Vector itself: the embedding vector of the chunk
Original content: the chunk for direct concatenation in augmented prompts
Metadata: storing source documents, chapters, directories, page numbers, offsets, timestamps, etc.

Vector databases are quite diverse; you can search and use them yourself, such as: Chroma
Qdrant

2.2 Retrieval Phase

The retrieval phase is responsible for converting user queries into text vectors and querying the most relevant document knowledge chunks

Since we have encoded text semantics into vectors, the core of retrieving based on text similarity is to see how close two vectors are. Common methods mainly include: cosine similarity, Euclidean distance, inner product

Cosine similarity focuses on whether the directions of two vectors are consistent, i.e., their cosine value, with a range of [-1, 1]. The closer to 1, the more consistent the direction
Euclidean distance focuses on the straight-line distance between two vectors in space. The smaller the distance, the closer they are
Inner product is affected by both direction and length. The more consistent the direction and the longer the length, the larger the inner product tends to be

In addition, there is keyword-based retrieval, i.e., extracting keywords from user queries to match corresponding documents, which is more effective in some scenarios, but not introduced in detail here

2.3 Generation Phase

The generation phase is responsible for concatenating the information retrieved in the previous step with the user query to form an augmented prompt Augmented Prompt and handing it to the LLM, allowing it to produce "accurate" answers with evidence

There is not much to say about this stage; it mainly involves selecting and organizing context, i.e., concatenating the retrieved chunks with user queries, and finally handing them to the LLM

However, it also involves some optimizations, such as:

Obtaining different topK chunks results through various retrieval methods, how to filter more appropriate topK chunks
How to make prompt templates clearer and more explicit
How to make the LLM correctly cite relevant evidence when generating answers

2.4 Practical Implementation

The theoretical part of Simple RAG is finished. Next, we implement a simple novel Q&A RAG, using the Journey to the West question we introduced at the beginning as a case study

I used LM Studio locally to run Embedding and LLM services, which can provide OpenAI Compatible format API interfaces

I used text-embedding-qwen3-embedding-0.6b and qwen/qwen3-4b-2507 as the models for this practical implementation

Install the following dependencies: pip install openai chromadb

Where chromadb is an open-source vector database, openai is added for convenient API usage

Indexing Phase

For chunking, first divide chapters according to the title format "Chapter xx", then use overlapping chunking for each chapter, chunking by CHUNK_SIZE, but the starting point moves forward by OVERLAP each time. That is, if a chapter has 2500 characters, then the chunks are: [0, 800), [650, 1450), [1300, 2100), [1950, 2750), [2600, none), the last empty block will be filtered out

Then vectorize each chunk and add it to the vector database. meta_data can include chapter titles

Distance metrics include: cosine, l2, ip (cosine similarity, Euclidean distance, inner product)

Here we choose cosine similarity; chromadb must set the metric when creating the collection

Retrieval Phase

Vectorize the user query and find the topK most similar results in the vector database

Generation Phase

Concatenate user messages and retrieved results context to form an augmented prompt, and hand it to the LLM to generate answers

CodeBlock Loading...

I tested the answers with topK set to 1, 2, and 3 respectively. They were basically usable. Actually, the correct original text paragraph was top1. In the top2 case, the model probably didn't understand what "someone will save him" meant. Using the deepseek-chat API gave answers that met my expectations

top2 case

The original sentence where the Bodhisattva tells Sun Wukong that someone will save him is: "I was ordered by the Buddha to go to the Eastern Land to find a scripture-seeker, and passing by here, I specially stopped to see you."
Source: Chapter 8: The Buddha Creates Scriptures to Transmit to Paradise; Guanyin Receives Orders to Go to Chang'an

top3 case

The original sentence where the Bodhisattva tells Sun Wukong that someone will save him is: "Since you have this intention, when I arrive in the Eastern Land of the Great Tang to find a scripture-seeker, I will teach him to save you. You can become his disciple, uphold the teachings, enter my Buddhist school, and cultivate the true fruit again, how about that?"
Chapter title: Chapter 8: The Buddha Creates Scriptures to Transmit to Paradise; Guanyin Receives Orders to Go to Chang'an

[References]
Chapter 8: The Buddha Creates Scriptures to Transmit to Paradise; Guanyin Receives Orders to Go to Chang'an (Similarity: 70.0%)
Chapter 14: The Mind-Monkey Returns to the Right; The Six Thieves Vanish Without Trace (Similarity: 69.9%)
Chapter 21: The Dharma-Protector Sets Up a Manor to Keep the Great Sage; Sumeru's Lingji Settles the Wind Demon (Similarity: 69.9%)

The similarity here is because ChromaDB returns 1 - similarity for cosine distance, so this is just restoring it.

2.5 Summary

At this point, we have successfully run through the simple RAG workflow: offline indexing (chunking, vectorization, storage) + online Q&A (retrieval, augmented prompt, generation)

Meanwhile, through a series of practices, we can also see many shortcomings. In fact, there are many areas for improvement, which is what "Advanced RAG" aims to solve

Chunking: How to chunk to ensure semantic coherence (semantic chunking)
Retrieval: How to improve retrieval precision and recall (multi-way recall, Rerank, etc.)
Query: Making the query used for matching clearer and more retrievable (query rewriting, coreference resolution)
Generation: How to write prompts and organize context (prompt templates, citations)
Evaluation: How to evaluate RAG with metrics (precision, recall, faithfulness)