SemanticChunk
Overview¶
The SemanticChunk
Block is used to split a document into semantic chunks, with each chunk being a group of semantically related sentences. The Block utilizes an embedding model to evaluate the semantic similarity between sentences and decides when to form a new chunk based on a configurable dissimilarity threshold.
This Block is useful for breaking down large documents into smaller, meaningful sections for tasks like summarization, topic modeling, or information extraction.
Description¶
Semantic chunk parser.
Splits a document into Chunks, with each node being a group of semantically related sentences.
Args: buffer_size (int): number of sentences to group together when evaluating semantic similarity chunk_model: (BaseEmbedding): embedding model to use, defaults to BAAI/bge-small-en-v1.5 breakpoint_percentile_threshold: (int): the percentile of cosine dissimilarity that must be exceeded between a group of sentences and the next to form a node. The smaller this number is, the more nodes will be generated
Metadata¶
- Category: Function
Configuration Options¶
Name | Data Type | Description | Default Value |
---|---|---|---|
buffer_size | int |
1 |
|
breakpoint_percentile_threshold | int |
95 |
|
model_name | str |
BAAI/bge-small-en-v1.5 |
Inputs¶
Name | Data Type | Description |
---|---|---|
text | str or list[str] |
Outputs¶
Name | Data Type | Description |
---|---|---|
result | list[str] |
State Variables¶
No state variables available.
Example(s)¶
Example 1: Split a document into chunks¶
- Create a
SemanticChunk
Block. - Set the
buffer_size
to2
(group sentences in batches of 2). - Set the
breakpoint_percentile_threshold
to90
. - Provide the input text:
"The quick brown fox jumps over the lazy dog. This is a test sentence. Semantic chunking is a powerful tool."
- The Block will output a list of chunks, such as:
[ "The quick brown fox jumps over the lazy dog. This is a test sentence.", "Semantic chunking is a powerful tool." ]
Example 2: Handle a list of documents¶
- Set up a
SemanticChunk
Block. - Provide a list of text documents:
[ "Document 1: The sky is blue.", "Document 2: The sun is bright." ]
- The Block will split each document into semantic chunks and return the result:
[ "Document 1: The sky is blue.", "Document 2: The sun is bright." ]
Example 3: Use a custom embedding model¶
- Set the
model_name
to"custom/embedding-model"
to use a specific embedding model for chunking. - Provide the text to be chunked:
"This is a test for using a custom model."
- The Block will use the custom model for embedding and chunking the text.
Error Handling¶
- If the input text is invalid or there is an error during the chunking process, the Block will raise a
RuntimeError
with a descriptive error message. - If no chunks are generated, the Block will return a list containing an empty string.
FAQ¶
What does the buffer_size
parameter do?
The buffer_size
parameter determines the number of sentences that are grouped together when evaluating semantic similarity. A higher buffer size will result in larger chunks, while a smaller buffer size will create more granular chunks.
What is the breakpoint_percentile_threshold
?
The breakpoint_percentile_threshold
is the percentile of cosine dissimilarity that must be exceeded between a group of sentences and the next to form a new chunk. A lower threshold will create more chunks, while a higher threshold will create fewer, larger chunks.
Can I use a custom embedding model for semantic chunking?
Yes, you can specify a custom embedding model by setting the model_name
parameter to the name of the model you want to use. The default model is "BAAI/bge-small-en-v1.5"
.
What happens if no semantic chunks are created?
If no semantic chunks are created, the Block will return a list containing an empty string to indicate that no meaningful chunks were found.