Skip to content

SemanticChunk

SemanticChunk

Overview

The SemanticChunk Block is used to split a document into semantic chunks, with each chunk being a group of semantically related sentences. The Block utilizes an embedding model to evaluate the semantic similarity between sentences and decides when to form a new chunk based on a configurable dissimilarity threshold.

This Block is useful for breaking down large documents into smaller, meaningful sections for tasks like summarization, topic modeling, or information extraction.

Description

Semantic chunk parser.

Splits a document into Chunks, with each node being a group of semantically related sentences.

Args: buffer_size (int): number of sentences to group together when evaluating semantic similarity chunk_model: (BaseEmbedding): embedding model to use, defaults to BAAI/bge-small-en-v1.5 breakpoint_percentile_threshold: (int): the percentile of cosine dissimilarity that must be exceeded between a group of sentences and the next to form a node. The smaller this number is, the more nodes will be generated

Metadata

  • Category: Function

Configuration Options

Name Data Type Description Default Value
buffer_size int 1
breakpoint_percentile_threshold int 95
model_name str BAAI/bge-small-en-v1.5

Inputs

Name Data Type Description
text str or list[str]

Outputs

Name Data Type Description
result list[str]

State Variables

No state variables available.

Example(s)

Example 1: Split a document into chunks

  • Create a SemanticChunk Block.
  • Set the buffer_size to 2 (group sentences in batches of 2).
  • Set the breakpoint_percentile_threshold to 90.
  • Provide the input text: "The quick brown fox jumps over the lazy dog. This is a test sentence. Semantic chunking is a powerful tool."
  • The Block will output a list of chunks, such as:
    [
      "The quick brown fox jumps over the lazy dog. This is a test sentence.",
      "Semantic chunking is a powerful tool."
    ]
    

Example 2: Handle a list of documents

  • Set up a SemanticChunk Block.
  • Provide a list of text documents:
    [
      "Document 1: The sky is blue.",
      "Document 2: The sun is bright."
    ]
    
  • The Block will split each document into semantic chunks and return the result:
    [
      "Document 1: The sky is blue.",
      "Document 2: The sun is bright."
    ]
    

Example 3: Use a custom embedding model

  • Set the model_name to "custom/embedding-model" to use a specific embedding model for chunking.
  • Provide the text to be chunked: "This is a test for using a custom model."
  • The Block will use the custom model for embedding and chunking the text.

Error Handling

  • If the input text is invalid or there is an error during the chunking process, the Block will raise a RuntimeError with a descriptive error message.
  • If no chunks are generated, the Block will return a list containing an empty string.

FAQ

What does the buffer_size parameter do?

The buffer_size parameter determines the number of sentences that are grouped together when evaluating semantic similarity. A higher buffer size will result in larger chunks, while a smaller buffer size will create more granular chunks.

What is the breakpoint_percentile_threshold?

The breakpoint_percentile_threshold is the percentile of cosine dissimilarity that must be exceeded between a group of sentences and the next to form a new chunk. A lower threshold will create more chunks, while a higher threshold will create fewer, larger chunks.

Can I use a custom embedding model for semantic chunking?

Yes, you can specify a custom embedding model by setting the model_name parameter to the name of the model you want to use. The default model is "BAAI/bge-small-en-v1.5".

What happens if no semantic chunks are created?

If no semantic chunks are created, the Block will return a list containing an empty string to indicate that no meaningful chunks were found.