SentenceChunk
Overview¶
The SentenceChunk
Block parses text with a preference for keeping complete sentences and paragraphs together. This Block aims to create chunks of text that maintain sentence integrity, making it useful for tasks where meaningful text divisions are important, such as summarization or token-limited models.
The Block uses a combination of sentence splitting, token counting, and customizable chunk sizes and overlaps. It also supports a secondary chunking regex for additional control over sentence splitting.
Description¶
Parse text with a preference for complete sentences.
In general, this class tries to keep sentences and paragraphs together. Therefore compared to the original TokenTextSplitter, there are less likely to be hanging sentences or parts of sentences at the end of the node chunk.
Args: chunk_size: The number of tokens to include in each chunk. (default is 200) chunk_overlap: The number of tokens that overlap between consecutive chunks. (default is 10) separator: Default separator for splitting into words. (default is " ") paragraph_separator: Separator between paragraphs. (default is "\n\n\n") secondary_chunking_regex: Backup regex for splitting into sentences.(default is "[^,.;]+[,.;]?".)
Steps: 1: Break text into splits that are smaller than chunk size base on the separators and regex. 2: Combine splits into chunks of size chunk_size (smaller than).
Metadata¶
- Category: Function
Configuration Options¶
Name | Data Type | Description | Default Value |
---|---|---|---|
chunk_size | int |
200 |
|
chunk_overlap | int |
10 |
|
separator | str |
|
|
paragraph_separator | str |
\n\n\n |
|
model_name | str |
gpt-3.5-turbo |
|
secondary_chunking_regex | str |
[^,.;。?!]+[,.;。?!]? |
Inputs¶
Name | Data Type | Description |
---|---|---|
text | str or list[str] |
Outputs¶
Name | Data Type | Description |
---|---|---|
result | list[str] |
State Variables¶
No state variables available.
Example(s)¶
Example 1: Chunk a document with default settings¶
- Create a
SentenceChunk
Block. - Set the
chunk_size
to200
tokens andchunk_overlap
to10
. - Provide the input text:
"This is the first sentence. This is the second sentence. This is the third sentence."
- The Block will output chunks, keeping sentences together while staying within the token limit, such as:
[ "This is the first sentence. This is the second sentence.", "This is the second sentence. This is the third sentence." ]
Example 2: Use a custom paragraph separator¶
- Set up a
SentenceChunk
Block. - Set the
paragraph_separator
to"\n\n"
. - Provide a text input with paragraphs separated by double newlines:
"Paragraph one text.\n\nParagraph two text."
- The Block will chunk the paragraphs based on the provided separator.
Example 3: Handle complex sentence structures with custom regex¶
- Create a
SentenceChunk
Block. - Set the
secondary_chunking_regex
to"[.!?]+"
to split based on sentence-ending punctuation. - Provide the input:
"Complex sentences can have multiple clauses; splitting them requires attention to detail."
- The Block will split the text at appropriate points while maintaining sentence integrity.
Error Handling¶
- If the tokenizer cannot be loaded for the specified model, the Block will raise a
RuntimeError
with an appropriate error message. - If an issue occurs during chunking, the Block will raise a
RuntimeError
describing the problem.
FAQ¶
What does the chunk_size
parameter do?
The chunk_size
parameter controls the number of tokens that each chunk should contain. The Block attempts to keep chunks within this size while preserving complete sentences and paragraphs.
What is the chunk_overlap
parameter?
The chunk_overlap
parameter specifies the number of tokens that should overlap between consecutive chunks. This ensures that no important information is lost at chunk boundaries.
Can I customize the separators used for splitting text?
Yes, you can customize both the separator
for splitting words and the paragraph_separator
for splitting paragraphs. By default, the word separator is a space (" "
), and the paragraph separator is "\n\n\n"
.
What happens if no valid chunks are created?
If no valid chunks are created, the Block will return a list containing an empty string to indicate that no meaningful chunks were generated from the input text.
Can I use a custom model for tokenization?
Yes, you can specify a custom model for tokenization by setting the model_name
parameter. The Block will use the specified model's tokenizer to encode the text into tokens.