SentenceChunk

Overview¶

The SentenceChunk Block parses text with a preference for keeping complete sentences and paragraphs together. This Block aims to create chunks of text that maintain sentence integrity, making it useful for tasks where meaningful text divisions are important, such as summarization or token-limited models.

The Block uses a combination of sentence splitting, token counting, and customizable chunk sizes and overlaps. It also supports a secondary chunking regex for additional control over sentence splitting.

Description¶

Parse text with a preference for complete sentences.

In general, this class tries to keep sentences and paragraphs together. Therefore compared to the original TokenTextSplitter, there are less likely to be hanging sentences or parts of sentences at the end of the node chunk.

Args: chunk_size: The number of tokens to include in each chunk. (default is 200) chunk_overlap: The number of tokens that overlap between consecutive chunks. (default is 10) separator: Default separator for splitting into words. (default is " ") paragraph_separator: Separator between paragraphs. (default is "\n\n\n") secondary_chunking_regex: Backup regex for splitting into sentences.(default is "[^,.;]+[,.;]?".)

Steps: 1: Break text into splits that are smaller than chunk size base on the separators and regex. 2: Combine splits into chunks of size chunk_size (smaller than).

Metadata¶

Category: Function

Configuration Options¶

Name	Data Type	Default Value
chunk_size	`int`	`200`
chunk_overlap	`int`	`10`
separator	`str`
paragraph_separator	`str`	`\n\n\n`
model_name	`str`	`gpt-3.5-turbo`
secondary_chunking_regex	`str`	`[^,.;。？！]+[,.;。？！]?`

Inputs¶

Name	Data Type	Description
text	`str or list[str]`

Outputs¶

Name	Data Type	Description
result	`list[str]`

State Variables¶

No state variables available.

Example(s)¶

Example 1: Chunk a document with default settings¶

Create a SentenceChunk Block.
Set the chunk_size to 200 tokens and chunk_overlap to 10.
Provide the input text: "This is the first sentence. This is the second sentence. This is the third sentence."

The Block will output chunks, keeping sentences together while staying within the token limit, such as:

[
  "This is the first sentence. This is the second sentence.",
  "This is the second sentence. This is the third sentence."
]

Example 2: Use a custom paragraph separator¶

Set up a SentenceChunk Block.
Set the paragraph_separator to "\n\n".
Provide a text input with paragraphs separated by double newlines:
```
"Paragraph one text.\n\nParagraph two text."
```
The Block will chunk the paragraphs based on the provided separator.

Example 3: Handle complex sentence structures with custom regex¶

Create a SentenceChunk Block.
Set the secondary_chunking_regex to "[.!?]+" to split based on sentence-ending punctuation.
Provide the input: "Complex sentences can have multiple clauses; splitting them requires attention to detail."
The Block will split the text at appropriate points while maintaining sentence integrity.

Error Handling¶

If the tokenizer cannot be loaded for the specified model, the Block will raise a RuntimeError with an appropriate error message.
If an issue occurs during chunking, the Block will raise a RuntimeError describing the problem.

FAQ¶

What does the chunk_size parameter do?

The chunk_size parameter controls the number of tokens that each chunk should contain. The Block attempts to keep chunks within this size while preserving complete sentences and paragraphs.

What is the chunk_overlap parameter?

The chunk_overlap parameter specifies the number of tokens that should overlap between consecutive chunks. This ensures that no important information is lost at chunk boundaries.

Can I customize the separators used for splitting text?

Yes, you can customize both the separator for splitting words and the paragraph_separator for splitting paragraphs. By default, the word separator is a space (" "), and the paragraph separator is "\n\n\n".

What happens if no valid chunks are created?

If no valid chunks are created, the Block will return a list containing an empty string to indicate that no meaningful chunks were generated from the input text.

Can I use a custom model for tokenization?

Yes, you can specify a custom model for tokenization by setting the model_name parameter. The Block will use the specified model's tokenizer to encode the text into tokens.