TokenChunk
Overview¶
The TokenChunk
Block splits a document into chunks based on a fixed token size. This method ensures that each chunk has a consistent number of tokens, making it ideal for cases where precise control over chunk size is necessary, such as when working with models that have specific token limits. However, this method may split sentences or words, which could affect the coherence of the resulting chunks.
This Block uses a tokenizer from the specified model to determine the token size and supports overlapping chunks for better continuity between consecutive chunks.
Description¶
Parse the document text into chunks with a fixed token size.
Args: - chunk_size: The number of tokens to include in each chunk. (default is 200) - chunk_overlap: The number of tokens that overlap between consecutive chunks. (default is 10) - separator: Default separator for splitting into words. (default is " ")
This chunking method is particularly useful when: - You need precise control over the size of each chunk. - You're working with models that have specific token limits. - You want to ensure consistent chunk sizes across different types of text.
Note: While this method provides consistent chunk sizes, it may split sentences or even words, which could affect the coherence of each chunk. Consider the trade-off between consistent size and semantic coherence when using this method.
Metadata¶
- Category: Function
Configuration Options¶
Name | Data Type | Description | Default Value |
---|---|---|---|
chunk_size | int |
200 |
|
chunk_overlap | int |
10 |
|
separator | str |
|
|
model_name | str |
gpt-3.5-turbo |
Inputs¶
Name | Data Type | Description |
---|---|---|
text | str or list[str] |
Outputs¶
Name | Data Type | Description |
---|---|---|
result | list[str] |
State Variables¶
No state variables available.
Example(s)¶
Example 1: Chunk a document with default settings¶
- Create a
TokenChunk
Block. - Set the
chunk_size
to200
tokens andchunk_overlap
to10
. - Provide the input text:
"This is a long document that needs to be split into chunks based on tokens."
- The Block will output chunks of the text, each containing approximately
200
tokens, with an overlap of10
tokens between consecutive chunks.
Example 2: Customize the word separator¶
- Set up a
TokenChunk
Block. - Set the
separator
to"\n"
to split text based on newline characters. - Provide the input text with newline-separated sections:
"Section 1: This is the first section.\nSection 2: This is the second section."
- The Block will split the text based on newline characters and return token-based chunks.
Example 3: Handle a list of documents¶
- Create a
TokenChunk
Block. - Provide a list of documents to be chunked:
[ "Document 1: This is the first document.", "Document 2: This is the second document." ]
- The Block will chunk each document individually and return the token-based chunks for each one.
Error Handling¶
- If the tokenizer cannot be loaded for the specified model, the Block will raise a
RuntimeError
with an appropriate error message. - If an issue occurs during the chunking process, the Block will raise a
RuntimeError
describing the problem. - If no valid chunks are created, the Block will return a list containing an empty string.
FAQ¶
What does the chunk_size
parameter control?
The chunk_size
parameter defines the number of tokens to include in each chunk. This allows you to ensure that each chunk stays within a specified token limit, which can be important for models with token restrictions.
What is the purpose of the chunk_overlap
parameter?
The chunk_overlap
parameter specifies the number of tokens that overlap between consecutive chunks. This overlap ensures continuity between chunks, which can be useful for maintaining context in language models.
Can I use custom models for tokenization?
Yes, you can specify a custom model for tokenization by setting the model_name
parameter. The Block will use the tokenizer from the specified model to split the text into tokens.
What happens if the input text is too short to fill a chunk?
If the input text is shorter than the specified chunk_size
, the Block will return the text as a single chunk without splitting it. If the text is empty or invalid, the Block will return an empty string.
Does this Block ensure that sentences are not split?
No, the TokenChunk
Block focuses on creating chunks with a consistent token size. It may split sentences or even words depending on the token boundaries. If you need sentence preservation, consider using a sentence-based chunking method.