FileStore
Overview¶
The FileStore
Block manages the storage and retrieval of files, allowing for content chunking, embedding generation, and semantic search over the stored files. It uses an in-memory vector database to store the embeddings of file chunks, enabling quick and efficient search based on the semantic similarity between a query and the stored content.
This Block is useful for applications that require storing documents, generating embeddings for content, and performing similarity-based searches over stored text.
Description¶
Store and search embeddings in an in-memory vector database.
Args: top_k: The number of results to return in search.
Steps: 1: Update index with embeddings. 2: Search index to return relevant documents.
Metadata¶
- Category: Function
Configuration Options¶
Name | Data Type | Description | Default Value |
---|---|---|---|
top_k | int |
5 |
Inputs¶
Name | Data Type | Description |
---|---|---|
files | list[File] |
|
filename | str |
|
query | str |
Outputs¶
Name | Data Type | Description |
---|---|---|
all_files | list[FileInfo] |
|
files | list[FileInfo] |
|
file_content | str |
|
chunks | list[str] |
State Variables¶
Name | Data Type | Description |
---|---|---|
data | Any |
|
files_state | list[ChunkedFile] |
|
pending_file_count | int |
|
all_files_state | list[FileInfo] |
|
new_files_state | list[FileInfo] |
Example(s)¶
Example 1: Add files and store their content¶
- Create a
FileStore
Block. - Provide a list of files to the
add_files()
step. - The Block will extract content from each file, convert it into chunks, generate embeddings, and store the results.
- The
all_files
output will contain information about all stored files, and thefiles
output will contain details about the newly added files.
Example 2: Retrieve file content by filename¶
- Set up a
FileStore
Block with files already added. - Use the
get()
step with a specific filename, such as"document1.txt"
. - The Block will return the full content of the specified file.
Example 3: Perform a semantic search over stored content¶
- Use a
FileStore
Block with stored file content and embeddings. - Provide a query, such as
"Find information about AI ethics."
to thesemantic_search()
step. - The Block will return the most relevant chunks of content based on the semantic similarity between the query and stored chunks.
Error Handling¶
- If a file is not found when calling the
get()
step, the Block will return a message indicating that the file was not found. - If no valid chunks are returned for a file, the Block will handle the missing content gracefully.
- If an unsupported file format is provided, the Block will process the file using a default content extraction method.
FAQ¶
How does the Block handle different file formats?
The Block can process files in various formats, including converting certain files to markdown using Pandoc. Unsupported formats are processed using default extraction methods.
Can I perform searches on large document collections?
Yes, the Block is designed to handle large collections of documents by storing content in chunks and generating embeddings for each chunk. Semantic search allows for efficient querying across large datasets.
What happens if I add the same file multiple times?
The Block treats each file addition independently. If the same file is added multiple times, it will be processed and stored as separate entries.
How does semantic search work?
The Block generates an embedding for the input query and compares it with the embeddings of stored chunks using cosine similarity. The top k
most similar chunks are returned based on the top_k
configuration.