GetFileContent

Overview¶

The GetFileContent Block extracts text content from a file by determining its type and applying the appropriate extraction method. It supports PDF and other file types, providing a seamless way to handle document files and retrieve their text content.

The block interacts with a BlobService to retrieve file data from a URI and processes the content based on the detected file type.

Description¶

Metadata¶

Category: Data

Configuration Options¶

No configuration options available.

Inputs¶

Name	Data Type	Description
file	`File`

Outputs¶

Name	Data Type	Description
content	`str`
file_name	`str`

State Variables¶

No state variables available.

Example(s)¶

Example 1: Extract text from a PDF file¶

Create a GetFileContent Block.
Provide a PDF file via a File object.
The Block will extract the text from the PDF file and output the text content, as well as the file name.

Example 2: Extract text from a non-PDF file¶

Set up a GetFileContent Block.
Provide a text or other non-PDF file via a File object.
The Block will extract and convert the text content using the appropriate method and send it to the content output, along with the file name.

Error Handling¶

If the file type is not detected or cannot be processed, the Block will raise an error.
PDF text extraction may be slower for large or complex documents due to the nature of the PDF structure.
If no file name is available, an empty string will be sent to the file_name output.

FAQ¶

What file types are supported?

This Block supports PDF files and other file types that can be partitioned and converted to text. It uses specialized methods for extracting text from PDFs and more general methods for other file types.

What happens if the file type cannot be detected?

If the file type cannot be detected, the Block will use a fallback method to handle the file, or it will raise an error if the file type is unsupported.

How does PDF text extraction work?

PDF text extraction uses the pypdf library to extract text from each page. The extracted text is then combined into a single string separated by double newlines for readability.

Can I use this block for large files?

Yes, but be aware that PDF text extraction can be slow for large or complex documents. For large files, consider handling the text extraction asynchronously to avoid blocking workflows.