LLMDataEnrich
Overview¶
The LLMDataEnrich
Block uses a Language Learning Model (LLM) to enrich documents with additional information based on user-defined instructions, outputting structured data according to a specified schema. It extracts and categorizes information from input text into specified fields, where each field can contain multiple relevant entries as a list of strings. This block is particularly useful for entity extraction, data parsing, and enriching unstructured content with structured metadata.
Block LLMDataEnrich not found
Example(s)¶
Example 1: Extract entities from a document¶
- Create an
LLMDataEnrich
Block. - Set
field_names
to["people", "organizations", "locations"]
. - Configure the
llm_config
with appropriate model settings. - Provide input text:
"John Smith from Microsoft visited the Seattle office yesterday."
- The Block will extract:
{"people": ["John Smith"], "organizations": ["Microsoft"], "locations": ["Seattle"]}
.
Example 2: Parse product information¶
- Create an
LLMDataEnrich
Block. - Set
field_names
to["product_names", "prices", "features"]
. - Provide input:
"The iPhone 15 Pro costs $999 and features a titanium design with 5x zoom camera."
- The Block will extract structured data for each field.
Example 3: Extract multiple values per field¶
- Create an
LLMDataEnrich
Block. - Set
field_names
to["skills", "certifications", "experience_years"]
. - Provide a resume or CV as input.
- The Block will extract multiple skills, certifications, and experience mentions as lists.
Example 4: Process content items with files¶
- Create an
LLMDataEnrich
Block. - Configure with desired extraction fields.
- Provide a list of
ContentItem
objects or a singleContentItem
. - The Block will process the content and extract structured information according to the schema.
Error Handling¶
- If the LLM does not return a tool call response, the Block will raise a
BlockError
. - If the LLM returns empty arguments for the tool call, an error will be raised.
- The Block validates that responses match the expected schema structure.
FAQ¶
How are the extraction fields defined?
Fields are defined in the field_names
list. Each field name becomes a property in the output schema, with the type list[str]
to allow multiple values per field.
What types of input does the Block accept?
The Block accepts:
- Simple strings
- Single ContentItem
objects
- Lists of ContentItem
objects
ContentItems can include text and file references.
Can I customize the extraction behavior?
Yes, through the llm_config
you can:
- Set a custom pre-prompt to guide extraction
- Choose different LLM models
- Configure model parameters
- Enable thread history for context-aware extraction
Why are all fields lists of strings?
This design allows maximum flexibility - each field can contain zero, one, or multiple extracted entities. For example, a document might mention multiple people, locations, or dates. The list structure accommodates all these cases.
How does this differ from the standard LLM block?
While the standard LLM block can output structured data, LLMDataEnrich is specifically optimized for entity extraction and data enrichment tasks. It automatically creates the appropriate schema based on field names and uses tool calling to ensure structured output.