WebsiteScraper
Overview¶
The WebsiteScraper
Block scrapes the content of a given website and retrieves both raw content and metadata about the pages visited. This Block is ideal for extracting text from websites for analysis or processing, with options to limit the number of pages visited.
Description¶
Scrapes the content of a website. Returns both the raw content and the content with metadata.
Metadata¶
- Category: Misc
- Icon: fa-globe
- Label: website scraper, web crawler, content extractor, web harvester, site parser
Configuration Options¶
Name | Data Type | Description | Default Value |
---|---|---|---|
page_limit | int |
3 |
Inputs¶
Name | Data Type | Description |
---|---|---|
base_url | str |
Outputs¶
Name | Data Type | Description |
---|---|---|
website_content | list[str] |
|
website_details | list[WebsiteDetails] |
State Variables¶
No state variables available.
Example(s)¶
Example 1: Scrape content from a website¶
- Create a
WebsiteScraper
Block. - Provide the base URL, such as
"https://example.com"
. - Set the
page_limit
configuration to3
. - The Block will output:
website_content
: A list of raw text content from the pages.website_details
: A list of detailed metadata, including the title, URL, and content of each page.
Example 2: Limit the number of pages to scrape¶
- Set up a
WebsiteScraper
Block. - Provide a base URL and set
page_limit
to1
. - The Block will scrape only the specified page and output its content and metadata.
Error Handling¶
- If the URL is invalid or unreachable, the Block will log an error and skip the page.
- If no pages are successfully scraped, the Block will output empty lists for both
website_content
andwebsite_details
.
FAQ¶
Can the Block scrape multiple pages from a website?
Yes, the Block will scrape the base URL and additional pages linked from it, up to the page_limit
configuration. Only pages within the same domain as the base URL will be scraped.
What happens if the base URL is empty or invalid?
If the base URL is empty, the Block will output empty results. For invalid or unreachable URLs, the Block will log an error and skip the page.
How does the Block handle rate limiting?
The Block uses a short delay (1 second
) between batches of scraping tasks to avoid overwhelming the server.
What metadata is included in the website_details
output?
The website_details
output includes the following for each page:
- title
: The title of the webpage.
- url
: The URL of the webpage.
- content
: The text content extracted from the page, excluding scripts and styles.