Skip to content

Commit 791f675

Browse files
committed
feat: add integration for smartcrawler
1 parent b80b0be commit 791f675

File tree

7 files changed

+457
-6
lines changed

7 files changed

+457
-6
lines changed

README.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,6 @@ Supercharge your LangChain agents with AI-powered web scraping capabilities. Lan
99
## 🔗 ScrapeGraph API & SDKs
1010
If you are looking for a quick solution to integrate ScrapeGraph in your system, check out our powerful API [here!](https://dashboard.scrapegraphai.com/login)
1111

12-
<p align="center">
13-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/api-banner.png" alt="ScrapeGraph API Banner" style="width: 70%;">
14-
</p>
15-
1612
We offer SDKs in both Python and Node.js, making it easy to integrate into your projects. Check them out below:
1713

1814
| SDK | Language | GitHub Link |

examples/readme.md

Lines changed: 198 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,198 @@
1+
# LangChain ScrapeGraph Examples
2+
3+
This directory contains comprehensive examples demonstrating how to use the LangChain ScrapeGraph tools for web scraping and data extraction.
4+
5+
## Prerequisites
6+
7+
Before running these examples, make sure you have:
8+
9+
1. **API Key**: Set your ScrapeGraph AI API key as an environment variable:
10+
```bash
11+
export SGAI_API_KEY="your-api-key-here"
12+
```
13+
14+
2. **Dependencies**: Install the required packages:
15+
```bash
16+
pip install langchain-scrapegraph scrapegraph-py
17+
```
18+
19+
For the agent example, you'll also need:
20+
```bash
21+
pip install langchain-openai langchain python-dotenv
22+
```
23+
24+
## Available Examples
25+
26+
### 1. Agent Integration (`agent_example.py`)
27+
**Purpose**: Demonstrates how to integrate ScrapeGraph tools with a LangChain agent for conversational web scraping.
28+
29+
**Features**:
30+
- Uses OpenAI's function calling capabilities
31+
- Combines multiple tools (SmartScraper, GetCredits, SearchScraper)
32+
- Provides conversational interface for web scraping tasks
33+
- Includes verbose output to show agent reasoning
34+
35+
**Usage**:
36+
```python
37+
python agent_example.py
38+
```
39+
40+
### 2. Basic Tool Examples
41+
42+
#### Get Credits Tool (`get_credits_tool.py`)
43+
**Purpose**: Check your remaining API credits.
44+
45+
**Features**:
46+
- Simple API credit checking
47+
- No parameters required
48+
- Returns current credit balance
49+
50+
**Usage**:
51+
```python
52+
python get_credits_tool.py
53+
```
54+
55+
#### Markdownify Tool (`markdownify_tool.py`)
56+
**Purpose**: Convert website content to clean markdown format.
57+
58+
**Features**:
59+
- Converts HTML to markdown
60+
- Cleans and structures content
61+
- Preserves formatting and links
62+
63+
**Usage**:
64+
```python
65+
python markdownify_tool.py
66+
```
67+
68+
#### Smart Scraper Tool (`smartscraper_tool.py`)
69+
**Purpose**: Extract specific information from a single webpage using AI.
70+
71+
**Features**:
72+
- Target specific websites
73+
- Use natural language prompts
74+
- Extract structured data
75+
- Support for both URL and HTML content
76+
77+
**Usage**:
78+
```python
79+
python smartscraper_tool.py
80+
```
81+
82+
#### Search Scraper Tool (`searchscraper_tool.py`)
83+
**Purpose**: Search the web and extract information based on a query.
84+
85+
**Features**:
86+
- Web search capabilities
87+
- AI-powered content extraction
88+
- No specific URL required
89+
- Returns relevant information from multiple sources
90+
91+
**Usage**:
92+
```python
93+
python searchscraper_tool.py
94+
```
95+
96+
#### Smart Crawler Tool (`smartcrawler_tool.py`)
97+
**Purpose**: Crawl multiple pages of a website and extract comprehensive information.
98+
99+
**Features**:
100+
- Multi-page crawling
101+
- Configurable depth and page limits
102+
- Domain restriction options
103+
- Website caching for efficiency
104+
- Extract information from multiple related pages
105+
106+
**Usage**:
107+
```python
108+
python smartcrawler_tool.py
109+
```
110+
111+
### 3. Structured Output Examples
112+
113+
All tools support structured output using Pydantic models. These examples show how to define schemas for consistent, typed responses.
114+
115+
#### Search Scraper with Schema (`searchscraper_tool_schema.py`)
116+
**Purpose**: Extract product information with structured output.
117+
118+
**Schema Features**:
119+
- Product name and description
120+
- Feature lists with structured details
121+
- Pricing information with multiple plans
122+
- Reference URLs for verification
123+
124+
**Key Schema Classes**:
125+
- `Feature`: Product feature details
126+
- `PricingPlan`: Pricing tier information
127+
- `ProductInfo`: Complete product information
128+
129+
#### Smart Scraper with Schema (`smartscraper_tool_schema.py`)
130+
**Purpose**: Extract website information with structured output.
131+
132+
**Schema Features**:
133+
- Website title and description
134+
- URL extraction from page
135+
- Support for both URL and HTML input
136+
137+
**Key Schema Classes**:
138+
- `WebsiteInfo`: Complete website information structure
139+
140+
#### Smart Crawler with Schema (`smartcrawler_tool_schema.py`)
141+
**Purpose**: Extract company information from multiple pages with structured output.
142+
143+
**Schema Features**:
144+
- Company description
145+
- Privacy policy content
146+
- Terms of service content
147+
- Multi-page content aggregation
148+
149+
**Key Schema Classes**:
150+
- `CompanyInfo`: Company information structure
151+
152+
## Tool Parameters Reference
153+
154+
### SmartScraperTool
155+
- `website_url`: Target website URL
156+
- `user_prompt`: What information to extract
157+
- `website_html`: (Optional) HTML content instead of URL
158+
- `llm_output_schema`: (Optional) Pydantic model for structured output
159+
160+
### SearchScraperTool
161+
- `user_prompt`: Search query and extraction instructions
162+
- `llm_output_schema`: (Optional) Pydantic model for structured output
163+
164+
### SmartCrawlerTool
165+
- `url`: Starting URL for crawling
166+
- `prompt`: What information to extract
167+
- `cache_website`: (Optional) Cache pages for efficiency
168+
- `depth`: (Optional) Maximum crawling depth
169+
- `max_pages`: (Optional) Maximum pages to crawl
170+
- `same_domain_only`: (Optional) Restrict to same domain
171+
- `llm_output_schema`: (Optional) Pydantic model for structured output
172+
173+
### GetCreditsTool
174+
- No parameters required
175+
176+
### MarkdownifyTool
177+
- `website_url`: Target website URL
178+
179+
## Best Practices
180+
181+
1. **Error Handling**: Always wrap tool calls in try-catch blocks for production use
182+
2. **Rate Limiting**: Be mindful of API rate limits when making multiple requests
183+
3. **Caching**: Use website caching for SmartCrawlerTool when processing multiple pages
184+
4. **Structured Output**: Use Pydantic schemas for consistent, typed responses
185+
5. **Logging**: Enable logging to debug and monitor tool performance
186+
187+
## Troubleshooting
188+
189+
- **Authentication Issues**: Ensure SGAI_API_KEY is properly set
190+
- **Import Errors**: Install all required dependencies
191+
- **Timeout Issues**: Increase timeout values for complex crawling operations
192+
- **Rate Limiting**: Implement delays between requests if hitting rate limits
193+
194+
## Additional Resources
195+
196+
- [ScrapeGraph AI Documentation](https://docs.scrapegraphai.com/)
197+
- [LangChain Documentation](https://python.langchain.com/)
198+
- [Pydantic Documentation](https://pydantic-docs.helpmanual.io/)

examples/smartcrawler_tool.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
from scrapegraph_py.logger import sgai_logger
2+
import json
3+
4+
from langchain_scrapegraph.tools import SmartCrawlerTool
5+
6+
sgai_logger.set_logging(level="INFO")
7+
8+
# Will automatically get SGAI_API_KEY from environment
9+
tool = SmartCrawlerTool()
10+
11+
# Example based on the provided code snippet
12+
url = "https://scrapegraphai.com/"
13+
prompt = "What does the company do? and I need text content from their privacy and terms"
14+
15+
# Use the tool with crawling parameters
16+
result = tool.invoke({
17+
"url": url,
18+
"prompt": prompt,
19+
"cache_website": True,
20+
"depth": 2,
21+
"max_pages": 2,
22+
"same_domain_only": True
23+
})
24+
25+
print(json.dumps(result, indent=2))
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
from pydantic import BaseModel, Field
2+
from scrapegraph_py.logger import sgai_logger
3+
import json
4+
5+
from langchain_scrapegraph.tools import SmartCrawlerTool
6+
7+
sgai_logger.set_logging(level="INFO")
8+
9+
# Define the output schema
10+
class CompanyInfo(BaseModel):
11+
company_description: str = Field(description="What the company does")
12+
privacy_policy: str = Field(description="Privacy policy content")
13+
terms_of_service: str = Field(description="Terms of service content")
14+
15+
# Initialize the tool with the schema
16+
tool = SmartCrawlerTool(llm_output_schema=CompanyInfo)
17+
18+
# Example crawling with structured output
19+
url = "https://scrapegraphai.com/"
20+
prompt = "What does the company do? and I need text content from their privacy and terms"
21+
22+
# Use the tool with crawling parameters and structured output
23+
result = tool.invoke({
24+
"url": url,
25+
"prompt": prompt,
26+
"cache_website": True,
27+
"depth": 2,
28+
"max_pages": 2,
29+
"same_domain_only": True
30+
})
31+
32+
print(json.dumps(result, indent=2))
33+
34+
# The output will be structured according to the CompanyInfo schema:
35+
# {
36+
# "company_description": "...",
37+
# "privacy_policy": "...",
38+
# "terms_of_service": "..."
39+
# }
Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
from .credits import GetCreditsTool
22
from .markdownify import MarkdownifyTool
33
from .searchscraper import SearchScraperTool
4+
from .smartcrawler import SmartCrawlerTool
45
from .smartscraper import SmartScraperTool
56

6-
__all__ = ["SmartScraperTool", "GetCreditsTool", "MarkdownifyTool", "SearchScraperTool"]
7+
__all__ = ["SmartScraperTool", "SmartCrawlerTool", "GetCreditsTool", "MarkdownifyTool", "SearchScraperTool"]

0 commit comments

Comments
 (0)