|
| 1 | +# LangChain ScrapeGraph Examples |
| 2 | + |
| 3 | +This directory contains comprehensive examples demonstrating how to use the LangChain ScrapeGraph tools for web scraping and data extraction. |
| 4 | + |
| 5 | +## Prerequisites |
| 6 | + |
| 7 | +Before running these examples, make sure you have: |
| 8 | + |
| 9 | +1. **API Key**: Set your ScrapeGraph AI API key as an environment variable: |
| 10 | + ```bash |
| 11 | + export SGAI_API_KEY="your-api-key-here" |
| 12 | + ``` |
| 13 | + |
| 14 | +2. **Dependencies**: Install the required packages: |
| 15 | + ```bash |
| 16 | + pip install langchain-scrapegraph scrapegraph-py |
| 17 | + ``` |
| 18 | + |
| 19 | + For the agent example, you'll also need: |
| 20 | + ```bash |
| 21 | + pip install langchain-openai langchain python-dotenv |
| 22 | + ``` |
| 23 | + |
| 24 | +## Available Examples |
| 25 | + |
| 26 | +### 1. Agent Integration (`agent_example.py`) |
| 27 | +**Purpose**: Demonstrates how to integrate ScrapeGraph tools with a LangChain agent for conversational web scraping. |
| 28 | + |
| 29 | +**Features**: |
| 30 | +- Uses OpenAI's function calling capabilities |
| 31 | +- Combines multiple tools (SmartScraper, GetCredits, SearchScraper) |
| 32 | +- Provides conversational interface for web scraping tasks |
| 33 | +- Includes verbose output to show agent reasoning |
| 34 | + |
| 35 | +**Usage**: |
| 36 | +```python |
| 37 | +python agent_example.py |
| 38 | +``` |
| 39 | + |
| 40 | +### 2. Basic Tool Examples |
| 41 | + |
| 42 | +#### Get Credits Tool (`get_credits_tool.py`) |
| 43 | +**Purpose**: Check your remaining API credits. |
| 44 | + |
| 45 | +**Features**: |
| 46 | +- Simple API credit checking |
| 47 | +- No parameters required |
| 48 | +- Returns current credit balance |
| 49 | + |
| 50 | +**Usage**: |
| 51 | +```python |
| 52 | +python get_credits_tool.py |
| 53 | +``` |
| 54 | + |
| 55 | +#### Markdownify Tool (`markdownify_tool.py`) |
| 56 | +**Purpose**: Convert website content to clean markdown format. |
| 57 | + |
| 58 | +**Features**: |
| 59 | +- Converts HTML to markdown |
| 60 | +- Cleans and structures content |
| 61 | +- Preserves formatting and links |
| 62 | + |
| 63 | +**Usage**: |
| 64 | +```python |
| 65 | +python markdownify_tool.py |
| 66 | +``` |
| 67 | + |
| 68 | +#### Smart Scraper Tool (`smartscraper_tool.py`) |
| 69 | +**Purpose**: Extract specific information from a single webpage using AI. |
| 70 | + |
| 71 | +**Features**: |
| 72 | +- Target specific websites |
| 73 | +- Use natural language prompts |
| 74 | +- Extract structured data |
| 75 | +- Support for both URL and HTML content |
| 76 | + |
| 77 | +**Usage**: |
| 78 | +```python |
| 79 | +python smartscraper_tool.py |
| 80 | +``` |
| 81 | + |
| 82 | +#### Search Scraper Tool (`searchscraper_tool.py`) |
| 83 | +**Purpose**: Search the web and extract information based on a query. |
| 84 | + |
| 85 | +**Features**: |
| 86 | +- Web search capabilities |
| 87 | +- AI-powered content extraction |
| 88 | +- No specific URL required |
| 89 | +- Returns relevant information from multiple sources |
| 90 | + |
| 91 | +**Usage**: |
| 92 | +```python |
| 93 | +python searchscraper_tool.py |
| 94 | +``` |
| 95 | + |
| 96 | +#### Smart Crawler Tool (`smartcrawler_tool.py`) |
| 97 | +**Purpose**: Crawl multiple pages of a website and extract comprehensive information. |
| 98 | + |
| 99 | +**Features**: |
| 100 | +- Multi-page crawling |
| 101 | +- Configurable depth and page limits |
| 102 | +- Domain restriction options |
| 103 | +- Website caching for efficiency |
| 104 | +- Extract information from multiple related pages |
| 105 | + |
| 106 | +**Usage**: |
| 107 | +```python |
| 108 | +python smartcrawler_tool.py |
| 109 | +``` |
| 110 | + |
| 111 | +### 3. Structured Output Examples |
| 112 | + |
| 113 | +All tools support structured output using Pydantic models. These examples show how to define schemas for consistent, typed responses. |
| 114 | + |
| 115 | +#### Search Scraper with Schema (`searchscraper_tool_schema.py`) |
| 116 | +**Purpose**: Extract product information with structured output. |
| 117 | + |
| 118 | +**Schema Features**: |
| 119 | +- Product name and description |
| 120 | +- Feature lists with structured details |
| 121 | +- Pricing information with multiple plans |
| 122 | +- Reference URLs for verification |
| 123 | + |
| 124 | +**Key Schema Classes**: |
| 125 | +- `Feature`: Product feature details |
| 126 | +- `PricingPlan`: Pricing tier information |
| 127 | +- `ProductInfo`: Complete product information |
| 128 | + |
| 129 | +#### Smart Scraper with Schema (`smartscraper_tool_schema.py`) |
| 130 | +**Purpose**: Extract website information with structured output. |
| 131 | + |
| 132 | +**Schema Features**: |
| 133 | +- Website title and description |
| 134 | +- URL extraction from page |
| 135 | +- Support for both URL and HTML input |
| 136 | + |
| 137 | +**Key Schema Classes**: |
| 138 | +- `WebsiteInfo`: Complete website information structure |
| 139 | + |
| 140 | +#### Smart Crawler with Schema (`smartcrawler_tool_schema.py`) |
| 141 | +**Purpose**: Extract company information from multiple pages with structured output. |
| 142 | + |
| 143 | +**Schema Features**: |
| 144 | +- Company description |
| 145 | +- Privacy policy content |
| 146 | +- Terms of service content |
| 147 | +- Multi-page content aggregation |
| 148 | + |
| 149 | +**Key Schema Classes**: |
| 150 | +- `CompanyInfo`: Company information structure |
| 151 | + |
| 152 | +## Tool Parameters Reference |
| 153 | + |
| 154 | +### SmartScraperTool |
| 155 | +- `website_url`: Target website URL |
| 156 | +- `user_prompt`: What information to extract |
| 157 | +- `website_html`: (Optional) HTML content instead of URL |
| 158 | +- `llm_output_schema`: (Optional) Pydantic model for structured output |
| 159 | + |
| 160 | +### SearchScraperTool |
| 161 | +- `user_prompt`: Search query and extraction instructions |
| 162 | +- `llm_output_schema`: (Optional) Pydantic model for structured output |
| 163 | + |
| 164 | +### SmartCrawlerTool |
| 165 | +- `url`: Starting URL for crawling |
| 166 | +- `prompt`: What information to extract |
| 167 | +- `cache_website`: (Optional) Cache pages for efficiency |
| 168 | +- `depth`: (Optional) Maximum crawling depth |
| 169 | +- `max_pages`: (Optional) Maximum pages to crawl |
| 170 | +- `same_domain_only`: (Optional) Restrict to same domain |
| 171 | +- `llm_output_schema`: (Optional) Pydantic model for structured output |
| 172 | + |
| 173 | +### GetCreditsTool |
| 174 | +- No parameters required |
| 175 | + |
| 176 | +### MarkdownifyTool |
| 177 | +- `website_url`: Target website URL |
| 178 | + |
| 179 | +## Best Practices |
| 180 | + |
| 181 | +1. **Error Handling**: Always wrap tool calls in try-catch blocks for production use |
| 182 | +2. **Rate Limiting**: Be mindful of API rate limits when making multiple requests |
| 183 | +3. **Caching**: Use website caching for SmartCrawlerTool when processing multiple pages |
| 184 | +4. **Structured Output**: Use Pydantic schemas for consistent, typed responses |
| 185 | +5. **Logging**: Enable logging to debug and monitor tool performance |
| 186 | + |
| 187 | +## Troubleshooting |
| 188 | + |
| 189 | +- **Authentication Issues**: Ensure SGAI_API_KEY is properly set |
| 190 | +- **Import Errors**: Install all required dependencies |
| 191 | +- **Timeout Issues**: Increase timeout values for complex crawling operations |
| 192 | +- **Rate Limiting**: Implement delays between requests if hitting rate limits |
| 193 | + |
| 194 | +## Additional Resources |
| 195 | + |
| 196 | +- [ScrapeGraph AI Documentation](https://docs.scrapegraphai.com/) |
| 197 | +- [LangChain Documentation](https://python.langchain.com/) |
| 198 | +- [Pydantic Documentation](https://pydantic-docs.helpmanual.io/) |
0 commit comments