Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Jan copy (improved README, simplified configuration, handle validation) #26

Merged
merged 21 commits into from
Nov 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 16 additions & 16 deletions .actor/actor.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,17 @@
"storages": {
"dataset": {
"actorSpecification": 1,
"title": "RAG Web browser",
"description": "Too see all scraped properties, export the whole dataset or select All fields instead of Overview",
"title": "RAG Web Browser",
"description": "Too see all scraped properties, export the whole dataset or select All fields instead of Overview.",
"views": {
"overview": {
"title": "Overview",
"description": "Selected fields from the dataset",
"description": "An view showing just basic properties for simplicity.",
"transformation": {
"fields": [
"metadata.url",
"metadata.title",
"text"
"markdown"
],
"flatten": ["metadata"]
},
Expand All @@ -31,39 +31,39 @@
"format": "text"
},
"metadata.title": {
"label": "Page Title",
"label": "Page title",
"format": "text"
},
"text": {
"label": "Extracted text",
"label": "Extracted markdown",
"format": "text"
}
}
}
},
"googleSearchResults": {
"title": "Google Search Results",
"description": "Title, Description and URL of the Google Search Results",
"searchResults": {
"title": "Search results",
"description": "A view showing just the Google Search results, without the page content.",
"transformation": {
"fields": [
"googleSearchResult.description",
"googleSearchResult.title",
"googleSearchResult.url"
"searchResult.description",
"searchResult.title",
"searchResult.url"
],
"flatten": ["googleSearchResult"]
"flatten": ["searchResult"]
},
"display": {
"component": "table",
"properties": {
"googleSearchResult.description": {
"searchResult.description": {
"label": "Description",
"format": "text"
},
"googleSearchResult.title": {
"searchResult.title": {
"label": "Title",
"format": "text"
},
"googleSearchResult.url": {
"searchResult.url": {
"label": "URL",
"format": "text"
}
Expand Down
102 changes: 49 additions & 53 deletions .actor/input_schema.json
Original file line number Diff line number Diff line change
@@ -1,133 +1,129 @@
{
"title": "RAG Web Browser",
"description": "RAG Web Browser for a retrieval augmented generation workflows. Retrieve and return website content from the top Google Search Results Pages",
"description": "Here you can test the Actor and its various settings. Enter the search terms or URL below and click *Start* to see the results. For production applications, we recommend using the Standby mode and calling the Actor via HTTP server for faster response times.",
"type": "object",
"schemaVersion": 1,
"properties": {
"query": {
"title": "Search term(s)",
"title": "Search term or URL",
"type": "string",
"description": "You can Use regular search words or enter Google Search URLs. Additionally, you can also apply [advanced Google search techniques](https://blog.apify.com/how-to-scrape-google-like-a-pro/). For example:\n\n - Search for results from a particular website: <code>llm site:openai.com</code> (note: there should be no space between `site`, the colon, and the domain openai.com; also the `.com` is required).\n\n - Search for results related to <code>javascript OR python</code>",
"prefill": "apify rag browser",
"editor": "textarea",
"description": "Enter Google Search keywords or a URL to a specific web page. The keywords might include the [advanced search operators](https://blog.apify.com/how-to-scrape-google-like-a-pro/). Examples:\n\n- <code>san francisco weather</code>\n- <code>https://www.cnn.com</code>\n- <code>function calling site:openai.com</code>",
"prefill": "san francisco weather",
"editor": "textfield",
"pattern": "[^\\s]+"
},
"maxResults": {
"title": "Number of top search results to return from Google. Only organic results are returned and counted",
"title": "Maximum results",
"type": "integer",
"description": "The number of top organic search results to return and scrape text from",
"prefill": 3,
"description": "The maximum number of top organic Google Search results whose web pages will be extracted. If `query` is a URL, then this field is ignored and the Actor only fetches the specific web page.",
"default": 3,
"minimum": 1,
"maximum": 100
},
"outputFormats": {
"title": "Output formats",
"type": "array",
"description": "Select the desired output formats for the retrieved content",
"description": "Select one or more formats to which the target web pages will be extracted and saved in the resulting dataset.",
"editor": "select",
"default": ["text"],
"default": ["markdown"],
"items": {
"type": "string",
"enum": ["text", "markdown", "html"],
"enumTitles": ["Plain text", "Markdown", "HTML"]
}
},
"requestTimeoutSecs": {
"title": "Request timeout in seconds",
"title": "Request timeout",
"type": "integer",
"description": "The maximum time (in seconds) allowed for request. If the request exceeds this time, it will be marked as failed and only already finished results will be returned",
"description": "The maximum time in seconds available for the request, including querying Google Search and scraping the target web pages. For example, OpenAI allows only [45 seconds](https://platform.openai.com/docs/actions/production#timeouts) for custom actions. If a target page loading and extraction exceeds this timeout, the corresponding page will be skipped in results to ensure at least some results are returned within the timeout. If no page is extracted within the timeout, the whole request fails.",
"minimum": 1,
"maximum": 600,
"default": 45
"maximum": 300,
"default": 40,
"unit": "seconds"
},
"proxyGroupSearch": {
"title": "Search Proxy Group",
"serpProxyGroup": {
"title": "Google SERP proxy group",
"type": "string",
"description": "Select the proxy group for loading search results",
"description": "Enables overriding the default Apify Proxy group used for fetching Google Search results.",
"editor": "select",
"default": "GOOGLE_SERP",
"enum": ["GOOGLE_SERP", "SHADER"],
"sectionCaption": "Google Search Settings"
"sectionCaption": "Google Search scraping settings"
},
"maxRequestRetriesSearch": {
"title": "Maximum number of retries for Google search request on network / server errors",
"serpMaxRetries": {
"title": "Google SERP maximum retries",
"type": "integer",
"description": "The maximum number of times the Google search crawler will retry the request on network, proxy or server errors. If the (n+1)-th request still fails, the crawler will mark this request as failed.",
"description": "The maximum number of times the Actor will retry fetching the Google Search results on error. If the last attempt fails, the entire request fails.",
"minimum": 0,
"maximum": 5,
"default": 3
"default": 2
},
"proxyConfiguration": {
"title": "Crawler: Proxy configuration",
"title": "Proxy configuration",
"type": "object",
"description": "Enables loading the websites from IP addresses in specific geographies and to circumvent blocking.",
"description": "Apify Proxy configuration used for scraping the target web pages.",
"default": {
"useApifyProxy": true
},
"prefill": {
"useApifyProxy": true
},
"editor": "proxy",
"sectionCaption": "Content Crawler Settings"
"sectionCaption": "Target pages scraping settings"
},
"initialConcurrency": {
"title": "Initial concurrency",
"title": "Initial browsing concurrency",
"type": "integer",
"description": "Initial number of Playwright browsers running in parallel. The system scales this value based on CPU and memory usage.",
"description": "The initial number of web browsers running in parallel. The system automatically scales the number based on the CPU and memory usage, in the range specified by `minConcurrency` and `maxConcurrency`. If the initial value is `0`, the Actor picks the number automatically based on the available memory.",
"minimum": 0,
"maximum": 50,
"default": 5
"default": 4,
"editor": "hidden"
},
"minConcurrency": {
"title": "Minimal concurrency",
"title": "Minimum browsing concurrency",
"type": "integer",
"description": "Minimum number of Playwright browsers running in parallel. Useful for defining a base level of parallelism.",
"description": "The minimum number of web browsers running in parallel.",
"minimum": 1,
"maximum": 50,
"default": 3
"default": 1,
"editor": "hidden"
},
"maxConcurrency": {
"title": "Maximal concurrency",
"title": "Maximum browsing concurrency",
"type": "integer",
"description": "Maximum number of browsers or clients running in parallel to avoid overloading target websites.",
"description": "The maximum number of web browsers running in parallel.",
"minimum": 1,
"maximum": 50,
"default": 20
"maximum": 100,
"default": 50,
"editor": "hidden"
},
"maxRequestRetries": {
"title": "Maximum number of retries for Playwright content crawler",
"title": "Target page max retries",
"type": "integer",
"description": "Maximum number of retry attempts on network, proxy, or server errors. If the (n+1)-th request fails, it will be marked as failed.",
"description": "The maximum number of times the Actor will retry loading the target web page on error. If the last attempt fails, the page will be skipped in the results.",
"minimum": 0,
"maximum": 3,
"default": 1
},
"requestTimeoutContentCrawlSecs": {
"title": "Request timeout for content crawling",
"type": "integer",
"description": "Timeout (in seconds) for making requests for each search result, including fetching and processing its content.\n\nThe value must be smaller than the 'Request timeout in seconds' setting.",
"minimum": 1,
"maximum": 60,
"default": 30
},
"dynamicContentWaitSecs": {
"title": "Wait for dynamic content (seconds)",
"title": "Target page dynamic content timeout",
"type": "integer",
"description": "Maximum time (in seconds) to wait for dynamic content to load. The crawler processes the page once this time elapses or when the network becomes idle.",
"default": 10
"description": "The maximum time in seconds to wait for dynamic page content to load. The Actor considers the web page as fully loaded once this time elapses or when the network becomes idle.",
"default": 10,
"unit": "seconds"
},
"removeCookieWarnings": {
"title": "Remove cookie warnings",
"type": "boolean",
"description": "If enabled, removes cookie consent dialogs to improve text extraction accuracy. Note that this will impact latency.",
"description": "If enabled, the Actor attempts to close or remove cookie consent dialogs to improve the quality of extracted text. Note that this setting increases the latency.",
"default": true
},
"debugMode": {
"title": "Debug mode (stores debugging information in dataset)",
"title": "Enable debug mode",
"type": "boolean",
"description": "If enabled, the Actor will store debugging information in the dataset's debug field",
"default": false,
"sectionCaption": "Debug Settings"
"description": "If enabled, the Actor will store debugging information into the resulting dataset under the `debug` field.",
"default": false
}
}
}
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,22 @@
This changelog summarizes all changes of the RAG Web Browser

### 2024-11-13

🚀 Features
- Improve README.md and simplify configuration
- Add an AWS Lambda function
- Hide variables initialConcurrency, minConcurrency, and maxConcurrency in the Actor input and remove them from README.md
- Remove requestTimeoutContentCrawlSecs and use only requestTimeoutSecs
- Ensure there is enough time left to wait for dynamic content before the Actor timeout (normal mode)
- Rename googleSearchResults to searchResults and searchProxyGroup to serpProxyGroup
- Implement input validation

### 2024-11-08

🚀 Features
- Add functionality to extract content from a specific URL
- Update README.md to include new functionality and provide examples

### 2024-10-17

🚀 Features
Expand Down
Loading
Loading