-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can I scrape a local html file? #929
Comments
To resolve the error related to the missing import os
from scrapegraphai.graphs import SmartScraperGraph
openai_api_key = os.getenv("OPENAI_API_KEY")
# Configuration for the scraping pipeline
graph_config = {
"llm": {
"api_key": openai_api_key,
"model": "openai/gpt-4-0125-preview",
},
"verbose": True,
"headless": True,
}
# Path to your local HTML file
html_file_path = "slack_html/0221-onboarding-bitbucket.html"
# Read the file locally
with open(html_file_path, 'r', encoding='utf-8') as file:
html_content = file.read()
# Initialize the SmartScraperGraph
smart_scraper = SmartScraperGraph(
prompt="Extract all messages with their timestamps and usernames from the Slack conversation.",
source=html_content,
config=graph_config
)
# Create output directory if it doesn't exist
output_folder_path = "cleaned_slack_html"
os.makedirs(output_folder_path, exist_ok=True)
# Execute the scraping pipeline
try:
result = smart_scraper.run()
# Prepare output file path
output_file_name = "cleaned_" + os.path.basename(html_file_path).replace(".html", ".json")
output_file_path = os.path.join(output_folder_path, output_file_name)
# Save results
print("Extraction completed. Saving results...")
with open(output_file_path, "w", encoding='utf-8') as f:
json.dump(result, f, indent=4, ensure_ascii=False)
print(f"Results saved to: {output_file_path}")
except Exception as e:
print(f"An error occurred: {str(e)}") This approach ensures that the HTML content is passed correctly to the To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other |
that code by dosubot throws
as does my attempt to use document scraper
|
Can you send me the errors? |
To resolve the 'Input to PromptTemplate is missing variables {"content"}' error, ensure that you are using a compatible version of the pip install scrapegraphai==1.40.1 If you are using version 1.39.0 and still encounter this error, consider downgrading to version 1.37.0: pip install scrapegraphai==1.37.0 Additionally, ensure that the To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other |
keeps throwing something about content
The text was updated successfully, but these errors were encountered: