Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why are my results so bad here? #949

Open
nyck33 opened this issue Mar 14, 2025 · 2 comments
Open

why are my results so bad here? #949

nyck33 opened this issue Mar 14, 2025 · 2 comments

Comments

@nyck33
Copy link

nyck33 commented Mar 14, 2025

Example of Search Graph
"""

import os

from dotenv import load_dotenv

from scrapegraphai.graphs import SearchGraph

from china_unis import universities

os.environ.clear()


load_dotenv()

# ************************************************
# Define the configuration for the graph
# ************************************************

openai_key = os.getenv("OPENAI_API_KEY")

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "openai/gpt-4o-2024-08-06",
    },
    "max_results": 2,
    "verbose": True,
}

prompt = f"""
Get me the contact email addresses of the following universities:
{universities[:10]}
"""

# ************************************************
# Create the SearchGraph instance and run it
# ************************************************

search_graph = SearchGraph(
    prompt=prompt, config=graph_config
)

result = search_graph.run()

print(result)


# Save results to both JSON and TXT formats for flexibility
import json
from pathlib import Path
from datetime import datetime

# Create output directory if it doesn't exist
output_dir = Path("output")
output_dir.mkdir(exist_ok=True)

# Generate timestamp for unique filenames
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Save as JSON
json_path = output_dir / f"university_contacts_{timestamp}.json"
with open(json_path, "w", encoding="utf-8") as f:
    json.dump(result, f, indent=2, ensure_ascii=False)

# Save as TXT
txt_path = output_dir / f"university_contacts_{timestamp}.txt"
with open(txt_path, "w", encoding="utf-8") as f:
    f.write(str(result))

print(f"\nResults saved to:")
print(f"JSON: {json_path}")
print(f"TXT: {txt_path}")

input:

universities = [
    "Beijing Foreign Studies University",
    "Beijing Jiaotong University",
    "Beijing Language and Culture University",
    "Beijing Radio and Television University",
    "Beijing University of Chinese Medicine",
    "Beijing University of Posts and Telecommunications",
    "Central China Normal University",
    "Chong Qing University",
    "Donghua University",
    "East China Normal University",
    "Harbin Engineering University",
    "Harbin Institute of Technology Shenzhen Graduate School",
    "Henan University",
    "Hubei University",
    "Jiangxi Normal University",
    "Jilin University",
    "Nanjing University",
    "Ningbo University",
    "Northeast Normal University",
    "Northwest University",
    "Northwestern Polytechnical University",
    "Ocean University of China",
    "Peking University",
    "Renmin University of China",
    "Shaanxi Normal University",
    "Shanghai International Studies University",
    "Shanghai Normal University",
    "Shanghai University",
    "Shanghai University of Traditional Chinese Medicine",
    "Sichuan Normal University",
    "Sichuan University",
    "Sun Yat-sen University",
    "The Central Academy of Drama",
    "Tianjin University",
    "Tianjin University of Finance and Economics",
    "Tsinghua University",
    "Wuhan University",
    "Yanbian University",
    "Yangzhou University",
    "Zhejiang University",
    "Zhongnan University of Economics and Law",
    "Zhuhai College of Jilin University",
    "Shanghai University",
    "Sichuan Normal University",
    "Chong Qing University",
    "Shanghai University of Finance & Economics",
    "Beijing Institute of Technology",
    "North China University of Technology",
    "Beijing University of Chemical Technology",
    "Shantou University",
    "China Medical University",
    "Chinese Culture University",
    "Dharma Drum Buddhist College",
    "Feng Chia University",
    "Fo Guang University",
    "Nanhua University",
    "National Central University",
    "National Cheng Kung University",
    "National Chengchi University",
    "National Taipei University",
    "National Taipei University of Technology",
    "National Taiwan Normal University",
    "National Taiwan University",
    "Shih Chien University",
    "Tatung University",
    "Tzu Chi University",
    "Chung Yuan Christian University",
    "Southern Taiwan University of Science and Technology",
    "National Taiwan University",
    "National University of Kaohsiung",
    "Asia University",
    "University of Taipei",
    "Lingnan University",
    "The Hong Kong Institute of Education"
]

gets me only:

{
  "Beijing Foreign Studies University": [
    "[email protected]",
    "[email protected]"
  ],
  "Beijing Jiaotong University": "NA",
  "Beijing Language and Culture University": "NA",
  "Beijing Radio and Television University": "NA",
  "Beijing University of Chinese Medicine": "NA",
  "Beijing University of Posts and Telecommunications": "NA",
  "Central China Normal University": "NA",
  "Chong Qing University": "NA",
  "Donghua University": "NA",
  "East China Normal University": "NA",
  "sources": [
    "https://iss.bfsu.edu.cn/notice_intro.php?id=84",
    "https://osao.bfsu.edu.cn/info/1042/2097.htm",
    "https://greatyop.com/chinese-universities-agency-no-province/",
    "https://freestudyinchina.com/silk-road-scholarship-beijing-jiaotong-university/"
  ]
}
Copy link

dosubot bot commented Mar 14, 2025

The issue with receiving "NA" for many universities in your SearchGraph results could be due to several factors:

  1. Model Limitations: The language model specified in your configuration, "openai/gpt-4o-2024-08-06", might not be effectively trained or capable of extracting contact emails from the web pages it accesses. Consider experimenting with different models or configurations to see if results improve.

  2. Search Limitations: The max_results parameter is set to 2, which limits the number of search results the graph can process. Increasing this number might provide more opportunities to find the required information.

  3. Prompt Specificity: The prompt might not be specific enough to guide the model in finding the correct information. You could try refining the prompt to be more explicit about the type of contact information you are seeking.

  4. Data Availability: The contact emails for some universities might not be readily available or easily accessible on the web, leading to "NA" results.

  5. Error Handling: There might be no specific error handling or logging implemented to provide insights into why results are "NA". Implementing logging could help diagnose issues.

Consider adjusting these factors to potentially improve the quality of your results.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@VinciGit00
Copy link
Collaborator

try with 4omini

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants