- Introduction
- Key Features
- Frameworks
- Pipeline Flowchart
- Setup Instructions
- Google Sheets Integration
- Setting up Environment Variables
- Dashboard UI
This is an end-to-end tool that allows users to automate data retrieval from the web, preprocess and filter results, scrape content, extract relevant context, and structure the data in a user-friendly format. The dashboard integrates various AI-powered and web scraping capabilities, and it allows users to define custom search queries to retrieve the most relevant data from online sources.
- Web Scraping and Filtering: Automates Google search, URL filtering, and content scraping.
- Webpage Parsing: Parses both HTML and PDF content to extract relevant context.
- Contextual Data Retrieval: Uses embeddings to retrieve and structure relevant data.
- Asynchronous Processing: Improves efficiency for large datasets.
- Google Sheets Integration: Supports importing spreadsheets from Google Sheets.
- Backend: Python
- API: FastAPI
- UI: Streamlit
- Data Handling: Pandas
- Google Sheets Integration: gspread
- Web Search: Custom Google Search Module
- Web Scraping: Beautiful Soup, PyPDF2,
Newspaper4k(Newspaper4k results in better scraped data but takes more time) - LLM: OpenAI (gpt-4o-mini) (Use "gpt-4o" for better consistency)
- Agents: Langchain
Prerequisites:
- Python 3.10 or higher
- Pip
Clone the repository:
git clone https://github.com/suryanshgupta9933/breakoutai-assesment.git
cd breakoutai-assesment
-
Create and Activate a Virtual Environment
python -m venv venv source venv/bin/activate
-
Install the required dependencies:
pip install -r requirements.txt
-
Start the Application
- Run the FastAPI Backend
python routes.py
- Run the Streamlit Dashboard
streamlit run dashboard.py
- Run the FastAPI Backend
This setup uses Docker Compose and requires minimal configuration. Prerequisites:
- Docker
-
Build and run the Docker containers:
docker-compose up --build
-
Access the Streamlit dashboard at
http://localhost:8501
.
In current Implementation, Google Sheets Integration is supported using Service Account. A service account is a special type of Google account intended to represent a non-human user that needs to authenticate and be authorized to access data in Google APIs.
Note: Since it’s a separate account, by default it does not have access to any spreadsheet until you share it with this account. Just like any other Google account.
- Head over to the Google Cloud Console and create a new project(or select an existing one).
- Search for
APIs & Services
in the search bar and click onEnable APIs and Services
. - Search for
Google Sheets API
and click onEnable
. - Search for
Google Drive API
and click onEnable
.
You have successfully enabled the Google Sheets API and Google Drive API for your project.
- Go to
APIs & Services
>Credentials
and click onCreate Credentials
. - Select
Service Account
and fill in the details.
Note: Copy the
Service Account Email ID
. You will share your Google Sheets to this account.
- Press on: near recently created service account and select
Keys
and then click onAdd Key
>Create new key
. - Select JSON key and press
Create
. - Your Service Account JSON Key will be downloaded.
- Rename the
.env.example
file to.env
and update the environment variables.UPLOAD_ENDPOINT="http://localhost:8000/upload-csv" PIPELINE_ENDPOINT="http://localhost:8000/pipeline" OPENAI_API_KEY="your-openai-api-key" SERVICE_ACCOUNT_KEY="path/to/your/service-account-key.json"
- Add your OpenAI API Key and path to your Service Account JSON Key to the
.env
file.
- You have successfully set up Google Sheets Integration. Share your Google Sheets with the Service Account Email ID and you are good to go.