webscraper-using-langchain-and-chromaDB

This is a small demo project illustrating how to create a chatbot that can query a scraped website. It uses LangChain to manage the chatbot's framework, Gradio for a user friendly interface, OpenAI's gpt-3.5-turbo LLM model, and ChromaDB as a vector store.

Getting started

This project supports both pip and pipenv. I recommend using pipenv for the best (and least error prone) experience.

Installation

Pip

Run

pip install -r requirements.txt

if using pip.

Pipenv

Run

pipenv install

if using pipenv, followed by pipenv shell to start a shell with the installed packages.

Environment variables

We need to create a new .env file from the .env.example file with our OPENAI_API_KEY. We can create one of these on OpenAI's platform.

Web scraping

To scrape a site, run

python scrape.py --site <site_url> --depth <int>

This will scrape a url and all links found at that url recursively up to the specified depth. This will only scrape sites with the same origin as the given <site_url>, so for example scraping https://python.langchain.com/docs will only scrape sites at https://python.langchain.com.

The data will be stored in a new scrape/ directory.

Data embeddings

To generate and persist the embeddings and create a vector store, run

python embed.py

A new persisted vector store will be created in the chroma/ directory.

Launching the chatbot

To launch the chatbot, we need to run

python main.py

This will start a Gradio server at http://127.0.0.1:7860, allowing us to chat to the scraped website and data store.

NOTE: we must both first scrape a site and persist a vector store in order for this to work.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
scrape		scrape
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
embed.py		embed.py
main.py		main.py
requirements.txt		requirements.txt
scrape.py		scrape.py
search.py		search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

webscraper-using-langchain-and-chromaDB

Getting started

Installation

Pip

Pipenv

Environment variables

Web scraping

Data embeddings

Launching the chatbot

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

webscraper-using-langchain-and-chromaDB

Getting started

Installation

Pip

Pipenv

Environment variables

Web scraping

Data embeddings

Launching the chatbot

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages