VISIONDOC AI

A GenAI system that interprets and semantically links embedded images within narrative documents for visual question answering and retrieval.

Accepts input formats: PDF, DOCX.
Uses a vision-language model (GEMMA3) to extract semantic meaning from images.
Implements chunking logic that binds pre-image and post-image text with image metadata.
Stores enriched chunks in a vector store (FAISS).
Enables semantic search and retrieval using vector similarity.
Builds a chatbot using retrieval-augmented generation (RAG) over vector data.
Ensures chunk provenance (page, image position) is preserved.
Implements access control and role-based user permissions.

PURPOSE

Let's take for example an auto brochure for Porsche Cayenne Turbo 2006.

It identifies each image and relevant text near it.

A description for each image is also created. Whenever the user asks for an image, he receives the most relevant picture.

The projects supports Q&A retrieval on multiple docx and pdf files.

Step 1: Add your desired documents into VisionDOC-AI/documents directory (remove any existing documents if you don't intend to use them)

Step 2: Run extraction/documents_extraction.py (extracts information about documents)

Step 3: Run db_build.py (storing information into vectorstore)

Step 4: Run app.py for web chatbot interface

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.idea		.idea
VisionDOC-AI		VisionDOC-AI
explanation		explanation
README.md		README.md
Screenshot 2025-08-27 151900.png		Screenshot 2025-08-27 151900.png
Screenshot 2025-08-27 152118.png		Screenshot 2025-08-27 152118.png
Screenshot 2025-08-27 152543.png		Screenshot 2025-08-27 152543.png
Screenshot 2025-08-27 153026.png		Screenshot 2025-08-27 153026.png
VisionDoc AI.odt		VisionDoc AI.odt