A GenAI system that interprets and semantically links embedded images within narrative documents for visual question answering and retrieval.
- Accepts input formats: PDF, DOCX.
- Uses a vision-language model (GEMMA3) to extract semantic meaning from images.
- Implements chunking logic that binds pre-image and post-image text with image metadata.
- Stores enriched chunks in a vector store (FAISS).
- Enables semantic search and retrieval using vector similarity.
- Builds a chatbot using retrieval-augmented generation (RAG) over vector data.
- Ensures chunk provenance (page, image position) is preserved.
- Implements access control and role-based user permissions.
Let's take for example an auto brochure for Porsche Cayenne Turbo 2006.

It identifies each image and relevant text near it.

A description for each image is also created. Whenever the user asks for an image, he receives the most relevant picture.

The projects supports Q&A retrieval on multiple docx and pdf files.

Step 1: Add your desired documents into VisionDOC-AI/documents directory (remove any existing documents if you don't intend to use them)
Step 2: Run extraction/documents_extraction.py (extracts information about documents)
Step 3: Run db_build.py (storing information into vectorstore)
Step 4: Run app.py for web chatbot interface
- use user: 'admin' and password: 'adminpwd' for admin permissions
- now, you can ask the model about any image that exists in documents
