The application, built with Streamlit, provides a multi-faceted interface for interacting with the knowledge graph.
Execute a curated list of complex, pre-written Cypher queries. Examples include:
- Finding "Super-Subs": Identify players who scored a goal after entering a match as a substitute.
- Team Logistics: List all stadiums and cities where a specific team played.
- Performance Analytics: Find players who received a card but never scored a goal.
- Manager Insights: Discover the manager of the tournament-winning team and see which awards their players have won.
Beyond the simple data retrieval, you can find entities that are semantically similar based on their roles and relationships within the graph. This feature allows you to compare three different similarity algorithms:
- Jaccard Similarity (Structural): Finds entities that share the most common neighbors in the graph. Calculated directly in Neo4j using the Graph Data Science (GDS) library.
- KGE - TransE (Distributional): Uses embeddings from a trained TransE model to find entities that are close in the learned vector space, based on translational relationships.
- KGE - ComplEx (Distributional): Leverages a more powerful Complex Embeddings model to capture more nuanced and complex relationships (symmetric, asymmetric).
- Shortest Path Visualization: Automatically computes and displays the shortest path between your query entity and the top-ranked similar entity, revealing the hidden connections that link them.
Interact with the knowledge graph using natural language. This feature employs a Retrieval-Augmented Generation (RAG) pipeline:
- Text-to-Cypher: A Large Language Model (Google's Gemini) translates your plain English question into a precise Cypher query.
- Graph Retrieval: The generated query is executed against the Neo4j database to retrieve relevant data.
- Natural Language Answer: The retrieved data is passed back to the LLM, which synthesizes it into a clear, concise answer to your original question.
Find players who look similar using an image-based search:
- Upload an Image: Provide a photo of a player.
- Embedding Extraction: The application calculates a vector embedding of the uploaded image using a pre-trained EfficientNet-B0 model.
- Similarity Search: It then queries the Neo4j database to find the top 5 players whose pre-calculated image embeddings have the highest Cosine Similarity to the query image's embedding, powered by the GDS library.
- Visual Results: The results are displayed showing the similar players' photos, names, and similarity scores.
Follow these steps to set up the project locally.
First, create and activate a Python virtual environment
# Create the virtual environment
python -m venv venv
# Activate it (on Windows)
.\venv\Scripts\activate
# Activate it (on macOS/Linux)
source venv/bin/activateThen, install the required dependencies:
pip install -r requirements.txtYou'll need Neo4j Desktop with an active DBMS instance (Enterprise Edition is recommended to use GDS).
-
Create a local DBMS
-
Configure Settings:
- Open the settings for your DBMS.
- Add the following line to enable Neosemantics (n10s) RDF procedures:
dbms.unmanaged_extension_classes=n10s.endpoint=/rdf
- Add the following line to grant unrestricted access to GDS, APOC, and n10s procedures:
dbms.security.procedures.unrestricted=jwt.security.,n10s.,apoc.,gds.*
-
Install Plugins:
- In your DBMS view, click
Open folder->Plugins. - Download the JAR files for APOC, Neosemantics (n10s), and Graph Data Science (GDS) compatible with your Neo4j version.
- Place the downloaded
.jarfiles into thispluginsfolder.
- In your DBMS view, click
-
Configure APOC:
- Go back and click
Open folder->Configuration. - Create a file named
apoc.conf(if it doesn't exist). - Add the following lines to the file to enable file import/export capabilities:
apoc.import.file.enabled=true apoc.export.file.enabled=true
- Go back and click
-
Restart the DBMS for the changes to take effect.
Before running the similarity algorithms, you need to project your graph into an in-memory format optimized for GDS. This only needs to be done once.
Open the Neo4j Browser and run the following Cypher query:
CALL gds.graph.project(
'fifa_graph', -- The name we give to the in-memory graph
'*', -- Use all node labels
'*' -- Use all relationship types
)
YIELD graphName, nodeCount, relationshipCountThis command creates an in-memory graph named fifa_graph, which will be used by the Jaccard similarity functions.
Create a .env file in the root of the project directory and add yout credentials:
NEO4J_URI="bolt://localhost:7687"
NEO4J_USER="neo4j"
NEO4J_PASSWORD="your_password"
NEO4J_DATABASE="your_database_name"
GOOGLE_API_KEY="your_google_api_key"
Once the setup is complete, launch the Streamlit app with the following command:
streamlit run app.py.
βββ π data_preprocessing/ # Notebook for preprocessing and merging datasets
| βββ data_preprocessing_competition_stats.ipynb
β βββ data_preprocessing_player_stats.ipynb
β βββ data_preprocessing.ipynb
β βββ π data/
β βββ competition_stats.csv
β βββ player_stats.csv
β βββ π world_cup/
β βββ world_cup_complete.csv
β βββ π processed/
β βββ class_mappings.csv
β βββ object_properties_mappings.csv
β
βββ π evaluation/
β βββ π images/ # Contains 5 images for each of the 32 players for evaluation
β βββ π models/ # Assets for KGE model training and evaluation
β βββ create_ground_truth.py # Script to create the ground truth for semantic sim. evaluation
β βββ fifa_triplets.tsv # Triplets used for KGE model training
β βββ ground_truth_images.csv # Ground truth for image search evaluation
β βββ image_sim_evaluation.py # Evaluation script for image semantic similarity
β βββ semantic_evaluation.py # Evaluation script for entity semantic similarity
β βββ semantic_ground_truth.csv # Ground truth for entity semantic similarity
β
βββ π models/
β βββ complex_fifa.pt # Trained ComplEx model artifact
β βββ transe_fifa.pt # Trained TransE model artifact
β
βββ πΌοΈ player_images/ # Contains 10 images for players from each team in every group
β
βββ app.py # Main Streamlit application file
βββ fifa_triplets.tsv # Triplets for KGE model training (used by app.py)
βββ fifa_wc_ontology.ttl # Domain ontology created with ProtΓ©gΓ©
βββ image_embedder.py # Script to calculate image embeddings
βββ image_embedding_loader.py # Script to load embeddings as properties for Player nodes
βββ image_embeddings_avg.json # Calculated average player embeddings
βββ kg_loader.py # Script for Ontology-aware KG creation in Neo4j
βββ requirements.txt # Python dependencies for the virtual environment
βββ train_kge.py # Script for training the KGE models

