Skip to content

An ontology-aware Knowledge Graph for the FIFA World Cup 2022, explorable through a Streamlit application. It enables deep analysis via structural queries, semantic similarity (TransE, ComplEx), player search by image, and a RAG pipeline for natural language querying.

Notifications You must be signed in to change notification settings

vlb20/FIFA_WC_2022_Knowledge_Graph

Repository files navigation

FIFA World Cup 2022 Knowledge Graph

logo fifa

Table of Contents

Features

The application, built with Streamlit, provides a multi-faceted interface for interacting with the knowledge graph.

1. Structural Query Retrieval

Execute a curated list of complex, pre-written Cypher queries. Examples include:

  • Finding "Super-Subs": Identify players who scored a goal after entering a match as a substitute.
  • Team Logistics: List all stadiums and cities where a specific team played.
  • Performance Analytics: Find players who received a card but never scored a goal.
  • Manager Insights: Discover the manager of the tournament-winning team and see which awards their players have won.
Screenshot 2025-10-15 124852

2. Semantic Similarity Search

Beyond the simple data retrieval, you can find entities that are semantically similar based on their roles and relationships within the graph. This feature allows you to compare three different similarity algorithms:

Screenshot 2025-10-15 125533
  • Jaccard Similarity (Structural): Finds entities that share the most common neighbors in the graph. Calculated directly in Neo4j using the Graph Data Science (GDS) library.
  • KGE - TransE (Distributional): Uses embeddings from a trained TransE model to find entities that are close in the learned vector space, based on translational relationships.
  • KGE - ComplEx (Distributional): Leverages a more powerful Complex Embeddings model to capture more nuanced and complex relationships (symmetric, asymmetric).
  • Shortest Path Visualization: Automatically computes and displays the shortest path between your query entity and the top-ranked similar entity, revealing the hidden connections that link them.

Screenshot 2025-10-15 125426

3. LLM-driven Graph Querying (RAG)

Interact with the knowledge graph using natural language. This feature employs a Retrieval-Augmented Generation (RAG) pipeline:

  1. Text-to-Cypher: A Large Language Model (Google's Gemini) translates your plain English question into a precise Cypher query.
  2. Graph Retrieval: The generated query is executed against the Neo4j database to retrieve relevant data.
  3. Natural Language Answer: The retrieved data is passed back to the LLM, which synthesizes it into a clear, concise answer to your original question.
Screenshot 2025-10-15 130117 Screenshot 2025-10-15 130143

4. Player Similarity by Image (Query-by-Example)

Find players who look similar using an image-based search:

  1. Upload an Image: Provide a photo of a player.
  2. Embedding Extraction: The application calculates a vector embedding of the uploaded image using a pre-trained EfficientNet-B0 model.
  3. Similarity Search: It then queries the Neo4j database to find the top 5 players whose pre-calculated image embeddings have the highest Cosine Similarity to the query image's embedding, powered by the GDS library.
  4. Visual Results: The results are displayed showing the similar players' photos, names, and similarity scores.
Screenshot 2025-10-15 125750

Setup and Installation

Follow these steps to set up the project locally.

1. Python Environment

First, create and activate a Python virtual environment

# Create the virtual environment
python -m venv venv

# Activate it (on Windows)
.\venv\Scripts\activate

# Activate it (on macOS/Linux)
source venv/bin/activate

Then, install the required dependencies:

pip install -r requirements.txt

2. Neo4j Desktop Setup

You'll need Neo4j Desktop with an active DBMS instance (Enterprise Edition is recommended to use GDS).

  1. Create a local DBMS

  2. Configure Settings:

    • Open the settings for your DBMS.
    • Add the following line to enable Neosemantics (n10s) RDF procedures:
      dbms.unmanaged_extension_classes=n10s.endpoint=/rdf
    • Add the following line to grant unrestricted access to GDS, APOC, and n10s procedures:
      dbms.security.procedures.unrestricted=jwt.security.,n10s.,apoc.,gds.*
  3. Install Plugins:

    • In your DBMS view, click Open folder -> Plugins.
    • Download the JAR files for APOC, Neosemantics (n10s), and Graph Data Science (GDS) compatible with your Neo4j version.
    • Place the downloaded .jar files into this plugins folder.
  4. Configure APOC:

    • Go back and click Open folder -> Configuration.
    • Create a file named apoc.conf (if it doesn't exist).
    • Add the following lines to the file to enable file import/export capabilities:
      apoc.import.file.enabled=true
      apoc.export.file.enabled=true
  5. Restart the DBMS for the changes to take effect.

3. One-Time GDS Graph Projection

Before running the similarity algorithms, you need to project your graph into an in-memory format optimized for GDS. This only needs to be done once.

Open the Neo4j Browser and run the following Cypher query:

CALL gds.graph.project(
    'fifa_graph',   -- The name we give to the in-memory graph
    '*',            -- Use all node labels
    '*'             -- Use all relationship types
)
YIELD graphName, nodeCount, relationshipCount

This command creates an in-memory graph named fifa_graph, which will be used by the Jaccard similarity functions.

4. Environment Variables

Create a .env file in the root of the project directory and add yout credentials:

NEO4J_URI="bolt://localhost:7687"
NEO4J_USER="neo4j"
NEO4J_PASSWORD="your_password"
NEO4J_DATABASE="your_database_name"
GOOGLE_API_KEY="your_google_api_key"

Running the Application

Once the setup is complete, launch the Streamlit app with the following command:

streamlit run app.py

Project Structure

.
β”œβ”€β”€ πŸ“‚ data_preprocessing/       # Notebook for preprocessing and merging datasets
|   β”œβ”€β”€ data_preprocessing_competition_stats.ipynb  
β”‚   β”œβ”€β”€ data_preprocessing_player_stats.ipynb     
β”‚   └── data_preprocessing.ipynb                  
β”‚   └── πŸ“‚ data/ 
β”‚       β”œβ”€β”€ competition_stats.csv
β”‚       β”œβ”€β”€ player_stats.csv
β”‚       └── πŸ“‚ world_cup/
β”‚           β”œβ”€β”€ world_cup_complete.csv
β”‚           └── πŸ“‚ processed/
β”‚               β”œβ”€β”€ class_mappings.csv
β”‚               └── object_properties_mappings.csv
β”‚
β”œβ”€β”€ πŸ“‚ evaluation/
β”‚   β”œβ”€β”€ πŸ“‚ images/              # Contains 5 images for each of the 32 players for evaluation
β”‚   └── πŸ“‚ models/              # Assets for KGE model training and evaluation
β”‚       β”œβ”€β”€ create_ground_truth.py  # Script to create the ground truth for semantic sim. evaluation
β”‚       β”œβ”€β”€ fifa_triplets.tsv             # Triplets used for KGE model training
β”‚       β”œβ”€β”€ ground_truth_images.csv       # Ground truth for image search evaluation
β”‚       β”œβ”€β”€ image_sim_evaluation.py       # Evaluation script for image semantic similarity
β”‚       β”œβ”€β”€ semantic_evaluation.py        # Evaluation script for entity semantic similarity
β”‚       └── semantic_ground_truth.csv     # Ground truth for entity semantic similarity
β”‚
β”œβ”€β”€ πŸ“‚ models/
β”‚   β”œβ”€β”€ complex_fifa.pt         # Trained ComplEx model artifact
β”‚   └── transe_fifa.pt          # Trained TransE model artifact
β”‚
β”œβ”€β”€ πŸ–ΌοΈ player_images/            # Contains 10 images for players from each team in every group
β”‚
β”œβ”€β”€ app.py                      # Main Streamlit application file
β”œβ”€β”€ fifa_triplets.tsv           # Triplets for KGE model training (used by app.py)
β”œβ”€β”€ fifa_wc_ontology.ttl        # Domain ontology created with ProtΓ©gΓ©
β”œβ”€β”€ image_embedder.py          # Script to calculate image embeddings
β”œβ”€β”€ image_embedding_loader.py  # Script to load embeddings as properties for Player nodes
β”œβ”€β”€ image_embeddings_avg.json   # Calculated average player embeddings
β”œβ”€β”€ kg_loader.py                # Script for Ontology-aware KG creation in Neo4j
β”œβ”€β”€ requirements.txt            # Python dependencies for the virtual environment
└── train_kge.py                # Script for training the KGE models

About

An ontology-aware Knowledge Graph for the FIFA World Cup 2022, explorable through a Streamlit application. It enables deep analysis via structural queries, semantic similarity (TransE, ComplEx), player search by image, and a RAG pipeline for natural language querying.

Topics

Resources

Stars

Watchers

Forks