Skip to content

semantic-seekers/main-app

Repository files navigation

Semantics Platform

Команда 12. Semantic Seekers.

Описание

Semantics Platform — это платформа для автоматического сбора, NLP-обработки и полнотекстового поиска новостных статей. Система ориентирована на анализ новостей на азербайджанском языке с акцентом на выявление рисков и событий.

Ключевые возможности

  • Автоматический crawling — сбор новостей с news-сайтов
  • Named Entity Recognition (NER) — извлечение сущностей: персоны, организации, локации, даты, денежные суммы
  • Risk Analysis — классификация рисков: коррупция, мошенничество, отмывание денег, санкции, банкротство
  • Event Detection — обнаружение событий: назначения, отставки, судебные дела, сделки, санкции
  • Full-text Search — поиск с использованием PostgreSQL tsvector и фильтрацией по entity/risk/event типам
  • REST API — FastAPI endpoint с Swagger UI документацией

Технологический стек

  • Infrastructure: Docker Compose, PostgreSQL 16, RabbitMQ
  • Backend: Python 3.12+, FastAPI, SQLAlchemy, Alembic
  • NLP: Hugging Face Transformers (multilingual BERT для NER)
  • Package Manager: uv

Architecture

flowchart TB
    subgraph Docker["Docker Compose"]
        subgraph Infra["Infrastructure"]
            PG[(PostgreSQL<br/>:5432)]
            RMQ[RabbitMQ<br/>:5672/:15672]
        end

        subgraph Apps["Services"]
            CR[Crawlers]
            NLP[NLP Worker<br/>consumer]
            API[FastAPI<br/>:8000]
        end
    end

    WEB((News<br/>Websites</br>report.az, trend.az)) --> CR
    CR -->|INSERT raw data| PG
    CR -->|publish news_id| RMQ
    RMQ -->|consume| NLP
    NLP -->|INSERT processed| PG
    NLP -->|UPDATE status| PG
    API -->|SELECT + FTS| PG
    USER((User)) -->|search| API
Loading

Data Flow

sequenceDiagram
    participant W as Website
    participant C as Crawler
    participant DB as PostgreSQL
    participant Q as RabbitMQ
    participant N as NLP Worker
    participant A as API
    participant U as User

    C->>W: Scrape articles
    W-->>C: HTML content
    C->>DB: INSERT into news_data
    C->>Q: Publish news_id
    Q-->>N: Consume message
    N->>N: Extract entities & risks
    N->>DB: INSERT into news_data_processed
    N->>DB: INSERT into news_data_subjects
    N->>DB: INSERT into news_data_risks
    N->>DB: UPDATE news_data status
    U->>A: GET /search?query=...
    A->>DB: Fulltext search (tsvector)
    DB-->>A: Results with entities & risks
    A-->>U: JSON response
Loading

Database Schema

erDiagram
    crawler_checkpoint {
        int8 id PK
        int8 inserted_count
        timestamptz timestamp
        text website
    }
    
    dict_entity {
        varchar(100) entity_group PK
        text name
    }

    dict_event_type {
        varchar(50) event_type PK
        text name
    }

    dict_risk {
        varchar(100) risk_group PK
        text name
    }

    news_data {
        uuid id PK
        uuid hash
        text url
        text title
        varchar(10) language
        text text_content
        text summary
        timestamptz published_date
        timestamptz modified_date
        text author
        text category
        jsonb tags
        jsonb meta_keywords
        text og_type
        timestamptz crawled_at
        varchar(20) status
        timestamptz loading_date
        timestamptz processing_date
        timestamptz uploading_date
    }

    news_data_processed {
        uuid id PK
        uuid news_data_id FK
        text news_title
        text news_text_content
        text news_summary
        jsonb news_tags
        jsonb news_meta_keywords
        varchar(100) model
        varchar(50) model_version
        varchar(20) status
        varchar(20) error_code
        text error_text
        timestamptz created_at
        tsvector search_vector
    }

    news_data_events {
        uuid id PK
        uuid news_data_processed_id FK
        varchar(50) event_type FK
        text person
        text organization
        text position
        jsonb details
        float8 confidence
    }

    news_data_risks {
        uuid id PK
        uuid news_data_processed_id FK
        varchar(100) risk_group FK
        float8 score
    }

    news_data_subjects {
        uuid id PK
        uuid news_data_processed_id FK
        varchar(100) entity_group FK
        text word
        int4 start
        int4 end
        float8 score
    }

    news_data_processed ||--|| news_data : "1:1"
    news_data_processed ||--o{ news_data_events : "1:N"
    news_data_processed ||--o{ news_data_subjects : "1:N"
    news_data_processed ||--o{ news_data_risks : "1:N"
    dict_risk ||--o{ news_data_risks : "1:N"
    dict_event_type ||--o{  news_data_events : "1:N"
    dict_entity ||--o{ news_data_subjects : "1:N"
Loading

Quick Start

# Clone with submodules
git clone --recurse-submodules git@github.com:semantic-seekers/main-app.git

# Or if already cloned, initialize submodules
git submodule update --init --recursive

# Copy environment file
cp env.example .env

# Start all services
docker compose up -d

# View logs
docker compose logs -f

# Stop services
docker compose down

Services

Service Port Description
PostgreSQL 5432 Database with pg_trgm extension
RabbitMQ 5672, 15672 Message broker (15672 = management UI)
Crawler - News scraper (runs via cron)
NLP Worker - Text processing worker
API 8000 FastAPI search endpoint

API Endpoints

Base URL

http://localhost:8000

Endpoints

Method Endpoint Description
GET /health Health check
GET /docs Swagger UI documentation
GET /search Fulltext search (query params)
POST /search Fulltext search (JSON body)
GET /entities List available entity types
GET /risks List available risk types
GET /event-types List available event types
GET /stats Database statistics

Search Parameters

Parameter Type Required Description
query string Search query (1-500 chars)
limit int Results per page (1-100, default: 20)
offset int Pagination offset (default: 0)
language string Filter by language (az, ru, en)
entity_group string Filter by entity type (LABEL_0..LABEL_24)
risk_group string Filter by risk type (LABEL_0..LABEL_7)
event_type string Filter by event type

Examples

# Health check
curl "http://localhost:8000/health"

# Simple search (English)
curl "http://localhost:8000/search?query=Azerbaijan&limit=5"

# Search with language filter (Russian)
curl "http://localhost:8000/search?query=банк&language=ru&limit=10"

# Search with risk filter (fraud)
curl "http://localhost:8000/search?query=money&risk_group=LABEL_2&limit=5"

# Search with entity filter (persons)
curl "http://localhost:8000/search?query=president&entity_group=LABEL_1&limit=5"

# Combined filters
curl "http://localhost:8000/search?query=investment&language=en&risk_group=LABEL_2&limit=10"

# POST search with JSON body
curl -X POST "http://localhost:8000/search" \
  -H "Content-Type: application/json" \
  -d '{"query": "corruption", "limit": 10, "risk_group": "LABEL_0"}'

# Get statistics
curl "http://localhost:8000/stats"

# List entity types
curl "http://localhost:8000/entities"

# List risk types
curl "http://localhost:8000/risks"

# List event types
curl "http://localhost:8000/event-types"

# Search with event type filter (deals)
curl "http://localhost:8000/search?query=contract&event_type=deal&limit=5"

# Format JSON output (requires jq)
curl -s "http://localhost:8000/search?query=bank&limit=2" | jq

Risk Types

risk_group Name Description (RU)
LABEL_0 corruption and bribery Коррупция и взяточничество
LABEL_1 money laundering Отмывание денег
LABEL_2 fraud Мошенничество
LABEL_3 sanctions and fines Санкции и штрафы
LABEL_4 litigation Судебные разбирательства
LABEL_5 bankruptcy Банкротство
LABEL_6 organized crime Связи с оргпреступностью
LABEL_7 conflict of interest Конфликт интересов

Entity Types (common)

entity_group Name Description
LABEL_1 PERSON Names of individuals
LABEL_2 LOCATION Geographical locations
LABEL_3 ORGANISATION Companies, institutions
LABEL_4 DATE Dates or periods
LABEL_6 MONEY Monetary values
LABEL_14 GPE Countries, cities
LABEL_9 PRODUCT Products and goods

Event Types

event_type Name Description (RU)
appointment Назначение на должность Appointments
resignation Отставка Resignations
court_case Судебное дело Court cases
deal Сделка/Контракт Deals and contracts
sanction Санкции Sanctions

Response Example

{
  "query": "test",
  "total": 15,
  "limit": 20,
  "offset": 0,
  "results": [
    {
      "id": "uuid",
      "news_data_id": "uuid",
      "title": "Article title",
      "text_content": "Article text (truncated to 500 chars)...",
      "summary": "Article summary",
      "url": "https://example.com/article",
      "language": "ru",
      "published_date": "2025-12-20T12:00:00Z",
      "similarity": 0.85,
      "entities": [
        {
          "entity_group": "LABEL_1",
          "word": "Person Name",
          "score": 0.95,
          "start": 10,
          "end": 21
        }
      ],
      "risks": [
        {
          "risk_group": "LABEL_0",
          "risk_name": "corruption and bribery",
          "score": 0.8
        }
      ],
      "events": [
        {
          "event_type": "deal",
          "event_name": "Сделка/Контракт",
          "person": null,
          "organization": "Siemens AG",
          "position": null,
          "event_date": "2024-05-20",
          "amount": 250000000,
          "currency": "EUR",
          "confidence": 0.7
        }
      ],
      "model": "azerbaijani-nlp",
      "model_version": "1.0.0",
      "processed_at": "2025-12-20T12:05:00Z"
    }
  ]
}

Database Schema

Tables

  • comp.news_data - Raw news articles
  • comp.news_data_processed - Processed articles with NLP results
  • comp.news_data_subjects - Extracted entities (NER)
  • comp.news_data_risks - Identified risks
  • comp.news_data_events - Extracted events (appointments, deals, etc.)
  • comp.dict_entity - Entity type dictionary
  • comp.dict_risk - Risk type dictionary
  • comp.dict_event_type - Event type dictionary
  • comp.crawler_checkpoint - Crawler run history

Running Migrations

# Enter crawler container
docker compose exec crawler bash

# Run migrations
.venv/bin/alembic upgrade head

Development

Project Structure

semantics/
├── docker-compose.yaml    # Main orchestration
├── env.example            # Environment template
├── pg/                    # PostgreSQL init scripts
├── crawler-report/        # News crawler report.az (submodule)
├── crawler-trend/         # News crawler trend.az (submodule)
├── nlp-worker/            # NLP processing service
├── api/                   # FastAPI search service
└── NLP.-Events-entities-extraction/  # Original NLP notebooks (submodule)

Git Submodules

# Clone repository with all submodules
git clone --recurse-submodules <repository-url>

# Initialize submodules (if already cloned)
git submodule update --init --recursive

# Update all submodules to latest
git submodule update --remote --merge

# Add a new submodule
git submodule add <submodule-repo-url> <path>

# Check submodule status
git submodule status

Building Individual Services

# Build specific service
docker compose build crawler

# Rebuild all
docker compose build

# Rebuild without cache
docker compose build --no-cache

Configuration

See env.example for all available environment variables.

License

Private project for MIPT.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors