You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Semantics Platform — это платформа для автоматического сбора, NLP-обработки и полнотекстового поиска новостных статей. Система ориентирована на анализ новостей на азербайджанском языке с акцентом на выявление рисков и событий.
NLP: Hugging Face Transformers (multilingual BERT для NER)
Package Manager: uv
Architecture
flowchart TB
subgraph Docker["Docker Compose"]
subgraph Infra["Infrastructure"]
PG[(PostgreSQL<br/>:5432)]
RMQ[RabbitMQ<br/>:5672/:15672]
end
subgraph Apps["Services"]
CR[Crawlers]
NLP[NLP Worker<br/>consumer]
API[FastAPI<br/>:8000]
end
end
WEB((News<br/>Websites</br>report.az, trend.az)) --> CR
CR -->|INSERT raw data| PG
CR -->|publish news_id| RMQ
RMQ -->|consume| NLP
NLP -->|INSERT processed| PG
NLP -->|UPDATE status| PG
API -->|SELECT + FTS| PG
USER((User)) -->|search| API
Loading
Data Flow
sequenceDiagram
participant W as Website
participant C as Crawler
participant DB as PostgreSQL
participant Q as RabbitMQ
participant N as NLP Worker
participant A as API
participant U as User
C->>W: Scrape articles
W-->>C: HTML content
C->>DB: INSERT into news_data
C->>Q: Publish news_id
Q-->>N: Consume message
N->>N: Extract entities & risks
N->>DB: INSERT into news_data_processed
N->>DB: INSERT into news_data_subjects
N->>DB: INSERT into news_data_risks
N->>DB: UPDATE news_data status
U->>A: GET /search?query=...
A->>DB: Fulltext search (tsvector)
DB-->>A: Results with entities & risks
A-->>U: JSON response
Loading
Database Schema
erDiagram
crawler_checkpoint {
int8 id PK
int8 inserted_count
timestamptz timestamp
text website
}
dict_entity {
varchar(100) entity_group PK
text name
}
dict_event_type {
varchar(50) event_type PK
text name
}
dict_risk {
varchar(100) risk_group PK
text name
}
news_data {
uuid id PK
uuid hash
text url
text title
varchar(10) language
text text_content
text summary
timestamptz published_date
timestamptz modified_date
text author
text category
jsonb tags
jsonb meta_keywords
text og_type
timestamptz crawled_at
varchar(20) status
timestamptz loading_date
timestamptz processing_date
timestamptz uploading_date
}
news_data_processed {
uuid id PK
uuid news_data_id FK
text news_title
text news_text_content
text news_summary
jsonb news_tags
jsonb news_meta_keywords
varchar(100) model
varchar(50) model_version
varchar(20) status
varchar(20) error_code
text error_text
timestamptz created_at
tsvector search_vector
}
news_data_events {
uuid id PK
uuid news_data_processed_id FK
varchar(50) event_type FK
text person
text organization
text position
jsonb details
float8 confidence
}
news_data_risks {
uuid id PK
uuid news_data_processed_id FK
varchar(100) risk_group FK
float8 score
}
news_data_subjects {
uuid id PK
uuid news_data_processed_id FK
varchar(100) entity_group FK
text word
int4 start
int4 end
float8 score
}
news_data_processed ||--|| news_data : "1:1"
news_data_processed ||--o{ news_data_events : "1:N"
news_data_processed ||--o{ news_data_subjects : "1:N"
news_data_processed ||--o{ news_data_risks : "1:N"
dict_risk ||--o{ news_data_risks : "1:N"
dict_event_type ||--o{ news_data_events : "1:N"
dict_entity ||--o{ news_data_subjects : "1:N"
Loading
Quick Start
# Clone with submodules
git clone --recurse-submodules git@github.com:semantic-seekers/main-app.git
# Or if already cloned, initialize submodules
git submodule update --init --recursive
# Copy environment file
cp env.example .env
# Start all services
docker compose up -d
# View logs
docker compose logs -f
# Stop services
docker compose down
Services
Service
Port
Description
PostgreSQL
5432
Database with pg_trgm extension
RabbitMQ
5672, 15672
Message broker (15672 = management UI)
Crawler
-
News scraper (runs via cron)
NLP Worker
-
Text processing worker
API
8000
FastAPI search endpoint
API Endpoints
Base URL
http://localhost:8000
Endpoints
Method
Endpoint
Description
GET
/health
Health check
GET
/docs
Swagger UI documentation
GET
/search
Fulltext search (query params)
POST
/search
Fulltext search (JSON body)
GET
/entities
List available entity types
GET
/risks
List available risk types
GET
/event-types
List available event types
GET
/stats
Database statistics
Search Parameters
Parameter
Type
Required
Description
query
string
✅
Search query (1-500 chars)
limit
int
❌
Results per page (1-100, default: 20)
offset
int
❌
Pagination offset (default: 0)
language
string
❌
Filter by language (az, ru, en)
entity_group
string
❌
Filter by entity type (LABEL_0..LABEL_24)
risk_group
string
❌
Filter by risk type (LABEL_0..LABEL_7)
event_type
string
❌
Filter by event type
Examples
# Health check
curl "http://localhost:8000/health"# Simple search (English)
curl "http://localhost:8000/search?query=Azerbaijan&limit=5"# Search with language filter (Russian)
curl "http://localhost:8000/search?query=банк&language=ru&limit=10"# Search with risk filter (fraud)
curl "http://localhost:8000/search?query=money&risk_group=LABEL_2&limit=5"# Search with entity filter (persons)
curl "http://localhost:8000/search?query=president&entity_group=LABEL_1&limit=5"# Combined filters
curl "http://localhost:8000/search?query=investment&language=en&risk_group=LABEL_2&limit=10"# POST search with JSON body
curl -X POST "http://localhost:8000/search" \
-H "Content-Type: application/json" \
-d '{"query": "corruption", "limit": 10, "risk_group": "LABEL_0"}'# Get statistics
curl "http://localhost:8000/stats"# List entity types
curl "http://localhost:8000/entities"# List risk types
curl "http://localhost:8000/risks"# List event types
curl "http://localhost:8000/event-types"# Search with event type filter (deals)
curl "http://localhost:8000/search?query=contract&event_type=deal&limit=5"# Format JSON output (requires jq)
curl -s "http://localhost:8000/search?query=bank&limit=2"| jq