Skip to content

Conversation

abkrim
Copy link
Contributor

@abkrim abkrim commented Aug 12, 2025

Objective

Provide native idempotent creates (dedupe) via Elasticsearch op_type=create so concurrent writers do not introduce duplicates when targeting the same document.

Motivation

  • In distributed environments, multiple nodes may attempt to write the same document at the same time.
  • Elasticsearch supports op_type=create to fail fast when a document with the same _id already exists. This PR exposes that behavior idiomatically via Eloquent-like APIs.

Changes

  • Query DSL
    • Add DslFactory::createOperation() for bulk action create.
    • Grammar::compileInsert() detects op_type from per-document fields (_op_type/op_type) or the builder option insert_op_type and emits create instead of index.
  • Eloquent Builder
    • createOnly(): forces op_type=create for inserts performed by the builder.
    • createOrFail(array $attributes): shortcut that applies createOnly() and fails on duplicate _id.
    • withRefresh(true|false|'wait_for'): explicit control over refresh.
  • Response processing
    • Processor::processBulkInsert() now handles both index and create actions consistently.
  • Exceptions
    • BulkInsertQueryException:
      • Formats errors for create as well as index.
      • Returns HTTP code 409 when version_conflict_engine_exception is detected (duplicate id), otherwise 400.
  • Minor
    • DslBuilder::setRefresh(bool|string) accepts 'wait_for' for refresh.

Usage

  • Deterministic id with create-only
Model::query()
  ->createOnly()
  ->withRefresh('wait_for')
  ->create([
    'id' => 'dataset:check-1:2025-01-01T00:00:00Z',
    'name' => 'First Insert',
  ]);
  • Per-document op_type
Model::create([
  'id' => 'dataset:check-2:2025-01-01T00:00:00Z',
  '_op_type' => 'create',
  'name' => 'Doc Create',
]);
  • Duplicate behavior
  • Duplicate creates will raise BulkInsertQueryException with code 409.

Tests

  • tests/CreateOpTypeTest.php
    • Verifies createOnly() creates the doc and rejects duplicates (409).
    • Verifies per-document _op_type=create and duplicate rejection.

Compatibility

  • No breaking changes to existing public APIs.
  • Note: consumers who rely on status codes should expect 409 for version conflict (duplicate) errors.

Reference

  • Base repo: pdphilip/laravel-elasticsearch

Best regards

@abkrim abkrim requested a review from pdphilip as a code owner August 12, 2025 06:23
@pdphilip
Copy link
Owner

Hey @abkrim, please walk me through the real world problem you face:

  1. Are you generating _ids in your Laravel App or is the elasticsearch engine doing it?
  2. Is your App distributed or ES or both?
  3. What is the source or event that triggers a write?

Then just take me through a simplified trace of events that leads to double entries of the same record

Thanks

@abkrim
Copy link
Contributor Author

abkrim commented Aug 14, 2025

Hey! Thanks for asking for details. Let me walk you through the scenario:

1. ID Generation:
Yes, I'm generating the _ids myself in the Laravel app due to control requirements. I need full control over the ID generation process for idempotency reasons.

2. Distribution:
Both the app and ES are distributed. It's a distributed platform that uses multiple Elasticsearch nodes where regional app nodes inject documents. I required ID control to ensure consistency across regions.

3. Event Source:
The events are distributed jobs that execute on RabbitMQ services across different regional areas.

The Problem:
In principle, there shouldn't be duplicates since I'm generating UUIDs v7, but I can't risk any duplicates occurring. That's why I reviewed the code and noticed there wasn't the op_type option I proposed, which would prevent duplicates by returning an error. Catching this error would allow me to retry writing the document with a new UUID.

Scale Context:
We're talking about potentially 5,000-20,000 docs per minute from 10 different nodes.

Why This Matters:
The possibility is small, but these documents are very important and each one MUST be UNIQUE. I need something like a newOrFail option or similar behavior.

Simplified Event Trace:

  1. Regional job triggers document creation
  2. Laravel app generates UUID v7 for _id
  3. Document gets sent to ES cluster
  4. Risk point: If network issues cause retry logic or race conditions between regional nodes, there's a tiny chance of duplicate attempts
  5. Without op_type=create, ES would overwrite instead of failing
  6. I need the failure to detect this edge case and regenerate with a fresh UUID

The feature would provide an essential safety net for critical document uniqueness in a high-throughput distributed environment.

Best regards.

@pdphilip pdphilip merged commit e2124c8 into pdphilip:main Aug 20, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants