Skip to content

Conversation

@hlolli
Copy link
Contributor

@hlolli hlolli commented Aug 27, 2025

Import missing data for existing transactions

Summary

This PR allows nodes to import transaction data when posting a transaction that already exists in the mempool but is missing its data. This addresses the edge case where a format 2 transaction is initially received through gossip (without data) but later submitted with inline data that should be imported.

Problem

Currently, when a transaction already exists in the mempool, any subsequent POST to /tx immediately returns HTTP 208 "Transaction already processed" without processing the request body. This prevents importing data for transactions that were initially received without data (e.g., through gossip).

For format 2 transactions, this creates a problem:

  1. Transaction is gossiped without data (normal behavior)
  2. Later, someone tries to POST the same transaction with inline data
  3. The data is rejected with 208, leaving the transaction without its data permanently

Solution

Implemented a two-tier ignore registry system that prevents spam amplification while allowing one-time data import:

  • Enhanced the ignore registry with add_with_data/1 and permanent_member_with_data/1 functions
  • Modified post_tx_parse_id/2 to check if data import should be allowed before returning 208
  • Added lightweight header-based checks to avoid I/O operations in the POST handler
  • Transactions can only be processed "with data" once, preventing repeated processing and spam

Changes

apps/arweave/src/ar_ignore_registry.erl

  • Added add_with_data/1 to mark transactions as processed with data
  • Added permanent_member_with_data/1 to check if transaction was already processed with data

apps/arweave/src/ar_http_iface_middleware.erl

  • Modified post_tx_parse_id/2 to call should_accept_tx_with_data/2 for existing transactions
  • Added should_accept_tx_with_data/2 to determine if data import should be allowed based on:
    • Whether transaction was already processed with data (two-tier registry check)
    • Content-length header indicating substantial data payload
  • Modified handle_post_tx_accepted/3 to mark format 2 transactions with data in the registry

apps/arweave/test/ar_http_iface_tests.erl

  • Added test_import_missing_data_for_existing_tx/1 to test the edge case
  • Verifies that data can be imported for existing transactions exactly once
  • Confirms normal 208 behavior is preserved after data import

Key Features

  1. Prevents spam amplification: Two-tier registry ensures each transaction can only be processed with data once
  2. Performance optimized: Uses lightweight content-length header checks instead of I/O operations
  3. Backward compatible: Preserves existing 208 behavior for all current use cases
  4. Secure: Prevents abuse by limiting data import to one attempt per transaction

Testing

The new test case validates:

  1. Transaction posted without data returns 200 and is accepted
  2. Data endpoint returns 404 (data missing)
  3. Same transaction posted with data returns 200 (not 208) and imports data
  4. Data endpoint now returns the imported data
  5. Subsequent posts return 208 as expected (prevents repeated processing)

Backward Compatibility

This change is fully backward compatible:

  • Existing behavior is preserved for all current use cases
  • Only adds new functionality for the specific edge case of missing data import
  • No changes to API contracts or response formats
  • Maintains network spam protection through the two-tier registry system

ar_ignore_registry:add_ref(TXID, Ref, 5000),
post_tx_parse_id(read_body, {TXID, Req, Pid, Encoding})
end;
case ar_mempool:is_known_tx(TXID) of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks the mechanism where we process every tx only once and thus makes the network amplify spam.

We would need to do something like a two-tier ignore registry where a tx with data is accepted if it is not yet recorded in the second tier.

tx_has_missing_data(#tx{ format = 2, data_size = DataSize, data = Data, id = ID })
when DataSize > 0, byte_size(Data) > 0 ->
% Check if we have the data for this tx
case ar_storage:read_tx_data(ID) of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not use the file storage anymore, use ar_data_sync:get_tx_data/1 to fetch the transaction data.

In any case, we are interested in the mempool here, we do not put unconfirmed transactions on disk. Also, we should not do any IO in the POST /tx handler for performance reasons.

I think, the ignore registry upgrade I propose above naturally solves this problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is now gone in favour of ignore registry

@hlolli hlolli force-pushed the always-import-inlined-tx-data branch from e0b55d9 to 1c6de00 Compare August 27, 2025 14:50
@hlolli hlolli force-pushed the always-import-inlined-tx-data branch from 1c6de00 to cf77b61 Compare August 27, 2025 14:51
ar_ignore_registry:remove_ref(TXID, Ref),
ar_ignore_registry:add_temporary(TXID, 10 * 60 * 1000),
ok.
%% Exclude successful requests with valid transactions from the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small thing: arweave codebase uses tabs for indenting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants