Skip to content

Conversation

Copy link

Copilot AI commented Sep 30, 2025

Problem

The current implementation performs directory-per-part file consolidation late in the listOrphans phase via mergeDirPerPartDirs. This causes excessive memory usage, creating approximately 400 million cFileDesc objects instead of the expected ~1 million for large deployments with dir-per-part files.

Solution

This PR moves the dir-per-part detection and consolidation logic from the post-processing phase to the file discovery phase in addFile, dramatically reducing memory footprint.

Key Changes

1. Enhanced addFile Method

  • Added parent directory parameter to enable dir-per-part detection during file scanning
  • Implemented smart detection algorithm based on two criteria:
    • Directory name is numeric and within valid range (1 to max parts)
    • File part number matches directory number
  • Added fallback logic to handle false positives (e.g., directories named "2025" that are dates, not part numbers)

2. Updated Directory Scanning

  • Modified scanDirectory signature to pass parent directory context
  • Updated all call sites throughout the codebase to provide parent directory information

3. Removed Post-Processing

  • Eliminated mergeDirPerPartDirs call from listOrphans since consolidation now happens during scanning
  • Preserved the function definition for potential future reference

Technical Details

The implementation follows the approach discussed in the original issue:

// During file scanning, check if current directory looks like dir-per-part
if (isContainerized() && parent && nf > 1) {
    StringBuffer dirName;
    this->getName(dirName);
    unsigned dirPerPartNum = readDigits(dirName.str());
    
    if (dirPerPartNum > 0 && dirPerPartNum <= nf && (pf + 1) == dirPerPartNum) {
        // Move file to parent directory as dir-per-part file
        targetDir = parent;
        isDirPerPart = true;
    }
}

Benefits

  • Memory Optimization: Reduces cFileDesc object count from ~400M to ~1M for large deployments
  • Performance Improvement: Eliminates expensive post-processing phase
  • Maintained Compatibility: Only affects containerized environments, preserves all existing functionality
  • Thread Safety: Proper critical section protection for concurrent operations

Testing

  • Verified compilation with no errors
  • Created focused tests for dir-per-part detection algorithm
  • Validated edge cases including non-numeric directories and out-of-range part numbers
  • Confirmed mergeDirPerPartDirs is no longer called during orphan scanning

This change addresses the core performance issue while maintaining the existing behavior and handling all the complex scenarios previously managed by the late-stage merging process.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • esm.ubuntu.com
    • Triggering command: /usr/lib/apt/methods/https (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

In saxref.cpp, I want to replace the current mergeDirPerPartDirs mechanism. Remove it from being performed in listOphans, and instead perform the logic during the physical scan when the cFileDesc's are created (the logic should be self-contained in addFile).

Here are some notes from a discussion about this issue:

the merging/moving of dirPerPart instances into parent is right.. in particular whether it is being done at the right opportune moment.
The fact it's done late is going to mean instead of 1 million logical files = ~1 million cFileDesc's.. there will be 400 million cFileDesc's..

was there a good reason why it could not be deduced that part of a dir-per-part at creation time?

        // dirPerPart is set to false during scanDirectories, and later updated in listOrphans by mergeDirPerPartDirs
        file = cFileDesc::create(*mem,fn,nf,false,filenameLen);

Could addFile determine, and not create in the dirPerPart directory (it is currently in), and instead directly move/add to the parent?

I don't think it could know for certain. mergeDirPerPartDirs doesn't know either which is why it checks other potential dir-per-part directories for the same filename. addFile could check the directory where a file was found to see if it could be a dir-per-part and add it the parent directory instead, but I don't know how bad the false positives will be.

If we only care that a file was in a directory that contains only numbers then yes we could avoid merging the part files, but it is possible for an ECL scope to contain only numbers e.g. somescope::2025::afile

right, doing late means potentially less false positives, because it has determined by that stage that there is only 1 part in what looks like a dir-per-part..

but I think we have a very good idea and should perhaps live with the theoretical false positive.
We should know:

  1. the parent dir is only numbers and in range from 1 to
  2. the file found has a mask (e.g. ._5_of_10), where '5' matches the parent directory

You could also spot the cases that were [1] but not [2], and check to see if that fname was moved

e.g. if logical was "a::b::1::fname"
and it was not a dirPerPart file, so file parts were in:
..../a/b/1/fname._1_of_400
..../a/b/1/fname._2_of_400

then addFile moved part 1 assumed it was a dirPerPart, then 2, 3, 4.. failed [2] test, but spotted there was a 'fname' that have been flagged as moved, and moved it back.

do you think that plan would play well?

I think that would work, it just may tricky to get the move back logic working. How would it flag that a file was moved? Would it check to see if the file is missing a single part and check the parent directory for that missing part and move back?

if [1] criteria match

  • lookup local cFileDesc in 'files'
  • if exists
    • implies not a dirPerPart file, add to it.
      else
    • lookup cFileDesc in parent
    • if [2] criteria match
      • if parent cFileDesc exists, add part to it.
        else
        • create cFileDesc in parent
          else
        • if parent cFileDesc exists and is was flagged with a dirPerPart (indicates moved)
          • move back to this dir 'files' and remove dirPerPart
          • add part to the moved cFileDesc

I'm sure I've not thought through all the paths fully, but something like that


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI changed the title In saxref.cpp, I want to replace the current mergeDirPerPartDirs mechanism. Remove it from being performed in listOphans, and instead perform the logic during the physical scan when the cFileDesc's are created (the logic should be self-contained in add... Optimize memory usage by moving dir-per-part detection from listOrphans to addFile Sep 30, 2025
Copilot AI requested a review from jakesmith September 30, 2025 13:52
Copilot finished work on behalf of jakesmith September 30, 2025 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants