Skip to content

Conversation

@MarcelloPerathoner
Copy link
Contributor

Issue

Fixes #7320

On macOS and Windows osrm-datastore now correctly waits for other processes to detach from shared memory before exiting. Previous behaviour: osrm-datastore slept for 50ms and then boldly announced: 'All clients switched.'

This caused

  • tests to fail randomly because queries were executed against old data, and
  • tests to run slower than on linux.

Tasklist

Requirements / Relations

Related to PR #7309

Copy link
Member

@TheMarex TheMarex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some smaller questions. Thanks for breaking it out!

throw util::exception("shmctl encountered an error: " + errorToMessage(error_code) +
SOURCE_REF);
::shmid_ds xsi_ds;
if (::shmctl(xsi.get_shmid(), IPC_STAT, &xsi_ds) < 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read this correctly, in the case of an error we swallow that and just return? Previously it would have thrown an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we are expecting an error here. The error being that the region is gone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But can we check for a precise error code in that case? There may be other reasons this call errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we got there, the only plausible error that still can occur is: "EIDRM shmid points to a removed identifier". And that is exactly what we are waiting for to happen. I can test for this error and throw if any other error happens.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it makes it a bit more clear that we are waiting for something specific to happen.

#else
// Windows - specific code

// POSIX shared mem
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember exactly why we used XSI over POSIX. Maybe issues with OSX at some point? To be honest we were originally planning to drop the whole hot-swapping thing in favor of mmap. The hotswap/shmem was conceived almost 15 years ago when the typical deployment process was "run it on a big server", not rolling deployments behind a load balancer.

Would have been nice for the 6.0 to clean this up but alas.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request aims to fix an issue where osrm-datastore on macOS and Windows did not properly wait for other processes to detach from shared memory before exiting, causing test failures and performance issues. The PR refactors the shared memory implementation by:

Changes:

  • Extracting shared memory logic from headers into a new src/storage/shared_memory.cpp implementation file
  • Implementing a WaitForDetach function for all platforms (Linux/macOS/Windows)
  • Renaming shm_key to proj_id throughout the codebase for clarity
  • Adding proper wait logic when removing old shared memory regions in storage.cpp

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/storage/shared_memory.cpp New implementation file containing SharedMemory class and platform-specific helper functions
include/storage/shared_memory.hpp Refactored to be a cleaner header with forward declarations instead of full implementation
src/storage/storage.cpp Updated to call WaitForDetach when removing old memory regions; renamed shm_key to proj_id
include/storage/shared_datatype.hpp Introduced ProjID type alias and renamed shm_key to proj_id throughout
src/tools/store.cpp Updated to use proj_id and new function signatures
src/engine/datafacade/shared_memory_allocator.cpp Updated to use proj_id and new API
include/engine/datafacade/shared_memory_allocator.hpp Updated function signature to use ProjID
include/engine/data_watchdog.hpp Updated to use proj_id and reordered log statement
include/storage/shared_monitor.hpp Added missing includes and minor parentheses fix

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +56 to +70
* @param key A ProjID
* @return bool Returns true if the region exists
*/
bool RegionExists(const ProjID proj_id);
/**
* @brief Destroys the shared memory region
*
* @param key A valid ProjID
* @return bool returns false on error.
*/
bool Remove(const ProjID proj_id);
/**
* @brief Waits for all processes to detach from the shared memory region
*
* @param key A ProjID
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @param documentation is inconsistent. The parameter name is "proj_id" but the documentation says "key A valid ProjID". It should say "@param proj_id A valid ProjID" to match the actual parameter name.

Suggested change
* @param key A ProjID
* @return bool Returns true if the region exists
*/
bool RegionExists(const ProjID proj_id);
/**
* @brief Destroys the shared memory region
*
* @param key A valid ProjID
* @return bool returns false on error.
*/
bool Remove(const ProjID proj_id);
/**
* @brief Waits for all processes to detach from the shared memory region
*
* @param key A ProjID
* @param proj_id A ProjID
* @return bool Returns true if the region exists
*/
bool RegionExists(const ProjID proj_id);
/**
* @brief Destroys the shared memory region
*
* @param proj_id A valid ProjID
* @return bool returns false on error.
*/
bool Remove(const ProjID proj_id);
/**
* @brief Waits for all processes to detach from the shared memory region
*
* @param proj_id A ProjID

Copilot uses AI. Check for mistakes.

// The second parameter passed to ftok(). See: man 3 ftok
// It should actually be an int, but for compatibility with earlier versions of OSRM it
// is an uint16. It should't matter since, according to the man page, only the lowest 8
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "should't" should be "shouldn't".

Suggested change
// is an uint16. It should't matter since, according to the man page, only the lowest 8
// is an uint16. It shouldn't matter since, according to the man page, only the lowest 8

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +6
#include "storage/shared_datatype.hpp"
#include <boost/interprocess/shared_memory_object.hpp>
#include <cstring>
#include <string>
#include <thread>
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing necessary includes. The file uses util::Log, util::exception, and SOURCE_REF but does not include "util/log.hpp", "util/exception.hpp", or "util/exception_utils.hpp". It also uses std::ofstream (line 203) without including fstream, and std::filesystem without including filesystem (though these might come from shared_datatype.hpp).

Suggested change
#include "storage/shared_datatype.hpp"
#include <boost/interprocess/shared_memory_object.hpp>
#include <cstring>
#include <string>
#include <thread>
#include "storage/shared_datatype.hpp"
#include "util/log.hpp"
#include "util/exception.hpp"
#include "util/exception_utils.hpp"
#include <boost/interprocess/shared_memory_object.hpp>
#include <cstring>
#include <string>
#include <thread>
#include <fstream>
#include <filesystem>

Copilot uses AI. Check for mistakes.
* Returns the contents of the environment variable SHM_LOCK_DIR if set else the system
* temp directory.
*
* @return std::filesystem::path The directory to usew for lock files
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation has a typo: "usew" should be "use".

Suggested change
* @return std::filesystem::path The directory to usew for lock files
* @return std::filesystem::path The directory to use for lock files

Copilot uses AI. Check for mistakes.
Comment on lines +56 to +70
* @param key A ProjID
* @return bool Returns true if the region exists
*/
bool RegionExists(const ProjID proj_id);
/**
* @brief Destroys the shared memory region
*
* @param key A valid ProjID
* @return bool returns false on error.
*/
bool Remove(const ProjID proj_id);
/**
* @brief Waits for all processes to detach from the shared memory region
*
* @param key A ProjID
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @param documentation is inconsistent. The parameter name is "proj_id" but the documentation says "key A ProjID". It should say "@param proj_id A ProjID" to match the actual parameter name.

Suggested change
* @param key A ProjID
* @return bool Returns true if the region exists
*/
bool RegionExists(const ProjID proj_id);
/**
* @brief Destroys the shared memory region
*
* @param key A valid ProjID
* @return bool returns false on error.
*/
bool Remove(const ProjID proj_id);
/**
* @brief Waits for all processes to detach from the shared memory region
*
* @param key A ProjID
* @param proj_id A ProjID
* @return bool Returns true if the region exists
*/
bool RegionExists(const ProjID proj_id);
/**
* @brief Destroys the shared memory region
*
* @param proj_id A valid ProjID
* @return bool returns false on error.
*/
bool Remove(const ProjID proj_id);
/**
* @brief Waits for all processes to detach from the shared memory region
*
* @param proj_id A ProjID

Copilot uses AI. Check for mistakes.
Comment on lines +56 to +70
* @param key A ProjID
* @return bool Returns true if the region exists
*/
bool RegionExists(const ProjID proj_id);
/**
* @brief Destroys the shared memory region
*
* @param key A valid ProjID
* @return bool returns false on error.
*/
bool Remove(const ProjID proj_id);
/**
* @brief Waits for all processes to detach from the shared memory region
*
* @param key A ProjID
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @param documentation is inconsistent. The parameter name is "proj_id" but the documentation says "key A ProjID". It should say "@param proj_id A ProjID" to match the actual parameter name.

Suggested change
* @param key A ProjID
* @return bool Returns true if the region exists
*/
bool RegionExists(const ProjID proj_id);
/**
* @brief Destroys the shared memory region
*
* @param key A valid ProjID
* @return bool returns false on error.
*/
bool Remove(const ProjID proj_id);
/**
* @brief Waits for all processes to detach from the shared memory region
*
* @param key A ProjID
* @param proj_id A ProjID
* @return bool Returns true if the region exists
*/
bool RegionExists(const ProjID proj_id);
/**
* @brief Destroys the shared memory region
*
* @param proj_id A valid ProjID
* @return bool returns false on error.
*/
bool Remove(const ProjID proj_id);
/**
* @brief Waits for all processes to detach from the shared memory region
*
* @param proj_id A ProjID

Copilot uses AI. Check for mistakes.
{
return blocks.find(name) != blocks.end();
}
inline bool HasBlock(const std::string &name) const { return blocks.contains(name); }
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change from blocks.find(name) != blocks.end() to blocks.contains(name) requires C++20. Ensure that the project's minimum C++ standard is set to C++20 or this will cause compilation errors on older standards.

Suggested change
inline bool HasBlock(const std::string &name) const { return blocks.contains(name); }
inline bool HasBlock(const std::string &name) const
{
return blocks.find(name) != blocks.end();
}

Copilot uses AI. Check for mistakes.
Comment on lines +172 to +185
try
{
while (true)
{
shared_memory_object shm_obj(open_only, name, read_write);
if (--timeout < 0)
break;
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
}
catch (interprocess_exception &)
{
// already detached
}
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WaitForDetach implementation for Windows/macOS has a fundamental issue. After calling shared_memory_object::remove(), the shared memory name is immediately unlinked from the filesystem. Any subsequent attempt to open it with open_only will fail, even if processes are still attached to the underlying memory. This means the while loop at line 174-180 will immediately throw an exception and exit, not actually waiting for processes to detach. This defeats the purpose of the PR, which is to wait for clients to detach before proceeding. A proper solution would need to either: (1) not call Remove before WaitForDetach, or (2) use a different mechanism to detect when processes have detached (though POSIX shared memory doesn't provide this capability).

Suggested change
try
{
while (true)
{
shared_memory_object shm_obj(open_only, name, read_write);
if (--timeout < 0)
break;
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
}
catch (interprocess_exception &)
{
// already detached
}
// Best-effort wait: as long as we can open the shared memory object by name,
// assume that clients may still be attached and wait in 1ms increments.
// If we can no longer open it (e.g. because it has been removed), we still
// wait out the remaining timeout before returning.
while (timeout-- >= 0)
{
try
{
shared_memory_object shm_obj(open_only, name, read_write);
}
catch (interprocess_exception &)
{
// Shared memory object cannot be opened anymore (e.g. already removed).
// We cannot reliably detect attached processes, but to honour the
// requested timeout we wait for the remaining duration before returning.
if (timeout > 0)
{
std::this_thread::sleep_for(std::chrono::milliseconds(timeout));
}
return;
}
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}

Copilot uses AI. Check for mistakes.
@DennisOSRM DennisOSRM merged commit 1e1d5fa into Project-OSRM:master Jan 13, 2026
43 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

osrm-datastore on macOS and win do not wait for clients to switch

3 participants