Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COLLECTIONS-803: prevent duplicate call to convertKey on put for CaseInsensitiveMap #276

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

Simulant87
Copy link

@Simulant87 Simulant87 commented Feb 3, 2022

Improves the performance by re-using the once converted key again when creating a new entry.
Explained in more detail in https://issues.apache.org/jira/browse/COLLECTIONS-803

@garydgregory
Copy link
Member

Hello @Simulant87
You'll need to add a test case to this PR that shows at least one failing test when run without the main changes.

@Simulant87
Copy link
Author

@garydgregory Thank you for the feedback.
I added a test that verifies the convertKey method is only called once, which fails currently on the master branch without the production code changes provided.
I kindly request another review. Please let me know if there is anything else missing.

@freya022
Copy link

I may also suggest changing the convertKey method to use key.toString().toLowerCase(Locale.ROOT) to convert the non-null key, instead of the char loop.

Benefits include better performance (measured a 2 times increase on my pc), and this is more fitted for the use case, as per the Character#toLowerCase documentation:

In general, String.toLowerCase() should be used to map characters to lowercase. 
String case mapping methods have several benefits over Character case mapping methods. 
String case mapping methods can perform locale-sensitive mappings, context-sensitive mappings, and 1:M character mappings, whereas the Character case mapping methods cannot

@Simulant87
Copy link
Author

@freya022 thank you for your suggestion. This sounds like another high performance improvement.
However it also sounds like the behaviour might slightly change in some edge cases. Something I expect the maintainers want to avoid and could cause a longer discussion. Therefore I would like to keep your suggestion separated from my changes. As my implementation does not cahnge the behaviour of the key conversion, it should be simpler to review and approve. I encourage your to create a ticket in Jira and create a separate pull request for that change.

@Simulant87
Copy link
Author

Hello @kinow,
I saw that you are able to merge in this repository in another recent PR. Would you also like to review my suggested changes in this PR, please?

@kinow
Copy link
Member

kinow commented Feb 24, 2022

Hello @kinow, I saw that you are able to merge in this repository in another recent PR. Would you also like to review my suggested changes in this PR, please?

Sorry, busy with other pull requests & issues 😥 🙇 code looks good from a brief look, but would need more time to really understand the change and confirm it's a good improvement 👍

@codecov-commenter
Copy link

codecov-commenter commented May 10, 2022

Codecov Report

Merging #276 (2193363) into master (1677dac) will decrease coverage by 4.66%.
The diff coverage is 92.96%.

@@             Coverage Diff              @@
##             master     #276      +/-   ##
============================================
- Coverage     85.87%   81.21%   -4.66%     
+ Complexity     4676     4609      -67     
============================================
  Files           292      288       -4     
  Lines         13469    13442      -27     
  Branches       1955     1984      +29     
============================================
- Hits          11566    10917     -649     
- Misses         1326     1932     +606     
- Partials        577      593      +16     
Impacted Files Coverage Δ
.../org/apache/commons/collections4/ClosureUtils.java 100.00% <ø> (ø)
...g/apache/commons/collections4/ComparatorUtils.java 81.48% <0.00%> (+2.91%) ⬆️
...rg/apache/commons/collections4/FluentIterable.java 100.00% <ø> (ø)
...org/apache/commons/collections4/IterableUtils.java 96.62% <0.00%> (ø)
...org/apache/commons/collections4/IteratorUtils.java 92.47% <0.00%> (ø)
...org/apache/commons/collections4/MultiMapUtils.java 60.00% <ø> (ø)
...rg/apache/commons/collections4/PredicateUtils.java 83.87% <ø> (ø)
...java/org/apache/commons/collections4/SetUtils.java 72.97% <ø> (ø)
...ache/commons/collections4/bag/SynchronizedBag.java 66.66% <ø> (-25.00%) ⬇️
...ommons/collections4/bag/SynchronizedSortedBag.java 25.00% <ø> (ø)
... and 94 more

... and 22 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Simulant87
Copy link
Author

@garydgregory May I request another review, to get my PR merged? I think the PR is complete with a test covering the new code, no conflicts to the main branch, and the pipeline is green.

@Simulant87 Simulant87 changed the base branch from master to collections March 22, 2023 20:44
@Simulant87 Simulant87 changed the base branch from collections to master March 22, 2023 20:45
@Simulant87
Copy link
Author

@Claudenw I saw you were active on other pull requests. Could you please review my pull request here as well? I just updated it to resolve a conflict.

Copy link
Contributor

@Claudenw Claudenw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the one issue, it looks good to me.

Copy link
Contributor

@aherbert aherbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code for this is fine but the javadoc has all been copied from the parent class and a lot is not relevant. Can you update it to describe what the methods are actually doing.

Simulant87 and others added 2 commits March 23, 2023 12:35
in order to allow overwrites by subclasses.
Updated JavaDoc and parameter names to point out required key conversion.
@Simulant87
Copy link
Author

Thank you @aherbert and @Claudenw for your valuable comments.
I addressed them all with my recent changes. I decided to make my private methods protected now to grant sub-classes the control to fine tune the implementation with single overwrites.

@Simulant87 Simulant87 requested review from Claudenw and removed request for Claudenw March 23, 2023 20:32
@Simulant87 Simulant87 requested review from aherbert and Claudenw and removed request for aherbert and Claudenw March 23, 2023 20:33
@aherbert
Copy link
Contributor

I have put a comment on the jira ticket about this change. I believe the best approach is not to add new protected (or private methods) but to override the minimum of the current API to achieve the functionality. Note that any version of this improvement would be a behavioural change for any derived classes. The private approach can only be circumvented by reimplementing the public put method that calls new private methods. The protected approach will pollute the public API. Both can be circumvented by downstream derived classes but it does require a code update to workaround the new functionality.

The question remains as to whether breaking functional compatibility in derived classes is an acceptable change for the performance benefit. Downstream users who have extended this class would have to check if the change breaks their code and update it as appropriate. Any users of the CaseInsensitiveMap class would see a performance improvement.

@Simulant87 Simulant87 requested review from Claudenw and removed request for aherbert March 28, 2023 19:04
@Claudenw
Copy link
Contributor

@aherbert I am torn here. I think that the changes proposed for CaseInsensitiveMap should be applied to AbstractHashedMap as any class that overrides convertKey will benefit. However, this means changes to addMapping and createEntry

The changes proposed are, within themselves, clean. However, changing the original implementations would be a breaking change for anyone that has overridden them. My thought was to modify AbstractHashedMap.put() to work as the proposed CaseInsensitiveMap.put() works. But this introduces 2 new methods to AbstractHashedMap. I don't like the proposed names but I also don't see any standardised names that make sense though perhaps then names _addMapping and _createEntry would work. We could then mark addMapping and createEntry for deprecation.

What is your opinion?

@aherbert
Copy link
Contributor

If you modify AbstractHashMap to just use the converted key in put:

    protected void addMapping(final int hashIndex, final int hashCode, final K key, final V value) {
        addConvertedMapping(hashIndex, hashCode, convertKey(key), value);
    }

    protected void addConvertedMapping(final int hashIndex, final int hashCode, final Object convertedKey, final V value) {
        modCount++;
        final HashEntry<K, V> entry = createConvertedEntry(data[hashIndex], hashCode, convertedKey, value);
        addEntry(entry, hashIndex);
        size++;
        checkCapacity();
    }

    protected HashEntry<K, V> createConvertedEntry(final HashEntry<K, V> next, final int hashCode, final Object convertedKey, final V value) {
        return new HashEntry<>(next, hashCode, convertedKey, value);
    }

With this change the existing createEntry method is not called in the codebase. The LRUMap and MultiKeyMap call addMapping.

Any child classes that do override convertKey will inherit the performance change from put. Any classes that use addMapping will not see a performance change if they have converted the key already (or do not have to).

Since this will not enhance the performance of any of the classes in Collections other than CaseInsensitiveMap then I do not see the point of adding more methods to the API and deprecating current ones. I think my suggestion on the Jira ticket to simply cache the converted key privately in CaseInsensitiveMap is the most conservative change. The new field to cache the converted key can be marked as transient to avoid serialisation issues. It just needs to be created before deserialisation by overriding doReadObject.

@aherbert
Copy link
Contributor

On further thought, caching the key conversion is not always going to work. The conversion must be done for each new call to the map with a key as it may be mutable.

This test works on the current implementation. It fails when the conversion is cached as the StringBuilder object is the same, even though it now is a different converted key:

@Test
public void testCache() {
    CaseInsensitiveMap<Object, Integer> map = new CaseInsensitiveMap<>();
    Assertions.assertNull(map.get("foo"));
    StringBuilder sb = new StringBuilder();
    map.put(sb, 1);
    sb.append("foo");
    Assertions.assertNull(map.get(sb)); // fails with a cached conversion as the map returns 1
}

Note that adding mutable items, changing them, and adding again is not a typical use case. But for the CaseInsensitiveMap it can be as all keys are stored converted.

So this leads us back to adding more methods to reuse the key converted in put when creating a new entry. Wherever this change is introduced, it will potentially be a functionally breaking change as methods that could have been overridden may no longer be used. So the minimum breaking change would be to make the change in CaseInsensitiveMap. Only an external library that extends this class, and overrides put, addMapping and/or createEntry will be affected. Adding new methods to AbstractHashMap to replace those which duplicate key conversion may reveal more issues when this path is followed. For example note that AbstractLinkedMap also calls convertKey in an override of createEntry, but it doesn't override put. So this would be affected.

Currently I am not convinced that this performance enhancement can be cleanly integrated with the current code. It would be easier for an end user to just extend the map, cache key conversion and not deliberately break it with mutable keys. This should be done if put is the largest part of the performance overhead of using the map.

@garydgregory
Copy link
Member

If the claim is that this change is for performance, then a JMH test to back up that claim would help.

@aherbert
Copy link
Contributor

If the claim is that this change is for performance, then a JMH test to back up that claim would help.

I posted a performance table for a quick JMH benchmark in the Jira ticket. I can add the benchmark to git master if required. It adds JMH to the pom to be run in a profile as per Commons Text. The benchmark must run in the same package for the various implementations to have access to package-private data structures.

@Claudenw
Copy link
Contributor

I think that we can make a change work for AbstractHashedMap.

If we simply change line 286 in the AbstractHashedMap.put() and inline the addMapping() call using the code proposed in this pull request replacing the call to createEntry() with an inline creation of the Entry object using the already converted key.

This has the advantages of

  1. Calling the conversion once.
  2. Improving the performance of implementations that do not override the put() method.
  3. Does not impact overriding implementations unless the overriding class implements addMapping() or createEntry() and does not override put().

There is a possible implementation using a ThreadLocal variable to store the key and and override createEntry() to check for it.

@aherbert
Copy link
Contributor

I think that we can make a change work for AbstractHashedMap.

This is another variant of having to second guess what a user may have overridden. If we ignore anything outside of Collections (i.e. user code that has extended classes) then we have to support what we have. Some classes extend AbstractHashedMap and override put (e.g. AbstractReferenceMap); some override createEntry (AbstractLinkedMap). AbstractLinkedMap even overrides createEntry where the convertKey method is called the second time. There does not seem to be an easy point at which to insert new code that avoids a duplicate conversion.

Improving the performance of implementations that do not override the put() method.

Performance improvement is negligible for Collections as no map overrides convertKey except CaseInsensitveMap. So changing a lot of the library to support unknown downstream users may be fixing a non-issue.

There is a possible implementation using a ThreadLocal variable

IIUC ThreadLocal is not relevant for this as the map is not thread safe for put anyway. Any cache implementation that does not call convertKey and reuses an existing conversion would have to be robust to clearing its own cache. Otherwise you can get a stale converted key for a mutable key object.

Currently I would favour the documentation of how to implement a class that perform convertKey using a cache. The user can then be made aware of downsides to possible implementations and can choose the best method for their use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants