Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ArrayUtils.binarySearch() with a key extractor #1270

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

kvr000
Copy link

@kvr000 kvr000 commented Aug 31, 2024

There is Collections.binarySearch() in Java core. However, it searches for exact element which is often not usable, as the elements are typically sorted by particular field.

These methods introduce binarySearch() on arrays and (sorted) List, with ability to provide sorting key extraction function and key comparator.

The rest of semantics, behavior, parameters and return values remain consistent with Collections.binarySearch() from Java core.

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @kvr000
Thank you for your PR.
Please see my comments.
Waiting for community feedback as well...

*
* @author Zbynek Vyskovsky
*/
public class SortedListUtils {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collection utilities do not belong in Commons Lang IMO. This would be for Commons Collections unless this functionality is already there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it. Unsure whether this was the best choice, as this is about the algorithm, not about implementing collections.

@kvr000 kvr000 force-pushed the sorted-list-binary-search branch 2 times, most recently from ed7f900 to ae97b6f Compare September 4, 2024 01:42
Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @kvr000,

This code should be in ArrayUtils.
TY!

@kvr000
Copy link
Author

kvr000 commented Sep 11, 2024

Hello @kvr000,

This code should be in ArrayUtils. TY!

@garydgregory : Yeah, got it, was busy to finish that. Fixed now, thank you.

@garydgregory
Copy link
Member

Hello @kvr000,
This code should be in ArrayUtils. TY!

@garydgregory : Yeah, got it, was busy to finish that. Fixed now, thank you.

@kvr000

Run mvn (by itself) before you push to catch build errors.

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @kvr000

  • See my scattered comments.
  • Add a method called createSortedList(int...) that guarantees sorted results and use it.
  • Add tests for non-sorted input, make sure the results are predictable, we do not want a DOS for an infinite loop for example.

public class ArrayUtilsBinarySearchTest extends AbstractLangTest {

@Test
public void binarySearch_whenEmpty_returnM1() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all tests in the class call binarySearch() and the class is already called "binary search", you do not need to repeat that name in each test method name, that'll de-clutter names a bit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with this comment. The methods names are intentionally descriptive with pattern name-condition-expected. This improves readability. While there are truly only binarySearch now, it doesn't have to be forever and vice versa - this class may be possibly merged into more generic ones which would work with sorted arrays (which was the very original name).

@kvr000 kvr000 force-pushed the sorted-list-binary-search branch 4 times, most recently from 4c16d1a to 9063592 Compare September 12, 2024 02:32
@kvr000
Copy link
Author

kvr000 commented Sep 12, 2024

Hello @kvr000

  • See my scattered comments.
  • Add a method called createSortedList(int...) that guarantees sorted results and use it.
  • Add tests for non-sorted input, make sure the results are predictable, we do not want a DOS for an infinite loop for example.

@garydgregory :

The createdSortedList() won't have any value. In the test, I have to see what is returned index - that's impossible with unsorted array. Additionally, now it's used also for intentionally unsorted array.

Good point for non-sorted input, I added a test. The code always does l = m+1 or h = m-1 , so it's safe.

Copy link
Contributor

@sebbASF sebbASF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some documentation clarifications

* @return
* index of the search key, if it is contained in the array within specified range; otherwise,
* (-first_greater - 1). The first_greater is the index of lowest greater element in the list - if all elements
* are lower, the first_greater is defined as toIndex.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above para needs rework: there is no specified range here, and no toIndex

Copy link
Member

@garydgregory garydgregory Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result should be documented as undefined if all array elements are equals; like the Arrays version.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to match non-index version.

@garydgregory : That is great point. In fact, one of my use cases is array (list) with duplicities. So this can be done better. I introduce binarySearchFirst and binarySearchLast methods which return consistently first and last matching element respectively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WRT binarySearchFirst and Last, let's not add those.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garydgregory : I already did. I dropped binarySearch as the randomness of choosing any of equal element effectively disqualifies it from such use cases. Predictable behavior of binarySearchFirst and binarySearchLast is huge benefit and comes at almost no cost (is slower only in cases when the lookup is quickly "lucky").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use caee, I thought, was a version of the 'Arrays' method plus key extraction. I would not necessarily want an API that always allocates an array to return a value. Maybe do this later, as a second API only if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of the 'Arrays' method, the entire entry is matched, so to some extent it does not matter which index is returned. That is not the case where a partial key is used, so I think in the user will often want to know all the matches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sudden thought: surely key extraction can take place in the Comparator, so why is this method needed at all?

Or maybe I am missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sudden thought: surely key extraction can take place in the Comparator, so why is this method needed at all?

Or maybe I am missing something?

Good point @sebbASF !

We might not need this at all...

Copy link
Author

@kvr000 kvr000 Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebbASF @garydgregory :

In the case of the 'Arrays' method, the entire entry is matched, so to some extent it does not matter which index is returned. That is not the case where a partial key is used, so I think in the user will often want to know all the matches.

In my particular case, I have list of GPS locations sorted by time and track distance, with distance unchanged if there was no move (causing duplicates). One use case is to find point for particular distance and if there are duplicates, I want the result to be predictable (here last but can be first in different use cases). If the result is unpredictable, it will lead to different results based on initial conditions for the search. I don't care about whole range, though I can imagine someone would appreciate such function with added complexity.

Sudden thought: surely key extraction can take place in the Comparator, so why is this method needed at all?

No, unless we're able to construct the full object which is not always possible - note that Comparator takes two arguments of the same type. It is technically achievable with hacking and casting one argument to T and the other one to K, relying on internal implementation passing the arguments in expected order but that would be super ugly and fragile to implementation changes.

@garydgregory :

Please revert the First/Last APIs.

For the reasons above, I want to keep those as they better match intended use case. I think we actually don't need original unpredictable binarySearch() .

src/main/java/org/apache/commons/lang3/ArrayUtils.java Outdated Show resolved Hide resolved
return m;
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there are no higher matches?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then l reaches h+1 and exits the loop, returning -toIndex - 1 as documented.

@garydgregory garydgregory changed the title binarySearch on sorted arrays and sorted lists binarySearch on sorted arrays Sep 12, 2024
@garydgregory garydgregory changed the title binarySearch on sorted arrays Add ArrayUtils.binarySearch with a key extractor Sep 13, 2024
@garydgregory garydgregory changed the title Add ArrayUtils.binarySearch with a key extractor Add ArrayUtils.binarySearch() with a key extractor Sep 13, 2024
@garydgregory
Copy link
Member

@kvr000
Please revert the First/Last APIs.

@kvr000
Copy link
Author

kvr000 commented Sep 18, 2024

@kvr000 Please revert the First/Last APIs.

@garydgregory : As explained in #1270 , I want to keep First/Last as I find them more useful due to their predictability. I don't think we really need original unpredictable simple binarySearch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants