Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bag.top(bottom)Occurrences #1471

Open
dxxvi opened this issue Jun 25, 2023 · 4 comments
Open

Bag.top(bottom)Occurrences #1471

dxxvi opened this issue Jun 25, 2023 · 4 comments

Comments

@dxxvi
Copy link

dxxvi commented Jun 25, 2023

The API documentation says that these methods return ListIterable<ObjectInPair<T>> but it didn't say if the element at the beginning of the returned list would have the most or the least occurrence. I think it's a bit unclear there. The unit test MutableBagTestCase.topOccurences says that the 1st element in the returned list has the most occurrences. Is there a unit test for this case (mentioned in the API doc) "In the event of a tie, all the items with the number of occurrences that match the occurrences of the last item will be returned"?

The implementation for those top(Bottom)Occurrences, if I'm not wrong, is in AbstractBag.occurrencesSortingBy. In this method, all elements in the bag are put into a list with occurrences. Then this list is sorted by some function. Then to get the top/bottom, e.g. 5, 5 first element in that list are returned. IMO, in some cases, this is inefficient: to get the top/bottom 5 occurrences, we have to sort the whole list.

@dxxvi dxxvi changed the title Discuss about Bag.top(bottom)Occurrences Bag.top(bottom)Occurrences Jun 25, 2023
@motlin
Copy link
Contributor

motlin commented Jun 30, 2023

the topOccurrences() method has an inefficient implementation that ought to be optimized. The current implementation sorts, as you said, making it O(n log n). The selection algorithm is O(n) and sorting would take O(k log k) in the absence of ties.

@motlin
Copy link
Contributor

motlin commented Jun 30, 2023

I took a look at Python's most_common and it takes the front element from a heap k times, making it O(n * k). This is a nice compromise in that it's faster than our implementation, but still pretty easy to implement.

@mohrezaei
Copy link
Member

I took a look at Python's most_common and it takes the front element from a heap k times, making it O(n * k). This is a nice compromise in that it's faster than our implementation, but still pretty easy to implement.

That's only decent if k < log(n). For k near n, it's impossible to do better than sorting (simply because the result has to be sorted) and we don't want O(n^2). So the algorithm needs to be smart based on k vs n.

I wonder how bad it would be to stuff the first k elements in a tree, then remove the worst one as new ones are added (the tree would have to be a list at each node to contain the equals).

@motlin
Copy link
Contributor

motlin commented Jul 1, 2023

Sorry that was a typo, I forgot the logarithm.

One implementation is to keep a min heap of fixed size k. Insert every (item, count) pair into it and remove the minimum whenever the heap size exceeds k. Inserting and removing in this heap is O(log k) since its fixed to size k. Handling ties is tricky in this implementation but can be done.

Another implementation is to heapify all items into a max heap, and then pop the max until we're done. This approach makes it easier to handle ties, since they come at the end. Heapify is O(n). Popping the max from a max heap is O(log n). So overall, this approach is O(k log n).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants