Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider approaches to sentence-based deduplication #18

Open
rahulbot opened this issue May 7, 2024 · 0 comments
Open

Consider approaches to sentence-based deduplication #18

rahulbot opened this issue May 7, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@rahulbot
Copy link

rahulbot commented May 7, 2024

As documented in mediacloud/story-indexer#278, we're seeing instances of headlines appearing at the tail end of stories and polluting results. We should consider the original idea of moving this stage of deduplication (which we used to do in the legacy system) to a sous-chef feature. The idea would be to do something like (a) tokenize by sentence, (b) remove duplicate sentences from stories after their first appearance, and (c) remove stories from the corpus that no longer match the query post-sentence-dedup'ing. This is non-trivial, but will take some design work.

@rahulbot rahulbot added the enhancement New feature or request label May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant