Consider approaches to sentence-based deduplication #18

rahulbot · 2024-05-07T14:56:42Z

As documented in mediacloud/story-indexer#278, we're seeing instances of headlines appearing at the tail end of stories and polluting results. We should consider the original idea of moving this stage of deduplication (which we used to do in the legacy system) to a sous-chef feature. The idea would be to do something like (a) tokenize by sentence, (b) remove duplicate sentences from stories after their first appearance, and (c) remove stories from the corpus that no longer match the query post-sentence-dedup'ing. This is non-trivial, but will take some design work.

rahulbot added the enhancement New feature or request label May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider approaches to sentence-based deduplication #18

Consider approaches to sentence-based deduplication #18

rahulbot commented May 7, 2024

Consider approaches to sentence-based deduplication #18

Consider approaches to sentence-based deduplication #18

Comments

rahulbot commented May 7, 2024