Replies: 4 comments 9 replies
-
|
I'm also considering indexing only the readibility extracted content's text instead of every text that appear in a document. This cuts all the header/footer/sidebar contents, so the remaining text is relevant and super clean. However in rare cases it can exclude valid content as well. |
Beta Was this translation helpful? Give feedback.
-
|
When I read the first post of this discussion, I immediately thought of Mozilla's Readability.js :) I'm glad to see that you already implemented it in 078548b, nice! |
Beta Was this translation helpful? Give feedback.
-
|
I don't like that search expects whole words. In search engines like DuckDuckGo or Google this normally happens only when I put the word in quotes, but Hister splits my query into words and then searches for whole words it seems, also quoting phrases did not work. I briefly tried https://pagefind.app/ and like how it works, e.g. I can search for I see that Hister is using https://blevesearch.com/, maybe making it work more like existing search engines work is a matter of configuring tokenizers. |
Beta Was this translation helpful? Give feedback.
-
[...]
The underlying problem is : generated HTML is not database/indexer friendly by nature (it's been done for programmers to please humans) Two other (obvious) answers to ugly-HTML problem (aside from readability.js):
The benefit of APIs is that once authentication is resolved, structured data resolves both the problem of ugly-HTML and crawler-blocking. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The number of irrelevant results for a given search query increases as the index grows. I'd like to prevent (or at least reduce) this behavior.
First, we need to identify the root-causes. As I see:
Please share your observations about other possible issues that can reduce search quality.
Second, we need solutions for these issues:
What do you think about these solutions? Do you have any better or complementary ideas?
Beta Was this translation helpful? Give feedback.
All reactions