In our previous post we introduced a funnel for deduplicating web documents within a search index. The dual problems of exact duplicate and near duplicate web document identification are considered. By chaining together several methods with increasing specificity we identify a system which provides sufficient precision and recall with minimal computational tradeoffs.
In this post, we look at a second challenge of maintaining a continually evolving corpus of web documents: content freshness. Roughly, freshness can be broken down into two categories: search tuning and corpus freshness.Read More