Remove-Duplicate Algorithm Based on Meta Search Result

2018 
According to the characteristics of duplicate web pages in the meta search engine, a duplicate web pages detection algorithm is proposed based on a web page URL, title and abstract, and according to their different characteristics, different similarity computing method is proposed, firstly, the page URL is standardization processed in the algorithm, and then for the title detection, the algorithm improves the title string fuzzy matching algorithm and calculate the similarity based on the word frequency of each items in the query, for the abstract judgment, similarity computing is in accordance with the sentences of the abstract, for each sentence the algorithm gives three weights, and calculates the weights of similarity on base of each summary statement, the effect of the algorithm is obvious, it has been verified by experiment that the algorithm is superior to the traditional algorithm in the precision and recall rate.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    8
    References
    0
    Citations
    NaN
    KQI
    []
    Baidu
    map