Metadata-Version: 2.4
Name: compare_urls
Version: 0.1
Summary: A script to compare the content of two urls and return the similarity using simhash
Home-page: https://github.com/aaronmefford/CompareUrls
Author: Aaron Mefford
Author-email: aaron@mefford.org
License: MIT
Keywords: simhash similarity near-duplicate
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Similarity, Hashing
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
License-File: LICENSE
Requires-Dist: mmh3
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: summary


Compare Urls
^^^^^^^^^^^^^^^^

A script that will compare the contents of two urls and return a similarity score between 0 and 1 indicating the approximate Jacquard similarity using simhash.

By default the script will tokenize the contents by words using 8 word rolling shingles.  Parameters can change the numbers of words in the shingle, as well as the switching to using characters instead of words.  The default is to use the builtin python hash function but the murmur hash can also be selected.


To install the dependencies run::

  ./setup.py install


To run the script::

  ./compare.py http://mysite.com/page1.html http://yoursite.com/page3.html

or to use murmur with 32 character shingles::

  ./compare.py -x murmur -l 32 -s '' http://mysite.com/page1.html http://yoursite.com/page3.html







  
