Metadata-Version: 2.4
Name: ioweb
Version: 0.0.21
Summary: Web Scraping Framework
Home-page: https://github.com/lorien/ioweb
Download-URL: https://github.com/lorien/ioweb/releases
Author: Gregory Petukhov
Author-email: lorien@lorien.name
Maintainer: Gregory Petukhov
Maintainer-email: lorien@lorien.name
License: MIT
Keywords: web scraping network crawling cralwer spider pycurl
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Description-Content-Type: text/markdown
Requires-Dist: urllib3<=1.25.6
Requires-Dist: pyopenssl
Requires-Dist: cryptography
Requires-Dist: idna
Requires-Dist: certifi
Requires-Dist: cachetools
Requires-Dist: gevent
Requires-Dist: pysocks
Requires-Dist: lxml
Requires-Dist: defusedxml
Requires-Dist: selection
Requires-Dist: cssselect
Requires-Dist: python-json-logger
Requires-Dist: psutil
Requires-Dist: defusedxml
Requires-Dist: pymongo
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: download-url
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: maintainer
Dynamic: maintainer-email
Dynamic: requires-dist
Dynamic: summary

## IOWeb Framework

![pytest status](https://github.com/lorien/ioweb/workflows/pytest/badge.svg)
![pytype status](https://github.com/lorien/ioweb/workflows/pytype/badge.svg)

Python framework to build web crawlers.

Good things:

 * system designed to run large number of network threads (like 100 or 500) on
    single CPU core
 * feature to combine things in chunks and then doing something with
    chunks (like mongodb bulk write)
 * asynchronous network operations are powered by gevent
 * network requests are handled with urllib3
 * HTML is parsed with lxml
 * ability to do CSS/XPATh queries to DOM tree of downloaded HTML document
 * ability to extract cert details
 * ability to resolve particular domain to custom IP
 * stat module to count events
 * logging statistics to influxdb
 * retrying on network errors

Bad things:

 * not fully covered with tests
 * no documentation

## Feedback

 * [t.me/grablab](https://t.me/grablab) - English chat about web scraping
 * [t.me/grablab_ru](https://t.me/grablab_ru) - Russian chat about web scraping
