Metadata-Version: 2.4
Name: alcazar
Version: 0.5.3
Summary: Web scraper framework
Home-page: https://github.com/saintamh/alcazar/
Author: Hervé Saint-Amand
Author-email: alcazar@saintamh.org
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Software Development :: Libraries
Description-Content-Type: text/markdown
Requires-Dist: lxml<5,>=4
Requires-Dist: cssselect<2,>=1.0
Requires-Dist: jmespath<1,>=0.9.3
Requires-Dist: requests<3,>=2
Requires-Dist: urllib3<2,>=1.17
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: summary

[![Build Status](https://travis-ci.org/saintamh/alcazar.svg?branch=master)](https://travis-ci.org/saintamh/alcazar)
[![PyPI version](https://badge.fury.io/py/alcazar.svg)](https://pypi.org/project/alcazar/)

Alcazar is a Python library that simplifies the task of writing web scrapers.

Some of its core features are:

* *succinct syntax* for locating relevant data within an HTML page, JSON document, string of text
* *HTTP caching to disk* for exact replay of scrapes without resubmitting HTTP requests
* *Throttling* of requests to the same host
* *Automatic retries* when an HTTP request fails, or when a page fails to parse as expected
* *Crawler* facilities for maintaining a queue of URLs to visit
* *fail-fast*: by default, we'd rather crash than save incorrect or incomplete data

Alcazar brings together the following libraries:

* [Requests](https://github.com/requests/requests)
* [lxml](https://lxml.de/) (including [cssselect](https://lxml.de/cssselect.html))
* [JMESPath](http://jmespath.org/)

Getting Started
===============

Alcazar is [available on PyPi](https://pypi.org/project/alcazar/) so it can be installed it using `pip`:

```
pip install alcazar
```

The simplest way to use the library is to instantiate a `Scraper` and call its `fetch` method:

```python
>>> import alcazar
>>> scraper = alcazar.Scraper()
>>> page = scraper.fetch('https://en.wikipedia.org/wiki/Gorgie')
>>> print(page.one('div[@id="toc"]/preceding-sibling::p[./b]').text.normalized)
Gorgie (/ˈɡɔːrɡiː/ GOR-gee) is a densely populated area of Edinburgh, Scotland. It is located in the west of the city and borders Murrayfield, Ardmillan and Dalry.
```

In this snippet:

* we've fetched the HTML for the page
  * if any network error or HTTP error happens, we'll retry to fetch it a few times, sleeping increasing delays between every attempt
* we've parsed the HTML into a tree
  * using lxml's excellent handling and recovery from "broken" HTML, as seen in the wild
* we've located the element we're interested in
  * here using an XPath expression, but we could've used a CSS selector too
  * we've checked that there was one and only one element that matched our query
  * else an exception would've been thrown, ensuring we capture only exactly what we wanted
* we've extracted its text, removed all tags from it, and normalized its whitespace

See the `samples` directory for a taste of how Alcazar works.
