Python interfaces

Archiving

Tools to download and save URLs.

archive

Archive the HTML from the provided URLs

storytracker.archive(url, verify=True, minify=True, extend_urls=True, compress=True, output_dir=None)
Parameters:
  • url (str) – The URL of the page to archive
  • verify (bool) – Verify that HTML is in the response’s content-type header
  • minify (bool) – Minify the HTML response to reduce its size
  • extend_urls (bool) – Extend relative URLs discovered in the HTML response to be absolute
  • compress (bool) – Compress the HTML response using gzip if an output_dir is provided
  • output_dir (str or None) – Provide a directory for the archived data to be stored
Returns:

An ArchivedURL object

Return type:

ArchivedURL

Raises ValueError:
 

If the response is not verified as HTML

Example usage:

>>> import storytracker

>>> # This will return gzipped content of the page to the variable
>>> obj = storytracker.archive("http://www.latimes.com")
<ArchivedURL: http://www.latimes.com@2014-07-17 04:08:32.169810+00:00>

>>> # You can save it to an automatically named file a directory you provide
>>> obj = storytracker.archive("http://www.latimes.com", output_dir="./")
>>> obj.archive_path
'./http!www.latimes.com!!!!@2014-07-17T04:09:21.835271+00:00.gz'

get

Retrieves HTML from the provided URLs

storytracker.get(url, verify=True)
Parameters:
  • url (str) – The URL of the page to archive
  • verify (bool) – Verify that HTML is in the response’s content-type header
Returns:

The content of the HTML response

Return type:

str

Raises ValueError:
 

If the response is not verified as HTML

Example usage:

>>> import storytracker

>>> html = storytracker.get("http://www.latimes.com")

Analysis

ArchivedURL

An URL’s archived HTML with tools for analysis.

class ArchivedURL(url, timestamp, html, gzip_archive_path=None, html_archive_path=None, browser_width=1024, browser_height=768, browser_driver="PhantomJS")

Initialization arguments

url

The url archived

timestamp

The date and time when the url was archived

html

The HTML archived

Optional initialization options

gzip_archive_path

A file path leading to an archive of the URL stored in a gzipped file.

html_archive_path

A file path leading to an archive of the URL storied in a raw HTML file.

browser_width

The width of the browser that will be opened to inspect the URL’s HTML By default it is 1024.

browser_height

The height of the browser that will be opened to inspect the URL’s HTML By default is 768.

browser_driver

The name of the browser that Selenium will use to open up HTML files. By default it is PhantomJS.

Other attributes

height

The height of the page in pixels after the URL is opened in a web browser

width

The width of the page in pixels after the URL is opened in a web browser

gzip

Returns the archived HTML as a stream of gzipped data

archive_filename

Returns a file name for this archive using the conventions of storytracker.create_archive_filename().

A list of all the hyperlinks extracted from the HTML

images

A list of all the images extracts from the HTML

largest_headline

Returns the story hyperlink with the largest area on the page. If there is a tie, returns the one that appears first on the page.

largest_image

The largest image extracted from the HTML

A list of all the hyperlinks extracted from the HTML that are estimated to lead to news stories.

summary_statistics

Returns a dictionary with basic summary statistics about hyperlinks and images on the page

Analysis methods

analyze()

Opens the URL’s HTML in a web browser and runs all of the analysis methods that use it.

get_cell(x, y, cell_size=256)

Returns the grid cell where the provided x and y coordinates appear on the page. Cells are sized as squares, with 256 pixels as the default.

The value is returned in the style of algebraic notation used in a game of chess.

>>> obj.get_cell(1, 1)
'a1'
>>> obj.get_cell(257, 1)
'b1'
>>> obj.get_cell(1, 513)
'a3'

Returns the Hyperlink object that matches the submitted href, if it exists.

open_browser()

Opens the URL’s HTML in an web browser so it can be analyzed.

close_browser()

Closes the web browser opened to analyze the URL’s HTML

Output methods

Returns the provided file object with a ready-to-serve CSV list of all hyperlinks extracted from the HTML.

write_gzip_to_directory(path)

Writes gzipped HTML data to a file in the provided directory path

write_html_to_directory(path)

Writes HTML data to a file in the provided directory path

write_illustration_to_directory(path)

Writes out a visualization of the hyperlinks and images on the page as a JPG to the provided directory path.

Example usage:

>>> import storytracker

>>> obj = storytracker.open_archive_filepath('/home/ben/archive/http!www.latimes.com!!!!@2014-07-06T16:31:57.697250.gz')
>>> obj.url
'http://www.latimes.com'

>>> obj.timestamp
datetime.datetime(2014, 7, 6, 16, 31, 57, 697250)

ArchivedURLSet

A list of ArchivedURL objects.

class ArchivedURLSet(list)

List items added to the set must be unique ArchivedURL objects.

Parses all of the hyperlinks from the HTML of all the archived URLs and returns a list of the distinct href hyperlinks with a series of statistics attached that describe how they are positioned.

summary_statistics

Returns a dictionary of summary statistics about the whole set of archived URLs.

print_href_analysis(href)

Outputs a human-readable analysis of the submitted href’s position across the set of archived URLs.

write_href_gif_to_directory(href, path, duration=0.5)

Writes out animation of a hyperlinks on the page as a GIF to the provided directory path

Returns the provided file object with a ready-to-serve CSV list of all hyperlinks extracted from the HTML.

Example usage:

>>> import storytracker

>>> obj_list = storytracker.open_archive_directory('/home/ben/archive/')

>>> obj_list[0].url
'http://www.latimes.com'

>>> obj_list[1].timestamp
datetime.datetime(2014, 7, 6, 16, 31, 57, 697250)

Image

class Image(src)

An image extracted from an archived URL.

Initialization arguments

src

The src attribute of the image tag

x

The x coordinate of the object’s location on the page.

y

The y coordinate of the object’s location on the page.

width

The width of the object’s size on the page.

height

The height of the object’s size on the page.

cell

The grid cell where the provided x and y coordinates appear on the page. Cells are sized as squares, with 256 pixels as the default.

The value is returned in the style of algebraic notation used in a game of chess.

Analysis methods

area

Returns the square area of the image

orientation

Returns a string describing the shape of the image.

‘square’ means the width and height are equal

‘landscape’ is a horizontal image with width greater than height

‘portrait’ is a vertical image with height greater than width None means there are no size attributes to test

File handling

Functions for naming, saving and retrieving archived URLs.

create_archive_filename

Returns a string that combines a URL and a timestamp of for naming archives saved to the filesystem.

storytracker.create_archive_filename(url, timestamp)
Parameters:
  • url (str) – The URL of the page that is being archived
  • timestamp (datetime) – A timestamp recording approximately when the URL was archive
Returns:

A string that combines the two arguments into a structure can be reversed back into Python

Return type:

str

Example usage:

>>> import storytracker
>>> from datetime import datetime
>>> storytracker.create_archive_filename("http://www.latimes.com", datetime.now())
'http!www.latimes.com!!!!@2014-07-06T16:31:57.697250'

open_archive_directory

Accepts a directory path and returns an ArchivedURLSet list filled with an ArchivedURL object that corresponds to every archived file it finds.

storytracker.open_archive_directory(path)
Parameters:path (str) – The path to directory containing archived files.
Returns:An ArchivedURLSet list
Return type:ArchivedURLSet

Example usage:

>>> import storytracker
>>> obj_list = storytracker.open_archive_directory('/home/ben/archive/')

open_archive_filepath

Accepts a file path and returns an ArchivedURL object

storytracker.open_archive_filepath(path)
Parameters:path (str) – The path to the archived file. Its file name must conform to the conventions of storytracker.create_archive_filename().
Returns:An ArchivedURL object
Return type:ArchivedURL
Raises ArchiveFileNameError:
 If the file’s name cannot be parsed using the conventions of storytracker.create_archive_filename().

Example usage:

>>> import storytracker
>>> obj = storytracker.open_archive_filepath('/home/ben/archive/http!www.latimes.com!!!!@2014-07-06T16:31:57.697250.gz')

open_wayback_machine_url

Accepts a URL from the Internet Archive’s Wayback Machine and returns an ArchivedURL object

storytracker.open_wayback_machine_url(url)
Parameters:url (str) – A URL from the Wayback Machine that links directly to an archive. An example is https://web.archive.org/web/20010911213814/http://www.cnn.com/.
Returns:An ArchivedURL object
Return type:ArchivedURL
Raises ArchiveFileNameError:
 If the file’s name cannot be parsed.

Example usage:

>>> import storytracker
>>> obj = storytracker.open_wayback_machine_url('https://web.archive.org/web/20010911213814/http://www.cnn.com/')

reverse_archive_filename

Accepts a filename created using the rules of storytracker.create_archive_filename() and converts it back to Python. Returns a tuple: The URL string and a timestamp. Do not include the file extension when providing a string.

storytracker.reverse_archive_filename(filename)
Parameters:filename (str) – A filename structured using the style of the storytracker.create_archive_filename() function
Returns:A tuple containing the URL of the archived page as a string and a datetime object of the archive’s timestamp
Return type:tuple

Example usage:

>>> import storytracker
>>> storytracker.reverse_archive_filename('http!www.latimes.com!!!!@2014-07-06T16:31:57.697250')
('http://www.latimes.com', datetime.datetime(2014, 7, 6, 16, 31, 57, 697250))

reverse_wayback_machine_url

Accepts an url from the Internet Archive’s Wayback Machine and returns a tuple with the archived URL string and a timestamp.

storytracker.reverse_wayback_machine_url(url)
Parameters:url (str) –

A URL from the Wayback Machine that links directly to an archive. An example is https://web.archive.org/web/20010911213814/http://www.cnn.com/.

Returns:A tuple containing the URL of the archived page as a string and a datetime object of the archive’s timestamp
Return type:tuple

Example usage:

>>> import storytracker
>>> storytracker.reverse_wayback_machine_url('https://web.archive.org/web/20010911213814/http://www.cnn.com/')
('http://www.cnn.com/', datetime.datetime(2001, 9, 11, 21, 38, 14))