Python interfaces¶
Archiving¶
Tools to download and save URLs.
archive¶
Archive the HTML from the provided URLs
- storytracker.archive(url, verify=True, minify=True, extend_urls=True, compress=True, output_dir=None)¶
Parameters: - url (str) – The URL of the page to archive
- verify (bool) – Verify that HTML is in the response’s content-type header
- minify (bool) – Minify the HTML response to reduce its size
- extend_urls (bool) – Extend relative URLs discovered in the HTML response to be absolute
- compress (bool) – Compress the HTML response using gzip if an output_dir is provided
- output_dir (str or None) – Provide a directory for the archived data to be stored
Returns: An ArchivedURL object
Return type: Raises ValueError: If the response is not verified as HTML
Example usage:
>>> import storytracker
>>> # This will return gzipped content of the page to the variable
>>> obj = storytracker.archive("http://www.latimes.com")
<ArchivedURL: http://www.latimes.com@2014-07-17 04:08:32.169810+00:00>
>>> # You can save it to an automatically named file a directory you provide
>>> obj = storytracker.archive("http://www.latimes.com", output_dir="./")
>>> obj.archive_path
'./http!www.latimes.com!!!!@2014-07-17T04:09:21.835271+00:00.gz'
get¶
Retrieves HTML from the provided URLs
- storytracker.get(url, verify=True)¶
Parameters: - url (str) – The URL of the page to archive
- verify (bool) – Verify that HTML is in the response’s content-type header
Returns: The content of the HTML response
Return type: str
Raises ValueError: If the response is not verified as HTML
Example usage:
>>> import storytracker
>>> html = storytracker.get("http://www.latimes.com")
Analysis¶
ArchivedURL¶
An URL’s archived HTML with tools for analysis.
- class ArchivedURL(url, timestamp, html, gzip_archive_path=None, html_archive_path=None, browser_width=1024, browser_height=768, browser_driver="PhantomJS")¶
Initialization arguments
- url¶
The url archived
- timestamp¶
The date and time when the url was archived
- html¶
The HTML archived
Optional initialization options
- gzip_archive_path¶
A file path leading to an archive of the URL stored in a gzipped file.
- html_archive_path¶
A file path leading to an archive of the URL storied in a raw HTML file.
- browser_width¶
The width of the browser that will be opened to inspect the URL’s HTML By default it is 1024.
- browser_height¶
The height of the browser that will be opened to inspect the URL’s HTML By default is 768.
- browser_driver¶
The name of the browser that Selenium will use to open up HTML files. By default it is PhantomJS.
Other attributes
- height¶
The height of the page in pixels after the URL is opened in a web browser
- width¶
The width of the page in pixels after the URL is opened in a web browser
- gzip¶
Returns the archived HTML as a stream of gzipped data
- archive_filename¶
Returns a file name for this archive using the conventions of storytracker.create_archive_filename().
- hyperlinks¶
A list of all the hyperlinks extracted from the HTML
- images¶
A list of all the images extracts from the HTML
- largest_headline¶
Returns the story hyperlink with the largest area on the page. If there is a tie, returns the one that appears first on the page.
- largest_image¶
The largest image extracted from the HTML
- story_links¶
A list of all the hyperlinks extracted from the HTML that are estimated to lead to news stories.
- summary_statistics¶
Returns a dictionary with basic summary statistics about hyperlinks and images on the page
Analysis methods
- analyze()¶
Opens the URL’s HTML in a web browser and runs all of the analysis methods that use it.
- get_cell(x, y, cell_size=256)¶
Returns the grid cell where the provided x and y coordinates appear on the page. Cells are sized as squares, with 256 pixels as the default.
The value is returned in the style of algebraic notation used in a game of chess.
>>> obj.get_cell(1, 1) 'a1' >>> obj.get_cell(257, 1) 'b1' >>> obj.get_cell(1, 513) 'a3'
- get_hyperlink_by_href(href, fails_silently=True)¶
Returns the Hyperlink object that matches the submitted href, if it exists.
- open_browser()¶
Opens the URL’s HTML in an web browser so it can be analyzed.
- close_browser()¶
Closes the web browser opened to analyze the URL’s HTML
Output methods
- write_hyperlinks_csv_to_file(file, encoding="utf-8")¶
Returns the provided file object with a ready-to-serve CSV list of all hyperlinks extracted from the HTML.
- write_gzip_to_directory(path)¶
Writes gzipped HTML data to a file in the provided directory path
- write_html_to_directory(path)¶
Writes HTML data to a file in the provided directory path
- write_illustration_to_directory(path)¶
Writes out a visualization of the hyperlinks and images on the page as a JPG to the provided directory path.
Example usage:
>>> import storytracker
>>> obj = storytracker.open_archive_filepath('/home/ben/archive/http!www.latimes.com!!!!@2014-07-06T16:31:57.697250.gz')
>>> obj.url
'http://www.latimes.com'
>>> obj.timestamp
datetime.datetime(2014, 7, 6, 16, 31, 57, 697250)
ArchivedURLSet¶
A list of ArchivedURL objects.
- class ArchivedURLSet(list)¶
List items added to the set must be unique ArchivedURL objects.
- hyperlinks¶
Parses all of the hyperlinks from the HTML of all the archived URLs and returns a list of the distinct href hyperlinks with a series of statistics attached that describe how they are positioned.
- summary_statistics¶
Returns a dictionary of summary statistics about the whole set of archived URLs.
- print_href_analysis(href)¶
Outputs a human-readable analysis of the submitted href’s position across the set of archived URLs.
- write_href_gif_to_directory(href, path, duration=0.5)¶
Writes out animation of a hyperlinks on the page as a GIF to the provided directory path
- write_hyperlinks_csv_to_file(file, encoding="utf-8")¶
Returns the provided file object with a ready-to-serve CSV list of all hyperlinks extracted from the HTML.
Example usage:
>>> import storytracker
>>> obj_list = storytracker.open_archive_directory('/home/ben/archive/')
>>> obj_list[0].url
'http://www.latimes.com'
>>> obj_list[1].timestamp
datetime.datetime(2014, 7, 6, 16, 31, 57, 697250)
Hyperlink¶
A hyperlink extracted from an ArchivedURL object.
- class Hyperlink(href, string, index, images=, []x=None, y=None, width=None, height=None, cell=None, font_size=None)¶
Initialization arguments
- href¶
The URL the hyperlink references
- string¶
The strings contents of the anchor tag
- index¶
The index value of the links order within its source HTML. Starts counting at zero.
- x¶
The x coordinate of the object’s location on the page.
- y¶
The y coordinate of the object’s location on the page.
- width¶
The width of the object’s size on the page.
- height¶
The height of the object’s size on the page.
- cell¶
The grid cell where the provided x and y coordinates appear on the page. Cells are sized as squares, with 256 pixels as the default.
The value is returned in the style of algebraic notation used in a game of chess.
- font_size¶
The size of the font of the text inside the hyperlink.
Other attributes
- __csv__¶
Returns a list of values ready to be written to a CSV file object
- domain¶
The domain of the href
- is_story¶
Returns a boolean estimate of whether the object’s href attribute links to a news story. Guess provided by storysniffer, a library developed as a companion to this project.
Image¶
- class Image(src)¶
An image extracted from an archived URL.
Initialization arguments
- src¶
The src attribute of the image tag
- x¶
The x coordinate of the object’s location on the page.
- y¶
The y coordinate of the object’s location on the page.
- width¶
The width of the object’s size on the page.
- height¶
The height of the object’s size on the page.
- cell¶
The grid cell where the provided x and y coordinates appear on the page. Cells are sized as squares, with 256 pixels as the default.
The value is returned in the style of algebraic notation used in a game of chess.
Analysis methods
- area¶
Returns the square area of the image
- orientation¶
Returns a string describing the shape of the image.
‘square’ means the width and height are equal
‘landscape’ is a horizontal image with width greater than height
‘portrait’ is a vertical image with height greater than width None means there are no size attributes to test
File handling¶
Functions for naming, saving and retrieving archived URLs.
create_archive_filename¶
Returns a string that combines a URL and a timestamp of for naming archives saved to the filesystem.
- storytracker.create_archive_filename(url, timestamp)¶
Parameters: - url (str) – The URL of the page that is being archived
- timestamp (datetime) – A timestamp recording approximately when the URL was archive
Returns: A string that combines the two arguments into a structure can be reversed back into Python
Return type: str
Example usage:
>>> import storytracker
>>> from datetime import datetime
>>> storytracker.create_archive_filename("http://www.latimes.com", datetime.now())
'http!www.latimes.com!!!!@2014-07-06T16:31:57.697250'
open_archive_directory¶
Accepts a directory path and returns an ArchivedURLSet list filled with an ArchivedURL object that corresponds to every archived file it finds.
- storytracker.open_archive_directory(path)¶
Parameters: path (str) – The path to directory containing archived files. Returns: An ArchivedURLSet list Return type: ArchivedURLSet
Example usage:
>>> import storytracker
>>> obj_list = storytracker.open_archive_directory('/home/ben/archive/')
open_archive_filepath¶
Accepts a file path and returns an ArchivedURL object
- storytracker.open_archive_filepath(path)¶
Parameters: path (str) – The path to the archived file. Its file name must conform to the conventions of storytracker.create_archive_filename(). Returns: An ArchivedURL object Return type: ArchivedURL Raises ArchiveFileNameError: If the file’s name cannot be parsed using the conventions of storytracker.create_archive_filename().
Example usage:
>>> import storytracker
>>> obj = storytracker.open_archive_filepath('/home/ben/archive/http!www.latimes.com!!!!@2014-07-06T16:31:57.697250.gz')
open_wayback_machine_url¶
Accepts a URL from the Internet Archive’s Wayback Machine and returns an ArchivedURL object
- storytracker.open_wayback_machine_url(url)¶
Parameters: url (str) – A URL from the Wayback Machine that links directly to an archive. An example is https://web.archive.org/web/20010911213814/http://www.cnn.com/. Returns: An ArchivedURL object Return type: ArchivedURL Raises ArchiveFileNameError: If the file’s name cannot be parsed.
Example usage:
>>> import storytracker
>>> obj = storytracker.open_wayback_machine_url('https://web.archive.org/web/20010911213814/http://www.cnn.com/')
reverse_archive_filename¶
Accepts a filename created using the rules of storytracker.create_archive_filename() and converts it back to Python. Returns a tuple: The URL string and a timestamp. Do not include the file extension when providing a string.
- storytracker.reverse_archive_filename(filename)¶
Parameters: filename (str) – A filename structured using the style of the storytracker.create_archive_filename() function Returns: A tuple containing the URL of the archived page as a string and a datetime object of the archive’s timestamp Return type: tuple
Example usage:
>>> import storytracker
>>> storytracker.reverse_archive_filename('http!www.latimes.com!!!!@2014-07-06T16:31:57.697250')
('http://www.latimes.com', datetime.datetime(2014, 7, 6, 16, 31, 57, 697250))
reverse_wayback_machine_url¶
Accepts an url from the Internet Archive’s Wayback Machine and returns a tuple with the archived URL string and a timestamp.
- storytracker.reverse_wayback_machine_url(url)¶
Parameters: url (str) – A URL from the Wayback Machine that links directly to an archive. An example is https://web.archive.org/web/20010911213814/http://www.cnn.com/.
Returns: A tuple containing the URL of the archived page as a string and a datetime object of the archive’s timestamp Return type: tuple
Example usage:
>>> import storytracker
>>> storytracker.reverse_wayback_machine_url('https://web.archive.org/web/20010911213814/http://www.cnn.com/')
('http://www.cnn.com/', datetime.datetime(2001, 9, 11, 21, 38, 14))