Table of Contents

  1. Introduction
    1. About navigating
  2. Data collection
    1. Scraping
    2. Capturing and exploring data
  3. Data organization and maintenance
    1. Data reading
    2. Data maintenance
  4. Data analysis and visualization
    1. Visualization
    2. Analysis
  5. Sources / see more
    1. University of North Carolina Digital Humanities Tools list
    2. Duke University Digital Humanities Tools list
    3. DHtech’s Awesome Digital Humanities tools list
    4. University of Amsterdam Digital Methods Initiative’s tool database
    5. Sciences Po médialab tools
    6. dbohan’s Awesome Structured Text Tools list


A list of digital tools, cribbed from a bunch of resources (5) and put together. Created in collaboration with Dr. Greg Elmer.

About navigating

This document is organized according to the sort of flow that a digital methods research project would undertake. If you’re crunched for time, your best bet is probably to search for a keyword that you’re looking for (if you’re reading this in a browser, something like Ctrl+F or CMD+F should pull up a search box; if you’re reading this outside of a browser somehow, you probably know how to grep for text).

Data collection


This PHP script allows you to enter a (set of) ASIN(s) and crawl its recommendations up til a user-specified depth.

Amazon Book Explorer

Provides different analytics for’s book search

App Tracker explorer

The DMI’s App Tracker Tracker is a tool to detect a set of predefined fingerprints of known tracking technologies or other software libaries.

.csv Get

Scrape elements from a website and generate a .csv file.

Use: grab select data like headlines, categories, etc.

Discus Comment Scraper

This tool scrapes threads and comments from websites implementing the commenting system.

Github organizations meta-data lookup

Extract the meta-data of organizations on Github

Github repositories meta-data lookup

Extract the meta-data of Github repositories

Github repositories scraper

Scrape Github for forks of projects

Github scraper

Scrape Github for user interactions and user to repository relations

Github user meta-data lookup

Extract meta-data about users on Github


Find out which users contributed source code to Github repositories

Google Autocomplete

Retrieves autocomplete suggestions from Google

Google Image Scraper

Query with one or more keywords, and/or use to query specific sites for images.

Google Play Similar Apps

DMI Google Play Similar Apps is a simple tool to extract the details of individual apps, collect ‘Similar’ apps, and extract their details.

Google Reverse Image scraper

Scrape Google for occurance of images

Googlescraper (Lippmannian Device)

Batch queries Google. Query the resonance of a particular term, or a series of terms, in a set of Websites.

Image Scraper

Scrape images from a single page.

Instagram Loader

Easily scrape images from Instagram based on hashtag, location, or user data. If the website asks you for a login, try from a different internet connection.

Scrapes links from the Wayback Machine

Internet Archive Wayback Machine Network Per Year

Enter a set of URLs and the archived versions closest to 1 July for a specific year are retrieved. Thereafter links are extracted and a network file is output.

iTunes Store

Query the iTunes store and grab both tabular and .gdf data regarding results.

News Agencies Scraper

Basic scraper for various news agencies for particular keywords and extract titles, images, dates and full text.


An all-in-one solution for scraping websites, including the ability to scrape platform pages. Closed source, paid, and requires a sign-up, although the website offers a 14-day demo trial.

  1. Using Octoparse for Instagram

    Octoparse provides a tutorial for scraping Instagram. It can be found on their website.

Search Engine Scraper

A browser extension that allows you to build scrapers, scrape websites, and export data in .csv format. Closed-source, but the browser extension is free.

Wikipedia TOC Scraper

Scrape Table of Contents for revisions of a wikipedia page and explore the results by moving a slider to browse across chronologically ordered TOCs.

Wikipedia categories scraper

Scrape Wikipedia for the categories of articles and the categories of related articles in different languages.

Wikipedia Edits Scraper and IP Localizer

Scrapes Wikipedia history and does IP to Geo for anonymous edits

YouTube Comment Scraper

Scrape comments from YouTube pages.

Use: uh… scrape comments from YouTube pages.

Capturing and exploring data

4CAT: Capture and Analysis Toolkit

Create datasets from webforums such as 4chan and Reddit and perform textual analysis on the resulting datasets. Login required.

Censorship Explorer

Check whether a URL is censored in a particular country by using proxies located around the world.

Expand Tiny Urls

Expands URLs that have been shortened by tools like or

Geo IP

Translates URLs or IP addresses into geographical locations

Infoscapelab DMi-TCAT

Login required; contact me at ab {at}

An instance of the University of Amsterdam’s Twitter Capture and Analysis toolkit accessible to Ryerson students.

Capture all internal links and/or outlinks from a page.

Robots.txt Discovery

Display a site’s robot exclusion policy.

Screenshot generator

Produce screenshots for a list of URLs

loads a URL and searches for patterns in the page’s source code

Text Ripper

Rip all non-html (i.e. text) from a specified page.

Timestamp Ripper

Rips and displays a web page’s last modification date (using the page’s HTML header). Beware of dynamically generated pages, where the date stamps will be the time of retrieval.


Enter two or more lists of URLs or other items to discover commonalities among them. Possible visualizations include a Venn Diagram.

Netvizz Tumblr toolkit

Analyze co-hashtags and other basic text information from Tumblr posts.


Search recent tweets and analyze them.

Use: if you want a quick analysis that the TCAT doesn’t provide.

Wikipedia Cross-Lingual Image Analysis

Makes the images of all language versions of a Wikipedia article comparable.

Wikipedia Entry Check

This tool checks if the issues exist as a Wikipedia page, i.e., an article. If it exists it checks whether the organization is mentioned on that page.

YouTube Data Tools

A collection of simple tools for extracting data from the YouTube platform via the YouTube API v3.

Data organization and maintenance

Data reading


Convert .csv data to .json and vice-versa.

Use: much API data is returned as .json-formatted files.

.csv to Table

Convert .csv files to searchable and sortable HTML table.

Use: visualize and analyze data formatted in .csv

eBay’s tabular data file utilities

Analyze data saved with tab delimiters, as opposed to the standard comma. Yes, it’s that eBay.

Use: perform maintenance and reading on tab-separated files.


Gain access to the Python programming language’s variety of tools and libraries to perform analysis on .csv, .json, .html files and more.

Use: pretty much any analysis and conversion under the sun; it’s a powerful toolkit but requires reading the documentation to figure out your own use-case.


A Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

Use: analyze graphs and networks and return them using python.


A web-based data management, network analysis & visualisation environment.

Use: an all-in-one suite for analyzing, managing and graphing data.


A browser-based tool that allows you to parse and analyze .csv data.

Use: look for basic patterns and characteristics of a .csv.


A command-line toolkit to analyze and investigate .csv files.

Use: easily find out things like frequencies of data, different values, and correlations.


Tad is a desktop application for viewing and analyzing tabular data such as .csv files.

Use: easily create “pivot tables” to analyze your data, among other csv functions.


A collection of coding tools, mostly in python, to analyze text.

Use: worth exploring to find programming examples for the analysis of text. Many use-cases in the repository.

Data maintenance

Ron’s .csv Editor

Deal with massive .csv files easily.

Use: organize, read, and analyze .csv files that would normally crash a spreadsheet program.


Remove bad lines from a .csv file and normalize the rest.

Use: sometimes .csv files exported from SQL databases have errors; many tools here, such as the YouTube data tools and the Twitter Capture and Analysis toolkit are exported as such. This tool discards those error-ridden rows and allows you to read the files.


A tool for cleaning data; transforming it from one format into another; and extending it with web services and external data. OpenRefine can be used to scrape data from websites or convert data between formats. It also makes it easy to save the processing steps to a file that can be loaded back into the tool at a later time, making it easy to repeat the process again on a different set of data.

Data analysis and visualization


Bubble Lines

Input tags and values to produce relatively sized bubbles. Output is an svg.


Generate word clouds and see word correlations in a given text. Calls itself the “not-so-pretty cousin of Wordle” (below).

Use: basic text analysis of word frequencies, along with visualization.

Colors For Data Scientists

Generate and refine palettes of optimally distinct colors. (by Sciences-Po)


Datawrapper allows users to create a variety of basic charts and graphs using submitted tabular data.


Replicates the tags in a tag cloud by their value

Dorling Map Generator

Input tags and values to produce a Dorling Map (i.e. bubbles). Output is an svg.

Chronos Timeline Chronos allows scholars and students to dynamically present historical data in a flexible online environment.

Lippmannian Device To Gephi

This tool allows one to visualize the output of the Lippmannian device as a network with Gephi.

Raw Text to Tag Cloud Engine

Takes raw text, counts the words and returns an ordered, unordered or alphabetically ordered tagcloud.

Rawgraphs is an online tabular data processing program that allows users to create advanced charts and graphs using submitted tabular data.

Scene Create a multimedia story told through 3D “VR” tools.


Create a graph out of the “see also” networks between given Wikipedia pages.


A collection of free, open-source web widgets, mostly for data visualizations.


Stitch together audio from various sources and embed it within a readable text.


Easy-to-use tool to build an annotated, interactive line chart.


Create a narrative, sequential story that moves through locations on a map.

Table to Net

Extract a network from a table. Set a column for nodes and a column for edges. It deals with multiple items per cell. (by Médialab Sciences-Po)

Tag Cloud Combinator

Enter two or more tag clouds and the values of each tag will be summed.

Tag Cloud Generator

Input tags and values to produce a tag cloud. Output is in SVG.

Tag Cloud HTML Generator

Input tags and values in wordle format to produce a HTML tag cloud or tag list.

Tag Cloud To Wordle

This tool allows one to transform a normal tag cloud into a fancy Wordle one.


A web-based timeline builder


Create a visually-appealing annotated timeline.


A tool for creating timelines which can be added to a website or blog.


Create resuable, static, embeddable maps from OpenStreetMap data.


A platform that helps you create customized “views” such as interactive maps and timelines.


An interactive, command-line tool for analyzing and visualizing tabular data.

Use: get quick visualizations and perform other data-scientific methods on tabular data.


Generate word clouds (clouds of words that size the words based on frequency) for a given text.

Use: visualize frequency of words in a given corpus.


Compare Lists

Compare two lists of URLs for their commonalities and differences.


A iPython notebook that walks the user through performing complex sentiment analysis of passages like Tweets for sentiment analysis. You can download the iPython notebook and run it yourself (which requires Jupyter lab, linked in the previous sentence), or read the text for an example.

Use: learn how to use python for sentiment analysis; perform sentiment analysis on texts.


Gephi is a visualization and exploration software for all kinds of graphs and networks.

Use: analyze the .gdf and .gxml files returned by many scraping and collection tools. The most robust tool available, but sometimes slow and hard to configure; an online alternative is Polinode, below.


Extract URLs from text, source code or search engine results. Produces a clean list of URLs.


Easily compare two images within a frame.

Language Detection

Detects language for given URLs. The first 1000 characters on the Web page(s) are extracted, and the language of each page is detected.

Lippmannian Device

The Lippmannian device is named Walter Lippmann, and provides a coarse means of showing actor partisanship.


NodeXL is a plugin to Microsoft Excel that allows you to visualize and analyze data beyond what the program has normally built in.

Use: visualize and analyze data using Microsoft Excel (although for a faster, lighter, and free alternative, see LibreOffice).


Various analyses of historical data in tabular format.


Login required

Polinode is an online tool that allows for the opening and basic manipulation of .gdf files.

Use: analyze the .gdf and .gxml files returned by many scraping and collection tools. An online tool that is not as powerful as Gephi, above, but easier to understand and get started with.

Rip Sentences

Rip text from a specified page and force line breaks between sentences.


Table 2 Net

Parse tabular data for relationships and convert into a table.

TLD counts

Enter URLS, and count the top level domains.


A web-based tool that provides text reading and basic analysis based on copy-pasted text.

Sources / see more

University of North Carolina Digital Humanities Tools list

Duke University Digital Humanities Tools list

DHtech’s Awesome Digital Humanities tools list

University of Amsterdam Digital Methods Initiative’s tool database

Sciences Po médialab tools

dbohan’s Awesome Structured Text Tools list