TubeKit - A Query-based YouTube Crawling Toolkit
School of Information and Library Science (SILS)
University of North Carolina
Chapel Hill, NC 27599
Realizing the importance of capturing contextual information for digital video preservation, we proposed a model of ContextMiner (Shah & Marchionini, 2007a) and demonstrated how to materialize this concept (Shah & Marchionini, 2007a). Extending this notion further to enable a digital curator to work with a dynamic collection, we decided to use US presidential election videos from YouTube (Shah & Marchionini, 2007c) and understand the role of contextual information in analyzing and explaining various collection development issues for a digital library or an archive. In order to obtain videos, metadata, and contextual information from YouTube, we built a crawler. This crawler is query-based; it uses a set of seed queries to search on YouTube and obtain the rank list with the top 100 videos for a query. The crawler then collects a set of attributes for each video. Some of them are static, such as title, description and tags, and are considered as metadata; the others are dynamic, such as views, comments, and ratings, and are considered as the contextual information. We decided to collect contextual information for the videos in our collection every day. This crawler has been running since May 2007.
In the time that followed we found several other topics for which we felt the need to collect similar data from YouTube. This need drove us to create a much broader and general framework that could allow us to build query-based crawlers for YouTube for any topic.
We present TubeKit - a toolkit for creating YouTube crawlers. It allows one to build one's own crawler that can crawl YouTube based on a set of seed queries and collect up to 17 different attributes at regular intervals. TubeKit assists in all the phases of this process starting database creation to finally preparing analysis reports from the collected data.
Following are the typical steps for creating and setting up a YouTube crawler using TubeKit.
- Provide basic information (YouTube developer ID, project name, etc.).
- Set up the database. TubeKit can create the databases as well as tables based on the preferences.
- Select up to 17 different attributes to collect for a YouTube video (Figure 1).
- Set up various schedules for crawling.
- Access the crawler and enter seed queries.
The crawler, including the settings, collection, and data analysis, can be accessed from anywhere with a browser.
The toolkit is implemented primarily using PHP and made open-source for the research community under a Creative Commons License. It is available to download for free at http://www.tubekit.org. We believe this toolkit will be very useful for researchers in the digital library domain for building digital video collections as well as analyzing certain social patterns from such a dynamic collection.
Shah, C. and Marchionini, G. (2007a). Capturing relevant information for digital curation. In IEEE ACM Joint Conference on Digital Libraries (JCDL), page 496.
Shah, C. and Marchionini, G. (2007b). ContextMiner: A tool for digital library curators. In IEEE ACM Joint Conference on Digital Libraries (JCDL), page 514.
Shah, C. and Marchionini, G. (2007c). Preserving 2008 US Presidential Election Videos. In International Web Archiving Workshop (IWAW).
Figure 1: A portion of TubeKit's new crawler setup page