DMOZ Internet Directory : Computers : Data_Formats : Archive : WARC : Software

Search:

DMOZ Internet Directory

Presented by DMOZLive.com

Home About Submit Site

Tools and utilities for writing, reading, inspecting and managing WARC files.

Sites [ Submit ]

GitHub: cc-warc-examples - CommonCrawl WARC/WET/WAT examples and processing code. GitHub: warc-mapreduce - Warc and wet support for Hadoop's mapreduce api. GitHub: warc-tools - Miscellaneous tools for processing WARC files from the CommonCrawl. Heritrix - The Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. GitHub: WarcMiddleware - Lets download a mirror copy of a website when running a web crawl with the Python web crawler Scrapy. GitHub: WarcProxy - Saves proxied HTTP traffic to a WARC file. GitHub: WarcMITMProxy - HTTP(S) proxy that saves traffic to a WARC file, using libmitmproxy. GitHub: Alard/warc-proxy - Viewer for browsing the contents of a WARC file. GitHub: Megawarc - Nondestructive warc-in-tar to warc conversion. GitHub: warctozip-service - An HTTP-based warc-to-zip converter. GitHub: archiveteam-megawarc-factory - Scripts to bundle Archive Team uploads and upload them to Archive.org. GitHub: CDX-Writer - Python script to create CDX index files of WARC data. GitHub: Heritrix-Cassandra - A library for writing Heritrix output directly to Cassandra. GitHub: python-heritrix - Simple Python wrapper around Heritrix API. WARCreate - Extension that allows a user to create a Web ARChive (WARC) file from any browseable webpage. The resulting files can then be used with other tools like the Internet Archive's open source Wayback Machine. GitHub: Wpull - Wget-compatible web downloader and crawler. Java Web Archive Toolkit (JWAT) - A package to read and validate WARC, ARC and GZip files. SiteStory - Transactional Archiving. Consists of selectively capturing and storing transactions that take place between a web client (browser) and a web server. NetarchiveSuite - A complete web archiving package whose primary function is to plan, schedule and run web harvests of parts of the Internet. Is built around the Heritrix web crawler. GItHub: WarcQtViewer - UI to view and manage .warc and .warc.gz files. WarcManager - Database web application which indexes and provides a browsing and search interface to a collection of warc data. DeDuplicator (Heritrix Add-on) - An add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls. IIPC: Open Wayback Development - Landing site for open source Wayback development. Web Archiving Integration Layer (WAIL) - A graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages. WARCAT - Python tool and library for handling Web ARChive (WARC) files.

Click [ Submit ] above to Add a New Site, Update a Site, or Remove a Site from this Category.

This directory is made available through a Creative Commons Attribution license from the DMOZ Organization.