Search:
Go
English
Deutsch
Français
Japanese
Chinese Simplified
Chinese Traditional
Korean
Russian
Arabic
Česky
Greek
Italiano
Afrikaans
Aragonés
Armenian
Asturianu
Azerbaijani
Bahasa Indonesia
Bahasa Melayu
Bangla
Bashkir
Belarusian
Bosanski
Brezhoneg
Bulgarian
Català
Cymraeg
Dansk
Eesti
Español
Esperanto
Euskara
Frysk
Furlan
Føroyskt
Gaeilge
Gàidhlig
Galego
Gujarati
Hebrew
Hindi
Hrvatski
Interlingua
Íslenska
Kannada
Kaszëbsczi
Kazakh
Kiswahili
Kurdî
Kyrgyz
Latviski
Lëtzebuergesch
Lietuvių
Lingua Latina
Magyar
Makedonski
Marathi
Nederlands
Nordfriisk
Norsk
O'zbekcha
Occitan
Ossetian
Persian
Polski
Português
Punjabi Gurmukhi
Română
Rumantsch
Sardu
Seeltersk
Shqip
Sicilianu
Sinhala
Slovensko
Slovensky
Srpski
Suomi
Svenska
Tagalog
Taiwanese
Tamil
Tatarça
Telugu
Thai
Tiếng Việt
Türkçe
Türkmençe
Ukrainian
Urdu
Uyghurche
DMOZ Internet Directory
Presented by
DMOZLive.com
Home
About
Submit Site
Tweet
Home
Computers
Data Formats
Archive
WARC
Software
25 Sites
Tools and utilities for writing, reading, inspecting and managing WARC files.
Sites
[ Submit ]
GitHub: cc-warc-examples
- CommonCrawl WARC/WET/WAT examples and processing code.
GitHub: warc-mapreduce
- Warc and wet support for Hadoop's mapreduce api.
GitHub: warc-tools
- Miscellaneous tools for processing WARC files from the CommonCrawl.
Heritrix
- The Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
GitHub: WarcMiddleware
- Lets download a mirror copy of a website when running a web crawl with the Python web crawler Scrapy.
GitHub: WarcProxy
- Saves proxied HTTP traffic to a WARC file.
GitHub: WarcMITMProxy
- HTTP(S) proxy that saves traffic to a WARC file, using libmitmproxy.
GitHub: Alard/warc-proxy
- Viewer for browsing the contents of a WARC file.
GitHub: Megawarc
- Nondestructive warc-in-tar to warc conversion.
GitHub: warctozip-service
- An HTTP-based warc-to-zip converter.
GitHub: archiveteam-megawarc-factory
- Scripts to bundle Archive Team uploads and upload them to Archive.org.
GitHub: CDX-Writer
- Python script to create CDX index files of WARC data.
GitHub: Heritrix-Cassandra
- A library for writing Heritrix output directly to Cassandra.
GitHub: python-heritrix
- Simple Python wrapper around Heritrix API.
WARCreate
- Extension that allows a user to create a Web ARChive (WARC) file from any browseable webpage. The resulting files can then be used with other tools like the Internet Archive's open source Wayback Machine.
GitHub: Wpull
- Wget-compatible web downloader and crawler.
Java Web Archive Toolkit (JWAT)
- A package to read and validate WARC, ARC and GZip files.
SiteStory
- Transactional Archiving. Consists of selectively capturing and storing transactions that take place between a web client (browser) and a web server.
NetarchiveSuite
- A complete web archiving package whose primary function is to plan, schedule and run web harvests of parts of the Internet. Is built around the Heritrix web crawler.
GItHub: WarcQtViewer
- UI to view and manage .warc and .warc.gz files.
WarcManager
- Database web application which indexes and provides a browsing and search interface to a collection of warc data.
DeDuplicator (Heritrix Add-on)
- An add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls.
IIPC: Open Wayback Development
- Landing site for open source Wayback development.
Web Archiving Integration Layer (WAIL)
- A graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages.
WARCAT
- Python tool and library for handling Web ARChive (WARC) files.
Click
[ Submit ]
above to Add a New Site, Update a Site, or Remove a Site from this Category.
This directory is made available through a Creative Commons Attribution license from the
DMOZ Organization.