This tutorial explains what is ArchiveBox and how to install ArchiveBox in Linux, and finally how to self-host your own personal Internet Archive with ArchiveBox.
Disclaimer: All the information given here are strictly for educational purpose only. Neither the author nor OSTechNix team is responsible for any damages done to the target sites such as bandwidth abusing or downloading copyrighted and illegal content.
Table of Contents
Introduction
The Internet Archive’s Wayback Machine (IAWM) is the largest and oldest public Web archive.
As of writing this, Internet Archive Wayback Machine (archive.org) has captured more than 778 billion web pages and has roughly stored petabyte of data.
Most users come to Archive.org because they do not find the requested pages on the live Web. About 65 % of the requested archived pages no longer exist on the live Web. Thanks to Archive.org, we can still access and view the old and defunct websites.
While archive.org is quite capable to preserve a lot of web resources, some you may wanted to host your own personal and private internet archive in your own server. This is where ArchiveBox comes in help.
What is ArchiveBox?
ArchiveBox is a free, open source and powerful Internet archiving solution to collect, save your favorite websites and view or read them offline.
You can feed it a single URL or schedule imports from your browser bookmarks, browser history, plain text, HTML, markdown, feeds like RSS, bookmark services like Pocket/Pinboard and more!
ArchiveBox saves the snapshot of the given URLs in several output formats such as HTML, JSON, PDF, PNG screenshots, WARC, and more!
By default, ArchiveBox stores all captured pages to archive.org for redundancy, however it you can disable it if you want local-only caching mode.
ArchiveBox is available as a Commandline tool, a web app and a desktop app. It is cross-platform and supports GNU/Linux, macOS and Windows. ArchiveBox is written in Python and the source code is available in GitHub.
Install ArchiveBox in Linux
ArchiveBox can be installed in a few different ways. We can install ArchiveBox using Docker, Docker-compose and automated setup script.
First of all, make sure you have installed Docker and Docker-compose as shown in any one of the following links.
- Install Docker Engine And Docker Compose In AlmaLinux, CentOS, Rocky Linux
- How to Install Docker Engine And Docker Compose In Ubuntu 22.04 LTS
- Setup Docker And Docker Compose With DockSTARTer
Install ArchiveBox using Docker-compose
The officially recommended way to install ArchiveBox is to use Docker-compose.
After installing Docker-compose, create a directory for ArchiveBox and download the docker-compose.yml
file in it:
$ mkdir ~/archivebox && cd ~/archivebox
$ curl -O 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/docker-compose.yml'
Run the initial setup and create an admin user by running the following command:
$ docker-compose run archivebox init --setup
Finally, start the ArchiveBox server using command:
$ docker-compose up
Now you can login to ArchiveBox admin Web UI dashboard from URL http://127.0.0.1:8000.
Logging into the Web UI is completely optional. You can do everything from commandline as well.
Install ArchiveBox using Docker
After installing Docker, create a directory for ArchiveBox and download the docker-compose.yml
file in it:
$ mkdir ~/archivebox && cd ~/archivebox
$ docker run -v $PWD:/data -it archivebox/archivebox init --setup
Run the initial setup and create an admin user by running the following command:
$ docker-compose run archivebox init --setup
Finally, start the ArchiveBox Docker instance using command:
$ docker run -v $PWD:/data -p 8000:8000 archivebox/archivebox
Now you can login to ArchiveBox admin Web UI dashboard from URL http://127.0.0.1:8000.
Logging into the Web UI is completely optional. You can do everything from commandline as well.
Install ArchiveBox using Auto Setup Script
First, install Docker. It is optional, but recommended.
Run the automatic setup script using command:
$ curl -sSL 'https://get.archivebox.io' | sh
This will automatically add the ArchiveBox repository and install ArchiveBox with all necessary dependencies.
ArchiveBox will be installed in a directory named 'archivebox' in your current working directory.
Cd into the archivebox directory and initialize it using commands:
$ cd ~/archivebox
$ archivebox init --setup
You will be asked to create a new admin user for the Web UI.
[...] [+] Creating new admin user for the Web UI... Username (leave blank to use 'ostechnix'): Email address: Password: Password (again): This password is too short. It must contain at least 8 characters. Bypass password validation and create user anyway? [y/N]: y Superuser created successfully. [...]
Finally, start ArchiveBox server using command:
$ archivebox server 0.0.0.0:8000
Install ArchiveBox using PiP
Install Python version 3.7 or latest and Node version 14 or greater as shown in the links below.
Install ArchiveBox using pip3:
$ pip3 install archivebox
Create directory for ArchiveBox and initialize it using commands:
$ mkdir ~/archivebox && cd ~/archivebox
$ archivebox init --setup
Finally, start ArchiveBox server using command:
$ archivebox server 0.0.0.0:8000
You can now access the ArchiveBox web UI from URL http://127.0.0.1:8000.
Install ArchiveBox from Apt Repository
Docker and Docker-compose are not required if you decided to install ArchiveBox from the repository.
First add the ArchiveBox repository.
On Ubuntu 20.04:
$ sudo apt install software-properties-common
$ sudo add-apt-repository -u ppa:archivebox/archivebox
On Ubuntu 22/10 and latest versions and Ubuntu 19.10 and earlier versions and other Debian-based systems:
$ echo "deb http://ppa.launchpad.net/archivebox/archivebox/ubuntu focal main" | sudo tee /etc/apt/sources.list.d/archivebox.list
$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C258F79DCC02E369
$ sudo apt update
After adding the relevant repository, install ArchiveBox on Ubuntu and Debian-based systems using command:
$ sudo apt install archivebox
Apt only provides a broken older version of Django, so we need to install ArchiveBox using PiP package manager as well.
$ sudo python3 -m pip install --upgrade --ignore-installed archivebox
Create a directory for ArchiveBox and initialize it:
$ mkdir ~/archivebox && cd ~/archivebox
$ archivebox init --setup
Finally, start ArchiveBox server using command:
$ archivebox server 0.0.0.0:8000
You can now access the ArchiveBox web UI from URL http://127.0.0.1:8000.
To view ArchiveBox version, run:
$ archivebox --version
To view the ArchiveBox help section, run:
$ archivebox help
How to Archive Websites using ArchiveBox?
ArchiveBox can be used to archive URLs either from commandline or via its WebUI.
To archive a single URL from command line, simply pass the as an argument like below:
archivebox add https://example.com/some/page
Or,
echo https://example.com/some/page | archivebox add
Example:
$ archivebox add https://github.com/ArchiveBox/ArchiveBox
To archive a list of URLs from a text file, run:
$ archivebox add < archive_urls.txt
Or,
$ cat archive_urls.txt.txt | archivebox add
Or,
$ archivebox add ~/Downloads/browser_bookmarks.html
Or,
$ archivebox add ~/Downloads/pinboard_bookmarks.json
Or,
$ curl https://getpocket.com/users/USERNAME/feed/all | archivebox add
You can also add --depth=1
to any one of the above commands to recursively download URLs and all URLs one hop away.
$ archivebox add --depth=1 < ~/Downloads/bookmarks_export.html
Print Archive Collection Statistics
After archiving the URL(s), you can view information and statistics about the archive collection using command:
$ archivebox status
This command will scan the archive data directories and display the indexed links, indexed links that are archived or unarchived and directories that are exists in the archive folder.
You can also list link data directories by status (e.g. indexed, corrupted, archived, etc.) using this command:
archivebox list --status= <status>
For example, to list all archived data directories, run:
$ archivebox list --status=archived
Sample output:
[i] [2023-01-05 12:11:06] ArchiveBox v0.6.2: archivebox list --status=archived > /home/ostechnix/archivebox /home/ostechnix/archivebox/archive/1672909053.266666 https://github.com/ArchiveBox/ArchiveBox "GitHub - ArchiveBox/ArchiveBox: ? Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more..."
As you see in the above output, I have archived the ArchiveBox GitHub repository itself.
You can further view the downloaded contents of this archive folder using ls
command:
$ ls /home/ostechnix/archivebox/archive/1672909053.266666/ avatars.githubusercontent.com git headers.json media readability camo.githubusercontent.com github.com index.html mercury user-images.githubusercontent.com favicon.ico github.githubassets.com index.json raw.githubusercontent.com warc
Save Archives in New Directory
As mentioned in the installation steps above, we are storing the archives in ~/archivebox
directory. You can also create a new directory in other location and initialize it to store the archive collections.
$ mkdir my_new_archive; cd my_new_archive/
$ archivebox init
Now start archiving the URLs as described in the previous section.
This way you can create different archive collections and store them in different directories.
Access ArchiveBox WebUI
To access the ArchiveBox admin Web UI, first start ArchiveBox server using the following command:
$ archivebox server 0.0.0.0:8000
You can now access the ArchiveBox web UI from URL http://127.0.0.1:8000 or http://IP-Address:8000.
As you can see, I have archived the ArchiveBox official GitHub Repository itself. Just click on the archive to open it.
You will see that the archive is saved in different output formats. Simply click on any output format to open the respective file. I clicked the Wget > HTML link and ArchiveBox displayed the archived content in the same window. You can also open it in new browser tab or window.
You can now read the URL in offline. This archive remain saved in your local system until you delete them.
To stop the ArchiveBox server, go back to the terminal window where you started it and press CTRL+C.
Add New Archive
Click the LOG IN button in the Web UI. Log in using the username and its password which you created while initializing ArchiveBox in the earlier steps. Refer the installation section to know how to initialize ArchiveBox.
Heads Up: Please note that the admin user is created when you initialize ArchiveBox. If it is not created for any reason, cd into your archive folder and then run this command to create the admin user and set a password for it.
$ archivebox manage createsuperuser
Click the ADD button and enter the URL(s) to be archived one by one. Choose the URLs format, Archive depth (E.g. 0 or 1) and click "Add URLs and archive" button.
Now the archiving process will start.
It is safe to leave or close this page as the archiving process will continue in the background.
Once the URL is archived, go to the SNAPSHOTS to view the list of archived pages.
Conclusion
ArchiveBox is a perfect and promising solution self-host your own personal Internet Archive to preserve webpage(s) before it is edited or taken down completely.
I request the users to use this service fairly and legitimately. Please don't use this for bandwidth abusing, scraping your competitors sites, or downloading copyrighted and illegal content.
Resources: