Home Self-host How To Self-host Your Own Internet Archive With ArchiveBox In Linux

How To Self-host Your Own Internet Archive With ArchiveBox In Linux

Archive Websites Using ArchiveBox In Linux

By sk
8.1K views

This tutorial explains what is ArchiveBox and how to install ArchiveBox in Linux, and finally how to self-host your own personal Internet Archive with ArchiveBox.

Disclaimer: All the information given here are strictly for educational purpose only. Neither the author nor OSTechNix team is responsible for any damages done to the target sites such as bandwidth abusing or downloading copyrighted and illegal content.

Introduction

The Internet Archive’s Wayback Machine (IAWM) is the largest and oldest public Web archive.

As of writing this, Internet Archive Wayback Machine (archive.org) has captured more than 778 billion web pages and has roughly stored petabyte of data.

Most users come to Archive.org because they do not find the requested pages on the live Web. About 65 % of the requested archived pages no longer exist on the live Web. Thanks to Archive.org, we can still access and view the old and defunct websites.

While archive.org is quite capable to preserve a lot of web resources, some you may wanted to host your own personal and private internet archive in your own server. This is where ArchiveBox comes in help.

What is ArchiveBox?

ArchiveBox is a free, open source and powerful Internet archiving solution to collect, save your favorite websites and view or read them offline.

You can feed it a single URL or schedule imports from your browser bookmarks, browser history, plain text, HTML, markdown, feeds like RSS, bookmark services like Pocket/Pinboard and more!

ArchiveBox saves the snapshot of the given URLs in several output formats such as HTML, JSON, PDF, PNG screenshots, WARC, and more!

By default, ArchiveBox stores all captured pages to archive.org for redundancy, however it you can disable it if you want local-only caching mode.

ArchiveBox is available as a Commandline tool, a web app and a desktop app. It is cross-platform and supports GNU/Linux, macOS and Windows. ArchiveBox is written in Python and the source code is available in GitHub.

Install ArchiveBox in Linux

ArchiveBox can be installed in a few different ways. We can install ArchiveBox using Docker, Docker-compose and automated setup script.

First of all, make sure you have installed Docker and Docker-compose as shown in any one of the following links.

Install ArchiveBox using Docker-compose

The officially recommended way to install ArchiveBox is to use Docker-compose.

After installing Docker-compose, create a directory for ArchiveBox and download the docker-compose.yml file in it:

$ mkdir ~/archivebox && cd ~/archivebox
$ curl -O 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/docker-compose.yml'

Run the initial setup and create an admin user by running the following command:

$ docker-compose run archivebox init --setup

Finally, start the ArchiveBox server using command:

$ docker-compose up

Now you can login to ArchiveBox admin Web UI dashboard from URL http://127.0.0.1:8000.

Logging into the Web UI is completely optional. You can do everything from commandline as well.

Install ArchiveBox using Docker

After installing Docker, create a directory for ArchiveBox and download the docker-compose.yml file in it:

$ mkdir ~/archivebox && cd ~/archivebox
$ docker run -v $PWD:/data -it archivebox/archivebox init --setup

Run the initial setup and create an admin user by running the following command:

$ docker-compose run archivebox init --setup

Finally, start the ArchiveBox Docker instance using command:

$ docker run -v $PWD:/data -p 8000:8000 archivebox/archivebox

Now you can login to ArchiveBox admin Web UI dashboard from URL http://127.0.0.1:8000.

Logging into the Web UI is completely optional. You can do everything from commandline as well.

Install ArchiveBox using Auto Setup Script

First, install Docker. It is optional, but recommended.

Run the automatic setup script using command:

$ curl -sSL 'https://get.archivebox.io' | sh

This will automatically add the ArchiveBox repository and install ArchiveBox with all necessary dependencies.

ArchiveBox will be installed in a directory named 'archivebox' in your current working directory.

Cd into the archivebox directory and initialize it using commands:

$ cd ~/archivebox
$ archivebox init --setup

You will be asked to create a new admin user for the Web UI.

[...]
[+] Creating new admin user for the Web UI...
Username (leave blank to use 'ostechnix'): 
Email address: 
Password: 
Password (again): 
This password is too short. It must contain at least 8 characters.
Bypass password validation and create user anyway? [y/N]: y
Superuser created successfully.
[...]
Initialize ArchiveBox
Initialize ArchiveBox

Finally, start ArchiveBox server using command:

$ archivebox server 0.0.0.0:8000

Install ArchiveBox using PiP

Install Python version 3.7 or latest and Node version 14 or greater as shown in the links below.

Install ArchiveBox using pip3:

$ pip3 install archivebox

Create directory for ArchiveBox and initialize it using commands:

$ mkdir ~/archivebox && cd ~/archivebox
$ archivebox init --setup

Finally, start ArchiveBox server using command:

$ archivebox server 0.0.0.0:8000

You can now access the ArchiveBox web UI from URL http://127.0.0.1:8000.

Install ArchiveBox from Apt Repository

Docker and Docker-compose are not required if you decided to install ArchiveBox from the repository.

First add the ArchiveBox repository.

On Ubuntu 20.04:

$ sudo apt install software-properties-common
$ sudo add-apt-repository -u ppa:archivebox/archivebox

On Ubuntu 22/10 and latest versions and Ubuntu 19.10 and earlier versions and other Debian-based systems:

$ echo "deb http://ppa.launchpad.net/archivebox/archivebox/ubuntu focal main" | sudo tee /etc/apt/sources.list.d/archivebox.list
$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C258F79DCC02E369
$ sudo apt update

After adding the relevant repository, install ArchiveBox on Ubuntu and Debian-based systems using command:

$ sudo apt install archivebox

Apt only provides a broken older version of Django, so we need to install ArchiveBox using PiP package manager as well.

$ sudo python3 -m pip install --upgrade --ignore-installed archivebox

Create a directory for ArchiveBox and initialize it:

$ mkdir ~/archivebox && cd ~/archivebox
$ archivebox init --setup

Finally, start ArchiveBox server using command:

$ archivebox server 0.0.0.0:8000

You can now access the ArchiveBox web UI from URL http://127.0.0.1:8000.

To view ArchiveBox version, run:

$ archivebox --version

To view the ArchiveBox help section, run:

$ archivebox help

How to Archive Websites using ArchiveBox?

ArchiveBox can be used to archive URLs either from commandline or via its WebUI.

To archive a single URL from command line, simply pass the as an argument like below:

archivebox add https://example.com/some/page

Or,

echo https://example.com/some/page | archivebox add

Example:

$ archivebox add https://github.com/ArchiveBox/ArchiveBox
Archive Websites using ArchiveBox
Archive Websites using ArchiveBox

To archive a list of URLs from a text file, run:

$ archivebox add < archive_urls.txt

Or,

$ cat archive_urls.txt.txt | archivebox add

Or,

$ archivebox add ~/Downloads/browser_bookmarks.html

Or,

$ archivebox add ~/Downloads/pinboard_bookmarks.json

Or,

$ curl https://getpocket.com/users/USERNAME/feed/all | archivebox add

You can also add --depth=1 to any one of the above commands to recursively download URLs and all URLs one hop away.

$ archivebox add --depth=1 < ~/Downloads/bookmarks_export.html

Print Archive Collection Statistics

After archiving the URL(s), you can view information and statistics about the archive collection using command:

$ archivebox status

This command will scan the archive data directories and display the indexed links, indexed links that are archived or unarchived and directories that are exists in the archive folder.

Print Archive Collection Statistics
Print Archive Collection Statistics

You can also list link data directories by status (e.g. indexed, corrupted, archived, etc.) using this command:

archivebox list --status= <status>

For example, to list all archived data directories, run:

$ archivebox list --status=archived

Sample output:

[i] [2023-01-05 12:11:06] ArchiveBox v0.6.2: archivebox list --status=archived
    > /home/ostechnix/archivebox

/home/ostechnix/archivebox/archive/1672909053.266666 https://github.com/ArchiveBox/ArchiveBox "GitHub - ArchiveBox/ArchiveBox: ? Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more..."

As you see in the above output, I have archived the ArchiveBox GitHub repository itself.

You can further view the downloaded contents of this archive folder using ls command:

$ ls /home/ostechnix/archivebox/archive/1672909053.266666/
avatars.githubusercontent.com  git                      headers.json  media                      readability
camo.githubusercontent.com     github.com               index.html    mercury                    user-images.githubusercontent.com
favicon.ico                    github.githubassets.com  index.json    raw.githubusercontent.com  warc

Save Archives in New Directory

As mentioned in the installation steps above, we are storing the archives in ~/archivebox directory. You can also create a new directory in other location and initialize it to store the archive collections.

$ mkdir my_new_archive; cd my_new_archive/
$ archivebox init

Now start archiving the URLs as described in the previous section.

This way you can create different archive collections and store them in different directories.

Access ArchiveBox WebUI

To access the ArchiveBox admin Web UI, first start ArchiveBox server using the following command:

$ archivebox server 0.0.0.0:8000

You can now access the ArchiveBox web UI from URL http://127.0.0.1:8000 or http://IP-Address:8000.

Access ArchiveBox WebUI
Access ArchiveBox WebUI

As you can see, I have archived the ArchiveBox official GitHub Repository itself. Just click on the archive to open it.

You will see that the archive is saved in different output formats. Simply click on any output format to open the respective file. I clicked the Wget > HTML link and ArchiveBox displayed the archived content in the same window. You can also open it in new browser tab or window.

Open an Archive
Open an Archive

You can now read the URL in offline. This archive remain saved in your local system until you delete them.

To stop the ArchiveBox server, go back to the terminal window where you started it and press CTRL+C.

Add New Archive

Click the LOG IN button in the Web UI. Log in using the username and its password which you created while initializing ArchiveBox in the earlier steps. Refer the installation section to know how to initialize ArchiveBox.

Login to ArchiveBox WebUI
Login to ArchiveBox WebUI

Heads Up: Please note that the admin user is created when you initialize ArchiveBox. If it is not created for any reason, cd into your archive folder and then run this command to create the admin user and set a password for it.

$ archivebox manage createsuperuser

Click the ADD button and enter the URL(s) to be archived one by one. Choose the URLs format, Archive depth (E.g. 0 or 1) and click "Add URLs and archive" button.

Now the archiving process will start.

Archive Websites via ArchiveBox WebUI
Archive Websites via ArchiveBox WebUI

It is safe to leave or close this page as the archiving process will continue in the background.

Once the URL is archived, go to the SNAPSHOTS to view the list of archived pages.

View Archive Snapshots in ArchiveBox WebUI
View Archive Snapshots in ArchiveBox WebUI

Conclusion

ArchiveBox is a perfect and promising solution self-host your own personal Internet Archive to preserve webpage(s) before it is edited or taken down completely.

I request the users to use this service fairly and legitimately. Please don't use this for bandwidth abusing, scraping your competitors sites, or downloading copyrighted and illegal content.

Resources:

You May Also Like

Leave a Comment

* By using this form you agree with the storage and handling of your data by this website.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

This website uses cookies to improve your experience. By using this site, we will assume that you're OK with it. Accept Read More