Canonical Launches Data Science Stack for ML Beginners

Data Science is the study of data. It involves collecting, analyzing, and interpreting large amounts of information. Data scientists use this information to make decisions, solve problems, and predict future trends.

Data scientists use various tools and techniques to analyze and interpret complex data sets. This helps businesses and organizations make better decisions.

If you're a beginner just starting with data science, you will probably face several challenges in setting up a proper data science environment.

Here are some reasons why setting up a data science environment can be challenging for beginners:

Software Installation: Newbies often struggle with installing the necessary software, such as programming languages (like Python or R), libraries, and tools (like Jupyter Notebooks or RStudio).
Understanding Dependencies: Software often requires specific versions of other software to work correctly. This can be confusing and lead to errors if not managed properly.
Learning Curve: Data science involves learning new skills, including programming, statistics, and machine learning. This can be overwhelming for beginners.
Data Handling: Working with data can be complex, especially when dealing with large or messy datasets. Understanding how to clean, store, and process data is crucial but can be difficult to grasp initially.
Version Control: Keeping track of changes in code and data is important but can be tricky to set up and manage, especially for those new to version control systems like Git.
Choosing the Right Tools: There are many tools and frameworks available, and choosing the right ones for a specific project can be daunting for beginners.

By understanding these challenges, beginners can better prepare themselves and seek the right resources and support to overcome them.

The initial hurdles can be challenging for new data scientists, but with persistence and consistent learning, the journey will become smoother.

Thanks to Canonical's Data Science Stack (DSS), setting up data science became much easier now. In this tutorial, we will discuss what is Data Science Stack and how to use it to setup a data science environment easily and quickly in Ubuntu operating systems.

Table of Contents

What is Data Science Stack (DSS)?

The Data Science Stack (DSS) by Canonical is an out-of-the-box solution for data scientists and machine learning engineers.

The Data Science Stack simplifies the setup process by providing a pre-configured environment that includes all the necessary tools and libraries for machine learning and data analysis.

By being designed to run on Ubuntu workstations and optimizing the use of GPUs, DSS can enhance the performance of machine learning models, which is particularly beneficial for computationally intensive tasks.

DSS allows users to focus more on the development and optimization of their models rather than the technicalities of the environment setup.

This can save a significant amount of time that would otherwise be spent on installing and configuring individual components.

What's Included in the Data Science Stack?

The Data Science Stack (DSS) provides a comprehensive and integrated environment for data scientists and machine learning engineers. Here's what it offers:

Pre-installed Tools: DSS includes popular open-source tools like MicroK8s, JupyterLab and MLflow, which are essential for data exploration, model development, and experiment tracking.
Machine Learning Frameworks: By default, it comes with two widely used machine learning frameworks, PyTorch and TensorFlow, which are ready to use for building and training models.
Command Line Interface (CLI): DSS provides an intuitive CLI for deploying these tools and frameworks, making it easier to manage and scale the environment.
User Interfaces: After deployment, users can access the UIs of the tools to start working on their data science projects without the hassle of manual setup.
Packaging Dependencies: DSS handles the packaging dependencies, ensuring that all tools, libraries, and frameworks are compatible with each other and work smoothly together.
Hardware Compatibility: It is designed to be compatible with the machine's hardware, optimizing the performance of the tools and frameworks
Simplified Configuration: Traditionally, setting up machine learning environments on workstations can be complex and difficult to reverse. DSS addresses this by providing accessible, production-ready, isolated, and reproducible ML environments that efficiently utilize a workstation's GPUs.
GPU Configuration: DSS simplifies GPU configuration by including the GPU operator, which manages the setup and usage of GPUs for machine learning tasks, leveraging their computational power effectively.

Overall, DSS aims to provide a hassle-free and optimized environment for data science and machine learning, allowing users to focus on their core tasks rather than the technical setup and maintenance of their tools.

Install Data Science Stack (DSS) in Ubuntu

To begin using the Data Science Stack (DSS) for machine learning and data science, follow these steps to set up your environment:

Prerequisites

Operating System: Ensure you have Ubuntu 22.04 LTS or Ubuntu 24.04 LTS installed on your system.
Internet Connection: You'll need an active internet connection to download and install the necessary software.
Snap: Make sure Snap is installed on your system, as it is required for installing MicroK8s and DSS.

Setting Up MicroK8s

DSS uses MicroK8s as its container orchestration system, which allows workloads to access the host's GPUs.

To Install MicroK8s on Ubuntu, run:

$ sudo snap install microk8s --channel 1.28/stable --classic

Next, enable the required services:

$ sudo microk8s enable storage dns rbac

Installing the DSS CLI

The Data Science Stack is managed through a Command Line Interface (CLI).

Install DSS CLI with the following command:

$ sudo snap install data-science-stack --channel latest/stable

With these steps completed, you'll have the foundational components of DSS installed and ready to use. You can now proceed to set up your machine learning environments and start running your first notebooks using the DSS CLI.

Getting Started with Data Science Stack

After installing MicroK8s and the DSS CLI, the next step is to initialize DSS on top of MicroK8s and prepare MLflow for use.

Initializing DSS and MLflow

To initialize DSS, you'll need to use the dss initialize command, which sets up the necessary resources within the MicroK8s cluster.

$ dss initialize --kubeconfig="$(sudo microk8s config)"

The --kubeconfig flag is used to specify the path to the Kubernetes configuration file generated by MicroK8s.

The dss initialize command may take a few minutes to complete. During this time, the DSS CLI will display messages indicating the progress of the deployment. You will see messages similar to the following:

[INFO] Waiting for deployment my-tensorflow-notebook in namespace dss to be ready...

This message indicates that DSS is waiting for the deployment of the TensorFlow notebook to be ready. Be patient as the system sets up the environment and ensures all components are correctly configured.

Once the initialization is complete, you will see an output like below:

[INFO] Executing initialize command
[INFO] Storing provided kubeconfig to /home/ostechnix/snap/data-science-stack/16/.dss/config
[INFO] Waiting for deployment mlflow in namespace dss to be ready...
[INFO] Deployment mlflow in namespace dss is ready
[INFO] DSS initialized. To create your first notebook run the command:

dss create

Examples:
  dss create my-notebook --image=pytorch
  dss create my-notebook --image=kubeflownotebookswg/jupyter-scipy:v1.8.0

Now, you will be ready to start using the MLflow tracking server and other components provided by DSS.

You can then proceed to create and run your first machine learning notebook within the DSS environment.

Starting Your First Jupyter Notebook

To start your first Jupyter Notebook using the Data Science Stack (DSS), you'll need to use the dss create command, which allows you to specify the type of notebook you want to create.

Here, we are creating a TensorFlow notebook named my-tensorflow-notebook with CUDA support:

$ dss create my-tensorflow-notebook --image=kubeflownotebookswg/jupyter-tensorflow-cuda:v1.8.0

Upon successful creation of the Notebook, you will see an output like below:

[INFO] Executing create command
[INFO] Waiting for deployment my-tensorflow-notebook in namespace dss to be ready...
[INFO] Waiting for deployment my-tensorflow-notebook in namespace dss to be ready...
[INFO] Waiting for deployment my-tensorflow-notebook in namespace dss to be ready...
[INFO] Deployment my-tensorflow-notebook in namespace dss is ready
[INFO] Success: Notebook my-tensorflow-notebook created successfully.
[INFO] Access the notebook at http://10.152.183.253:80.

Once the notebook is ready, the command shows a URL that you can use to access the JupyterLab UI.

To start working with your notebook, open a web browser and enter the provided URL into the address bar.

As you see in the above output, we can access the newly created Notebook at http://10.152.183.253:80 from a Web browser. Replace the URL with your own.

This will take you to the JupyterLab interface where you can create new notebooks, upload data, and begin your machine learning tasks using TensorFlow and CUDA.

Remember that the IP address and port number in the URL may vary depending on your specific setup.

That's it. You can now start interact with your Notebook.

View DSS Status

To quickly check the status of your Data Science Stack (DSS) environment, including the status of MLflow and the availability of GPU acceleration, you can use the dss status command like below.

$ dss status

The dss status command will provide you with a summary of the current state of your DSS environment. Here's an example of what the output might look like:

[INFO] MLflow deployment: Ready
[INFO] MLflow URL: http://10.152.183.157:5000
[INFO] GPU acceleration: Disabled

Explanation of Output:

MLflow deployment: Ready indicates that the MLflow tracking server is up and running.
MLflow URL provides the URL where you can access the MLflow UI to track your machine learning experiments.
GPU acceleration: Disabled shows that there is no GPU available or configured for use in the current DSS environment.

To verify, open the MLflow URL http://10.152.183.157:5000 from your web browser.

This will open the MLflow dashboard in your web browser.

Experiments tab in the MLflow dashboard:

Since it is our new installation, there are no experiments yet. To create an experiment use the mlflow experiments CLI.

Models tab in MLflow Dashboard:

Listing DSS Commands

To view the list of available commands for the Data Science Stack (DSS), you can use the dss command with the --help option.

Run the following command in your terminal:

$ dss --help

This will display a list of commands along with a brief description of their purpose.

If you need more detailed information about a specific DSS command, you can use the command followed by the --help option.

For example, to get details about the initialize command, you would run:

$ dss logs --help

Removing Data Science Stack from MicroK8s

If you don't need DSS anymore, you can use the dss purge command to remove the Data Science Stack from your MicroK8s cluster.

To remove DSS, execute the following command in your terminal:

$ dss purge

This command will completely remove all DSS components, including Jupyter Notebooks, the MLflow server, and any data stored within the DSS environment.

It's important to note that this action is irreversible, and all data within the DSS environment will be permanently lost. Make sure to back up any important data before proceeding with the purge.

Remove DSS CLI and MicroK8s

While the dss purge command removes the DSS components from the MicroK8s cluster, it does not remove the DSS CLI or the MicroK8s cluster itself. If you wish to remove these as well, you will need to delete their respective snaps:

To remove the DSS CLI, use the following command:

$ sudo snap remove data-science-stack

To remove MicroK8s, use the following command:

$ sudo snap remove microk8s

By following these steps, you can completely remove the Data Science Stack (DSS) and its associated components from your system.

Frequently Asked Questions (FAQ)

Q: What is Data Science Stack (DSS)?

A: Data Science Stack (DSS) is a comprehensive, ready-to-run environment for machine learning and data science. It is designed to simplify the setup and management of data science tools and frameworks, allowing users to focus on their core tasks rather than the intricacies of environment configuration.

Q: What tools are included in DSS?

A: DSS includes a variety of open-source tools such as Jupyter Notebook, MLflow, and popular machine learning frameworks like TensorFlow and PyTorch. It also provides a container orchestration system, MicroK8s, for managing workloads.

Q: How do I install DSS?

A: To install DSS, you need to have Ubuntu 22.04 LTS or Ubuntu 24.04 LTS, an internet connection, and Snap installed. Then, you can install MicroK8s and the DSS CLI using Snap commands. For detailed instructions, refer to the official documentation or installation guide.

Q: How do I start a Jupyter Notebook with DSS?

A: You can start a Jupyter Notebook with DSS using the dss create command, specifying the desired image for your notebook. For example, to start a TensorFlow notebook, you would use dss create my-tensorflow-notebook --image=kubeflownotebookswg/jupyter-tensorflow-cuda:v1.8.0.

Q: What is the purpose of the dss status command?

A: The dss status command provides a quick overview of the current state of your DSS environment, including the status of MLflow and the availability of GPU acceleration. It helps you verify that all components are functioning correctly.

Q: How do I remove DSS from my environment?

A: To remove DSS, you can use the dss purge command, which will remove all DSS components, including Jupyter Notebooks and the MLflow server. Note that this action is irreversible and will result in the loss of all data within the DSS environment.

Q: Where can I find more information about DSS commands?

A: You can find detailed information about DSS commands by using the dss --help command to list all available commands and dss <command> --help to get detailed usage for a specific command.

Q: Is DSS free to use?

Yes, DSS is based on open-source tools and is free to use.

Q: Is DSS suitable for beginners in data science?

A: Yes, DSS is designed to be user-friendly and can be a great tool for beginners as it reduces the complexity of setting up a data science environment. It provides a ready-made and optimized environment that allows users to start working on data science projects quickly.

Conclusion

In summary, the Data Science Stack (DSS) simplifies the setup for data science tasks. It provides a collection of tools that work well together, making it easier to start projects quickly.

Whether you're new to data science or experienced, DSS helps you focus on your work by handling the technical setup. It's a reliable tool that supports efficient data analysis and model building.

Resource:

Data Science Stack (DSS) Documentation