Generic placeholder image


Please use IU Canvas for submission of some assignments and checking your grades.

Go to IU Canvas

Start Learning!

Generic placeholder image

Technology Used

We will use FutureSystems (previously FutureGrid) facilities and cloud computing experience is helpful but not essential. Good working experience with Java is required.

Go to FutureSystems

FutureSystems Tutorial

Free Online Tutorials

Generic placeholder image


Geoffrey Charles Fox
Senior Associate Dean for Research
Distinguished Professor of Physics,
Computer Science and Informatics

More Details

Course Information

BUEX-V594 Section 37186: Big Data & Open Source Software Projects

The course covers the following material

This course studies software used in many commercial activities to study Big Data. The backdrop for course is the ~300 software subsystems illustrated at (Links to an external site.). We will describe the software architecture represented by this collection which we term HPC-ABDS (High Performance Computing - enhanced Apache Big Data Stack).

  1. The cloud computing architecture underlying ABDS and contrast of this with HPC.
  2. The software architecture with its different layers at HPC-ABDS Kaleidoscope covering broad functionality and rationale for each layer.
  3. Then we will go through selected software systems – about 5% of those in the Kaleidoscope which have been already deployed on FutureSystems (previously called FutureGrid) cloud using OpenStack and Chef recipes.
  4. Then we will go through selected software systems – about 10% of those in the Kaleidoscope which have been already deployed on FutureSystems systems using OpenStack and Chef recipes.
  5. Students will chose one other open source member of Kaleidoscope each and deploy as illustrated in class.
  6. The main activity of the course will be building a significant project using multiple HPC-ABDS subsystems combined with user code and data.
  7. Projects will be suggested or students can chose their own.

You are presumed to have the following pre-requisites for this course:

  1. Elementary knowledge in a scripting language needed (if not available this can be acquired as part of this course)
  2. Basic knowledge of Python desirable (if not available this can be acquired as part of this course)
  3. Ability to (learn to) use the LInux/Unix command shell (we will have a lesson on this)
  4. Basic understanding on how to install packages and programs on Linux (we will have a lesson on this)

As part of course you will get experience in:

  1. DevOps: "software deployment automation".
  2. Linux command shell.
  3. Elementary usage of ssh.
  4. Use of Github to store software packages and documentation.
  5. The reproducible installation of sophisticated platforms on virtual clusters. This is facilitated either by scripts developed in Python, Openstack Heat, or a DevOps framework such as Ansible, Chef, or Puppet. Which framework is chosen will depend on the experience level of the student.
  6. You will learn utility of the key parts of Big Data Stack

Course Lessons

Unit 6: More on Software Stack (only one part)

In this week, you will learn how to gain access to the FutureSystems resources. Some of the lessons have been prepared for the beginners to help understand the basics of Linux operating systems and the collaboration tools i.e. github, google hangout and remote desktop. Please watch video lessons and read online materials on this page. It also covers Unix shell scripting, SSH and other utilities with various exercises.

In this week, you will learn about OpenStack ad Public Clouds. OpenStack is a open-source cloud computing software platform and a community-driven project. You can use OpenStack to build a cloud infrastructure in your public or private network, or you can simply use cloud software for your services. The lessons in this week are specifically prepared to try OpenStack Software and give you the confidence and understanding of using IaaS cloud platforms. There are tutorial lessons to explore OpenStack web dashboard (Horizon) and compute engine (Nova) including Public Clouds e.g. Amazon EC2 or Microsoft Azure.

In this week, you will learn about Cloudmesh which is a cloud resource management software written in Python. It automates launching multiple VM instances across different cloud platforms including Amazon EC2, Microsoft Azure Virtual Machine, HP Cloud, OpenStack, and Eucalyptus. The web interface of Cloudmesh help users and administrators manage entire cloud resources. Most cutting-edge technologies such as Apache LibCloud, Celery, IPython, Flask, Fabric, Docopt, YAML, MongoDB, and Sphinx are applied to enhance Web Service, Command Line Tools and Rest APIs.

In this week, you will learn about open-source configuration management (CM) software as part of IT automation and orchestration. We focus on Ansible and OpenStack Heat to review of system configuration and management but Salt, Puppet, Chef, and Juju are introduced to explore other tools as well. With different features of these software, you will see which tool is ideal for your system environment and understand basic CM techniques. We have a few lab sessions to provide hands-on experience about deploying and configuring applications on IT infrastructure.

This week, you will learn basics of virtual clusters. Typically, analyzing large data sets containing unstructured data types requires distributed computing resources for data processing with high performance, scalability, and availability. With virtualization technology, cluster computing can be more flexible, effective and cost-efficient in terms of resource utilization. There are three basic tutorials about deploying a virtual cluster, Hadoop cluster and MongoDB Sharded cluster which give you a chance to gain some experience of how to setup virtual clusters manually and configure software with Cloudmesh.

Unit 8: Overview of HPC-ABDS Software Stack

This consists of the ~300 technologies in HPC-ABDS Described with roughly one page per technology. It is divided into several lessons in separate PowerPoints organized by layers of HPC-ABDS technology.

Apache Big Data Stack

FutureSystems Access

Getting started with hands-on access:

  1. Create an account on the FutureSystems Portal.
  2. Request to be added to project FG-465.
  3. Upload a public SSH key to FutureSystems portal in order to access FutureSystems systems. Initial steps described in videos 1) Get a Portal Account, and 2) Upload an SSH key
  4. Explore the OpenStack Tutorial.
  5. Instructions for account creation, joining a project and uploading an SSH key are all available here.
  6. If you are using Windows, the simplest solution for using SSH keys is the Putty SSH client, and it’s SSH authentication agent Pageant. Putty and its associated programs are available here

FREE Java/Python Training via IU to