Big Data Applications and Analytics MOOC - 2014

1. Summary ^New

This MOOC investigates the use of clouds running data analytics collaboratively for processing Big Data to solve problems in Big Data Applications and Analytics. Case studies such as Netflix recommender systems, Genomic data, Sports, Health, and more will be discussed.Please enroll only using a Google account.

2. Imp. Info

Use the Google Community forum for course discussions. Homework for this class will be posted in the forum that will also allow sharing of homework for getting peer feedback.Course Material:
syllabus,
course files,
slides¹
course download folder
practice discussions

3. Updates

This is a new improved version of the 2013 original MOOC. New content on Clouds Computing, Sports Informatics, Health Informatics, and Radar Informatics is now available. All videos are viewable in HD(1080P).This is a self-paced course offering no assignments or homeworks or exams. You can, however, participate in practice discussions.

Instructor

Professor Geoffrey Fox received a PhD in Theoretical Physics from Cambridge University and is now Professor of Informatics and Computing as well as Physics at Indiana University, where he is director of the Digital Science Center and Associate Dean for Research and Graduate Studies at the School of Informatics and Computing. He previously held positions at Caltech, Syracuse University and Florida State University.

He has published around 1,000 papers in Physics and Computer Science, supervised the PhD candidacies of 65 students, and received an h-index of 67 along with over 23000 citations. Professor Fox currently works in applying Computer Science to Bioinformatics, Sensor Clouds, Earthquake and Ice-sheet Science, and Particle Physics. He is principal investigator of FutureGrid – a facility to enable development of new approaches to computing. He is involved in several projects, including the eHumanity portal, to enhance the capability of Minority Serving Institutions. A Fellow of APS and ACM, he has experience in online education and its use in MOOCs for areas like Data and Computational Science.

Course Syllabus (PDF)

Course Content

Please click Enroll to watch the Section 1 - Introduction Units Time

Section 1 - Introduction^New

1, 2

4h 39min

This section has a technical overview of course followed by a broad motivation for course.

The course overview covers it's content and structure. It presents the X-Informatics fields (defined values of X) and the Rallying cry of course: Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics ( or e-X). The courses is set up as a MOOC divided into units that vary in length but are typically around an hour and those are further subdivided into 5-15 minute lessons. The course covers a mix of applications (the X in X-Informatics) and technologies needed to support the field electronically i.e. to process the application data. The overview ends with a discussion of course content at highest level. The course starts with a longish Motivation unit summarizing clouds and data science, then units describing applications (X = Physics, e-Commerce, Web Search and Text mining, Health, Sensors and Remote Sensing). These are interspersed with discussions of infrastructure (clouds) and data analytics (algorithms like clustering and collaborative filtering used in applications). The course uses either Python or Java and there are Side MOOCs discussing Python and Java tracks.

The course motivation starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data. Then the cloud computing model developed at amazing speed by industry is introduced. The 4 paradigms of scientific research are described with growing importance of data oriented version.He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC's.

Unit 1 - Course Introduction ^New

1h 4min

Unit Overview
Lesson 1 - Course in One Page
Lesson 2 - Overall Introduction
Lesson 3 - Course Topics I
Lesson 4 - Course Topics II
Lesson 5 - Course Topics III

Overview

Geoffrey gives a short introduction to the course covering its content and structure. He presents the X-Informatics fields (defined values of X) and the Rallying cry of course: Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics (or e-X). The courses is set up as a MOOC divided into units that vary in length but are typically around an hour and those are further subdivided into 5-15 minute lessons.
The course covers a mix of applications (the X in X-Informatics) and technologies needed to support the field electronically i.e. to process the application data. The introduction ends with a discussion of course content at highest level.
The course starts with a longish Motivation unit summarizing clouds and data science, then units describing applications (X = Physics, e-Commerce, Web Search and Text mining, Health, Sensors and Remote Sensing). These are interspersed with discussions of infrastructure (clouds) and data analytics (algorithms like clustering and collaborative filtering used in applications)
The course uses either Python or Java and there are Side MOOCs discussing Python and Java tracks.
1.1 - Course in One Page

Geoffrey gives a short introduction to the course covering it's content and structure. He presents the X-Informatics fields (defined values of X) and the Rallying cry of course: Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics ( or e-X). The courses is set up as a MOOC divided into units that vary in length but are typically around an hour and those are further subdivided into 5-15 minute lessons. Geoffrey follows discussion of mechanics of course with a list of all the units offered
1.2 - Overall Introduction

This course gives an overview of big data from a use case (application) point of view noting that big data in field X drives the concept of X-Informatics. It covers applications, algorithms and infrastructure/technology (cloud computing). There are 3 versions of Spring 2014 course: I400 Informatics at IU for Undergraduates, I590 Informatics at IU for Graduate students, I590 component of non residential data science certificate. They differ in homework and recommended/required lectures. A single web resource handles lectures for all 3 classes
1.3 - Course Topics I

Geoffrey discusses some of the available units:
Motivation: Big Data and the Cloud; Centerpieces of the Future Economy
Introduction: What is Big Data, Data Analytics and X-Informatics
Python for Big Data Applications and Analytics: NumPy, SciPy, MatPlotlib
Using FutureGrid for Big Data Applications and Analytics Course
X-Informatics Physics Use Case, Discovery of Higgs Particle; Counting Events and Basic Statistics Parts I-IV
1.4 - Course Topics II

Geoffrey discusses some more of the available units:X-Informatics Use Cases: Big Data Use Cases Survey
Using Plotviz Software for Displaying Point Distributions in 3D
X-Informatics Use Case: e-Commerce and Lifestyle with recommender systems
Technology Recommender Systems - K-Nearest Neighbors, Clustering and heuristic methods
Parallel Computing Overview and familiar examples
Cloud Technology for Big Data Applications & Analytics
1.5 - Course Topics III

Geoffrey discusses the remainder of the available units:
X-Informatics Use Case: Web Search and Text Mining and their technologies
Technology for X-Informatics: PageRank
Technology for X-Informatics: Kmeans
Technology for X-Informatics: MapReduce
Technology for X-Informatics: Kmeans and MapReduce Parallelism
X-Informatics Use Case: Sports
X-Informatics Use Case: Health
X-Informatics Use Case: Sensors
X-Informatics Use Case: Radar for Remote Sensing.

Unit 2 - Course Motivation: Big Data and Clouds ^New

3h 35min

Section 2 - Overview of Data Science: What is Big Data, Data Analytics and X-Informatics?

3, 4, 5

2h 32min

The course introduction starts with X-Informatics and its rallying cry. The growing number of jobs in data science is highlighted. The first unit offers a look at the phenomenon described as the Data Deluge starting with its broad features. Data science and the famous DIKW (Data to Information to Knowledge to Wisdom) pipeline are covered. Then more detail is given on the flood of data from Internet and Industry applications with eBay and General Electric discussed in most detail.

In the next unit, Geoffrey continues the discussion of the data deluge with a focus on scientific research. He takes a first peek at data from the Large Hadron Collider considered later as physics Informatics and gives some biology examples. He discusses the implication of data for the scientific method which is changing with the data-intensive methodology joining observation, theory and simulation as basic methods. Two broad classes of data are the long tail of sciences: many users with individually modest data adding up to a lot; and a myriad of Internet connected devices -- the Internet of Things.

Geoffrey gives an initial technical overview of cloud computing as pioneered by companies like Amazon, Google and Microsoft with new centers holding up to a million servers. The benefits of Clouds in terms of power consumption and the environment are also touched upon, followed by a list of the most critical features of Cloud computing with a comparison to supercomputing. Features of the data deluge are discussed with a salutary example where more data did better than more thought. Then comes Data science and one part of it -- data analytics -- the large algorithms that crunch the big data to give big wisdom. There are many ways to describe data science and several are discussed to give a good composite picture of this emerging field.

Unit 3 - Part I: Data Science generics and Commercial Data Deluge

56min

Unit 4 - Part II: Data Deluge and Scientific Applications and Methodology

36min

Unit Overview
Lesson 1 - Science & Research I
Lesson 2 - Science & Research II
Lesson 3 - Implications for Scientific Method
Lesson 4 - Long Tail of Science
Lesson 5 - Internet of Things

Overview

Geoffrey continues the discussion of the data deluge with a focus on scientific research. He takes a first peek at data from the Large Hadron Collider considered later as physics Informatics and gives some biology examples. He discusses the implication of data for the scientific method which is changing with the data-intensive methodology joining observation, theory and simulation as basic methods.
We discuss the long tail of sciences; many users with individually modest data adding up to a lot. The last lesson emphasizes how everyday devices -- the Internet of Things -- are being used to create a wealth of data.
4.1 - Science & Research I

We look into more big data examples with a focus on science and research. We give astronomy, genomics, radiology, particle physics and discovery of Higgs particle (Covered in more detail in later lessons), European Bioinformatics Institute and contrast to Facebook and Walmart
4.2 - Science & Research II

We look into more big data examples with a focus on science and research. We give astronomy, genomics, radiology, particle physics and discovery of Higgs particle (Covered in more detail in later lessons), European Bioinformatics Institute and contrast to Facebook and Walmart
4.3 - Implications for Scientific Method

We discuss the emergences of a new fourth methodology for scientific research based on data driven inquiry. We contrast this with third -- computation or simulation based discovery - methodology which emerged itself some 25 years ago.
4.4 - Long Tail of Science

There is big science such as particle physics where a single experiment has 3000 people collaborate!.Then there are individual investigators who don't generate a lot of data each but together they add up to Big data.
4.5 - Internet of Things

A final category of Big data comes from the Internet of Things where lots of small devices -- smart phones, web cams, video games collect and disseminate data and are controlled and coordinated in the cloud

Unit 5 - Part III: Clouds and Big Data Processing; Data Science Process and Analytics

Unit Overview
Lesson 1 - Clouds
Lesson 2 - Features of Data Deluge I
Lesson 3 - Features of Data Deluge II
Lesson 4 - Data Science Process
Lesson 5 - Data Analytics I
Lesson 6 - Data Analytics II

Overview

Geoffrey gives an initial technical overview of cloud computing as pioneered by companies like Amazon, Google and Microsoft with new centers holding up to a million servers. The benefits of Clouds in terms of power consumption and the environment are also touched upon, followed by a list of the most critical features of Cloud computing with a comparison to supercomputing.
He discusses features of the data deluge with a salutary example where more data did better than more thought. He introduces data science and one part of it -- data analytics -- the large algorithms that crunch the big data to give big wisdom. There are many ways to describe data science and several are discussed to give a good composite picture of this emerging field.
5.1 - Clouds

We describe cloud data centers with their staggering size with up to a million servers in a single data center and centers built modularly from shipping containers full of racks. The benefits of Clouds in terms of power consumption and the environment are also touched upon, followed by a list of the most critical features of Cloud computing and a comparison to supercomputing.
5.2 - Features of Data Deluge I

Data, Information, intelligence algorithms, infrastructure, data structure, semantics and knowledge are related. The semantic web and Big data are compared. We give an example where ''More data usually beats better algorithms''. We discuss examples of intelligent big data and list 8 different types of data deluge
5.3 - Features of Data Deluge II

Data, Information, intelligence algorithms, infrastructure, data structure, semantics and knowledge are related. The semantic web and Big data are compared. We give an example where ''More data usually beats better algorithms''. We discuss examples of intelligent big data and list 8 different types of data deluge
5.4 - Data Science Process

We describe and critique one view of the work of a data scientists. Then we discuss and contrast 7 views of the process needed to speed data through the DIKW pipeline.
5.5 - Data Analytics I

We stress the importance of data analytics giving examples from several fields. We note that better analytics is as important as better computing and storage capability.
5.6 - Data Analytics II

We stress the importance of data analytics giving examples from several fields. We note that better analytics is as important as better computing and storage capability.

Section 3 - Technology Training - Python & FutureGrid

6, 7

1h 53min

This section is meant to give an overview of the python tools needed for doing for this course. These are really powerful tools which every data scientist who wishes to use python must know. This section covers. Canopy - Its is an IDE for python developed by EnThoughts. The aim of this IDE is to bring the various python libraries under one single framework or ''Canopy'' - that is why the name. NumPy - It is popular library on top of which many other libraries (like pandas, scipy) are built. It provides a way a vectorizing data. This helps to organize in a more intuitive fashion and also helps us use the various matrix operations which are popularly used by the machine learning community. Matplotlib: This a data visualization package. It allows you to create graphs charts and other such diagrams. It supports Images in JPEG, GIF, TIFF format. SciPy: SciPy is a library built above numpy and has a number of off the shelf algorithms / operations implemented. These include algorithms from calculus(like integration), statistics, linear algebra, image-processing, signal processing, machine learning, etc.

Unit 6 - Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib

1h 21min

Unit 7 - Using FutureSystems for Java and Python

32min

Section 4 - X= Physics Case Study

8, 9, 10, 11

3h 7min

This section starts by describing the LHC accelerator at CERN and evidence found by the experiments suggesting existence of a Higgs Boson. The huge number of authors on a paper, remarks on histograms and Feynman diagrams is followed by an accelerator picture gallery. The next unit is devoted to Python experiments looking at histograms of Higgs Boson production with various forms of shape of signal and various background and with various event totals. Then random variables and some simple principles of statistics are introduced with explanation as to why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they seen so often in natural phenomena. Several Python illustrations are given. Random Numbers with their Generators and Seeds lead to a discussion of Binomial and Poisson Distribution. Monte-Carlo and accept-reject methods. The Central Limit Theorem concludes discussion.

Unit 8 - I: Looking for Higgs Particles, Bumps in Histograms, Experiments and Accelerators

42min

Unit Overview
Lesson 1 - Looking for Higgs Particle and Counting Introduction I
Lesson 2 - Looking for Higgs Particle and Counting Introduction II
Lesson 3 - Physics-Informatics Looking for Higgs Particle Experiments
Lesson 4 - Accelerator Picture Gallery of Big Science

Overview

This unit is devoted to Python and Java experiments with Geoffrey looking at histograms of Higgs Boson production with various forms of shape of signal and various background and with various event totals. The lectures use Python but use of Java is described.
8.1 - Looking for Higgs Particle and Counting Introduction I

We return to particle case with slides used in introduction and stress that particles often manifested as bumps in histograms and those bumps need to be large enough to stand out from background in a statistically significant fashion.
8.2 - Looking for Higgs Particle and Counting Introduction II

We give a few details on one LHC experiment ATLAS. Experimental physics papers have a staggering number of authors and quite big budgets. Feynman diagrams describe processes in a fundamental fashion.
8.3 - Physics-Informatics Looking for Higgs Particle Experiments

We give a few details on one LHC experiment ATLAS. Experimental physics papers have a staggering number of authors and quite big budgets. Feynman diagrams describe processes in a fundamental fashion.
8.4 - Accelerator Picture Gallery of Big Science

This lesson gives a small picture gallery of accelerators. Accelerators, detection chambers and magnets in tunnels and a large underground laboratory used fpr experiments where you need to be shielded from background like cosmic rays

Unit 9 - II: Looking for Higgs Particles: Python Event Counting for Signal and Background

32min

Unit Overview
Lesson 1 - Physics Use Case II 1: Class Software
Lesson 2 - Physics Use Case II 2: Event Counting
Lesson 3 - Physics Use Case II 3: With Python examples of Signal plus Background
Lesson 4 - Physics Use Case II 4: Change shape of background & num of Higgs Particles

Overview

This unit is devoted to Python experiments with Geoffrey looking at histograms of Higgs Boson production with various forms of shape of signal and various background and with various event totals
9.1 - Physics Use Case II 1: Class Software

We discuss how this unit uses Java and Python on both a backend server (FutureGrid) or a local client. WE point out useful book on Python for data analysis. This builds on technology training in Section 3
9.2 - Physics Use Case II 2: Event Counting

We define ''event counting'' data collection environments. We discuss the python and Java code to generate events according to a particular scenario (the important idea of Monte Carlo data). Here a sloping background plus either a Higgs particle generated similarly to LHC observation or one observed with better resolution (smaller measurement error).
9.3 - Physics Use Case II 3: With Python examples of Signal plus Background

This uses Monte Carlo data both to generate data like the experimental observations and explore effect of changing amount of data and changing measurement resolution for Higgs.
9.4 - Physics Use Case II 4: Change shape of background & num of Higgs Particles

This lesson continues the examination of Monte Carlo data looking at effect of change in number of Higgs particles produced and in change in shape of background

Unit 10 - III: Looking for Higgs Particles: Random Variables, Physics and Normal Distributions

57min

Unit Overview
Lesson 1 - Statistics Overview and Fundamental Idea: Random Variables
Lesson 2 - Physics and Random Variables I
Lesson 3 - Physics and Random Variables II
Lesson 4 - Statistics of Events with Normal Distributions
Lesson 5 - Gaussian Distributions
Lesson 6 - Using Statistics

Overview

Geoffrey introduces random variables and some simple principles of statistics and explains why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they seen so often in natural phenomena. Several Python illustrations are given. Java is currently not available in this unit.
10.1 - Statistics Overview and Fundamental Idea: Random Variables

We go through the many different areas of statistics covered in the Physics unit. We define the statistics concept of a random variable.
10.2 - Physics and Random Variables I

We describe the DIKW pipeline for the analysis of this type of physics experiment and go through details of analysis pipeline for the LHC ATLAS experiment. We give examples of event displays showing the final state particles seen in a few events. We illustrate how physicists decide whats going on with a plot of expected Higgs production experimental cross sections (probabilities) for signal and background.
10.3 - Physics and Random Variables II

We describe the DIKW pipeline for the analysis of this type of physics experiment and go through details of analysis pipeline for the LHC ATLAS experiment. We give examples of event displays showing the final state particles seen in a few events. We illustrate how physicists decide whats going on with a plot of expected Higgs production experimental cross sections (probabilities) for signal and background.
10.4 - Statistics of Events with Normal Distributions

We introduce Poisson and Binomial distributions and define independent identically distributed (IID) random variables. We give the law of large numbers defining the errors in counting and leading to Gaussian distributions for many things. We demonstrate this in Python experiments.
10.5 - Gaussian Distributions

We introduce the Gaussian distribution and give Python examples of the fluctuations in counting Gaussian distributions.
10.6 - Using Statistics

We discuss the significance of a standard deviation and role of biases and insufficient statistics with a Python example in getting incorrect answers.

Unit 11 - IV: Looking for Higgs Particles: Random Numbers, Distributions and Central Limit Theorem

56min

Section 5 - Big Data Use Cases Survey

12, 13, 14

5h 18min

This section covers 51 values of X and an overall study of Big data that emerged from a NIST (National Institute for Standards and Technology) study of Big data. The section covers the NIST Big Data Public Working Group (NBD-PWG) Process and summarizes the work of five subgroups: Definitions and Taxonomies Subgroup, Reference Architecture Subgroup, Security and Privacy Subgroup, Technology Roadmap Subgroup and the Requirements andUse Case Subgroup. 51 use cases collected in this process are briefly discussed with a classification of the source of parallelism and the high and low level computational structure. We describe the key features of this classification.

Unit 12 - Overview of NIST Big Data Public Working Group (NBD-PWG) Process and Results

1h 12min

Unit 13 - 51 Big Data Use Cases

2h 47min

Unit 14 - Features of 51 Big Data Use Cases

1h 19min

Section 6 - Technology Training - Plotviz

Geoffrey introduces Plotviz, a data visualization tool developed at Indiana University to display 2 and 3 dimensional data. The motivation is that the human eye is very good at pattern recognition and can ''see'' structure in data. Although most Big data is higher dimensional than 3, all can be transformed by dimension reduction techniques to 3D. He gives several examples to show how the software can be used and what kind of data can be visualized. This includes individual plots and the manipulation of multiple synchronized plots.Finally, he describes the download and software dependency of Plotviz.

Unit 15 - Using Plotviz Software for Displaying Point Distributions in 3D

Unit Overview
Lesson 1 - Motivation and Introduction to use
Lesson 2 - Example of Use I: Cube and Structured Dataset
Lesson 3 - Example of Use II: Proteomics and Synchronized Rotation
Lesson 4 - Example of Use III: More Features and larger Proteomics Sample
Lesson 5 - Example of Use IV: Tools and Examples
Lesson 6 - Example of Use V: Final Examples

Overview

Geoffrey introduces Plotviz, a data visualization tool developed at Indiana University to display 2 and 3 dimensional data. The motivation is that the human eye is very good at pattern recognition and can ''see'' structure in data. Although most Big data is higher dimensional than 3, all can be transformed by dimension reduction techniques to 3D. He gives several examples to show how the software can be used and what kind of data can be visualized. This includes individual plots and the manipulation of multiple synchronized plots. Finally, he describes the download and software dependency of Plotviz.
15.1 - Motivation and Introduction to use

The motivation of Plotviz is that the human eye is very good at pattern recognition and can ''see'' structure in data. Although most Big data is higher dimensional than 3, all data can be transformed by dimension reduction techniques to 3D and one can check analysis like clustering and/or see structure missed in a computer analysis. The motivations shows some Cheminformatics examples. The use of Plotviz is started in slide 4 with a discussion of input file which is either a simple text or more features (like colors) can be specified in a rich XML syntax. Plotviz deals with points and their classification (clustering). Next the protein sequence browser in 3D shows the basic structure of Plotviz interface. The next two slides explain the core 3D and 2D manipulations respectively. Note all files used in examples are available to students.
15.2 - Example of Use I: Cube and Structured Dataset

Initially we start with a simple plot of 8 points -- the corners of a cube in 3 dimensions -- showing basic operations such as size/color/labels and Legend of points. The second example shows a dataset (coming from GTM dimension reduction) with significant structure. This has .pviz and a .txt versions that are compared
15.3 - Example of Use II: Proteomics and Synchronized Rotation

This starts with an examination of a sample of Protein Universe Browser showing how one uses Plotviz to look at different features of this set of Protein sequences projected to 3D. Then we show how to compare two datasets with synchronized rotation of a dataset clustered in 2 different ways; this dataset comes from k Nearest Neighbor discussion
15.4 - Example of Use III: More Features and larger Proteomics Sample

This starts by describing use of Labels and Glyphs and the Default mode in Plotviz. Then we illustrate sophisticated use of these ideas to view a large Proteomics dataset
15.5 - Example of Use IV: Tools and Examples

This lesson starts by describing the Plotviz tools and then sets up two examples -- Oil Flow and Trading -- described in PowerPoint. It finishes with the Plotviz viewing of Oil Flow data
15.6 - Example of Use V: Final Examples

This starts with Plotviz looking at Trading example introduced in previous lesson and them examines solvent data. It finishes with two large biology examples with 446K and 100K points and each with over 100 clusters. We finish remarks on Plotviz software structure and how to download. We also remind you that a picture is worth a 1000 words

Section 7 - X= e-Commerce and LifeStyle Case Study

16, 17, 18

2h 12min

Recommender systems operate under the hood of such widely recognized sites as Amazon, eBay, Monster and Netflix where everything is a recommendation. This involves a symbiotic relationship between vendor and buyer whereby the buyer provides the vendor with information about their preferences, while the vendor then offers recommendations tailored to match their needs. Kaggle competitions h improve the success of the Netflix and other recommender systems. Attention is paid to models that are used to compare how changes to the systems affect their overall performance. Geoffrey muses how the humble ranking has become such a dominant driver of the world's economy. More examples of recommender systems are given from Google News, Retail stores and in depth Yahoo! covering the multi-faceted criteria used in deciding recommendations on web sites. The formulation of recommendations in terms of points in a space or bag is given where bags of item properties, user properties, rankings and users are useful. Detail is given on basic principles behind recommender systems: user-based collaborative filtering, which uses similarities in user rankings to predict their interests, and the Pearson correlation, used to statistically quantify correlations between users viewed as points in a space of items. Items are viewed as points in a space of users in item-based collaborative filtering. The Cosine Similarity is introduced, the difference between implicit and explicit ratings and the k Nearest Neighbors algorithm. General features like the curse of dimensionality in high dimensions are discussed. A simple Python k Nearest Neighbor code and its application to an artificial data set in 3 dimensions is given. Results are visualized in Matplotlib in 2D and with Plotviz in 3D. The concept of a training and a testing set are introduced with training set pre labeled. Recommender system are used to discuss clustering with k-means based clustering methods used and their results examined in Plotviz. The original labelling is compared to clustering results and extension to 28 clusters given. General issues in clustering are discussed including local optima, the use of annealing to avoid this and value of heuristic algorithms.

Unit 16 - Recommender Systems: Introduction

52min

Unit 17 - Recommender Systems: Examples and Algorithms

52min

Unit 18 - Item-based Collaborative Filtering and its Technologies

28min

Unit Overview
Lesson 1 - Item-based Collaborative Filtering I
Lesson 2 - Item-based Collaborative Filtering II
Lesson 3 - k Nearest Neighbors and High Dimensional Spaces

Overview

Geoffrey moves on to item-based collaborative filtering where items are viewed as points in a space of users. The Cosine Similarity is introduced, the difference between implicit and explicit ratings and the k Nearest Neighbors algorithm. General features like the curse of dimensionality in high dimensions are discussed
18.1 - Item-based Collaborative Filtering I

We covered user-based collaborative filtering in the previous unit. Here we start by discussing memory-based real time and model based offline (batch) approaches. Now we look at item-based collaborative filtering where items are viewed in the space of users and the cosine measure is used to quantify distances. WE discuss optimizations and how batch processing can help. We discuss different Likert ranking scales and issues with new items that do not have a significant number of rankings.
18.2 - Item-based Collaborative Filtering II

We covered user-based collaborative filtering in the previous unit. Here we start by discussing memory-based real time and model based offline (batch) approaches. Now we look at item-based collaborative filtering where items are viewed in the space of users and the cosine measure is used to quantify distances. WE discuss optimizations and how batch processing can help. We discuss different Likert ranking scales and issues with new items that do not have a significant number of rankings.
18.3 - k Nearest Neighbors and High Dimensional Spaces

We define the k Nearest Neighbor algorithms and present the Python software but do not use it. We give examples from Wikipedia and describe performance issues. This algorithm illustrates the curse of dimensionality. If items were a real vectors in a low dimension space, there would be faster solution methods.

Section 8 - Technology Training - kNN & Clustering

19, 20

1h 23min

This section is meant to provide a discussion on the kth Nearest Neighbor (kNN) algorithm and clustering using K-means. Python version for kNN is discussed in the video and instructions for both Java and Python are mentioned in the slides. Plotviz is used for generating 3D visualizations.

Unit 19 - Recommender Systems - K-Nearest Neighbors (Python & Java Track)

33min

Unit Overview
Lesson 1 - Python k'th Nearest Neighbor Algorithms I
Lesson 2 - Python k'th Nearest Neighbor Algorithms II
Lesson 3 - 3D Visualization
Lesson 4 - Testing k'th Nearest Neighbor Algorithms

Overview

Geoffrey discusses a simple Python k Nearest Neighbor code and its application to an artificial data set in 3 dimensions. Results are visualized in Matplotlib in 2D and with Plotviz in 3D. The concept of training and testing sets are introduced with training set pre-labelled.
19.1 - Python k'th Nearest Neighbor Algorithms I

This lesson considers the Python k Nearest Neighbor code found on the web associated with a book by Harrington on Machine Learning. There are two data sets. First we consider a set of 4 2D vectors divided into two categories (clusters) and use k=3 Nearest Neighbor algorithm to classify 3 test points. Second we consider a 3D dataset that has already been classified and show how to normalize. In this lesson we just use Matplotlib to give 2D plots
19.2 - Python k'th Nearest Neighbor Algorithms II

This lesson considers the Python k Nearest Neighbor code found on the web associated with a book by Harrington on Machine Learning. There are two data sets. First we consider a set of 4 2D vectors divided into two categories (clusters) and use k=3 Nearest Neighbor algorithm to classify 3 test points. Second we consider a 3D dataset that has already been classified and show how to normalize. In this lesson we just use Matplotlib to give 2D plots
19.3 - 3D Visualization

The lesson modifies the online code to allow it to produce files readable by PlotViz. We visualize already classified 3D set and rotate in 3D.
19.4 - Testing k'th Nearest Neighbor Algorithms

The lesson goes through an example of using k NN classification algorithm by dividing dataset into 2 subsets. One is training set with initial classification; the other is test point to be classified by k=3 NN using training set. The code records fraction of points with a different classification from that input. One can experiment with different sizes of the two subsets. The Python implementation of algorithm is analyzed in detail.

Unit 20 - Clustering and heuristic methods

50min

Unit Overview
Lesson 1 - Kmeans Clustering
Lesson 2 - Clustering of Recommender System Example
Lesson 3 - Clustering of Recommender Example into more than 3 Clusters
Lesson 4 - Local Optima in Clustering
Lesson 5 - Clustering in General
Lesson 6 - Heuristics

Overview

Geoffrey uses example of recommender system to discuss clustering. The details of methods are not discussed but k-means based clustering methods are used and their results examined in Plotviz. The original labelling is compared to clustering results and extension to 28 clusters given. General issues in clustering are discussed including local optima, the use of annealing to avoid this and value of heuristic algorithms.
20.1 - Kmeans Clustering

Geoffrey introduces the k means algorithm in a gentle fashion and describes its key features including dangers of local minima. A simple example from Wikipedia is examined
20.2 - Clustering of Recommender System Example

Plotviz is used to examine and compare the original classification with an ''optimal'' clustering into 3 clusters using a fancy deterministic annealing method that is similar to k means. The new clustering has centers marked
20.3 - Clustering of Recommender Example into more than 3 Clusters

The previous division into 3 clusters is compared into a clustering into 28 separate clusters that are naturally smaller in size and divide 3D space covered by 1000 points into compact geometrically local regions.
20.4 - Local Optima in Clustering

This lesson introduces some general principles. First many important processes are ''just'' optimization problems. Most such problems are rife with local optima. The key idea behind annealing to avoid local optima is described. The pervasive greedy optimization method is described.
20.5 - Clustering in General

The two different applications of clustering are described. First find geometrically distinct regions and secondly divide spaces into geometrically compact regions that may have no ''thin air'' between them. Generalizations such as mixture models and latent factor methods are just mentioned. The important distinction between applications in vector spaces and those where only inter-point distances are defined is described. Examples are then given using PlotViz from 2D clustering of a mass spectrometry example and the results of clustering genomic data mapped into 3D with Multi Dimensional Scaling MDS.
20.6 - Heuristics

Some remarks are given on heuristics; why are they so important why getting exact answers is often not so important?

Section 9 - Cloud Computing Technology for Big DataApplications & Analytics

21, 22, 23, 24, 25

5h 16min

Geoffrey describes the central role of Parallel computing in Clouds and Big Data which is decomposed into lots of ''Little data'' running in individual cores. Many examples are given and it is stressed that issues in parallel computing are seen in day to day life for communication, synchronization, load balancing and decomposition. Cyberinfrastructure for e-moreorlessanything or moreorlessanything-Informatics and the basics of cloud computing are introduced. This includes virtualization and the important ''as a Service'' components and we go through several different definitions of cloud computing.

Gartner's Technology Landscape includes hype cycle and priority matrix and covers clouds and Big Data. Two simple examples of the value of clouds for enterprise applications are given with a review of different views as to nature of Cloud Computing. This IaaS (Infrastructure as a Service) discussion is followed by PaaS and SaaS (Platform and Software as a Service). Features in Grid and cloud computing and data are treated. We summarize the 21 layers and almost 300 software packages in the HPC-ABDS Software Stack explaining how they are used.

Cloud (Data Center) Architectures with physical setup, Green Computing issues and software models are discussed followed by the Cloud Industry stakeholders with a 2014 Gartner analysis of Cloud computing providers. This is followed by applications on the cloud including data intensive problems, comparison with high performance computing, science clouds and the Internet of Things. Remarks on Security, Fault Tolerance and Synchronicity issues in cloud follow. We describe the way users and data interact with a cloud system. The Big Data Processing from an application perspective with commercial examples including eBay concludes section after a discussion of data system architectures.

Unit 21 - Parallel Computing: Overview of Basic Principles with familiar Examples

57min

Unit Overview
Lesson 1 - Decomposition I
Lesson 2 - Decomposition II
Lesson 3 - Decomposition III
Lesson 4 - Parallel Computing in Society I
Lesson 5 - Parallel Computing in Society II
Lesson 6 - Parallel Processing for Hadrian's Wall

Overview

Geoffrey describes the central role of Parallel computing in Clouds and Big Data which is decomposed into lots of ''Little data'' running in individual cores. Many examples are given and it is stressed that issues in parallel computing are seen in day to day life for communication, synchronization, load balancing and decomposition.
21.1 - Decomposition I

Geoffrey describes why parallel computing is essential with Big Data and distinguishes parallelism over users to that over the data in problem. The general ideas behind data decomposition are given followed by a few often whimsical examples dreamed up 30 years ago in the early heady days of parallel computing. These include scientific simulations, defense outside missile attack and computer chess. The basic problem of parallel computing -- efficient coordination of separate tasks processing different data parts -- is described with MPI and MapReduce as two approaches. The challenges of data decomposition in irregular problems is noted.
21.2 - Decomposition II

Geoffrey describes why parallel computing is essential with Big Data and distinguishes parallelism over users to that over the data in problem. The general ideas behind data decomposition are given followed by a few often whimsical examples dreamed up 30 years ago in the early heady days of parallel computing. These include scientific simulations, defense outside missile attack and computer chess. The basic problem of parallel computing -- efficient coordination of separate tasks processing different data parts -- is described with MPI and MapReduce as two approaches. The challenges of data decomposition in irregular problems is noted.
21.3 - Decomposition III

Geoffrey describes why parallel computing is essential with Big Data and distinguishes parallelism over users to that over the data in problem. The general ideas behind data decomposition are given followed by a few often whimsical examples dreamed up 30 years ago in the early heady days of parallel computing. These include scientific simulations, defense outside missile attack and computer chess. The basic problem of parallel computing -- efficient coordination of separate tasks processing different data parts -- is described with MPI and MapReduce as two approaches. The challenges of data decomposition in irregular problems is noted.
21.4 - Parallel Computing in Society I

This lesson from the past notes that one can view society as an approach to parallel linkage of people. The largest example given is that of the construction of a long wall such as that (Hadrian's wall) between England and Scotland. Different approaches to parallelism are given with formulae for the speed up and efficiency. The concepts of grain size (size of problem tackled by an individual processor) and coordination overhead are exemplified. This example also illustrates Amdahl's law and the relation between data and processor topology. The lesson concludes with other examples from nature including collections of neurons (the brain) and ants.
21.5 - Parallel Computing in Society II

This lesson from the past notes that one can view society as an approach to parallel linkage of people. The largest example given is that of the construction of a long wall such as that (Hadrian's wall) between England and Scotland. Different approaches to parallelism are given with formulae for the speed up and efficiency. The concepts of grain size (size of problem tackled by an individual processor) and coordination overhead are exemplified. This example also illustrates Amdahl's law and the relation between data and processor topology. The lesson concludes with other examples from nature including collections of neurons (the brain) and ants.
21.6 - Parallel Processing for Hadrian's Wall

This lesson returns to Hadrian's wall and uses it to illustrate advanced issues in parallel computing. First Geoffrey describes the basic SPMD -- Single Program Multiple Data -- model. Then irregular but homogeneous and heterogeneous problems are discussed. Static and dynamic load balancing is needed. Inner parallelism (as in vector instruction or the multiple fingers of masons) and outer parallelism (typical data parallelism) are demonstrated. Parallel I/O for Hadrian's wall is followed by a slide summarizing this quaint comparison between Big data parallelism and the construction of a large wall.

Unit 22 - Cloud Computing Technology Part I: Introduction

1hr 4min

Unit 23 - Cloud Computing Technology Part II: Software and Systems

48min

Unit Overview
Lesson 1 - What is Cloud Computing
Lesson 2 - Introduction to Cloud Software Architecture: IaaS and PaaS I
Lesson 3 - Introduction to Cloud Software Architecture: IaaS and PaaS II
Lesson 4 - Using the HPC-ABDS Software Stack

Overview

Geoffrey covers different views as to nature of architecture and application for Cloud Computing. Then we discuss cloud software for the cloud starting at virtual machine management (IaaS) and the broad Platform (middleware) capabilities with examples from Amazon and academic studies. We summarize the 21 layers and almost 300 software packages in the HPC-ABDS Software Stack explaining how they are used.
23.1 - What is Cloud Computing

This lesson gives some general remark of cloud systems from an architecture and application perspective.
23.2 - Introduction to Cloud Software Architecture: IaaS and PaaS I

Geoffrey discusses cloud software for the cloud starting at virtual machine management (IaaS) and the broad Platform (middleware) capabilities with examples from Amazon and academic studies.
23.3 - Introduction to Cloud Software Architecture: IaaS and PaaS II

Geoffrey discusses cloud software for the cloud starting at virtual machine management (IaaS) and the broad Platform (middleware) capabilities with examples from Amazon and academic studies.
23.4 - Using the HPC-ABDS Software Stack

Using the HPC-ABDS Software Stack.

Unit 24 - Cloud Computing Technology Part III: Architectures, Applications and Systems

1h 30min

Unit 25 - Cloud Computing Technology Part IV: Data Systems

57min

Unit Overview
Lesson 1 - The 10 Interaction scenarios (access patterns) I
Lesson 2 - The 10 Interaction scenarios - Science Examples
Lesson 3 - Remaining general access patterns
Lesson 4 - Data in the Cloud
Lesson 5 - Applications Processing Big Data

Overview

We describe the way users and data interact with a cloud system. The unit concludes with the treatment of data in the cloud from an architecture perspective and Big Data Processing from an application perspective with commercial examples including eBay.
25.1 - The 10 Interaction scenarios (access patterns) I

The next 3 lessons describe the way users and data interact with the system.
25.2 - The 10 Interaction scenarios - Science Examples

This lessons describe the way users and data interact with the system for some science examples.
25.3 - Remaining general access patterns

This lesson describe the way users and data interact with the system for the final set of examples.
25.4 - Data in the Cloud

Databases, File systems, Object Stores and NOSQL are discussed and compared. The way to build a modern data repository in the cloud is introduced.
25.5 - Applications Processing Big Data

This lesson collects remarks on Big data processing from several sources: Berkeley, Teradata, IBM, Oracle and eBay with architectures and application opportunities.

Section 10 - X-Informatics with X = Web Search and Text Miningand their technologies

26, 27

1h 39min

This section starts with an overview of data mining and puts our study of classification, clustering and exploration methods in context. We examine the problem to be solved in web and text search and note the relevance of history with libraries, catalogs and concordances. An overview of web search is given describing the continued evolution of search engines and the relation to the field of Information Retrieval. The importance of recall, precision and diversity is discussed. The important Bag of Words model is introduced and both Boolean queries and the more general fuzzy indices. The important vector space model and revisiting the Cosine Similarity as a distance in this bag follows. The basic TF-IDF approach is dis cussed. Relevance is discussed with a probabilistic model while the distinction between Bayesian and frequency views of probability distribution completes this unit. Geoffrey starts with an overview of the different steps (data analytics) in web search and then goes key steps in detail starting with document preparation. An inverted index is described and then how it is prepared for web search. The Boolean and Vector Space approach to query processing follow. This is followed by Link Structure Analysis including Hubs, Authorities and PageRank. The application of PageRank ideas as reputation outside web search is covered. The web graph structure, crawling it and issues in web advertising and search follow. The use of clustering and topic models completes section.

Unit 26 - Web Search and Text Mining I

Unit 27 - Web Search and Text Mining II

39min

Unit Overview
Lesson 1 - Data Analytics for Web Search
Lesson 2 - Link Structure Analysis including PageRank I
Lesson 3 - Link Structure Analysis including PageRank II
Lesson 4 - Web Advertising and Search
Lesson 5 - Clustering and Topic Models

Overview

Geoffrey starts with an overview of the different steps (data analytics) in web search. This is followed by Link Structure Analysis including Hubs, Authorities and PageRank. The application of PageRank ideas as reputation outside web search is covered. Issues in web advertising and search follow. his leads to emerging field of computational advertising. The use of clustering and topic models completes unit with Google News as an example.
27.1 - Data Analytics for Web Search

This short lesson describes the different steps needed in web search including: Get the digital data (from web or from scanning); Crawl web; Preprocess data to get searchable things (words, positions); Form Inverted Index mapping words to documents; Rank relevance of documents with potentially sophisticated techniques; and integrate technology to support advertising and ways to allow or stop pages artificially enhancing relevance.
27.2 - Link Structure Analysis including PageRank I

The value of links and the concepts of Hubs and Authorities are discussed. This leads to definition of PageRank with examples. Extensions of PageRank viewed as a reputation are discussed with journal rankings and university department rankings as examples. There are many extension of these ideas which are not discussed here although topic models are covered briefly in a later lesson.
27.3 - Link Structure Analysis including PageRank II

The value of links and the concepts of Hubs and Authorities are discussed. This leads to definition of PageRank with examples. Extensions of PageRank viewed as a reputation are discussed with journal rankings and university department rankings as examples. There are many extension of these ideas which are not discussed here although topic models are covered briefly in a later lesson.
27.4 - Web Advertising and Search

Internet and mobile advertising is growing fast and can be personalized more than for traditional media. There are several advertising types Sponsored search, Contextual ads, Display ads and different models: Cost per viewing, cost per clicking and cost per action. This leads to emerging field of computational advertising.
27.5 - Clustering and Topic Models

We discuss briefly approaches to defining groups of documents. We illustrate this for Google News and give an example that this can give different answers from word-based analyses. We mention some work at Indiana University on a Latent Semantic Indexing model.

Section 11 - Technology for Big Data Applications & Analytics

28, 29, 30, 31

1h 58min

Geoffrey uses the K-means Python code in SciPy package to show real code for clustering. After a simple example we generate 4 clusters of distinct centers and various choice for sizes using Matplotlib tor visualization. We show results can sometimes be incorrect and sometimes make different choices among comparable solutions. We discuss the ''hill'' between different solutions and rationale for running K-means many times and choosing best answer. Then we introduce MapReduce with the basic architecture and a homely example. The discussion of advanced topics includes an extension to Iterative MapReduce from Indiana University called Twister and a generalized Map Collective model. Some measurements of parallel performance are given. The SciPy K-means code is modified to support a MapReduce execution style. This illustrates the key ideas of mappers and reducers. With appropriate runtime this code would run in parallel but here the ''parallel'' maps run sequentially. This simple 2 map version can be generalized to scalable parallelism. Python is used to Calculate PageRank from Web Linkage Matrix showing several different formulations of the basic matrix equations to finding leading eigenvector. The unit is concluded by a calculation of PageRank for general web pages by extracting the secret from Google.

Unit 28 - Technology for X-Informatics: K-means (Python & Java Track)

42min

Unit Overview
Lesson 1 - K-means in Python
Lesson 2 - Analysis of 4 Artificial Clusters I
Lesson 3 - Analysis of 4 Artificial Clusters II
Lesson 4 - Analysis of 4 Artificial Clusters III

Overview

Geoffrey uses the K-means Python code in SciPy package to show real code for clustering. After a simple example we generate 4 clusters of distinct centers and various choice for sizes using Matplotlib tor visualization. We show results can sometimes be incorrect and sometimes make different choices among comparable solutions. We discuss the ''hill'' between different solutions and rationale for running K-means many times and choosing best answer.
28.1 - K-means in Python

Geoffrey uses the K-means Python code in SciPy package to show real code for clustering and applies it a set of 85 two dimensional vectors -- officially sets of weights and heights to be clustered to find T-shirt sizes. We run through Python code with Matplotlib displays to divide into 2-5 clusters. Then we discuss Python to generate 4 clusters of varying sizes and centered at corners of a square in two dimensions. We formally give the K means algorithm better than before and make definition consistent with code in SciPy
28.2 - Analysis of 4 Artificial Clusters I

We present clustering results on the artificial set of 1000 2D points described in previous lesson for 3 choices of cluster sizes ''small'' ''large'' and ''very large''. We emphasize the SciPy always does 20 independent K means and takes the best result -- an approach to avoiding local minima. We allow this number of independent runs to be changed and in particular set to 1 to generate more interesting erratic results. We define changes in our new K means code that also has two measures of quality allowed. The slides give many results of clustering into 2 4 6 and 8 clusters (there were only 4 real clusters). We show that the ''very small'' case has two very different solutions when clustered into two clusters and use this to discuss functions with multiple minima and a hill between them. The lesson has both discussion of already produced results in slides and interactive use of Python for new runs.
28.3 - Analysis of 4 Artificial Clusters II

We present clustering results on the artificial set of 1000 2D points described in previous lesson for 3 choices of cluster sizes ''small'' ''large'' and ''very large''. We emphasize the SciPy always does 20 independent K means and takes the best result -- an approach to avoiding local minima. We allow this number of independent runs to be changed and in particular set to 1 to generate more interesting erratic results. We define changes in our new K means code that also has two measures of quality allowed. The slides give many results of clustering into 2 4 6 and 8 clusters (there were only 4 real clusters). We show that the ''very small'' case has two very different solutions when clustered into two clusters and use this to discuss functions with multiple minima and a hill between them. The lesson has both discussion of already produced results in slides and interactive use of Python for new runs.
28.4 - Analysis of 4 Artificial Clusters III

We present clustering results on the artificial set of 1000 2D points described in previous lesson for 3 choices of cluster sizes ''small'' ''large'' and ''very large''. We emphasize the SciPy always does 20 independent K means and takes the best result -- an approach to avoiding local minima. We allow this number of independent runs to be changed and in particular set to 1 to generate more interesting erratic results. We define changes in our new K means code that also has two measures of quality allowed. The slides give many results of clustering into 2 4 6 and 8 clusters (there were only 4 real clusters). We show that the ''very small'' case has two very different solutions when clustered into two clusters and use this to discuss functions with multiple minima and a hill between them. The lesson has both discussion of already produced results in slides and interactive use of Python for new runs.

Unit 29 - Technology for X-Informatics: MapReduce

30min

Unit Overview
Lesson 1 - Introduction
Lesson 2 - Advanced Topics I
Lesson 3 - Advanced Topics II

Overview

Geoffrey's introduction to MapReduce describes the basic architecture and a homely example. The discussion of advanced topics includes extension to Iterative MapReduce from Indiana University called Twister and a generalized Map Collective model. Some measurements of parallel performance are given
29.1 - Introduction

This introduction uses an analogy to making fruit punch by slicing and blending fruit to illustrate MapReduce. The formal structure of MapReduce and Iterative MapReduce is presented with parallel data flowing from disks through multiple Map and Reduce phases to be inspected by the user
29.2 - Advanced Topics I

This defines 4 types of MapReduce and the Map Collective model of Qiu. The Iterative MapReduce model from Indiana University called Twister is described and a few performance measurements on Microsoft Azure are presented.
29.3 - Advanced Topics II

This defines 4 types of MapReduce and the Map Collective model of Qiu. The Iterative MapReduce model from Indiana University called Twister is described and a few performance measurements on Microsoft Azure are presented.

Unit 30 - Technology for X-Informatics: Kmeans and MapReduce Parallelism (Python & Java Track)

17min

Unit Overview
Lesson 1 - MapReduce Kmeans in Python I
Lesson 2 - MapReduce Kmeans in Python II

Overview

Geoffrey modifies the SciPy K-means code to support a MapReduce execution style and runs it in this short unit. This illustrates the key ideas of mappers and reducers. With appropriate runtime this code would run in parallel but here the ''parallel'' maps run sequentially. Geoffrey stresses that this simple 2 map version can be generalized to scalable parallelism.
30.1 - MapReduce Kmeans in Python I

Geoffrey modifies the SciPy K-means code to support a MapReduce execution style and runs it in this short unit. This illustrates the key ideas of mappers and reducers. With appropriate runtime this code would run in parallel but here the ''parallel'' maps run sequentially. Geoffrey stresses that this simple 2 map version can be generalized to scalable parallelism.
30.2 - MapReduce Kmeans in Python II

Geoffrey modifies the SciPy K-means code to support a MapReduce execution style and runs it in this short unit. This illustrates the key ideas of mappers and reducers. With appropriate runtime this code would run in parallel but here the ''parallel'' maps run sequentially. Geoffrey stresses that this simple 2 map version can be generalized to scalable parallelism.

Unit 31 - Technology for X-Informatics: PageRank (Python & Java Track)

29min

Unit Overview
Lesson 1 - Calculate PageRank from Web Linkage Matrix I
Lesson 2 - Calculate PageRank from Web Linkage Matrix II
Lesson 3 - Calculate PageRank of a real page

Overview

Geoffrey uses Python to Calculate PageRank from Web Linkage Matrix showing several different formulations of the basic matrix equations to finding leading eigenvector. The unit is concluded by a calculation of PageRank for general web pages by extracting the secret from Google.
31.1 - Calculate PageRank from Web Linkage Matrix I

Geoffrey takes two simple matrices for 6 and 8 web sites respectively to illustrate the calculation of PageRank.
31.2 - Calculate PageRank from Web Linkage Matrix II

Geoffrey takes two simple matrices for 6 and 8 web sites respectively to illustrate the calculation of PageRank.
31.3 - Calculate PageRank of a real page

This tiny lesson presents a Python code that finds the Page Rank that Google calculates for any page on the web

Section 12 - X= Sports Case Study

32, 33, 34

3h 24min

Sports sees significant growth in analytics with pervasive statistics shifting to more sophisticated measures. We start with baseball as game is built around segments dominated by individuals where detailed (video/image) achievement measures including PITCHf/x and FIELDf/x are moving field into big data arena. There are interesting relationships between the economics of sports and big data analytics. We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.

Unit 32 - Sports Informatics I: Sabermetrics (Basic)

1h 29min

Unit Overview
Lesson 1 - Introduction and Sabermetrics (Baseball Informatics) Lesson
Lesson 2 - Basic Sabermetrics
Lesson 3 - Wins Above Replacement

Overview

This unit discusses baseball starting with the movie Moneyball and the 2002-2003 Oakland Athletics. Unlike sports like basketball and soccer, most baseball action is built around individuals often interacting in pairs. This is much easier to quantify than many player phenomena in other sports. We discuss Performance-Dollar relationship including new stadiums and media/advertising. We look at classic baseball averages and sophisticated measures like Wins Above Replacement.
32.1 - Introduction and Sabermetrics (Baseball Informatics) Lesson

Introduction to all Sports Informatics, Moneyball The 2002-2003 Oakland Athletics, Diamond Dollars economic model of baseball, Performance - Dollar relationship, Value of a Win.
32.2 - Basic Sabermetrics

Different Types of Baseball Data, Sabermetrics, Overview of all data, Details of some statistics based on basic data, OPS, wOBA, ERA, ERC, FIP, UZR.
32.3 - Wins Above Replacement

Wins above Replacement WAR, Discussion of Calculation, Examples, Comparisons of different methods, Coefficient of Determination, Another, Sabermetrics Example, Summary of Sabermetrics.

Unit 33 - Sports Informatics II: Sabermetrics (Advanced)

Unit Overview
Lesson 1 - Pitching Clustering
Lesson 2 - Pitcher Quality
Lesson 3 - PITCHf/X
Lesson 4 - Other Video Data Gathering in Baseball

Overview

This unit discusses 'advanced sabermetrics' covering advances possible from using video from PITCHf/X, FIELDf/X, HITf/X, COMMANDf/X and MLBAM.
33.1 - Pitching Clustering

A Big Data Pitcher Clustering method introduced by Vince Gennaro, Data from Blog and video at 2013 SABR conference.
33.2 - Pitcher Quality

Results of optimizing match ups, Data from video at 2013 SABR conference.
33.3 - PITCHf/X

Examples of use of PITCHf/X.
33.4 - Other Video Data Gathering in Baseball

FIELDf/X, MLBAM, HITf/X, COMMANDf/X.

Unit 34 - Sports Informatics III: Other Sports

55min

Unit Overview
Lesson 1 - Wearables
Lesson 2 - Soccer and the Olympics
Lesson 3 - Spatial Visualization in NFL and NBA
Lesson 4 - Tennis and Horse Racing

Overview

We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.
34.1 - Wearables

Consumer Sports, Stake Holders, and Multiple Factors.
34.2 - Soccer and the Olympics

Soccer, Tracking Players and Balls, Olympics.
34.3 - Spatial Visualization in NFL and NBA

NFL, NBA, and Spatial Visualization.
34.4 - Tennis and Horse Racing

Tennis, Horse Racing, and Continued Emphasis on Spatial Visualization.

Section 13 - X= Health Informatics Case Study

2h 25min

This section starts by discussing general aspects of Big Data and Health including data sizes, different areas including genomics, EBI, radiology and the Quantified Self movement. We review current state of health care and trends associated with it including increased use of Telemedicine. We summarize an industry survey by GE and Accenture and an impressive exemplar Cloud-based medicine system from Potsdam. We give some details of big data in medicine. Some remarks on Cloud computing and Health focus on security and privacy issues. We survey an April 2013 McKinsey report on the Big Data revolution in US health care; a Microsoft report in this area and a European Union report on how Big Data will allow patient centered care in the future. Examples are given of the Internet of Things, which will have great impact on health including wearables. A study looks at 4 scenarios for healthcare in 2032. Two are positive, one middle of the road and one negative. The final topic is Genomics, Proteomics and Information Visualization.

Unit 35 - X-Informatics Case Study: Health Informatics

2h 25min

Section 14 - X= Sensors Case Study

1h 5min

Geoffrey starts with the Internet of Things IoT giving examples like monitors of machine operation, QR codes, surveillance cameras, scientific sensors, drones and self driving cars and more generally transportation systems. We give examples of robots and drones. We introduce the Industrial Internet of Things IIoT and summarize surveys and expectations Industry wide. We give examples from General Electric. Sensor clouds control the many small distributed devices of IoT and IIoT. More detail is given for radar data gathered by sensors; ubiquitous or smart cities and homes including U-Korea; and finally the smart electric grid.

Unit 36 - X-Informatics Case Study: Sensors

1h 5min

Section 15 - X= Radar Case Study

21min

Unit 37 - X-Informatics Case Study: Radar

21min

Unit Overview
Lesson 1 - Introduction
Lesson 2 - Remote Sensing
Lesson 3 - Ice Sheet Science
Lesson 4 - Global Climate Change
Lesson 5 - Radio Overview
Lesson 6 - Radio Informatics

Overview

The changing global climate is suspected to have long-term effects on much of the world's inhabitants. Among the various effects, the rising sea level will directly affect many people living in low-lying coastal regions. While the ocean-s thermal expansion has been the dominant contributor to rises in sea level, the potential contribution of discharges from the polar ice sheets in Greenland and Antarctica may provide a more significant threat due to the unpredictable response to the changing climate. The Radar-Informatics unit provides a glimpse in the processes fueling global climate change and explains what methods are used for ice data acquisitions and analysis.
37.1 - Introduction

This unit motivates radar-informatics by building on previous discussions on why X-applications are growing in data size and why analytics are necessary for acquiring knowledge from large data. The unit details three mosaics of a changing Greenland ice sheet and provides a concise overview to subsequent lessons by detailing explaining how other remote sensing technologies, such as the radar, can be used to sound the polar ice sheets and what we are doing with radar images to extract knowledge to be incorporated into numerical models.
37.2 - Remote Sensing

This explains the basics of remote sensing, the characteristics of remote sensors and remote sensing applications. Emphasis is on image acquisition and data collection in the electromagnetic spectrum.
37.3 - Ice Sheet Science

This unit provides a brief understanding on why melt water at the base of the ice sheet can be detrimental and why it’s important for sensors to sound the bedrock.
37.4 - Global Climate Change

This unit provides an understanding and the processes for the greenhouse effect, how warming effects the Polar Regions, and the implications of a rise in sea level.
37.5 - Radio Overview

This unit provides an elementary introduction to radar and its importance to remote sensing, especially to acquiring information about Greenland and Antarctica.
37.6 - Radio Informatics

This unit focuses on the use of sophisticated computer vision algorithms, such as active contours and a hidden markov model to support data analysis for extracting layers, so ice sheet models can accurately forecast future changes in climate.

1. Summary ^New

2. Imp. Info

Use the Google Community forum for course discussions. Homework for this class will be posted in the forum that will also allow sharing of homework for getting peer feedback.Course Material:
syllabus,
course files,
slides¹
course download folder
practice discussions

3. Updates

Course Syllabus (PDF)

Important Links

Other Online Lectures at Indiana University

1. Summary New

2. Imp. Info

Use the Google Community forum for course discussions. Homework for this class will be posted in the forum that will also allow sharing of homework for getting peer feedback.Course Material: syllabus,course files,slides1course download folderpractice discussions

3. Updates

Course Syllabus (PDF)

Course Content

Important Links

Other Online Lectures at Indiana University

1. Summary ^New

Use the Google Community forum for course discussions. Homework for this class will be posted in the forum that will also allow sharing of homework for getting peer feedback.Course Material:
syllabus,
course files,
slides¹
course download folder
practice discussions