Skip to main content

Machine Learning Tutorial

useR! Machine Learning Tutorial

UseR 2016

Overview

This tutorial contains training modules for six popular supervised machine learning methods:
Here are some practical, related topics we will cover for each algorithm:
  • Dimensionality Issues
  • Sparsity
  • Normalization
  • Categorical Data
  • Missing Data
  • Class Imbalance
  • Overfitting
  • Software
  • Scalability
Instructions for how to install the necessary software for this tutorial is available here. Data for the tutorial can be downloaded by running ./data/get-data.sh (requires wget).

Dimensionality Issues

Certain algorithms don't scale well when there are millions of features. For example, decision trees require computing some sort of metric (to determine the splits) on all the feature values (or some fraction of the values as in Random Forest and Stochastic GBM). Therefore, computation time is linear in the number of features. Other algorithms, such as GLM, scale much better to high-dimensional (n << p) and wide data with appropriate regularization (e.g. Lasso, Elastic Net, Ridge).

Sparsity

Algorithms can deal with data sparsity (where many of the feature values are zero) in different ways. In some algorithms there are ways to speed up the computations if sparsity is present, so it's good to know if these shortcuts are available.

Normalization

Some algorithms such as Deep Neural Nets and GLMs require that data be normalized for effective interpretation of the models. Tree-based algorithms (Decision Trees, Random Forest, Gradient Boosting Machines) do not require normalization. Tree-based methods only use information about whether a value is greater than or less than a certain value (e.g. x > 7 vs. x ≤ 7), the values themselves do not matter.

Categorical Data

Algorithms handle categorical data differently. Some algorithms such as GLM and Deep Neural Nets require that a categorical variable be expanded into a set of indicator variables, prior to training. With tree-based methods and software that supports it, there are ways to get around this requirement, which allows the algorithm to handle the categorical features directly. It is important to be cognizant of the cardinality of your categorical features before training, as additional pre-processing (collapsing categories, etc) may be beneficial with high-cardinality features.

Missing Data

Assuming the features are missing completely at random, there are a number of ways of handling missing data:
  1. Discard observations with any missing values.
  2. Rely on the learning algorithm to deal with missing values in its training phase.
  3. Impute all missing values before training.
For most learning methods, the imputation approach (3) is necessary. The simplest tactic is to impute the missing value with the mean or median of the nonmissing values for that feature. If the features have at least some moderate degree of dependence, one can do better by estimating a predictive model for each feature given the other features and then imputing each missing value by its prediction from the model.
Some software packages handle missing data automatically, although many don't, so it's important to be aware if any pre-processing is required by the user.

Class Imbalance

Algorithms that optimize a metric such as accuracy may fail to perform well on training sets that contain a significant degree of class imbalance. Certain algorithms, such as GBM, allow the user to optimize a performance metric of choice, which is useful when you have a highly imbalanced training set.

Overfitting

It is always good to pay attention to the potential of overfitting, but certain algorithms and certain implementations are more prone to this issue. For example, when using Deep Neural Nets and Gradient Boosting Machines, it's always a good idea to check for overfitting.

Software

For each algorithm, we will provide examples of open source R packages that implement the algorithm. All implementations are different, so we will provide information on how each of the implementations differ.

Scalability

We will address scalability issues inherent to the algorithm and discuss algorithmic or technological solutions to scalability concerns for "big data."

Resources

Where to learn more?

Comments

Popular posts from this blog

Introduction to Machine Learning in Python

Python tutorials for introduction to machine learning Introduction to Machine Learning in Python This repository provides instructional material for machine learning in python. The material is used for two classes taught at NYU Tandon by  Sundeep Rangan : EE-UY / CS-UY 4563: Introduction to Machine Learning (Undergraduate) EL-GY 6123: Introduction to Machine Learning (Graduate) Anyone is free to use and copy this material (at their own risk!). But, please cite the material if you use the material in your own class. Pre-requisites All the software can be run on any laptop (Windows, MAC or UNIX).  Instructions  are also provided to run the code in Google Cloud Platform on a virtual machine (VM). Both classes assume no python or ML experience. However, experience with some programming language (preferably object-oriented) is required. To follow all the mathematical details and to complete the homework exercises, the class assumes undergraduate probability, ...

Python Machine Learning Notebooks (Tutorial style)

Python Machine Learning Notebooks (Tutorial style) Dr. Tirthajyoti Sarkar, Sunnyvale, CA ( You can connect with me on LinkedIn here ) Essential codes/demo IPython notebooks for jump-starting machine learning/data science. You can start with this article that I wrote in Heartbeat magazine (on Medium platform): "Some Essential Hacks and Tricks for Machine Learning with Python" Essential tutorial-type notebooks on Pandas and Numpy Jupyter notebooks covering a wide range of functions and operations on the topics of NumPy, Pandans, Seaborn, matplotlib etc. Basics of Numpy array Basics of Pandas DataFrame Basics of Matplotlib and Descriptive Statistics Tutorial-type notebooks covering regression, classification, clustering, dimensionality reduction, and some basic neural network algorithms Regression Simple linear regression with t-statistic generation Multiple ways to do linear regression in Python and their speed comparison ( check the article I wr...

R tutorials for Data Science, NLP and Machine Learning

R Data Science Tutorials This repo contains a curated list of R tutorials and packages for Data Science, NLP and Machine Learning. This also serves as a reference guide for several common data analysis tasks. Curated list of Python tutorials for Data Science, NLP and Machine Learning . Comprehensive topic-wise list of Machine Learning and Deep Learning tutorials, codes, articles and other resources . Learning R Online Courses tryR on Codeschool Introduction to R for Data Science - Microsoft | edX Introduction to R on DataCamp Data Analysis with R Free resources for learning R R for Data Science - Hadley Wickham Advanced R - Hadley Wickham swirl: Learn R, in R Data Analysis and Visualization Using R MANY R PROGRAMMING TUTORIALS A Handbook of Statistical Analyses Using R , Find Other Chapters Cookbook for R Learning R in 7 simple steps More Resources Awesome-R Repository on GitHub R Reference Card: Cheatsheet R bloggers: blog aggregator R Resources...