Skip to main content

Spark Python Notebooks

Spark Python Notebooks

Join the chat at https://gitter.im/jadianes/spark-py-notebooks
This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language.
If Python is not your language, and it is R, you may want to have a look at our R on Apache Spark (SparkR) notebooks instead. Additionally, if your are interested in being introduced to some basic Data Science Engineering, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R.

Instructions

A good way of using these notebooks is by first cloning the repo, and then starting your own IPython notebook/Jupyter inpySpark mode. For example, if we have a standalone Spark installation running in our localhost with a maximum of 6Gb per node assigned to IPython:
MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="6G" IPYTHON_OPTS="notebook --pylab inline" ~/spark-1.5.0-bin-hadoop2.6/bin/pyspark
Notice that the path to the pyspark command will depend on your specific installation. So as requirement, you need to haveSpark installed in the same machine you are going to start the IPython notebook server.
For more Spark options see here. In general it works the rule of passing options described in the form spark.executor.memoryas SPARK_EXECUTOR_MEMORY when calling IPython/pySpark.

Datasets

We will be using datasets from the KDD Cup 1999. The results of this competition can be found here.

References

The reference book for these and other Spark related topics is:
  • Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia.

Notebooks

The following notebooks can be examined individually, although there is a more or less linear 'story' when followed in sequence. By using the same dataset they try to solve a related set of tasks with it.

RDD creation

About reading files and parallelize.

RDDs basics

A look at mapfilter, and collect.

Sampling RDDs

RDD sampling methods explained.

RDD set operations

Brief introduction to some of the RDD pseudo-set operations.

Data aggregations on RDDs

RDD actions reducefold, and aggregate.

Working with key/value pair RDDs

How to deal with key/value pairs in order to aggregate and explore data.

MLlib: Basic Statistics and Exploratory Data Analysis

A notebook introducing Local Vector types, basic statistics in MLlib for Exploratory Data Analysis and model selection.

MLlib: Logistic Regression

Labeled points and Logistic Regression classification of network attacks in MLlib. Application of model selection techniques using correlation matrix and Hypothesis Testing.

MLlib: Decision Trees

Use of tree-based methods and how they help explaining models and feature selection.

Spark SQL: structured processing for Data Analysis

In this notebook a schema is inferred for our network interactions dataset. Based on that, we use Spark's SQL DataFrameabstraction to perform a more structured exploratory data analysis.

Applications

Beyond the basics. Close to real-world applications using Spark and other technologies.

Olssen: On-line Spectral Search ENgine for proteomics

Same tech stack this time with an AngularJS client app.

An on-line movie recommendation web service

This tutorial can be used independently to build a movie recommender model based on the MovieLens dataset. Most of the code in the first part, about how to use ALS with the public MovieLens dataset, comes from my solution to one of the exercises proposed in the CS100.1x Introduction to Big Data with Apache Spark by Anthony D. Joseph on edX, that is also publicly available since 2014 at Spark Summit.
There I've added with minor modifications to use a larger dataset and also code about how to store and reload the model for later use. On top of that we build a Flask web service so the recommender can be use to provide movie recommendations on-line.

KDD Cup 1999

My try using Spark with this classic dataset and Knowledge Discovery competition.

Contributing

Contributions are welcome! For bug reports or requests please submit an issue.

Contact

Feel free to contact me to discuss any issues, questions, or comments.

License

This repository contains a variety of content; some developed by Jose A. Dianes, and some from third-parties. The third-party content is distributed under the license provided by those parties.
The content developed by Jose A. Dianes is distributed under the following license:
Copyright 2016 Jose A Dianes

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Comments

Popular posts from this blog

Introduction to Machine Learning in Python

Python tutorials for introduction to machine learning Introduction to Machine Learning in Python This repository provides instructional material for machine learning in python. The material is used for two classes taught at NYU Tandon by  Sundeep Rangan : EE-UY / CS-UY 4563: Introduction to Machine Learning (Undergraduate) EL-GY 6123: Introduction to Machine Learning (Graduate) Anyone is free to use and copy this material (at their own risk!). But, please cite the material if you use the material in your own class. Pre-requisites All the software can be run on any laptop (Windows, MAC or UNIX).  Instructions  are also provided to run the code in Google Cloud Platform on a virtual machine (VM). Both classes assume no python or ML experience. However, experience with some programming language (preferably object-oriented) is required. To follow all the mathematical details and to complete the homework exercises, the class assumes undergraduate probability, ...

Python Machine Learning Notebooks (Tutorial style)

Python Machine Learning Notebooks (Tutorial style) Dr. Tirthajyoti Sarkar, Sunnyvale, CA ( You can connect with me on LinkedIn here ) Essential codes/demo IPython notebooks for jump-starting machine learning/data science. You can start with this article that I wrote in Heartbeat magazine (on Medium platform): "Some Essential Hacks and Tricks for Machine Learning with Python" Essential tutorial-type notebooks on Pandas and Numpy Jupyter notebooks covering a wide range of functions and operations on the topics of NumPy, Pandans, Seaborn, matplotlib etc. Basics of Numpy array Basics of Pandas DataFrame Basics of Matplotlib and Descriptive Statistics Tutorial-type notebooks covering regression, classification, clustering, dimensionality reduction, and some basic neural network algorithms Regression Simple linear regression with t-statistic generation Multiple ways to do linear regression in Python and their speed comparison ( check the article I wr...

R tutorials for Data Science, NLP and Machine Learning

R Data Science Tutorials This repo contains a curated list of R tutorials and packages for Data Science, NLP and Machine Learning. This also serves as a reference guide for several common data analysis tasks. Curated list of Python tutorials for Data Science, NLP and Machine Learning . Comprehensive topic-wise list of Machine Learning and Deep Learning tutorials, codes, articles and other resources . Learning R Online Courses tryR on Codeschool Introduction to R for Data Science - Microsoft | edX Introduction to R on DataCamp Data Analysis with R Free resources for learning R R for Data Science - Hadley Wickham Advanced R - Hadley Wickham swirl: Learn R, in R Data Analysis and Visualization Using R MANY R PROGRAMMING TUTORIALS A Handbook of Statistical Analyses Using R , Find Other Chapters Cookbook for R Learning R in 7 simple steps More Resources Awesome-R Repository on GitHub R Reference Card: Cheatsheet R bloggers: blog aggregator R Resources...