Haskell Data Analysis Cookbook

Over 130 Practical Recipes For Data Analysis and Machine Learning





  • Harness Haskell in the real world
  • Master data analysis techniques
  • Discover useful Haskell libraries
  • Implement machine learning


Chapter 1  Obtaining Data

input <- readFile "input.txt"

Data is everywhere, logging is cheap, and analysis is inevitable. The recipes in this chapter cover how to gather useful data.

  • Read text from a file path
  • CSV
  • JSON
  • XML
  • HTML
  • HTTP GET requests
  • HTTP POST requests
  • MongoDB

Chapter 2  Cleaning Data

isWhitespace x = elem x " \t\r\n"

Learn how to validate and clean data carefully. Ensure sanity checks on data before analyzing it.

  • Trimming whitespace
  • Ignoring punctuation
  • Unexpected input
  • Regular expressions
  • Deduplication
  • Frequency table
  • Manhattan distance
  • Euclidean distance
  • Pearson Correlation
  • Cosine similarity

Chapter 3  Taming Strings

let strs = splitOn "," "bob,joe,nick"

Many interesting analysis techniques can be used on a large corpus of words to examine the structure of a sentence or the contents of a book.

  • Base conversion
  • Substring search (Boyer–Moore–Horspool, Rabin-Karp)
  • Split a string
  • Longest common subsequence
  • Phonetic code
  • Edit distance
  • Jaro–Winkler distance
  • Scraping text
  • Fixing spelling mistakes

Chapter 4  Hashing Data

let checksum = md5 file

To summarize an item into a small and typically fixed length value, we apply a hashing function to it. This chapter will cover the following recipes.

  • Hashing data
  • MD5 and cryptographic checksums
  • Using a hash table
  • Google's CityHash
  • Geohashing
  • Bloom filter
  • Perceptual hashing

Chapter 5  Using Trees

data Tree = Node v l r | Null

Everything from creating simple binary trees to practical applications such as Huffman trees are covered in this section.

  • Binary tree
  • Rose tree
  • Depth-first traversal
  • Breadth-first traversal
  • Height of a tree
  • Binary search tree
  • AVL tree
  • Min-heap
  • Huffman tree encoding and decoding

Chapter 6  Using Graphs

type Graph = Table [Vertex]

A graph allows for representing network data such as social networks, biological gene relationship, and road topologies. Graphs are very common in data analysis and this chapter will cover some essential algorithms.

  • List of edges
  • Adjacency list
  • Topological sort
  • Depth first traversal
  • Breadth first traversal
  • Visualizing a graph
  • Directed acyclic word graphs
  • Hexagonal and square grids
  • Maximal cliques

Chapter 7  Statistics

let (b, m) = linearRegression xs ys

This chapter contains recipes that answer questions about data deviation from the norm, existence of linear and quadratic trends, and probabilistic values of a network.

  • Moving average and median
  • Linear and quadratic regression
  • Covariance matrix
  • Pearson correlation coefficient
  • Bayesian network
  • Playing cards
  • Markov chain
  • N-grams
  • Neural network perception

Chapter 8  Clustering Data

let clusters = kmeans points

Computer algorithms are becoming better and better at analyzing large data sets. As machines perform faster, so do their ability to detect interesting patterns in data.

  • K-means clustering
  • Hierarchical clustering
  • Number of clusters
  • Parts of speech
  • Training a parts of speech tagger
  • Word lexemes clustering
  • Visualizing

Chapter 9  Performance

a <- rpar task1

This chapter will cover parallel and concurrent design. Massive data analysis is a very real problem which this chapter will try to solve.

  • Benchmarking runtime
  • Evaluating in parallel
  • Controlling algorithms in sequence
  • Forking IO
  • Parallelizing pure functions
  • Mapping in parallel
  • Accessing tuple elements in parallel
  • MapReduce

Chapter 10  Real-Time

h <- connectTo "localhost" myPort

The gratifying nature of analyzing data the moment it is received is the core subject of this chapter. The following real-time data topics will be covered.

  • Streaming Twitter data
  • IRC bot
  • Polling a webserver
  • Repsonding to system events
  • Sockets

Chapter 11  Visualizing

plot X11 Data2D [Color Red] [] pts

Visualizing data is important in all steps of data analysis. It is always useful to have an inutitive understanding so this chapter covers many ways to graph data.

  • Plotting a line graph
  • Plotting a pie-chart
  • Plotting a bar graph
  • Displaying a scatter plot
  • Visualizing a graphical network
  • Using D3.js

Chapter 12  Exporting

save = insertMany "item" mongoList

The last important step in data analysis is to export and present the data in a usable format. The recipes in this chapter cover how to save and present data.

  • Exporting to CSV
  • JSON
  • SQLite
  • MongoDB
  • HTML
  • LaTeX