Over 130 Practical Recipes For Data Analysis and Machine Learning

Harness Haskell in the real world
Master data analysis techniques
Discover useful Haskell libraries
Implement machine learning

Chapter 1 Obtaining Data

input <- readFile "input.txt"

Data is everywhere, logging is cheap, and analysis is inevitable. The recipes in this chapter cover how to gather useful data.

Read text from a file path
CSV
JSON
XML
HTML
HTTP GET requests
HTTP POST requests
MongoDB

Chapter 2 Cleaning Data

isWhitespace x = elem x " \t\r\n"

Learn how to validate and clean data carefully. Ensure sanity checks on data before analyzing it.

Trimming whitespace
Ignoring punctuation
Unexpected input
Regular expressions
Deduplication
Frequency table
Manhattan distance
Euclidean distance
Pearson Correlation
Cosine similarity

Chapter 3 Taming Strings

let strs = splitOn "," "bob,joe,nick"

Many interesting analysis techniques can be used on a large corpus of words to examine the structure of a sentence or the contents of a book.

Base conversion
Substring search (Boyer–Moore–Horspool, Rabin-Karp)
Split a string
Longest common subsequence
Phonetic code
Edit distance
Jaro–Winkler distance
Scraping text
Fixing spelling mistakes

Chapter 4 Hashing Data

let checksum = md5 file

To summarize an item into a small and typically fixed length value, we apply a hashing function to it. This chapter will cover the following recipes.

Hashing data
MD5 and cryptographic checksums
Using a hash table
Google's CityHash
Geohashing
Bloom filter
Perceptual hashing

Chapter 5 Using Trees

data Tree = Node v l r | Null

Everything from creating simple binary trees to practical applications such as Huffman trees are covered in this section.

Binary tree
Rose tree
Depth-first traversal
Breadth-first traversal
Height of a tree
Binary search tree
AVL tree
Min-heap
Huffman tree encoding and decoding

Chapter 6 Using Graphs

type Graph = Table [Vertex]

A graph allows for representing network data such as social networks, biological gene relationship, and road topologies. Graphs are very common in data analysis and this chapter will cover some essential algorithms.

List of edges
Adjacency list
Topological sort
Depth first traversal
Breadth first traversal
Visualizing a graph
Directed acyclic word graphs
Hexagonal and square grids
Maximal cliques

Chapter 7 Statistics

let (b, m) = linearRegression xs ys

This chapter contains recipes that answer questions about data deviation from the norm, existence of linear and quadratic trends, and probabilistic values of a network.

Moving average and median
Linear and quadratic regression
Covariance matrix
Pearson correlation coefficient
Bayesian network
Playing cards
Markov chain
N-grams
Neural network perception

Chapter 8 Clustering Data

let clusters = kmeans points

Computer algorithms are becoming better and better at analyzing large data sets. As machines perform faster, so do their ability to detect interesting patterns in data.

K-means clustering
Hierarchical clustering
Number of clusters
Parts of speech
Training a parts of speech tagger
Word lexemes clustering
Visualizing

Chapter 9 Performance

a <- rpar task1

This chapter will cover parallel and concurrent design. Massive data analysis is a very real problem which this chapter will try to solve.

Benchmarking runtime
Evaluating in parallel
Controlling algorithms in sequence
Forking IO
Parallelizing pure functions
Mapping in parallel
Accessing tuple elements in parallel
MapReduce

Chapter 10 Real-Time

h <- connectTo "localhost" myPort

The gratifying nature of analyzing data the moment it is received is the core subject of this chapter. The following real-time data topics will be covered.

Streaming Twitter data
IRC bot
Polling a webserver
Repsonding to system events
Sockets

Chapter 11 Visualizing

plot X11 Data2D [Color Red] [] pts

Visualizing data is important in all steps of data analysis. It is always useful to have an inutitive understanding so this chapter covers many ways to graph data.

Plotting a line graph
Plotting a pie-chart
Plotting a bar graph
Displaying a scatter plot
Visualizing a graphical network
Using D3.js

Chapter 12 Exporting

save = insertMany "item" mongoList

The last important step in data analysis is to export and present the data in a usable format. The recipes in this chapter cover how to save and present data.

Exporting to CSV
JSON
SQLite
MongoDB
HTML
LaTeX