Turns out that’s pretty simple. The process can be broken down in three steps:

- defining a feature vector;
- training a classifier;
- test the classifier on unknown images.

A feature vector is a vector that summarize the features of an object that we want to classify.

Obviously photos taken by night have more dark pixel compared to photos taken by night. So we can the number of dark/midrange/light pixel as feature vector to classify photos.

However, simply counting the number of pixels for each color would make the feature vector dependent on the size of each photo. So, it makes more sense to put in the feature vector the ratio of pixels of each color by the total number of pixels.

An additional problem is that so far the feature vector would be larger than necessary: the RGB color space has \(\left(2^8\right)^3 = 16.777.216\) elements, so a city with a dark sky would be substantially different from a city whose sky has a slightly different shade of dark blue.

We can reduce the feature vector size by mapping each of the 256 possible values of each color channel to a smaller set of values, for example 4.

The final problem is that most machine learning libraries assume the feature vector to be a 1-dimensional vector, while the RGB color space is 3-dimensional. For this reason we can simply map the \(\)(4)^3 = 64\(\) cells of the RGB color space to a 1-dimensional vector with 64 slots.

Let’s use Pillow to read images and start writing some Python code that given an image file path or an image URL calculates its feature vector:

Just to have an idea of what we get, this is the feature vector plot of a city by day:

And this is the feature vector of a city by night:

Exactly as expected: night photos have plenty of dark pixels.

Another good quality of this approach is that feature values are already normalized, which makes most classifier work better.

I chose scikit-learn as machine learning library, but you are free to choose the one that excites you the most.

First, what is a classifier? It is a “thing” that given a feature vector returns its class. In our example, it should return “1” given the feature vector of a picture of a city by day, and “0” for a city by night. The procedure that teaches to the classifier what feature vectors belong to which classes is called *training*.

So go collect pictures of cities by day and by night, I’ll wait. Once you got them, put them in two separate folders. This will be our *training set*, that we’ll use to train a classifier.

Assume that we have no clue what machine learning algorithm we should use. scikit-learn provides a useful cheat sheet to guide us: http://scikit-learn.org/stable/tutorial/machine_learning_map/. In our case, it suggests that we should use a C-Support Vector Classification (also called Support Vector Machine, SVM).

SVM are all-around awesome and are one of those algorithms that is pretty much always worth trying because they work well in a wide range of settings. Go read about them.

In very simple terms, SVM puts all objects in the training set in a n-dimensional space (n can even be infinite!), and then looks for the plane that better divides objects of type A from objects of type B.

So far, this would work only if these objects are linearly separable. However, with *one weird trick* (mathematicians hate it) SVM work even on non-linearly separable classes (it’s actually called *kernel trick*).

The SVM implementation of scikit-learn is available sklearn.svc module. As per cheat sheet suggestion, we are going to use SVC.

SVC has plenty of parameters, the most important are C (penalty error), kernel (the type of kernel to use), and gamma (kernel coefficient).

Even if you have a good knowledge of SVM is not straightforward to choose these parameters. The simplest approach to solve this dilemma is simply to try all possible combinations of these parameters and pick the classifier that works best. scikit-learn automatizes this using the GridSearchCV class.

Putting together the pieces of the puzzle we have to:

- gather training data using the code showed in the previous section;
- define the parameter search space to find a good classifier
- return the classifier

Here’s the code that does it:

There are several other techniques to properly train a classifier, such as cross-validation. Read about them on the official scikit-learn documentation.

We got the training data, we got the classifier, we only need to test it:

And this is an example of how it works:

**Yay!**

The classifier actually returned 1 for a photo taken by day and 0 for a picture taken by night!

Here is the full source code:

This binary classifier works quite well if you feed it with enough training data. Although as example I chose daytime vs. nighttime photos, it works for all images that have a reasonably different colorspaces, e.g. photos of tigers vs. elephants, landscapes vs. portraits, sea vs. meadow, and so on.

Moreover it is quite easy to modify it so that it works with *multiple classes*. Of course, this is left as an exercise to the reader.

A **confidence interval** gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data

This is the definition of confidence interval given in the Statistics Glossary v1.1 by Valerie J. Easton & John H. McColl. The “unknown population parameter” is usually the population mean, so in the following I will just assume that the “unknown population parameter” is indeed the mean.

Thus we are dealing with a sample of a population and we want to measure how close we get to the population mean using only data about a sample.

If independent samples are taken from the same population and confidence interval evaluated for each sample then a certain percentage (called confidence level) of the intervals will include the population mean. The confidence level is usually 95%, but we can get to 99%, 90% or any other percentage we fancy.

I’m always a bit let down when I read a paper and authors do not report the confidence interval of their experimental results. It means that whatever measure they are reporting you have to guess whether it is significant or not.

I think it’s good to make a habit of including the confidence interval for any measurement you are reporting.

In most practical settings, we don’t actually know what is the population distribution and we just assume that it is normally distributed. For samples from other population distributions what I am going to describe is approximately correct by the Central Limit Theorem.

For a population with unknown mean \(\mu\), unknown standard deviation \(\sigma\), a confidence interval for the population mean, based on a random sample of size, is \(\overline{x}\pm t^*\frac{s}{\sqrt{n}}\) where:

- \(\overline{x}\) is the sample mean;
- \(n\) is the sample size;
- \(s\) is the estimated standard deviation (also known as standard error);
- \(t^*\) is the upper \(\frac{1-C}{2}\) critical value for the Student’s t-distribution with \(n-1\) degrees of freedom.

The most difficult element is to evaluate \(t^*\).

Assume that we are given the height in cm of 30 one year old toddlers: 63.5, 81.3, 88.9, 63.5, 76.2, 67.3, 66.0, 64.8, 74.9, 81.3, 76.2, 72.4, 76.2, 81.3, 71.1, 80.0, 73.7, 74.9, 76.2, 86.4, 73.7, 81.3, 68.6, 71.1, 83.8, 71.1, 68.6, 81.3, 73.7, 74.9.

The average height is 74.8 cm. What is the 95% confidence interval of this mean?

The Apache Commons Math 3 can give critical values for the Student’s t-distribution. So download it or use your dependency manager to use it. Here is the code that calculates the 95% confidence interval:

The output of this program is:

The code to do the same calculation in Python is very similar. We will use numpy and scipy:

The output is exactly the same of the Java version.

]]>Tries are a type of associative arrays, like hash tables. However, compared to hash tables, tries have several advantages:

- looking up a word takes O(word length) in the worst case for tries, whereas an imperfect hash table can take up to O(number of words);
- tries have no collisions;
- it is possible to walk through all keys in order;
- it is not necessary to design an hash function.

Tries do have disadvantages:

- the naïve implementation of tries uses pointers between nodes, which reduces their cache efficiency;
- if tries are naïvely stored on a storage device they perform much worse than hash tables;
- a trie may require more memory than an hash table.

So, given these pros and cons, here is a simple trie implementation:

This code assumes that the file `wlist.txt`

is accessible and that it contains a dictionary of words, one per line.

For example, this is how the trie that stores all the English words that start with “archi” looks like (click for a large version):

As a side node, it possible to “collapse” long chains of prefixes. For the same set of words we would get something like this:

Let’s now use a trie to solve a common ~~interview question~~ problem.

T9 is a predictive text technology for mobile phones having a 3×4 keypad. Its core idea is simple: users only press each key once for each letter of the word they want to type. For example, to write “arching” a user would tap “2724464”. That’s exactly what a trie does! Compared with the previous implementation, we need to change two things:

- nodes are numbers rather than letters;
- a node is associated to a list of values because the same sequence of numbers can generate different words.

Here’s the code:

The most prominent changes are that we are using `str.translate`

to map a-z characters to phone digits and that each node has a list of associated values (`node.values`

).

Here is how this “trie” looks like:

The algorithm to implement T9 is now straightforward: given a sequence of numbers, iterate over them and use each of them to select next trie node. When we run out of numbers it means that we found the longest valid prefix in the trie. We then start a depth first search starting from the last node we explored, and collect words from each node that we explore.

Here is the code that does it:

This code assumes that there is a `wlist.txt`

file available. Several English wordlists can be downloaded here: http://www.keithv.com/software/wlist/.

A sample output of this simple application is:

Tries are ubiquitous in all problems where prefix matching is useful. A few examples are:

- Google autocomplete: tries are augmented with words popularity;
- spell checkers: each reasonable misspelling of a word (due to insertion, deletion, or substitution of one or more character) is linked to the correct spelling, and each misspelling is linked to the correct spelling (see http://norvig.com/spell-correct.html for an overview of how simple spell correction works);
- firewalls often store IP ranges associated to a policy (e.g., drop packet, forward packet, accept packet, etc.): IP ranges can be efficiently stored and searched using a trie (for example, see this paper);
- tries are also used in bioinformatics for overlap detection in fragment assembly (ref.);
- the fastest algorithm for large data sets sorting, burstsort, is based on tries (ref.)

Whenever you are dealing with a problem where prefixes are important, tries might be the right tool.

]]>Visualizing the “depends on” relationship allows understand at a glance what are the core packages and why each package was installed; so let’s explore de dependencies of packets installed on a Debian-based Linux installation.

Dependencies of a package can be listed using `apt-cache depends <packagename>`

. For example, the list of dependencies of vim are:

So what we want to do is:

- list all installed packages;
- list dependencies for each installed package;
- put dependencies in some format that allows easy visualization.

Listing installed packages is easily done with `dpkg`

:

Let’s explain this command:

`dpkg --get-selections`

: list all packages;`grep -v 'deinstall'`

: remove all lines that contain the word “deinstall”, thus only packages whose selection state is “install” will be kept;`cut -f1 | cut -d: -f1`

Now we want to get the list dependencies for each of these packages.

`apt-cache`

allows to easily list dependencies of a package, given its name. When parsing its output, we should keep in mind that alternative dependencies start with a pipe (“`|`

“) and virtual packages are shown within angle brackets (`<>`

).

Assuming that the variable `$pkgname`

contains the name of a valid package, we can use the following command to list the packages it depends on:

What it does is:

`apt-cache depends "$pkgname"`

: list all dependencies for the $pkgname package;`grep 'Depends:'`

: keep only lines containing the "Depends" word (i.e., "Depends" and "PreDepends");`grep -v "<"`

: remove references to virtual packages;`cut -d':' -f2-`

: keep only package names;`tr -d ' '`

: trim possible whitespaces.

Let's put all in a bash script:

I choose Gephi to visualize the dependency graph.

A possible alternative would be dot from the GraphViz suite. However, on my desktop machine there are about 2.500 installed packages and more than 13.000 dependencies: this graph is big enough to make very hard for `dot`

to find a reasonable layout for all the graph nodes and output an image that can be actually visualized.

Gephi supports several graph definition formats, including GEXF. GEXF is an XML-based format and it is very simple to adapt our script so that it outputs dependencies in the GEXF format:

I run this script on a server running a barebone installation of Debian Wheezy. I loaded the resulting GEXF file in Gephi and I tweaked a bit the controls:

In particular, I ran the "ForceAtlas 2" layout algorithm to neatly position all nodes (bottom left corner in the screenshot), I filtered the nodes to remove those that have a 0 degree, such as font packages (bottom right corner), and I ranked nodes based on their outdegree: nodes with higher outdegree are more red.

A few more tweaks in the Preview section of Gephi allow to create something like this:

The red node at the center of the graph is `libc6`

that, unsurprisingly, many packages depend on. The other two red nodes on the top left are `debconf`

and `dpkg`

: due to the very small installation size, they are relatively strongly connected to the other packages. The two nodes just below `libc6`

are `zlib1g`

and `multiarch-support`

. Finally, the node on the bottom right is `python`

.

Here is the same graph for a Kubuntu installation:

Gephi also gives metrics about the graphs it loads. For example, the average degree of this graph is 5.3, and its diameter is 12 (i.e., the longest path between any two connected nodes has 12 edges), the average path length is 3.7 and so on.

The Gephi team has collected several datasets to explore their software, available here.

]]>