Introduction to a data analysis with python for parent MATLAB users

What you can do with this article

You, new to Python, can do coding for prediction model construction with Python.

Targeted readers

Those who are interested in Python for a data analysis but never engaged in it
Those who want to start trying another language in addition to MATLAB
Those who are not interested in programming languages

A Table of Contents

Who am I ? (You can skip this part.
Reasons why I recommend Python
Python environment construction for users who are familiar with MATLAB
Points that are confusing when you try to use Python in the same way as MATLAB
Footnotes

Who am I?

I am Naoya OOSUGI, joined the Advance Technology Lab (ATL) at Recruit Technologies Co.,Ltd. from this July. My research interests are in the fields of signal processing and in machine learning. I am not an engineer and I analyzed the brain wave of monkeys at a research center, which is famous for basic research, until last year.

Since using MATLAB for data analysis in the field of neuroscience was popular, I have been using MATLAB for 5 years. MATLAB is a charged software and it offers supports well (^*1), so it’s been commonly used in the academic fields where reproducibility and reliability are important. I used signal processing toolbox in MATLAB a lot.

Since I joined ATL from July and I came into an environment where MATLAB is not used (^*2), I see what I can do with Python. Then, once I felt I got to know it, I felt like “No more MATLAB, It’s enough”. Although I have known Python for a month, writing an article about the points I struggled with makes you feel easier about starting Python.

Reasons I recommend Python

The reasons why I liked Python and its breakdown:

Scikitlern was great (about 80%)
Other modules were available (about 10%)
Environment construction was easy (about 5%)
MeCab could be used without any trouble (about 3%)
Drawing similar to MATLAB was easily done (about 2%)

So, I would like to talk more about one of the Python modules, scikitlearn, instead of Python itself.

Scikitlearn is a machine-learning related module. Even those who are not interested in machine learning at all will find this article useful enough because this shows a good example for how-to-write object-oriented in Python.

For those who are interested in hypothetical test such as t test, see the reference of a module called scipy. Scipy is a module that has a collection of mathematics used in science. You will see useful functions there.

To explain scikitlearn, I am thinking about a use case to check if the label information is included in the feature value by the cross-over checking through Support Vector Machine (SVM). Let’s say, the data is written in the two csv files in a local environment; one in which a value (math) for the feature value (dependent variable) is written, and the other in which its teacher label (response variable) is written.

The following is the code. How short is it? 7 lines only.

import numpy
import sklearn

#ファイル読み込み
train = numpy.genfromtxt('./data/training.csv',delimiter=',',dtype='float')
label = numpy.genfromtxt('./data/label.csv',delimiter=',',dtype='float')

#モデル定義
model = sklearn.svm.SVC()

#10-fold 交差確認法
score = sklearn.cross_validation.cross_val_score(model,train,label,cv=10)

#成績表示
print score.mean()

import numpy

import sklearn

#ファイル読み込み

train = numpy.genfromtxt('./data/training.csv',delimiter=',',dtype='float')

label = numpy.genfromtxt('./data/label.csv',delimiter=',',dtype='float')

#モデル定義

model = sklearn.svm.SVC()

#10-fold 交差確認法

score = sklearn.cross_validation.cross_val_score(model,train,label,cv=10)

#成績表示

print score.mean()

The following is the code. How short is it? 7 lines only.

Import numpy and Import sklearn is a declaration that we use two modules, Numpy and Scikitlearn.

train = numpy.genfromtxt('./data/training.csv',deliminiter=',',dtype='float')

and

label = numpy.genfromtxt('./data/label.csv',deliminiter=',',dtype='float')

enable the csv file in the local environment to load to Python. We stored the feature value into the variable named as Train (observing number multiplied with feature value number) and into the variable named as Label data (observing number). (^*3)

In Numpy.genfromtxt, we select two options, Delimiter and d type. Delimiter selects what a delimiter of read file is. D type selects in what data form you store the read file to variables.

When you read the files with Numpy.gnfromtxt, it is read in the form of Numpy array (^*5) instead of Python basic list (⁴). This Numpy array format is to be used like a matrix of MATLAB. So, make sure that you cannot change the size dynamically unlike MATLAB.

Model = sklearn.svm.SVC() selects what model is used in Scikitlearn method.

!!!!!!!!!!This line is the biggest difference with MATLAB, which is one of the points that moved me.!!!!!!!!!!

These are selected as follows;

What model you use.
What value its hyper parameter is.

If I tell a constructor, people who are somewhat familiar with objective-oriented programing understand it. For those who are not familiar with object-oriented programing, like “Is it a kind of food we can eat?”(^*6), I add more information about it. An object is a useful thing which sticks functions to its structure and which can change the value of structures in use of the function. In this case all you need to know is just as above.

For the 1st point regarding what model you use, although we call SVM in this case, you can select random forest and generalized linear models. More details can be obtained in the official website.

Moreover, the liblinear(^*7) can be used in the same way.

For the 2nd point regarding what value its hyper parameter is, since you do not select any sample code, the default value, for example, “rbf” in kernel, is shown here. But you can select each hyper parameter with this argument such as what you use for the kernel and what you do with C value and Norm (^*8).

This time the object called model became the SVM of rbf kernel like those.

You use a function called “fit” for a learning model parameter.

model.fit(train,label)

1	model.fit(train,label)

Once you learn the parameter, write these below with function called “predict” for output.

model.predict(train)

1	model.predict(train)

While you do not write in MATLAB like this, this enables us to write easily about cross-over checking. (^*9)

Store the prediction rate into a variable called “score” which was found at the over-cross checking of score = sklearn.cross_validation.cross_val_score(model,train,label,cv=10)
We aplly 10-fold over-cross checking this time.

It is honestly very simple.

You just rewrite one line to call the preceding “model” in order to adjust the hyper parameter of this model and also to try another model, since the model can take an argument as it is. For example, you take a look at L2-logistic regression as below.

model = sklearn.linear_model.LogisticRegression(penalty='l2’)

1	model = sklearn.linear_model.LogisticRegression(penalty='l2’)

Finally, the average value, which is derived by the cross-over checking of

score = sklearn.cross_validation.cross_val_score(model,train,label,cv=10)

appears on the console. The content of the variable as the name of variable “print” appears on the console.

sklearn.cross_validation.cross_val_score is formed in the numpy array format as mentioned before.

The array format has basic functions such as sum and means, etc.

To derive the average value it can be write as

numpy.mean(score)

1	numpy.mean(score)

in addition to

score.mean()

1	score.mean()

I do not know which is suitable but I use the latter way because it’s easier to write.

Here I’ve finished nearly 80% of the explanation of the reasons why I like Python. There are other convenient modules (10% of the reasons) but since I am still studying about it, I will stop talking with this simple learning sample from a teacher scikitlearn. I will write another simple use case like this time if this article gets high page views.

You can, of course, write it in a similar way with an outside tool such as weka in MATLAB, but it was easier than I expected before to construct an environment to check the code as mentioned above , which is another reason (5%) why I came to like Python (^*10).

The following is the environment which I thought it is easy to introduce and to use.

Python environment Construction for users familiar with MATLAB

Prepare

While those names are inconvenient to search, you can very easily introduce them and check easily the code mentioned above. The picture below illustrates the image of after-introduction in mac, which is also similar in Windows.

spyder

First of all, I will explain what anaconda is about.

anaconda includes many convenient Python packages, and you can install and update a famous package with anaconda easily. You can download it from the website below. Installation is not difficult, so I skiped this part.
https://store.continuum.io/cshop/anaconda/

Just like anaconda you can use the code mentioned above. But those who are familiar with the GUI of MATLAB find some difficulties in using CUI.

So, I will introduce another development environment called spyder. Although the official anaconda homepage says you can download XXX in anaconda, it seems the link is dead as of August 2014, so let’s download and install it from the site below.
https://bitbucket.org/spyder-ide/spyderlib/downloads

If you have anaconda installed already, when you start up spyder, did you see that Python in anaconda started?

That’s all for environment construction. This anaconda and spyder enables us to analyze most of the parts (^*11).

This is all about the main contents of this article. I am just a beginner and there must be some inexperienced points but thank you for reading this article.

Points that are difficult when you try to use Python in the same way as MATLAB

Finally, I want to mention some points for MATLAB users to fail.

1. Punctuate a content of for and if, not by bracket but by indent

Let’s take a look at this.

count = 0
for i in [0,1,2,3]:
    count = count + 1

print count

count = 0

for i in [0,1,2,3]:

count = count + 1

print count

For MATLAB users unfamiliar with Python, it does not seem easy to understand how far for goes on.

In this case, the value of count appears just once when for sentence finishes. In Python the rage of it and for is defined by putting indent (^*12) in order. Once you get familiar with it, it will be no problem, but I used to often generate errors because I did not put the indent in order when I executed it with copy and paste to the console from an editor.

2. Capturing the result in the middle bothers

While in MATLAB you can capture the result in the middle, even a variable, a set of variables or a illustration, easily with the command save and save as (^*13), in Python you will find it harder.

You can use numpy.save and numpy.load if in a numpy array format but you cannot when you want to save every model parameter learned before. Using a module called pickle seems good but I cannot yet decide the best way to do it. I will report the correct way if I discover.

3. Using Japanese characteristic code bothers me

Using Japanese characteristic code bothers me. The flow usually goes as File input → Encode to Unicode → Process →Decode to UTF-8, etc.→File output, but there are something bothering me.

Footnote

MATLAB is a product by MathWorks. They have great support via e-mails, etc.↩
Let me add more information just in case you misunderstood that we are in short on budget. We have no problem such as budget shortages or something stingy.↩
While we select the place of file with a relative path ./data/training.csv and ./data/label.csv from the present directory for this time, you can select it with an absolute path with no doubt. When you change the present directory, you can use cd command.↩
The list in the Python data format can store a various types of data mixed-up. The list also can store a list.↩
As array cannot store a single data type only, you need to select a data type first. Even if you skip this, it automatically selects a data type, but it’s better for you to try to declare it by yourself even if it bothers you.↩
This was me about a couple of months ago. What I was familiar with was MATLAB only. When I tell the main constructor in MATLAB is figure(), you get a picture of what an object is.↩
This is a SVM specific in a line form and it requires less memory for calculating time and for being demanding, which moved me.↩
You can select if you use a processor in parallel here. Awesome!↩
You can easily implement it by taking a model as an argument even for an assemble learning such as boosting.↩
I personally feel bothered with authorization of licenses when you import MATLAB to a new calculator. There is nothing we can do, though.↩
Since the Japanese morphological analysis tool, MeCab-Python, is, of course, not supported, I needed to install it by myself.↩
These are like tab and space. If there is meaningless space at the beginning of a line in Python, it returns an error.↩
There are some traps such as the order of arguments is different between save and save as, which was easy, when I think back on it now.↩