Introduction to Machine Learning

In the past few months you have all indicated an interest in learning a bit of coding and a bit of data scienc-y stuff while I’ve been yammering on about machine learning.

So, the purpose of this is to create something that all of us will be interested in and we can all learn fun stuff – going through this tutorial will achieve a couple of things –

  1. Teach you python basics (not really – but you’ll see some code)
  2. Machine learning basics (using Weka and some terminology)
  3. Bit of natural language processing

And now I’ve blown as I said a couple of things and then listed three things which sort of bring us to our final point – I won’t have all the answers and this won’t be me teaching you stuff. I know the basics and I’ll be able to offer occasional guidance and pointers but mostly we’ll all be learning together.

So, if you’re not interested then turn back here. If yes, go on.

Introduction to Machine Learning

So, loads of cool sounding terms were heard (natural language processing, machine learning, coding) and we’re now all excited. Time to get us back to earth. We’ll be building a simple email spam filter just to prove out some points. Note that this is a rather “classic” machine learning example mostly because emails are ubiquitous and thus rich data sets are easy to find and email text is rich so we can practice python to clean up data for ingestion.

  • What is Python?

Python is a general purpose high level programming language. It’s really good at text and string processing so it’s well loved by the data science community as usually a major challenge is to clean up the data and make it usable by a machine.

We’ll be using Python 2.7

You can get it here –

  • What is Weka?

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

Weka can be found here –

  • What is Machine Learning?

Machine learning is basically learning from data “automatically”.

A long time ago when I was a wee programmer access to data was not plentiful and computing power was non-existent (Spectrum ZX anyone? Amstrad CPC?) so things were nice and simple and people could hand-write rules to solve lots of problems. So, in our spam example, stuff like if there are (lots of) links in the email body it’s possibly spam and if you see {{ some list of words }}, it’s probably spam. That generally worked all right, but unfortunately problems are usually more complicated than finding spam in an email – we’re just using it as a proof of concept – so quickly the number of rules – and their combination – that should be implemented and processed by the machine quickly becomes unmanageable. Machine learning is the rather vast umbrella of methods that automate learning these processes based on features of your data.

Back to the spam filter example.

What we are trying to is classify whether an email is spam or not spam. Classification is one of the common high level problems in machine learning – simply put we must decide whether something fits a class – is this email spam or not spam? Does this picture contain a human face or not? Note that classification is not necessarily boolean – it can contain multiple classes so does this picture contain a male, a female or no human is still a valid classification example.

So, what we will be building  is a system to automatically differentiate between two (or more) classes – spam or not-spam (and for future use potential-spam, high interest emails or even classify emails based on work/social/whatever categories). As this is machine learning we want to achieve this by not implementing the rules manually. The general process is for the computer to learn the rules based on our dataset (which should reflect the real world), and then use what it learned from our dataset on ‘new’ data. This implies a major caveat of all machine learning – without a backing dataset to teach the machine what is correct we cannot do machine learning. This is one of the major reasons that machine learning systems get better over time – as more data becomes available to them they become more proficient because of the added training.

Note that generally in real world problems getting the data (and making it usable) is generally one of the major challenges for machine learning as we usually start off without any data and even worse we are not entirely sure what data we need. This is why Python is useful for machine learning – it is very powerful for manipulating text and data.

So, we have a ready made dataset in two files – named is-spam and not-spam . These are about 1200-odd emails in total which is a decent training set. Now would be a good time to download the files and extract them in a new, empty, directory (folder).

A  typical spam email example –

Get Money! Significant Others! (Not writing women because we're PC even in SPAM!) WIN THE LOTTERY! <html><head></head><body bgcolor=3Dblack>
<table border=3D0 cellspacing=3D0 cellpadding=3D5 align=3Dcenter><tr><th b=
<table border=3D0 cellspacing=3D0 cellpadding=3D5 align=3Dcenter><tr><th b=
<table border=3D0 cellspacing=3D0 cellpadding=3D0 align=3Dcenter>
<th><a href=3D"
<img src=3D"" width=3D279 height=3D=
286 border=3D0></a></th>
<th><a href=3D"
<img src=3D"" width=3D301 height=3D=
286 border=3D0></a></th>
<th><a href=3D"
<img src=3D"" width=3D279 height=3D=
94 border=3D0></a></th>
<th><a href=3D"
<img src=3D"" width=3D301 height=3D=
94 border=3D0></a></th>


A typical normal email example –

CLAIM: A leaked e-mail revealed Clinton aide Cheryl Mills calling the men who died in Benghazi “idiot soldiers” and saying she was glad they were tortured.


EXAMPLE: [Collected via e-mail and Twitter, October 2016]


ORIGIN:In October 2016, the above-reproduced e-mail screenshot began circulating on social media, purportedly capturing a message from longtime Clinton aide Cheryl Mills to Hillary Clinton in which Mills expressed a need to “hide” that requests for help made by the four U.S. personnel who died in the attack on the U.S. diplomatic compound in Benghazi had been denied, and in which Mills referred to those dead personnel as “idiot US soldiers [who] deserved to die”:

From: Mills, Cheryl D. <>
Sent: July 26, 2012 2:11 PM
To: H
Subject: Re: Benghazi

We need to hide all traces of the denied requests for help. Those idiot US soldiers deserved to die and I’m glad they were tortured.

But if anyone finds out we are toast.

Black Power!


If you quickly look at the emails in the two folders you will see that the emails are pretty much representative of what spam (and not-spam) look like. This leads us to another machine learning caveat – if you initial data sets do not represent what will be encountered in the real world then the machine will not learn the correct rules and the machine will not do anything useful – in fact it might get it totally wrong. So, there must always be effort allocated in preparing the initial data sets and ensuring that they are applicable to your problem.

Going back to machine learning – we are trying to determine the features that will allow us to differentiate between the emails in the not-spam and is-spam folders. The emails from both is-spam and not-spam are thus basically text – with some metadata (HTML) thrown in. Humans understand text and language quite well, however computers are not (yet) very good at it – just try asking Siri to find you a good restaurant in your vicinity. So, for our machine learning task – and in order to be able to extract the features – we need to understand language into something understandable by the computer – i.e. numbers. Enter Natural Language Processing (NLP) – we want the computer to understand the gist of the context of these messages. The machine learning part is then understanding the patterns in the numbers output from NLP and applying the patterns to all of our emails.

The emails from both `is-spam` and `not-spam` are text. Text and language are things that we as humans understand really well, but computers understand less well. We have to transform, in our case, this English language into things that computers can understand – which, in the general case, would be numbers. This is where the Natural Language Process (NLP) comes in, where we’re trying to get the computer to understand the gist of the context of these messages. The machine learning part of this is understanding patterns in the numbers and applying the patterns to future things (i.e. the emails in our test data set or other emails we’ve got).

What is a feature?

Let’s just shamelessly steal the answer from Wikipedia here – “In machine learning and pattern recognition, a feature is an individual measurable property of a phenomenon being observed. Choosing informative, discriminating and independent features is a crucial step for effective algorithms in pattern recognition, classification and regression.”  So, in simple terms – we want to translate our text emails into a set of numbers that represent the contents of the email. The better we can represent an email in numeric form the more patterns we’ll hopefully be able to extract from our emails if we pick good features.

Before writing any other features let’s just look at the number of words in our emails –
def numwords(emailtext):
splittext = emailtext.split(" ")
return len(splittext)

Save the above into a file called in the same directory (folder) where you have extracted the is-spam and not-spam files.

Don’t worry if the above python code makes no sense – it just counts the number of words. You can catch up on python through the tutorial here or just google python tutorial – there’s plenty high quality ones out there.

Weka uses .arff files to analyze its data sets – so we need another bit of python (more complicated) to create those files. We won’t be editing this file at all so you can just copy paste the below table in your text editor and save it as in the same directory as
import os, re
import math, glob
import features
import inspect

def main():

arff = open("spam.arff", "w")

ml_functions = inspect.getmembers(features, inspect.isfunction)
feature_functions = []
feature_functions += list([f[1] for f in ml_functions])

arff.write("@RELATION " + RELATION_NAME + "\n")
for feature in feature_functions:
arff.write("@ATTRIBUTE " +\
str(feature.__name__) + " REAL\n") #change this if we
#have non-real number

arff.write("@ATTRIBUTE SPAM {True, False}\n")


spam_directory = "is-spam"
not_spam = "not-spam"
for email in glob.glob("*"):#ITERATE THROUGH ALL DATA HERE
extract_features(open(email).read(), feature_functions, arff, True)

for email in glob.glob("*"):#ITERATE THROUGH ALL DATA HERE
extract_features(open(email).read(), feature_functions, arff, False)

def numwords(emailtext):
splittext = emailtext.split(" ")
return len(splittext)

def extract_features(data, feature_functions, arff, spam):
values = []
buff = ""

for feature in feature_functions:
value = feature(data)
if spam:
buff += (",".join([str(x) for x in values]) + ', True' + "\n")
buff += (",".join([str(x) for x in values]) + ', False' + "\n")


if __name__ == "__main__":

Before we begin the fun

You should have four items in your machine learning folder – two python files ( and and two directories – is-spam and not-spam as shown below.


If that is not the case then make sure all files are where they should be. Download the dataset again from here –  is-spam and not-spam – and copy and paste the python scripts in the tables above again.

The fun begins

Now, try to execute from a command prompt in Windows – it will generate an .arff file which we can then load into Weka. However, before we begin let’s open up the .arff file in our text editor to have a look at its contents (or just have a look at the table below).

638, True
74, True
88, True

@RELATION is just the name of the problem. It’s actually just a static name set in the filename. It can be set to anything and it does not really matter.

@ATTRIBUTE is what Weka calls a feature. We have already defined two attributes in this baseline .arff  file. The first feature is numwords, which is set as a REAL number which indicates the number of words in the email. The second is spam which is a boolean attribute and can take the values True or False – the second attribute is there for purposes of training the model and obviously will not exist in real world data.

@DATA is merely an indicator of where the data begins – each line in the .arff file is then a comma separated file which contains the attribute values per email.

The .arff file will never be produced manually – however, we do need to have knowledge of its structure as our scripts will be producing .arff files. That said, we never have to look at .arff files again – we just need to load them into Weka.

Let’s fire up Weka now – it starts as a standard application and shows the GUI Chooser window as shown below.


Select Explorer – the first option.

Weka opens up in the pre-process tab as shown below.


Select open file and navigate to the directory where you just executed the python script – where the .arff file is located and load it up into Weka.


We start off in the pre-process tab.

The left column contains all our features.

If you click SPAM we see the distribution of spam or not in the right column – we can see that out of 1388 emails, 501 are spam and 887 are not. If you click numwords you can see various statistics about the email length etc. however the distribution does not mean much now.

Let’s start building a classifier to start making sense of the data.


Go into the Classify tab, and click the classifier button on the top left of the screen. The drop down menu above will appear. It does not really matter at this time which classifier we choose but in order to follow this intro tutorial let’s go for J48. J48 is an open source implementation of C4.5 written for Weka – we’ll talk more about J48 later on.

Regardless, click Start and you will see a lot of interesting data appear in your Weka window.


Have a look at the classifier output – under summary (red circle) you see the correct hit-rate of the classifier.

Play around with changing classifiers – you won’t see significant differences between them now as we only have a single feature and thus none of them will produce particularly good results for this specific problem however some of them will always be better for specific problems. If you are serious about machine learning you should read up on the capabilities of each of the classifiers to better understand fitness for each problems. Wikipedia is your friend as a starting point but we’ll explore classifiers more in future posts.

Test options are also another important part of this tab – we not only want to train our classifier, we also want to have something to test it against. So, play around with the test options and run the classifier again. Generally, increasing training data should increase accuracy (even by small increments).


So, for example for this data set using J48 we can see the below results for the following percentage splits –

Training Set PercentageAccuracy
6065.9459 %
7066.8269 %
80 67.6259 %
9069.0647 %

Note that generally we should leave sufficient data to test our model against – an 80-20 or 90-10 split is generally OK.

So, we see that generally we sit at approximately under 70% correct spam/not-spam classification – that’s great for a start apparently. So, this will be our baseline. In machine learning a baseline is a method that uses heuristics, simple summary statistics, randomness, or machine learning to create predictions for a data set. You can use these predictions to measure the baseline’s performance (e.g., accuracy)– this metric will then become what you compare any other machine learning algorithm against. So, let’s start off with 70% baseline and improve on that.

Before we start digging in deeper it’s perhaps a good time to take a look at the confusion matrix. This output is at the bottom of the classifier output screen.


A confusion matrix (a.k.a. error matrix) is a specific table layout that allows visualization of the performance of an algorithm – typically a supervised learning one (one with a training data set, such as ours). Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class.

So, our confusion matrix looks like –

=== Confusion Matrix ===

a b <-- classified as
43 52 | a = True
40 143 | b = False

The row indicates the true class, the column indicates the classifier output. Each entry, then, gives the number of instances of <row> that were classified as <column>.

So, we classified 43 a’s correctly as a’s and 52 a’s as b’s (not spam). And 40 b’s were classified as a’s while 143 b’s (not spam) were classified correctly. An easier way to remember this is that all correct classifications are on the top-left to bottom-right diagonal. Everything off that diagonal is an incorrect classification of some sort.

Confusion matrices are fun and Wikipedia has a brilliant articles with cats, dogs and rabbits in it so have a look there for more detail. In purely practical terms a confusion matrix is useful as it allows us to see what we’re classifying mistakenly and allows us to tweak our features so that we get better classification.

Adding more features

In order to improve accuracy we definitely need to extract more features from our data set. Generally, to come up with ideas for features to add it makes sense to manually have a look at your data set (use Linux command line tools to do it fast), speak to a domain expert to give you ideas (i..e. how would you classify emails as spam?) and so on. Selecting the correct feature is a challenging aspect.

So, looking at the emails in our spam folder (again using simple Linux utilities) we can see that

$ cd is-spam/

Papa100619@PAPA100619-973 /cygdrive/c/Development/MachineLearning/SpamClassifier/is-spam
$ grep -i HTML * | wc -l

Papa100619@PAPA100619-973 /cygdrive/c/Development/MachineLearning/SpamClassifier/is-spam
$ cd ..

Papa100619@PAPA100619-973 /cygdrive/c/Development/MachineLearning/SpamClassifier
$ cd not-spam/

Papa100619@PAPA100619-973 /cygdrive/c/Development/MachineLearning/SpamClassifier/not-spam
$ grep -i HTML * | wc -l

more of the spam emails have the word HTML in them rather than those in not-spam. So this could be a really good feature.

Let’s add this to our file by adding the below code to it - add has_html function
def has_html(emailtext):
return 1 if "html" in emailtext.lower() else 0

and execute again which will generate a brand new .arff file which we can load again into Weka.


We can see that the has_html feature has been added in the attributes section.

As expected, we can see that most of the emails where the has_html feature is 1 (i.e. have html) are spam while for those who has_html is 0 are not-spam. So, potentially this looks like a good feature as we can learn from it – spam emails appear to be html.

Let’s move on and add a third feature now.

Machine Learning - - add num_links function
def num_links(emailtext):
return emailtext.count('http')

What this feature does is count the number of times the word http appears in emails – this is a good indicator of the number of links in an email as most links will look like . The expectation is that most spam emails will have many links as they want you to go on their site and buy stuff (or just generate clicks). This is again a far from a perfect feature as potentially a lively email discussion about the http protocol will have http appearing in multiple times however, this is a far more granular feature as it counts the number of occurrences of the word rather than just going to true/false as the has_html feature.

So, once again let’s execute and generate and load the new spam.arff file into Weka.


Our minimum value is 0 whereas our maximum is 68 – i.e. there are emails with no links in them and there’s at least one that had 68 links in it.

Looking at the distribution graph below we see that at some point (about 10 links) all the emails are spam. So, this seems like a really good feature as we can create a more general rule – if links are over 10 then it’s definitely spam and a general weight rule – the more links an email has the more likely it is spam.

So, we have two new features – let’s go back to the classify tab and see what we can achieve with the new features.


It looks like nothing has changed!

This gets us back to our classified discussion – OneR classifier only looks at a _single_ feature and attempts to classify based on that. So, we need a classifier that is capable of using multiple features before reaching a final decision. Let’s go back and choose J48 again.

Intermission – What are Decision Trees? 

As we mentioned J48 is a decision tree learning model. You are probably familiar with decision trees – basically we make decisions based on one of the features and move down the tree branches until we hopefully reach a correct conclusion. There are numerous algorithms on how to build a decision tree but it is far beyond the scope of this intro to go into them.

However, a brief explanation of decision trees is necessary

Regardless, run the test again. In the classifier output window we’ll see our decision tree –


J48 pruned tree

num_links <= 3
|   has_html <= 0
|   |   numwords <= 17: True (35.0/8.0) | | numwords > 17
|   |   |   num_links <= 2: False (810.0/146.0) | | | num_links > 2
|   |   |   |   numwords <= 270
|   |   |   |   |   numwords <= 106: False (18.0/3.0) | | | | | numwords > 106: True (52.0/15.0)
|   |   |   |   numwords > 270: False (20.0/1.0)
|   has_html > 0
|   |   num_links <= 0: True (25.0/2.0) | | num_links > 0
|   |   |   num_links <= 1
|   |   |   |   numwords <= 384: False (46.0/12.0) | | | | numwords > 384: True (19.0/3.0)
|   |   |   num_links > 1
|   |   |   |   num_links <= 2: True (82.0/32.0) | | | | num_links > 2: False (59.0/26.0)
num_links > 3: True (222.0/62.0)

Our tree shows us exactly how the machine makes decisions – it is still fairly simplistic – let’s work from the bottom up.

if num_link > 3 we say the email is spam.

if less than 3 we take a look as to whether the email is html (or more accurately contains the word html) and then split again in two branches where we check against the number of links in the email and so on and so forth.

What’s important is that our classification has improved significantly to 76.61% (over 10% improved compared to baseline) with only a few clicks here and there.

=== Summary ===

Correctly Classified Instances 213 76.6187 %
Incorrectly Classified Instances 65 23.3813 %

Looks like we’re doing better – 3/4 of spam are identified so we’ve reached late 90s levels of accuracy. It thus looks like if we keep adding features then surely our classification accuracy will eventually get close to 100% ! Unfortunately, that is not the case as not all features have the same value.

In order to demonstrate this let’s add one more feature in our file.

Machine Learning - - dummy function
def dummy(emailtext):
return 1

This function will just return 1 regardless of what kind of email it is. Now, let’s run another time and reload the .arff file into Weka and see what happens.

We will see the exact same results as before – same summary.

=== Summary ===

Correctly Classified Instances 213 76.6187 %
Incorrectly Classified Instances 65 23.3813 %

Same decision tree.

=== Classifier model (full training set) ===

J48 pruned tree

num_links <= 3
|   has_html <= 0
|   |   numwords <= 17: True (35.0/8.0) | | numwords > 17
|   |   |   num_links <= 2: False (810.0/146.0) | | | num_links > 2
|   |   |   |   numwords <= 270
|   |   |   |   |   numwords <= 106: False (18.0/3.0) | | | | | numwords > 106: True (52.0/15.0)
|   |   |   |   numwords > 270: False (20.0/1.0)
|   has_html > 0
|   |   num_links <= 0: True (25.0/2.0) | | num_links > 0
|   |   |   num_links <= 1
|   |   |   |   numwords <= 384: False (46.0/12.0) | | | | numwords > 384: True (19.0/3.0)
|   |   |   num_links > 1
|   |   |   |   num_links <= 2: True (82.0/32.0) | | | | num_links > 2: False (59.0/26.0)
num_links > 3: True (222.0/62.0)

So it looks like Weka (the machine learning actually) is ignoring our new attribute fully and completely.

This happens because we can actually measure ‘information’ or ‘knowledge’ gain of each feature. In Weka, navigate to the ‘Select attributes’ tab.


On the first item (Attribute Evaluator) Select InfoGainAttributeEval

Weka will ask to enable Ranker – click yes on the menu that pops up.


Now, we’re finally ready to evaluate our attributes – just hit Start (left side, middle of the screen). In the attribute selection output you will see the Ranked attributed displayed.

Attribute Evaluator (supervised, Class (nominal): 5 SPAM):
Information Gain Ranking Filter

Ranked attributes:
0.0985 3 num_links
0.0712 2 has_html
0.0612 4 numwords
0 1 dummy

As can be seen the dummy feature has ranking of 0 – the algorithm recognizes that it gains absolutely no information from this attribute so it ignores it automatically. This gives us some insight on feature selection as we have a way to select good, high information gain features rather than poor features. Recall that each feature may potentially spawn a branch in our decision tree so additional features mean more work for the machine so feature selection is extremely important for performance.

So, now we can judge features based on information gain. Let’s have a look what happens if we start applying more specific features.

‘Common sense’ seems to indicate that words such as “buy”, “purchase”, “click” (as it indicates a link), “discount”, “free”, “offer”, “one-time” likely indicate some sort of attempt to purchase something via email so a potential spam email. So, let’s add a more complicated feature in our program such as the below – - add naturalLanguageWords
def spammy_words(emailtext):
spam_words = ['buy', 'purchase', 'click', 'discount', 'free', 'offer', 'one-time'] #words array we consider spam
splittext = emailtext.split(" ")
total = 0
for word in spam_words:
total += splittext.count(word) #add to word count

return total

Now, let’s run again and load up the .arff file again.


We can see from the apparent distribution of spammy_words that in emails with more words in are array the bluer the bars gets – so this looks like a really good feature.

Before we continue with our analysis let’s play around with our Classifier configuration. Go to the classify tab and click on J48.


The below options will pop up – simply set the Unpruned option to True.


We briefly discussed decision trees previously and have seen that as we add new features the tree’s size (branches and leaf) grows. This is generally a problem as then trees can become too large (so they won’t be too performant in practical conditions) or they will become highly specific (over-fitting) our test cases.

Now, let’s run our classifier again. With our brand new feature and an unpruned tree we should see significant gains in classification accuracy!


Unfortunately, that is not the case as we can see. There is a slight gain in accuracy – about a percentage point.

=== Summary ===

Correctly Classified Instances 215 77.3381 %
Incorrectly Classified Instances 63 22.6619 %
Kappa statistic 0.5051
Mean absolute error 0.2802
Root mean squared error 0.4055
Relative absolute error 61.238 %
Root relative squared error 85.3748 %
Total Number of Instances 278

So, why has our accuracy not increased significantly? This is probably because of over-fitting – you can see that our decision tree has significantly increased in size – especially when compared to our previous tree – we now have over 89 different, but highly specific, rules.

J48 pruned tree

num_links <= 3
|   has_html <= 0
|   |   spammy_words <= 0
|   |   |   numwords <= 17
|   |   |   |   num_links <= 1
|   |   |   |   |   num_links <= 0: True (19.0/3.0) | | | | | num_links > 0
|   |   |   |   |   |   numwords <= 15
|   |   |   |   |   |   |   numwords <= 11: False (3.0/1.0) | | | | | | | numwords > 11: True (7.0/1.0)
|   |   |   |   |   |   numwords > 15: False (2.0)
|   |   |   |   num_links > 1: True (4.0)
|   |   |   numwords > 17
|   |   |   |   num_links <= 1: False (525.0/61.0) | | | | num_links > 1
|   |   |   |   |   numwords <= 219: False (135.0/40.0) | | | | | numwords > 219
|   |   |   |   |   |   num_links <= 2: False (42.0/2.0) | | | | | | num_links > 2
|   |   |   |   |   |   |   numwords <= 270: True (3.0/1.0) | | | | | | | numwords > 270: False (15.0)
|   |   spammy_words > 0
|   |   |   num_links <= 2
|   |   |   |   spammy_words <= 1
|   |   |   |   |   numwords <= 350
|   |   |   |   |   |   numwords <= 99
|   |   |   |   |   |   |   num_links <= 0: False (3.0) | | | | | | | num_links > 0: True (11.0/3.0)
|   |   |   |   |   |   numwords > 99: False (41.0/2.0)
|   |   |   |   |   numwords > 350
|   |   |   |   |   |   numwords <= 524: True (19.0/4.0) | | | | | | numwords > 524: False (32.0/9.0)
|   |   |   |   spammy_words > 1
|   |   |   |   |   numwords <= 3562: True (40.0/14.0) | | | | | numwords > 3562: False (4.0)
|   |   |   num_links > 2
|   |   |   |   numwords <= 304
|   |   |   |   |   numwords <= 148: False (3.0/1.0) | | | | | numwords > 148: True (22.0/2.0)
|   |   |   |   numwords > 304: False (5.0/1.0)
|   has_html > 0
|   |   num_links <= 0: True (25.0/2.0) | | num_links > 0
|   |   |   spammy_words <= 0
|   |   |   |   num_links <= 1
|   |   |   |   |   numwords <= 559
|   |   |   |   |   |   numwords <= 115
|   |   |   |   |   |   |   numwords <= 94: False (15.0/4.0) | | | | | | | numwords > 94: True (4.0)
|   |   |   |   |   |   numwords > 115: False (20.0)
|   |   |   |   |   numwords > 559: True (5.0)
|   |   |   |   num_links > 1
|   |   |   |   |   num_links <= 2: True (59.0/27.0) | | | | | num_links > 2: False (41.0/14.0)
|   |   |   spammy_words > 0
|   |   |   |   spammy_words <= 1
|   |   |   |   |   num_links <= 1
|   |   |   |   |   |   numwords <= 194
|   |   |   |   |   |   |   numwords <= 117: False (2.0) | | | | | | | numwords > 117: True (3.0)
|   |   |   |   |   |   numwords > 194: False (3.0)
|   |   |   |   |   num_links > 1
|   |   |   |   |   |   num_links <= 2
|   |   |   |   |   |   |   numwords <= 120: True (9.0) | | | | | | | numwords > 120: False (7.0/2.0)
|   |   |   |   |   |   num_links > 2: True (12.0/3.0)
|   |   |   |   spammy_words > 1
|   |   |   |   |   num_links <= 2: True (20.0/1.0) | | | | | num_links > 2
|   |   |   |   |   |   numwords <= 298: True (3.0) | | | | | | numwords > 298: False (3.0)
num_links > 3
|   spammy_words <= 0
|   |   numwords <= 838
|   |   |   has_html <= 0: True (50.0/18.0) | | | has_html > 0
|   |   |   |   num_links <= 7: False (59.0/25.0) | | | | num_links > 7: True (21.0/1.0)
|   |   numwords > 838: True (25.0/1.0)
|   spammy_words > 0
|   |   spammy_words <= 2
|   |   |   has_html <= 0: True (22.0/2.0) | | | has_html > 0
|   |   |   |   spammy_words <= 1: True (22.0/4.0) | | | | spammy_words > 1
|   |   |   |   |   numwords <= 645: True (4.0) | | | | | numwords > 645: False (3.0/1.0)
|   |   spammy_words > 2: True (16.0)

Number of Leaves  : 	45

Size of the tree : 	89

An indication that we may be overfitting to our training data is that if we _reduce_ the training set percentage data our classifier accuracy actually improves (again by a small percentage). In the Test Options of Weka reduce the percentage split to 66% (2/3 of data). You will see that hit rate improves a bit –

=== Summary ===

Correctly Classified Instances         377               79.8729 %
Incorrectly Classified Instances        95               20.1271 %
Kappa statistic                          0.5605
Mean absolute error                      0.2703
Root mean squared error                  0.4025
Relative absolute error                 58.797  %
Root relative squared error             84.4408 %
Total Number of Instances              472     

So, we can definitely tweak and play around with configuration options of both classifier and training set to improve accuracy further.

However, let’s take a step back and try to improve our spammy_words feature first. We conceived of this feature without any sort of data – we just decided on a few words that we _thought_ indicated spam based on our own personal experience. Let’s take a look at the data first however.

Navigate to our is-spam directory and using our new-found mastery of Linux execute the below –

sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' 0* | LC_ALL=C sort | LC_ALL=C uniq -ic | LC_ALL=C sort -k1 -r | head -50

What this command does is break all emails into the directory into words and then counts the number of occurences of each in all the emails in the directory. You will see output such as the below –

$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' 0* | LC_ALL=C sort | LC_ALL=C uniq -ic | LC_ALL=C sort -k1 -r | head -150
5529 the
4643 to
3539 and
3459 of
3030 you
2253 a
1966 in
1941 for
1825 your
1610 this
1517 is
1048 i
1041 that
955 with

Again, go into the not-spam directory and execute the same command.

$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' 0* | LC_ALL=C sort | LC_ALL=C uniq -ic | LC_ALL=C sort -k1 -r | head -150
  13366 the
   8842 >
   7148 to
   6665 of
   6142 and
   5561 a
   3955 in
   3232 is
   3132 that
   2505 for
   2174 it
   2015 i

Most words are actually fairly common – articles and such (a, the, to, of etc.) so we can’t do much with those as they appear with similar frequency in both spam and not-spam emails. However, we see some interesting differences. For example, the word “helvetica,” and “face=”verdana”> === Summary === Correctly Classified Instances 231 83.0935 % Incorrectly Classified Instances 47 16.9065 % Kappa statistic 0.6194 Mean absolute error 0.2167 Root mean squared error 0.3809 Relative absolute error 47.3546 % Root relative squared error 80.2008 % Total Number of Instances 278

83% of spam detected. That’s not bad at all but we can improve further.





Leave a Comment

Your email address will not be published. Required fields are marked *