Introduction to Linux

The purpose of this post is to provide a brief introduction to Linux commands that have to do with data science.

Mainly, we will look at how we can manipulate large data files using commands such as grep, awk, sort, uniq and how we can use commands such as curl to quickly download full webpages and look at their data.

Note that we will be using Unix/Linux interchangeably across these posts – it will always mean a computer that can execute unix shell commands so please don’t go all pedantic on me.

Environment setup

Before we begin we will need to setup our environment. We’ll need two things – a way to execute unix shell commands and a good, robust text editor. Fortunately, there are options for both.

Unix shell options on Windows

We’re talking about unix commands but unfortunately we’re mostly stuck on Windows. There are solutions as even Microsoft has seen the light and realized the utility and power of having unix commands at your fingertips. Windows 10 now offers the Linux subsystem and instructions on how to install it can be found here. However, as Windows 7 is stull fairly ubiquitous and well used we will mainly use Cygwin. Both options will offer us the same functionality – a command line window where we will be able to execute our unix shell commands.

Cygwin a large collection of GNU and Open Source tools which provide functionality similar to a Linux distribution on Windows. After installation of either option all you get is basically a window such as the below

cygwin

where you can type in your commands and write scripts. Impressive huh?

Text Editor

You will also need a text editor. There’s numerous options out there but favourties include notepad++, atom and/or sublime. Feel free to use whichever suits you. Examples will mostly be using notepad++.

The text editor will be mostly needed as although one liners can be extremely powerful more complicated tasks will require a series of commands or in computer parlance a script.

The data

As part of this example series a sample data set has been created by your truly – in total there’s 1.2 million rows of data and the file is a fairly simplistic sales ledger. Note that we have deliberately gone over Excel’s max row count of 1,048,576 rows – and our data set is small in the brave new world we’re in.

You can download the data set here in zip format.

The file is pipe separated and contains the following columns –

Column NameColumn Description
NameName of person who bought the item
ItemBoughtThe type of item bought in text format
Quantity Number of the item bought
Value Value of the item bought in US dollars
Country Where the purchase was made
CostCenter Division in company where item was purchased
Company Company that bought the item
DateStamp A date in DD-MM-YYYY HH:MM:SS format

The first five lines of the file look as the below –

Name|ItemBought|Quantity|Value|Country|CostCenter|Company|DateStamp
Walter U. Yang|cereals|6|$97.12|Ireland|90|Yahoo|01-01-2015 11:33:39
Zephr Y. Sloan|noodles|8|$46.29|United Kingdom|96|Altavista|23-03-2015 11:31:49
Nero Butler|noodles|5|$41.50|Ireland|92|Altavista|26-01-2015 18:59:08
Barclay Landry|salads|4|$94.57|Germany|94|Macromedia|24-12-2015 20:24:09

Before we begin Unix a note pipes and redirects

All of the commands we will be using accept text input and produce text output. So, most of the times it is useful to be able to control where the input comes from and where the output goes. Unix supports the concept of pipes (represented by the | symbol) and redirects (represented by the > or >> symbol).

We need these funny symbols mostly because we usually want to do fairly complex operations on our file such as select only lines with a word in the them, sort them, get rid of duplicate lines and then save the output into a new file.

When you want to send the output of a command to another command use a pipe | between the commands.

When you want to save output of a command in a new (or other file) use > (for a new file) or >> (to append to a file – will create if already exists).

Unix Commands

Unix offers a plethora of commands and utilities – the below list is mainly my personal choice and preference. Feel free to disagree or use alternatives.

cat

cat is the simplest Unix command of them all probably. It just prints the _full_ contents of a file.

#print contents of a file.
cat sourceData_Lesson1

will thus print a lot of lines (1.2 million of them) on our screen so don’t run this command unless you are willing to wait for a rather long time.

head & tail

Head and tail will print out the beginning and ending lines of a (huge) file. This is usually useful if you want to inspect what a file looks like – having a look at the data using the good old eyeball. You can optionally include the -n flag to indicate number of lines you want displayed. Generally, you should head and tail a big file just to check for headers or trailing rows (as some file formats may have those) so you should be aware of them as they may mess up your data sets especially if loading into a database.

#Example use :
#See the first 5 lines of the test file sourceData_Lesson1
head -n 3 sourceData_Lesson1
#See the last 5 lines of the test file sourceData_Lesson1
 tail -n 3 sourceData_Lesson1

wc (word count)

By default, wc will quickly tell you how many lines, words, and bytes are in a file. If you’re looking for just the line count, you can pass the -l parameter in.

Used most often to verify record counts between files or database tables throughout an analysis – if the file’s line count is equal to your table row count than you’re doing OK. Can help you determine if your loading process is not loading everything.

wc sourceData_Lesson1
# 1292928  5060480 94437632 sourceData_Lesson1
wc -l sourceData_Lesson1
# 1292928 sourceData_Lesson1

grep

Grep allows you to search files using regular expressions. As regular expressions can be fairly complex it is usually enough to just go grep word_that_I_am_looking_for filename to output what I need – complicated regular expressions are seldom needed in the type of work we’ll be doing.

 

#print out all the lines containing the word germany in them (warning lots of lines!)
grep -i germany sourceData_Lesson1 

Generally, for the type of work we’ll be doing we want to aggregate things so here’s the first example of a useful pipe

#print out the number lines containing the word germany (more useful than above)
grep -i germany sourceData_Lesson1 | wc -l
#206080

We can further dig in into the data by piping grep output to another grep and

#print out the number lines containing the words germany, microsoft and noodles
grep -i germany sourceData_Lesson1 | grep -i microsoft | grep -i noodles | wc -l
#1792

sort & uniq

Sort outputs the lines of a file in order based on a column key using the -k parameter. If a key isn’t specified, sort will treat each line as a concatenated string and sort based on the values of the first column. The -n and -r parameters allow you to sort numerically and in reverse order, respectively.

Sometimes you want to check for duplicate records in a large text file – that’s when uniq comes in handy. By using the -c parameter, uniq will output the count of occurrences along with the line. You can also use the -d and -u parameters to output only duplicated or unique records.

#print first 100K lines of the source file, sort, then print number of times a lines has been identified then just the first 10 lines
$ head -100000 sourceData_Lesson1 | sort | uniq -c | head -10
 10 Aaron F. Byers|soups|2|$98.21|United States|93|Lavasoft|29-08-2015 17:12:18
 9 Aaron Gonzalez|cereals|9|$48.06|Canada|96|Apple Systems|29-03-2015 07:08:15
 10 Aaron M. Dawson|cereals|7|$15.03|Canada|88|Borland|17-03-2015 19:28:44
 10 Aaron P. Merritt|sandwiches|4|$72.61|United States|92|Lycos|17-08-2015 02:14:51
 10 Aaron Snyder|sandwiches|5|$16.90|Canada|93|Google|14-06-2015 06:13:26
 10 Aaron Stephenson|noodles|7|$6.20|Canada|92|Borland|15-05-2015 08:24:43
 10 Abbot D. Keller|salads|5|$64.44|Canada|98|Borland|17-10-2015 02:05:24

Oh dear! It looks like there are duplicate lines in our source data as there instances of the same order happening 10 times (same timestamp, same name, same items ordered, same cost etc.)  – this is a common issue in data science and cleaning up your source data is almost always a challenge. To clean up the file just issue the below command. Always remember that uniq works only against fully sorted files, so it is important to first sort the file then pipe it into uniq.

#sort the file and pipe into uniq to remove all duplicate entries (will take a while)
$ sort sourceData_Lesson1 | uniq > noDuplicates

You will note that you are left with a small file with only 10101 lines – this is an extreme case of duplication but this is a tutorial so that’s OK – you will almost never find such bad cases of duplication in the wild.

sed

Sed is another powerful command that we will mainly use for string replacement – i.e. use sed’s substitute command.

As an example, let’s take a look at our data again. We have such lines as the below –

Aaron F. Byers|soups|2|$98.21|United States|93|Lavasoft|29-08-2015 17:12:18
Aaron Gonzalez|cereals|9|$48.06|Canada|96|Apple Systems|29-03-2015 07:08:15
Aaron M. Dawson|cereals|7|$15.03|Canada|88|Borland|17-03-2015 19:28:44

The dollar sign (marked in bold) makes manipulation more difficult as the computer won’t be able to easily treat this as a number and do necessary calculations (i.e. sum all the numbers).

Sed commands should be encased in single quotes and there are four parts to the substitute command:

s	  Substitute command
/../../	  Delimiter
one	  Regular Expression Pattern Search Pattern
ONE	  Replacement string

Here’s some examples – generally, don’t fret about complex regular expressions, most of the time they will not be needed and this is just an intro tutorial.

#Replace Lycos with Telefonica (as Lycos was acquired and need to assign those costs to Telefonica - so early 2000s!)
$ sed 's/Lycos/Telefonica/g' noDuplicates | head -10
Aaron F. Byers|soups|2|$98.21|United States|93|Lavasoft|29-08-2015 17:12:18
Aaron Gonzalez|cereals|9|$48.06|Canada|96|Apple Systems|29-03-2015 07:08:15
Aaron M. Dawson|cereals|7|$15.03|Canada|88|Borland|17-03-2015 19:28:44
Aaron P. Merritt|sandwiches|4|$72.61|United States|92|Telefonica|17-08-2015 02:14:51
Aaron Snyder|sandwiches|5|$16.90|Canada|93|Google|14-06-2015 06:13:26
Aaron Stephenson|noodles|7|$6.20|Canada|92|Borland|15-05-2015 08:24:43
Abbot D. Keller|salads|5|$64.44|Canada|98|Borland|17-10-2015 02:05:24
Abbot E. Wilkinson|salads|10|$2.92|United Kingdom|91|Cakewalk|19-06-2015 15:20:04
Abbot F. Mann|cereals|8|$9.54|United States|94|Telefonica|28-04-2015 03:13:04
Abbot J. Romero|pies|8|$84.01|France|98|Cakewalk|26-06-2015 19:58:10

Some characters we might want to replace occassionally are marked as special characters (i.e. they mean something in code). Such a character is the dollar sign (or the quote or pipe etc. which may also be encountered in locations we don’t want in our source files). To replace those we need to escape the characters. We escape a character by adding a backslash at the beginning as the below example. Note that we’re replacing the dollar sign with nothing (the 3rd part of the command is empty) so effectively we’re just removing the dollar sign from the entire file.

#Get rid of the dollar sign in all lines of the noDuplicates file.
$ sed 's/\$//g' noDuplicates | head -4
Aaron F. Byers|soups|2|98.21|United States|93|Lavasoft|29-08-2015 17:12:18
Aaron Gonzalez|cereals|9|48.06|Canada|96|Apple Systems|29-03-2015 07:08:15
Aaron M. Dawson|cereals|7|15.03|Canada|88|Borland|17-03-2015 19:28:44
Aaron P. Merritt|sandwiches|4|72.61|United States|92|Lycos|17-08-2015 02:14:51

awk

awk is the command we’ll use for most of your arithmetic functions (i.e. sums). Awk is really a programming language which allows easy manipulation of structured data and the generation of formatted reports – however, we will only use a tiny part of its feature set for now.

So, let’s use sed to clean up our file again only this time we’ll save our data to a new file as well –

$ sed 's/\$//g' noDuplicates > noDuplicatesNoDollarSign

Now that we have a clean file with no dollar signs (“$”) we can start using awk against it.

$ awk -F"|" '{sum+=$4} END {print sum}' noDuplicatesNoDollarSign
499336

The above line says:

  1. Set the field separator to the pipe character (-F “|”). Note that this has nothing to do with our pipeline in point #2 – remember that our sourcefile is pipe separated.
  2. Increment the variable sum with the value in the fourth column ($4) (i.e. total sum of orders).
  3. When it ends print the sum.

So, let’s have an example of combining commands and providing meaningful output.

For instance, how much money was spent by Lavasoft on all orders

#output only lines with Lavasoft in them and them sum using awk
$ grep Lavasoft noDuplicatesNoDollarSign | awk -F"|" '{sum+=$4} END {print sum}'
34589.5

Awk has another useful feature that we may end up using a lot – it can be used to print only specific columns of a file.

#set field separator to pipe then print the 7th column of the file. Use head to print only first 5 lines.
$ awk -F"|" '{print $7}' noDuplicatesNoDollarSign | head -5
Lavasoft
Apple Systems
Borland
Lycos
Google

Why is that useful? Well, we can easily find out all possible values of a specific column using the below –

#use awk to print the 7th column of the file, sort it, then just print the unique values once

$ awk -F"|" '{print $7}' noDuplicatesNoDollarSign | sort | uniq
Adobe
Altavista
Apple Systems
Borland
Cakewalk
Chami
Company
Finale
Google
Lavasoft
Lycos
Macromedia
Microsoft
Sibelius
Yahoo

Or alternatively, find the number of orders per company using uniq -c (print number of each occurence).

$ awk -F"|" '{print $7}' noDuplicatesNoDollarSign | sort | uniq -c
    741 Adobe
    761 Altavista
    729 Apple Systems
    748 Borland
    735 Cakewalk
    750 Chami
      1 Company
    714 Finale
    703 Google
    706 Lavasoft
    735 Lycos
    688 Macromedia
    705 Microsoft
    679 Sibelius
    706 Yahoo

So, let’s write up a quick script to find out the sum of transactions for all companies – these sort of fast scripts are where Unix scripts really shine – you can do all of these things using Python, databases but for fast checks Unix utilities are awesome.

Fire up your text editor (from environment setup) and copy paste the below –

#save company names in a file - print the 7th column of the file, sort and then print only unique values.
awk -F"|" '{print $7}' noDuplicatesNoDollarSign | sort | uniq >> tempFile

#print the tempFile lines and while there's a line to read do something to it (execute the commands between do and done)
cat tempFile | while read line
do

#grep the company name, then sum the value of transactions (sum the 4th column) and print the sum
SUM=`grep "$line" noDuplicatesNoDollarSign | awk -F"|" '{sum+=$4} END {print sum}'`
#an important note here is that $line (the value of each printed line) is enclosed in double quotes - this is because you can have values such as Apple Systems in the #example data which would break the command. Enclosing this in double quotes make Unix treat the value as a single parameter.

#print the company name and the sum, so we have a mini report.
echo $line ":" $SUM

done

So, that’s it – the examples above are fairly practical and can be extended to real world problems – it’s mostly a matter of grep-ing the right things and then modifying which columns to add.

Leave a Comment

Your email address will not be published. Required fields are marked *