Overblog
Edit post Follow this blog Administration + Create my blog

Introduction to Natural Language Processing

May 6 2019


Introduction to Natural Language Processing

Natural Language Processing, commonly referred to as NLP, is a field dealing with human-computer interaction, specifically in relation to language. NLP is inevitably intertwined with research in Computer Science, Linguistics and Artificial Intelligence. The range of research topics of NLP is broad and varied — a comprehensive list can be found on the Wikipedia website.

To start off, let’s briefly consider

as a recent example — and one which is easy to understand for non-specialists. As explained in the video presentations that opened this section, the task here seems at first quite trivial: a computer hears a Jeopardy! question, then it computes and returns the answer. Trivial, however, it certainly is not, as further explained in this news item: Trivial, it’s not. Have a look at the article and see what you think.

1.1 Why NLP for historians?

There are various examples of NLP being used by historians. For instance, in the opening section of this course we showed you a video where Professor Tim Hitchcock talked about text mining for the Proceedings of the Old Bailey Online project.

Another excellent example of NLP in use by historians can be found on the blog historying, where Cameron Blevins has discussed over several posts his use of NLP to understand the diaries of a midwife named Martha Ballard.

You can find this here.

NLP can be useful to any historian who deals with large amounts of text, as long as that text is available in a digital format. Increasingly, swathes of data are being made available online, and if you are a modern historian many of your sources may be born-digital. NLP offers a way into such large bodies of data.

1.2 Aims and organisation

This section of the module aims to give you an introduction to NLP and a taste of its possible applications to the historical domain. In order to be as accessible as possible to a non-technical audience this section will focus more on basic concepts of NLP and on how to use existing implementations of NLP tools.

At the end of the course you will master the main technical concepts which underlie many NLP applications, such as tokenisation, Speech Tagging and Named Entity Recognition. You will also be able to apply to historical texts pre-existing NLP tools drawn from robust frameworks such as the Natural Language Toolkit (NLTK) — an NLP toolkit written in Python.

Since it is impossible to teach NLP without using a programming language, we will use Python as the programming language for the code examples.

Although we talked a little about this in the previous section, this course is not meant to be an introduction to Python and you will need to look elsewhere for a more detailed introduction. For this, have a look at some of these other resources:

  • Dive into Python — DiP version 5.4
  • The Programming Historian
  • Natural Language Processing with Python NLTK — page references are to the 1st edition (2009)

Or work your way through this online free tutorial for using Python:

For the purposes of this course we will guide you through the basics of Python to enable you to complete the tasks regarding NLP. This should give you an understanding of what you might achieve, but to use it properly you will need to go further.

1.3 Using the code examples

Generally speaking, all the code snippets that are given can be copied and pasted into a Python interactive shell session.

The code snippets have been tested with Python 2.7, therefore it is recommended that you use the same version. The following instructions are for Windows machines. For Mac use the instructions on the NLTK documentation website. Mac and Linux machines will usually have Python pre-installed, but you will probably need to add the three add-on modules: NLTK itself, Numpy and Yaml.

To install these modules on a Mac it’s easier if you install setup tools first. Type this in the terminal at the top level for your username:

sudo sh setuptools-0.6c11-py2.7.egg

Enter your password when prompted.

When downloading is finished you should be able to use the easy_install command for the rest of the modules; with the sudo command you’ll need to enter your password each time, to confirm the command:

sudo easy_install pip

When this has downloaded you need to do the same for numpy:

sudo pip install -U numpy

and, finally…

sudo pip install -U pyyaml nltk

To install these modules on a Linux machine, it is probably as simple as getting them through your distribution’s package manager (such as the Ubuntu Software Center); you may find that numpy comes pre-installed with Python, as seems to be the case with many recent distributions.

Exercise

1. Download Python 2.7 to your computer if you have not already done so.

2. On your computer locate the Python folder. Open up the interactive shell session — the best version to use is called IDLE.

3. You will become familiar with this piece of sofware very soon. However, for now close IDLE down. You now have the programming software installed but no data to use with it. To avoid errors appearing we are going to download three separate files one after the other.

Install Numpy (download)

Install NLTK (this stands for Natural Language Toolkit) (download). Choose the windows installer option.

Install Yaml (download). Choose the 2.7 version (ending Py2.7.exe)

4. This step is optional because it’s not necessary to this course. But if you want to follow some of the exercises in the NLTK handbook you will need to install the corpora that the exercises use: open up IDLE and type the following to integrate these files with your software.

>>> import nltk
>>> nltk.download()

These commands will bring up a second window. Choose the All Packages option and click Download. This download will take some time.

If you receive error messages on IDLE check the NLTK documentation pages for further advice. Also check this page if you are using a Mac or wish to download from source.

You have now installed nearly everything that you will need to follow the rest of this course. For Windows users you should find all of the downloaded material in your computer’s hard drive in a folder called Python27. This will generally be found from Start — My Computer — Local Drive (C:) — Python27 or ~/nltk_data in your home directory for non-Windows systems. You will need to add further documents directly to this folder later in this module, but don’t worry too much about it now, just note the location.

Whenever you find the triple closed angle bracket sign in the examples, like in the following command

>>> import nltk

you should keep in mind that you should not type the angle brackets into the Python interactive shell or you will get an error (these have been included as a prompt only). Those characters are sometimes used to remind the reader that those commands can be executed from the interactive shell.

However, one can use a tool such as iPython which provides an enhanced interactive shell where, for instance, you can copy commands containing the triple closed angle quotes that still get executed. iPython also provides other useful features such as an inspection of modules, methods and functions by adding a question mark “?” just after the object one wants to inspect. For the purposes of this module we will not be looking any further at iPython, but if you wanted to go further it is worth a look.

Note:

Everytime you open up a new session of Python in this course you must begin with the command

>>> import nltk

This command instructs Python to load the NLTK module (natural language toolkit module). The import command is a central feature of Python (if you looked at the Python script in the Regular Expressions and Scripting section of this course you may remember that it is necessary to use “import re” before you can run regular expressions). If you forget to import what you need you will mostly get an error to tell you so.

2. Text processing

Now that you have downloaded Python we are going to break down the steps required to transform a corpus of text files into a corpus ready for further processing. As discussed in the previous section of the course, we need to make sure that the text is not only machine readable but that we can separate text into smaller units (e.g., sentences, tokens).

We will be focusing on creating strings in Python. String is the term used to describe text. Thus, when we ask you to create a string we are in fact asking you to identify text, that can then be overlaid on a corpus of texts for meaningful results to be discovered. For example, if we create a string designed to search for all of the words ‘snow’ then this is exactly what it will do. If you recall, this is precisely what happened in the Python example in the previous section. We created a search string for the words “session of the”.

Most of the processing steps covered in this section are commonly used in NLP and involve the concatenation of several steps into a single executable flow. This is usually referred to as the NLP pipeline. By this we mean the operation of joining two character strings end-to-end. For example, the strings “snow” and “ball” may be concatenated to give “snowball”.

To be able to make such strings work, however, you need to be aware of a few other things, specifically Unicode. We will begin there, but don’t worry if you don’t yet fully understand what we mean by strings and the NLP pipeline, as we will be going into this shortly.

2.1 Working with Unicode

Let’s start then from the very basics. How are texts represented digitally?

Unicode is a standard encoding system to represent text electronically and currently supports more than a million characters. Unicode is implemented in different encodings such as UTF-8, UTF-16 and UTF-32. The difference between these encodings has to do with the number of bytes used to store each Unicode character. ASCII, the encoding format whose limitations Unicode attempted to overcome, uses one byte for each character — also called technically codepoint.

From a practical point of view, it is essential to know the basics of Unicode in order to avoid annoying errors related to the encoding format when reading from or writing to files. You may wish to familiarise yourself further with this in the NLTK document: pp. 93–97, online section 3.3, or Dive into Python: section 9.4 (pp. 125–129).

Let’s see a first example of how strings are handled in Python.

Exercise 1:

The first snippet creates and prints a string. Open up Python (IDLE) and try the following:

1. Create a string. Type the following into Python:

>>> my_string = “hello world!”

Remember not to type the angle brackets, which represent the command prompt. If all goes well, when you press return, Python will just give you another prompt.

What we’re doing here is assigning a variable, which we’re arbitrarily calling my_string.

2. Now try the following. What do you get as a result?

>>> print my_string

Python remembers the value of our variable and promptly prints it out.

3. Print the type of the variable we just initialised

>>> print type(my_string)

Python will now tell you that your type is str, because it’s a string (i.e., str = string).

There are, of course, other data types in Python, but for obvious reasons this course will concentrate on strings (text). A couple of other common data types are integers, declared like this:

>>> my_int = 99

And Booleans, declared like this

>>> my_boolean = True

You’ll notice the key difference: in declaring a string you put the value of the variable in quotation marks, but with integers, Booleans and others you don’t.

Try creating first an integer and then a boolean variable. Ask Python what data type each is and then ask Python to print out your new variables.

4. Create a Unicode string (as shown below):

This next string declaration creates a Unicode string. As you will notice, the main difference is the “u” used as a prefix before the string we are going to create (1). This tells the Python compiler that we are creating a string and we want to use Unicode to represent it.

Type the following:

>>> my_ustring = u”hello world!”

2. Print the type of the variable we just initialised:

>>> print type(my_ustring)

You should get the response:

<type ‘unicode’>

What is this snippet of code telling us? There are two ways of representing a string in Python:

  1. as a string
  2. as a Unicode string

Tip: if you want to have Unicode-encoded text in a Python source file you need to append an encoding declaration at the very beginning of the .py file that we mentioned in the Regular Expresssions section of the course. This tells the Python interpreter which encoding should be applied to the source file. Thus it would begin:

#!/usr/bin/env python

# −*− coding: UTF−8 −*−

Hands-on: reading Unicode from text files

Let’s open and read a file without specifying the text encoding.

1. Create a folder called web inside your data folder that you created for section 3, then download to that folder this file: sueddeutsche_article.txt (right click). If you have not created this folder yet, then go to the Python27 folder and create the folder data (i.e., on Windows c:\python27) and then web. Make sure both folder names are entirely in lower case.

2. In IDLE type:

>>> fname = “data/web/sueddeutsche_article.txt”
>>> file = open(fname,’r’)
>>> data = file.read()
>>> print type(data)

This will print “str” (confirming that this is a string). Next close down the file (Python normally does this for you if necessary, but it’s good practice to do so explicitly):

`>>> file.close()

To open and read a file directly into Unicode:

>>> import codecs
>>> file = codecs.open(fname,’r’,’utf-8')
>>> data = file.read()
>>> print type(data)

This should print a unicode version. Next close down the file as good practice:

>>> file.close()

Exercise 2:

Now try this with your own data.

1. Select or create a .txt file and add it to the c:\python27\data folder or the equivalent on your system.

2. Now try the above instructions but replacing the file name with that of your own .txt file.

2.2 Tokenisation

What is tokenisation? Tokenisation is the task of separating a portion of running text into words or tokens. Token is the name — more general than word — we give to the sequence of characters that we want to treat as a group. You will soon see why the distinction is important. In any NLP pipeline tokenisation is one of the first necessary steps we need to take to make our texts ready for further processing. What we consider a token can vary according to the task we are performing or the domain we are dealing with.

As you will see in the following examples, there are different ways to deal with tokenisation of texts. A straightforward difference is, for example, how punctuation is treated, particularly with regard to abbreviations. The string “e.g.” can be split into either two tokens (“e.”, “g.”) or can properly be recognised as an abbreviation thus tokenised as one single token (“e.g.”).

2.3 Word count

After tokenising a text, the first figure we can calculate is the word frequency. By word frequency we indicate the number of times each token occurs in a text. When talking about word frequency, we distinguish between types and tokens. Types are the distinct words in a corpus, whereas tokens are the words, including repeats. Let’s see how this works in practice.

Let’s take as example one of the sentences above:

Types are the distinct words in a corpus, whereas tokens are the running words.

How many types and tokens are there in the above sentence? Answer

Let’s see how we can use Python to calculate these figures. First of all, let’s tokenise the sentence by using a tokeniser which uses non-alphabetical characters as a separator.

>>> from nltk.tokenize.regexp import WhitespaceTokenizer
>>> my_str = “Types are the distinct words in a corpus, whereas tokens are the running words.”

Note here that we used a slightly different syntax for importing from a module. You’ll recognise by now the variable assignment. Now type:

>>> tokens = WhitespaceTokenizer().tokenize(my_str)
>>> print len(tokens)

You should get the answer 14. Next type:

>>> my_vocab = set(tokens)
>>> print len(tokens)

Again the answer is 14.

Now we are going to perform the same operation but using a different tokenizer.

>>> my_str = “Types are the distinct words in a corpus, whereas tokens are the running words.”

We’ll now import a different tokenizer:

>>> from nltk.tokenize.regexp import WordPunctTokenizer

This tokenizer also splits our string into tokens:

>>> my_toks = WordPunctTokenizer().tokenize(my_str)
>>> print len(my_toks)

The answer should be 16.

>>> my_vocab = set(my_toks)
>>> print len(my_vocab)

The answer should be 13.

What is the difference between the two approaches? In the first one, the vocabulary ends up containing “words” and “words.” as two distinct words; whereas in the second example “words” is a token type and “.” (i.e. the dot) is split into a separate token and this results into a new token type in addition to “words”.

You can see here the subtleties inherent even in a fairly simple idea, and why we avoid using a phrase like “word count” and prefer the term tokens.

2.4 Hands-on: tokenisation and word frequency

Let’s see now in practice how the tokenisation influences the result we get from analysing a text.

>>> import codecs
>>> data = codecs.open (“data/web/sueddeutsche_article.txt”, ‘r’, ‘utf-8’)
>>> data = data.encode(‘utf-8’)

So we have read our text file. Don’t forget that you can read this file yourself if you want to have a sense of what’s in it. You can even swap it, in these examples, for a file of your own choice, if you prefer. Now we’ll import the whitespace tokeniser from NLTK:

>>> from nltk.tokenize.regexp import WhitespaceTokenizer
>>> tokens = WhitespaceTokenizer().tokenize(data)
>>> from nltk import FreqDist
>>> tokens = WhitespaceTokenizer().tokenize(data)
>>> frequency_distribution = FreqDist(tokens)
>>> print len(frequency_distribution.hapaxes())

This gives us the number of hapaxes — tokens which only appear once in the text. If you used our example file you should have got the result 216. At this point you might want to see the tokens listed; if so just append this line after the code above:

>>> print frequency_distribution.hapaxes()

Now let’s tokenise using a different method. If you’re entering all this code into IDLE in one go you won’t need to repeat the first three lines of code at the top of the page, because the file is still read into memory. If you’re coming back after starting a new session you’ll need to repeat the first lines because, of course, IDLE won’t remember anything from previous sessions.

>>> from nltk.tokenize.regexp import WordPunctTokenizer
>>> tokens = WordPunctTokenizer().tokenize(data)
>>> frequency_distribution = FreqDist(tokens)
>>> print len(frequency_distribution.hapaxes())

If you used our example you should have got the result 205. If you used your own file you should have also got a different result the second time. Again you can see the list with:

>>> print frequency_distribution.hapaxes()

You might be surprised that words like Der and Die occur only once in a German text. This is because the results are case-sensitive: der and die are not listed.

To sum up, in the example above we have:

  • read a text file to memory (2)
  • tokenised the text using two different methods
  • computed some basic word frequency over the extracted tokens
2.5 Sentence segmentation
Reading

Another preliminary step that is commonly performed on texts before further processing is the so-called sentence segmentation or sentence boundary detection, namely the process of dividing up a running text into sentences. One aspect which makes this task less straightforward than it sounds is the presence of punctuation marks that can be used either to indicate a full stop or to form abbreviations and the like.

The NLTK framework includes an implementation of a sentence tokeniser — that is, a program which performs sentence segmentation — which can handle texts in several languages. This tokeniser is called PunktSentenceTokenizer and is based on the publication by Kiss, T. & Strunk, J., 2006. Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics, 32(4), pp. 485–525.

The code below contains a number of programming and/or NLP concepts that might be unclear to you at the moment. Nevertheless, try it out — it should work provided that you have correctly set up your Python + NLTK environment. The concept of training a software to perform a given task will definitely be more clear after reading the next section, but for the time being just take it as an example of how sentence segmentation works, as the input and output of the example are really intuitively understandable.

>>> import codecs

>>> import nltk

>>> text = codecs.open(“data/web/nytimes_article.txt”,’r’,’utf-8').read()

The following bit of code might look a bit obscure so let’s talk through it.

The variable sentence_tokenizer contains an instance of the class nltk.tokenize.punkt.PunktSentenceTokenizer. This instance is not created from scratch but is loaded from memory — precisely from the path ‘tokenizers/punkt/english.pickle’. This instance of PunktSentenceTokenizer has already been trained on the English language and then saved to memory by writing it to a file.

>>> sentence_tokenizer=nltk.data.load(‘tokenizers/punkt/english.pickle’)
>>> result = sentence_tokenizer.tokenize(text)

now we simply call the method tokenize with the text to be split into sentences as our only parameter

>>> result = sentence_tokenizer.tokenize(text)

This prints the number of sentences into which the text was split up:

>>> print len(result)

Finally let’s print a list of sentences with their progressive number in brackets. Note that this is the first time we’ll have used the for construction. In IDLE you won’t get a prompt but an indent for the second line because the interpreter is waiting for the instruction after the colon. Enter the second line and press return twice to execute the command:

>>> for n,sentence in enumerate(result):

print "(%i) %s "%(n+1,sentence)

Exercise:

Again, try this methodology with an English-language file of your own. Repetition will make these processes seem much easier.

2.6 Further resources:
  • to see more kinds of tokenisation in action check out this demo.
  • you can try it out by using the data from the example above just by copying and pasting the file content into the browser.
  • read more about object serialisation from Python’s documentation.
Share this post
Repost0
To be informed of the latest articles, subscribe:
Comment on this post