nltk: all the computational linguistics you could ever want, and then some

What can you do? 

nltk is a Python module that contains probably every text processing module you’ve ever had a vague inkling of a need for.  It contains corpuses of language for machine learning/training; word tokenizers (splits sentences into individual words or ngrams); part-of-speech taggers; parse parts of speech in sentences (with trees!); and much, much more.  It’s good for analyzing lots of text for sentiment analysis, text classification, and tagging mentions of named entities (people, places, and companies).

How do you get started?

The creators of nltk have published a book for free online that explains how to use many of the features nltk has.  It explains how to do things like access the corpora that nltk has; categorize words; classify text; and even build grammars.  Basically, the best way to get started is install nltk, then go through the book and try the examples they present.  They include code examples in the book so you can follow along and practice using different functions and corpuses.  There’s also a wiki attached to the github and stackoverflow, where programmers go when they’re lost, is of course a useful (but often very specific) resource.  The learning curve required to become comfortable leveraging the different functions available is fairly steep because they are so many and so specialized, and in my opinion the best way to gain that comfort level is to simply play around with nltk and build cool things to gain experience.  Simply reading the book, while interesting, won’t be enough to become good at using nltk.

How easy or hard is it?

Well, it’s certainly easier than writing all of this from scratch, no matter how competent a programmer you are.  The one thing that can be difficult with Python modules is that you’re not entirely sure what’s under the hood unless you get cozy with the source code.  That means you might not be sure what’s causing a performance issue, why it doesn’t like your input, or why your output looks a certain way.  Also, figuring out exactly which function to use for a specific task might be somewhat confusing as well unless you have a certain amount of experience in machine learning or know exactly what you want (it’s hard to go wrong with tokenization).  For example, the built-in classifier is only as good as the features you feed it; giving it too many high-dimensionality items might result in overfitting or just horrendously slow code, and giving it low-dimensionality items might mean it can’t classify the items effectively.  Experience with Python datatypes and object-oriented programming is also very, very important; if you don’t understand what a function is, what list comprehensions look like, and how Python dictionaries work, the example code given in the book will be incomprehensible.  Even though the printouts from the example code look very nice and fancy and clean, the knowledge behind their creation (how do you print things that look nice? what is a development set? how do you use/leverage helper functions like tokenizer and the nltk function that gets the n most common words/letters? how do decision trees work?) is far from simple.  Anyone with programming experience can use the simpler functions very effectively and the less simple functions with probable success, but in my opinion knowing how classifiers and parsers work is important to use them well.  The bottom line is that they’re only as good as what you feed them, and understanding how definitive or accurate their output is requires a degree of understanding of what’s under the hood.

Would I recommend this to a friend?

If that friend had a similar programming background to me (can write Python code pretty well; knows a little bit about machine learning) I’d recommend it with little reservations other than a warning about the learning curve and the overwhelming abundance of options.  I’d still suggest they at least skim the book and keep stackoverflow close at hand (although that’s true for most programming projects that venture into unknown territory).  If my friend wasn’t comfortable with machine learning, I’d suggest they read up on Wikipedia about whatever classifiers they use so they have an idea of why the classifier misbehaves, if it does, or what errors it’s likely to make.  And if they weren’t comfortable with programming, I’d suggest they look into other natural language processing tools.  This is a tool that’s made by programmers and scientists, and it shows in the documentation, the resources, and the wealth of options available to those who know how to use them.

tl;dr: nltk has a ton of really cool natural language processing tools.  However, they are by no means idiot-proof, and you will be sad if you don’t know Python.  One does not simply download nltk and spit out useful results in five minutes.  


Leave a Reply

Your email address will not be published. Required fields are marked *