Is There a Good Natural Language Processing Library

Is there a good natural language processing library

LingPipe is very nice and well documented. You can also take a look at:

  • OpenNLP
  • Stanford NLP
  • Apache UIMA
  • GATE
  • CogComp-NLP
  • FrameNet

The last one specifically might be of interest to you, although I don't know whether there are any readily available Java implementations (and maybe that's too big of a gun for your problem anyway :-)

Paul's idea of using a DSL is probably easier and faster to implement, and more reliable to use for your customers. I, too, would recommend looking into that first.

Natural language processing

As I said in comments, the question is not about a language, but about suitable library. And there are a lot of NLP libraries in both Java and C++. I believe you must inspect some of them (in both languages) and then, when you will know all the plenty of available libraries, create some kind of "big plan", how to implement your task. So, here I'll just give you some links with a brief explanation what is what.

Java

GATE - it is exactly what its name means - General Architecture for Text Processing. Application in GATE is a pipeline. You put language processing resources like tokenizers, POS-taggers, morphological analyzers, etc. on it and run the process. The result is represented as a set of annotations - meta information, attached to a peace of text (e.g. token). In addition to great number of plugins (including plugins for integration with other NLP resources like WordNet or Stanford Parser), it has many predefined dictionaries (cities, names, etc.) and its own regex-like language JAPE. GATE comes with its own IDE (GATE Developer), where you can try your pipeline setup, and then save it and load from Java code.

UIMA - or Unstructured Information Management Applications. It is very similar to GATE in terms of architecture. It also represents pipeline and produces set of annotations. Like GATE, it has visual IDE, where you can try out your future application. The difference is that UIMA mostly concerns information extraction while GATE performs text processing without explicit consideration of its purpose. Also UIMA comes with simple REST server.

OpenNLP - they call themselves organization center for open source projects on NLP, and this is the most appropriate definition. Main direction of development is to use machine learning algorithms for the most general NLP tasks like part-of-speech tagging, named entity recognition, coreference resolution and so on. It also has good integration with UIMA, so its tools are also available.

Stanford NLP - probably best choice for engineers and researchers with NLP and ML knowledge. Unlike libraries like GATE and UIMA, it doesn't aim to provide as much tools as possible, but instead concentrates on idiomatic models. E.g. you don't have comprehensive dictionaries, but you can train probabilistic algorithm to create it! In addition to its CoreNLP component, that provides most wildly used tools like tokenization, POS tagging, NER, etc., it has several very interesting subprojects. E.g. their Dependency framework allows you to extract complete sentence structure. That is, you can, for example, easily extract information about subject and object of a verb in question, which is much harder using other NLP tools.

C++

UIMA - yes, there are complete implementations for both Java and C++.

Stanford Parser - some Stanford's projects are only in Java, others - only in C++, and some of them are available in both languages. You can find many of them here.

APIs

A number of web service APIs perform specific language processing, including:

Alchemy API - language identification, named entity recognition, sentiment analysis and much more! Take a look at their main page - it is quite self-descriptive.

OpenCalais - this service tries to build giant graph of everything. You pass it a web page URL and it enriches this page text with found entities, together with relations between them. For example, you pass it a page with "Steve Jobs" and it returns "Apple Inc." (roughly speaking) together with probability that this is the same Steve Jobs.

Other recommendations

And yes, you should definitely take a look at Python's NLTK. It is not only a powerful and easy-to-use NLP library, but also a part of excellent scientific stack created by extremely friendly community.

Update (2017-11-15): 7 years later there are even more impressive tools, cool algorithms and interesting tasks. One comprehensive description may be found here:

https://tomassetti.me/guide-natural-language-processing/

Ideas for Natural Language Processing project?

  1. Obnoxious language filtering - I think this will reduce down to a process very similar to spam email filtering. That is, counting the frequency of a set of more-or-less 'obnoxious' words. It doesn't sound like you will get the scope to do anything particularly clever, unless you also use other sources of information (e.g. the structure of the social links shared between the sender and recipient, perhaps). On the other hand, online bullying is a very serious thing and you can bet Facebook/Myspace and the other social networking sites care a lot about tackling it.

  2. Stylistic Analysis - There has been some work done on this in various forms, often under the name authorship analysis. Shlomo Argamon does a lot of work in this area and you could probably discover a lot more from the references in his papers. One of the best ways to profile an author is to learn the distribution of their usage of a set of stopwords (a.k.a functional words), such as 'and' ,'but', 'if', etc. I think there's a lot more scope to do something new and interesting in this area - authorship analysis on internet data is a hard problem - but also a lot more scope to fail.

  3. Chat bot - You're right, this is a pretty standard project. It's also quite hard to measure success/failure. I think the project would be more compelling if it was a chat-bot with some kind of purpose, like answering questions in a limited domain, but that's something that's very difficult to do well.

The rest are really too vague to make any comments on, sorry.

There aren't any NLP libraries that I know of in OCaml, it's just not a particularly popular programming language. However, I do know of a machine learning library in Ocaml, called MEGAM, written by Hal Daume, who is a very good NLP researcher, which has been used for NLP tasks. I get a feeling that figuring out MEGAM and using it to do some NLP task might be too big a project to take on, however.

Some other ideas:

  • Sentiment Analysis - A very trendy area of research. You could make this task as easy or hard as you like, from scoring a document as positive/negative to extracting specific topics and generating a sentiment score for each one.
  • Coreference/Anaphora resolution - A difficult task but a very important one. Some approaches use a graph representation (each mention is a node with edges between them if they co-refer) to enforce things like transitivity.
  • Document Classification - You could try and learn a system on the StackOverflow data set to suggest tags for a given question. It's a fairly well known problem with some established techniques, but an it's interesting data set and has an obvious and useful application to the real world . You could also see if you can find specific features of a question (word choice, length, formatting, punctuation, etc.) that cause them to be voted highly.
  • Haiku Generation - Kind of a silly one, but I always thought it was an interesting idea. Syllable counting could be done with the CMU pronouncing dictionary. Should be a lot of fun, if not particularly useful.


Related Topics



Leave a reply



Submit