Java or Python for Natural Language Processing

Python vs Java for natural language processing

Assuming you don't have a platform restriction that would constrain your choice of language, you should choose your language based on whatever you're most comfortable with (I prefer Python myself), and which has the best libraries for your application (as @GregHewgill pointed out the Python tools (Natural Language Toolkit) are mature and comprehensive).

So while I personally would choose Python, it's really something you have to choose for yourself.

== EDIT ==

This question about Java NLP libraries might help you decide if you can use Java for your analysis; the top answer has a list you can investigate. Without more information about your problem set, I can't provide more specific advice.

Natural Language Processing in Java (NLP)

You should try the Stanford - NLP. It has many utilities and libraries for NLP like the Parts-Of-Speech Tagger,all of which are great to use and easy to understand.

Is there a good natural language processing library

LingPipe is very nice and well documented. You can also take a look at:

  • OpenNLP
  • Stanford NLP
  • Apache UIMA
  • GATE
  • CogComp-NLP
  • FrameNet

The last one specifically might be of interest to you, although I don't know whether there are any readily available Java implementations (and maybe that's too big of a gun for your problem anyway :-)

Paul's idea of using a DSL is probably easier and faster to implement, and more reliable to use for your customers. I, too, would recommend looking into that first.

Natural language processing

As I said in comments, the question is not about a language, but about suitable library. And there are a lot of NLP libraries in both Java and C++. I believe you must inspect some of them (in both languages) and then, when you will know all the plenty of available libraries, create some kind of "big plan", how to implement your task. So, here I'll just give you some links with a brief explanation what is what.

Java

GATE - it is exactly what its name means - General Architecture for Text Processing. Application in GATE is a pipeline. You put language processing resources like tokenizers, POS-taggers, morphological analyzers, etc. on it and run the process. The result is represented as a set of annotations - meta information, attached to a peace of text (e.g. token). In addition to great number of plugins (including plugins for integration with other NLP resources like WordNet or Stanford Parser), it has many predefined dictionaries (cities, names, etc.) and its own regex-like language JAPE. GATE comes with its own IDE (GATE Developer), where you can try your pipeline setup, and then save it and load from Java code.

UIMA - or Unstructured Information Management Applications. It is very similar to GATE in terms of architecture. It also represents pipeline and produces set of annotations. Like GATE, it has visual IDE, where you can try out your future application. The difference is that UIMA mostly concerns information extraction while GATE performs text processing without explicit consideration of its purpose. Also UIMA comes with simple REST server.

OpenNLP - they call themselves organization center for open source projects on NLP, and this is the most appropriate definition. Main direction of development is to use machine learning algorithms for the most general NLP tasks like part-of-speech tagging, named entity recognition, coreference resolution and so on. It also has good integration with UIMA, so its tools are also available.

Stanford NLP - probably best choice for engineers and researchers with NLP and ML knowledge. Unlike libraries like GATE and UIMA, it doesn't aim to provide as much tools as possible, but instead concentrates on idiomatic models. E.g. you don't have comprehensive dictionaries, but you can train probabilistic algorithm to create it! In addition to its CoreNLP component, that provides most wildly used tools like tokenization, POS tagging, NER, etc., it has several very interesting subprojects. E.g. their Dependency framework allows you to extract complete sentence structure. That is, you can, for example, easily extract information about subject and object of a verb in question, which is much harder using other NLP tools.

C++

UIMA - yes, there are complete implementations for both Java and C++.

Stanford Parser - some Stanford's projects are only in Java, others - only in C++, and some of them are available in both languages. You can find many of them here.

APIs

A number of web service APIs perform specific language processing, including:

Alchemy API - language identification, named entity recognition, sentiment analysis and much more! Take a look at their main page - it is quite self-descriptive.

OpenCalais - this service tries to build giant graph of everything. You pass it a web page URL and it enriches this page text with found entities, together with relations between them. For example, you pass it a page with "Steve Jobs" and it returns "Apple Inc." (roughly speaking) together with probability that this is the same Steve Jobs.

Other recommendations

And yes, you should definitely take a look at Python's NLTK. It is not only a powerful and easy-to-use NLP library, but also a part of excellent scientific stack created by extremely friendly community.

Update (2017-11-15): 7 years later there are even more impressive tools, cool algorithms and interesting tasks. One comprehensive description may be found here:

https://tomassetti.me/guide-natural-language-processing/

Natural Language Processing Solution in Java?

Two popular ones that I know of are:

Gate

OpenNLP

Natural Language Processing of Topics

I will suggest TextBlob, since it simplify the process to train the classifier. See the tutorial here about how to build the text classifier. Of course in your specific problem, you need to find out how many different categories you want to classify; you have then to train submitting a significant training set (not too much to avoid over fitting the dataset); at that point your classifier will be ready to get new data of type

"dragons": {
"category": "lifestyle",
"category_id": 17,
"score": 0.279108277990115,
"topic_id": 2137
}

and classify it. At that point you have to evaluate your classification against a test dataset.
This is not so obvious as it seems by the way looking at this mini dataset (could you provide a bigger one it would kelp), it seems that you have some clusters of data like:

first cluster tagged as lifestyle

"dragons": {
"category": "lifestyle",
"category_id": 17,
"score": 0.279108277990115,
"topic_id": 2137
},
"furry-fandom": {
"category": "lifestyle",
"category_id": 17,
"score": 0.279108277990115,
"topic_id": 48595
},
"legendarycreatures": {
"category": "lifestyle",
"category_id": 17,
"score": 0.279108277990115,
}

second cluster tagged socializing

"wine": {
"category": "socializing",
"category_id": 31,
"score": 0.0,
"topic_id": 611
}

To define you super category, you have to tell the classifier that terms like dragons and legendarycreatures belongs to the same dataset, let's call this fantasy. So this is not only a matter or classification, but of text analysis and semantics as well: legendarycreatures => legendary + creatures (bag of words) has a distance to the term dragons that is more closer than other words, so word2vec could help here to evaluated vectors of those names and to define the metrics behind them and the distance between them. A good implementation is provided by gensim.

I'm mentioning word2vec since it will work if you have the text / description for each of those entries or not. In the last case you can just define a metrics for the title of the item like dragons or legendarycreatures.

[UPDATE]
So, I'm trying to figure out how to find the right classification algorithm using a brand new technique "that automatically creates and optimizes machine learning pipelines using genetic programming", named Tpot made by @rhiever

In this case, the tool needs the features vectors (from word2vec) as input, that must be provided in the supervised data set format. Here is the discussion, that is a good starting point.



Related Topics



Leave a reply



Submit