MITIE: MIT Information Extraction

pbnjay · on Aug 27, 2015

This looks pretty cool, are there any comparisons to NLTK et al? The examples seem pretty straight forward and well commented, but overall Documentation is a bit lacking.

Now the really important question: Is it pronounced the same as a "mai tai" or more like "mitty" ?

phy6 · on Aug 27, 2015

It's very fast after the model(s) have been trained. The training can take a long time, it might be something like N^4 where N is the number of distinct types of features. Something about intersecting planes for each dimension.

It is used in several DARPA programs including XDATA and MEMEX to name a couple. One of the committers is actually now a DARPA PM (Wade Shen) who has taken over some programs from a previous PM you may remember from MEMEX on 60 minutes.

As far as speed, once trained we used it as part of a batch job to enhance various types of freetext and semi structured text, and the performance was very good (I don't have numbers in front of me, but some groups should)

We also would wrap MITIE with a Tangelo wrapper so we could use it as a REST client (for other webapps to hit at runtime), posting freetext to it and getting back a list of entities and annotated freetext.

It can also work well on semi-structured text, for instance a table of semi-regular data that was pasted into a string, losing it's pagination/formatting. This requires a tailored model but works well.

The training of MITIE can be a bit challenging if you have too many types that might appear in similar locations in text. One of the DARPA teams built a MITIE trainer which allowed a SME to annotate text in a web ui to help build the model, which is then run against the corpus of data in batch.

The stock model is built on newspaper data, IIRC, so it may not be suited to something like, say, tweets or books.

I hope this helps. I highly recommend checking it out if your project needs something like this. A lot of man hours went into developing it, and the developers would love for it to gain traction and have the technology transfer outside academia/defense. Drop them a line or a pull request!

Note: there have been suggestions for including some rudimentary low-hanging-fruit post process techniques, like applying supplied regexes, whitelists, blacklists, pronoun dictionaries, etc. One variant was also looking to pull out relationships as well as entities as tagged fields.

mark_l_watson · on Aug 27, 2015

Thanks for that explanation. I bookmarked the site but was going to pass on playing with it before reading your post. I am most interested in generating relationships between named entities. I can find NEs in my NLP code but I can't generate links like "owns", "located at", etc.

bowyakka · on Aug 27, 2015

Last time we did a runtime performance benchmark

http://gbowyer.freeshell.org/ner-perf.png

unclesaamm · on Aug 27, 2015

The people I used to work with called it "mighty".

phy6 · on Aug 27, 2015

Yes, we call it 'mighty'

eterps · on Aug 27, 2015

What can you do with it?

adamio · on Aug 27, 2015

natural language processing

ninjin · on Aug 27, 2015

Has this been published somewhere? The usage guide looks good, is there a model description?

gherkin0 · on Aug 27, 2015

Is there any documentation for this anywhere?

The "binary relation detection" looks interesting, and I'd like to know more about it. Are there any other NLP libraries that can perform similar functions?

tycho01 · on Aug 28, 2015

Uses BLAS but no mention of cuBLAS to speed things up? Does that mean the linear algebra wasn't big enough a component to merit optimizing on?

yeukhon · on Aug 27, 2015

So what does MIT stand for?

phy6 · on Aug 27, 2015

Massachusetts Institute of Technology

It's a school you should know of.

yeukhon · on Aug 28, 2015

Okay. I was assuming that too but I couldn't find a reference to the school on README... hence this question. Okay...

fluffyllemon · on Aug 27, 2015

MIT is the Massachusetts Institute of Technology, largely regarded as one of the top universities in the world.

0xdeadbeefbabe · on Aug 27, 2015

They are anti-harvard for sure.