[Logo]

Link Grammar Parser

by Davy Temperley, John Lafferty and Daniel Sleator
Maintained and extended by Linas Vepstas - <linasvepstas@gmail.com>, Dom Lachowicz - <domlachowicz@gmail.com>, the Open Cognition project and Abiword.

News

February, 2015: link-grammar 5.2.5 released! See below for a description of recent changes.

The 5.0.0 version of Link Grammar now uses a new license: the LGPL v2.1 license. Older versions remain available under the BSD license. This license change was made to allow greater participation in the project.

The new version includes the Persian and Arabic systems, which were previously distributed separately. It also includes prototype, experimental dictionaries for Hebrew and Turkish, and an expanded Lithuanian dictionary. In addition, the programming interfaces for python and ocaml are now integrated, joining those for java and common lisp. A shell script to run the JSON network parse server is included.

What is Link Grammar?

The Link Grammar Parser is a syntactic parser of English, Russian, Arabic and Persian (and other languages as well), based on Link Grammar, an original theory of syntax and morphology. Given a sentence, the system assigns to it a syntactic structure, which consists of a set of labelled links connecting pairs of words. The parser also produces a "constituent" (HPSG style phrase tree) representation of a sentence (showing noun phrases, verb phrases, etc.). The RelEx extension provides Stanford-style Dependency Grammar output.

The theory of Link Grammar parsing, and the original version of the parser was created in 1991 by Davy Temperley, John Lafferty and Daniel Sleator, at the time professors of linguistics and computer science at the Carnegie Mellon University. It is the product of decades of academic research into grammar and morphology, and is discussed in numerous publications.

Ongoing development by OpenCog

The practical day-to-day mechanics of maintaining an open-source project made ongoing hosting by Carnegie-Mellon impractical. Thus, this, the main Link Grammar website, is hosted by AbiWord, while the source code is located at GitHub.

Ongoing development of Link Grammar is guided and supported by the Open Cognition project, where the parser plays an important role in the OpenCog natural language processing subsystem. Research and implementation is ongoing; current work includes investigations into unsupervised learning of language, unsupervised learning of morphology, semantically guided parsing and grammatically induced word-sense disambiguation.

A sibling project, RelEx, uses constraint-grammar-like techniques to extract dependency relations and assorted additional linguistic information, including FrameNet-style framing and reference (anaphora) resolution. The dependency output is similar to that of the Stanford parser. It's performance is comparable to the Stanford PCFG parsing model, and is more than three times faster than the Stanford "lexicalized" (factored) model.

The NLGen and NLGen2 projects provide natural language generation modules, based on, and compatible with link-grammar and RelEx. They implement the SegSim ideas for NL generation. See the following NLGen demos: Demo of Virtual Dog Learning to Play Fetch via Imitation and Reinforcement, AI Virtual Dog's Emotions Fluctuate Based on Its Experiences, Demo of Embodied Anaphora Resolution and AI Virtual Dog Answers Simple Questions about Itself and Its Environment.

Although based on the original Carnegie-Mellon code base, the current Link Grammar package has evolved and changed in certain profound and important ways. There have been innumerable bug fixes, and performance has improved by more than an order of magnitude. Other notable differences include:

  • Actively maintained! New releases typically happen quarterly.
  • Russian dictionaries!
  • Morphology support!
  • Expanded English dictionaries, with many thousands of new words; dramatically improved parse coverage for a wide variety of constructions.
  • Merger of BioLG project changes, for improved parsing of biomedical text. This includes enhanced entity recognition, and precise identification of numeric quantities.
  • New bindings, including Ruby, Python, perl, Lisp, Java and Ocaml.
  • Support for UTF8 Unicode; Arabic and Persian dictionaries; prototype German dictionary.
  • Multi-threading support; a standard build system; pkg-config integration; a CMake config file, dynamic/shared library support; a TCP/IP-based parse server, fixes for non-Linux platforms, including Windows, MacOSX, FreeBSD.

Downloading Link Grammar

The system can be downloaded either as a tarball, or via git. The current stable version is Link Grammar 5.2.5 (February, 2015). Older versions are available here. Unstable, development versions are available via the link-grammar github repository.

Documentation

One of the best ways to obtain a solid, easy-to-understand overview of the parser is to review the original papers describing it, here, here, here and here. There is an extensive set of pages documenting the dictionary; specifically, the names of links and their meanings, as well as how to write new rules. There is also a short primer for creating dictionaries for new languages. The documentation for the programming API is here. Documentation for additions made in the 4.0 release is on the improvements page. A fairly comprehensive bibliography of papers written before 2004 is here (mirror).

Mailing Lists

The mailing list for Link Grammar discussion is at the link-grammar google group.

Subscribe to link-grammar:

Enter email:


Linguistic Disclaimer

Link Grammar is a natural language parser, not a human-level artificial general intelligence. This means that there are many sentences that it cannot parse correctly, or at all. There are entire classes of speech and writing that it cannot handle, including twitter posts, IRC chat logs, Valley-girl basilect, Old and Middle English, stock-market listings and raw HTML dumps.

Link Grammar works best with "newspaper English", as taught to and written by those educated in American colleges: standard-sized sentences, with good grammar, proper punctuation, and correct capitalization. Link Grammar has difficulties with the following types of textual input:

  • Phrases (that are not a part of a complete sentence).
  • Twitter posts. These tend to be sentence fragments, often lacking proper grammatical structure.
  • Any text containing a large number of spelling errors.
  • "Registers", such as newspaper headlines, where determiners are omitted; for example, "Thieves rob bank."
  • Bulleted lists, such as this.
  • Quotations within sentences (and parenthetical remarks). It does not currently understand the hierarchical nature of quotations.
  • Dialog, stage plays and movie scripts. Such dialog tends to consist of interleaved sentences.
  • Speech-to-text output. Such systems generate large numbers of mis-heard words that, taken at face value cannot be a part of valid sentences. Even such recognition was perfect, spoken English tends not to be as well-constructed or grammatical as written English.
  • Support for British English and Commonwealth English is poor. This includes any English dialects spoken in India, Pakistan, Nigeria, Bangladesh, South Africa, as well as former American protectorates, such as the Phillipines. British and regional spelling of words is missing from the dictionaries.
  • Slang and various regional non-middle-class-American dialects. This includes most dialects spoken by anyone living in economically poor or under-educated geographical regions, whether in urban housing projects or the red-state small-town and rural poor. Self-identifying subgroup dialects are also not handled, such as drug-culture, gang-culture and hacker-culture.
  • Long run-on sentences. These can generate thousands of alternative parses in a combinatorial explosion.
It is hoped that the unsupervised learning of language proposal will be of sufficient power and ability to handle most of these exceptional cases. Work is ongoing.

Languages

Ranked in order of maturity.

English
The main English documentation is here.
Russian
A set of Russian dictionaries providing full coverage for the language have been incorporated into the main distribution as of version 4.7.10 (March 2013). An older version, from which these are derived, can be found at http://slashzone.ru/parser/. By Sergey Protasov. Includes link documentation (mirror) and subscript (morphology) documentation (mirror). Russian morpheme dictionaries can be had at http://aot.ru.

Документация по связям и по классам слов доступна в виде списка примеров.

Persian
The Persian dictionaries from Jon Dehdari have been incorporated into the main distribution, as of version 5.0.0 (April 2014). This includes a copy of the Persian stemming engine, as significant morphology analysis needs to be performed to parse Persian.
Arabic
The Arabic dictionaries from Jon Dehdari have been incorporated into the main distribution, as of version 5.0.0 (April 2014). These are derived from the older, original version. [Mirror] These require the Aramorph stemming package, which is included.
German
A small German dictionary is available as a part of the distribution. It contains roughly one thousand words. A brief description is provided here.
Lithuanian
A very small Lithuanian prototype dictionary has been created. It contains a few hundred words. A few basic sentences parse just fine; the current version focuses on morphological analysis coupled with grammatical analysis. Documentation is here.

Sukurta yra labai prasta Lietuvių kalbos žodynas; beveik neiks ikį šiol neveikia. Čia dokumentacija.

Indonesian
A small Indonesian prototype dictionary has been created. It contains about one hundred words.
Hebrew
A very small Hebrew prototype dictionary has been created. It contains a few dozen words. Almost nothing works correctly (yet).
Turkish
A very small Turkish prototype dictionary has been created. It contains a few dozen words. Almost nothing works correctly (yet).
French, Luthor project
The Luthor project aims to develop a set of scripts to automatically construct Link Grammar linkage dictionaries by mining Wiktionary data. Current efforts are focusing on French. (This project appears to be defunct).

Adjunct Projects

The default distribution for Link Grammar includes bindings for Java, Python, OCaML, Common Lisp, and AutoIt, as well as a SWIG FFI interface file. Additional language bindings, and some related projects, are listed below:

RelEx Semantic Relation Extractor
RelEx is an English-language semantic relationship extractor, built on the Link Parser. It can identify subject, object, indirect object and many other relationships between words in a sentence. It will also provide part-of-speech tagging, noun-number tagging, verb tense tagging, gender tagging, and so on. RelEx includes a basic implementation of the Hobbs anaphora (pronoun) resolution algorithm.
Ruby bindings
Ruby bindings are coordinated at the Ruby-LinkParser website. The code can be found at the ged/link-parser github page.
Perl bindings
The perl bindings, created by Danny Brian, have been updated. See the Lingua-LinkParser page on CPAN. There is also a tutorial written against an older version of the bindings; some details may be different.
Psi Toolkit (Perl)
The Psi Toolkit, an NLP toolkit aimed at linguists and NLP engineers, includes bindings for link-grammar, via perl.
Javascript
Obsolete Javascript bindings can be found at the dijs/link-grammar github page. Someone, please port these to the latest version!
Pre-parsed Wikipedia
Parsed versions of various texts, including all articles from a May 2008 dump of Wikipedia, as well as a partial parse of an October 2010 dump, are available at http://gnucash.org/linas/nlp/data/

Of related interest

Genia tagger
The Genia tagger is useful for named entity extraction. BSD license source.
After the Deadline
After the Deadline is a GPL-licensed language-checking tool. If you just want to have your text proof-read, this is probably a good choice.

Recent Applications and Publications

The original homepage hosted at the Carnegie Mellon University lists an extensive bibliography (mirror) referencing several dozen older (pre-2004) papers pertaining to the Link Grammar Parser. More recent publications and announcements are listed below.

Some miscellaneous facts:

  • Any categorical grammar can be easily converted to a link grammar; see section 6 of Daniel Sleator and Davy Temperley. 1993. "Parsing English with a Link Grammar." Third International Workshop on Parsing Technologies.
  • Link grammars can be learned by performing a statistical analysis on a large corpus: see John Lafferty, Daniel Sleator, and Davy Temperley. 1992. "Grammatical Trigrams: A Probabilistic Model of Link Grammar." Proceedings of the AAAI Conference on Probabilistic Approaches to Natural Language, October, 1992.

Recent Changes

Version 5.2.5 (1 February 2015)

Fixes for assorted breaks.

  • Fix contracted "is" verb.
  • Fix crash in batch mode (issue #63).
  • Fix Python bindings so that setting PYTHONPATH is not required.
  • Fix "... as I expected him to be."
  • Fix incorrect printing for some Russian linkages. (pull #75)
  • Fix warning from glibc version 2.20 (bug #77)

Version 5.2.4 (12 January 2015)

Fixes for assorted breaks.

  • Fix usage of 'less'.
  • Fix MS Windows random number portability API.
  • Fix mis-handled dictionary subscript dot (issue #53).
  • Fix crash on dots used as fill (issue #57).
  • Fix memory leak (issue #54).

Version 5.2.3 (4 January 2015)

Fixes for assorted build breaks.

  • Fix broken check for editline UTF8 support.
  • Work around broken perl binding definition for clang.

Version 5.2.2 (3 January 2015)

Fixes for assorted build breaks.

  • Fix OSX build break.
  • MSVC12 project file fixes.
  • Check for UTF8 support in libedit ("undefined reference to el_wgets")
  • Enable the 'make check' target for the multi-threading unit test.
  • Misc verb fixes.

Version 5.2.1 (28 December 2014)

Failed to run all of the tests when creating 5.2.0. So try again.

  • Prototype Indonesian dictionary from Hendy Irawan.
  • Fix crash on long sentences.

Version 5.2.0 (27 December 2014)

This is a major release of the parser, with many important changes in it. The internals of the parser have been re-organized, resulting in a speedup of 2x to 4x for typical English texts. Multiple multi- threading bugs were fixed, and there is now a simple multi-threading unit test. A memory leak was fixed, and a memory over-consumption bug was fixed. These changes were enabled by the final removal of the "fat link" code from the parser.

  • y'all, ain't, gonna, y'gotta: Beverly Hillbillies basilect.
  • Permanent removal of the fat-link code.
  • Remove deprecated constituent tree code.
  • Windows: add terminal screen resizing support.
  • Windows: a build fix.
  • reign, rule, run, leave, come: can take predicative adjective.
  • Rework costs for many verb-derived adjectives.
  • Handle (predicative) adjectival modifiers for assorted perfect verbs.
  • Fixes for various color names.
  • Fixes for various affirmative answers.
  • Add 100 missing verbs.
  • Add preliminary lxc-docker (docker.io) support.
  • Remove MSVC6 support.
  • Fix memleak introduced in version 5.1.0
  • Speedup of 1.7x to 4x (depending on text) from linkage processing redesign.
  • Fix multi-threading safety bug.
  • Fix link-and-domain printing alignment (to handle utf8 char widths).
  • Windows: fixes for MSVC12 support.
  • Fix memory consumption bug (EMPTY_WORD) introduced in version 4.7.10.
  • Get rid of xrealloc, which clashes with libbfd symbol xrealloc.
  • Add multi-threaded parsing unit test.

Version 5.1.3 (7 October 2014)

This release continues with fixes for build-breaks for Apple OSX.

  • More fixes for build breaks on Apple OSX.
  • Minor fixes involving "to do"

Version 5.1.2 (4 October 2014)

The most serious fix in this release is a build-break fix for Apple OSX Mavericks.

  • Fix greeting: "How do you do?"
  • Fix indirect object in 'what' questions: 'To what do you owe your success?'
  • Fix assorted questions with verb "to be".
  • Compile fixes for Apple OSX version "Mavericks"

Version 5.1.1 (23 September 2014)

The most serious fix in this release is a fix involving parse ranking in the Java API, which was causing RelEx to generate incorrect parse rankings for certain sentences.

  • Minor post-processing cleanup.
  • English dict: Fix questions with "it".
  • swig: add missing API functions sentence_split, dictionary_get_lang
  • Swap order of post-processing and bad morphology rejection.
  • Fix handling of ellipsis when there's missing whitespace.
  • Java: API bugfix/change: costs should have been doubles not ints.
  • Fat-linkage code: fix it so it compiles again.
  • Sat-solver: re-enable it so that it runs.

Version 5.1.0 (29 August 2014)

This version includes a number of important changes. One of these is that the connectors can now be given a direction (head and tail indicators), so that link-grammar dependencies can now be true, hierarchical dependency arrows. This is of marginal importance for English, where dependency directions are implicit, but is vital for free-word-order languages, where bi-directional links are not enough.

Another important change is that costs can now be arbitrary floating point numbers. This is particularly useful for providing fine-grained parse ranking. The LG cost system assigns a "cost" to every connector, and the sum-total of costs for a sentence determines the parse ranking. Since costs are additive, they behave as entropies (log P -- the logarithm of a probability: probabilities are multiplicative, logarithms are additive).

Under the covers, there's been some major work on the tokenization (splitting sentences into words) and morphology (splitting words into morphemes) code. This work is ongoing, and should eventually result in much better support for non-English languages.

Other notable changes include an updated Russian dictionary, and an assortment of changes to the English dictionary. An intriguing step towards phonology: LG can now distinguish between the use of the determiners "a" and "an" preceding nouns that start with consonants or vowels. Whether fancier phonology support is possible is a curious question.

  • Updated Russian dictionaries from Sergei Protasov.
  • Added morphology-based unknown-word handling for Russian, from Sergei.
  • Fix up fat-linkage code, which was recently broken...
  • API cleanup: many command-line options never belonged in the API.
  • New emoticon support was clobbering certain dictionary words.
  • Fix: "Go to spot X", "It happens at time T."
  • Add a dozen missing verbs.
  • Minor work on greetings.
  • Add mechanism for denoting fractional costs in the file-backed dict.
  • Fix: broken handling of gerunds (due to bad verb-wall connectors)
  • Major redesign of morpheme splitting mechanism (from AmirP)
  • Minor extensions to support numeric formulas, e.g. 1 + 1 = 2.
  • Remove fat linkage support from the SAT solver.
  • Enable build of SAT solver by default.
  • Fix multiple bugs with unit stripping.
  • Add bounds-checking to the C API.
  • Fix the old disjunct-printing implementation.
  • Add support for easy-to-use link direction indicator.
  • Add random morphology generator tool.
  • Partial support for phonetic use of "a" vs. "an" for English.
  • Rework how coordination between conjunctions works: "either... or ...", etc.
  • Major redesign of tokenization mechanism (from AmirP)

Version 5.0.8 (30 April 2014)

  • Fix handling of initial letters in ordered lists.
  • Fix another serious error in constituent printing, introduced in 4.8.0.
  • New emoticon support was clobbering certain number expressions.
  • Misc English dict fixes, more verb-wall connectors.

Version 5.0.7 (29 April 2014)

  • Compile fixes in SAT solver.
  • Add missing verb-wall connectors for is, hasn't, haven't, hadn't, etc.
  • Remove verb-wall connector for imperative verbs.
  • Fix serious error in constituent printing, introduced in 5.0.3
  • Fix old bug in command-line handling of options.
  • Fix parsing of various ordered lists, including some tables of contents.

Version 5.0.6 (18 April 2014)

  • Fix: JSON output format missing brace; from Matt Kruse.
  • Fix: Serious error in Russian morphology printing.

Version 5.0.5 (17 April 2014)

  • Fix packaging bug with the English dictionary.

Version 5.0.4 (16 April 2014)

  • Expanded unit tests for capitalization.
  • Fix who questions: "Who are they?", "Who are you?", etc.
  • Provide verb-wall linkage for many questions.
  • Add Biblical naming idioms: "Lud, son of Shem, ..."
  • Fix MacOSX build break.
  • Fix the 'make clean' target to not remove critical files.
  • Fix broken emoticon support in English dict.
  • Remove obsolete entity detection tokens from English dict.
  • Fix broken equation parsing.

Version 5.0.3 (13 April 2014)

  • Minor memory usage optimization
  • Fix unit test: suppress printing of empty word, and of morphology.
  • Fix: Swig and python were meant to be optional, not required!

Version 5.0.2 (10 April 2014)

  • Expanded unit tests
  • Fix another sqlite3-dev build break

Version 5.0.1 (9 April 2014)

  • Dictionary debugging print fixes from Amir P
  • Print summary of parse statistics when in batch mode (from AmirP)
  • Generalize the notion of prefix/suffix to arbitrary classes (Amir P)
  • Fixes for German adjectives.
  • Fix build break when sqlite3-dev not installed.
  • Fix regression in Russian morphology handling.

Version 5.0.0 (1 April 2014)

We are proud to announce a major new release of the Link Grammar Parser! It contains many important changes and new additions. One of the most significant changes is that the license has been changed from the BSD license to the LGPL. This was done to enable considerably more flexibility in accepting contributions to the project: it seems that few are particularly interested in contributing to a BSD-licensed project. This change has enabled folding in some new work:

  • Arabic and Persian dictionaries! These were previously maintained as separate add-ons. Including them as part of the distribution should make it easier for interested users.
  • A new 'bindings' directory, containing code for Java, Python, Common Lisp, OCaML and AutoIt programming languages. The Python bindings are an updated version of the older pylinkgrammar-0.2.13 bindings. A SWIG interface file should make it easy to create other language bindings as well.
  • Improved morphology support. This will be invisible to most users, but it lays the groundwork for add Hebrew support to the parser.
  • Expanded Lithuanian support. This remains a simplistic prototype, but it now performs a more sophisticated morphological analysis.
  • Experimental Turkish and Hebrew dictionaries.
  • A demo of the JSON parser server: it shows how to run the server, which will accept accept raw sentences on a socket, and returns the parsed forms.
  • Some slightly incompatible changes to the API: it was time for some housekeeping.
  • Misc minor updates to the English Language dictionaries.
  • Preliminary work for SQL-backed dynamic dictionaries. This should enable certain types of automated language learning.

The full changelog is shown below.

  • License upgrade to LGPLv2.1
  • Arabic dictionaries, from Jon Dehdari
  • Persian dictionaries, from Jon Dehdari
  • Support for Hebrew tokenization, from Amir P.
  • Fix wild-card matching for user-supplied word lookup.
  • Prototype Turkish dictionary from Can Bruce.
  • Re-arrange programming language bindings directory.
  • Adopt the orphaned/unsupported pylinkgrammar Python bindings.
  • Deprecate the obsolete CNode interface.
  • Provide low-level perl bindings.
  • Adopt the orphaned/unsupported OCaML bindings.
  • Support affirmative replies: "Who did it?" "John's evil twin."
  • Expanded Lithuanian dictionary.
  • Minor disjunct printing fixes.
  • Fix: "Mary is too XXX to talk to."
  • Prototype Hebrew dictionary from Amir P.
  • Change !suffixes flag to !morphology.
  • Introduce a bi-directional connector, for free-word-order languages.
  • Introduce a symmetric-AND operator, for free-word-order languages.
  • Add demo shell script for running the JSON parse server.
  • Bugfix: Java server failing when input sentence has commas in it!
  • New !test and !debug commands for selective debugging support.
  • Print post-processing rejection message, when !bad is enabled.
  • Remove some deprecated functions for C API.
  • Remove all deprecated functions from Java API.
  • Initial support for an SQL-backed dynamic dictionary.
A list of older changes can be found here.

License

Current versions of the Link Grammar parser software, language dictionaries and documentation are available under the LGPL v2.1 license. Versions prior to 5.0.0 are available under a variant of the BSD license.

Copyright (c) 2003-2004 Daniel Sleator, David Temperley, and John Lafferty. All rights reserved.
Copyright (c) 2003 Peter Szolovits
Copyright (c) 2004,2012,2013 Sergey Protasov
Copyright (c) 2006 Sampo Pyysalo
Copyright (c) 2007 Mike Ross
Copyright (c) 2008,2009,2010 Borislav Iordanov
Copyright (c) 2008-2015 Linas Vepstas
Copyright (c) 2014, 2015 Amir Plivatsky