NooJ: A Linguistic Development Environment



RA Project

The RA project:
An industrial linguistic engine for NooJ

Presented at the NooJ2017 Conference


1. Architecture


SWIFT: a new programming language and paradigm. RA is entirely compiled: there is no JAVA nor .NET virtual machine, nor any garbage collection during runtime. RA runs natively on Windows, Mac OSX, LINUX and UNIX. RA’s functionalities will be available via a Web Service; the developer version will be also available in a playground (an interpreter environment equivalent to python's).

All RA’s methods are based on functional programming code, with no side-effect. Therefore, RA is 100% thread-safe, is well adapted to concurrent programming and is easily distributed: RA will run great on multiple-core computers.

RA’s design follows Unit Testing practices, i.e. each functionality has been first being implemented by over 3 tests in average. This systematic practice will contribute to make RA industrially robust, i.e. RA’s successive versions will suffer no regression.


2. Linguistics 


RA processes only one type of finite-state machine, in the new .ra format. The new .ra machine will be used to process dictionaries (similar to .nod), morphological (similar to .nom) and syntactic (similar to .nog) resources.

-- NooJ v5 inflectional grammars (.nof) will be converted to new .ra machines that are reversible, i.e. they can be used both for generation and lemmatisation. 

-- NooJ v5 Dictionaries (.dic + .nof files) will have to be recompiled by RA into the new .ra machines.

-- NooJ v5 morphological grammars (.nom) will be recompiled by RA into .ra machines

-- NooJ v5 syntactic grammars (.nog) will have to be modified manually before being recompiled by RA into .ra machines. NooJ v5.1 will add new functionalities to help users adapt their syntactic grammars to the new engine.

RA will process morphological and syntactic grammars indifferently; it will be possible to include morphological grammars inside syntactic grammars, and/or use syntactic contexts to constrain morphological analyses.

RA will contain a new, optimized annotation system, different and incompatible with NooJ’s.


3. Algorithms


The new linguistic engine will contain only one parser, which is similar to NooJ’s previous dictionary lookup method, whether the user applies a dictionary, a morphological grammar or a syntactic grammar or any combination of them.

 

RA's dictionaries should be substantially smaller than current NooJ .nod dictionaries, because the .ra machines do not need to store the lemma for each lexical entry (since RA's inflectional/derivational grammars are reversible).

RA will contain a Cascade compiler that transforms a set of linguistic resources associated with different levels of priority (those selected in Info > Preferences) into one single finite-state machine. For instance, it will apply 4 dictionaries, 2 morphological grammars and 5 syntactic grammars in only one single step.

RA, as opposed to NooJ, dynamically computes a Partial Text Annotation Structure: annotations are added to the TAS only where needed. For instance, if a grammar does not contain any <V> symbol (Verb), there is no need for RA to annotate all the verbs of the text before applying the grammar.