In 1985 Sakhr Software Company launched a multi-phase, large-scale research project for
automatic processing of written Arabic. The Company mobilized massive human, material and
financial resources to achieve this ambitious goal, which took more than ten years to be
accomplished.
Arabic Core Linguistic Engines
The Core Linguistic Engines address the following four language levels:
- Character Level:
Arabic Optical Character Recognition (A-OCR):
Since 1993, Sakhr has been developing the technology of automatic recognition of printed
Arabic text known as OCR. Although Sakhr has been targeting the Arab market, Sakhr
realized from the very beginning the importance of developing a bilingual OCR for
Arabic/English text.
The program is able to learn characters shapes, which can be used to improve the
recognition accuracy to its maximum. No other Arabic OCR system has this unique feature of
combining both recognition technologies. After establishing its leadership in the field of
Arabic/English printed text recognition, Sakhr internationalized its OCR system by
supporting more languages. Sakhr has released a Persian version of its product for an
Iranian distributor. Special versions for Arabic script languages, such as Urdu and Jawi
can be developed based on the same technology. In addition to the support of Arabic-like
languages, Sakhr now supports all the 16 European languages.
- Word Level:
Multi-Mode Morphological Processor
(MMMP):
Sakhr's MMMP is a morphological analyzer-synthesizer of Arabic. The analyzer identifies
all possible stem forms of a word, i.e. extracting its basic form stripped from affixes.
Unlike the English Stemmer, the MMMP analyzer does not stop at the stem level but proceeds
to extract the root and the Morphological Pattern (MP) of the word.
Decomposing Arabic words into their morphological primitives is a basic requirement for
full text indexing, search, dictionary organization and look up, as well as for spelling
and grammatical checking. Even more important, the MMMP enables deeper processing of
Arabic at the syntax and semantic levels. The MMMP synthesizer works in a reverse mode to
generate linguistically-correct final word forms. The synthesizer is a key tool for
generating the required output in machine translation systems and other text generation
applications, such as summarizers and style checkers.
- Sentence Level:
a) Multi-Mode Syntactical Processor (MMSP):
The Sakhr MMSP parses the Arabic sentence into its syntactical constituents:
(verb, subject, object, adverbs, predicate, etc...). MMSP is driven by a formal grammar of
Arabic with extensive linguistic and lexical coverage, it integrates a set of advanced
deterministic and preferential parsing techniques. Its major power lies in its ability to
resolve the inter-mixed ambiguities involved within non-vowelized Arabic text. This
ability is a major factor for successful in-depth processing of text for translation,
summarization, automatic understanding, and content analysis.
b) Arabic Automatic Diacritizer (AAD)
The AAD is a technology breakthrough, which solved the basic problem of handling the
unvowelized Arabic text automatically. It is an intelligent processor based on the MMMS.
It simulates the mental process exercised by Arabic native speakers in interpreting
undiacritized text and substituting missing vowels. The Automatic Diacritizer provides
different options for diacritization: full, mandatory, or case ending diacritics. The AAD
is the entry point for rendering written Arabic text suitable for serious computation.
- Continuous Text Level:
a) Arabic Text Fragmenter (ATF)
Sakhr's ATF automatically divides the continuous text into separate sentences. It is a
basic front-end processor, which prepares narratives for sentence-based processors such as
parsers and for machine translation.
b) Arabic Automatic Indexer (AAI)
Sakhr's AAI automatically examines the content of a document to identify key words and
phrases. For the first time, the Arabic automatic indexer enables the creation of book
indices with an ease never done before seen for Arabic books. AAI has different levels of
indexing and has an HTML version for the Internet.
|