Hello, this is ITOH(@takahi_i), a software engineer at ATL. I released a document checker written in Natural Languages called RedPen. (Sorry for my late announcement with the Beta version.) The target documents for this tool written in Natural Languages include manuals, essays, e-mails, etc. Here’s the project URL below.
This article demonstrates the RedPen features and how to use it.
Features of RedPen
When RedPen finds invalid expressions, complicated sentences, and obvious inconsistencies used in a document, it generates a warning. RedPen currently provides very primitive features as follows:
The reason why RedPen currently provides only the primitive features is that what we want to focus on with this checker is not a sophisticated analysis which is researched in the field of Natural Language Processing, but a detection of obvious invalidation and inappropriate format processed by a static analysis tool of the programming language.
Supported format
RedPen currently supports Markdown, one of the tags in Textile and plain text format.
Usage of RedPen
Let’s get started with RedPen using the supplied configuration file and sample document. Please refer to the manual for more detailed information of how-to-use.
Sample Setting
RedPen has 3 kinds of configuration files. One is a whole configuration file and it specifies the other two configuration files and document language. (In the setting below Japanese “ja” is specified.)
1 2 3 4 5 6 |
<configuration> <validator>conf/validation-conf-ja.xml</validator> <lang char-conf="conf/symbol-conf-ja.xml">ja</lang> </configuration> |
The other two configuration files specified in the whole configuration files are the Validator configuration file and the character configuration file.
Validator configuration file
In the Validator configuration file, adds a Validator which covers what you want to check. Validator checks an input document for each perspective, for example, the length of input sentences and invalid symbols.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
<component name="Validator"> <component name="SentenceIterator" /> <component name="SentenceLength" /> <property name="max_length" value="150"/> </component> <component name="InvalidCharacter" /> <component name="SpaceWithSymbol" /> <component name="KatakanaEndHyphen /"> <component name="KatakanaSpellCheck" /> </component> </component> |
In the setting above, these are registered;
が登録されています。
Character configuration file
The default setting of character set is determined by the language (lang field) of the main configuration file. If you want to use a different setting from the default character setting of a specified language, you can override the default setting with the character setting file.
In the character configuration file, you define characters you use, characters you must not use and if you need a white space between characters. I do not explain it furthermore but I define an end period as “。” and a comma as “、”. invalid-char detects the invalid characters which have equivalent symbolic meanings. For example, in the setting below, comma is defined as “、” and “,” as invalid comma characters.
1 2 3 4 5 6 7 8 9 10 11 |
<?xml version="1.0"?> <character-table> <character name="EXCLAMATION_MARK" value="!" invalid-chars="!" after-space="true" /> <character name="LEFT_QUOTATION_MARK" value="'" invalid-chars="“" before-space="true" /> <character name="RIGHT_QUOTATION_MARK" value="'" invalid-chars="”" after-space="true" /> <character name="NUMBER_SIGN" value="#" invalid-chars="#" after-space="true" /> <character name="FULL_STOP" value="。" invalid-chars=".." /> <character name="COMMA" value="、" invalid-chars="," /> </character-table> |
Sample document
We use the following sample document which is supplied in the RedPen package.
1 2 3 4 5 6 |
最近利用されているソフトウェアの中には複数の計算機上で動作(分散)するものが多く存在し、このような分散ソフトウェアは複数の計算機で動作することで大量のデータを扱えたり、高負荷な状況に対処できたりします。 本稿では,複数の計算機(クラスタ)で動作する各サーバーを「インスタンス」と呼びます。 たとえば検索エンジンやデータベースではインデックスを複数のインスタンスで分割して保持します。 このような場合、各インデクスの結果をマージしてクライアントプログラムに渡す機構が必要となります。 |
Installation and running
First of all, download RedPen.
Installation of RedPen
With the following procedure, download RedPen and then install it. (You need Git and Maven installed.)
1 2 3 4 5 |
$ git clone git@github.com:recruit-tech/redpen.git $ cd redpen $ mvn package |
Running of RedPen
Then, run the installed RedPen.
1 2 3 4 5 6 |
$ cd redpen-app/target/ $ tar xzvf redpen-app-0.6-assembled.tar.gz $ cd redpen-app-0.6 $ bin/redpen -c conf/dv-conf-ja.xml doc/txt/ja/sampledoc-ja.txt |
Result after run
When you run the command above, some errors come out as you see below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
15:53:31.833 [main] INFO o.b.docvalidator.ConfigurationLoader - Succeeded to load configuration file 15:53:31.838 [main] INFO o.b.docvalidator.ConfigurationLoader - Validation Setting file: conf/validation-conf-ja.xml 15:53:31.842 [main] WARN o.b.d.c.ValidationConfigurationLoader - Found more than one root "component" blocks in the configuration 15:53:31.842 [main] WARN o.b.d.c.ValidationConfigurationLoader - Use the first configuration block ... 15:53:31.843 [main] INFO o.b.docvalidator.ConfigurationLoader - Succeeded to load validator configuration setting 15:53:31.843 [main] INFO o.b.docvalidator.ConfigurationLoader - Setting lang as "ja" 15:53:31.843 [main] INFO o.b.docvalidator.ConfigurationLoader - Setting character table setting file as "conf/symbol-conf-ja.xml" 15:53:31.844 [main] INFO o.b.docvalidator.ConfigurationLoader - Symbol setting file: conf/symbol-conf-ja.xml 15:53:31.852 [main] INFO o.b.d.config.CharacterTableLoader - Succeeded to load character table 15:53:31.852 [main] INFO o.b.docvalidator.ConfigurationLoader - Succeeded to load character configuration setting 15:53:31.864 [main] INFO o.b.d.parser.BasicDocumentParser - "。" is added as a end of sentence character 15:53:31.864 [main] INFO o.b.d.parser.BasicDocumentParser - "?" is added as a end of sentence character 15:53:31.864 [main] INFO o.b.d.parser.BasicDocumentParser - "!" is added as a end of sentence character 15:53:31.872 [main] INFO o.b.d.distributor.ResultDistributor - Creating Distributor... ValidationError[SentenceLength][doc/txt/ja/sampledoc-ja.txt : 0 (The length of the line exceeds the maximum 101.)] at line: 最近利用されているソフトウェアの中には複数の計算機上で動作(分散)するものが多 く存在し、このような分散ソフトウェアは複数の計算機で動作することで大量のデータを扱えたり、高負荷な状況に対処できたりします。 ValidationError[InvalidCharacter][doc/txt/ja/sampledoc-ja.txt : 1 (Invalid symbol found: ",")] at line: 本稿では,複数の計算機(クラスタ)で動作する各サーバーを「インスタンス」と呼びまます。 ValidationError[KatakanaEndHyphen][doc/txt/ja/sampledoc-ja.txt : 1 (Invalid Katakana end hypen found "サーバー")] at line: 本稿では,複数の計算機(クラスタ)で動作する各サーバーを「インスタンス」と呼びまます。 ValidationError[KatakanaSpellCheck][doc/txt/ja/sampledoc-ja.txt : 3 (Found a Katakana word: "インデクス", which is similar to "インデックス" at postion 2.)] at line: このような場合、各インデクスの結果をマージしてクライアントプログラムに渡す機構が必要となります。 |
Now, take a look at the result above. In the result above, you can see 4 errors (Validation Error) come out. The first error is that the first sentence is too long in the input document. The second one is that the used comma is different from the registered one. The third one is that you do not need the last hyphen in the word ”サーバー”. The fourth one is to tell that there is a Katakana word “インデクス” of which appearance of written characters looks like a Katakana word “インデックス”.
Current status and future
RedPen has been released as a beta version. The application works only own its own, and there will be many changes up to the official release. Format of the configuration files (XML) will be greatly changed as per engineers’ requirement. Moreover, the interface of Validator will be changed as the engineers make an active suggestion. So, I would like to ask for your kind understanding of these inconveniences.
The biggest thing we feel unsatisfied with is the complicated settings when we start using the current version of RedPen. For example, in the current version, you need to make an invalid expression list by yourself to use Validator which detects invalid expressions (InvalidExpressionValidator). In order to be able to use it without settings or adding resources by providing a default setting for each language and by having Validator which can be used without any setting, we want to solve the problem of the complicated settings. As a Validator without any setting we are thinking to check if there is a formal name for an abbreviated notation in an input document and also if there is a mixture of written language and spoken language of Japanese.
Other big improvements will include a supported plugin to add Validator easily. I will develop RedPen gradually, so please look over me with kind eyes. Thank you.