Support to Annotate Arabic #774

ghost · 2012-05-14T09:50:17Z

This is issue is to discuss the current short-comings regarding Arabic script and how/if it can be resolved given our current architecture.

Emad Mohamed mentioned on the corpora mailing list that they can use the ASCII Backwater encoding for Arabic but that it is sub-optimal. We really need a native to help out with thin but at least from what I could read at CPAN it looks like a dreadful hack to get Arabic into ASCII.

According to @amadanmath, the following should be an issue:

<head>
    <meta charset="utf-8"/>
</head>

<body dir="rtl">
    <p>
        <span dir="rtl">
            لغة إنجليزية <span dir="ltr">English</span> لغة إنجليزية
        </span>
    </p>

    <p>
        <span dir="rtl">
            لغة إنجليزية English لغة إنجليزية
        </span>
    </p>
</body>

But it appears that at least Firefox renders both the same and handles the English portion correctly.

From talking to one of the attendees at EACL 2012 tokenisation may also become an issue. For this we could use a similar approach as we have already done for Japanese and incorporate a morphological analyser to find the start and end of the "tokens".

Here is one I found after some minor Googling:

https://github.com/mosta/raramorph

The text was updated successfully, but these errors were encountered:

fsalotaibi · 2012-06-04T22:33:34Z

Hi there,

I'm working in Arabic NLP and very interesting to help to get this tool supporting Arabic.

I believe that, if this happened this tool will get so many citations as the researches on Arabic NLP are flourishing these days and became important.

Regarding supporting the transliteration of the ASCII version (Buckwalter) instead of the actual Arabic glyphs, I believe this is not a good choice. As you know the readability of the transliteration is difficult especially for the one who is working on the annotation task.

The optimal choice is to support the RTL with Arabic glyphs.

Please feel free to contact with me as I'm so happy to be engaged.

Fahd

spyysalo · 2012-06-05T03:48:52Z

Hi @fsalotaibi,

Thanks for your interest in brat! We're happy to welcome any contribution to Arabic support in brat, and would much appreciate your help on this feature.

For rendering the actual glyphs in brat, as a first step, we would need to know how to create an SVG document with Arabic that renders correctly in at least some major browser. If you can look into this, it would be very helpful if you could try exporting an SVG with Arabic from brat (from Data->Visualization->SVG) and see if you can edit it to render correctly.

amadanmath · 2012-06-05T04:18:01Z

But it appears that at least Firefox renders both the same and handles the English portion correctly.

Yeah, Firefox might do the right thing when rendering HTML. However, note that we're laying out each word separately by drawing it onto the SVG canvas; so we do not have access to Firefox's heuristics. We need to know the order of the spans. So in the case of English language لغة إنجليزية, the linear order (in the text file) is

لغة (language)
إنجليزية (English)
English
language

and it should also be the order in which the elements are set in SVG (for copy/paste purposes); but the coordinates on the screen (and ultimately the visual effect) needs to be (seen from left to right):

English
language
إنجليزية (English)
لغة (language)

spyysalo · 2012-06-05T04:27:12Z

@amadanmath : do I understand correctly that this last issue you mention is that it would be necessary to reverse the RTL order for parts of the document that do not use Arabic glyphs? If we were to assume that there are no such strings (i.e. everything is RTL) or that the text input has already reversed these appropriately, would this substantially ease the task?

amadanmath · 2012-06-05T05:07:58Z

Yes, I suppose that's what I'm saying. Note that for the copy-paste to work properly you'd need to make sure that only the coordinates are reshuffled, but the order in which they're put into SVG is not. I believe a good algorithm might be: lay all chunks out as they appear (showing them RTL); then find sequences of LTR chunks in the same row, and recalculate their coordinates so that they appear in the reverse order, without changing anything else.

Obviously if there's no RTL text, the task is easier. Still not easy, since we have a bunch of places where the assumption is LTR. Also, I'm still not 100% convinced I'd know how to tell LTR chunks from RTL ones.

It may never happen, I don't know, but say you have "كربون-12" ("carbon-12"), and someone annotates "بون-1" ("bon-1"). You can see that it visually becomes a discontinuous span (but it is not discontinuous byte-wise). The chunk is "كربون-12", but it's neither LTR nor RTL - it's hybrid.

spyysalo · 2012-06-05T09:58:20Z

Even though I don't really know about the client, the example you give sounds like it would take a lot of work to do right. If we want to get all that for the first iteration of Arabic support, I'm guessing it might be a while.

Could the tool still be useful for annotating Arabic if we were to assume that everything is RTL? This would get cases like English language لغة إنجليزية, and "كربون12 wrong, but perhaps it would still be better to have "mostly OK" support now rather than perfect support much later? (Comments from someone with an understanding of the frequency of these types of cases would be much appreciated!)

fsalotaibi · 2012-06-05T10:27:02Z

"It may never happen, I don't know, but say you have "كربون-12" ("carbon-12"), and someone annotates "بون-1" ("bon-1"). You can see that it visually becomes a discontinuous span (but it is not discontinuous byte-wise). The chunk is "كربون-12", but it's neither LTR nor RTL - it's hybrid."

What researches do when want to annotate a piece of Arabic text is to do the tokenization first as a preprocessing. So it is not the brat duty to take care of the proper tokenization. I believe no one will try to annotate such thing like : ["كربون-12" ("carbon-12"), and someone annotates "بون-1" ("bon-1")].

The word order of the mixed Arabic and English is perfectly handled by Microsoft bench softwares such as word. We could inspire the same algorithm to do so. But I'll give you a simple statistic that may convince you:
taking the Arabic Wikipedia as a case study I found that:
85.3% of the tokens are Arabic words.
1.2% of the tokens are English words.
2.71% of the tokens are numbers.
10.83 of the tokens are symbols.

So the mix of RTL and LTR would not be that serieos (currently but it is very powerful to be supported) as the total number of English words is very small. I'm in doubt about the numbers and symbols.

** If you do the option of assuming everything is RTL, I'm happy to test it and give you the feed back for the pros and cons.

Fahd

spyysalo · 2012-06-05T10:35:48Z

@fsalotaibi : thank you for the information and statistics! I believe it should make the initial implementation much easier if we can make the assumptions that 1) the text is pre-tokenized 2) everything is RLT. @amadanmath : what would implementing this require on the server side?

fsalotaibi · 2012-07-02T20:45:37Z

I hope not annoying, any news about supporting Arabic. Actually I'm involved with other in building an Arabic NE corpus as we are planning to start annotating in two weeks time. I really support this nice tool to be used based on its functionality. Team members still waiting for it as well. I'm afraid the time will be the issue.

I believe this would be very good reputation once supporting such RTL language.

** I tried to modify the code, but actually I stucked to understand how the calculation of the glyphs happened to switch to RTL instead. Can anyone pinpoint me to right piece of work to let me try?

spyysalo · 2012-07-03T10:47:38Z

@fsalotaibi : not annoying at all, thanks for reminding us! We have a few other features prioritized right now, but if you're willing to have a look at the code, we'd be happy to help.

@amadanmath : could you provide some pointers on what would need to be changed to make this happen?

fsalotaibi · 2012-07-03T18:36:25Z

@spyysalo: Thank you, I'm trying my best to understand how this could happen. It seems brat is a big project to understand in short time. I only have two weeks to start the annotation project, and I do still support this tool within my team.

@amadanmath : I worked on a prototype to illustirate what are needed to support Arabic:

Actually brat already supports utf-8, so the character are fully supported.
The only problem is with the direction displayed for both the text and the annotation tags including the arcs.
The current output when displaying Arabic sentences:
http://i50.tinypic.com/1slf7s.png
By the way, I tested this on Chrome, Safari and Firefox. The best display I got is with Firefox even it is not fully supported by brat. Look at how this looks on Chrome and Safari:
http://i48.tinypic.com/2wbvww7.png
It is completely overlapping.

The desired and proper way is shown in the following prototype:
http://i48.tinypic.com/2r3kta8.png

As you can see:

The box is all in RTL
Sentence number, i.e. row numbers, are in the right
The token, i.e. word, and the tag are aligned properly.
The arc direction is from right to left.

This is what we need for this stage. I'm not sure how difficult this work is. As I said earlier, I'm very happy to evaluate this work while doing the support. I'm really exciting to let this tool supporting Arabic. I believe this will open many doors for other researchers.

spyysalo · 2012-07-04T03:36:08Z

@fsalotaibi : thank you for your efforts on this! I'm afraid I can't help myself on the technical aspects as I don't know the relevant part of the client code, but hopefully @amadanmath can. I agree this would be a valuable feature to have.

For ease of reference, I'm placing your screenshots inline here (click on "GitHub Flavored Markdown" in the comment form for syntax):

The current output when displaying Arabic sentences:

how this looks on Chrome and Safari:

The desired and proper way is shown in the following prototype:

amadanmath · 2012-07-04T13:47:41Z

Okay, some quick pointers:

If you look at client/src/visualizer.js, you will find the function renderDataReal. It is rather huge, and does the layout.

In it you will find the variable currentX. It starts a little past the left edge (leaving room for the sentence number), and will be used to position the next chunk. Here is a check if the current chunk has overflowed the right margin and needs to be put into a new row; if so, currentX is reset to the start of the next row.

As the first step, these procedures would need to be reversed; if RTL language is rendered, start with the right edge (leaving the space for the sentence number), decrease currentX, and check if it falls below the left margin.

I don't know what getStartPositionOfChar and similar functions return for RTL languages (positive or negative numbers? where is the origin?) but you will likely need to also mess with the function getTextAndSpanTextMeasurements, which calculates at which point spans start and finish inside their chunks. Also depending on where the origin is, you might need to change the calculation of the position of the span boxes... And places where it says fragment.right or similar, they would actually need to point at the left side...

There's a bunch of things I am skipping over here, as the visualisation part is quite complex.

spyysalo · 2012-10-18T05:27:40Z

Hi @fsalotaibi : I chanced on https://www.odesk.com/o/jobs/job/Modifying-Javascript-canvas-GUI_~~fb065ce0129fa79c/, which suggests that you found a way to implement Arabic support. Great! Would you be prepared to consider contributing the implementation of this feature back to brat, so that others in the user community could also benefit from it?

ghost · 2012-12-25T00:39:31Z

I had no idea that Unicode had RTR and RTL features, so I will leave this link here for future reference even though using it is discouraged: http://www.w3.org/International/questions/qa-bidi-controls

FatimahNLP · 2014-10-20T17:08:08Z

Hello all.
I need urgent help. does the brat tool support Arabic labeling ? my project need the Arabic annotation tool. please if yes tell me the steps to support Arabic language labeling in brat tool.

spyysalo · 2014-10-20T18:21:25Z

No explicit support has been implemented, but from some recent discussion on the mailing list it appears that it is possible to use brat to annotate Arabic using recent versions of Firefox.

FatimahNLP · 2014-10-20T22:50:18Z

Thanks @spyysalo
Who can help me in the way, to add labeling and annotation in Arabic

ghost · 2014-11-05T12:24:40Z

As relevant as this is, I don't see it happening before v1.4.

spyysalo · 2014-11-05T13:09:50Z

As discussed on the list recently, there has been some success annotating Arabic on recent versions of Firefox. We might wish to document the conditions for making this work.

fsalotaibi · 2014-11-05T14:19:12Z

That was vey long time. We successfully managed to apply the right to left (RTL) into brat. Please see as an example of Arabic (RTL) text:
http://www.ebsar.com/brat/#/FGANER/109-out

The modification is part of our project and it is still not released to the public. Meanwhile, anyone who wants to use brat on our server, please don't hesitate to contact me on fahd_alotaibi(AT)hotmail.com, we may be able to give you such access to use it online to tag Arabic text.

** Please use either Google Chrome or Firefox to have the correct rendering result. (internet explorer is not supported)

icycandy · 2014-11-05T14:39:20Z

@fsalotaibi Do you have any plan to release to the public? I have some arabic text to annote, and currently excel is used.

FatimahNLP · 2014-11-05T18:27:22Z

Thanks very much fsalotaibi and icycandy
I appreciate your help
I need the steps to let brat accepts text from left to right, steps to annotate Arabic text using brat
thanks in advance.

reckart · 2015-06-29T14:27:22Z

We have added experimental support for left-to-right to WebAnno now. To this end, I have patched the brat Javascript files from brat that we use in WebAnno to support an LTR and an RTL mode. The changes are all conspicuously marked and should be reasonable easy to transfer back into brat.

In particular, the changes do

place the sentences numbers on the right in RTL mode
support token-level annotations in RTL mode
support relations (arcs) in RTL mode

Some functionalities may not have been fixed for RTL because we don't use them in WebAnno.

Also, there are some known issues, e.g.:

sub-token annotation in Firefox are broken (but work for Chrome/Safari)
sub-token annotation on LTR tokens in RTL mode (e.g. numbers) are rendered wrong

Anybody interested in integrating this back into brat?

https://github.com/webanno/webanno/blob/2.2.x/webanno-brat/src/main/java/de/tudarmstadt/ukp/clarin/webanno/brat/resource/visualizer.js

ghost · 2015-06-29T15:24:51Z

@reckart: Cool! We are certainly interested. @amadanmath: When you have the time, could you have a look at putting this into a branch?

lcrist · 2015-08-18T19:34:26Z

Hi, just wondered if there's been any activity or timeline for inclusion of RTL abled brat?

spyysalo · 2015-09-18T14:08:47Z

@amadanmath : could you please have a look at #774 (comment) and #1150?

amadanmath · 2016-03-11T09:05:25Z

Sorry it took me forever to address this; WebAnno changes backported to brat. Thank you, @reckart.

It is committed to the branch feature-rtl; if anyone wants to test it, please do (I can't test it properly as I can't read any RTL languages).

You will need to include the following in the visual.conf:

[options]
Text direction:rtl

amadanmath · 2016-03-11T09:12:45Z

Seems it bugs a bit on mixed directionality text -- try selecting half of the abbreviation and half of the neighbouring Arabic word:

والحلقة الناقصة كانت دمج برنامج CYPNET الذي ينقل الملفات، ببرنامج SNDMSG الذي يكتب الرسالة، وكان نتاج هذا الاندماج البريد الإلكتروني.

reckart · 2016-03-11T09:32:41Z

Fabulous!

Well, yes, mixed tokens are a known issue in our code. I hope that sharing the code between brat and WebAnno increases the chance that somebody picks up the baton and addresses the remaining issues and that both projects can profit from this.

Support to Annotate Arabic #774

Support to Annotate Arabic #774

Comments

ghost commented May 14, 2012

fsalotaibi commented Jun 4, 2012

spyysalo commented Jun 5, 2012

amadanmath commented Jun 5, 2012

spyysalo commented Jun 5, 2012

amadanmath commented Jun 5, 2012

spyysalo commented Jun 5, 2012

fsalotaibi commented Jun 5, 2012

spyysalo commented Jun 5, 2012

fsalotaibi commented Jul 2, 2012

spyysalo commented Jul 3, 2012

fsalotaibi commented Jul 3, 2012

spyysalo commented Jul 4, 2012

amadanmath commented Jul 4, 2012

spyysalo commented Oct 18, 2012

ghost commented Dec 25, 2012

FatimahNLP commented Oct 20, 2014

spyysalo commented Oct 20, 2014

FatimahNLP commented Oct 20, 2014

ghost commented Nov 5, 2014

spyysalo commented Nov 5, 2014

fsalotaibi commented Nov 5, 2014

icycandy commented Nov 5, 2014

FatimahNLP commented Nov 5, 2014

reckart commented Jun 29, 2015

ghost commented Jun 29, 2015

lcrist commented Aug 18, 2015

spyysalo commented Sep 18, 2015

amadanmath commented Mar 11, 2016

amadanmath commented Mar 11, 2016

reckart commented Mar 11, 2016

reckart commented May 17, 2016

reckart commented May 17, 2016

reckart commented May 17, 2016

reckart commented May 18, 2016

reckart commented May 19, 2016

amadanmath commented Jun 30, 2016

reckart commented Jun 30, 2016