Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support to Annotate Arabic #774

Open
ghost opened this issue May 14, 2012 · 37 comments
Open

Support to Annotate Arabic #774

ghost opened this issue May 14, 2012 · 37 comments

Comments

@ghost
Copy link

ghost commented May 14, 2012

This is issue is to discuss the current short-comings regarding Arabic script and how/if it can be resolved given our current architecture.

Emad Mohamed mentioned on the corpora mailing list that they can use the ASCII Backwater encoding for Arabic but that it is sub-optimal. We really need a native to help out with thin but at least from what I could read at CPAN it looks like a dreadful hack to get Arabic into ASCII.

According to @amadanmath, the following should be an issue:

<head>
    <meta charset="utf-8"/>
</head>

<body dir="rtl">
    <p>
        <span dir="rtl">
            لغة إنجليزية <span dir="ltr">English</span> لغة إنجليزية
        </span>
    </p>

    <p>
        <span dir="rtl">
            لغة إنجليزية English لغة إنجليزية
        </span>
    </p>
</body>

But it appears that at least Firefox renders both the same and handles the English portion correctly.

From talking to one of the attendees at EACL 2012 tokenisation may also become an issue. For this we could use a similar approach as we have already done for Japanese and incorporate a morphological analyser to find the start and end of the "tokens".

Here is one I found after some minor Googling:

https://github.com/mosta/raramorph
@ghost ghost assigned spyysalo May 14, 2012
@fsalotaibi
Copy link

Hi there,

I'm working in Arabic NLP and very interesting to help to get this tool supporting Arabic.

I believe that, if this happened this tool will get so many citations as the researches on Arabic NLP are flourishing these days and became important.

Regarding supporting the transliteration of the ASCII version (Buckwalter) instead of the actual Arabic glyphs, I believe this is not a good choice. As you know the readability of the transliteration is difficult especially for the one who is working on the annotation task.

The optimal choice is to support the RTL with Arabic glyphs.

Please feel free to contact with me as I'm so happy to be engaged.

Fahd

@spyysalo
Copy link
Member

spyysalo commented Jun 5, 2012

Hi @fsalotaibi,

Thanks for your interest in brat! We're happy to welcome any contribution to Arabic support in brat, and would much appreciate your help on this feature.

For rendering the actual glyphs in brat, as a first step, we would need to know how to create an SVG document with Arabic that renders correctly in at least some major browser. If you can look into this, it would be very helpful if you could try exporting an SVG with Arabic from brat (from Data->Visualization->SVG) and see if you can edit it to render correctly.

@amadanmath
Copy link
Contributor

But it appears that at least Firefox renders both the same and handles the English portion correctly.

Yeah, Firefox might do the right thing when rendering HTML. However, note that we're laying out each word separately by drawing it onto the SVG canvas; so we do not have access to Firefox's heuristics. We need to know the order of the spans. So in the case of English language لغة إنجليزية, the linear order (in the text file) is

  1. لغة (language)
  2. إنجليزية (English)
  3. English
  4. language

and it should also be the order in which the elements are set in SVG (for copy/paste purposes); but the coordinates on the screen (and ultimately the visual effect) needs to be (seen from left to right):

  1. English
  2. language
  3. إنجليزية (English)
  4. لغة (language)

@spyysalo
Copy link
Member

spyysalo commented Jun 5, 2012

@amadanmath : do I understand correctly that this last issue you mention is that it would be necessary to reverse the RTL order for parts of the document that do not use Arabic glyphs? If we were to assume that there are no such strings (i.e. everything is RTL) or that the text input has already reversed these appropriately, would this substantially ease the task?

@amadanmath
Copy link
Contributor

Yes, I suppose that's what I'm saying. Note that for the copy-paste to work properly you'd need to make sure that only the coordinates are reshuffled, but the order in which they're put into SVG is not. I believe a good algorithm might be: lay all chunks out as they appear (showing them RTL); then find sequences of LTR chunks in the same row, and recalculate their coordinates so that they appear in the reverse order, without changing anything else.

Obviously if there's no RTL text, the task is easier. Still not easy, since we have a bunch of places where the assumption is LTR. Also, I'm still not 100% convinced I'd know how to tell LTR chunks from RTL ones.

It may never happen, I don't know, but say you have "كربون-12" ("carbon-12"), and someone annotates "بون-1" ("bon-1"). You can see that it visually becomes a discontinuous span (but it is not discontinuous byte-wise). The chunk is "كربون-12", but it's neither LTR nor RTL - it's hybrid.

@spyysalo
Copy link
Member

spyysalo commented Jun 5, 2012

Even though I don't really know about the client, the example you give sounds like it would take a lot of work to do right. If we want to get all that for the first iteration of Arabic support, I'm guessing it might be a while.

Could the tool still be useful for annotating Arabic if we were to assume that everything is RTL? This would get cases like English language لغة إنجليزية, and "كربون12 wrong, but perhaps it would still be better to have "mostly OK" support now rather than perfect support much later? (Comments from someone with an understanding of the frequency of these types of cases would be much appreciated!)

@fsalotaibi
Copy link

"It may never happen, I don't know, but say you have "كربون-12" ("carbon-12"), and someone annotates "بون-1" ("bon-1"). You can see that it visually becomes a discontinuous span (but it is not discontinuous byte-wise). The chunk is "كربون-12", but it's neither LTR nor RTL - it's hybrid."

What researches do when want to annotate a piece of Arabic text is to do the tokenization first as a preprocessing. So it is not the brat duty to take care of the proper tokenization. I believe no one will try to annotate such thing like : ["كربون-12" ("carbon-12"), and someone annotates "بون-1" ("bon-1")].

The word order of the mixed Arabic and English is perfectly handled by Microsoft bench softwares such as word. We could inspire the same algorithm to do so. But I'll give you a simple statistic that may convince you:
taking the Arabic Wikipedia as a case study I found that:
85.3% of the tokens are Arabic words.
1.2% of the tokens are English words.
2.71% of the tokens are numbers.
10.83 of the tokens are symbols.

So the mix of RTL and LTR would not be that serieos (currently but it is very powerful to be supported) as the total number of English words is very small. I'm in doubt about the numbers and symbols.

** If you do the option of assuming everything is RTL, I'm happy to test it and give you the feed back for the pros and cons.

Fahd

@spyysalo
Copy link
Member

spyysalo commented Jun 5, 2012

@fsalotaibi : thank you for the information and statistics! I believe it should make the initial implementation much easier if we can make the assumptions that 1) the text is pre-tokenized 2) everything is RLT. @amadanmath : what would implementing this require on the server side?

@fsalotaibi
Copy link

I hope not annoying, any news about supporting Arabic. Actually I'm involved with other in building an Arabic NE corpus as we are planning to start annotating in two weeks time. I really support this nice tool to be used based on its functionality. Team members still waiting for it as well. I'm afraid the time will be the issue.

I believe this would be very good reputation once supporting such RTL language.

** I tried to modify the code, but actually I stucked to understand how the calculation of the glyphs happened to switch to RTL instead. Can anyone pinpoint me to right piece of work to let me try?

@spyysalo
Copy link
Member

spyysalo commented Jul 3, 2012

@fsalotaibi : not annoying at all, thanks for reminding us! We have a few other features prioritized right now, but if you're willing to have a look at the code, we'd be happy to help.

@amadanmath : could you provide some pointers on what would need to be changed to make this happen?

@fsalotaibi
Copy link

@spyysalo: Thank you, I'm trying my best to understand how this could happen. It seems brat is a big project to understand in short time. I only have two weeks to start the annotation project, and I do still support this tool within my team.

@amadanmath : I worked on a prototype to illustirate what are needed to support Arabic:

  1. Actually brat already supports utf-8, so the character are fully supported.
  2. The only problem is with the direction displayed for both the text and the annotation tags including the arcs.
    The current output when displaying Arabic sentences:
    http://i50.tinypic.com/1slf7s.png
    By the way, I tested this on Chrome, Safari and Firefox. The best display I got is with Firefox even it is not fully supported by brat. Look at how this looks on Chrome and Safari:
    http://i48.tinypic.com/2wbvww7.png
    It is completely overlapping.

The desired and proper way is shown in the following prototype:
http://i48.tinypic.com/2r3kta8.png

As you can see:

  1. The box is all in RTL
  2. Sentence number, i.e. row numbers, are in the right
  3. The token, i.e. word, and the tag are aligned properly.
  4. The arc direction is from right to left.

This is what we need for this stage. I'm not sure how difficult this work is. As I said earlier, I'm very happy to evaluate this work while doing the support. I'm really exciting to let this tool supporting Arabic. I believe this will open many doors for other researchers.

@spyysalo
Copy link
Member

spyysalo commented Jul 4, 2012

@fsalotaibi : thank you for your efforts on this! I'm afraid I can't help myself on the technical aspects as I don't know the relevant part of the client code, but hopefully @amadanmath can. I agree this would be a valuable feature to have.

For ease of reference, I'm placing your screenshots inline here (click on "GitHub Flavored Markdown" in the comment form for syntax):

The current output when displaying Arabic sentences:

how this looks on Chrome and Safari:

The desired and proper way is shown in the following prototype:

@amadanmath
Copy link
Contributor

Okay, some quick pointers:

If you look at client/src/visualizer.js, you will find the function renderDataReal. It is rather huge, and does the layout.

In it you will find the variable currentX. It starts a little past the left edge (leaving room for the sentence number), and will be used to position the next chunk. Here is a check if the current chunk has overflowed the right margin and needs to be put into a new row; if so, currentX is reset to the start of the next row.

As the first step, these procedures would need to be reversed; if RTL language is rendered, start with the right edge (leaving the space for the sentence number), decrease currentX, and check if it falls below the left margin.

I don't know what getStartPositionOfChar and similar functions return for RTL languages (positive or negative numbers? where is the origin?) but you will likely need to also mess with the function getTextAndSpanTextMeasurements, which calculates at which point spans start and finish inside their chunks. Also depending on where the origin is, you might need to change the calculation of the position of the span boxes... And places where it says fragment.right or similar, they would actually need to point at the left side...

There's a bunch of things I am skipping over here, as the visualisation part is quite complex.

@spyysalo
Copy link
Member

Hi @fsalotaibi : I chanced on https://www.odesk.com/o/jobs/job/Modifying-Javascript-canvas-GUI_~~fb065ce0129fa79c/, which suggests that you found a way to implement Arabic support. Great! Would you be prepared to consider contributing the implementation of this feature back to brat, so that others in the user community could also benefit from it?

@ghost
Copy link
Author

ghost commented Dec 25, 2012

I had no idea that Unicode had RTR and RTL features, so I will leave this link here for future reference even though using it is discouraged: http://www.w3.org/International/questions/qa-bidi-controls

@FatimahNLP
Copy link

Hello all.
I need urgent help. does the brat tool support Arabic labeling ? my project need the Arabic annotation tool. please if yes tell me the steps to support Arabic language labeling in brat tool.

@spyysalo
Copy link
Member

No explicit support has been implemented, but from some recent discussion on the mailing list it appears that it is possible to use brat to annotate Arabic using recent versions of Firefox.

@FatimahNLP
Copy link

Thanks @spyysalo
Who can help me in the way, to add labeling and annotation in Arabic

@ghost ghost modified the milestones: v1.5 Amaze-a-Vole, v1.4 Lemon Curry Nov 5, 2014
@ghost
Copy link
Author

ghost commented Nov 5, 2014

As relevant as this is, I don't see it happening before v1.4.

@spyysalo
Copy link
Member

spyysalo commented Nov 5, 2014

As discussed on the list recently, there has been some success annotating Arabic on recent versions of Firefox. We might wish to document the conditions for making this work.

@fsalotaibi
Copy link

That was vey long time. We successfully managed to apply the right to left (RTL) into brat. Please see as an example of Arabic (RTL) text:
http://www.ebsar.com/brat/#/FGANER/109-out

The modification is part of our project and it is still not released to the public. Meanwhile, anyone who wants to use brat on our server, please don't hesitate to contact me on fahd_alotaibi(AT)hotmail.com, we may be able to give you such access to use it online to tag Arabic text.

** Please use either Google Chrome or Firefox to have the correct rendering result. (internet explorer is not supported)

@icycandy
Copy link

icycandy commented Nov 5, 2014

@fsalotaibi Do you have any plan to release to the public? I have some arabic text to annote, and currently excel is used.

@FatimahNLP
Copy link

Thanks very much fsalotaibi and icycandy
I appreciate your help
I need the steps to let brat accepts text from left to right, steps to annotate Arabic text using brat
thanks in advance.

@reckart
Copy link

reckart commented Jun 29, 2015

We have added experimental support for left-to-right to WebAnno now. To this end, I have patched the brat Javascript files from brat that we use in WebAnno to support an LTR and an RTL mode. The changes are all conspicuously marked and should be reasonable easy to transfer back into brat.

In particular, the changes do

  • place the sentences numbers on the right in RTL mode
  • support token-level annotations in RTL mode
  • support relations (arcs) in RTL mode

Some functionalities may not have been fixed for RTL because we don't use them in WebAnno.

Also, there are some known issues, e.g.:

  • sub-token annotation in Firefox are broken (but work for Chrome/Safari)
  • sub-token annotation on LTR tokens in RTL mode (e.g. numbers) are rendered wrong

Anybody interested in integrating this back into brat?

https://github.com/webanno/webanno/blob/2.2.x/webanno-brat/src/main/java/de/tudarmstadt/ukp/clarin/webanno/brat/resource/visualizer.js

@ghost
Copy link
Author

ghost commented Jun 29, 2015

@reckart: Cool! We are certainly interested. @amadanmath: When you have the time, could you have a look at putting this into a branch?

@lcrist
Copy link

lcrist commented Aug 18, 2015

Hi, just wondered if there's been any activity or timeline for inclusion of RTL abled brat?

@spyysalo
Copy link
Member

@amadanmath : could you please have a look at #774 (comment) and #1150?

amadanmath pushed a commit that referenced this issue Mar 11, 2016
@amadanmath
Copy link
Contributor

Sorry it took me forever to address this; WebAnno changes backported to brat. Thank you, @reckart.

It is committed to the branch feature-rtl; if anyone wants to test it, please do (I can't test it properly as I can't read any RTL languages).

You will need to include the following in the visual.conf:

[options]
Text direction:rtl

@amadanmath
Copy link
Contributor

Seems it bugs a bit on mixed directionality text -- try selecting half of the abbreviation and half of the neighbouring Arabic word:

والحلقة الناقصة كانت دمج برنامج CYPNET الذي ينقل الملفات، ببرنامج SNDMSG الذي يكتب الرسالة، وكان نتاج هذا الاندماج البريد الإلكتروني.

@reckart
Copy link

reckart commented Mar 11, 2016

Fabulous!

Well, yes, mixed tokens are a known issue in our code. I hope that sharing the code between brat and WebAnno increases the chance that somebody picks up the baton and addresses the remaining issues and that both projects can profit from this.

See also: webanno/webanno#49

@reckart
Copy link

reckart commented May 17, 2016

I finally found some non-trivially annotated RTL data (in Hebrew) which shows that the RTL layout doesn't push out the labels sufficiently. This needs some improvement. Cf. webanno/webanno#273

@reckart
Copy link

reckart commented May 17, 2016

@amadanmath if you have any hot pointers where to look regarding fixing the "pushing", would be great!

@reckart
Copy link

reckart commented May 17, 2016

Looks like a general layout problem with wide labels, not limited to the RTL layout or RTL glyphs.

@reckart
Copy link

reckart commented May 18, 2016

Managed to fix the layout issue ;) webanno/webanno#273

@reckart
Copy link

reckart commented May 19, 2016

You might find this also interesting: webanno/webanno#265 (comment)

@amadanmath
Copy link
Contributor

Merged into master branch now.

@reckart
Copy link

reckart commented Jun 30, 2016

I should mention that there have been more improvements to RTL mode in WebAnno, also some issues still open to be resolved:

https://github.com/webanno/webanno/issues?utf8=✓&q=is%3Aissue%20label%3ARTL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants
@amadanmath @spyysalo @icycandy @reckart @fsalotaibi @lcrist @FatimahNLP and others