Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.dic and .aff content by param. #19

Open
alecuba16 opened this issue Jan 14, 2022 · 7 comments
Open

.dic and .aff content by param. #19

alecuba16 opened this issue Jan 14, 2022 · 7 comments
Labels
question Further information is requested

Comments

@alecuba16
Copy link

alecuba16 commented Jan 14, 2022

Hello!

Would it be possible to populate the dictionary by submitting a LIST with the content of .dic and .aff ?

This is useful in the case of spark UDFs where it is easier to pass LIST variables, rather than copy .dic and .aff files from the driver node to the executors.

Btw , there is any way to implement stemming like the original hunspell library? Or there is some alternative for stemming?

@zverok
Copy link
Owner

zverok commented Jan 15, 2022

Would it be possible to populate the dictionary by submitting a LIST with the content of .dic and .aff ?

It is theoretically possible. You'll need to implement some wrapper around a list that will correspond to two requirements:

  1. It is iterable, producing pairs of (line number, line)
  2. It has method reset_encoding(encoding_name) which works in the middle of iteration and makes the next lines in different encoding.

Once you have this, you can just:

aff, context = spylls.hunspell.readers.read_aff(MyReader(af_lines_list))
dic = spylls.hunspell.readers.read_dic(MyReader(dic_lines_list), aff=aff, context=context)
dictionary = spylls.hunspell.Dictionary(aff, dic)

Btw , there is any way to implement stemming like the original hunspell library? Or there is some alternative for stemming?

Not that convenient, but

from spylls.hunspell.algo.capitalization import Type as CapType

dic = spylls.hunspell.Dictionary.from_files('examples/en_US')
for form in dic.lookuper.affix_forms('kittens', captype=CapType.NO): 
  print(form.stem)
# prints: "kitten"

@zverok zverok added the question Further information is requested label Jan 15, 2022
@alecuba16
Copy link
Author

alecuba16 commented Jan 16, 2022

Thanks for the reply, about the first issue, I was able to populate the dictionary with the method that you have suggested, I have some problems with the encoding of special chars, but is something that I will address the next week.

The second issue, the stemming, I did a test with the code that you have provided, but it seems that there is some import (or library version) that is preventing to pass the captype:

Expected zero arguments for construction of ClassDict (for spylls.hunspell.algo.capitalization.Type)

The complete code snipped is this, ignore the spark udf wrapper:

from pyspark.sql import *
import pyspark.sql.functions as F
import pyspark.sql.types as T
import spylls
from spylls.hunspell.algo.capitalization import Type as CapType
from pyspark import SparkFiles

def pyspark_transform(spark, df):
    def hunspell(desc):
        if desc:
            dic = spylls.hunspell.Dictionary.from_zip(SparkFiles.get("es_ES.zip"))
            return [sug for sug in dic.lookuper.affix_forms(desc, captype=CapType.NO)]
        else:
            return [""]

    dic_path="hdfs:///hunspell/es_ES.zip"
    spark.sparkContext.addFile(dic_path)
    
    udf_hunspell = F.udf(hunspell, T.ArrayType(T.StringType()))
    
    df=df.withColumn("result",udf_hunspell(F.col("desc")))  
    
    return df

@zverok
Copy link
Owner

zverok commented Jan 16, 2022

Expected zero arguments for construction of ClassDict (for spylls.hunspell.algo.capitalization.Type)

That's very weird! Can you show a full backtrace of an error?

@alecuba16
Copy link
Author

Thanks for the fast reply.

The stacktrace shows a lot of spark garbage that is not informative and the only python related message is the weird one. But looking your message that seems to be something related with the spark environment. I have executed the code in a local instance of python, at the driver side of the spark (pyspark) environment, and it works properly.

So there is something with the python versions of the executors and the imports of the hunspell library that is not being imported or being imported as None I suppose.

I will check that and will come with the solution.

@alecuba16
Copy link
Author

I found the problem, as I suspected the executors' python instance weren't able to install the hunspell library and the import was failing , producing a cascade of scala<->java errors (common in pyspark stacktraces) that was hidding the main problem, I had to log in into the cluster manager to find out that error.

Summarizing, your were totally right and your code can be integrated into a spark UDF, thanks!

@alecuba16
Copy link
Author

Victor, one final question about the stemming process. What is the procedure for stemming accented words like "específicos". It seems that the affix form method requires non accented words I'm right?

thanks!

@zverok
Copy link
Owner

zverok commented Jan 17, 2022

It should depend on the dictionary only (if the dictionary has accents, they should be properly processed); but with Unicode quirks you never know :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants