Skip to content

JuanitoFatas/active_normalizer

Repository files navigation

Active Normalizer

Normalize weird Japanese characters, see tests for examples.

Normalize fullwidth, halfwidth hiragana, katakana, symbols.

Usage

Each normalizer class accepts option of :nfc, :nfd, :nfkd, :nfkc (See Normalization Forms for more information). Each normalizer instance responds to run.

require "active_normalizer/normalizers/ruby"
nfkc_normalizer = ActiveNormalizer.new(
  ActiveNormalizer::Normalizers::Ruby,
  options: :nfkc
)
nfkc_normalizer.run(input)

Benchmark

Benchmarking simple string: 800ー12345
Warming up --------------------------------------
                 UNF    92.981k i/100ms
             Unicode    36.002k i/100ms
                Ruby    17.044k i/100ms
        UnicodeUtils    12.681k i/100ms
       ActiveSupport     7.482k i/100ms
Calculating -------------------------------------
                 UNF      1.173M (±17.6%) i/s -      5.672M in   5.041037s
             Unicode    404.502k (± 6.8%) i/s -      2.016M in   5.008748s
                Ruby    191.562k (±30.3%) i/s -    835.156k in   5.106057s
        UnicodeUtils    132.477k (± 5.3%) i/s -    672.093k in   5.088759s
       ActiveSupport     75.011k (±34.9%) i/s -    329.208k in   5.058559s

Comparison:
                 UNF:  1172663.8 i/s
             Unicode:   404502.1 i/s - 2.90x  slower
                Ruby:   191562.4 i/s - 6.12x  slower
        UnicodeUtils:   132477.3 i/s - 8.85x  slower
       ActiveSupport:    75010.6 i/s - 15.63x  slower

Warming up --------------------------------------
                 UNF    67.181k i/100ms
             Unicode    31.572k i/100ms
                Ruby    14.947k i/100ms
        UnicodeUtils    12.443k i/100ms
       ActiveSupport     5.561k i/100ms
Calculating -------------------------------------
                 UNF    997.098k (±25.2%) i/s -     27.477M in  30.052018s
             Unicode    328.071k (±19.5%) i/s -      9.503M in  30.090451s
                Ruby    177.045k (±32.8%) i/s -      4.529M in  30.071040s
        UnicodeUtils    134.513k (± 6.7%) i/s -      4.019M in  30.059621s
       ActiveSupport     68.063k (±44.7%) i/s -      1.668M in  30.131968s

Comparison:
                 UNF:   997097.6 i/s
             Unicode:   328070.8 i/s - 3.04x  slower
                Ruby:   177044.6 i/s - 5.63x  slower
        UnicodeUtils:   134512.7 i/s - 7.41x  slower
       ActiveSupport:    68063.1 i/s - 14.65x  slower


Benchmarking longer string: ㍻㍼㍽㍾㌀㌁㌂㌃㌄㌅㌆㌇㌈㌉㌊㌋㌌㌍㌎㌏㌐㌑㌒㌓㌔㌕㌖㌗㌘㌙㌚㌛㌜㌝㌞㌟㌠㌡㌢㌣㌤㌥㌦㌧㌨㌩㌪㌫㌬㌭㌮㌯㌰㌱㌲㌳㌴㌵㌶㌷㌸㌹㌺㌻㌼㌽㌾㌿㍀㍁㍂㍃㍄㍅㍆㍇㍈㍉㍊㍋㍌㍍㍎㍏㍐㍑㍒㍓㍔㍕㍖㍗
Warming up --------------------------------------
                 UNF     6.023k i/100ms
             Unicode     1.238k i/100ms
                Ruby     1.068k i/100ms
        UnicodeUtils   319.000  i/100ms
       ActiveSupport   258.000  i/100ms
Calculating -------------------------------------
                 UNF     59.891k (± 6.8%) i/s -    301.150k in   5.055411s
             Unicode     11.740k (± 9.0%) i/s -     59.424k in   5.103353s
                Ruby     10.655k (±10.9%) i/s -     53.400k in   5.091860s
        UnicodeUtils      3.087k (± 8.9%) i/s -     15.312k in   5.004688s
       ActiveSupport      2.533k (±11.1%) i/s -     12.642k in   5.064477s

Comparison:
                 UNF:    59890.8 i/s
             Unicode:    11740.2 i/s - 5.10x  slower
                Ruby:    10655.0 i/s - 5.62x  slower
        UnicodeUtils:     3087.4 i/s - 19.40x  slower
       ActiveSupport:     2532.6 i/s - 23.65x  slower

Warming up --------------------------------------
                 UNF     5.739k i/100ms
             Unicode     1.122k i/100ms
                Ruby     1.113k i/100ms
        UnicodeUtils   312.000  i/100ms
       ActiveSupport   254.000  i/100ms
Calculating -------------------------------------
                 UNF     59.371k (± 4.4%) i/s -      1.779M in  30.026571s
             Unicode     10.780k (±17.3%) i/s -    310.794k in  30.106556s
                Ruby     11.144k (± 6.7%) i/s -    332.787k in  30.034689s
        UnicodeUtils      3.164k (± 4.9%) i/s -     94.848k in  30.056928s
       ActiveSupport      2.635k (± 8.8%) i/s -     78.486k in  30.075836s

Comparison:
                 UNF:    59371.2 i/s
                Ruby:    11143.9 i/s - 5.33x  slower
             Unicode:    10779.6 i/s - 5.51x  slower
        UnicodeUtils:     3163.5 i/s - 18.77x  slower
       ActiveSupport:     2635.3 i/s - 22.53x  slower

Benchmark code can be found at bin/benchmark.

Installation

Add this line to your application's Gemfile:

gem "active_normalizer"

And then execute:

$ bundle

Or install it yourself as:

$ gem install active_normalizer

Dependnecies

Active Normalizer provides a handful of normalizers. Their dependencies are not bundled except for one that utilizes standard library. You must bundle the normalizer's gem dependency.

ActiveNormalizer::Normalizers::Ruby

# no dependency required, standard library

require "active_normalizer/normalizers/ruby"

ActiveNormalizer::Normalizers::UNF - unf

gem "unf"

require "active_normalizer/normalizers/unf"

ActiveNormalizer::Normalizers::Unicode - unicode

gem "unicode"

require "active_normalizer/normalizers/unicode"

ActiveNormalizer::Normalizers::UnicodeUtils - unicode_utils

gem "unicode_utils"

require "active_normalizer/normalizers/unicode_utils"

ActiveNormalizer::Normalizers::ActiveSupportMultibyte - active_support

gem "active_support"

require "active_normalizer/normalizers/active_support"

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/hack for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/JuanitoFatas/active_normalizer.

License

The gem is available as open source under the terms of the MIT License.