fastcws

轻量级高性能中文分词项目

动图演示

如标题所言，fastcws性能极高。从动图中可以看出，fastcws冷启动加载只用了 0.125s；冷启动加上分词 18 万字只用了 0.35s。简单估算一下，已经达到了单核百万字的水准！

命令行工具

fastcws命令行工具（从源码编译的话，位于src/tools/fastcws）可以直接将stdin的输入按句分词后输出到stdout：

$ fastcws
在春风吹拂的季节翩翩起舞
在/春风/吹拂/的/季节/翩翩起舞/

可以用管道方便的将文件分词后，转储到另一个文件：

$ cat input.txt | fastcws > output.txt

此外，还支持自定义分隔符、从文件加载词典、HMM模型等，详见fastcws --help。

`Windows` 注意事项

在Windows平台上，默认的编码是utf16，但是本项目目前只使用utf8作为唯一编码。

在直接用命令行界面进行输入时，无需考虑此问题，因为工具使用了nowide进行自动转换：

$ fastcws
在春风吹拂的季节翩翩起舞
在/春风/吹拂/的/季节/翩翩起舞/

在使用管道分词文件时，必须确认文件以utf8格式保存且不带 BOM，否则可能导致分词工作不正常或者出现错误：

$ type input.txt | fastcws.exe > output.txt

必须保证input.txt是以utf8格式保存的。

C语言函数库

本项目以c++17写成，不过可以使用编译得到的动态链接库，以稳定的 C 语言 API 调用分词组件：

// #include "libfastcws.h"

fastcws_init();
fastcws_result* result = fastcws_alloc_result();

int err = fastcws_word_break("在春风吹拂的季节翩翩起舞", result);
if (err) {
	...
}
const char *word_begin;
size_t word_len;
while(fastcws_result_next(result, &word_begin, &word_len) == 0) {
	...
}
fastcws_result_free(result);

如你所见，分词是0拷贝的，因此性能十分优秀。

此外，C API 同样支持从文件加载词典、HMM模型等。examples目录下有更多范例可供参考。

同样需要注意的是，传入的数据编码必须是utf8。

编译安装

和多数cmake项目一样：

git submodule update --init --recursive
cmake -S . -B build
cmake --build build
cmake --build build --target install

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets @ ba0d45a		assets @ ba0d45a
doc/resource		doc/resource
examples		examples
external		external
include		include
src		src
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets @ ba0d45a

assets @ ba0d45a

doc/resource

doc/resource

examples

examples

external

external

include

include

src

src

tests

tests

.gitignore

.gitignore

.gitmodules

.gitmodules

CMakeLists.txt

CMakeLists.txt

LICENSE

LICENSE

README.md

README.md

Repository files navigation

fastcws

动图演示

命令行工具

`Windows` 注意事项

C语言函数库

编译安装

About

Releases

Packages

Languages

License

fastcws/fastcws

Folders and files

Latest commit

History

Repository files navigation

fastcws

动图演示

命令行工具

Windows 注意事项

C语言函数库

编译安装

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

`Windows` 注意事项