Static deps: Move unibreak to be an external dep.

We need any version of libunibreak. The first one has been released in mid 2012.
Even slow distros like ubuntu already have an LTS out with a good enough
version, so I consider this enough to remove the maintenance cost.
This has been discussed on IRC.

@feature
This commit is contained in:
Tom Hacohen 2015-05-07 10:02:40 +01:00
parent 92ff90ecca
commit a2a9f33802
16 changed files with 3 additions and 5715 deletions

View File

@ -1762,6 +1762,8 @@ EFL_ADD_LIBS([EVAS], [-lm])
# Freetype
EFL_DEPEND_PKG([EVAS], [FREETYPE], [freetype2 >= 9.3.0])
EFL_DEPEND_PKG([EVAS], [UNIBREAK], [libunibreak])
## optional dependencies
# FontConfig

View File

@ -87,24 +87,8 @@ lib/evas/include/evas_blend_ops.h \
lib/evas/include/evas_filter.h \
lib/evas/canvas/evas_vg_private.h
# Linebreak
noinst_HEADERS += \
static_libs/libunibreak/linebreak.h \
static_libs/libunibreak/linebreakdef.h \
static_libs/libunibreak/wordbreakdef.h \
static_libs/libunibreak/wordbreak.h \
static_libs/libunibreak/wordbreakdata.c
# Linebreak
lib_evas_libevas_la_SOURCES = \
static_libs/libunibreak/linebreak.c \
static_libs/libunibreak/linebreakdata.c \
static_libs/libunibreak/linebreakdef.c \
static_libs/libunibreak/wordbreak.c
# Main
lib_evas_libevas_la_SOURCES += \
lib_evas_libevas_la_SOURCES = \
lib/evas/main.c
# Canvas
@ -314,7 +298,6 @@ lib_evas_libevas_la_CPPFLAGS = -I$(top_builddir)/src/lib/efl \
-I$(top_srcdir)/src/lib/evas/cserve2 \
-I$(top_srcdir)/src/lib/evas/file \
-I$(top_srcdir)/src/lib/evas/include \
-I$(top_srcdir)/src/static_libs/libunibreak \
-I$(top_builddir)/src/lib/evas/canvas \
-I$(top_builddir)/src/modules/evas/engines/software_generic \
-I$(top_builddir)/src/modules/evas/engines/gl_generic \
@ -366,15 +349,6 @@ lib/evas/common/libevas_convert_rgb_32.la \
lib_evas_libevas_la_LDFLAGS = @EFL_LTLIBRARY_FLAGS@
# Linebreak
EXTRA_DIST += \
static_libs/libunibreak/LICENCE \
static_libs/libunibreak/AUTHORS \
static_libs/libunibreak/NEWS \
static_libs/libunibreak/README \
static_libs/libunibreak/ChangeLog
# Engines
EXTRA_DIST += \

View File

@ -1,11 +0,0 @@
Wu Yongwei. Designed and implemented the original liblinebreak.
Current maintainer of libunibreak.
Nikolay Pultsin. Put forward the original requirements on liblinebreak,
performed tests, and made a lot of suggestions on the initial versions.
Thomas Klausner. Autoconfiscated and libtoolized liblinebreak.
Tom Hacohen. Added word boundaries support.
Petr Filipsky. Added incremental processing for line-breaking.

View File

@ -1,702 +0,0 @@
2013-11-14 Wu Yongwei <wuyongwei@gmail.com>
* src/linebreak.c: Add/update comments and doc comments.
(lb_init_breaking_class): Rename to treat_first_char.
(lb_classify_break_simple): Rename to get_lb_result_simple.
(lb_classify_break_lookup): Rename to get_lb_result_lookup.
(set_linebreaks): Remove an unused local variable.
2013-11-14 Wu Yongwei <wuyongwei@gmail.com>
* src/linebreakdata.c: Regenerate from LineBreak-6.3.0.txt.
2013-11-13 Wu Yongwei <wuyongwei@gmail.com>
Fix compilation problems under MSVC.
* src/linebreak.c (lb_init_breaking_class): Remove `inline'.
(lb_classify_break_simple): Ditto.
(lb_classify_break_lookup): Ditto.
(lb_classify_break_lookup): Move local variable declaration before
assertions.
2013-11-10 Wu Yongwei <wuyongwei@gmail.com>
* src/Makefile.am (libunibreak_la_LDFLAGS): Set the version-info to
`2:0:1'.
2013-11-10 Wu Yongwei <wuyongwei@gmail.com>
* src/linebreakdef.c: Adjust the order of code.
(lb_process_next_char): Make its return type int.
* src/linebreak.c (lb_process_next_char): Ditto.
2013-11-10 Wu Yongwei <wuyongwei@gmail.com>
* src/linebreak.c: Make minor changes in doc comments, formatting,
and names.
* src/linebreakdef.c: Ditto.
2013-11-10 Wu Yongwei <wuyongwei@gmail.com>
* AUTHORS: Add `Petr Filipsky'.
2013-11-10 Petr Filipsky <philodej@gmail.com>
Expose low level line-breaking API for incremental processing.
* src/linebreak.h: Add prototype declarations for
lb_init_break_context and lb_process_next_char.
(struct LineBreakContext): New struct.
* src/linebreak.h (LINEBREAK_UNDEFINED): New macro constant.
(lb_init_breaking_class): New static function.
(lb_classify_break_simple): New static function.
(lb_classify_break_lookup): New static function.
(lb_init_break_context): New function.
(lb_process_next_char): New function.
(set_linebreaks): Implement with lb_init_break_context and
lb_process_next_char.
2013-11-05 Petr Filipsky <philodej@gmail.com>
* src/wordbreakdef.h (enum WordBreakClass): Update according to
Table 3 of Unicode Standard Annex 29, Revision 23.
2013-09-30 Wu Yongwei <wuyongwei@gmail.com>
Update for the libunibreak 1.1 release.
* configure.ac (AC_INIT): Change the library version to `1.1'.
* Doxyfile (PROJECT_NUMBER): Change to `1.1'.
* Makefile.am (EXTRA_DIST): Add the `tools' directory.
* NEWS: Add information about libunibreak 1.1.
* src/Makefile.am (libunibreak_la_LDFLAGS): Set the version to `1:1'.
2013-09-29 Wu Yongwei <wuyongwei@gmail.com>
* src/Makefile.msvc: Modernize obsolete/deprecated MSVC options.
2013-09-28 Wu Yongwei <wuyongwei@gmail.com>
* src/wordbreak.c: Update copyright year and UAX information.
* src/wordbreak.h: Ditto.
* src/wordbreakdef.h: Ditto.
2013-09-28 Wu Yongwei <wuyongwei@gmail.com>
Fix the errors caused by libtool 2.4 (really annoying to the level
of WTF for making me add the foolish dependency on m4).
* Makefile.am (ACLOCAL_AMFLAGS): Add `-I m4'.
* bootstrap: Add a line to execute autoreconf.
* configure.ac (AC_CONFIG_MACRO_DIR): Set to `[m4]'.
* purge: Make it remove also the m4 directory.
2013-09-28 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.am (EXTRA_DIST): Add `README.md'.
2013-09-28 Wu Yongwei <wuyongwei@gmail.com>
* README.md: New Markdown version of README.
* README: Remove.
2013-05-13 Tom Hacohen <tom@stosb.com>
Update files according to UAX #29-21, for Unicode 6.2.0.
* README: Update the reference to UAX #29-21.
* src/wordbreak.c (set_wordbreaks): Update for WBP_Regional.
* src/wordbreakdef.h (WBP_Regional): New enumerator for the new
property `RI' as defined in UAX #29-21.
* src/wordbreakdata.c: Regenerate from WordBreakProperty-6.2.0.txt.
2013-05-06 Wu Yongwei <wuyongwei@gmail.com>
* src/Makefile.am (install-exec-hook): Make sure `--disable-static'
can work (thanks to Eugene V. Lyubimkin).
2012-10-06 Wu Yongwei <wuyongwei@gmail.com>
Update files according to UAX #14-30, for Unicode 6.2.0.
* README: Update the reference to UAX #14-30.
* src/linebreak.c (baTable): Update for the new class `RI'.
* src/linebreak.h (LINEBREAK_VERSION): Set to 0x0202.
* src/linebreakdef.h (LBP_RI): New enumerator for the new class `RI'
as defined in UAX #14-30.
* src/linebreakdata.c: Regenerate from LineBreak-6.2.0.txt.
2012-10-06 Wu Yongwei <wuyongwei@gmail.com>
* src/linebreak.c (baTable): Correct the issue that one column was
missing in the table.
2012-10-06 Wu Yongwei <wuyongwei@gmail.com>
* README: Update to reflect the recent changes.
2012-10-06 Wu Yongwei <wuyongwei@gmail.com>
Make `make linebreakdata' and `make wordbreakdata' work again.
* src/Makefile.am (EXTRA_DIST): Add missing `filter_dup.c'.
(linebreakdata): New make target.
(wordbreakdata): New make target.
2012-10-06 Wu Yongwei <wuyongwei@gmail.com>
Make `make dist' work again after the directory adjustment.
* Doxyfile (INPUT): Change to `src'.
(FILE_PATTERNS): Set to `*.c *.h'.
* Makefile.am (EXTRA_DIST): Move content from src/Makefile.am.
(doc): Move target from src/Makefile.am.
* src/Makefile.am (EXTRA_DIST): Move partial content to Makefile.am.
(doc): Move target to Makefile.am.
2012-09-16 Wu Yongwei <wuyongwei@gmail.com>
Update files according to UAX #14-28, for Unicode 6.1.0.
* README: Update the reference to UAX #14-28.
* src/linebreak.c (baTable): Update for the new class `HL'.
(resolve_lb_class): Resolve the new class `CJ' to `ID' (simplified).
* src/linebreakdef.h (LBP_HL): New enumerator for the new class `HL'
as defined in UAX #14-28.
(LBP_CJ): New enumerator for the new class `CJ' as defined in
UAX #14-28.
* src/linebreakdata.c: Regenerate from LineBreak-6.1.0.txt.
2012-08-13 Tom Hacohen <tom@stosb.com>
Move source files to under src.
* Makefile.am: Split from original Makefile.am.
(SUBDIRS): Add `src'.
* configure.ac (AC_CONFIG_SRCDIR): Add `src/' before `linebreak.c'.
(AC_CONFIG_FILES): Add `src/Makefile'.
* src/LineBreak1.sed: Move from LineBreak1.sed.
* src/LineBreak2.sed: Move from LineBreak2.sed.
* src/Makefile.am: Split from Makefile.am
* src/Makefile.gcc: Move from Makefile.gcc.
* src/Makefile.msvc: Move from Makefile.msvc.
* src/filter_dup.c: Move from filter_dup.c.
* src/linebreak.c: Move from linebreak.c.
* src/linebreak.h: Move from linebreak.h.
* src/linebreakdata.c: Move from linebreakdata.c.
* src/linebreakdata1.tmpl: Move from linebreakdata1.tmpl.
* src/linebreakdata2.tmpl: Move from linebreakdata2.tmpl.
* src/linebreakdata3.tmpl: Move from linebreakdata3.tmpl.
* src/linebreakdef.c: Move from linebreakdef.c.
* src/linebreakdef.h: Move from linebreakdef.h.
* src/sort_numeric_hex.py: Move from sort_numeric_hex.py.
* src/wordbreak.c: Move from wordbreak.c.
* src/wordbreak.h: Move from wordbreak.h.
* src/wordbreakdata.c: Move from wordbreakdata.c.
* src/wordbreakdata1.tmpl: Move from wordbreakdata1.tmpl.
* src/wordbreakdata2.tmpl: Move from wordbreakdata2.tmpl.
* src/wordbreakdef.h: Move from wordbreakdef.h.
2012-08-12 Wu Yongwei <wuyongwei@gmail.com>
* README: Change the home URL to github; remove $Id$; eliminate
non-ASCII characters.
2012-08-11 Wu Yongwei <wuyongwei@gmail.com>
Update for the libunibreak 1.0 release.
* configure.ac (AC_INIT): Change the library name and version to
`libunibreak' and `1.0'.
(AC_PROG_LN_S): New macro.
(AC_OUTPUT): Change to `libunibreak.pc'.
* Doxyfile (PROJECT_NAME): Change to `libunibreak'.
(PROJECT_NUMBER): Change to `1.0'.
* LICENCE: Add copyright information about Tom Hacohen.
* Makefile.am (lib_LTLIBRARIES): Change to `libunibreak.la'.
(pkgconfig_DATA): Change to `libunibreak.la'.
(libunibreak_la_LDFLAGS): Reset the version to `1:0'.
(install-exec-hook): Replace the static library liblinebreak.a with
a symlink to libunibreak.a.
* Makefile.msvc: Change the library name to `libunibreak', and the
output library to `unibreak.lib'.
* NEWS: Add information about libunibreak 1.0.
* README: Change the library name, and add information about word
break.
2012-02-04 Wu Yongwei <wuyongwei@gmail.com>
* wordbreak.h (WORDBREAK_INSIDEACHAR): Change from
WORDBREAK_INSIDECHAR.
* wordbreak.c (set_brks_to): Change `WORDBREAK_INSIDECHAR' to
`WORDBREAK_INSIDEACHAR'.
2012-01-19 Wu Yongwei <wuyongwei@gmail.com>
* wordbreak.h: Change angle brackets to quotation marks (which
caused build errors).
2012-01-19 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.gcc (CFILES): Add wordbreak.c.
(WordBreakProperty.txt): New target.
(wordbreakdata): New target.
2012-01-19 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.am (liblinebreak_la_SOURCES): Remove wordbreakdata.c.
(EXTRA_DIST): Add wordbreakdata.c, wordbreakdata1.tmpl, and
wordbreakdata2.tmpl.
2012-01-19 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.msvc: Add wordbreak files.
2012-01-18 Tom Hacohen <tom@stosb.com>
Add word breaking support.
* AUTHORS: Add `Tom Hacohen'.
* Makefile.am (include_HEADERS): Add header files for word breaking.
(liblinebreak_la_SOURCES): Add source files for word breaking.
(sort_numeric_hex.py): Add `sort_numeric_hex.py'.
(distclean-local): Clean also `WordBreakData.txt'.
(WordBreakProperty.txt): New target.
(wordbreakdata): New target.
* sort_numeric_hex.py: New file.
* wordbreak.c: New file.
* wordbreak.h: New file.
* wordbreakdef.h: New file.
* wordbreakdata.c: New file.
* wordbreakdata1.tmpl: New file.
* wordbreakdata2.tmpl: New file.
2011-05-17 Wu Yongwei <wuyongwei@gmail.com>
Add support for pkg-config (thanks to Tom Hacohen).
* liblinebreak.pc.in: New file.
* configure.ac (AC_OUTPUT): Add `liblinebreak.pc'.
* Makefile.am (pkgconfig_DATA): Set to `liblinebreak.pc'.
(pkgconfigdir): Set to `$(libdir)/pkgconfig'.
2011-05-07 Wu Yongwei <wuyongwei@gmail.com>
* README: Update the reference to UAX #14-26, for Unicode 6.0.0.
2011-05-07 Wu Yongwei <wuyongwei@gmail.com>
* configure.ac (AC_INIT): Increase the version to 2.1.
* Makefile.am (liblinebreak_la_LDFLAGS): Set the version-info to
`2:1'.
2011-05-07 Wu Yongwei <wuyongwei@gmail.com>
* LICENCE: Update the copyright year.
2011-05-07 Wu Yongwei <wuyongwei@gmail.com>
Update for the 2.1 release.
* Doxyfile (PROJECT_NUMBER): Set to `2.1'.
* NEWS: Add information about the 2.1 release.
* linebreak.h (LINEBREAK_VERSION): Set to `0x0201'.
* linebreak.h: Update comments.
* linebreak.c: Ditto.
* linebreakdef.h: Ditto.
* linebreakdef.c: Ditto.
2011-05-07 Wu Yongwei <wuyongwei@gmail.com>
* linebreakdata.c: Regenerate from LineBreak-6.0.0.txt.
2011-05-07 Wu Yongwei <wuyongwei@gmail.com>
* linebreak.c (set_linebreaks): Fix the assertion failure when
U+FFFC (OBJECT REPLACEMENT CHARACTER) appears at the beginning of a
line (thanks to Tom Hacohen).
2010-01-03 Wu Yongwei <wuyongwei@gmail.com>
* LICENCE: Update the copyright year.
2010-01-03 Wu Yongwei <wuyongwei@gmail.com>
* NEWS: Add information about the 2.0 release.
2010-01-03 Wu Yongwei <wuyongwei@gmail.com>
* Doxyfile (PROJECT_NUMBER): Set to `2.0'.
(HAVE_DOT): Set to `YES'.
2010-01-03 Wu Yongwei <wuyongwei@gmail.com>
* linebreak.c: Update the version number in comment to 2.0.
* linebreak.h: Ditto.
* linebreakdef.c: Ditto.
* linebreakdef.h: Ditto.
2009-12-17 Wu Yongwei <wuyongwei@gmail.com>
Change the values of enum BreakAction to the same length.
* linebreak.c (DIRECT_BRK): Rename to DIR_BRK.
(INDIRECT_BRK): Rename to IND_BRK.
(CM_INDIRECT_BRK): Rename to CMI_BRK.
(CM_PROHIBITED_BRK): Rename to CMP_BRK.
(PROHIBITED_BRK): Rename to PRH_BRK.
2009-11-29 Wu Yongwei <wuyongwei@gmail.com>
* Doxyfile (TAB_SIZE): Set to the correct size `4', as used in the
source files.
2009-11-29 Wu Yongwei <wuyongwei@gmail.com>
Update files according to UAX #14-24, for Unicode 5.2.0.
* linebreak.c: Update comments about UAX #14.
* linebreak.h: Ditto.
* linebreakdef.c: Ditto.
* linebreakdef.h: Ditto.
(LBP_CP): New enumerator for the new `CP' class as defined in
UAX #14-24.
* linebreak.c (baTable): Update for the new class `CP'.
* linebreakdata.c: Regenerate from LineBreak-5.2.0.txt.
* README: Update the reference to UAX #14-24, for Unicode 5.2.0.
2009-05-03 Wu Yongwei <wuyongwei@gmail.com>
* NEWS: Add information about the 1.2 release.
2009-04-30 Wu Yongwei <wuyongwei@gmail.com>
Optimize the Doxygen output.
* linebreak.c (lb_prop_index): Adjust its definition format
slightly.
2009-04-30 Wu Yongwei <wuyongwei@gmail.com>
* Doxyfile (USE_WINDOWS_ENCODING): Remove obsolete tag.
(DETAILS_AT_TOP): Ditto.
(MAX_DOT_GRAPH_WIDTH): Ditto.
(MAX_DOT_GRAPH_HEIGHT): Ditto.
(REFERENCED_BY_RELATION): Set to `NO'.
(REFERENCES_RELATION): Ditto.
(EXCLUDE): Add `filter_dup.c'.
2009-04-28 Wu Yongwei <wuyongwei@gmail.com>
* linebreak.c (lb_get_next_char_utf8): Fix the issue that the index
can point to the middle of a UTF-8 sequence if End of String (EOS)
is encountered prematurely (thanks to Nikolay Pultsin and Rick Xu).
(lb_get_next_char_utf16): Fix the issue that the index can point to
the middle of a UTF-16 surrogate pair if EOS is encountered
prematurely.
2009-04-20 Wu Yongwei <wuyongwei@gmail.com>
* linebreakdef.c (lb_prop_English): Remove the specialization of
right single quotation mark as closing punctuation mark, because it
can be used as apostrophe.
(lb_prop_Spanish): Ditto.
(lb_prop_French): Ditto.
2009-04-09 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.msvc: Make the `clean' target work on MSVC versions other
than 6.0; do not use precompiled header.
2009-03-07 Wu Yongwei <wuyongwei@gmail.com>
* linebreak.h: Correct the wrong date in the documentation comment.
* linebreakdef.h: Ditto.
2009-02-10 Wu Yongwei <wuyongwei@gmail.com>
* configure.ac (AC_INIT): Increase the version to 2.0.
* Makefile.am (liblinebreak_la_LDFLAGS): Set the version-info to
`2:0'.
2009-02-10 Wu Yongwei <wuyongwei@gmail.com>
* linebreak.h (LINEBREAK_VERSION): New macro.
(linebreak_version): New global constant declaration.
* linebreak.c (linebreak_version): New global constant definition.
2009-02-10 Wu Yongwei <wuyongwei@gmail.com>
Reduce namespace pollution.
* linebreak.c (get_lb_prop_lang): Mark as static.
(get_next_char_utf8): Rename to lb_get_next_char_utf8.
(get_next_char_utf16): Rename to lb_get_next_char_utf32.
(get_next_char_utf32): Rename to lb_get_next_char_utf32.
(is_breakable): Rename to is_line_breakable.
* linebreak.h (get_next_char_utf8): Remove the function prototype
declaration.
(get_next_char_utf16): Ditto.
(get_next_char_utf32): Ditto.
(is_breakable): Rename to is_line_breakable.
* linebreakdef.h (lb_get_next_char_utf8): Add the function prototype
declaration.
(lb_get_next_char_utf16): Ditto.
(lb_get_next_char_utf32): Ditto.
2009-02-06 Wu Yongwei <wuyongwei@gmail.com>
* NEWS: Add information about the 1.1 release.
2009-01-02 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.am (EXTRA_DIST): Add the missing `LICENCE' file.
2008-12-31 Wu Yongwei <wuyongwei@gmail.com>
* linebreak.c: Update the version number in comment to 1.0.
* linebreak.h: Ditto.
* linebreakdef.c: Ditto.
* linebreakdef.h: Ditto.
2008-12-31 Wu Yongwei <wuyongwei@gmail.com>
* NEWS: Update for the 1.0 release.
2008-12-31 Wu Yongwei <wuyongwei@gmail.com>
* README: Correct two typos.
2008-12-31 Wu Yongwei <wuyongwei@gmail.com>
* README: Add the online URL reference.
2008-12-30 Wu Yongwei <wuyongwei@gmail.com>
* README: Update the reference to UAX #14-22, for Unicode 5.1.0.
2008-12-13 Wu Yongwei <wuyongwei@gmail.com>
Update files according to UAX #14-22, for Unicode 5.1.0.
* linebreak.c (baTable): Update according to Table 2 of UAX #14-22.
* linebreakdef.c (lb_prop_Spanish): Remove the unnecessary
customization for inverted marks in Spanish.
* linebreakdata.c: Regenerate from LineBreak-5.1.0.txt.
* linebreak.h: Update comment only.
* linebreakdef.h: Ditto.
2008-12-12 Wu Yongwei <wuyongwei@gmail.com>
* README: Update for the new build methods and better readability.
2008-12-12 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.msvc: Correct the inconsistent naming in the output
message.
2008-12-12 Wu Yongwei <wuyongwei@gmail.com>
* configure.ac (AM_INIT_AUTOMAKE): Mark `foreign'.
* bootstrap: New file.
* purge: New file.
* Makefile.gcc (purge): Remove this target.
2008-12-10 Wu Yongwei <wuyongwei@gmail.com>
* NEWS: New file.
2008-12-10 Wu Yongwei <wuyongwei@gmail.com>
* AUTHORS: New file.
2008-12-10 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.gcc (purge): New phony target to purge files generated by
autoconfiscation.
2008-12-10 Thomas Klausner <tk@giga.or.at>
* configure.ac: New file.
* Makefile.am: New file.
2008-12-10 Wu Yongwei <wuyongwei@gmail.com>
* Doxyfile (OUTPUT_DIRECTORY): Set to `doc'.
(ALPHABETICAL_INDEX): Set to `YES'.
2008-12-09 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.msvc: New file.
2008-12-09 Wu Yongwei <wuyongwei@gmail.com>
* Makefile: Remove (to become Makefile.gcc).
* Makefile.gcc: New file (was Makefile).
2008-12-07 Wu Yongwei <wuyongwei@gmail.com>
* linebreak.c: Adjust the comment that refers to Unicode Annex 14.
* linebreak.h: Ditto.
* linebreakdef.c: Ditto.
* linebreakdef.h: Ditto.
2008-12-07 Wu Yongwei <wuyongwei@gmail.com>
Use only POSIX basic regexp to ensure maximum portability (issues
have been found on Mac OS X, where GNU extensions do not work).
* LineBreak1.sed: Replace `[:xdigit:]' with `0-9A-F', and `\+' with
`\{1,\}'.
* LineBreak2.sed: Ditto.
2008-12-07 Wu Yongwei <wuyongwei@gmail.com>
* Makefile: Replace `*.exe' with `filter_dup$(EXEEXT)', since the
extension `.exe' is specific to Windows.
2008-04-20 Wu Yongwei <wuyongwei@gmail.com>
Add README and LICENCE files, as well as a Doxyfile to generate
documents.
* README: New file.
* LICENCE: New file.
* Doxyfile: New file.
* Makefile (doc): Add new phony target.
2008-04-04 Wu Yongwei <wuyongwei@gmail.com>
Remove the English override for plus sign: it is better treated in
the text breaking program (see ../breaktext/ for an example).
* linebreakdef.c (lb_prop_English): Remove the line for plus sign.
2008-03-29 Wu Yongwei <wuyongwei@gmail.com>
* Makefile: Correct the dependency-making rules when OLDGCC=Y.
2008-03-23 Wu Yongwei <wuyongwei@gmail.com>
* Makefile (clean): Do not remove *.exe and tags here.
(distclean): Remove *.exe and tags.
2008-03-23 Wu Yongwei <wuyongwei@gmail.com>
Remove the English override for solidus: it is better treated in the
text breaking program (see ../breaktext/ for an example).
* linebreakdef.c (lb_prop_English): Remove the line for solidus.
2008-03-16 Wu Yongwei <wuyongwei@gmail.com>
Rename init_linebreak_prop_index to init_linebreak for future
safety; make visible certain functions that are potentially useful.
* linebreak.c (init_linebreak_prop_index): Rename to init_linebreak.
(get_next_char_t): Move to linebreakdef.h.
(get_next_char_utf8): Make non-static.
(get_next_char_utf16): Ditto.
(get_next_char_utf32): Ditto.
(set_linebreaks): Ditto.
* linebreak.h (init_linebreak_prop_index): Rename to init_linebreak.
(get_next_char_utf8): Add the function prototype.
(get_next_char_utf16): Ditto.
(get_next_char_utf32): Ditto.
* linebreakdef.h (get_next_char_t): Add the typedef.
(set_linebreaks): Add the function prototype.
2008-03-16 Wu Yongwei <wuyongwei@gmail.com>
* Makefile (OLDGCC): Add support for GCC 2.95.3 (when OLDGCC=Y).
2008-03-15 Wu Yongwei <wuyongwei@gmail.com>
* linebreak.c (set_linebreaks): Fix a bug that `==' was wrongly used
for `='.
2008-03-05 Wu Yongwei <wuyongwei@gmail.com>
Improve the performance by reducing the look-ups of the
language-specific line breaking properties array from the language
name (thanks to Nikolay Pultsin).
* linebreak.c (get_lb_prop_lang): New function.
(get_char_lb_class_lang): Change the second parameter from the
language name to the line breaking properties array.
(set_linebreaks): Look up the language-specific line breaking
properties array from the language name only once in one function
call.
2008-03-03 Wu Yongwei <wuyongwei@gmail.com>
Make minor adjustments in code and comments.
* linebreak.c: Adjust the doc comments.
(init_linebreak_prop_index): Modify a conditional to make it more
robust and consistent.
* linebreakdef.c (lb_prop_lang_map): Replace the pointer
lb_prop_default with NULL, since the value is never used.
2008-03-03 Wu Yongwei <wuyongwei@gmail.com>
Accelerate get_char_lb_class for invalid Unicode code points.
* linebreak.c (get_char_lb_class): Adjust the conditionals so that
getting the line breaking class for an invalid code point is much
faster, which requires the array of line breaking properties be
sorted.
* linebreakdef.h: Adjust a comment that the array of line break
properties must be sorted.
2008-03-02 Wu Yongwei <wuyongwei@gmail.com>
Change the values of enum BreakAction to more complete forms.
* linebreak.c (INDRCT_BRK): Rename to INDIRECT_BRK.
(CM_INDRCT_BRK): Rename to CM_INDIRECT_BRK.
(CM_PROHIBTD_BRK): Rename to CM_PROHIBITED_BRK.
(PROHIBTD_BRK): Rename to PROHIBITED_BRK.
2008-03-02 Wu Yongwei <wuyongwei@gmail.com>
Implement a two-stage search in get_char_lb_class_default to
accelerate the overall performance, especially for non-Latin
languages.
* linebreak.c (LINEBREAK_INDEX_SIZE): New constant macro.
(struct LineBreakPropertiesIndex): New struct.
(lb_prop_index): New static variable.
(init_linebreak_prop_index): New function.
(get_char_lb_class_default): New function.
(get_char_lb_class_lang): Use get_char_lb_class_default.
* linebreak.h: Detect C++ and add extern "C" guard if necessary.
(init_linebreak_prop_index): Add the prototype declaration.
* linebreakdef.h: Adjust a comment.
2008-03-02 Wu Yongwei <wuyongwei@gmail.com>
Split/refactor the code; add (doc) comments.
* Makefile (CFILES): Add linebreakdata.c and linebreakdef.c.
* linebreak.c: Add and adjust comments.
(linebreakdef.h): Add include file.
(linebreakdata.c): Remove include file.
(EOS): Remove (now in linebreakdef.h).
(enum LineBreakClass): Ditto.
(struct LineBreakProperties): Ditto.
(lbpEnglish): Remove (now in linebreakdef.c as lb_prop_English).
(lbpGerman): Remove (now in linebreakdef.c as lb_prop_German).
(lbpSpanish): Remove (now in linebreakdef.c as lb_prop_Spanish).
(lbpFrench): Remove (now in linebreakdef.c as lb_prop_French).
(lbpRussian): Remove (now in linebreakdef.c as lb_prop_Russian).
(lbpChinese): Remove (now in linebreakdef.c as lb_prop_Chinese).
(struct LineBreakPropertiesLang): Remove (now in linebreakdef.h).
(lbpLangs): Remove (now in linebreakdef.c as lb_prop_lang_map).
(get_next_char_utf16): Make sure memory access not go beyond len.
* linebreak.h: Add copyright information and adjust comments.
(stddef.h): Add include file.
* linebreakdata.c (linebreak.h): Add include file.
(linebreakdef.h): Add include file.
(lbpDefault): Make global and rename to lb_prop_default.
* linebreakdata2.tmpl: Add two include files, a comment line, and
remove `static'.
* linebreakdef.c: New file.
* linebreakdef.h: New file.
2008-02-26 Wu Yongwei <wuyongwei@gmail.com>
* linebreak.c (lbpSpanish): New array for Spanish-specific data.
(lbpLangs): Update the index array for Spanish.
(resolve_lb_class): Resolve AmbIguous class to IDeographic in
Chinese, Japanese, and Korean.
2008-02-26 Wu Yongwei <wuyongwei@gmail.com>
* Makefile (LineBreak.txt): Add new rule to retrieve it from the Web
if it is not already there.
2008-02-23 Wu Yongwei <wuyongwei@gmail.com>
Add files for linebreak.
* LineBreak1.sed: New file.
* LineBreak2.sed: New file.
* Makefile: New file.
* filter_dup.c: New file.
* linebreak.c: New file.
* linebreak.h: New file.
* linebreakdata.c: New file.
* linebreakdata1.tmpl: New file.
* linebreakdata2.tmpl: New file.
* linebreakdata3.tmpl: New file.

View File

@ -1,19 +0,0 @@
Copyright (C) 2008-2012 Wu Yongwei <wuyongwei at gmail dot com>
Copyright (C) 2012 Tom Hacohen <tom dot hacohen at samsung dot com>
This software is provided 'as-is', without any express or implied
warranty. In no event will the author be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgement in the product documentation would
be appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not
be misrepresented as being the original software.
3. This notice may not be removed or altered from any source
distribution.

View File

@ -1,56 +0,0 @@
New in libunibreak 1.1
- Update the code and data to conform to Unicode 6.2.0
- Update build files to support libtool 2.4
- Adjust code structure
- Make a few bug fixes
New in libunibreak 1.0
- Add word breaking support
- Change the library name to "libunibreak", while keeping maximum compatibility
- Add pkg-config support
New in liblinebreak 2.1
- Update the data according to LineBreak-6.0.0.txt
- Fix the bug that an assertion in code can fail if U+FFFC is
encountered at the beginning of a line
New in liblinebreak 2.0
- Update the algorithm and data according to UAX #14-24 and
LineBreak-5.2.0.txt
- Rename some functions to reduce namespace pollution
- Make Doxygen documentation better
New in liblinebreak 1.2
- Fix the bug that an assertion in code can fail if an invalid UTF-8 or
UTF-16 sequence is encountered near the end of input
- Remove the specialization of right single quotation mark as closing
punctuation mark in English, French, and Spanish, because it can be
used as apostrophe
- Make Doxygen documentation better
New in liblinebreak 1.1
- Make get_lb_prop_lang static and not an exported symbol
- Define is_line_breakable to alias to is_breakable
- Declare get_next_char_utf* will be changed to lb_get_next_char_utf*
- Move the declarations of get_next_char_utf* from linebreak.h to
linebreakdef.h
- Add the function documentation comments to the header files
New in liblinebreak 1.0
- Update the line breaking data according to UAX #14-22 and
LineBreak-5.1.0.txt
- Add autoconfiscation support (./configure, make, make install)
- Add Makefile for MSVC
First public release (0.9.6, or 20080421)
- Implement line breaking algorithm according to UAX #14-19
- Line breaking data is generated from LineBreak-5.0.0.txt
- Makefile only supports GCC

View File

@ -1,87 +0,0 @@
LIBUNIBREAK
===========
Overview
--------
This is the README file for libunibreak, an implementation of the line
breaking and word breaking algorithms as described in [Unicode Standard
Annex 14] [1] and [Unicode Standard Annex 29] [2]. Check the project's
[home page] [3] for up-to-date information.
[1]: http://www.unicode.org/reports/tr14/tr14-30.html
[2]: http://www.unicode.org/reports/tr29/tr29-21.html
[3]: https://github.com/adah1972/libunibreak
Licence
-------
This library is released under an open-source licence, the zlib/libpng
licence. Please check the file *LICENCE* for details.
Apart from using the algorithm, part of the code is derived from the
[Unicode Public Data] [4], and the [Unicode Terms of Use] [5] may apply.
[4]: http://www.unicode.org/Public/
[5]: http://www.unicode.org/copyright.html
Installation
------------
There are three ways to build the library:
1. On \*NIX systems supported by the autoconfiscation tools, do the
normal
./configure
make
sudo make install
to build and install both the dynamic and static libraries. In
addition, one may
- type `make doc` to generate the doxygen documentation; or
- type `make linebreakdata` to regenerate *linebreakdata.c* from
*LineBreak.txt*.
- type `make wordbreakdata` to regenerate *wordbreakdata.c* from
*WordBreakProperty.txt*.
2. On systems where GCC and Binutils are supported, one can type
cd src
cp -p Makefile.gcc Makefile
make
to build the static library. In addition, one may
- type `make debug` or `make release` to explicitly generate the
debug or release build;
- type `make doc` to generate the doxygen documentation; or
- type `make linebreakdata` to regenerate *linebreakdata.c* from
*LineBreak.txt*.
- type `make wordbreakdata` to regenerate *wordbreakdata.c* from
*WordBreakProperty.txt*.
3. On Windows, apart from using method 1 (Cygwin/MSYS) and method 2
(MinGW), MSVC can also be used. Type
cd src
nmake -f Makefile.msvc
to build the static library. By default the debug release is built.
To build the release version
nmake -f Makefile.msvc CFG="libunibreak - Win32 Release"
Documentation
-------------
Check the generated document *doc/html/linebreak\_8h.html* and
*doc/html/wordbreak\_8h.html* in the downloaded file for the public
interfaces exposed to applications.
<!--
vim:autoindent:expandtab:formatoptions=tcqlmn:textwidth=72:
-->

View File

@ -1,879 +0,0 @@
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
/*
* Line breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2008-2013 Wu Yongwei <wuyongwei at gmail dot com>
* Copyright (C) 2013 Petr Filipsky <philodej at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute
* it freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must
* not claim that you wrote the original software. If you use this
* software in a product, an acknowledgement in the product
* documentation would be appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must
* not be misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source
* distribution.
*
* The main reference is Unicode Standard Annex 14 (UAX #14):
* <URL:http://www.unicode.org/reports/tr14/>
*
* When this library was designed, this annex was at Revision 19, for
* Unicode 5.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-19.html>
*
* This library has been updated according to Revision 30, for
* Unicode 6.2.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-30.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
/**
* @file linebreak.c
*
* Implementation of the line breaking algorithm as described in Unicode
* Standard Annex 14.
*
* @version 2.5, 2013/11/14
* @author Wu Yongwei
* @author Petr Filipsky
*/
#include <assert.h>
#include <stddef.h>
#include <string.h>
#include "linebreak.h"
#include "linebreakdef.h"
/**
* Special value used internally to indicate an undefined break result.
*/
#define LINEBREAK_UNDEFINED -1
/**
* Size of the second-level index to the line breaking properties.
*/
#define LINEBREAK_INDEX_SIZE 40
/**
* Version number of the library.
*/
const int linebreak_version = LINEBREAK_VERSION;
/**
* Enumeration of break actions. They are used in the break action
* pair table below.
*/
enum BreakAction
{
DIR_BRK, /**< Direct break opportunity */
IND_BRK, /**< Indirect break opportunity */
CMI_BRK, /**< Indirect break opportunity for combining marks */
CMP_BRK, /**< Prohibited break for combining marks */
PRH_BRK /**< Prohibited break */
};
/**
* Break action pair table. This is a direct mapping of Table 2 of
* Unicode Standard Annex 14, Revision 30.
*/
static enum BreakAction baTable[LBP_RI][LBP_RI] = {
{ /* OP */
PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK,
CMP_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK,
PRH_BRK },
{ /* CL */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, PRH_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK },
{ /* CP */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, PRH_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK },
{ /* QU */
PRH_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
IND_BRK },
{ /* GL */
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
IND_BRK },
{ /* NS */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK },
{ /* EX */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK },
{ /* SY */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK },
{ /* IS */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, IND_BRK,
DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK },
{ /* PR */
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, IND_BRK,
IND_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
DIR_BRK },
{ /* PO */
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, IND_BRK,
DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK },
{ /* NU */
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
DIR_BRK, IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK },
{ /* AL */
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, IND_BRK,
DIR_BRK, IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK },
{ /* HL */
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, IND_BRK,
DIR_BRK, IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK },
{ /* ID */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK },
{ /* IN */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK },
{ /* HY */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, DIR_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK },
{ /* BA */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, DIR_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK },
{ /* BB */
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
IND_BRK },
{ /* B2 */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, PRH_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK },
{ /* ZW */
DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK },
{ /* CM */
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, IND_BRK,
DIR_BRK, IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK },
{ /* WJ */
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
IND_BRK },
{ /* H2 */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK,
DIR_BRK },
{ /* H3 */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, IND_BRK,
DIR_BRK },
{ /* JL */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, DIR_BRK,
DIR_BRK },
{ /* JV */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK,
DIR_BRK },
{ /* JT */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, IND_BRK,
DIR_BRK },
{ /* RI */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK,
CMI_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
IND_BRK },
};
/**
* Struct for the second-level index to the line breaking properties.
*/
struct LineBreakPropertiesIndex
{
utf32_t end; /**< End coding point */
struct LineBreakProperties *lbp;/**< Pointer to line breaking properties */
};
/**
* Second-level index to the line breaking properties.
*/
static struct LineBreakPropertiesIndex lb_prop_index[LINEBREAK_INDEX_SIZE] =
{
{ 0xFFFFFFFF, lb_prop_default }
};
/**
* Initializes the second-level index to the line breaking properties.
* If it is not called, the performance of #get_char_lb_class_lang (and
* thus the main functionality) can be pretty bad, especially for big
* code points like those of Chinese.
*/
void init_linebreak(void)
{
size_t i;
size_t iPropDefault;
size_t len;
size_t step;
len = 0;
while (lb_prop_default[len].prop != LBP_Undefined)
++len;
step = len / LINEBREAK_INDEX_SIZE;
iPropDefault = 0;
for (i = 0; i < LINEBREAK_INDEX_SIZE; ++i)
{
lb_prop_index[i].lbp = lb_prop_default + iPropDefault;
iPropDefault += step;
lb_prop_index[i].end = lb_prop_default[iPropDefault].start - 1;
}
lb_prop_index[--i].end = 0xFFFFFFFF;
}
/**
* Gets the language-specific line breaking properties.
*
* @param lang language of the text
* @return pointer to the language-specific line breaking
* properties array if found; \c NULL otherwise
*/
static struct LineBreakProperties *get_lb_prop_lang(const char *lang)
{
struct LineBreakPropertiesLang *lbplIter;
if (lang != NULL)
{
for (lbplIter = lb_prop_lang_map; lbplIter->lang != NULL; ++lbplIter)
{
if (strncmp(lang, lbplIter->lang, lbplIter->namelen) == 0)
{
return lbplIter->lbp;
}
}
}
return NULL;
}
/**
* Gets the line breaking class of a character from a line breaking
* properties array.
*
* @param ch character to check
* @param lbp pointer to the line breaking properties array
* @return the line breaking class if found; \c LBP_XX otherwise
*/
static enum LineBreakClass get_char_lb_class(
utf32_t ch,
struct LineBreakProperties *lbp)
{
while (lbp->prop != LBP_Undefined && ch >= lbp->start)
{
if (ch <= lbp->end)
return lbp->prop;
++lbp;
}
return LBP_XX;
}
/**
* Gets the line breaking class of a character from the default line
* breaking properties array.
*
* @param ch character to check
* @return the line breaking class if found; \c LBP_XX otherwise
*/
static enum LineBreakClass get_char_lb_class_default(
utf32_t ch)
{
size_t i = 0;
while (ch > lb_prop_index[i].end)
++i;
assert(i < LINEBREAK_INDEX_SIZE);
return get_char_lb_class(ch, lb_prop_index[i].lbp);
}
/**
* Gets the line breaking class of a character for a specific
* language. This function will check the language-specific data first,
* and then the default data if there is no language-specific property
* available for the character.
*
* @param ch character to check
* @param lbpLang pointer to the language-specific line breaking
* properties array
* @return the line breaking class if found; \c LBP_XX
* otherwise
*/
static enum LineBreakClass get_char_lb_class_lang(
utf32_t ch,
struct LineBreakProperties *lbpLang)
{
enum LineBreakClass lbcResult;
/* Find the language-specific line breaking class for a character */
if (lbpLang)
{
lbcResult = get_char_lb_class(ch, lbpLang);
if (lbcResult != LBP_XX)
return lbcResult;
}
/* Find the generic language-specific line breaking class, if no
* language context is provided, or language-specific data are not
* available for the specific character in the specified language */
return get_char_lb_class_default(ch);
}
/**
* Resolves the line breaking class for certain ambiguous or complicated
* characters. They are treated in a simplistic way in this
* implementation.
*
* @param lbc line breaking class to resolve
* @param lang language of the text
* @return the resolved line breaking class
*/
static enum LineBreakClass resolve_lb_class(
enum LineBreakClass lbc,
const char *lang)
{
switch (lbc)
{
case LBP_AI:
if (lang != NULL &&
(strncmp(lang, "zh", 2) == 0 || /* Chinese */
strncmp(lang, "ja", 2) == 0 || /* Japanese */
strncmp(lang, "ko", 2) == 0)) /* Korean */
{
return LBP_ID;
}
else
{
return LBP_AL;
}
case LBP_CJ:
/* Simplified for `normal' line breaking. See
* <url:http://www.unicode.org/reports/tr14/tr14-30.html#CJ>
* for details. */
return LBP_ID;
case LBP_SA:
case LBP_SG:
case LBP_XX:
return LBP_AL;
default:
return lbc;
}
}
/**
* Treats specially for the first character in a line.
*
* @param[in,out] lbpCtx pointer to the line breaking context
* @pre \a lbpCtx->lbcCur has a valid line break class
* @post \a lbpCtx->lbcCur has the updated line break class
*/
static void treat_first_char(
struct LineBreakContext* lbpCtx)
{
switch (lbpCtx->lbcCur)
{
case LBP_LF:
case LBP_NL:
lbpCtx->lbcCur = LBP_BK; /* Rule LB5 */
break;
case LBP_CB:
lbpCtx->lbcCur = LBP_BA; /* Rule LB20 */
break;
case LBP_SP:
lbpCtx->lbcCur = LBP_WJ; /* Leading space treated as WJ */
break;
default:
break;
}
}
/**
* Tries telling the line break opportunity by simple rules.
*
* @param[in,out] lbpCtx pointer to the line breaking context
* @pre \a lbpCtx->lbcCur has the current line break
* class; and \a lbpCtx->lbcNew has the line
* break class for the next character
* @post \a lbpCtx->lbcCur has the updated line break
* class
* @return break result, one of #LINEBREAK_MUSTBREAK,
* #LINEBREAK_ALLOWBREAK, and #LINEBREAK_NOBREAK
* if identified; or #LINEBREAK_UNDEFINED if
* table lookup is needed
*/
static int get_lb_result_simple(
struct LineBreakContext* lbpCtx)
{
if (lbpCtx->lbcCur == LBP_BK
|| (lbpCtx->lbcCur == LBP_CR && lbpCtx->lbcNew != LBP_LF))
{
return LINEBREAK_MUSTBREAK; /* Rules LB4 and LB5 */
}
switch (lbpCtx->lbcNew)
{
case LBP_SP:
return LINEBREAK_NOBREAK; /* Rule LB7; no change to lbcCur */
case LBP_BK:
case LBP_LF:
case LBP_NL:
lbpCtx->lbcCur = LBP_BK; /* Mandatory break after */
return LINEBREAK_NOBREAK; /* Rule LB6 */
case LBP_CR:
lbpCtx->lbcCur = LBP_CR;
return LINEBREAK_NOBREAK; /* Rule LB6 */
case LBP_CB:
lbpCtx->lbcCur = LBP_BA;
return LINEBREAK_ALLOWBREAK; /* Rule LB20 */
default:
return LINEBREAK_UNDEFINED; /* Table lookup is needed */
}
}
/**
* Tells the line break opportunity by table lookup.
*
* @param[in,out] lbpCtx pointer to the line breaking context
* @pre \a lbpCtx->lbcCur has the current line break
* class; \a lbpCtx->lbcLast has the line break
* class for the last character; and \a
* lbcCur->lbcNew has the line break class for
* the next character
* @post \a lbpCtx->lbcCur has the updated line break
* class
* @return break result, one of #LINEBREAK_MUSTBREAK,
* #LINEBREAK_ALLOWBREAK, and #LINEBREAK_NOBREAK
*/
static int get_lb_result_lookup(
struct LineBreakContext* lbpCtx)
{
/* TODO: Rule LB21a, as introduced by Revision 28 of UAX#14, is not
* yet implemented below. */
int brk = LINEBREAK_UNDEFINED;
assert(lbpCtx->lbcCur <= LBP_JT);
assert(lbpCtx->lbcNew <= LBP_JT);
switch (baTable[lbpCtx->lbcCur - 1][lbpCtx->lbcNew - 1])
{
case DIR_BRK:
brk = LINEBREAK_ALLOWBREAK;
break;
case CMI_BRK:
case IND_BRK:
brk = (lbpCtx->lbcLast == LBP_SP)
? LINEBREAK_ALLOWBREAK
: LINEBREAK_NOBREAK;
break;
case CMP_BRK:
brk = LINEBREAK_NOBREAK;
if (lbpCtx->lbcLast != LBP_SP)
return brk; /* Do not update lbcCur */
break;
case PRH_BRK:
brk = LINEBREAK_NOBREAK;
break;
}
lbpCtx->lbcCur = lbpCtx->lbcNew;
return brk;
}
/**
* Initializes line breaking context for a given language.
*
* @param[in,out] lbpCtx pointer to the line breaking context
* @param[in] ch the first character to process
* @param[in] lang language of the input
* @post the line breaking context is initialized
*/
void lb_init_break_context(
struct LineBreakContext* lbpCtx,
utf32_t ch,
const char* lang)
{
lbpCtx->lang = lang;
lbpCtx->lbpLang = get_lb_prop_lang(lang);
lbpCtx->lbcLast = LBP_Undefined;
lbpCtx->lbcNew = LBP_Undefined;
lbpCtx->lbcCur = resolve_lb_class(
get_char_lb_class_lang(ch, lbpCtx->lbpLang),
lbpCtx->lang);
treat_first_char(lbpCtx);
}
/**
* Updates LineBreakingContext for the next code point and returns
* the detected break.
*
* @param[in,out] lbpCtx pointer to the line breaking context
* @param[in] ch Unicode code point
* @return break result, one of #LINEBREAK_MUSTBREAK,
* #LINEBREAK_ALLOWBREAK, and #LINEBREAK_NOBREAK
* @post the line breaking context is updated
*/
int lb_process_next_char(
struct LineBreakContext* lbpCtx,
utf32_t ch )
{
int brk;
lbpCtx->lbcLast = lbpCtx->lbcNew;
lbpCtx->lbcNew = get_char_lb_class_lang(ch, lbpCtx->lbpLang);
brk = get_lb_result_simple(lbpCtx);
switch (brk)
{
case LINEBREAK_MUSTBREAK:
lbpCtx->lbcCur = resolve_lb_class(lbpCtx->lbcNew, lbpCtx->lang);
treat_first_char(lbpCtx);
break;
case LINEBREAK_UNDEFINED:
lbpCtx->lbcNew = resolve_lb_class(lbpCtx->lbcNew, lbpCtx->lang);
brk = get_lb_result_lookup(lbpCtx);
break;
default:
break;
}
return brk;
}
/**
* Gets the next Unicode character in a UTF-8 sequence. The index will
* be advanced to the next complete character, unless the end of string
* is reached in the middle of a UTF-8 sequence.
*
* @param[in] s input UTF-8 string
* @param[in] len length of the string in bytes
* @param[in,out] ip pointer to the index
* @return the Unicode character beginning at the index; or
* #EOS if end of input is encountered
*/
utf32_t lb_get_next_char_utf8(
const utf8_t *s,
size_t len,
size_t *ip)
{
utf8_t ch;
utf32_t res;
assert(*ip <= len);
if (*ip == len)
return EOS;
ch = s[*ip];
if (ch < 0xC2 || ch > 0xF4)
{ /* One-byte sequence, tail (should not occur), or invalid */
*ip += 1;
return ch;
}
else if (ch < 0xE0)
{ /* Two-byte sequence */
if (*ip + 2 > len)
return EOS;
res = ((ch & 0x1F) << 6) + (s[*ip + 1] & 0x3F);
*ip += 2;
return res;
}
else if (ch < 0xF0)
{ /* Three-byte sequence */
if (*ip + 3 > len)
return EOS;
res = ((ch & 0x0F) << 12) +
((s[*ip + 1] & 0x3F) << 6) +
((s[*ip + 2] & 0x3F));
*ip += 3;
return res;
}
else
{ /* Four-byte sequence */
if (*ip + 4 > len)
return EOS;
res = ((ch & 0x07) << 18) +
((s[*ip + 1] & 0x3F) << 12) +
((s[*ip + 2] & 0x3F) << 6) +
((s[*ip + 3] & 0x3F));
*ip += 4;
return res;
}
}
/**
* Gets the next Unicode character in a UTF-16 sequence. The index will
* be advanced to the next complete character, unless the end of string
* is reached in the middle of a UTF-16 surrogate pair.
*
* @param[in] s input UTF-16 string
* @param[in] len length of the string in words
* @param[in,out] ip pointer to the index
* @return the Unicode character beginning at the index; or
* #EOS if end of input is encountered
*/
utf32_t lb_get_next_char_utf16(
const utf16_t *s,
size_t len,
size_t *ip)
{
utf16_t ch;
assert(*ip <= len);
if (*ip == len)
return EOS;
ch = s[(*ip)++];
if (ch < 0xD800 || ch > 0xDBFF)
{ /* If the character is not a high surrogate */
return ch;
}
if (*ip == len)
{ /* If the input ends here (an error) */
--(*ip);
return EOS;
}
if (s[*ip] < 0xDC00 || s[*ip] > 0xDFFF)
{ /* If the next character is not the low surrogate (an error) */
return ch;
}
/* Return the constructed character and advance the index again */
return (((utf32_t)ch & 0x3FF) << 10) + (s[(*ip)++] & 0x3FF) + 0x10000;
}
/**
* Gets the next Unicode character in a UTF-32 sequence. The index will
* be advanced to the next character.
*
* @param[in] s input UTF-32 string
* @param[in] len length of the string in dwords
* @param[in,out] ip pointer to the index
* @return the Unicode character beginning at the index; or
* #EOS if end of input is encountered
*/
utf32_t lb_get_next_char_utf32(
const utf32_t *s,
size_t len,
size_t *ip)
{
assert(*ip <= len);
if (*ip == len)
return EOS;
return s[(*ip)++];
}
/**
* Sets the line breaking information for a generic input string.
*
* @param[in] s input string
* @param[in] len length of the input
* @param[in] lang language of the input
* @param[out] brks pointer to the output breaking data,
* containing #LINEBREAK_MUSTBREAK,
* #LINEBREAK_ALLOWBREAK, #LINEBREAK_NOBREAK,
* or #LINEBREAK_INSIDEACHAR
* @param[in] get_next_char function to get the next UTF-32 character
*/
void set_linebreaks(
const void *s,
size_t len,
const char *lang,
char *brks,
get_next_char_t get_next_char)
{
utf32_t ch;
struct LineBreakContext lbCtx;
size_t posCur = 0;
size_t posLast = 0;
--posLast; /* To be ++'d later */
ch = get_next_char(s, len, &posCur);
if (ch == EOS)
return;
lb_init_break_context(&lbCtx, ch, lang);
/* Process a line till an explicit break or end of string */
for (;;)
{
for (++posLast; posLast < posCur - 1; ++posLast)
{
brks[posLast] = LINEBREAK_INSIDEACHAR;
}
assert(posLast == posCur - 1);
ch = get_next_char(s, len, &posCur);
if (ch == EOS)
break;
brks[posLast] = lb_process_next_char(&lbCtx, ch);
}
assert(posLast == posCur - 1 && posCur <= len);
/* Break after the last character */
brks[posLast] = LINEBREAK_MUSTBREAK;
/* When the input contains incomplete sequences */
while (posCur < len)
{
brks[posCur++] = LINEBREAK_INSIDEACHAR;
}
}
/**
* Sets the line breaking information for a UTF-8 input string.
*
* @param[in] s input UTF-8 string
* @param[in] len length of the input
* @param[in] lang language of the input
* @param[out] brks pointer to the output breaking data, containing
* #LINEBREAK_MUSTBREAK, #LINEBREAK_ALLOWBREAK,
* #LINEBREAK_NOBREAK, or #LINEBREAK_INSIDEACHAR
*/
void set_linebreaks_utf8(
const utf8_t *s,
size_t len,
const char *lang,
char *brks)
{
set_linebreaks(s, len, lang, brks,
(get_next_char_t)lb_get_next_char_utf8);
}
/**
* Sets the line breaking information for a UTF-16 input string.
*
* @param[in] s input UTF-16 string
* @param[in] len length of the input
* @param[in] lang language of the input
* @param[out] brks pointer to the output breaking data, containing
* #LINEBREAK_MUSTBREAK, #LINEBREAK_ALLOWBREAK,
* #LINEBREAK_NOBREAK, or #LINEBREAK_INSIDEACHAR
*/
void set_linebreaks_utf16(
const utf16_t *s,
size_t len,
const char *lang,
char *brks)
{
set_linebreaks(s, len, lang, brks,
(get_next_char_t)lb_get_next_char_utf16);
}
/**
* Sets the line breaking information for a UTF-32 input string.
*
* @param[in] s input UTF-32 string
* @param[in] len length of the input
* @param[in] lang language of the input
* @param[out] brks pointer to the output breaking data, containing
* #LINEBREAK_MUSTBREAK, #LINEBREAK_ALLOWBREAK,
* #LINEBREAK_NOBREAK, or #LINEBREAK_INSIDEACHAR
*/
void set_linebreaks_utf32(
const utf32_t *s,
size_t len,
const char *lang,
char *brks)
{
set_linebreaks(s, len, lang, brks,
(get_next_char_t)lb_get_next_char_utf32);
}
/**
* Tells whether a line break can occur between two Unicode characters.
* This is a wrapper function to expose a simple interface. Generally
* speaking, it is better to use #set_linebreaks_utf32 instead, since
* complicated cases involving combining marks, spaces, etc. cannot be
* correctly processed.
*
* @param char1 the first Unicode character
* @param char2 the second Unicode character
* @param lang language of the input
* @return one of #LINEBREAK_MUSTBREAK, #LINEBREAK_ALLOWBREAK,
* #LINEBREAK_NOBREAK, or #LINEBREAK_INSIDEACHAR
*/
int is_line_breakable(
utf32_t char1,
utf32_t char2,
const char* lang)
{
utf32_t s[2];
char brks[2];
s[0] = char1;
s[1] = char2;
set_linebreaks_utf32(s, 2, lang, brks);
return brks[0];
}

View File

@ -1,87 +0,0 @@
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
/*
* Line breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2008-2012 Wu Yongwei <wuyongwei at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute
* it freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must
* not claim that you wrote the original software. If you use this
* software in a product, an acknowledgement in the product
* documentation would be appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must
* not be misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source
* distribution.
*
* The main reference is Unicode Standard Annex 14 (UAX #14):
* <URL:http://www.unicode.org/reports/tr14/>
*
* When this library was designed, this annex was at Revision 19, for
* Unicode 5.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-19.html>
*
* This library has been updated according to Revision 30, for
* Unicode 6.2.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-30.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
/**
* @file linebreak.h
*
* Header file for the line breaking algorithm.
*
* @version 2.2, 2012/10/06
* @author Wu Yongwei
*/
#ifndef LINEBREAK_H
#define LINEBREAK_H
#include <stddef.h>
#ifdef __cplusplus
extern "C" {
#endif
#define LINEBREAK_VERSION 0x0202 /**< Version of the library linebreak */
extern const int linebreak_version;
#ifndef LINEBREAK_UTF_TYPES_DEFINED
#define LINEBREAK_UTF_TYPES_DEFINED
typedef unsigned char utf8_t; /**< Type for UTF-8 data points */
typedef unsigned short utf16_t; /**< Type for UTF-16 data points */
typedef unsigned int utf32_t; /**< Type for UTF-32 data points */
#endif
#define LINEBREAK_MUSTBREAK 0 /**< Break is mandatory */
#define LINEBREAK_ALLOWBREAK 1 /**< Break is allowed */
#define LINEBREAK_NOBREAK 2 /**< No break is possible */
#define LINEBREAK_INSIDEACHAR 3 /**< A UTF-8/16 sequence is unfinished */
void init_linebreak(void);
void set_linebreaks_utf8(
const utf8_t *s, size_t len, const char* lang, char *brks);
void set_linebreaks_utf16(
const utf16_t *s, size_t len, const char* lang, char *brks);
void set_linebreaks_utf32(
const utf32_t *s, size_t len, const char* lang, char *brks);
int is_line_breakable(utf32_t char1, utf32_t char2, const char* lang);
#ifdef __cplusplus
}
#endif
#endif /* LINEBREAK_H */

File diff suppressed because it is too large Load Diff

View File

@ -1,139 +0,0 @@
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
/*
* Line breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2008-2012 Wu Yongwei <wuyongwei at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute
* it freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must
* not claim that you wrote the original software. If you use this
* software in a product, an acknowledgement in the product
* documentation would be appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must
* not be misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source
* distribution.
*
* The main reference is Unicode Standard Annex 14 (UAX #14):
* <URL:http://www.unicode.org/reports/tr14/>
*
* When this library was designed, this annex was at Revision 19, for
* Unicode 5.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-19.html>
*
* This library has been updated according to Revision 30, for
* Unicode 6.2.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-30.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
/**
* @file linebreakdef.c
*
* Definition of language-specific data.
*
* @version 2.2, 2012/10/06
* @author Wu Yongwei
*/
#include "linebreak.h"
#include "linebreakdef.h"
/**
* English-specifc data over the default Unicode rules.
*/
static struct LineBreakProperties lb_prop_English[] = {
{ 0x2018, 0x2018, LBP_OP }, /* Left single quotation mark: opening */
{ 0x201C, 0x201C, LBP_OP }, /* Left double quotation mark: opening */
{ 0x201D, 0x201D, LBP_CL }, /* Right double quotation mark: closing */
{ 0, 0, LBP_Undefined }
};
/**
* German-specifc data over the default Unicode rules.
*/
static struct LineBreakProperties lb_prop_German[] = {
{ 0x00AB, 0x00AB, LBP_CL }, /* Left double angle quotation mark: closing */
{ 0x00BB, 0x00BB, LBP_OP }, /* Right double angle quotation mark: opening */
{ 0x2018, 0x2018, LBP_CL }, /* Left single quotation mark: closing */
{ 0x201C, 0x201C, LBP_CL }, /* Left double quotation mark: closing */
{ 0x2039, 0x2039, LBP_CL }, /* Left single angle quotation mark: closing */
{ 0x203A, 0x203A, LBP_OP }, /* Right single angle quotation mark: opening */
{ 0, 0, LBP_Undefined }
};
/**
* Spanish-specifc data over the default Unicode rules.
*/
static struct LineBreakProperties lb_prop_Spanish[] = {
{ 0x00AB, 0x00AB, LBP_OP }, /* Left double angle quotation mark: opening */
{ 0x00BB, 0x00BB, LBP_CL }, /* Right double angle quotation mark: closing */
{ 0x2018, 0x2018, LBP_OP }, /* Left single quotation mark: opening */
{ 0x201C, 0x201C, LBP_OP }, /* Left double quotation mark: opening */
{ 0x201D, 0x201D, LBP_CL }, /* Right double quotation mark: closing */
{ 0x2039, 0x2039, LBP_OP }, /* Left single angle quotation mark: opening */
{ 0x203A, 0x203A, LBP_CL }, /* Right single angle quotation mark: closing */
{ 0, 0, LBP_Undefined }
};
/**
* French-specifc data over the default Unicode rules.
*/
static struct LineBreakProperties lb_prop_French[] = {
{ 0x00AB, 0x00AB, LBP_OP }, /* Left double angle quotation mark: opening */
{ 0x00BB, 0x00BB, LBP_CL }, /* Right double angle quotation mark: closing */
{ 0x2018, 0x2018, LBP_OP }, /* Left single quotation mark: opening */
{ 0x201C, 0x201C, LBP_OP }, /* Left double quotation mark: opening */
{ 0x201D, 0x201D, LBP_CL }, /* Right double quotation mark: closing */
{ 0x2039, 0x2039, LBP_OP }, /* Left single angle quotation mark: opening */
{ 0x203A, 0x203A, LBP_CL }, /* Right single angle quotation mark: closing */
{ 0, 0, LBP_Undefined }
};
/**
* Russian-specifc data over the default Unicode rules.
*/
static struct LineBreakProperties lb_prop_Russian[] = {
{ 0x00AB, 0x00AB, LBP_OP }, /* Left double angle quotation mark: opening */
{ 0x00BB, 0x00BB, LBP_CL }, /* Right double angle quotation mark: closing */
{ 0x201C, 0x201C, LBP_CL }, /* Left double quotation mark: closing */
{ 0, 0, LBP_Undefined }
};
/**
* Chinese-specifc data over the default Unicode rules.
*/
static struct LineBreakProperties lb_prop_Chinese[] = {
{ 0x2018, 0x2018, LBP_OP }, /* Left single quotation mark: opening */
{ 0x2019, 0x2019, LBP_CL }, /* Right single quotation mark: closing */
{ 0x201C, 0x201C, LBP_OP }, /* Left double quotation mark: opening */
{ 0x201D, 0x201D, LBP_CL }, /* Right double quotation mark: closing */
{ 0, 0, LBP_Undefined }
};
/**
* Association data of language-specific line breaking properties with
* language names. This is the definition for the static data in this
* file. If you want more flexibility, or do not need the data here,
* you may want to redefine \e lb_prop_lang_map in your C source file.
*/
struct LineBreakPropertiesLang lb_prop_lang_map[] = {
{ "en", 2, lb_prop_English },
{ "de", 2, lb_prop_German },
{ "es", 2, lb_prop_Spanish },
{ "fr", 2, lb_prop_French },
{ "ru", 2, lb_prop_Russian },
{ "zh", 2, lb_prop_Chinese },
{ NULL, 0, NULL }
};

View File

@ -1,174 +0,0 @@
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
/*
* Line breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2008-2013 Wu Yongwei <wuyongwei at gmail dot com>
* Copyright (C) 2013 Petr Filipsky <philodej at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute
* it freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must
* not claim that you wrote the original software. If you use this
* software in a product, an acknowledgement in the product
* documentation would be appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must
* not be misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source
* distribution.
*
* The main reference is Unicode Standard Annex 14 (UAX #14):
* <URL:http://www.unicode.org/reports/tr14/>
*
* When this library was designed, this annex was at Revision 19, for
* Unicode 5.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-19.html>
*
* This library has been updated according to Revision 30, for
* Unicode 6.2.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-30.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
/**
* @file linebreakdef.h
*
* Definitions of internal data structures, declarations of global
* variables, and function prototypes for the line breaking algorithm.
*
* @version 2.4, 2013/11/10
* @author Wu Yongwei
* @author Petr Filipsky
*/
/**
* Constant value to mark the end of string. It is not a valid Unicode
* character.
*/
#define EOS 0xFFFFFFFF
/**
* Line break classes. This is a direct mapping of Table 1 of Unicode
* Standard Annex 14, Revision 26.
*/
enum LineBreakClass
{
/* This is used to signal an error condition. */
LBP_Undefined, /**< Undefined */
/* The following break classes are treated in the pair table. */
LBP_OP, /**< Opening punctuation */
LBP_CL, /**< Closing punctuation */
LBP_CP, /**< Closing parenthesis */
LBP_QU, /**< Ambiguous quotation */
LBP_GL, /**< Glue */
LBP_NS, /**< Non-starters */
LBP_EX, /**< Exclamation/Interrogation */
LBP_SY, /**< Symbols allowing break after */
LBP_IS, /**< Infix separator */
LBP_PR, /**< Prefix */
LBP_PO, /**< Postfix */
LBP_NU, /**< Numeric */
LBP_AL, /**< Alphabetic */
LBP_HL, /**< Hebrew letter */
LBP_ID, /**< Ideographic */
LBP_IN, /**< Inseparable characters */
LBP_HY, /**< Hyphen */
LBP_BA, /**< Break after */
LBP_BB, /**< Break before */
LBP_B2, /**< Break on either side (but not pair) */
LBP_ZW, /**< Zero-width space */
LBP_CM, /**< Combining marks */
LBP_WJ, /**< Word joiner */
LBP_H2, /**< Hangul LV */
LBP_H3, /**< Hangul LVT */
LBP_JL, /**< Hangul L Jamo */
LBP_JV, /**< Hangul V Jamo */
LBP_JT, /**< Hangul T Jamo */
LBP_RI, /**< Regional indicator */
/* The following break classes are not treated in the pair table */
LBP_AI, /**< Ambiguous (alphabetic or ideograph) */
LBP_BK, /**< Break (mandatory) */
LBP_CB, /**< Contingent break */
LBP_CJ, /**< Conditional Japanese starter */
LBP_CR, /**< Carriage return */
LBP_LF, /**< Line feed */
LBP_NL, /**< Next line */
LBP_SA, /**< South-East Asian */
LBP_SG, /**< Surrogates */
LBP_SP, /**< Space */
LBP_XX /**< Unknown */
};
/**
* Struct for entries of line break properties. The array of the
* entries \e must be sorted.
*/
struct LineBreakProperties
{
utf32_t start; /**< Starting coding point */
utf32_t end; /**< End coding point */
enum LineBreakClass prop; /**< The line breaking property */
};
/**
* Struct for association of language-specific line breaking properties
* with language names.
*/
struct LineBreakPropertiesLang
{
const char *lang; /**< Language name */
size_t namelen; /**< Length of name to match */
struct LineBreakProperties *lbp; /**< Pointer to associated data */
};
/**
* Context representing internal state of the line breaking algorithm.
* This is useful to callers if incremental analysis is wanted.
*/
struct LineBreakContext
{
const char *lang; /**< Language name */
struct LineBreakProperties *lbpLang;/**< Pointer to LineBreakProperties */
enum LineBreakClass lbcCur; /**< Breaking class of current codepoint */
enum LineBreakClass lbcNew; /**< Breaking class of next codepoint */
enum LineBreakClass lbcLast; /**< Breaking class of last codepoint */
};
/**
* Abstract function interface for #lb_get_next_char_utf8,
* #lb_get_next_char_utf16, and #lb_get_next_char_utf32.
*/
typedef utf32_t (*get_next_char_t)(const void *, size_t, size_t *);
/* Declarations */
extern struct LineBreakProperties lb_prop_default[];
extern struct LineBreakPropertiesLang lb_prop_lang_map[];
/* Function Prototype */
utf32_t lb_get_next_char_utf8(const utf8_t *s, size_t len, size_t *ip);
utf32_t lb_get_next_char_utf16(const utf16_t *s, size_t len, size_t *ip);
utf32_t lb_get_next_char_utf32(const utf32_t *s, size_t len, size_t *ip);
void lb_init_break_context(
struct LineBreakContext* lbpCtx,
utf32_t ch,
const char* lang);
int lb_process_next_char(
struct LineBreakContext* lbpCtx,
utf32_t ch);
void set_linebreaks(
const void *s,
size_t len,
const char *lang,
char *brks,
get_next_char_t get_next_char);

View File

@ -1,453 +0,0 @@
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
/*
* Word breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2013 Tom Hacohen <tom at stosb dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute
* it freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must
* not claim that you wrote the original software. If you use this
* software in a product, an acknowledgement in the product
* documentation would be appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must
* not be misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source
* distribution.
*
* The main reference is Unicode Standard Annex 29 (UAX #29):
* <URL:http://unicode.org/reports/tr29>
*
* When this library was designed, this annex was at Revision 17, for
* Unicode 6.0.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-17.html>
*
* This library has been updated according to Revision 21, for
* Unicode 6.2.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-21.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
/**
* @file wordbreak.c
*
* Implementation of the word breaking algorithm as described in Unicode
* Standard Annex 29.
*
* @version 2.4, 2013/09/28
* @author Tom Hacohen
*/
#include <assert.h>
#include <stddef.h>
#include <string.h>
#include "linebreak.h"
#include "linebreakdef.h"
#include "wordbreak.h"
#include "wordbreakdata.c"
#define ARRAY_LEN(x) (sizeof(x) / sizeof(x[0]))
/**
* Initializes the wordbreak internals. It currently does nothing, but
* it may in the future.
*/
void init_wordbreak(void)
{
}
/**
* Gets the word breaking class of a character.
*
* @param ch character to check
* @param wbp pointer to the wbp breaking properties array
* @param len size of the wbp array in number of items
* @return the word breaking class if found; \c WBP_Any otherwise
*/
static enum WordBreakClass get_char_wb_class(
utf32_t ch,
struct WordBreakProperties *wbp,
size_t len)
{
int min = 0;
int max = len - 1;
int mid;
do
{
mid = (min + max) / 2;
if (ch < wbp[mid].start)
max = mid - 1;
else if (ch > wbp[mid].end)
min = mid + 1;
else
return wbp[mid].prop;
}
while (min <= max);
return WBP_Any;
}
/**
* Sets the word break types to a specific value in a range.
*
* It sets the inside chars to #WORDBREAK_INSIDEACHAR and the rest to brkType.
* Assumes \a brks is initialized - all the cells with #WORDBREAK_NOBREAK are
* cells that we really don't want to break after.
*
* @param[in] s input string
* @param[out] brks breaks array to fill
* @param[in] posStart start position
* @param[in] posEnd end position (exclusive)
* @param[in] len length of the string
* @param[in] brkType breaks type to use
* @param[in] get_next_char function to get the next UTF-32 character
*/
static void set_brks_to(
const void *s,
char *brks,
size_t posStart,
size_t posEnd,
size_t len,
char brkType,
get_next_char_t get_next_char)
{
size_t posNext = posStart;
while (posNext < posEnd)
{
utf32_t ch;
(void)ch;
ch = get_next_char(s, len, &posNext);
assert(ch != EOS);
for (; posStart < posNext - 1; ++posStart)
brks[posStart] = WORDBREAK_INSIDEACHAR;
assert(posStart == posNext - 1);
/* Only set it if we haven't set it not to break before. */
if (brks[posStart] != WORDBREAK_NOBREAK)
brks[posStart] = brkType;
posStart = posNext;
}
}
/* Checks to see if the class is newline, CR, or LF (rules WB3a and b). */
#define IS_WB3ab(cls) ((cls == WBP_Newline) || (cls == WBP_CR) || \
(cls == WBP_LF))
/**
* Sets the word breaking information for a generic input string.
*
* @param[in] s input string
* @param[in] len length of the input
* @param[in] lang language of the input
* @param[out] brks pointer to the output breaking data, containing
* #WORDBREAK_BREAK, #WORDBREAK_NOBREAK, or
* #WORDBREAK_INSIDEACHAR
* @param[in] get_next_char function to get the next UTF-32 character
*/
static void set_wordbreaks(
const void *s,
size_t len,
const char *lang,
char *brks,
get_next_char_t get_next_char)
{
enum WordBreakClass wbcLast = WBP_Undefined;
/* wbcSeqStart is the class that started the current sequence.
* WBP_Undefined is a special case that means "sot".
* This value is the class that is at the start of the current rule
* matching sequence. For example, in case of Numeric+MidNum+Numeric
* it'll be Numeric all the way.
*/
enum WordBreakClass wbcSeqStart = WBP_Undefined;
utf32_t ch;
size_t posNext = 0;
size_t posCur = 0;
size_t posLast = 0;
/* TODO: Language-specific specialization. */
(void) lang;
/* Init brks. */
memset(brks, WORDBREAK_BREAK, len);
ch = get_next_char(s, len, &posNext);
while (ch != EOS)
{
enum WordBreakClass wbcCur;
wbcCur = get_char_wb_class(ch, wb_prop_default,
ARRAY_LEN(wb_prop_default));
switch (wbcCur)
{
case WBP_CR:
/* WB3b */
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
wbcSeqStart = wbcCur;
posLast = posCur;
break;
case WBP_LF:
if (wbcSeqStart == WBP_CR) /* WB3 */
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_NOBREAK, get_next_char);
wbcSeqStart = wbcCur;
posLast = posCur;
break;
}
/* Fall off */
case WBP_Newline:
/* WB3a,3b */
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
wbcSeqStart = wbcCur;
posLast = posCur;
break;
case WBP_Extend:
case WBP_Format:
/* WB4 - If not the first char/after a newline (WB3a,3b), skip
* this class, set it to be the same as the prev, and mark
* brks not to break before them. */
if ((wbcSeqStart == WBP_Undefined) || IS_WB3ab(wbcSeqStart))
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
wbcSeqStart = wbcCur;
}
else
{
/* It's surely not the first */
brks[posCur - 1] = WORDBREAK_NOBREAK;
/* "inherit" the previous class. */
wbcCur = wbcLast;
}
break;
case WBP_Katakana:
if ((wbcSeqStart == WBP_Katakana) || /* WB13 */
(wbcSeqStart == WBP_ExtendNumLet)) /* WB13b */
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_NOBREAK, get_next_char);
}
/* No rule found, reset */
else
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
}
wbcSeqStart = wbcCur;
posLast = posCur;
break;
case WBP_ALetter:
if ((wbcSeqStart == WBP_ALetter) || /* WB5,6,7 */
(wbcLast == WBP_Numeric) || /* WB10 */
(wbcSeqStart == WBP_ExtendNumLet)) /* WB13b */
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_NOBREAK, get_next_char);
}
/* No rule found, reset */
else
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
}
wbcSeqStart = wbcCur;
posLast = posCur;
break;
case WBP_MidNumLet:
if ((wbcLast == WBP_ALetter) || /* WB6,7 */
(wbcLast == WBP_Numeric)) /* WB11,12 */
{
/* Go on */
}
else
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
wbcSeqStart = wbcCur;
posLast = posCur;
}
break;
case WBP_MidLetter:
if (wbcLast == WBP_ALetter) /* WB6,7 */
{
/* Go on */
}
else
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
wbcSeqStart = wbcCur;
posLast = posCur;
}
break;
case WBP_MidNum:
if (wbcLast == WBP_Numeric) /* WB11,12 */
{
/* Go on */
}
else
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
wbcSeqStart = wbcCur;
posLast = posCur;
}
break;
case WBP_Numeric:
if ((wbcSeqStart == WBP_Numeric) || /* WB8,11,12 */
(wbcLast == WBP_ALetter) || /* WB9 */
(wbcSeqStart == WBP_ExtendNumLet)) /* WB13b */
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_NOBREAK, get_next_char);
}
/* No rule found, reset */
else
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
}
wbcSeqStart = wbcCur;
posLast = posCur;
break;
case WBP_ExtendNumLet:
/* WB13a,13b */
if ((wbcSeqStart == wbcLast) &&
((wbcLast == WBP_ALetter) ||
(wbcLast == WBP_Numeric) ||
(wbcLast == WBP_Katakana) ||
(wbcLast == WBP_ExtendNumLet)))
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_NOBREAK, get_next_char);
}
/* No rule found, reset */
else
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
}
wbcSeqStart = wbcCur;
posLast = posCur;
break;
case WBP_Regional:
/* WB13c */
if (wbcSeqStart == WBP_Regional)
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_NOBREAK, get_next_char);
}
wbcSeqStart = wbcCur;
posLast = posCur;
break;
case WBP_Any:
/* Allow breaks and reset */
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
wbcSeqStart = wbcCur;
posLast = posCur;
break;
default:
/* Error, should never get here! */
assert(0);
break;
}
wbcLast = wbcCur;
posCur = posNext;
ch = get_next_char(s, len, &posNext);
}
/* WB2 */
set_brks_to(s, brks, posLast, posNext, len,
WORDBREAK_BREAK, get_next_char);
}
/**
* Sets the word breaking information for a UTF-8 input string.
*
* @param[in] s input UTF-8 string
* @param[in] len length of the input
* @param[in] lang language of the input
* @param[out] brks pointer to the output breaking data, containing
* #WORDBREAK_BREAK, #WORDBREAK_NOBREAK, or
* #WORDBREAK_INSIDEACHAR
*/
void set_wordbreaks_utf8(
const utf8_t *s,
size_t len,
const char *lang,
char *brks)
{
set_wordbreaks(s, len, lang, brks,
(get_next_char_t)lb_get_next_char_utf8);
}
/**
* Sets the word breaking information for a UTF-16 input string.
*
* @param[in] s input UTF-16 string
* @param[in] len length of the input
* @param[in] lang language of the input
* @param[out] brks pointer to the output breaking data, containing
* #WORDBREAK_BREAK, #WORDBREAK_NOBREAK, or
* #WORDBREAK_INSIDEACHAR
*/
void set_wordbreaks_utf16(
const utf16_t *s,
size_t len,
const char *lang,
char *brks)
{
set_wordbreaks(s, len, lang, brks,
(get_next_char_t)lb_get_next_char_utf16);
}
/**
* Sets the word breaking information for a UTF-32 input string.
*
* @param[in] s input UTF-32 string
* @param[in] len length of the input
* @param[in] lang language of the input
* @param[out] brks pointer to the output breaking data, containing
* #WORDBREAK_BREAK, #WORDBREAK_NOBREAK, or
* #WORDBREAK_INSIDEACHAR
*/
void set_wordbreaks_utf32(
const utf32_t *s,
size_t len,
const char *lang,
char *brks)
{
set_wordbreaks(s, len, lang, brks,
(get_next_char_t)lb_get_next_char_utf32);
}

View File

@ -1,76 +0,0 @@
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
/*
* Word breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2013 Tom Hacohen <tom at stosb dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute
* it freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must
* not claim that you wrote the original software. If you use this
* software in a product, an acknowledgement in the product
* documentation would be appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must
* not be misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source
* distribution.
*
* The main reference is Unicode Standard Annex 29 (UAX #29):
* <URL:http://unicode.org/reports/tr29>
*
* When this library was designed, this annex was at Revision 17, for
* Unicode 6.0.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-17.html>
*
* This library has been updated according to Revision 21, for
* Unicode 6.2.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-21.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
/**
* @file wordbreak.h
*
* Header file for the word breaking (segmentation) algorithm.
*
* @version 2.3, 2013/09/28
* @author Tom Hacohen
*/
#ifndef WORDBREAK_H
#define WORDBREAK_H
#include <stddef.h>
#include "linebreak.h"
#ifdef __cplusplus
extern "C" {
#endif
#define WORDBREAK_BREAK 0 /**< Break is allowed */
#define WORDBREAK_NOBREAK 1 /**< No break is allowed */
#define WORDBREAK_INSIDEACHAR 2 /**< A UTF-8/16 sequence is unfinished */
void init_wordbreak(void);
void set_wordbreaks_utf8(
const utf8_t *s, size_t len, const char* lang, char *brks);
void set_wordbreaks_utf16(
const utf16_t *s, size_t len, const char* lang, char *brks);
void set_wordbreaks_utf32(
const utf32_t *s, size_t len, const char* lang, char *brks);
#ifdef __cplusplus
}
#endif
#endif

View File

@ -1,949 +0,0 @@
/* The content of this file is generated from:
# WordBreakProperty-6.2.0.txt
# Date: 2012-08-13, 19:12:09 GMT [MD]
*/
#include "linebreak.h"
#include "wordbreakdef.h"
static struct WordBreakProperties wb_prop_default[] = {
{0x000A, 0x000A, WBP_LF},
{0x000B, 0x000C, WBP_Newline},
{0x000D, 0x000D, WBP_CR},
{0x0027, 0x0027, WBP_MidNumLet},
{0x002C, 0x002C, WBP_MidNum},
{0x002E, 0x002E, WBP_MidNumLet},
{0x0030, 0x0039, WBP_Numeric},
{0x003A, 0x003A, WBP_MidLetter},
{0x003B, 0x003B, WBP_MidNum},
{0x0041, 0x005A, WBP_ALetter},
{0x005F, 0x005F, WBP_ExtendNumLet},
{0x0061, 0x007A, WBP_ALetter},
{0x0085, 0x0085, WBP_Newline},
{0x00AA, 0x00AA, WBP_ALetter},
{0x00AD, 0x00AD, WBP_Format},
{0x00B5, 0x00B5, WBP_ALetter},
{0x00B7, 0x00B7, WBP_MidLetter},
{0x00BA, 0x00BA, WBP_ALetter},
{0x00C0, 0x00D6, WBP_ALetter},
{0x00D8, 0x00F6, WBP_ALetter},
{0x00F8, 0x01BA, WBP_ALetter},
{0x01BB, 0x01BB, WBP_ALetter},
{0x01BC, 0x01BF, WBP_ALetter},
{0x01C0, 0x01C3, WBP_ALetter},
{0x01C4, 0x0293, WBP_ALetter},
{0x0294, 0x0294, WBP_ALetter},
{0x0295, 0x02AF, WBP_ALetter},
{0x02B0, 0x02C1, WBP_ALetter},
{0x02C6, 0x02D1, WBP_ALetter},
{0x02E0, 0x02E4, WBP_ALetter},
{0x02EC, 0x02EC, WBP_ALetter},
{0x02EE, 0x02EE, WBP_ALetter},
{0x0300, 0x036F, WBP_Extend},
{0x0370, 0x0373, WBP_ALetter},
{0x0374, 0x0374, WBP_ALetter},
{0x0376, 0x0377, WBP_ALetter},
{0x037A, 0x037A, WBP_ALetter},
{0x037B, 0x037D, WBP_ALetter},
{0x037E, 0x037E, WBP_MidNum},
{0x0386, 0x0386, WBP_ALetter},
{0x0387, 0x0387, WBP_MidLetter},
{0x0388, 0x038A, WBP_ALetter},
{0x038C, 0x038C, WBP_ALetter},
{0x038E, 0x03A1, WBP_ALetter},
{0x03A3, 0x03F5, WBP_ALetter},
{0x03F7, 0x0481, WBP_ALetter},
{0x0483, 0x0487, WBP_Extend},
{0x0488, 0x0489, WBP_Extend},
{0x048A, 0x0527, WBP_ALetter},
{0x0531, 0x0556, WBP_ALetter},
{0x0559, 0x0559, WBP_ALetter},
{0x0561, 0x0587, WBP_ALetter},
{0x0589, 0x0589, WBP_MidNum},
{0x0591, 0x05BD, WBP_Extend},
{0x05BF, 0x05BF, WBP_Extend},
{0x05C1, 0x05C2, WBP_Extend},
{0x05C4, 0x05C5, WBP_Extend},
{0x05C7, 0x05C7, WBP_Extend},
{0x05D0, 0x05EA, WBP_ALetter},
{0x05F0, 0x05F2, WBP_ALetter},
{0x05F3, 0x05F3, WBP_ALetter},
{0x05F4, 0x05F4, WBP_MidLetter},
{0x0600, 0x0604, WBP_Format},
{0x060C, 0x060D, WBP_MidNum},
{0x0610, 0x061A, WBP_Extend},
{0x0620, 0x063F, WBP_ALetter},
{0x0640, 0x0640, WBP_ALetter},
{0x0641, 0x064A, WBP_ALetter},
{0x064B, 0x065F, WBP_Extend},
{0x0660, 0x0669, WBP_Numeric},
{0x066B, 0x066B, WBP_Numeric},
{0x066C, 0x066C, WBP_MidNum},
{0x066E, 0x066F, WBP_ALetter},
{0x0670, 0x0670, WBP_Extend},
{0x0671, 0x06D3, WBP_ALetter},
{0x06D5, 0x06D5, WBP_ALetter},
{0x06D6, 0x06DC, WBP_Extend},
{0x06DD, 0x06DD, WBP_Format},
{0x06DF, 0x06E4, WBP_Extend},
{0x06E5, 0x06E6, WBP_ALetter},
{0x06E7, 0x06E8, WBP_Extend},
{0x06EA, 0x06ED, WBP_Extend},
{0x06EE, 0x06EF, WBP_ALetter},
{0x06F0, 0x06F9, WBP_Numeric},
{0x06FA, 0x06FC, WBP_ALetter},
{0x06FF, 0x06FF, WBP_ALetter},
{0x070F, 0x070F, WBP_Format},
{0x0710, 0x0710, WBP_ALetter},
{0x0711, 0x0711, WBP_Extend},
{0x0712, 0x072F, WBP_ALetter},
{0x0730, 0x074A, WBP_Extend},
{0x074D, 0x07A5, WBP_ALetter},
{0x07A6, 0x07B0, WBP_Extend},
{0x07B1, 0x07B1, WBP_ALetter},
{0x07C0, 0x07C9, WBP_Numeric},
{0x07CA, 0x07EA, WBP_ALetter},
{0x07EB, 0x07F3, WBP_Extend},
{0x07F4, 0x07F5, WBP_ALetter},
{0x07F8, 0x07F8, WBP_MidNum},
{0x07FA, 0x07FA, WBP_ALetter},
{0x0800, 0x0815, WBP_ALetter},
{0x0816, 0x0819, WBP_Extend},
{0x081A, 0x081A, WBP_ALetter},
{0x081B, 0x0823, WBP_Extend},
{0x0824, 0x0824, WBP_ALetter},
{0x0825, 0x0827, WBP_Extend},
{0x0828, 0x0828, WBP_ALetter},
{0x0829, 0x082D, WBP_Extend},
{0x0840, 0x0858, WBP_ALetter},
{0x0859, 0x085B, WBP_Extend},
{0x08A0, 0x08A0, WBP_ALetter},
{0x08A2, 0x08AC, WBP_ALetter},
{0x08E4, 0x08FE, WBP_Extend},
{0x0900, 0x0902, WBP_Extend},
{0x0903, 0x0903, WBP_Extend},
{0x0904, 0x0939, WBP_ALetter},
{0x093A, 0x093A, WBP_Extend},
{0x093B, 0x093B, WBP_Extend},
{0x093C, 0x093C, WBP_Extend},
{0x093D, 0x093D, WBP_ALetter},
{0x093E, 0x0940, WBP_Extend},
{0x0941, 0x0948, WBP_Extend},
{0x0949, 0x094C, WBP_Extend},
{0x094D, 0x094D, WBP_Extend},
{0x094E, 0x094F, WBP_Extend},
{0x0950, 0x0950, WBP_ALetter},
{0x0951, 0x0957, WBP_Extend},
{0x0958, 0x0961, WBP_ALetter},
{0x0962, 0x0963, WBP_Extend},
{0x0966, 0x096F, WBP_Numeric},
{0x0971, 0x0971, WBP_ALetter},
{0x0972, 0x0977, WBP_ALetter},
{0x0979, 0x097F, WBP_ALetter},
{0x0981, 0x0981, WBP_Extend},
{0x0982, 0x0983, WBP_Extend},
{0x0985, 0x098C, WBP_ALetter},
{0x098F, 0x0990, WBP_ALetter},
{0x0993, 0x09A8, WBP_ALetter},
{0x09AA, 0x09B0, WBP_ALetter},
{0x09B2, 0x09B2, WBP_ALetter},
{0x09B6, 0x09B9, WBP_ALetter},
{0x09BC, 0x09BC, WBP_Extend},
{0x09BD, 0x09BD, WBP_ALetter},
{0x09BE, 0x09C0, WBP_Extend},
{0x09C1, 0x09C4, WBP_Extend},
{0x09C7, 0x09C8, WBP_Extend},
{0x09CB, 0x09CC, WBP_Extend},
{0x09CD, 0x09CD, WBP_Extend},
{0x09CE, 0x09CE, WBP_ALetter},
{0x09D7, 0x09D7, WBP_Extend},
{0x09DC, 0x09DD, WBP_ALetter},
{0x09DF, 0x09E1, WBP_ALetter},
{0x09E2, 0x09E3, WBP_Extend},
{0x09E6, 0x09EF, WBP_Numeric},
{0x09F0, 0x09F1, WBP_ALetter},
{0x0A01, 0x0A02, WBP_Extend},
{0x0A03, 0x0A03, WBP_Extend},
{0x0A05, 0x0A0A, WBP_ALetter},
{0x0A0F, 0x0A10, WBP_ALetter},
{0x0A13, 0x0A28, WBP_ALetter},
{0x0A2A, 0x0A30, WBP_ALetter},
{0x0A32, 0x0A33, WBP_ALetter},
{0x0A35, 0x0A36, WBP_ALetter},
{0x0A38, 0x0A39, WBP_ALetter},
{0x0A3C, 0x0A3C, WBP_Extend},
{0x0A3E, 0x0A40, WBP_Extend},
{0x0A41, 0x0A42, WBP_Extend},
{0x0A47, 0x0A48, WBP_Extend},
{0x0A4B, 0x0A4D, WBP_Extend},
{0x0A51, 0x0A51, WBP_Extend},
{0x0A59, 0x0A5C, WBP_ALetter},
{0x0A5E, 0x0A5E, WBP_ALetter},
{0x0A66, 0x0A6F, WBP_Numeric},
{0x0A70, 0x0A71, WBP_Extend},
{0x0A72, 0x0A74, WBP_ALetter},
{0x0A75, 0x0A75, WBP_Extend},
{0x0A81, 0x0A82, WBP_Extend},
{0x0A83, 0x0A83, WBP_Extend},
{0x0A85, 0x0A8D, WBP_ALetter},
{0x0A8F, 0x0A91, WBP_ALetter},
{0x0A93, 0x0AA8, WBP_ALetter},
{0x0AAA, 0x0AB0, WBP_ALetter},
{0x0AB2, 0x0AB3, WBP_ALetter},
{0x0AB5, 0x0AB9, WBP_ALetter},
{0x0ABC, 0x0ABC, WBP_Extend},
{0x0ABD, 0x0ABD, WBP_ALetter},
{0x0ABE, 0x0AC0, WBP_Extend},
{0x0AC1, 0x0AC5, WBP_Extend},
{0x0AC7, 0x0AC8, WBP_Extend},
{0x0AC9, 0x0AC9, WBP_Extend},
{0x0ACB, 0x0ACC, WBP_Extend},
{0x0ACD, 0x0ACD, WBP_Extend},
{0x0AD0, 0x0AD0, WBP_ALetter},
{0x0AE0, 0x0AE1, WBP_ALetter},
{0x0AE2, 0x0AE3, WBP_Extend},
{0x0AE6, 0x0AEF, WBP_Numeric},
{0x0B01, 0x0B01, WBP_Extend},
{0x0B02, 0x0B03, WBP_Extend},
{0x0B05, 0x0B0C, WBP_ALetter},
{0x0B0F, 0x0B10, WBP_ALetter},
{0x0B13, 0x0B28, WBP_ALetter},
{0x0B2A, 0x0B30, WBP_ALetter},
{0x0B32, 0x0B33, WBP_ALetter},
{0x0B35, 0x0B39, WBP_ALetter},
{0x0B3C, 0x0B3C, WBP_Extend},
{0x0B3D, 0x0B3D, WBP_ALetter},
{0x0B3E, 0x0B3E, WBP_Extend},
{0x0B3F, 0x0B3F, WBP_Extend},
{0x0B40, 0x0B40, WBP_Extend},
{0x0B41, 0x0B44, WBP_Extend},
{0x0B47, 0x0B48, WBP_Extend},
{0x0B4B, 0x0B4C, WBP_Extend},
{0x0B4D, 0x0B4D, WBP_Extend},
{0x0B56, 0x0B56, WBP_Extend},
{0x0B57, 0x0B57, WBP_Extend},
{0x0B5C, 0x0B5D, WBP_ALetter},
{0x0B5F, 0x0B61, WBP_ALetter},
{0x0B62, 0x0B63, WBP_Extend},
{0x0B66, 0x0B6F, WBP_Numeric},
{0x0B71, 0x0B71, WBP_ALetter},
{0x0B82, 0x0B82, WBP_Extend},
{0x0B83, 0x0B83, WBP_ALetter},
{0x0B85, 0x0B8A, WBP_ALetter},
{0x0B8E, 0x0B90, WBP_ALetter},
{0x0B92, 0x0B95, WBP_ALetter},
{0x0B99, 0x0B9A, WBP_ALetter},
{0x0B9C, 0x0B9C, WBP_ALetter},
{0x0B9E, 0x0B9F, WBP_ALetter},
{0x0BA3, 0x0BA4, WBP_ALetter},
{0x0BA8, 0x0BAA, WBP_ALetter},
{0x0BAE, 0x0BB9, WBP_ALetter},
{0x0BBE, 0x0BBF, WBP_Extend},
{0x0BC0, 0x0BC0, WBP_Extend},
{0x0BC1, 0x0BC2, WBP_Extend},
{0x0BC6, 0x0BC8, WBP_Extend},
{0x0BCA, 0x0BCC, WBP_Extend},
{0x0BCD, 0x0BCD, WBP_Extend},
{0x0BD0, 0x0BD0, WBP_ALetter},
{0x0BD7, 0x0BD7, WBP_Extend},
{0x0BE6, 0x0BEF, WBP_Numeric},
{0x0C01, 0x0C03, WBP_Extend},
{0x0C05, 0x0C0C, WBP_ALetter},
{0x0C0E, 0x0C10, WBP_ALetter},
{0x0C12, 0x0C28, WBP_ALetter},
{0x0C2A, 0x0C33, WBP_ALetter},
{0x0C35, 0x0C39, WBP_ALetter},
{0x0C3D, 0x0C3D, WBP_ALetter},
{0x0C3E, 0x0C40, WBP_Extend},
{0x0C41, 0x0C44, WBP_Extend},
{0x0C46, 0x0C48, WBP_Extend},
{0x0C4A, 0x0C4D, WBP_Extend},
{0x0C55, 0x0C56, WBP_Extend},
{0x0C58, 0x0C59, WBP_ALetter},
{0x0C60, 0x0C61, WBP_ALetter},
{0x0C62, 0x0C63, WBP_Extend},
{0x0C66, 0x0C6F, WBP_Numeric},
{0x0C82, 0x0C83, WBP_Extend},
{0x0C85, 0x0C8C, WBP_ALetter},
{0x0C8E, 0x0C90, WBP_ALetter},
{0x0C92, 0x0CA8, WBP_ALetter},
{0x0CAA, 0x0CB3, WBP_ALetter},
{0x0CB5, 0x0CB9, WBP_ALetter},
{0x0CBC, 0x0CBC, WBP_Extend},
{0x0CBD, 0x0CBD, WBP_ALetter},
{0x0CBE, 0x0CBE, WBP_Extend},
{0x0CBF, 0x0CBF, WBP_Extend},
{0x0CC0, 0x0CC4, WBP_Extend},
{0x0CC6, 0x0CC6, WBP_Extend},
{0x0CC7, 0x0CC8, WBP_Extend},
{0x0CCA, 0x0CCB, WBP_Extend},
{0x0CCC, 0x0CCD, WBP_Extend},
{0x0CD5, 0x0CD6, WBP_Extend},
{0x0CDE, 0x0CDE, WBP_ALetter},
{0x0CE0, 0x0CE1, WBP_ALetter},
{0x0CE2, 0x0CE3, WBP_Extend},
{0x0CE6, 0x0CEF, WBP_Numeric},
{0x0CF1, 0x0CF2, WBP_ALetter},
{0x0D02, 0x0D03, WBP_Extend},
{0x0D05, 0x0D0C, WBP_ALetter},
{0x0D0E, 0x0D10, WBP_ALetter},
{0x0D12, 0x0D3A, WBP_ALetter},
{0x0D3D, 0x0D3D, WBP_ALetter},
{0x0D3E, 0x0D40, WBP_Extend},
{0x0D41, 0x0D44, WBP_Extend},
{0x0D46, 0x0D48, WBP_Extend},
{0x0D4A, 0x0D4C, WBP_Extend},
{0x0D4D, 0x0D4D, WBP_Extend},
{0x0D4E, 0x0D4E, WBP_ALetter},
{0x0D57, 0x0D57, WBP_Extend},
{0x0D60, 0x0D61, WBP_ALetter},
{0x0D62, 0x0D63, WBP_Extend},
{0x0D66, 0x0D6F, WBP_Numeric},
{0x0D7A, 0x0D7F, WBP_ALetter},
{0x0D82, 0x0D83, WBP_Extend},
{0x0D85, 0x0D96, WBP_ALetter},
{0x0D9A, 0x0DB1, WBP_ALetter},
{0x0DB3, 0x0DBB, WBP_ALetter},
{0x0DBD, 0x0DBD, WBP_ALetter},
{0x0DC0, 0x0DC6, WBP_ALetter},
{0x0DCA, 0x0DCA, WBP_Extend},
{0x0DCF, 0x0DD1, WBP_Extend},
{0x0DD2, 0x0DD4, WBP_Extend},
{0x0DD6, 0x0DD6, WBP_Extend},
{0x0DD8, 0x0DDF, WBP_Extend},
{0x0DF2, 0x0DF3, WBP_Extend},
{0x0E31, 0x0E31, WBP_Extend},
{0x0E34, 0x0E3A, WBP_Extend},
{0x0E47, 0x0E4E, WBP_Extend},
{0x0E50, 0x0E59, WBP_Numeric},
{0x0EB1, 0x0EB1, WBP_Extend},
{0x0EB4, 0x0EB9, WBP_Extend},
{0x0EBB, 0x0EBC, WBP_Extend},
{0x0EC8, 0x0ECD, WBP_Extend},
{0x0ED0, 0x0ED9, WBP_Numeric},
{0x0F00, 0x0F00, WBP_ALetter},
{0x0F18, 0x0F19, WBP_Extend},
{0x0F20, 0x0F29, WBP_Numeric},
{0x0F35, 0x0F35, WBP_Extend},
{0x0F37, 0x0F37, WBP_Extend},
{0x0F39, 0x0F39, WBP_Extend},
{0x0F3E, 0x0F3F, WBP_Extend},
{0x0F40, 0x0F47, WBP_ALetter},
{0x0F49, 0x0F6C, WBP_ALetter},
{0x0F71, 0x0F7E, WBP_Extend},
{0x0F7F, 0x0F7F, WBP_Extend},
{0x0F80, 0x0F84, WBP_Extend},
{0x0F86, 0x0F87, WBP_Extend},
{0x0F88, 0x0F8C, WBP_ALetter},
{0x0F8D, 0x0F97, WBP_Extend},
{0x0F99, 0x0FBC, WBP_Extend},
{0x0FC6, 0x0FC6, WBP_Extend},
{0x102B, 0x102C, WBP_Extend},
{0x102D, 0x1030, WBP_Extend},
{0x1031, 0x1031, WBP_Extend},
{0x1032, 0x1037, WBP_Extend},
{0x1038, 0x1038, WBP_Extend},
{0x1039, 0x103A, WBP_Extend},
{0x103B, 0x103C, WBP_Extend},
{0x103D, 0x103E, WBP_Extend},
{0x1040, 0x1049, WBP_Numeric},
{0x1056, 0x1057, WBP_Extend},
{0x1058, 0x1059, WBP_Extend},
{0x105E, 0x1060, WBP_Extend},
{0x1062, 0x1064, WBP_Extend},
{0x1067, 0x106D, WBP_Extend},
{0x1071, 0x1074, WBP_Extend},
{0x1082, 0x1082, WBP_Extend},
{0x1083, 0x1084, WBP_Extend},
{0x1085, 0x1086, WBP_Extend},
{0x1087, 0x108C, WBP_Extend},
{0x108D, 0x108D, WBP_Extend},
{0x108F, 0x108F, WBP_Extend},
{0x1090, 0x1099, WBP_Numeric},
{0x109A, 0x109C, WBP_Extend},
{0x109D, 0x109D, WBP_Extend},
{0x10A0, 0x10C5, WBP_ALetter},
{0x10C7, 0x10C7, WBP_ALetter},
{0x10CD, 0x10CD, WBP_ALetter},
{0x10D0, 0x10FA, WBP_ALetter},
{0x10FC, 0x10FC, WBP_ALetter},
{0x10FD, 0x1248, WBP_ALetter},
{0x124A, 0x124D, WBP_ALetter},
{0x1250, 0x1256, WBP_ALetter},
{0x1258, 0x1258, WBP_ALetter},
{0x125A, 0x125D, WBP_ALetter},
{0x1260, 0x1288, WBP_ALetter},
{0x128A, 0x128D, WBP_ALetter},
{0x1290, 0x12B0, WBP_ALetter},
{0x12B2, 0x12B5, WBP_ALetter},
{0x12B8, 0x12BE, WBP_ALetter},
{0x12C0, 0x12C0, WBP_ALetter},
{0x12C2, 0x12C5, WBP_ALetter},
{0x12C8, 0x12D6, WBP_ALetter},
{0x12D8, 0x1310, WBP_ALetter},
{0x1312, 0x1315, WBP_ALetter},
{0x1318, 0x135A, WBP_ALetter},
{0x135D, 0x135F, WBP_Extend},
{0x1380, 0x138F, WBP_ALetter},
{0x13A0, 0x13F4, WBP_ALetter},
{0x1401, 0x166C, WBP_ALetter},
{0x166F, 0x167F, WBP_ALetter},
{0x1681, 0x169A, WBP_ALetter},
{0x16A0, 0x16EA, WBP_ALetter},
{0x16EE, 0x16F0, WBP_ALetter},
{0x1700, 0x170C, WBP_ALetter},
{0x170E, 0x1711, WBP_ALetter},
{0x1712, 0x1714, WBP_Extend},
{0x1720, 0x1731, WBP_ALetter},
{0x1732, 0x1734, WBP_Extend},
{0x1740, 0x1751, WBP_ALetter},
{0x1752, 0x1753, WBP_Extend},
{0x1760, 0x176C, WBP_ALetter},
{0x176E, 0x1770, WBP_ALetter},
{0x1772, 0x1773, WBP_Extend},
{0x17B4, 0x17B5, WBP_Extend},
{0x17B6, 0x17B6, WBP_Extend},
{0x17B7, 0x17BD, WBP_Extend},
{0x17BE, 0x17C5, WBP_Extend},
{0x17C6, 0x17C6, WBP_Extend},
{0x17C7, 0x17C8, WBP_Extend},
{0x17C9, 0x17D3, WBP_Extend},
{0x17DD, 0x17DD, WBP_Extend},
{0x17E0, 0x17E9, WBP_Numeric},
{0x180B, 0x180D, WBP_Extend},
{0x1810, 0x1819, WBP_Numeric},
{0x1820, 0x1842, WBP_ALetter},
{0x1843, 0x1843, WBP_ALetter},
{0x1844, 0x1877, WBP_ALetter},
{0x1880, 0x18A8, WBP_ALetter},
{0x18A9, 0x18A9, WBP_Extend},
{0x18AA, 0x18AA, WBP_ALetter},
{0x18B0, 0x18F5, WBP_ALetter},
{0x1900, 0x191C, WBP_ALetter},
{0x1920, 0x1922, WBP_Extend},
{0x1923, 0x1926, WBP_Extend},
{0x1927, 0x1928, WBP_Extend},
{0x1929, 0x192B, WBP_Extend},
{0x1930, 0x1931, WBP_Extend},
{0x1932, 0x1932, WBP_Extend},
{0x1933, 0x1938, WBP_Extend},
{0x1939, 0x193B, WBP_Extend},
{0x1946, 0x194F, WBP_Numeric},
{0x19B0, 0x19C0, WBP_Extend},
{0x19C8, 0x19C9, WBP_Extend},
{0x19D0, 0x19D9, WBP_Numeric},
{0x1A00, 0x1A16, WBP_ALetter},
{0x1A17, 0x1A18, WBP_Extend},
{0x1A19, 0x1A1B, WBP_Extend},
{0x1A55, 0x1A55, WBP_Extend},
{0x1A56, 0x1A56, WBP_Extend},
{0x1A57, 0x1A57, WBP_Extend},
{0x1A58, 0x1A5E, WBP_Extend},
{0x1A60, 0x1A60, WBP_Extend},
{0x1A61, 0x1A61, WBP_Extend},
{0x1A62, 0x1A62, WBP_Extend},
{0x1A63, 0x1A64, WBP_Extend},
{0x1A65, 0x1A6C, WBP_Extend},
{0x1A6D, 0x1A72, WBP_Extend},
{0x1A73, 0x1A7C, WBP_Extend},
{0x1A7F, 0x1A7F, WBP_Extend},
{0x1A80, 0x1A89, WBP_Numeric},
{0x1A90, 0x1A99, WBP_Numeric},
{0x1B00, 0x1B03, WBP_Extend},
{0x1B04, 0x1B04, WBP_Extend},
{0x1B05, 0x1B33, WBP_ALetter},
{0x1B34, 0x1B34, WBP_Extend},
{0x1B35, 0x1B35, WBP_Extend},
{0x1B36, 0x1B3A, WBP_Extend},
{0x1B3B, 0x1B3B, WBP_Extend},
{0x1B3C, 0x1B3C, WBP_Extend},
{0x1B3D, 0x1B41, WBP_Extend},
{0x1B42, 0x1B42, WBP_Extend},
{0x1B43, 0x1B44, WBP_Extend},
{0x1B45, 0x1B4B, WBP_ALetter},
{0x1B50, 0x1B59, WBP_Numeric},
{0x1B6B, 0x1B73, WBP_Extend},
{0x1B80, 0x1B81, WBP_Extend},
{0x1B82, 0x1B82, WBP_Extend},
{0x1B83, 0x1BA0, WBP_ALetter},
{0x1BA1, 0x1BA1, WBP_Extend},
{0x1BA2, 0x1BA5, WBP_Extend},
{0x1BA6, 0x1BA7, WBP_Extend},
{0x1BA8, 0x1BA9, WBP_Extend},
{0x1BAA, 0x1BAA, WBP_Extend},
{0x1BAB, 0x1BAB, WBP_Extend},
{0x1BAC, 0x1BAD, WBP_Extend},
{0x1BAE, 0x1BAF, WBP_ALetter},
{0x1BB0, 0x1BB9, WBP_Numeric},
{0x1BBA, 0x1BE5, WBP_ALetter},
{0x1BE6, 0x1BE6, WBP_Extend},
{0x1BE7, 0x1BE7, WBP_Extend},
{0x1BE8, 0x1BE9, WBP_Extend},
{0x1BEA, 0x1BEC, WBP_Extend},
{0x1BED, 0x1BED, WBP_Extend},
{0x1BEE, 0x1BEE, WBP_Extend},
{0x1BEF, 0x1BF1, WBP_Extend},
{0x1BF2, 0x1BF3, WBP_Extend},
{0x1C00, 0x1C23, WBP_ALetter},
{0x1C24, 0x1C2B, WBP_Extend},
{0x1C2C, 0x1C33, WBP_Extend},
{0x1C34, 0x1C35, WBP_Extend},
{0x1C36, 0x1C37, WBP_Extend},
{0x1C40, 0x1C49, WBP_Numeric},
{0x1C4D, 0x1C4F, WBP_ALetter},
{0x1C50, 0x1C59, WBP_Numeric},
{0x1C5A, 0x1C77, WBP_ALetter},
{0x1C78, 0x1C7D, WBP_ALetter},
{0x1CD0, 0x1CD2, WBP_Extend},
{0x1CD4, 0x1CE0, WBP_Extend},
{0x1CE1, 0x1CE1, WBP_Extend},
{0x1CE2, 0x1CE8, WBP_Extend},
{0x1CE9, 0x1CEC, WBP_ALetter},
{0x1CED, 0x1CED, WBP_Extend},
{0x1CEE, 0x1CF1, WBP_ALetter},
{0x1CF2, 0x1CF3, WBP_Extend},
{0x1CF4, 0x1CF4, WBP_Extend},
{0x1CF5, 0x1CF6, WBP_ALetter},
{0x1D00, 0x1D2B, WBP_ALetter},
{0x1D2C, 0x1D6A, WBP_ALetter},
{0x1D6B, 0x1D77, WBP_ALetter},
{0x1D78, 0x1D78, WBP_ALetter},
{0x1D79, 0x1D9A, WBP_ALetter},
{0x1D9B, 0x1DBF, WBP_ALetter},
{0x1DC0, 0x1DE6, WBP_Extend},
{0x1DFC, 0x1DFF, WBP_Extend},
{0x1E00, 0x1F15, WBP_ALetter},
{0x1F18, 0x1F1D, WBP_ALetter},
{0x1F20, 0x1F45, WBP_ALetter},
{0x1F48, 0x1F4D, WBP_ALetter},
{0x1F50, 0x1F57, WBP_ALetter},
{0x1F59, 0x1F59, WBP_ALetter},
{0x1F5B, 0x1F5B, WBP_ALetter},
{0x1F5D, 0x1F5D, WBP_ALetter},
{0x1F5F, 0x1F7D, WBP_ALetter},
{0x1F80, 0x1FB4, WBP_ALetter},
{0x1FB6, 0x1FBC, WBP_ALetter},
{0x1FBE, 0x1FBE, WBP_ALetter},
{0x1FC2, 0x1FC4, WBP_ALetter},
{0x1FC6, 0x1FCC, WBP_ALetter},
{0x1FD0, 0x1FD3, WBP_ALetter},
{0x1FD6, 0x1FDB, WBP_ALetter},
{0x1FE0, 0x1FEC, WBP_ALetter},
{0x1FF2, 0x1FF4, WBP_ALetter},
{0x1FF6, 0x1FFC, WBP_ALetter},
{0x200C, 0x200D, WBP_Extend},
{0x200E, 0x200F, WBP_Format},
{0x2018, 0x2018, WBP_MidNumLet},
{0x2019, 0x2019, WBP_MidNumLet},
{0x2024, 0x2024, WBP_MidNumLet},
{0x2027, 0x2027, WBP_MidLetter},
{0x2028, 0x2028, WBP_Newline},
{0x2029, 0x2029, WBP_Newline},
{0x202A, 0x202E, WBP_Format},
{0x203F, 0x2040, WBP_ExtendNumLet},
{0x2044, 0x2044, WBP_MidNum},
{0x2054, 0x2054, WBP_ExtendNumLet},
{0x2060, 0x2064, WBP_Format},
{0x206A, 0x206F, WBP_Format},
{0x2071, 0x2071, WBP_ALetter},
{0x207F, 0x207F, WBP_ALetter},
{0x2090, 0x209C, WBP_ALetter},
{0x20D0, 0x20DC, WBP_Extend},
{0x20DD, 0x20E0, WBP_Extend},
{0x20E1, 0x20E1, WBP_Extend},
{0x20E2, 0x20E4, WBP_Extend},
{0x20E5, 0x20F0, WBP_Extend},
{0x2102, 0x2102, WBP_ALetter},
{0x2107, 0x2107, WBP_ALetter},
{0x210A, 0x2113, WBP_ALetter},
{0x2115, 0x2115, WBP_ALetter},
{0x2119, 0x211D, WBP_ALetter},
{0x2124, 0x2124, WBP_ALetter},
{0x2126, 0x2126, WBP_ALetter},
{0x2128, 0x2128, WBP_ALetter},
{0x212A, 0x212D, WBP_ALetter},
{0x212F, 0x2134, WBP_ALetter},
{0x2135, 0x2138, WBP_ALetter},
{0x2139, 0x2139, WBP_ALetter},
{0x213C, 0x213F, WBP_ALetter},
{0x2145, 0x2149, WBP_ALetter},
{0x214E, 0x214E, WBP_ALetter},
{0x2160, 0x2182, WBP_ALetter},
{0x2183, 0x2184, WBP_ALetter},
{0x2185, 0x2188, WBP_ALetter},
{0x24B6, 0x24E9, WBP_ALetter},
{0x2C00, 0x2C2E, WBP_ALetter},
{0x2C30, 0x2C5E, WBP_ALetter},
{0x2C60, 0x2C7B, WBP_ALetter},
{0x2C7C, 0x2C7D, WBP_ALetter},
{0x2C7E, 0x2CE4, WBP_ALetter},
{0x2CEB, 0x2CEE, WBP_ALetter},
{0x2CEF, 0x2CF1, WBP_Extend},
{0x2CF2, 0x2CF3, WBP_ALetter},
{0x2D00, 0x2D25, WBP_ALetter},
{0x2D27, 0x2D27, WBP_ALetter},
{0x2D2D, 0x2D2D, WBP_ALetter},
{0x2D30, 0x2D67, WBP_ALetter},
{0x2D6F, 0x2D6F, WBP_ALetter},
{0x2D7F, 0x2D7F, WBP_Extend},
{0x2D80, 0x2D96, WBP_ALetter},
{0x2DA0, 0x2DA6, WBP_ALetter},
{0x2DA8, 0x2DAE, WBP_ALetter},
{0x2DB0, 0x2DB6, WBP_ALetter},
{0x2DB8, 0x2DBE, WBP_ALetter},
{0x2DC0, 0x2DC6, WBP_ALetter},
{0x2DC8, 0x2DCE, WBP_ALetter},
{0x2DD0, 0x2DD6, WBP_ALetter},
{0x2DD8, 0x2DDE, WBP_ALetter},
{0x2DE0, 0x2DFF, WBP_Extend},
{0x2E2F, 0x2E2F, WBP_ALetter},
{0x3005, 0x3005, WBP_ALetter},
{0x302A, 0x302D, WBP_Extend},
{0x302E, 0x302F, WBP_Extend},
{0x3031, 0x3035, WBP_Katakana},
{0x303B, 0x303B, WBP_ALetter},
{0x303C, 0x303C, WBP_ALetter},
{0x3099, 0x309A, WBP_Extend},
{0x309B, 0x309C, WBP_Katakana},
{0x30A0, 0x30A0, WBP_Katakana},
{0x30A1, 0x30FA, WBP_Katakana},
{0x30FC, 0x30FE, WBP_Katakana},
{0x30FF, 0x30FF, WBP_Katakana},
{0x3105, 0x312D, WBP_ALetter},
{0x3131, 0x318E, WBP_ALetter},
{0x31A0, 0x31BA, WBP_ALetter},
{0x31F0, 0x31FF, WBP_Katakana},
{0x32D0, 0x32FE, WBP_Katakana},
{0x3300, 0x3357, WBP_Katakana},
{0xA000, 0xA014, WBP_ALetter},
{0xA015, 0xA015, WBP_ALetter},
{0xA016, 0xA48C, WBP_ALetter},
{0xA4D0, 0xA4F7, WBP_ALetter},
{0xA4F8, 0xA4FD, WBP_ALetter},
{0xA500, 0xA60B, WBP_ALetter},
{0xA60C, 0xA60C, WBP_ALetter},
{0xA610, 0xA61F, WBP_ALetter},
{0xA620, 0xA629, WBP_Numeric},
{0xA62A, 0xA62B, WBP_ALetter},
{0xA640, 0xA66D, WBP_ALetter},
{0xA66E, 0xA66E, WBP_ALetter},
{0xA66F, 0xA66F, WBP_Extend},
{0xA670, 0xA672, WBP_Extend},
{0xA674, 0xA67D, WBP_Extend},
{0xA67F, 0xA67F, WBP_ALetter},
{0xA680, 0xA697, WBP_ALetter},
{0xA69F, 0xA69F, WBP_Extend},
{0xA6A0, 0xA6E5, WBP_ALetter},
{0xA6E6, 0xA6EF, WBP_ALetter},
{0xA6F0, 0xA6F1, WBP_Extend},
{0xA717, 0xA71F, WBP_ALetter},
{0xA722, 0xA76F, WBP_ALetter},
{0xA770, 0xA770, WBP_ALetter},
{0xA771, 0xA787, WBP_ALetter},
{0xA788, 0xA788, WBP_ALetter},
{0xA78B, 0xA78E, WBP_ALetter},
{0xA790, 0xA793, WBP_ALetter},
{0xA7A0, 0xA7AA, WBP_ALetter},
{0xA7F8, 0xA7F9, WBP_ALetter},
{0xA7FA, 0xA7FA, WBP_ALetter},
{0xA7FB, 0xA801, WBP_ALetter},
{0xA802, 0xA802, WBP_Extend},
{0xA803, 0xA805, WBP_ALetter},
{0xA806, 0xA806, WBP_Extend},
{0xA807, 0xA80A, WBP_ALetter},
{0xA80B, 0xA80B, WBP_Extend},
{0xA80C, 0xA822, WBP_ALetter},
{0xA823, 0xA824, WBP_Extend},
{0xA825, 0xA826, WBP_Extend},
{0xA827, 0xA827, WBP_Extend},
{0xA840, 0xA873, WBP_ALetter},
{0xA880, 0xA881, WBP_Extend},
{0xA882, 0xA8B3, WBP_ALetter},
{0xA8B4, 0xA8C3, WBP_Extend},
{0xA8C4, 0xA8C4, WBP_Extend},
{0xA8D0, 0xA8D9, WBP_Numeric},
{0xA8E0, 0xA8F1, WBP_Extend},
{0xA8F2, 0xA8F7, WBP_ALetter},
{0xA8FB, 0xA8FB, WBP_ALetter},
{0xA900, 0xA909, WBP_Numeric},
{0xA90A, 0xA925, WBP_ALetter},
{0xA926, 0xA92D, WBP_Extend},
{0xA930, 0xA946, WBP_ALetter},
{0xA947, 0xA951, WBP_Extend},
{0xA952, 0xA953, WBP_Extend},
{0xA960, 0xA97C, WBP_ALetter},
{0xA980, 0xA982, WBP_Extend},
{0xA983, 0xA983, WBP_Extend},
{0xA984, 0xA9B2, WBP_ALetter},
{0xA9B3, 0xA9B3, WBP_Extend},
{0xA9B4, 0xA9B5, WBP_Extend},
{0xA9B6, 0xA9B9, WBP_Extend},
{0xA9BA, 0xA9BB, WBP_Extend},
{0xA9BC, 0xA9BC, WBP_Extend},
{0xA9BD, 0xA9C0, WBP_Extend},
{0xA9CF, 0xA9CF, WBP_ALetter},
{0xA9D0, 0xA9D9, WBP_Numeric},
{0xAA00, 0xAA28, WBP_ALetter},
{0xAA29, 0xAA2E, WBP_Extend},
{0xAA2F, 0xAA30, WBP_Extend},
{0xAA31, 0xAA32, WBP_Extend},
{0xAA33, 0xAA34, WBP_Extend},
{0xAA35, 0xAA36, WBP_Extend},
{0xAA40, 0xAA42, WBP_ALetter},
{0xAA43, 0xAA43, WBP_Extend},
{0xAA44, 0xAA4B, WBP_ALetter},
{0xAA4C, 0xAA4C, WBP_Extend},
{0xAA4D, 0xAA4D, WBP_Extend},
{0xAA50, 0xAA59, WBP_Numeric},
{0xAA7B, 0xAA7B, WBP_Extend},
{0xAAB0, 0xAAB0, WBP_Extend},
{0xAAB2, 0xAAB4, WBP_Extend},
{0xAAB7, 0xAAB8, WBP_Extend},
{0xAABE, 0xAABF, WBP_Extend},
{0xAAC1, 0xAAC1, WBP_Extend},
{0xAAE0, 0xAAEA, WBP_ALetter},
{0xAAEB, 0xAAEB, WBP_Extend},
{0xAAEC, 0xAAED, WBP_Extend},
{0xAAEE, 0xAAEF, WBP_Extend},
{0xAAF2, 0xAAF2, WBP_ALetter},
{0xAAF3, 0xAAF4, WBP_ALetter},
{0xAAF5, 0xAAF5, WBP_Extend},
{0xAAF6, 0xAAF6, WBP_Extend},
{0xAB01, 0xAB06, WBP_ALetter},
{0xAB09, 0xAB0E, WBP_ALetter},
{0xAB11, 0xAB16, WBP_ALetter},
{0xAB20, 0xAB26, WBP_ALetter},
{0xAB28, 0xAB2E, WBP_ALetter},
{0xABC0, 0xABE2, WBP_ALetter},
{0xABE3, 0xABE4, WBP_Extend},
{0xABE5, 0xABE5, WBP_Extend},
{0xABE6, 0xABE7, WBP_Extend},
{0xABE8, 0xABE8, WBP_Extend},
{0xABE9, 0xABEA, WBP_Extend},
{0xABEC, 0xABEC, WBP_Extend},
{0xABED, 0xABED, WBP_Extend},
{0xABF0, 0xABF9, WBP_Numeric},
{0xAC00, 0xD7A3, WBP_ALetter},
{0xD7B0, 0xD7C6, WBP_ALetter},
{0xD7CB, 0xD7FB, WBP_ALetter},
{0xFB00, 0xFB06, WBP_ALetter},
{0xFB13, 0xFB17, WBP_ALetter},
{0xFB1D, 0xFB1D, WBP_ALetter},
{0xFB1E, 0xFB1E, WBP_Extend},
{0xFB1F, 0xFB28, WBP_ALetter},
{0xFB2A, 0xFB36, WBP_ALetter},
{0xFB38, 0xFB3C, WBP_ALetter},
{0xFB3E, 0xFB3E, WBP_ALetter},
{0xFB40, 0xFB41, WBP_ALetter},
{0xFB43, 0xFB44, WBP_ALetter},
{0xFB46, 0xFBB1, WBP_ALetter},
{0xFBD3, 0xFD3D, WBP_ALetter},
{0xFD50, 0xFD8F, WBP_ALetter},
{0xFD92, 0xFDC7, WBP_ALetter},
{0xFDF0, 0xFDFB, WBP_ALetter},
{0xFE00, 0xFE0F, WBP_Extend},
{0xFE10, 0xFE10, WBP_MidNum},
{0xFE13, 0xFE13, WBP_MidLetter},
{0xFE14, 0xFE14, WBP_MidNum},
{0xFE20, 0xFE26, WBP_Extend},
{0xFE33, 0xFE34, WBP_ExtendNumLet},
{0xFE4D, 0xFE4F, WBP_ExtendNumLet},
{0xFE50, 0xFE50, WBP_MidNum},
{0xFE52, 0xFE52, WBP_MidNumLet},
{0xFE54, 0xFE54, WBP_MidNum},
{0xFE55, 0xFE55, WBP_MidLetter},
{0xFE70, 0xFE74, WBP_ALetter},
{0xFE76, 0xFEFC, WBP_ALetter},
{0xFEFF, 0xFEFF, WBP_Format},
{0xFF07, 0xFF07, WBP_MidNumLet},
{0xFF0C, 0xFF0C, WBP_MidNum},
{0xFF0E, 0xFF0E, WBP_MidNumLet},
{0xFF1A, 0xFF1A, WBP_MidLetter},
{0xFF1B, 0xFF1B, WBP_MidNum},
{0xFF21, 0xFF3A, WBP_ALetter},
{0xFF3F, 0xFF3F, WBP_ExtendNumLet},
{0xFF41, 0xFF5A, WBP_ALetter},
{0xFF66, 0xFF6F, WBP_Katakana},
{0xFF70, 0xFF70, WBP_Katakana},
{0xFF71, 0xFF9D, WBP_Katakana},
{0xFF9E, 0xFF9F, WBP_Extend},
{0xFFA0, 0xFFBE, WBP_ALetter},
{0xFFC2, 0xFFC7, WBP_ALetter},
{0xFFCA, 0xFFCF, WBP_ALetter},
{0xFFD2, 0xFFD7, WBP_ALetter},
{0xFFDA, 0xFFDC, WBP_ALetter},
{0xFFF9, 0xFFFB, WBP_Format},
{0x10000, 0x1000B, WBP_ALetter},
{0x1000D, 0x10026, WBP_ALetter},
{0x10028, 0x1003A, WBP_ALetter},
{0x1003C, 0x1003D, WBP_ALetter},
{0x1003F, 0x1004D, WBP_ALetter},
{0x10050, 0x1005D, WBP_ALetter},
{0x10080, 0x100FA, WBP_ALetter},
{0x10140, 0x10174, WBP_ALetter},
{0x101FD, 0x101FD, WBP_Extend},
{0x10280, 0x1029C, WBP_ALetter},
{0x102A0, 0x102D0, WBP_ALetter},
{0x10300, 0x1031E, WBP_ALetter},
{0x10330, 0x10340, WBP_ALetter},
{0x10341, 0x10341, WBP_ALetter},
{0x10342, 0x10349, WBP_ALetter},
{0x1034A, 0x1034A, WBP_ALetter},
{0x10380, 0x1039D, WBP_ALetter},
{0x103A0, 0x103C3, WBP_ALetter},
{0x103C8, 0x103CF, WBP_ALetter},
{0x103D1, 0x103D5, WBP_ALetter},
{0x10400, 0x1044F, WBP_ALetter},
{0x10450, 0x1049D, WBP_ALetter},
{0x104A0, 0x104A9, WBP_Numeric},
{0x10800, 0x10805, WBP_ALetter},
{0x10808, 0x10808, WBP_ALetter},
{0x1080A, 0x10835, WBP_ALetter},
{0x10837, 0x10838, WBP_ALetter},
{0x1083C, 0x1083C, WBP_ALetter},
{0x1083F, 0x10855, WBP_ALetter},
{0x10900, 0x10915, WBP_ALetter},
{0x10920, 0x10939, WBP_ALetter},
{0x10980, 0x109B7, WBP_ALetter},
{0x109BE, 0x109BF, WBP_ALetter},
{0x10A00, 0x10A00, WBP_ALetter},
{0x10A01, 0x10A03, WBP_Extend},
{0x10A05, 0x10A06, WBP_Extend},
{0x10A0C, 0x10A0F, WBP_Extend},
{0x10A10, 0x10A13, WBP_ALetter},
{0x10A15, 0x10A17, WBP_ALetter},
{0x10A19, 0x10A33, WBP_ALetter},
{0x10A38, 0x10A3A, WBP_Extend},
{0x10A3F, 0x10A3F, WBP_Extend},
{0x10A60, 0x10A7C, WBP_ALetter},
{0x10B00, 0x10B35, WBP_ALetter},
{0x10B40, 0x10B55, WBP_ALetter},
{0x10B60, 0x10B72, WBP_ALetter},
{0x10C00, 0x10C48, WBP_ALetter},
{0x11000, 0x11000, WBP_Extend},
{0x11001, 0x11001, WBP_Extend},
{0x11002, 0x11002, WBP_Extend},
{0x11003, 0x11037, WBP_ALetter},
{0x11038, 0x11046, WBP_Extend},
{0x11066, 0x1106F, WBP_Numeric},
{0x11080, 0x11081, WBP_Extend},
{0x11082, 0x11082, WBP_Extend},
{0x11083, 0x110AF, WBP_ALetter},
{0x110B0, 0x110B2, WBP_Extend},
{0x110B3, 0x110B6, WBP_Extend},
{0x110B7, 0x110B8, WBP_Extend},
{0x110B9, 0x110BA, WBP_Extend},
{0x110BD, 0x110BD, WBP_Format},
{0x110D0, 0x110E8, WBP_ALetter},
{0x110F0, 0x110F9, WBP_Numeric},
{0x11100, 0x11102, WBP_Extend},
{0x11103, 0x11126, WBP_ALetter},
{0x11127, 0x1112B, WBP_Extend},
{0x1112C, 0x1112C, WBP_Extend},
{0x1112D, 0x11134, WBP_Extend},
{0x11136, 0x1113F, WBP_Numeric},
{0x11180, 0x11181, WBP_Extend},
{0x11182, 0x11182, WBP_Extend},
{0x11183, 0x111B2, WBP_ALetter},
{0x111B3, 0x111B5, WBP_Extend},
{0x111B6, 0x111BE, WBP_Extend},
{0x111BF, 0x111C0, WBP_Extend},
{0x111C1, 0x111C4, WBP_ALetter},
{0x111D0, 0x111D9, WBP_Numeric},
{0x11680, 0x116AA, WBP_ALetter},
{0x116AB, 0x116AB, WBP_Extend},
{0x116AC, 0x116AC, WBP_Extend},
{0x116AD, 0x116AD, WBP_Extend},
{0x116AE, 0x116AF, WBP_Extend},
{0x116B0, 0x116B5, WBP_Extend},
{0x116B6, 0x116B6, WBP_Extend},
{0x116B7, 0x116B7, WBP_Extend},
{0x116C0, 0x116C9, WBP_Numeric},
{0x12000, 0x1236E, WBP_ALetter},
{0x12400, 0x12462, WBP_ALetter},
{0x13000, 0x1342E, WBP_ALetter},
{0x16800, 0x16A38, WBP_ALetter},
{0x16F00, 0x16F44, WBP_ALetter},
{0x16F50, 0x16F50, WBP_ALetter},
{0x16F51, 0x16F7E, WBP_Extend},
{0x16F8F, 0x16F92, WBP_Extend},
{0x16F93, 0x16F9F, WBP_ALetter},
{0x1B000, 0x1B000, WBP_Katakana},
{0x1D165, 0x1D166, WBP_Extend},
{0x1D167, 0x1D169, WBP_Extend},
{0x1D16D, 0x1D172, WBP_Extend},
{0x1D173, 0x1D17A, WBP_Format},
{0x1D17B, 0x1D182, WBP_Extend},
{0x1D185, 0x1D18B, WBP_Extend},
{0x1D1AA, 0x1D1AD, WBP_Extend},
{0x1D242, 0x1D244, WBP_Extend},
{0x1D400, 0x1D454, WBP_ALetter},
{0x1D456, 0x1D49C, WBP_ALetter},
{0x1D49E, 0x1D49F, WBP_ALetter},
{0x1D4A2, 0x1D4A2, WBP_ALetter},
{0x1D4A5, 0x1D4A6, WBP_ALetter},
{0x1D4A9, 0x1D4AC, WBP_ALetter},
{0x1D4AE, 0x1D4B9, WBP_ALetter},
{0x1D4BB, 0x1D4BB, WBP_ALetter},
{0x1D4BD, 0x1D4C3, WBP_ALetter},
{0x1D4C5, 0x1D505, WBP_ALetter},
{0x1D507, 0x1D50A, WBP_ALetter},
{0x1D50D, 0x1D514, WBP_ALetter},
{0x1D516, 0x1D51C, WBP_ALetter},
{0x1D51E, 0x1D539, WBP_ALetter},
{0x1D53B, 0x1D53E, WBP_ALetter},
{0x1D540, 0x1D544, WBP_ALetter},
{0x1D546, 0x1D546, WBP_ALetter},
{0x1D54A, 0x1D550, WBP_ALetter},
{0x1D552, 0x1D6A5, WBP_ALetter},
{0x1D6A8, 0x1D6C0, WBP_ALetter},
{0x1D6C2, 0x1D6DA, WBP_ALetter},
{0x1D6DC, 0x1D6FA, WBP_ALetter},
{0x1D6FC, 0x1D714, WBP_ALetter},
{0x1D716, 0x1D734, WBP_ALetter},
{0x1D736, 0x1D74E, WBP_ALetter},
{0x1D750, 0x1D76E, WBP_ALetter},
{0x1D770, 0x1D788, WBP_ALetter},
{0x1D78A, 0x1D7A8, WBP_ALetter},
{0x1D7AA, 0x1D7C2, WBP_ALetter},
{0x1D7C4, 0x1D7CB, WBP_ALetter},
{0x1D7CE, 0x1D7FF, WBP_Numeric},
{0x1EE00, 0x1EE03, WBP_ALetter},
{0x1EE05, 0x1EE1F, WBP_ALetter},
{0x1EE21, 0x1EE22, WBP_ALetter},
{0x1EE24, 0x1EE24, WBP_ALetter},
{0x1EE27, 0x1EE27, WBP_ALetter},
{0x1EE29, 0x1EE32, WBP_ALetter},
{0x1EE34, 0x1EE37, WBP_ALetter},
{0x1EE39, 0x1EE39, WBP_ALetter},
{0x1EE3B, 0x1EE3B, WBP_ALetter},
{0x1EE42, 0x1EE42, WBP_ALetter},
{0x1EE47, 0x1EE47, WBP_ALetter},
{0x1EE49, 0x1EE49, WBP_ALetter},
{0x1EE4B, 0x1EE4B, WBP_ALetter},
{0x1EE4D, 0x1EE4F, WBP_ALetter},
{0x1EE51, 0x1EE52, WBP_ALetter},
{0x1EE54, 0x1EE54, WBP_ALetter},
{0x1EE57, 0x1EE57, WBP_ALetter},
{0x1EE59, 0x1EE59, WBP_ALetter},
{0x1EE5B, 0x1EE5B, WBP_ALetter},
{0x1EE5D, 0x1EE5D, WBP_ALetter},
{0x1EE5F, 0x1EE5F, WBP_ALetter},
{0x1EE61, 0x1EE62, WBP_ALetter},
{0x1EE64, 0x1EE64, WBP_ALetter},
{0x1EE67, 0x1EE6A, WBP_ALetter},
{0x1EE6C, 0x1EE72, WBP_ALetter},
{0x1EE74, 0x1EE77, WBP_ALetter},
{0x1EE79, 0x1EE7C, WBP_ALetter},
{0x1EE7E, 0x1EE7E, WBP_ALetter},
{0x1EE80, 0x1EE89, WBP_ALetter},
{0x1EE8B, 0x1EE9B, WBP_ALetter},
{0x1EEA1, 0x1EEA3, WBP_ALetter},
{0x1EEA5, 0x1EEA9, WBP_ALetter},
{0x1EEAB, 0x1EEBB, WBP_ALetter},
{0x1F1E6, 0x1F1FF, WBP_Regional},
{0xE0001, 0xE0001, WBP_Format},
{0xE0020, 0xE007F, WBP_Format},
{0xE0100, 0xE01EF, WBP_Extend},
{0xFFFFFFFF, 0xFFFFFFFF, WBP_Undefined}
};

View File

@ -1,88 +0,0 @@
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
/*
* Word breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2013 Tom Hacohen <tom at stosb dot com>
* Copyright (C) 2013 Petr Filipsky <philodej at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute
* it freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must
* not claim that you wrote the original software. If you use this
* software in a product, an acknowledgement in the product
* documentation would be appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must
* not be misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source
* distribution.
*
* The main reference is Unicode Standard Annex 29 (UAX #29):
* <URL:http://unicode.org/reports/tr29>
*
* When this library was designed, this annex was at Revision 17, for
* Unicode 6.0.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-17.html>
*
* This library has been updated according to Revision 21, for
* Unicode 6.2.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-21.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
/**
* @file wordbreakdef.h
*
* Definitions of internal data structures, declarations of global
* variables, and function prototypes for the word breaking algorithm.
*
* @version 2.4, 2013/11/10
* @author Tom Hacohen
* @author Petr Filipsky
*/
/**
* Word break classes. This is a direct mapping of Table 3 of Unicode
* Standard Annex 29, Revision 23.
*/
enum WordBreakClass
{
WBP_Undefined,
WBP_CR,
WBP_LF,
WBP_Newline,
WBP_Extend,
WBP_Format,
WBP_Katakana,
WBP_ALetter,
WBP_MidNumLet,
WBP_MidLetter,
WBP_MidNum,
WBP_Numeric,
WBP_ExtendNumLet,
WBP_Regional,
WBP_Hebrew,
WBP_Single,
WBP_Double,
WBP_Any
};
/**
* Struct for entries of word break properties. The array of the
* entries \e must be sorted.
*/
struct WordBreakProperties
{
utf32_t start; /**< Starting coding point */
utf32_t end; /**< End coding point */
enum WordBreakClass prop; /**< The word breaking property */
};