Synced libunibreak local copy with upstream.

This fixes T805.
This commit is contained in:
Tom Hacohen 2014-01-21 16:41:06 +00:00
parent cc8fa1da45
commit cff1a9a59f
12 changed files with 1359 additions and 1100 deletions

View File

@ -1,4 +1,5 @@
Wu Yongwei. Designed and implemented liblinebreak.
Wu Yongwei. Designed and implemented the original liblinebreak.
Current maintainer of libunibreak.
Nikolay Pultsin. Put forward the original requirements on liblinebreak,
performed tests, and made a lot of suggestions on the initial versions.
@ -6,3 +7,5 @@ performed tests, and made a lot of suggestions on the initial versions.
Thomas Klausner. Autoconfiscated and libtoolized liblinebreak.
Tom Hacohen. Added word boundaries support.
Petr Filipsky. Added incremental processing for line-breaking.

View File

@ -1,3 +1,116 @@
2013-11-14 Wu Yongwei <wuyongwei@gmail.com>
* src/linebreak.c: Add/update comments and doc comments.
(lb_init_breaking_class): Rename to treat_first_char.
(lb_classify_break_simple): Rename to get_lb_result_simple.
(lb_classify_break_lookup): Rename to get_lb_result_lookup.
(set_linebreaks): Remove an unused local variable.
2013-11-14 Wu Yongwei <wuyongwei@gmail.com>
* src/linebreakdata.c: Regenerate from LineBreak-6.3.0.txt.
2013-11-13 Wu Yongwei <wuyongwei@gmail.com>
Fix compilation problems under MSVC.
* src/linebreak.c (lb_init_breaking_class): Remove `inline'.
(lb_classify_break_simple): Ditto.
(lb_classify_break_lookup): Ditto.
(lb_classify_break_lookup): Move local variable declaration before
assertions.
2013-11-10 Wu Yongwei <wuyongwei@gmail.com>
* src/Makefile.am (libunibreak_la_LDFLAGS): Set the version-info to
`2:0:1'.
2013-11-10 Wu Yongwei <wuyongwei@gmail.com>
* src/linebreakdef.c: Adjust the order of code.
(lb_process_next_char): Make its return type int.
* src/linebreak.c (lb_process_next_char): Ditto.
2013-11-10 Wu Yongwei <wuyongwei@gmail.com>
* src/linebreak.c: Make minor changes in doc comments, formatting,
and names.
* src/linebreakdef.c: Ditto.
2013-11-10 Wu Yongwei <wuyongwei@gmail.com>
* AUTHORS: Add `Petr Filipsky'.
2013-11-10 Petr Filipsky <philodej@gmail.com>
Expose low level line-breaking API for incremental processing.
* src/linebreak.h: Add prototype declarations for
lb_init_break_context and lb_process_next_char.
(struct LineBreakContext): New struct.
* src/linebreak.h (LINEBREAK_UNDEFINED): New macro constant.
(lb_init_breaking_class): New static function.
(lb_classify_break_simple): New static function.
(lb_classify_break_lookup): New static function.
(lb_init_break_context): New function.
(lb_process_next_char): New function.
(set_linebreaks): Implement with lb_init_break_context and
lb_process_next_char.
2013-11-05 Petr Filipsky <philodej@gmail.com>
* src/wordbreakdef.h (enum WordBreakClass): Update according to
Table 3 of Unicode Standard Annex 29, Revision 23.
2013-09-30 Wu Yongwei <wuyongwei@gmail.com>
Update for the libunibreak 1.1 release.
* configure.ac (AC_INIT): Change the library version to `1.1'.
* Doxyfile (PROJECT_NUMBER): Change to `1.1'.
* Makefile.am (EXTRA_DIST): Add the `tools' directory.
* NEWS: Add information about libunibreak 1.1.
* src/Makefile.am (libunibreak_la_LDFLAGS): Set the version to `1:1'.
2013-09-29 Wu Yongwei <wuyongwei@gmail.com>
* src/Makefile.msvc: Modernize obsolete/deprecated MSVC options.
2013-09-28 Wu Yongwei <wuyongwei@gmail.com>
* src/wordbreak.c: Update copyright year and UAX information.
* src/wordbreak.h: Ditto.
* src/wordbreakdef.h: Ditto.
2013-09-28 Wu Yongwei <wuyongwei@gmail.com>
Fix the errors caused by libtool 2.4 (really annoying to the level
of WTF for making me add the foolish dependency on m4).
* Makefile.am (ACLOCAL_AMFLAGS): Add `-I m4'.
* bootstrap: Add a line to execute autoreconf.
* configure.ac (AC_CONFIG_MACRO_DIR): Set to `[m4]'.
* purge: Make it remove also the m4 directory.
2013-09-28 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.am (EXTRA_DIST): Add `README.md'.
2013-09-28 Wu Yongwei <wuyongwei@gmail.com>
* README.md: New Markdown version of README.
* README: Remove.
2013-05-13 Tom Hacohen <tom@stosb.com>
Update files according to UAX #29-21, for Unicode 6.2.0.
* README: Update the reference to UAX #29-21.
* src/wordbreak.c (set_wordbreaks): Update for WBP_Regional.
* src/wordbreakdef.h (WBP_Regional): New enumerator for the new
property `RI' as defined in UAX #29-21.
* src/wordbreakdata.c: Regenerate from WordBreakProperty-6.2.0.txt.
2013-05-06 Wu Yongwei <wuyongwei@gmail.com>
* src/Makefile.am (install-exec-hook): Make sure `--disable-static'
can work (thanks to Eugene V. Lyubimkin).
2012-10-06 Wu Yongwei <wuyongwei@gmail.com>
Update files according to UAX #14-30, for Unicode 6.2.0.
@ -82,11 +195,12 @@
2012-08-11 Wu Yongwei <wuyongwei@gmail.com>
Update for the libunibreak 1.0 release.
* configure.ac (AC_INIT): Change the library name and version to
`libunibreak' and `1.0'.
(AC_PROG_LN_S): New macro.
(AC_OUTPUT): Change to `libunibreak.pc'.
* Doxyfile: (PROJECT_NAME): Change to `libunibreak'.
* Doxyfile (PROJECT_NAME): Change to `libunibreak'.
(PROJECT_NUMBER): Change to `1.0'.
* LICENCE: Add copyright information about Tom Hacohen.
* Makefile.am (lib_LTLIBRARIES): Change to `libunibreak.la'.
@ -96,7 +210,7 @@
a symlink to libunibreak.a.
* Makefile.msvc: Change the library name to `libunibreak', and the
output library to `unibreak.lib'.
* NEW: Add information about libunibreak 1.0.
* NEWS: Add information about libunibreak 1.0.
* README: Change the library name, and add information about word
break.

View File

@ -1,3 +1,10 @@
New in libunibreak 1.1
- Update the code and data to conform to Unicode 6.2.0
- Update build files to support libtool 2.4
- Adjust code structure
- Make a few bug fixes
New in libunibreak 1.0
- Add word breaking support

View File

@ -1,31 +1,30 @@
L I B U N I B R E A K
=====================
LIBUNIBREAK
===========
Overview
--------
This is the README file for libunibreak, an implementation of the line
breaking and word breaking algorithms as described in Unicode
Standard Annex 14 and Unicode Standard Annex 30, available at
<URL:http://www.unicode.org/reports/tr14/tr14-30.html>
<URL:http://www.unicode.org/reports/tr29/tr29-17.html>
breaking and word breaking algorithms as described in [Unicode Standard
Annex 14] [1] and [Unicode Standard Annex 29] [2]. Check the project's
[home page] [3] for up-to-date information.
Check this URL for up-to-date information:
<URL:https://github.com/adah1972/libunibreak>
[1]: http://www.unicode.org/reports/tr14/tr14-30.html
[2]: http://www.unicode.org/reports/tr29/tr29-21.html
[3]: https://github.com/adah1972/libunibreak
Licence
-------
This library is released under an open-source licence, the zlib/libpng
licence. Please check the file LICENCE for details.
licence. Please check the file *LICENCE* for details.
Apart from using the algorithm, part of the code is derived from the
data provided under
<URL:http://www.unicode.org/Public/>
[Unicode Public Data] [4], and the [Unicode Terms of Use] [5] may apply.
And the Unicode Terms of Use may apply:
<URL:http://www.unicode.org/copyright.html>
[4]: http://www.unicode.org/Public/
[5]: http://www.unicode.org/copyright.html
Installation
@ -33,7 +32,7 @@ Installation
There are three ways to build the library:
1) On *NIX systems supported by the autoconfiscation tools, do the
1. On \*NIX systems supported by the autoconfiscation tools, do the
normal
./configure
@ -42,30 +41,28 @@ There are three ways to build the library:
to build and install both the dynamic and static libraries. In
addition, one may
- type `make doc` to generate the doxygen documentation; or
- type `make linebreakdata` to regenerate *linebreakdata.c* from
*LineBreak.txt*.
- type `make wordbreakdata` to regenerate *wordbreakdata.c* from
*WordBreakProperty.txt*.
- type `make doc' to generate the doxygen documentation; or
- type `make linebreakdata' to regenerate linebreakdata.c from
LineBreak.txt.
- type `make wordbreakdata' to regenerate wordbreakdata.c from
WordBreakProperty.txt.
2) On systems where GCC and Binutils are supported, one can type
2. On systems where GCC and Binutils are supported, one can type
cd src
cp -p Makefile.gcc Makefile
make
to build the static library. In addition, one may
- type `make debug' or `make release' to explicitly generate the
- type `make debug` or `make release` to explicitly generate the
debug or release build;
- type `make doc' to generate the doxygen documentation; or
- type `make linebreakdata' to regenerate linebreakdata.c from
LineBreak.txt.
- type `make wordbreakdata' to regenerate wordbreakdata.c from
WordBreakProperty.txt.
- type `make doc` to generate the doxygen documentation; or
- type `make linebreakdata` to regenerate *linebreakdata.c* from
*LineBreak.txt*.
- type `make wordbreakdata` to regenerate *wordbreakdata.c* from
*WordBreakProperty.txt*.
3) On Windows, apart from using method 1 (Cygwin/MSYS) and method 2
3. On Windows, apart from using method 1 (Cygwin/MSYS) and method 2
(MinGW), MSVC can also be used. Type
cd src
@ -80,9 +77,11 @@ There are three ways to build the library:
Documentation
-------------
Check the generated document doc/html/linebreak_8h.html and
doc/html/wordbreak_8h.html in the downloaded file for the public
Check the generated document *doc/html/linebreak\_8h.html* and
*doc/html/wordbreak\_8h.html* in the downloaded file for the public
interfaces exposed to applications.
<!--
vim:autoindent:expandtab:formatoptions=tcqlmn:textwidth=72:
-->

View File

@ -1,10 +1,11 @@
/* vim: set tabstop=4 shiftwidth=4: */
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
/*
* Line breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2008-2012 Wu Yongwei <wuyongwei at gmail dot com>
* Copyright (C) 2008-2013 Wu Yongwei <wuyongwei at gmail dot com>
* Copyright (C) 2013 Petr Filipsky <philodej at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
@ -44,8 +45,9 @@
* Implementation of the line breaking algorithm as described in Unicode
* Standard Annex 14.
*
* @version 2.3, 2012/10/06
* @version 2.5, 2013/11/14
* @author Wu Yongwei
* @author Petr Filipsky
*/
#include <assert.h>
@ -54,6 +56,11 @@
#include "linebreak.h"
#include "linebreakdef.h"
/**
* Special value used internally to indicate an undefined break result.
*/
#define LINEBREAK_UNDEFINED -1
/**
* Size of the second-level index to the line breaking properties.
*/
@ -424,7 +431,7 @@ static enum LineBreakClass resolve_lb_class(
}
case LBP_CJ:
/* Simplified for `normal' line breaking. See
* <url:http://www.unicode.org/reports/tr14/tr14-28.html#CJ>
* <url:http://www.unicode.org/reports/tr14/tr14-30.html#CJ>
* for details. */
return LBP_ID;
case LBP_SA:
@ -436,6 +443,180 @@ static enum LineBreakClass resolve_lb_class(
}
}
/**
* Treats specially for the first character in a line.
*
* @param[in,out] lbpCtx pointer to the line breaking context
* @pre \a lbpCtx->lbcCur has a valid line break class
* @post \a lbpCtx->lbcCur has the updated line break class
*/
static void treat_first_char(
struct LineBreakContext* lbpCtx)
{
switch (lbpCtx->lbcCur)
{
case LBP_LF:
case LBP_NL:
lbpCtx->lbcCur = LBP_BK; /* Rule LB5 */
break;
case LBP_CB:
lbpCtx->lbcCur = LBP_BA; /* Rule LB20 */
break;
case LBP_SP:
lbpCtx->lbcCur = LBP_WJ; /* Leading space treated as WJ */
break;
default:
break;
}
}
/**
* Tries telling the line break opportunity by simple rules.
*
* @param[in,out] lbpCtx pointer to the line breaking context
* @pre \a lbpCtx->lbcCur has the current line break
* class; and \a lbpCtx->lbcNew has the line
* break class for the next character
* @post \a lbpCtx->lbcCur has the updated line break
* class
* @return break result, one of #LINEBREAK_MUSTBREAK,
* #LINEBREAK_ALLOWBREAK, and #LINEBREAK_NOBREAK
* if identified; or #LINEBREAK_UNDEFINED if
* table lookup is needed
*/
static int get_lb_result_simple(
struct LineBreakContext* lbpCtx)
{
if (lbpCtx->lbcCur == LBP_BK
|| (lbpCtx->lbcCur == LBP_CR && lbpCtx->lbcNew != LBP_LF))
{
return LINEBREAK_MUSTBREAK; /* Rules LB4 and LB5 */
}
switch (lbpCtx->lbcNew)
{
case LBP_SP:
return LINEBREAK_NOBREAK; /* Rule LB7; no change to lbcCur */
case LBP_BK:
case LBP_LF:
case LBP_NL:
lbpCtx->lbcCur = LBP_BK; /* Mandatory break after */
return LINEBREAK_NOBREAK; /* Rule LB6 */
case LBP_CR:
lbpCtx->lbcCur = LBP_CR;
return LINEBREAK_NOBREAK; /* Rule LB6 */
case LBP_CB:
lbpCtx->lbcCur = LBP_BA;
return LINEBREAK_ALLOWBREAK; /* Rule LB20 */
default:
return LINEBREAK_UNDEFINED; /* Table lookup is needed */
}
}
/**
* Tells the line break opportunity by table lookup.
*
* @param[in,out] lbpCtx pointer to the line breaking context
* @pre \a lbpCtx->lbcCur has the current line break
* class; \a lbpCtx->lbcLast has the line break
* class for the last character; and \a
* lbcCur->lbcNew has the line break class for
* the next character
* @post \a lbpCtx->lbcCur has the updated line break
* class
* @return break result, one of #LINEBREAK_MUSTBREAK,
* #LINEBREAK_ALLOWBREAK, and #LINEBREAK_NOBREAK
*/
static int get_lb_result_lookup(
struct LineBreakContext* lbpCtx)
{
/* TODO: Rule LB21a, as introduced by Revision 28 of UAX#14, is not
* yet implemented below. */
int brk = LINEBREAK_UNDEFINED;
assert(lbpCtx->lbcCur <= LBP_JT);
assert(lbpCtx->lbcNew <= LBP_JT);
switch (baTable[lbpCtx->lbcCur - 1][lbpCtx->lbcNew - 1])
{
case DIR_BRK:
brk = LINEBREAK_ALLOWBREAK;
break;
case CMI_BRK:
case IND_BRK:
brk = (lbpCtx->lbcLast == LBP_SP)
? LINEBREAK_ALLOWBREAK
: LINEBREAK_NOBREAK;
break;
case CMP_BRK:
brk = LINEBREAK_NOBREAK;
if (lbpCtx->lbcLast != LBP_SP)
return brk; /* Do not update lbcCur */
break;
case PRH_BRK:
brk = LINEBREAK_NOBREAK;
break;
}
lbpCtx->lbcCur = lbpCtx->lbcNew;
return brk;
}
/**
* Initializes line breaking context for a given language.
*
* @param[in,out] lbpCtx pointer to the line breaking context
* @param[in] ch the first character to process
* @param[in] lang language of the input
* @post the line breaking context is initialized
*/
void lb_init_break_context(
struct LineBreakContext* lbpCtx,
utf32_t ch,
const char* lang)
{
lbpCtx->lang = lang;
lbpCtx->lbpLang = get_lb_prop_lang(lang);
lbpCtx->lbcLast = LBP_Undefined;
lbpCtx->lbcNew = LBP_Undefined;
lbpCtx->lbcCur = resolve_lb_class(
get_char_lb_class_lang(ch, lbpCtx->lbpLang),
lbpCtx->lang);
treat_first_char(lbpCtx);
}
/**
* Updates LineBreakingContext for the next code point and returns
* the detected break.
*
* @param[in,out] lbpCtx pointer to the line breaking context
* @param[in] ch Unicode code point
* @return break result, one of #LINEBREAK_MUSTBREAK,
* #LINEBREAK_ALLOWBREAK, and #LINEBREAK_NOBREAK
* @post the line breaking context is updated
*/
int lb_process_next_char(
struct LineBreakContext* lbpCtx,
utf32_t ch )
{
int brk;
lbpCtx->lbcLast = lbpCtx->lbcNew;
lbpCtx->lbcNew = get_char_lb_class_lang(ch, lbpCtx->lbpLang);
brk = get_lb_result_simple(lbpCtx);
switch (brk)
{
case LINEBREAK_MUSTBREAK:
lbpCtx->lbcCur = resolve_lb_class(lbpCtx->lbcNew, lbpCtx->lang);
treat_first_char(lbpCtx);
break;
case LINEBREAK_UNDEFINED:
lbpCtx->lbcNew = resolve_lb_class(lbpCtx->lbcNew, lbpCtx->lang);
brk = get_lb_result_lookup(lbpCtx);
break;
default:
break;
}
return brk;
}
/**
* Gets the next Unicode character in a UTF-8 sequence. The index will
* be advanced to the next complete character, unless the end of string
@ -577,10 +758,7 @@ void set_linebreaks(
get_next_char_t get_next_char)
{
utf32_t ch;
enum LineBreakClass lbcCur;
enum LineBreakClass lbcNew;
enum LineBreakClass lbcLast;
struct LineBreakProperties *lbpLang;
struct LineBreakContext lbCtx;
size_t posCur = 0;
size_t posLast = 0;
@ -588,28 +766,7 @@ void set_linebreaks(
ch = get_next_char(s, len, &posCur);
if (ch == EOS)
return;
lbpLang = get_lb_prop_lang(lang);
lbcCur = resolve_lb_class(get_char_lb_class_lang(ch, lbpLang), lang);
lbcNew = LBP_Undefined;
nextline:
/* Special treatment for the first character */
switch (lbcCur)
{
case LBP_LF:
case LBP_NL:
lbcCur = LBP_BK;
break;
case LBP_CB:
lbcCur = LBP_BA;
break;
case LBP_SP:
lbcCur = LBP_WJ;
break;
default:
break;
}
lb_init_break_context(&lbCtx, ch, lang);
/* Process a line till an explicit break or end of string */
for (;;)
@ -619,75 +776,10 @@ nextline:
brks[posLast] = LINEBREAK_INSIDEACHAR;
}
assert(posLast == posCur - 1);
lbcLast = lbcNew;
ch = get_next_char(s, len, &posCur);
if (ch == EOS)
break;
lbcNew = get_char_lb_class_lang(ch, lbpLang);
if (lbcCur == LBP_BK || (lbcCur == LBP_CR && lbcNew != LBP_LF))
{
brks[posLast] = LINEBREAK_MUSTBREAK;
lbcCur = resolve_lb_class(lbcNew, lang);
goto nextline;
}
switch (lbcNew)
{
case LBP_SP:
brks[posLast] = LINEBREAK_NOBREAK;
continue;
case LBP_BK:
case LBP_LF:
case LBP_NL:
brks[posLast] = LINEBREAK_NOBREAK;
lbcCur = LBP_BK;
continue;
case LBP_CR:
brks[posLast] = LINEBREAK_NOBREAK;
lbcCur = LBP_CR;
continue;
case LBP_CB:
brks[posLast] = LINEBREAK_ALLOWBREAK;
lbcCur = LBP_BA;
continue;
default:
break;
}
lbcNew = resolve_lb_class(lbcNew, lang);
/* TODO: LB21a, as introduced by Revision 28 of UAX#14, is not
* yet implemented below. */
assert(lbcCur <= LBP_JT);
assert(lbcNew <= LBP_JT);
switch (baTable[lbcCur - 1][lbcNew - 1])
{
case DIR_BRK:
brks[posLast] = LINEBREAK_ALLOWBREAK;
break;
case CMI_BRK:
case IND_BRK:
if (lbcLast == LBP_SP)
{
brks[posLast] = LINEBREAK_ALLOWBREAK;
}
else
{
brks[posLast] = LINEBREAK_NOBREAK;
}
break;
case CMP_BRK:
brks[posLast] = LINEBREAK_NOBREAK;
if (lbcLast != LBP_SP)
continue;
break;
case PRH_BRK:
brks[posLast] = LINEBREAK_NOBREAK;
break;
}
lbcCur = lbcNew;
brks[posLast] = lb_process_next_char(&lbCtx, ch);
}
assert(posLast == posCur - 1 && posCur <= len);

View File

@ -1,4 +1,4 @@
/* vim: set tabstop=4 shiftwidth=4: */
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
/*
* Line breaking in a Unicode sequence. Designed to be used in a

View File

@ -1,6 +1,6 @@
/* The content of this file is generated from:
# LineBreak-6.2.0.txt
# Date: 2012-08-08, 19:26:00 GMT [KW]
# LineBreak-6.3.0.txt
# Date: 2013-02-06, 19:45:00 GMT [KW, LI]
*/
#include "linebreak.h"
@ -114,7 +114,9 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x060C, 0x060D, LBP_IS },
{ 0x060E, 0x060F, LBP_AL },
{ 0x0610, 0x061A, LBP_CM },
{ 0x061B, 0x061F, LBP_EX },
{ 0x061B, 0x061B, LBP_EX },
{ 0x061C, 0x061C, LBP_CM },
{ 0x061E, 0x061F, LBP_EX },
{ 0x0620, 0x064A, LBP_AL },
{ 0x064B, 0x065F, LBP_CM },
{ 0x0660, 0x0669, LBP_NU },
@ -456,7 +458,7 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x205D, 0x205F, LBP_BA },
{ 0x2060, 0x2060, LBP_WJ },
{ 0x2061, 0x2064, LBP_AL },
{ 0x206A, 0x206F, LBP_CM },
{ 0x2066, 0x206F, LBP_CM },
{ 0x2070, 0x2071, LBP_AL },
{ 0x2074, 0x2074, LBP_AI },
{ 0x2075, 0x207C, LBP_AL },
@ -473,7 +475,7 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x20A7, 0x20A7, LBP_PO },
{ 0x20A8, 0x20B5, LBP_PR },
{ 0x20B6, 0x20B6, LBP_PO },
{ 0x20B7, 0x20BA, LBP_PR },
{ 0x20B7, 0x20CF, LBP_PR },
{ 0x20D0, 0x20F0, LBP_CM },
{ 0x2100, 0x2102, LBP_AL },
{ 0x2103, 0x2103, LBP_PO },
@ -774,7 +776,8 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x2E33, 0x2E34, LBP_BA },
{ 0x2E35, 0x2E39, LBP_AL },
{ 0x2E3A, 0x2E3B, LBP_B2 },
{ 0x2E80, 0x3000, LBP_ID },
{ 0x2E80, 0x2FFB, LBP_ID },
{ 0x3000, 0x3000, LBP_BA },
{ 0x3001, 0x3002, LBP_CL },
{ 0x3003, 0x3004, LBP_ID },
{ 0x3005, 0x3005, LBP_NS },
@ -803,7 +806,9 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x301E, 0x301F, LBP_CL },
{ 0x3020, 0x3029, LBP_ID },
{ 0x302A, 0x302F, LBP_CM },
{ 0x3030, 0x303A, LBP_ID },
{ 0x3030, 0x3034, LBP_ID },
{ 0x3035, 0x3035, LBP_CM },
{ 0x3036, 0x303A, LBP_ID },
{ 0x303B, 0x303C, LBP_NS },
{ 0x303D, 0x303F, LBP_ID },
{ 0x3041, 0x3041, LBP_CJ },

View File

@ -1,4 +1,4 @@
/* vim: set tabstop=4 shiftwidth=4: */
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
/*
* Line breaking in a Unicode sequence. Designed to be used in a

View File

@ -1,10 +1,11 @@
/* vim: set tabstop=4 shiftwidth=4: */
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
/*
* Line breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2008-2012 Wu Yongwei <wuyongwei at gmail dot com>
* Copyright (C) 2008-2013 Wu Yongwei <wuyongwei at gmail dot com>
* Copyright (C) 2013 Petr Filipsky <philodej at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
@ -44,15 +45,16 @@
* Definitions of internal data structures, declarations of global
* variables, and function prototypes for the line breaking algorithm.
*
* @version 2.3, 2012/10/06
* @version 2.4, 2013/11/10
* @author Wu Yongwei
* @author Petr Filipsky
*/
/**
* Constant value to mark the end of string. It is not a valid Unicode
* character.
*/
#define EOS 0xFFFF
#define EOS 0xFFFFFFFF
/**
* Line break classes. This is a direct mapping of Table 1 of Unicode
@ -130,6 +132,19 @@ struct LineBreakPropertiesLang
struct LineBreakProperties *lbp; /**< Pointer to associated data */
};
/**
* Context representing internal state of the line breaking algorithm.
* This is useful to callers if incremental analysis is wanted.
*/
struct LineBreakContext
{
const char *lang; /**< Language name */
struct LineBreakProperties *lbpLang;/**< Pointer to LineBreakProperties */
enum LineBreakClass lbcCur; /**< Breaking class of current codepoint */
enum LineBreakClass lbcNew; /**< Breaking class of next codepoint */
enum LineBreakClass lbcLast; /**< Breaking class of last codepoint */
};
/**
* Abstract function interface for #lb_get_next_char_utf8,
* #lb_get_next_char_utf16, and #lb_get_next_char_utf32.
@ -144,6 +159,13 @@ extern struct LineBreakPropertiesLang lb_prop_lang_map[];
utf32_t lb_get_next_char_utf8(const utf8_t *s, size_t len, size_t *ip);
utf32_t lb_get_next_char_utf16(const utf16_t *s, size_t len, size_t *ip);
utf32_t lb_get_next_char_utf32(const utf32_t *s, size_t len, size_t *ip);
void lb_init_break_context(
struct LineBreakContext* lbpCtx,
utf32_t ch,
const char* lang);
int lb_process_next_char(
struct LineBreakContext* lbpCtx,
utf32_t ch);
void set_linebreaks(
const void *s,
size_t len,

View File

@ -1,10 +1,10 @@
/* vim: set tabstop=4 shiftwidth=4: */
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
/*
* Word breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2012 Tom Hacohen <tom@stosb.com>
* Copyright (C) 2013 Tom Hacohen <tom at stosb dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
@ -30,6 +30,10 @@
* Unicode 6.0.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-17.html>
*
* This library has been updated according to Revision 21, for
* Unicode 6.2.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-21.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
@ -40,7 +44,7 @@
* Implementation of the word breaking algorithm as described in Unicode
* Standard Annex 29.
*
* @version 2.3, 2013/05/14
* @version 2.4, 2013/09/28
* @author Tom Hacohen
*/

View File

@ -1,10 +1,10 @@
/* vim: set tabstop=4 shiftwidth=4: */
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
/*
* Word breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2012 Tom Hacohen <tom@stosb.com>
* Copyright (C) 2013 Tom Hacohen <tom at stosb dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
@ -30,6 +30,10 @@
* Unicode 6.0.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-17.html>
*
* This library has been updated according to Revision 21, for
* Unicode 6.2.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-21.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
@ -39,7 +43,7 @@
*
* Header file for the word breaking (segmentation) algorithm.
*
* @version 2.2, 2012/02/04
* @version 2.3, 2013/09/28
* @author Tom Hacohen
*/

View File

@ -1,10 +1,11 @@
/* vim: set tabstop=4 shiftwidth=4: */
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
/*
* Word breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2012 Tom Hacohen <tom@stosb.com>
* Copyright (C) 2013 Tom Hacohen <tom at stosb dot com>
* Copyright (C) 2013 Petr Filipsky <philodej at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
@ -30,6 +31,10 @@
* Unicode 6.0.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-17.html>
*
* This library has been updated according to Revision 21, for
* Unicode 6.2.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-21.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
@ -40,13 +45,14 @@
* Definitions of internal data structures, declarations of global
* variables, and function prototypes for the word breaking algorithm.
*
* @version 2.2, 2013/05/14
* @version 2.4, 2013/11/10
* @author Tom Hacohen
* @author Petr Filipsky
*/
/**
* Word break classes. This is a direct mapping of Table 3 of Unicode
* Standard Annex 29, Revision 17.
* Standard Annex 29, Revision 23.
*/
enum WordBreakClass
{
@ -64,6 +70,9 @@ enum WordBreakClass
WBP_Numeric,
WBP_ExtendNumLet,
WBP_Regional,
WBP_Hebrew,
WBP_Single,
WBP_Double,
WBP_Any
};