forked from enlightenment/efl
Synced libunibreak local copy with upstream.
This fixes T805.
This commit is contained in:
parent
cc8fa1da45
commit
cff1a9a59f
|
@ -1,4 +1,5 @@
|
|||
Wu Yongwei. Designed and implemented liblinebreak.
|
||||
Wu Yongwei. Designed and implemented the original liblinebreak.
|
||||
Current maintainer of libunibreak.
|
||||
|
||||
Nikolay Pultsin. Put forward the original requirements on liblinebreak,
|
||||
performed tests, and made a lot of suggestions on the initial versions.
|
||||
|
@ -6,3 +7,5 @@ performed tests, and made a lot of suggestions on the initial versions.
|
|||
Thomas Klausner. Autoconfiscated and libtoolized liblinebreak.
|
||||
|
||||
Tom Hacohen. Added word boundaries support.
|
||||
|
||||
Petr Filipsky. Added incremental processing for line-breaking.
|
||||
|
|
|
@ -1,3 +1,116 @@
|
|||
2013-11-14 Wu Yongwei <wuyongwei@gmail.com>
|
||||
|
||||
* src/linebreak.c: Add/update comments and doc comments.
|
||||
(lb_init_breaking_class): Rename to treat_first_char.
|
||||
(lb_classify_break_simple): Rename to get_lb_result_simple.
|
||||
(lb_classify_break_lookup): Rename to get_lb_result_lookup.
|
||||
(set_linebreaks): Remove an unused local variable.
|
||||
|
||||
2013-11-14 Wu Yongwei <wuyongwei@gmail.com>
|
||||
|
||||
* src/linebreakdata.c: Regenerate from LineBreak-6.3.0.txt.
|
||||
|
||||
2013-11-13 Wu Yongwei <wuyongwei@gmail.com>
|
||||
|
||||
Fix compilation problems under MSVC.
|
||||
* src/linebreak.c (lb_init_breaking_class): Remove `inline'.
|
||||
(lb_classify_break_simple): Ditto.
|
||||
(lb_classify_break_lookup): Ditto.
|
||||
(lb_classify_break_lookup): Move local variable declaration before
|
||||
assertions.
|
||||
|
||||
2013-11-10 Wu Yongwei <wuyongwei@gmail.com>
|
||||
|
||||
* src/Makefile.am (libunibreak_la_LDFLAGS): Set the version-info to
|
||||
`2:0:1'.
|
||||
|
||||
2013-11-10 Wu Yongwei <wuyongwei@gmail.com>
|
||||
|
||||
* src/linebreakdef.c: Adjust the order of code.
|
||||
(lb_process_next_char): Make its return type int.
|
||||
* src/linebreak.c (lb_process_next_char): Ditto.
|
||||
|
||||
2013-11-10 Wu Yongwei <wuyongwei@gmail.com>
|
||||
|
||||
* src/linebreak.c: Make minor changes in doc comments, formatting,
|
||||
and names.
|
||||
* src/linebreakdef.c: Ditto.
|
||||
|
||||
2013-11-10 Wu Yongwei <wuyongwei@gmail.com>
|
||||
|
||||
* AUTHORS: Add `Petr Filipsky'.
|
||||
|
||||
2013-11-10 Petr Filipsky <philodej@gmail.com>
|
||||
|
||||
Expose low level line-breaking API for incremental processing.
|
||||
* src/linebreak.h: Add prototype declarations for
|
||||
lb_init_break_context and lb_process_next_char.
|
||||
(struct LineBreakContext): New struct.
|
||||
* src/linebreak.h (LINEBREAK_UNDEFINED): New macro constant.
|
||||
(lb_init_breaking_class): New static function.
|
||||
(lb_classify_break_simple): New static function.
|
||||
(lb_classify_break_lookup): New static function.
|
||||
(lb_init_break_context): New function.
|
||||
(lb_process_next_char): New function.
|
||||
(set_linebreaks): Implement with lb_init_break_context and
|
||||
lb_process_next_char.
|
||||
|
||||
2013-11-05 Petr Filipsky <philodej@gmail.com>
|
||||
|
||||
* src/wordbreakdef.h (enum WordBreakClass): Update according to
|
||||
Table 3 of Unicode Standard Annex 29, Revision 23.
|
||||
|
||||
2013-09-30 Wu Yongwei <wuyongwei@gmail.com>
|
||||
|
||||
Update for the libunibreak 1.1 release.
|
||||
* configure.ac (AC_INIT): Change the library version to `1.1'.
|
||||
* Doxyfile (PROJECT_NUMBER): Change to `1.1'.
|
||||
* Makefile.am (EXTRA_DIST): Add the `tools' directory.
|
||||
* NEWS: Add information about libunibreak 1.1.
|
||||
* src/Makefile.am (libunibreak_la_LDFLAGS): Set the version to `1:1'.
|
||||
|
||||
2013-09-29 Wu Yongwei <wuyongwei@gmail.com>
|
||||
|
||||
* src/Makefile.msvc: Modernize obsolete/deprecated MSVC options.
|
||||
|
||||
2013-09-28 Wu Yongwei <wuyongwei@gmail.com>
|
||||
|
||||
* src/wordbreak.c: Update copyright year and UAX information.
|
||||
* src/wordbreak.h: Ditto.
|
||||
* src/wordbreakdef.h: Ditto.
|
||||
|
||||
2013-09-28 Wu Yongwei <wuyongwei@gmail.com>
|
||||
|
||||
Fix the errors caused by libtool 2.4 (really annoying to the level
|
||||
of WTF for making me add the foolish dependency on m4).
|
||||
* Makefile.am (ACLOCAL_AMFLAGS): Add `-I m4'.
|
||||
* bootstrap: Add a line to execute autoreconf.
|
||||
* configure.ac (AC_CONFIG_MACRO_DIR): Set to `[m4]'.
|
||||
* purge: Make it remove also the m4 directory.
|
||||
|
||||
2013-09-28 Wu Yongwei <wuyongwei@gmail.com>
|
||||
|
||||
* Makefile.am (EXTRA_DIST): Add `README.md'.
|
||||
|
||||
2013-09-28 Wu Yongwei <wuyongwei@gmail.com>
|
||||
|
||||
* README.md: New Markdown version of README.
|
||||
* README: Remove.
|
||||
|
||||
2013-05-13 Tom Hacohen <tom@stosb.com>
|
||||
|
||||
Update files according to UAX #29-21, for Unicode 6.2.0.
|
||||
* README: Update the reference to UAX #29-21.
|
||||
* src/wordbreak.c (set_wordbreaks): Update for WBP_Regional.
|
||||
* src/wordbreakdef.h (WBP_Regional): New enumerator for the new
|
||||
property `RI' as defined in UAX #29-21.
|
||||
* src/wordbreakdata.c: Regenerate from WordBreakProperty-6.2.0.txt.
|
||||
|
||||
2013-05-06 Wu Yongwei <wuyongwei@gmail.com>
|
||||
|
||||
* src/Makefile.am (install-exec-hook): Make sure `--disable-static'
|
||||
can work (thanks to Eugene V. Lyubimkin).
|
||||
|
||||
2012-10-06 Wu Yongwei <wuyongwei@gmail.com>
|
||||
|
||||
Update files according to UAX #14-30, for Unicode 6.2.0.
|
||||
|
@ -82,11 +195,12 @@
|
|||
|
||||
2012-08-11 Wu Yongwei <wuyongwei@gmail.com>
|
||||
|
||||
Update for the libunibreak 1.0 release.
|
||||
* configure.ac (AC_INIT): Change the library name and version to
|
||||
`libunibreak' and `1.0'.
|
||||
(AC_PROG_LN_S): New macro.
|
||||
(AC_OUTPUT): Change to `libunibreak.pc'.
|
||||
* Doxyfile: (PROJECT_NAME): Change to `libunibreak'.
|
||||
* Doxyfile (PROJECT_NAME): Change to `libunibreak'.
|
||||
(PROJECT_NUMBER): Change to `1.0'.
|
||||
* LICENCE: Add copyright information about Tom Hacohen.
|
||||
* Makefile.am (lib_LTLIBRARIES): Change to `libunibreak.la'.
|
||||
|
@ -96,7 +210,7 @@
|
|||
a symlink to libunibreak.a.
|
||||
* Makefile.msvc: Change the library name to `libunibreak', and the
|
||||
output library to `unibreak.lib'.
|
||||
* NEW: Add information about libunibreak 1.0.
|
||||
* NEWS: Add information about libunibreak 1.0.
|
||||
* README: Change the library name, and add information about word
|
||||
break.
|
||||
|
||||
|
|
|
@ -1,3 +1,10 @@
|
|||
New in libunibreak 1.1
|
||||
|
||||
- Update the code and data to conform to Unicode 6.2.0
|
||||
- Update build files to support libtool 2.4
|
||||
- Adjust code structure
|
||||
- Make a few bug fixes
|
||||
|
||||
New in libunibreak 1.0
|
||||
|
||||
- Add word breaking support
|
||||
|
|
|
@ -1,31 +1,30 @@
|
|||
L I B U N I B R E A K
|
||||
=====================
|
||||
LIBUNIBREAK
|
||||
===========
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
This is the README file for libunibreak, an implementation of the line
|
||||
breaking and word breaking algorithms as described in Unicode
|
||||
Standard Annex 14 and Unicode Standard Annex 30, available at
|
||||
<URL:http://www.unicode.org/reports/tr14/tr14-30.html>
|
||||
<URL:http://www.unicode.org/reports/tr29/tr29-17.html>
|
||||
breaking and word breaking algorithms as described in [Unicode Standard
|
||||
Annex 14] [1] and [Unicode Standard Annex 29] [2]. Check the project's
|
||||
[home page] [3] for up-to-date information.
|
||||
|
||||
Check this URL for up-to-date information:
|
||||
<URL:https://github.com/adah1972/libunibreak>
|
||||
[1]: http://www.unicode.org/reports/tr14/tr14-30.html
|
||||
[2]: http://www.unicode.org/reports/tr29/tr29-21.html
|
||||
[3]: https://github.com/adah1972/libunibreak
|
||||
|
||||
|
||||
Licence
|
||||
-------
|
||||
|
||||
This library is released under an open-source licence, the zlib/libpng
|
||||
licence. Please check the file LICENCE for details.
|
||||
licence. Please check the file *LICENCE* for details.
|
||||
|
||||
Apart from using the algorithm, part of the code is derived from the
|
||||
data provided under
|
||||
<URL:http://www.unicode.org/Public/>
|
||||
[Unicode Public Data] [4], and the [Unicode Terms of Use] [5] may apply.
|
||||
|
||||
And the Unicode Terms of Use may apply:
|
||||
<URL:http://www.unicode.org/copyright.html>
|
||||
[4]: http://www.unicode.org/Public/
|
||||
[5]: http://www.unicode.org/copyright.html
|
||||
|
||||
|
||||
Installation
|
||||
|
@ -33,7 +32,7 @@ Installation
|
|||
|
||||
There are three ways to build the library:
|
||||
|
||||
1) On *NIX systems supported by the autoconfiscation tools, do the
|
||||
1. On \*NIX systems supported by the autoconfiscation tools, do the
|
||||
normal
|
||||
|
||||
./configure
|
||||
|
@ -42,30 +41,28 @@ There are three ways to build the library:
|
|||
|
||||
to build and install both the dynamic and static libraries. In
|
||||
addition, one may
|
||||
- type `make doc` to generate the doxygen documentation; or
|
||||
- type `make linebreakdata` to regenerate *linebreakdata.c* from
|
||||
*LineBreak.txt*.
|
||||
- type `make wordbreakdata` to regenerate *wordbreakdata.c* from
|
||||
*WordBreakProperty.txt*.
|
||||
|
||||
- type `make doc' to generate the doxygen documentation; or
|
||||
- type `make linebreakdata' to regenerate linebreakdata.c from
|
||||
LineBreak.txt.
|
||||
- type `make wordbreakdata' to regenerate wordbreakdata.c from
|
||||
WordBreakProperty.txt.
|
||||
|
||||
2) On systems where GCC and Binutils are supported, one can type
|
||||
2. On systems where GCC and Binutils are supported, one can type
|
||||
|
||||
cd src
|
||||
cp -p Makefile.gcc Makefile
|
||||
make
|
||||
|
||||
to build the static library. In addition, one may
|
||||
|
||||
- type `make debug' or `make release' to explicitly generate the
|
||||
- type `make debug` or `make release` to explicitly generate the
|
||||
debug or release build;
|
||||
- type `make doc' to generate the doxygen documentation; or
|
||||
- type `make linebreakdata' to regenerate linebreakdata.c from
|
||||
LineBreak.txt.
|
||||
- type `make wordbreakdata' to regenerate wordbreakdata.c from
|
||||
WordBreakProperty.txt.
|
||||
- type `make doc` to generate the doxygen documentation; or
|
||||
- type `make linebreakdata` to regenerate *linebreakdata.c* from
|
||||
*LineBreak.txt*.
|
||||
- type `make wordbreakdata` to regenerate *wordbreakdata.c* from
|
||||
*WordBreakProperty.txt*.
|
||||
|
||||
3) On Windows, apart from using method 1 (Cygwin/MSYS) and method 2
|
||||
3. On Windows, apart from using method 1 (Cygwin/MSYS) and method 2
|
||||
(MinGW), MSVC can also be used. Type
|
||||
|
||||
cd src
|
||||
|
@ -80,9 +77,11 @@ There are three ways to build the library:
|
|||
Documentation
|
||||
-------------
|
||||
|
||||
Check the generated document doc/html/linebreak_8h.html and
|
||||
doc/html/wordbreak_8h.html in the downloaded file for the public
|
||||
Check the generated document *doc/html/linebreak\_8h.html* and
|
||||
*doc/html/wordbreak\_8h.html* in the downloaded file for the public
|
||||
interfaces exposed to applications.
|
||||
|
||||
|
||||
<!--
|
||||
vim:autoindent:expandtab:formatoptions=tcqlmn:textwidth=72:
|
||||
-->
|
||||
|
|
|
@ -1,10 +1,11 @@
|
|||
/* vim: set tabstop=4 shiftwidth=4: */
|
||||
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
|
||||
|
||||
/*
|
||||
* Line breaking in a Unicode sequence. Designed to be used in a
|
||||
* generic text renderer.
|
||||
*
|
||||
* Copyright (C) 2008-2012 Wu Yongwei <wuyongwei at gmail dot com>
|
||||
* Copyright (C) 2008-2013 Wu Yongwei <wuyongwei at gmail dot com>
|
||||
* Copyright (C) 2013 Petr Filipsky <philodej at gmail dot com>
|
||||
*
|
||||
* This software is provided 'as-is', without any express or implied
|
||||
* warranty. In no event will the author be held liable for any damages
|
||||
|
@ -44,8 +45,9 @@
|
|||
* Implementation of the line breaking algorithm as described in Unicode
|
||||
* Standard Annex 14.
|
||||
*
|
||||
* @version 2.3, 2012/10/06
|
||||
* @version 2.5, 2013/11/14
|
||||
* @author Wu Yongwei
|
||||
* @author Petr Filipsky
|
||||
*/
|
||||
|
||||
#include <assert.h>
|
||||
|
@ -54,6 +56,11 @@
|
|||
#include "linebreak.h"
|
||||
#include "linebreakdef.h"
|
||||
|
||||
/**
|
||||
* Special value used internally to indicate an undefined break result.
|
||||
*/
|
||||
#define LINEBREAK_UNDEFINED -1
|
||||
|
||||
/**
|
||||
* Size of the second-level index to the line breaking properties.
|
||||
*/
|
||||
|
@ -424,7 +431,7 @@ static enum LineBreakClass resolve_lb_class(
|
|||
}
|
||||
case LBP_CJ:
|
||||
/* Simplified for `normal' line breaking. See
|
||||
* <url:http://www.unicode.org/reports/tr14/tr14-28.html#CJ>
|
||||
* <url:http://www.unicode.org/reports/tr14/tr14-30.html#CJ>
|
||||
* for details. */
|
||||
return LBP_ID;
|
||||
case LBP_SA:
|
||||
|
@ -436,6 +443,180 @@ static enum LineBreakClass resolve_lb_class(
|
|||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Treats specially for the first character in a line.
|
||||
*
|
||||
* @param[in,out] lbpCtx pointer to the line breaking context
|
||||
* @pre \a lbpCtx->lbcCur has a valid line break class
|
||||
* @post \a lbpCtx->lbcCur has the updated line break class
|
||||
*/
|
||||
static void treat_first_char(
|
||||
struct LineBreakContext* lbpCtx)
|
||||
{
|
||||
switch (lbpCtx->lbcCur)
|
||||
{
|
||||
case LBP_LF:
|
||||
case LBP_NL:
|
||||
lbpCtx->lbcCur = LBP_BK; /* Rule LB5 */
|
||||
break;
|
||||
case LBP_CB:
|
||||
lbpCtx->lbcCur = LBP_BA; /* Rule LB20 */
|
||||
break;
|
||||
case LBP_SP:
|
||||
lbpCtx->lbcCur = LBP_WJ; /* Leading space treated as WJ */
|
||||
break;
|
||||
default:
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Tries telling the line break opportunity by simple rules.
|
||||
*
|
||||
* @param[in,out] lbpCtx pointer to the line breaking context
|
||||
* @pre \a lbpCtx->lbcCur has the current line break
|
||||
* class; and \a lbpCtx->lbcNew has the line
|
||||
* break class for the next character
|
||||
* @post \a lbpCtx->lbcCur has the updated line break
|
||||
* class
|
||||
* @return break result, one of #LINEBREAK_MUSTBREAK,
|
||||
* #LINEBREAK_ALLOWBREAK, and #LINEBREAK_NOBREAK
|
||||
* if identified; or #LINEBREAK_UNDEFINED if
|
||||
* table lookup is needed
|
||||
*/
|
||||
static int get_lb_result_simple(
|
||||
struct LineBreakContext* lbpCtx)
|
||||
{
|
||||
if (lbpCtx->lbcCur == LBP_BK
|
||||
|| (lbpCtx->lbcCur == LBP_CR && lbpCtx->lbcNew != LBP_LF))
|
||||
{
|
||||
return LINEBREAK_MUSTBREAK; /* Rules LB4 and LB5 */
|
||||
}
|
||||
|
||||
switch (lbpCtx->lbcNew)
|
||||
{
|
||||
case LBP_SP:
|
||||
return LINEBREAK_NOBREAK; /* Rule LB7; no change to lbcCur */
|
||||
case LBP_BK:
|
||||
case LBP_LF:
|
||||
case LBP_NL:
|
||||
lbpCtx->lbcCur = LBP_BK; /* Mandatory break after */
|
||||
return LINEBREAK_NOBREAK; /* Rule LB6 */
|
||||
case LBP_CR:
|
||||
lbpCtx->lbcCur = LBP_CR;
|
||||
return LINEBREAK_NOBREAK; /* Rule LB6 */
|
||||
case LBP_CB:
|
||||
lbpCtx->lbcCur = LBP_BA;
|
||||
return LINEBREAK_ALLOWBREAK; /* Rule LB20 */
|
||||
default:
|
||||
return LINEBREAK_UNDEFINED; /* Table lookup is needed */
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Tells the line break opportunity by table lookup.
|
||||
*
|
||||
* @param[in,out] lbpCtx pointer to the line breaking context
|
||||
* @pre \a lbpCtx->lbcCur has the current line break
|
||||
* class; \a lbpCtx->lbcLast has the line break
|
||||
* class for the last character; and \a
|
||||
* lbcCur->lbcNew has the line break class for
|
||||
* the next character
|
||||
* @post \a lbpCtx->lbcCur has the updated line break
|
||||
* class
|
||||
* @return break result, one of #LINEBREAK_MUSTBREAK,
|
||||
* #LINEBREAK_ALLOWBREAK, and #LINEBREAK_NOBREAK
|
||||
*/
|
||||
static int get_lb_result_lookup(
|
||||
struct LineBreakContext* lbpCtx)
|
||||
{
|
||||
/* TODO: Rule LB21a, as introduced by Revision 28 of UAX#14, is not
|
||||
* yet implemented below. */
|
||||
int brk = LINEBREAK_UNDEFINED;
|
||||
assert(lbpCtx->lbcCur <= LBP_JT);
|
||||
assert(lbpCtx->lbcNew <= LBP_JT);
|
||||
switch (baTable[lbpCtx->lbcCur - 1][lbpCtx->lbcNew - 1])
|
||||
{
|
||||
case DIR_BRK:
|
||||
brk = LINEBREAK_ALLOWBREAK;
|
||||
break;
|
||||
case CMI_BRK:
|
||||
case IND_BRK:
|
||||
brk = (lbpCtx->lbcLast == LBP_SP)
|
||||
? LINEBREAK_ALLOWBREAK
|
||||
: LINEBREAK_NOBREAK;
|
||||
break;
|
||||
case CMP_BRK:
|
||||
brk = LINEBREAK_NOBREAK;
|
||||
if (lbpCtx->lbcLast != LBP_SP)
|
||||
return brk; /* Do not update lbcCur */
|
||||
break;
|
||||
case PRH_BRK:
|
||||
brk = LINEBREAK_NOBREAK;
|
||||
break;
|
||||
}
|
||||
lbpCtx->lbcCur = lbpCtx->lbcNew;
|
||||
return brk;
|
||||
}
|
||||
|
||||
/**
|
||||
* Initializes line breaking context for a given language.
|
||||
*
|
||||
* @param[in,out] lbpCtx pointer to the line breaking context
|
||||
* @param[in] ch the first character to process
|
||||
* @param[in] lang language of the input
|
||||
* @post the line breaking context is initialized
|
||||
*/
|
||||
void lb_init_break_context(
|
||||
struct LineBreakContext* lbpCtx,
|
||||
utf32_t ch,
|
||||
const char* lang)
|
||||
{
|
||||
lbpCtx->lang = lang;
|
||||
lbpCtx->lbpLang = get_lb_prop_lang(lang);
|
||||
lbpCtx->lbcLast = LBP_Undefined;
|
||||
lbpCtx->lbcNew = LBP_Undefined;
|
||||
lbpCtx->lbcCur = resolve_lb_class(
|
||||
get_char_lb_class_lang(ch, lbpCtx->lbpLang),
|
||||
lbpCtx->lang);
|
||||
treat_first_char(lbpCtx);
|
||||
}
|
||||
|
||||
/**
|
||||
* Updates LineBreakingContext for the next code point and returns
|
||||
* the detected break.
|
||||
*
|
||||
* @param[in,out] lbpCtx pointer to the line breaking context
|
||||
* @param[in] ch Unicode code point
|
||||
* @return break result, one of #LINEBREAK_MUSTBREAK,
|
||||
* #LINEBREAK_ALLOWBREAK, and #LINEBREAK_NOBREAK
|
||||
* @post the line breaking context is updated
|
||||
*/
|
||||
int lb_process_next_char(
|
||||
struct LineBreakContext* lbpCtx,
|
||||
utf32_t ch )
|
||||
{
|
||||
int brk;
|
||||
|
||||
lbpCtx->lbcLast = lbpCtx->lbcNew;
|
||||
lbpCtx->lbcNew = get_char_lb_class_lang(ch, lbpCtx->lbpLang);
|
||||
brk = get_lb_result_simple(lbpCtx);
|
||||
switch (brk)
|
||||
{
|
||||
case LINEBREAK_MUSTBREAK:
|
||||
lbpCtx->lbcCur = resolve_lb_class(lbpCtx->lbcNew, lbpCtx->lang);
|
||||
treat_first_char(lbpCtx);
|
||||
break;
|
||||
case LINEBREAK_UNDEFINED:
|
||||
lbpCtx->lbcNew = resolve_lb_class(lbpCtx->lbcNew, lbpCtx->lang);
|
||||
brk = get_lb_result_lookup(lbpCtx);
|
||||
break;
|
||||
default:
|
||||
break;
|
||||
}
|
||||
return brk;
|
||||
}
|
||||
|
||||
/**
|
||||
* Gets the next Unicode character in a UTF-8 sequence. The index will
|
||||
* be advanced to the next complete character, unless the end of string
|
||||
|
@ -577,10 +758,7 @@ void set_linebreaks(
|
|||
get_next_char_t get_next_char)
|
||||
{
|
||||
utf32_t ch;
|
||||
enum LineBreakClass lbcCur;
|
||||
enum LineBreakClass lbcNew;
|
||||
enum LineBreakClass lbcLast;
|
||||
struct LineBreakProperties *lbpLang;
|
||||
struct LineBreakContext lbCtx;
|
||||
size_t posCur = 0;
|
||||
size_t posLast = 0;
|
||||
|
||||
|
@ -588,28 +766,7 @@ void set_linebreaks(
|
|||
ch = get_next_char(s, len, &posCur);
|
||||
if (ch == EOS)
|
||||
return;
|
||||
lbpLang = get_lb_prop_lang(lang);
|
||||
lbcCur = resolve_lb_class(get_char_lb_class_lang(ch, lbpLang), lang);
|
||||
lbcNew = LBP_Undefined;
|
||||
|
||||
nextline:
|
||||
|
||||
/* Special treatment for the first character */
|
||||
switch (lbcCur)
|
||||
{
|
||||
case LBP_LF:
|
||||
case LBP_NL:
|
||||
lbcCur = LBP_BK;
|
||||
break;
|
||||
case LBP_CB:
|
||||
lbcCur = LBP_BA;
|
||||
break;
|
||||
case LBP_SP:
|
||||
lbcCur = LBP_WJ;
|
||||
break;
|
||||
default:
|
||||
break;
|
||||
}
|
||||
lb_init_break_context(&lbCtx, ch, lang);
|
||||
|
||||
/* Process a line till an explicit break or end of string */
|
||||
for (;;)
|
||||
|
@ -619,75 +776,10 @@ nextline:
|
|||
brks[posLast] = LINEBREAK_INSIDEACHAR;
|
||||
}
|
||||
assert(posLast == posCur - 1);
|
||||
lbcLast = lbcNew;
|
||||
ch = get_next_char(s, len, &posCur);
|
||||
if (ch == EOS)
|
||||
break;
|
||||
lbcNew = get_char_lb_class_lang(ch, lbpLang);
|
||||
if (lbcCur == LBP_BK || (lbcCur == LBP_CR && lbcNew != LBP_LF))
|
||||
{
|
||||
brks[posLast] = LINEBREAK_MUSTBREAK;
|
||||
lbcCur = resolve_lb_class(lbcNew, lang);
|
||||
goto nextline;
|
||||
}
|
||||
|
||||
switch (lbcNew)
|
||||
{
|
||||
case LBP_SP:
|
||||
brks[posLast] = LINEBREAK_NOBREAK;
|
||||
continue;
|
||||
case LBP_BK:
|
||||
case LBP_LF:
|
||||
case LBP_NL:
|
||||
brks[posLast] = LINEBREAK_NOBREAK;
|
||||
lbcCur = LBP_BK;
|
||||
continue;
|
||||
case LBP_CR:
|
||||
brks[posLast] = LINEBREAK_NOBREAK;
|
||||
lbcCur = LBP_CR;
|
||||
continue;
|
||||
case LBP_CB:
|
||||
brks[posLast] = LINEBREAK_ALLOWBREAK;
|
||||
lbcCur = LBP_BA;
|
||||
continue;
|
||||
default:
|
||||
break;
|
||||
}
|
||||
|
||||
lbcNew = resolve_lb_class(lbcNew, lang);
|
||||
|
||||
/* TODO: LB21a, as introduced by Revision 28 of UAX#14, is not
|
||||
* yet implemented below. */
|
||||
|
||||
assert(lbcCur <= LBP_JT);
|
||||
assert(lbcNew <= LBP_JT);
|
||||
switch (baTable[lbcCur - 1][lbcNew - 1])
|
||||
{
|
||||
case DIR_BRK:
|
||||
brks[posLast] = LINEBREAK_ALLOWBREAK;
|
||||
break;
|
||||
case CMI_BRK:
|
||||
case IND_BRK:
|
||||
if (lbcLast == LBP_SP)
|
||||
{
|
||||
brks[posLast] = LINEBREAK_ALLOWBREAK;
|
||||
}
|
||||
else
|
||||
{
|
||||
brks[posLast] = LINEBREAK_NOBREAK;
|
||||
}
|
||||
break;
|
||||
case CMP_BRK:
|
||||
brks[posLast] = LINEBREAK_NOBREAK;
|
||||
if (lbcLast != LBP_SP)
|
||||
continue;
|
||||
break;
|
||||
case PRH_BRK:
|
||||
brks[posLast] = LINEBREAK_NOBREAK;
|
||||
break;
|
||||
}
|
||||
|
||||
lbcCur = lbcNew;
|
||||
brks[posLast] = lb_process_next_char(&lbCtx, ch);
|
||||
}
|
||||
|
||||
assert(posLast == posCur - 1 && posCur <= len);
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
/* vim: set tabstop=4 shiftwidth=4: */
|
||||
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
|
||||
|
||||
/*
|
||||
* Line breaking in a Unicode sequence. Designed to be used in a
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
/* The content of this file is generated from:
|
||||
# LineBreak-6.2.0.txt
|
||||
# Date: 2012-08-08, 19:26:00 GMT [KW]
|
||||
# LineBreak-6.3.0.txt
|
||||
# Date: 2013-02-06, 19:45:00 GMT [KW, LI]
|
||||
*/
|
||||
|
||||
#include "linebreak.h"
|
||||
|
@ -114,7 +114,9 @@ struct LineBreakProperties lb_prop_default[] = {
|
|||
{ 0x060C, 0x060D, LBP_IS },
|
||||
{ 0x060E, 0x060F, LBP_AL },
|
||||
{ 0x0610, 0x061A, LBP_CM },
|
||||
{ 0x061B, 0x061F, LBP_EX },
|
||||
{ 0x061B, 0x061B, LBP_EX },
|
||||
{ 0x061C, 0x061C, LBP_CM },
|
||||
{ 0x061E, 0x061F, LBP_EX },
|
||||
{ 0x0620, 0x064A, LBP_AL },
|
||||
{ 0x064B, 0x065F, LBP_CM },
|
||||
{ 0x0660, 0x0669, LBP_NU },
|
||||
|
@ -456,7 +458,7 @@ struct LineBreakProperties lb_prop_default[] = {
|
|||
{ 0x205D, 0x205F, LBP_BA },
|
||||
{ 0x2060, 0x2060, LBP_WJ },
|
||||
{ 0x2061, 0x2064, LBP_AL },
|
||||
{ 0x206A, 0x206F, LBP_CM },
|
||||
{ 0x2066, 0x206F, LBP_CM },
|
||||
{ 0x2070, 0x2071, LBP_AL },
|
||||
{ 0x2074, 0x2074, LBP_AI },
|
||||
{ 0x2075, 0x207C, LBP_AL },
|
||||
|
@ -473,7 +475,7 @@ struct LineBreakProperties lb_prop_default[] = {
|
|||
{ 0x20A7, 0x20A7, LBP_PO },
|
||||
{ 0x20A8, 0x20B5, LBP_PR },
|
||||
{ 0x20B6, 0x20B6, LBP_PO },
|
||||
{ 0x20B7, 0x20BA, LBP_PR },
|
||||
{ 0x20B7, 0x20CF, LBP_PR },
|
||||
{ 0x20D0, 0x20F0, LBP_CM },
|
||||
{ 0x2100, 0x2102, LBP_AL },
|
||||
{ 0x2103, 0x2103, LBP_PO },
|
||||
|
@ -774,7 +776,8 @@ struct LineBreakProperties lb_prop_default[] = {
|
|||
{ 0x2E33, 0x2E34, LBP_BA },
|
||||
{ 0x2E35, 0x2E39, LBP_AL },
|
||||
{ 0x2E3A, 0x2E3B, LBP_B2 },
|
||||
{ 0x2E80, 0x3000, LBP_ID },
|
||||
{ 0x2E80, 0x2FFB, LBP_ID },
|
||||
{ 0x3000, 0x3000, LBP_BA },
|
||||
{ 0x3001, 0x3002, LBP_CL },
|
||||
{ 0x3003, 0x3004, LBP_ID },
|
||||
{ 0x3005, 0x3005, LBP_NS },
|
||||
|
@ -803,7 +806,9 @@ struct LineBreakProperties lb_prop_default[] = {
|
|||
{ 0x301E, 0x301F, LBP_CL },
|
||||
{ 0x3020, 0x3029, LBP_ID },
|
||||
{ 0x302A, 0x302F, LBP_CM },
|
||||
{ 0x3030, 0x303A, LBP_ID },
|
||||
{ 0x3030, 0x3034, LBP_ID },
|
||||
{ 0x3035, 0x3035, LBP_CM },
|
||||
{ 0x3036, 0x303A, LBP_ID },
|
||||
{ 0x303B, 0x303C, LBP_NS },
|
||||
{ 0x303D, 0x303F, LBP_ID },
|
||||
{ 0x3041, 0x3041, LBP_CJ },
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
/* vim: set tabstop=4 shiftwidth=4: */
|
||||
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
|
||||
|
||||
/*
|
||||
* Line breaking in a Unicode sequence. Designed to be used in a
|
||||
|
|
|
@ -1,10 +1,11 @@
|
|||
/* vim: set tabstop=4 shiftwidth=4: */
|
||||
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
|
||||
|
||||
/*
|
||||
* Line breaking in a Unicode sequence. Designed to be used in a
|
||||
* generic text renderer.
|
||||
*
|
||||
* Copyright (C) 2008-2012 Wu Yongwei <wuyongwei at gmail dot com>
|
||||
* Copyright (C) 2008-2013 Wu Yongwei <wuyongwei at gmail dot com>
|
||||
* Copyright (C) 2013 Petr Filipsky <philodej at gmail dot com>
|
||||
*
|
||||
* This software is provided 'as-is', without any express or implied
|
||||
* warranty. In no event will the author be held liable for any damages
|
||||
|
@ -44,15 +45,16 @@
|
|||
* Definitions of internal data structures, declarations of global
|
||||
* variables, and function prototypes for the line breaking algorithm.
|
||||
*
|
||||
* @version 2.3, 2012/10/06
|
||||
* @version 2.4, 2013/11/10
|
||||
* @author Wu Yongwei
|
||||
* @author Petr Filipsky
|
||||
*/
|
||||
|
||||
/**
|
||||
* Constant value to mark the end of string. It is not a valid Unicode
|
||||
* character.
|
||||
*/
|
||||
#define EOS 0xFFFF
|
||||
#define EOS 0xFFFFFFFF
|
||||
|
||||
/**
|
||||
* Line break classes. This is a direct mapping of Table 1 of Unicode
|
||||
|
@ -130,6 +132,19 @@ struct LineBreakPropertiesLang
|
|||
struct LineBreakProperties *lbp; /**< Pointer to associated data */
|
||||
};
|
||||
|
||||
/**
|
||||
* Context representing internal state of the line breaking algorithm.
|
||||
* This is useful to callers if incremental analysis is wanted.
|
||||
*/
|
||||
struct LineBreakContext
|
||||
{
|
||||
const char *lang; /**< Language name */
|
||||
struct LineBreakProperties *lbpLang;/**< Pointer to LineBreakProperties */
|
||||
enum LineBreakClass lbcCur; /**< Breaking class of current codepoint */
|
||||
enum LineBreakClass lbcNew; /**< Breaking class of next codepoint */
|
||||
enum LineBreakClass lbcLast; /**< Breaking class of last codepoint */
|
||||
};
|
||||
|
||||
/**
|
||||
* Abstract function interface for #lb_get_next_char_utf8,
|
||||
* #lb_get_next_char_utf16, and #lb_get_next_char_utf32.
|
||||
|
@ -144,6 +159,13 @@ extern struct LineBreakPropertiesLang lb_prop_lang_map[];
|
|||
utf32_t lb_get_next_char_utf8(const utf8_t *s, size_t len, size_t *ip);
|
||||
utf32_t lb_get_next_char_utf16(const utf16_t *s, size_t len, size_t *ip);
|
||||
utf32_t lb_get_next_char_utf32(const utf32_t *s, size_t len, size_t *ip);
|
||||
void lb_init_break_context(
|
||||
struct LineBreakContext* lbpCtx,
|
||||
utf32_t ch,
|
||||
const char* lang);
|
||||
int lb_process_next_char(
|
||||
struct LineBreakContext* lbpCtx,
|
||||
utf32_t ch);
|
||||
void set_linebreaks(
|
||||
const void *s,
|
||||
size_t len,
|
||||
|
|
|
@ -1,10 +1,10 @@
|
|||
/* vim: set tabstop=4 shiftwidth=4: */
|
||||
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
|
||||
|
||||
/*
|
||||
* Word breaking in a Unicode sequence. Designed to be used in a
|
||||
* generic text renderer.
|
||||
*
|
||||
* Copyright (C) 2012 Tom Hacohen <tom@stosb.com>
|
||||
* Copyright (C) 2013 Tom Hacohen <tom at stosb dot com>
|
||||
*
|
||||
* This software is provided 'as-is', without any express or implied
|
||||
* warranty. In no event will the author be held liable for any damages
|
||||
|
@ -30,6 +30,10 @@
|
|||
* Unicode 6.0.0:
|
||||
* <URL:http://www.unicode.org/reports/tr29/tr29-17.html>
|
||||
*
|
||||
* This library has been updated according to Revision 21, for
|
||||
* Unicode 6.2.0:
|
||||
* <URL:http://www.unicode.org/reports/tr29/tr29-21.html>
|
||||
*
|
||||
* The Unicode Terms of Use are available at
|
||||
* <URL:http://www.unicode.org/copyright.html>
|
||||
*/
|
||||
|
@ -40,7 +44,7 @@
|
|||
* Implementation of the word breaking algorithm as described in Unicode
|
||||
* Standard Annex 29.
|
||||
*
|
||||
* @version 2.3, 2013/05/14
|
||||
* @version 2.4, 2013/09/28
|
||||
* @author Tom Hacohen
|
||||
*/
|
||||
|
||||
|
|
|
@ -1,10 +1,10 @@
|
|||
/* vim: set tabstop=4 shiftwidth=4: */
|
||||
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
|
||||
|
||||
/*
|
||||
* Word breaking in a Unicode sequence. Designed to be used in a
|
||||
* generic text renderer.
|
||||
*
|
||||
* Copyright (C) 2012 Tom Hacohen <tom@stosb.com>
|
||||
* Copyright (C) 2013 Tom Hacohen <tom at stosb dot com>
|
||||
*
|
||||
* This software is provided 'as-is', without any express or implied
|
||||
* warranty. In no event will the author be held liable for any damages
|
||||
|
@ -30,6 +30,10 @@
|
|||
* Unicode 6.0.0:
|
||||
* <URL:http://www.unicode.org/reports/tr29/tr29-17.html>
|
||||
*
|
||||
* This library has been updated according to Revision 21, for
|
||||
* Unicode 6.2.0:
|
||||
* <URL:http://www.unicode.org/reports/tr29/tr29-21.html>
|
||||
*
|
||||
* The Unicode Terms of Use are available at
|
||||
* <URL:http://www.unicode.org/copyright.html>
|
||||
*/
|
||||
|
@ -39,7 +43,7 @@
|
|||
*
|
||||
* Header file for the word breaking (segmentation) algorithm.
|
||||
*
|
||||
* @version 2.2, 2012/02/04
|
||||
* @version 2.3, 2013/09/28
|
||||
* @author Tom Hacohen
|
||||
*/
|
||||
|
||||
|
|
|
@ -1,10 +1,11 @@
|
|||
/* vim: set tabstop=4 shiftwidth=4: */
|
||||
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
|
||||
|
||||
/*
|
||||
* Word breaking in a Unicode sequence. Designed to be used in a
|
||||
* generic text renderer.
|
||||
*
|
||||
* Copyright (C) 2012 Tom Hacohen <tom@stosb.com>
|
||||
* Copyright (C) 2013 Tom Hacohen <tom at stosb dot com>
|
||||
* Copyright (C) 2013 Petr Filipsky <philodej at gmail dot com>
|
||||
*
|
||||
* This software is provided 'as-is', without any express or implied
|
||||
* warranty. In no event will the author be held liable for any damages
|
||||
|
@ -30,6 +31,10 @@
|
|||
* Unicode 6.0.0:
|
||||
* <URL:http://www.unicode.org/reports/tr29/tr29-17.html>
|
||||
*
|
||||
* This library has been updated according to Revision 21, for
|
||||
* Unicode 6.2.0:
|
||||
* <URL:http://www.unicode.org/reports/tr29/tr29-21.html>
|
||||
*
|
||||
* The Unicode Terms of Use are available at
|
||||
* <URL:http://www.unicode.org/copyright.html>
|
||||
*/
|
||||
|
@ -40,13 +45,14 @@
|
|||
* Definitions of internal data structures, declarations of global
|
||||
* variables, and function prototypes for the word breaking algorithm.
|
||||
*
|
||||
* @version 2.2, 2013/05/14
|
||||
* @version 2.4, 2013/11/10
|
||||
* @author Tom Hacohen
|
||||
* @author Petr Filipsky
|
||||
*/
|
||||
|
||||
/**
|
||||
* Word break classes. This is a direct mapping of Table 3 of Unicode
|
||||
* Standard Annex 29, Revision 17.
|
||||
* Standard Annex 29, Revision 23.
|
||||
*/
|
||||
enum WordBreakClass
|
||||
{
|
||||
|
@ -64,6 +70,9 @@ enum WordBreakClass
|
|||
WBP_Numeric,
|
||||
WBP_ExtendNumLet,
|
||||
WBP_Regional,
|
||||
WBP_Hebrew,
|
||||
WBP_Single,
|
||||
WBP_Double,
|
||||
WBP_Any
|
||||
};
|
||||
|
||||
|
|
Loading…
Reference in New Issue