Static deps unibreak: update to what will soon be version 3.

Version 3 is not yet released, but this is on track to become it.
This is based on commit: a815e11f7ebf35b59278f783227a829ee4692760.

@feature.
This commit is contained in:
Tom Hacohen 2015-05-07 10:53:11 +01:00
parent ba77a837a3
commit 7a49d23f90
17 changed files with 952 additions and 303 deletions

View File

@ -90,6 +90,8 @@ lib/evas/canvas/evas_vg_private.h
# Linebreak
noinst_HEADERS += \
static_libs/libunibreak/unibreakbase.h \
static_libs/libunibreak/unibreakdef.h \
static_libs/libunibreak/linebreak.h \
static_libs/libunibreak/linebreakdef.h \
static_libs/libunibreak/wordbreakdef.h \
@ -98,6 +100,8 @@ static_libs/libunibreak/wordbreakdata.c
# Linebreak
lib_evas_libevas_la_SOURCES = \
static_libs/libunibreak/unibreakbase.c \
static_libs/libunibreak/unibreakdef.c \
static_libs/libunibreak/linebreak.c \
static_libs/libunibreak/linebreakdata.c \
static_libs/libunibreak/linebreakdef.c \

View File

@ -1,3 +1,167 @@
2015-04-19 Wu Yongwei <wuyongwei@gmail.com>
* LICENCE: Update copyright information.
2015-04-19 Wu Yongwei <wuyongwei@gmail.com>
* src/linebreakdata2.tmp: Remove the unnecessary inclusion of
"linebreak.h".
* src/linebreakdata.c: Ditto.
2015-04-19 Wu Yongwei <wuyongwei@gmail.com>
Use extended regexp to simplify expressions.
* src/LineBreak1.sed: Simplify with extended regexp.
* src/LineBreak2.sed: Ditto.
* src/Makefile.am: Add `-E' to the command line of sed.
2015-04-19 Wu Yongwei <wuyongwei@gmail.com>
Make further clean-up for the 3.0 release.
* configure.ac (AC_INIT): Change the library version to `3.0'.
* Doxyfile (PROJECT_NUMBER): Change to `3.0'.
(EXCLUDE): Add the missing `src/' before `filter_dup.c'.
* src/wordbreakdata1.tmpl: Remove the inclusion of "linebreak.h".
* src/wordbreakdata.c: Ditto.
2015-04-19 Wu Yongwei <wuyongwei@gmail.com>
* src/wordbreakdef.h: Include "unibreakdef.h".
2015-04-19 Wu Yongwei <wuyongwei@gmail.com>
* purge: Make it remove `compile'.
2015-04-18 Wu Yongwei <wuyongwei@gmail.com>
* src/unibreakdef.c: New file.
* src/unibreakdef.h: New file.
* src/wordbreak.c: Rename reference to `lb_get_next_char...' to
`ub_get_next_char...'.
* src/linebreak.c: Ditto.
(lb_get_next_char_utf8): Remove definition.
(lb_get_next_char_utf16): Ditto.
(lb_get_next_char_utf32): Ditto.
* src/linebreakdef.h: Include "unibreakdef.h".
(EOS): Remove definition.
(get_next_char_t): Remove typedef.
(lb_get_next_char_utf8): Remove declaration.
(lb_get_next_char_utf16): Ditto.
(lb_get_next_char_utf32): Ditto.
* src/Makefile.am (include_HEADERS): Add `unibreakdef.h'.
(libunibreak_la_SOURCES): Add `unibreakdef.c'.
(libunibreak_la_CFLAGS): Define to `-W -Wall'.
2015-04-18 Wu Yongwei <wuyongwei@gmail.com>
* src/unibreakbase.c: New file.
* src/unibreakbase.h: New file.
* src/linebreak.c (linebreak_version): Remove definition.
* src/linebreak.h: Include "unibreakbase.h".
(linebreak_version): Remove declaration.
(LINEBREAK_VERSION): Remove definition.
(utf8_t): Remove typedef.
(utf16_t): Remove typedef.
(utf32_t): Remove typedef.
* src/wordbreak.h: Include "unibreakbase.h" instead of
"linebreak.h".
* src/Makefile.am (include_HEADERS): Add `unibreakbase.h'.
(libunibreak_la_SOURCES): Add `unibreakbase.c'.
(libunibreak_la_LDFLAGS): Set the version-info to `3:0:0'.
2015-04-13 Wu Yongwei <wuyongwei@gmail.com>
* src/wordbreak.c: Update copyright and version information.
* src/wordbreak.h: Ditto.
* src/wordbreakdef.h: Ditto.
2015-04-13 Tom Hacohen <tom@stosb.com>
* src/wordbreakdef.h (enum WordBreakClass): Clean up and reorder.
2015-04-10 Tom Hacohen <tom@stosb.com>
Don't ship internal header.
* src/Makefile.am (include_HEADERS): Remove `wordbreakdef.h'.
(EXTRA_DIST): Add `wordbreakdef.h'.
2015-04-10 Tom Hacohen <tom@stosb.com>
Update files according to UAX #29-29, for Unicode 7.0.0.
* src/wordbreak.c (set_wordbreaks): Take care of Hebrew letters.
* src/wordbreakdata.h (enum WordBreakClass): Add WBP_Hebrew_Letter,
WBP_Single_Quote, and WBP_Double_Quote.
* src/wordbreakdata.c: Regenerate from WordBreakProperty-7.0.0.txt.
2015-04-10 Tom Hacohen <tom@stosb.com>
* src/sort_numeric_hex.py: Fix compatibility issue with new Python.
* src/Makefile.am (wordbreakdata): Fix word break data enum for
names with underscores.
* src/wordbreakdef.h (enum WordBreakClass): Correct WBP_Regional to
WBP_Regional_Indicator.
* src/wordbreak.c: Ditto.
* src/wordbreakdata.c: Ditto.
2015-04-05 Wu Yongwei <wuyongwei@gmail.com>
* src/linebreak.c: Make pointer alignment consistent.
* src/linebreak.h: Ditto.
* src/linebreakdef.h: Ditto.
2015-04-05 Wu Yongwei <wuyongwei@gmail.com>
* src/linebreak.h: Update copyright year and UAX information.
* src/linebreakdef.c: Ditto.
2015-04-05 Wu Yongwei <wuyongwei@gmail.com>
Implement rule LB21a, as introduced by Revision 28 of UAX #14.
* src/linebreakdef.h (struct LineBreakContext): Add new field
fLb21aHebrew.
* src/linebreak.c (treat_first_char): Initialize fLb21aHebrew
properly.
(lb_init_break_context): Clear fLb21aHebrew.
(get_lb_result_lookup): Apply rule LB21a and update fLb21aHebrew.
2014-12-06 Mikhail Polubisok <mpolubisok@gmail.com>
* src/linebreak.c (get_lb_result_lookup): Extend assertion condition
that has been wrong since Unicode 6.2.
2014-09-19 Petr Filipsky <philodej@gmail.com>
* src/LineBreak1.sed: Fix sed expression due to changed
LineBreak.txt file format.
2014-05-24 Wu Yongwei <wuyongwei@gmail.com>
* src/Makefile.gcc (TARGET): Change from `liblinebreak.a' to
`libunibreak.a'.
2014-05-23 Christoph Junghans <junghans@votca.org>
Fix `make install DESTDIR=...'.
* Makefile.am (install-exec-hook): Prefix `$(DESTDIR)/' before
`${libdir}'.
2014-02-16 Wu Yongwei <wuyongwei@gmail.com>
Following https://people.gnome.org/~walters/docs/build-api.txt, add
a quasi-standard autogen.sh, which generates `configure' and runs it
optionally.
* autogen.sh: New file.
2014-02-12 Wu Yongwei <wuyongwei@gmail.com>
* bootstrap: Remove the overkill bits and add back autoreconf.
* purge: Ensure config.cache is removed.
2014-02-10 Tom Hacohen <tom@stosb.com>
* bootstrap: Solve bootstrap problems found on Linux and Mac (thanks
to Nick Shvelidze and Christopher Baker).
2013-11-14 Wu Yongwei <wuyongwei@gmail.com>
* src/linebreak.c: Add/update comments and doc comments.

View File

@ -1,5 +1,6 @@
Copyright (C) 2008-2012 Wu Yongwei <wuyongwei at gmail dot com>
Copyright (C) 2012 Tom Hacohen <tom dot hacohen at samsung dot com>
Copyright (C) 2008-2015 Wu Yongwei <wuyongwei at gmail dot com>
Copyright (C) 2012-2015 Tom Hacohen <tom at stosb dot com>
Copyright (C) 2013 Petr Filipsky <philodej at gmail dot com>
This software is provided 'as-is', without any express or implied
warranty. In no event will the author be held liable for any damages

View File

@ -4,7 +4,7 @@
* Line breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2008-2013 Wu Yongwei <wuyongwei at gmail dot com>
* Copyright (C) 2008-2015 Wu Yongwei <wuyongwei at gmail dot com>
* Copyright (C) 2013 Petr Filipsky <philodej at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
@ -31,9 +31,9 @@
* Unicode 5.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-19.html>
*
* This library has been updated according to Revision 30, for
* Unicode 6.2.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-30.html>
* This library has been updated according to Revision 33, for
* Unicode 7.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-33.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
@ -45,7 +45,7 @@
* Implementation of the line breaking algorithm as described in Unicode
* Standard Annex 14.
*
* @version 2.5, 2013/11/14
* @version 2.7, 2015/04/18
* @author Wu Yongwei
* @author Petr Filipsky
*/
@ -66,11 +66,6 @@
*/
#define LINEBREAK_INDEX_SIZE 40
/**
* Version number of the library.
*/
const int linebreak_version = LINEBREAK_VERSION;
/**
* Enumeration of break actions. They are used in the break action
* pair table below.
@ -451,7 +446,7 @@ static enum LineBreakClass resolve_lb_class(
* @post \a lbpCtx->lbcCur has the updated line break class
*/
static void treat_first_char(
struct LineBreakContext* lbpCtx)
struct LineBreakContext *lbpCtx)
{
switch (lbpCtx->lbcCur)
{
@ -465,6 +460,8 @@ static void treat_first_char(
case LBP_SP:
lbpCtx->lbcCur = LBP_WJ; /* Leading space treated as WJ */
break;
case LBP_HL:
lbpCtx->fLb21aHebrew = 1; /* Rule LB21a */
default:
break;
}
@ -485,7 +482,7 @@ static void treat_first_char(
* table lookup is needed
*/
static int get_lb_result_simple(
struct LineBreakContext* lbpCtx)
struct LineBreakContext *lbpCtx)
{
if (lbpCtx->lbcCur == LBP_BK
|| (lbpCtx->lbcCur == LBP_CR && lbpCtx->lbcNew != LBP_LF))
@ -528,13 +525,12 @@ static int get_lb_result_simple(
* #LINEBREAK_ALLOWBREAK, and #LINEBREAK_NOBREAK
*/
static int get_lb_result_lookup(
struct LineBreakContext* lbpCtx)
struct LineBreakContext *lbpCtx)
{
/* TODO: Rule LB21a, as introduced by Revision 28 of UAX#14, is not
* yet implemented below. */
int brk = LINEBREAK_UNDEFINED;
assert(lbpCtx->lbcCur <= LBP_JT);
assert(lbpCtx->lbcNew <= LBP_JT);
assert(lbpCtx->lbcCur <= LBP_RI);
assert(lbpCtx->lbcNew <= LBP_RI);
switch (baTable[lbpCtx->lbcCur - 1][lbpCtx->lbcNew - 1])
{
case DIR_BRK:
@ -555,6 +551,19 @@ static int get_lb_result_lookup(
brk = LINEBREAK_NOBREAK;
break;
}
/* Special processing due to rule LB21a */
if (lbpCtx->fLb21aHebrew &&
(lbpCtx->lbcCur == LBP_HY || lbpCtx->lbcCur == LBP_BA))
{
brk = LINEBREAK_NOBREAK;
lbpCtx->fLb21aHebrew = 0;
}
else if (!(lbpCtx->lbcNew == LBP_HY || lbpCtx->lbcNew == LBP_BA))
{
lbpCtx->fLb21aHebrew = (lbpCtx->lbcNew == LBP_HL);
}
lbpCtx->lbcCur = lbpCtx->lbcNew;
return brk;
}
@ -568,9 +577,9 @@ static int get_lb_result_lookup(
* @post the line breaking context is initialized
*/
void lb_init_break_context(
struct LineBreakContext* lbpCtx,
struct LineBreakContext *lbpCtx,
utf32_t ch,
const char* lang)
const char *lang)
{
lbpCtx->lang = lang;
lbpCtx->lbpLang = get_lb_prop_lang(lang);
@ -579,6 +588,7 @@ void lb_init_break_context(
lbpCtx->lbcCur = resolve_lb_class(
get_char_lb_class_lang(ch, lbpCtx->lbpLang),
lbpCtx->lang);
lbpCtx->fLb21aHebrew = 0;
treat_first_char(lbpCtx);
}
@ -593,7 +603,7 @@ void lb_init_break_context(
* @post the line breaking context is updated
*/
int lb_process_next_char(
struct LineBreakContext* lbpCtx,
struct LineBreakContext *lbpCtx,
utf32_t ch )
{
int brk;
@ -617,127 +627,6 @@ int lb_process_next_char(
return brk;
}
/**
* Gets the next Unicode character in a UTF-8 sequence. The index will
* be advanced to the next complete character, unless the end of string
* is reached in the middle of a UTF-8 sequence.
*
* @param[in] s input UTF-8 string
* @param[in] len length of the string in bytes
* @param[in,out] ip pointer to the index
* @return the Unicode character beginning at the index; or
* #EOS if end of input is encountered
*/
utf32_t lb_get_next_char_utf8(
const utf8_t *s,
size_t len,
size_t *ip)
{
utf8_t ch;
utf32_t res;
assert(*ip <= len);
if (*ip == len)
return EOS;
ch = s[*ip];
if (ch < 0xC2 || ch > 0xF4)
{ /* One-byte sequence, tail (should not occur), or invalid */
*ip += 1;
return ch;
}
else if (ch < 0xE0)
{ /* Two-byte sequence */
if (*ip + 2 > len)
return EOS;
res = ((ch & 0x1F) << 6) + (s[*ip + 1] & 0x3F);
*ip += 2;
return res;
}
else if (ch < 0xF0)
{ /* Three-byte sequence */
if (*ip + 3 > len)
return EOS;
res = ((ch & 0x0F) << 12) +
((s[*ip + 1] & 0x3F) << 6) +
((s[*ip + 2] & 0x3F));
*ip += 3;
return res;
}
else
{ /* Four-byte sequence */
if (*ip + 4 > len)
return EOS;
res = ((ch & 0x07) << 18) +
((s[*ip + 1] & 0x3F) << 12) +
((s[*ip + 2] & 0x3F) << 6) +
((s[*ip + 3] & 0x3F));
*ip += 4;
return res;
}
}
/**
* Gets the next Unicode character in a UTF-16 sequence. The index will
* be advanced to the next complete character, unless the end of string
* is reached in the middle of a UTF-16 surrogate pair.
*
* @param[in] s input UTF-16 string
* @param[in] len length of the string in words
* @param[in,out] ip pointer to the index
* @return the Unicode character beginning at the index; or
* #EOS if end of input is encountered
*/
utf32_t lb_get_next_char_utf16(
const utf16_t *s,
size_t len,
size_t *ip)
{
utf16_t ch;
assert(*ip <= len);
if (*ip == len)
return EOS;
ch = s[(*ip)++];
if (ch < 0xD800 || ch > 0xDBFF)
{ /* If the character is not a high surrogate */
return ch;
}
if (*ip == len)
{ /* If the input ends here (an error) */
--(*ip);
return EOS;
}
if (s[*ip] < 0xDC00 || s[*ip] > 0xDFFF)
{ /* If the next character is not the low surrogate (an error) */
return ch;
}
/* Return the constructed character and advance the index again */
return (((utf32_t)ch & 0x3FF) << 10) + (s[(*ip)++] & 0x3FF) + 0x10000;
}
/**
* Gets the next Unicode character in a UTF-32 sequence. The index will
* be advanced to the next character.
*
* @param[in] s input UTF-32 string
* @param[in] len length of the string in dwords
* @param[in,out] ip pointer to the index
* @return the Unicode character beginning at the index; or
* #EOS if end of input is encountered
*/
utf32_t lb_get_next_char_utf32(
const utf32_t *s,
size_t len,
size_t *ip)
{
assert(*ip <= len);
if (*ip == len)
return EOS;
return s[(*ip)++];
}
/**
* Sets the line breaking information for a generic input string.
*
@ -809,7 +698,7 @@ void set_linebreaks_utf8(
char *brks)
{
set_linebreaks(s, len, lang, brks,
(get_next_char_t)lb_get_next_char_utf8);
(get_next_char_t)ub_get_next_char_utf8);
}
/**
@ -829,7 +718,7 @@ void set_linebreaks_utf16(
char *brks)
{
set_linebreaks(s, len, lang, brks,
(get_next_char_t)lb_get_next_char_utf16);
(get_next_char_t)ub_get_next_char_utf16);
}
/**
@ -849,7 +738,7 @@ void set_linebreaks_utf32(
char *brks)
{
set_linebreaks(s, len, lang, brks,
(get_next_char_t)lb_get_next_char_utf32);
(get_next_char_t)ub_get_next_char_utf32);
}
/**
@ -868,7 +757,7 @@ void set_linebreaks_utf32(
int is_line_breakable(
utf32_t char1,
utf32_t char2,
const char* lang)
const char *lang)
{
utf32_t s[2];
char brks[2];

View File

@ -4,7 +4,7 @@
* Line breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2008-2012 Wu Yongwei <wuyongwei at gmail dot com>
* Copyright (C) 2008-2015 Wu Yongwei <wuyongwei at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
@ -30,9 +30,9 @@
* Unicode 5.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-19.html>
*
* This library has been updated according to Revision 30, for
* Unicode 6.2.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-30.html>
* This library has been updated according to Revision 33, for
* Unicode 7.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-33.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
@ -43,7 +43,7 @@
*
* Header file for the line breaking algorithm.
*
* @version 2.2, 2012/10/06
* @version 2.4, 2015/04/18
* @author Wu Yongwei
*/
@ -51,21 +51,12 @@
#define LINEBREAK_H
#include <stddef.h>
#include "unibreakbase.h"
#ifdef __cplusplus
extern "C" {
#endif
#define LINEBREAK_VERSION 0x0202 /**< Version of the library linebreak */
extern const int linebreak_version;
#ifndef LINEBREAK_UTF_TYPES_DEFINED
#define LINEBREAK_UTF_TYPES_DEFINED
typedef unsigned char utf8_t; /**< Type for UTF-8 data points */
typedef unsigned short utf16_t; /**< Type for UTF-16 data points */
typedef unsigned int utf32_t; /**< Type for UTF-32 data points */
#endif
#define LINEBREAK_MUSTBREAK 0 /**< Break is mandatory */
#define LINEBREAK_ALLOWBREAK 1 /**< Break is allowed */
#define LINEBREAK_NOBREAK 2 /**< No break is possible */
@ -73,12 +64,12 @@ typedef unsigned int utf32_t; /**< Type for UTF-32 data points */
void init_linebreak(void);
void set_linebreaks_utf8(
const utf8_t *s, size_t len, const char* lang, char *brks);
const utf8_t *s, size_t len, const char *lang, char *brks);
void set_linebreaks_utf16(
const utf16_t *s, size_t len, const char* lang, char *brks);
const utf16_t *s, size_t len, const char *lang, char *brks);
void set_linebreaks_utf32(
const utf32_t *s, size_t len, const char* lang, char *brks);
int is_line_breakable(utf32_t char1, utf32_t char2, const char* lang);
const utf32_t *s, size_t len, const char *lang, char *brks);
int is_line_breakable(utf32_t char1, utf32_t char2, const char *lang);
#ifdef __cplusplus
}

View File

@ -1,9 +1,8 @@
/* The content of this file is generated from:
# LineBreak-6.3.0.txt
# Date: 2013-02-06, 19:45:00 GMT [KW, LI]
# LineBreak-7.0.0.txt
# Date: 2014-02-28, 23:15:00 GMT [KW, LI]
*/
#include "linebreak.h"
#include "linebreakdef.h"
/** Default line breaking properties as from the Unicode Web site. */
@ -93,11 +92,12 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x0363, 0x036F, LBP_CM },
{ 0x0370, 0x037D, LBP_AL },
{ 0x037E, 0x037E, LBP_IS },
{ 0x0384, 0x0482, LBP_AL },
{ 0x037F, 0x0482, LBP_AL },
{ 0x0483, 0x0489, LBP_CM },
{ 0x048A, 0x0587, LBP_AL },
{ 0x0589, 0x0589, LBP_IS },
{ 0x058A, 0x058A, LBP_BA },
{ 0x058D, 0x058E, LBP_AL },
{ 0x058F, 0x058F, LBP_PR },
{ 0x0591, 0x05BD, LBP_CM },
{ 0x05BE, 0x05BE, LBP_BA },
@ -159,7 +159,7 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x0829, 0x082D, LBP_CM },
{ 0x0830, 0x0858, LBP_AL },
{ 0x0859, 0x085B, LBP_CM },
{ 0x085E, 0x08AC, LBP_AL },
{ 0x085E, 0x08B2, LBP_AL },
{ 0x08E4, 0x0903, LBP_CM },
{ 0x0904, 0x0939, LBP_AL },
{ 0x093A, 0x093C, LBP_CM },
@ -171,7 +171,7 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x0962, 0x0963, LBP_CM },
{ 0x0964, 0x0965, LBP_BA },
{ 0x0966, 0x096F, LBP_NU },
{ 0x0970, 0x097F, LBP_AL },
{ 0x0970, 0x0980, LBP_AL },
{ 0x0981, 0x0983, LBP_CM },
{ 0x0985, 0x09B9, LBP_AL },
{ 0x09BC, 0x09BC, LBP_CM },
@ -223,14 +223,14 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x0BF0, 0x0BF8, LBP_AL },
{ 0x0BF9, 0x0BF9, LBP_PR },
{ 0x0BFA, 0x0BFA, LBP_AL },
{ 0x0C01, 0x0C03, LBP_CM },
{ 0x0C00, 0x0C03, LBP_CM },
{ 0x0C05, 0x0C3D, LBP_AL },
{ 0x0C3E, 0x0C56, LBP_CM },
{ 0x0C58, 0x0C61, LBP_AL },
{ 0x0C62, 0x0C63, LBP_CM },
{ 0x0C66, 0x0C6F, LBP_NU },
{ 0x0C78, 0x0C7F, LBP_AL },
{ 0x0C82, 0x0C83, LBP_CM },
{ 0x0C81, 0x0C83, LBP_CM },
{ 0x0C85, 0x0CB9, LBP_AL },
{ 0x0CBC, 0x0CBC, LBP_CM },
{ 0x0CBD, 0x0CBD, LBP_AL },
@ -239,7 +239,7 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x0CE2, 0x0CE3, LBP_CM },
{ 0x0CE6, 0x0CEF, LBP_NU },
{ 0x0CF1, 0x0CF2, LBP_AL },
{ 0x0D02, 0x0D03, LBP_CM },
{ 0x0D01, 0x0D03, LBP_CM },
{ 0x0D05, 0x0D3D, LBP_AL },
{ 0x0D3E, 0x0D4D, LBP_CM },
{ 0x0D4E, 0x0D4E, LBP_AL },
@ -252,7 +252,9 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x0D7A, 0x0D7F, LBP_AL },
{ 0x0D82, 0x0D83, LBP_CM },
{ 0x0D85, 0x0DC6, LBP_AL },
{ 0x0DCA, 0x0DF3, LBP_CM },
{ 0x0DCA, 0x0DDF, LBP_CM },
{ 0x0DE6, 0x0DEF, LBP_NU },
{ 0x0DF2, 0x0DF3, LBP_CM },
{ 0x0DF4, 0x0DF4, LBP_AL },
{ 0x0E01, 0x0E3A, LBP_SA },
{ 0x0E3F, 0x0E3F, LBP_PR },
@ -363,7 +365,7 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x1810, 0x1819, LBP_NU },
{ 0x1820, 0x18A8, LBP_AL },
{ 0x18A9, 0x18A9, LBP_CM },
{ 0x18AA, 0x191C, LBP_AL },
{ 0x18AA, 0x191E, LBP_AL },
{ 0x1920, 0x193B, LBP_CM },
{ 0x1940, 0x1940, LBP_AL },
{ 0x1944, 0x1945, LBP_EX },
@ -378,7 +380,7 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x1A7F, 0x1A7F, LBP_CM },
{ 0x1A80, 0x1A99, LBP_NU },
{ 0x1AA0, 0x1AAD, LBP_SA },
{ 0x1B00, 0x1B04, LBP_CM },
{ 0x1AB0, 0x1B04, LBP_CM },
{ 0x1B05, 0x1B33, LBP_AL },
{ 0x1B34, 0x1B44, LBP_CM },
{ 0x1B45, 0x1B4B, LBP_AL },
@ -412,7 +414,9 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x1CED, 0x1CED, LBP_CM },
{ 0x1CEE, 0x1CF1, LBP_AL },
{ 0x1CF2, 0x1CF4, LBP_CM },
{ 0x1CF5, 0x1DBF, LBP_AL },
{ 0x1CF5, 0x1CF6, LBP_AL },
{ 0x1CF8, 0x1CF9, LBP_CM },
{ 0x1D00, 0x1DBF, LBP_AL },
{ 0x1DC0, 0x1DFF, LBP_CM },
{ 0x1E00, 0x1FFC, LBP_AL },
{ 0x1FFD, 0x1FFD, LBP_BB },
@ -475,7 +479,9 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x20A7, 0x20A7, LBP_PO },
{ 0x20A8, 0x20B5, LBP_PR },
{ 0x20B6, 0x20B6, LBP_PO },
{ 0x20B7, 0x20CF, LBP_PR },
{ 0x20B7, 0x20BA, LBP_PR },
{ 0x20BB, 0x20BB, LBP_PO },
{ 0x20BC, 0x20CF, LBP_PR },
{ 0x20D0, 0x20F0, LBP_CM },
{ 0x2100, 0x2102, LBP_AL },
{ 0x2103, 0x2103, LBP_PO },
@ -564,7 +570,12 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x22A5, 0x22A5, LBP_AI },
{ 0x22A6, 0x22BE, LBP_AL },
{ 0x22BF, 0x22BF, LBP_AI },
{ 0x22C0, 0x2311, LBP_AL },
{ 0x22C0, 0x2307, LBP_AL },
{ 0x2308, 0x2308, LBP_OP },
{ 0x2309, 0x2309, LBP_CL },
{ 0x230A, 0x230A, LBP_OP },
{ 0x230B, 0x230B, LBP_CL },
{ 0x230C, 0x2311, LBP_AL },
{ 0x2312, 0x2312, LBP_AI },
{ 0x2313, 0x2319, LBP_AL },
{ 0x231A, 0x231B, LBP_ID },
@ -573,7 +584,7 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x232A, 0x232A, LBP_CL },
{ 0x232B, 0x23EF, LBP_AL },
{ 0x23F0, 0x23F3, LBP_ID },
{ 0x2400, 0x244A, LBP_AL },
{ 0x23F4, 0x244A, LBP_AL },
{ 0x2460, 0x24FE, LBP_AI },
{ 0x24FF, 0x24FF, LBP_AL },
{ 0x2500, 0x254B, LBP_AI },
@ -671,8 +682,8 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x270E, 0x2756, LBP_AL },
{ 0x2757, 0x2757, LBP_AI },
{ 0x2758, 0x275A, LBP_AL },
{ 0x275B, 0x275E, LBP_QU },
{ 0x275F, 0x2761, LBP_AL },
{ 0x275B, 0x2760, LBP_QU },
{ 0x2761, 0x2761, LBP_AL },
{ 0x2762, 0x2763, LBP_EX },
{ 0x2764, 0x2767, LBP_AL },
{ 0x2768, 0x2768, LBP_OP },
@ -737,7 +748,7 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x29FD, 0x29FD, LBP_CL },
{ 0x29FE, 0x2B54, LBP_AL },
{ 0x2B55, 0x2B59, LBP_AI },
{ 0x2C00, 0x2CEE, LBP_AL },
{ 0x2B5A, 0x2CEE, LBP_AL },
{ 0x2CEF, 0x2CF1, LBP_CM },
{ 0x2CF2, 0x2CF3, LBP_AL },
{ 0x2CF9, 0x2CF9, LBP_EX },
@ -776,6 +787,10 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x2E33, 0x2E34, LBP_BA },
{ 0x2E35, 0x2E39, LBP_AL },
{ 0x2E3A, 0x2E3B, LBP_B2 },
{ 0x2E3C, 0x2E3E, LBP_BA },
{ 0x2E3F, 0x2E3F, LBP_AL },
{ 0x2E40, 0x2E41, LBP_BA },
{ 0x2E42, 0x2E42, LBP_OP },
{ 0x2E80, 0x2FFB, LBP_ID },
{ 0x3000, 0x3000, LBP_BA },
{ 0x3001, 0x3002, LBP_CL },
@ -882,7 +897,7 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0xA66F, 0xA672, LBP_CM },
{ 0xA673, 0xA673, LBP_AL },
{ 0xA674, 0xA67D, LBP_CM },
{ 0xA67E, 0xA697, LBP_AL },
{ 0xA67E, 0xA69D, LBP_AL },
{ 0xA69F, 0xA69F, LBP_CM },
{ 0xA6A0, 0xA6EF, LBP_AL },
{ 0xA6F0, 0xA6F1, LBP_CM },
@ -923,7 +938,11 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0xA9C7, 0xA9C9, LBP_BA },
{ 0xA9CA, 0xA9CF, LBP_AL },
{ 0xA9D0, 0xA9D9, LBP_NU },
{ 0xA9DE, 0xAA28, LBP_AL },
{ 0xA9DE, 0xA9DF, LBP_AL },
{ 0xA9E0, 0xA9EF, LBP_SA },
{ 0xA9F0, 0xA9F9, LBP_NU },
{ 0xA9FA, 0xA9FE, LBP_SA },
{ 0xAA00, 0xAA28, LBP_AL },
{ 0xAA29, 0xAA36, LBP_CM },
{ 0xAA40, 0xAA42, LBP_AL },
{ 0xAA43, 0xAA43, LBP_CM },
@ -1753,8 +1772,8 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0xFB29, 0xFB29, LBP_AL },
{ 0xFB2A, 0xFB4F, LBP_HL },
{ 0xFB50, 0xFD3D, LBP_AL },
{ 0xFD3E, 0xFD3E, LBP_OP },
{ 0xFD3F, 0xFD3F, LBP_CL },
{ 0xFD3E, 0xFD3E, LBP_CL },
{ 0xFD3F, 0xFD3F, LBP_OP },
{ 0xFD50, 0xFDFB, LBP_AL },
{ 0xFDFC, 0xFDFC, LBP_PO },
{ 0xFDFD, 0xFDFD, LBP_AL },
@ -1766,7 +1785,7 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0xFE17, 0xFE17, LBP_OP },
{ 0xFE18, 0xFE18, LBP_CL },
{ 0xFE19, 0xFE19, LBP_IN },
{ 0xFE20, 0xFE26, LBP_CM },
{ 0xFE20, 0xFE2D, LBP_CM },
{ 0xFE30, 0xFE34, LBP_ID },
{ 0xFE35, 0xFE35, LBP_OP },
{ 0xFE36, 0xFE36, LBP_CL },
@ -1852,13 +1871,17 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x10100, 0x10102, LBP_BA },
{ 0x10107, 0x101FC, LBP_AL },
{ 0x101FD, 0x101FD, LBP_CM },
{ 0x10280, 0x1039D, LBP_AL },
{ 0x10280, 0x102D0, LBP_AL },
{ 0x102E0, 0x102E0, LBP_CM },
{ 0x102E1, 0x10375, LBP_AL },
{ 0x10376, 0x1037A, LBP_CM },
{ 0x10380, 0x1039D, LBP_AL },
{ 0x1039F, 0x1039F, LBP_BA },
{ 0x103A0, 0x103CF, LBP_AL },
{ 0x103D0, 0x103D0, LBP_BA },
{ 0x103D1, 0x1049D, LBP_AL },
{ 0x104A0, 0x104A9, LBP_NU },
{ 0x10800, 0x10855, LBP_AL },
{ 0x10500, 0x10855, LBP_AL },
{ 0x10857, 0x10857, LBP_BA },
{ 0x10858, 0x1091B, LBP_AL },
{ 0x1091F, 0x1091F, LBP_BA },
@ -1868,7 +1891,12 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x10A38, 0x10A3F, LBP_CM },
{ 0x10A40, 0x10A47, LBP_AL },
{ 0x10A50, 0x10A57, LBP_BA },
{ 0x10A58, 0x10B35, LBP_AL },
{ 0x10A58, 0x10AE4, LBP_AL },
{ 0x10AE5, 0x10AE6, LBP_CM },
{ 0x10AEB, 0x10AEF, LBP_AL },
{ 0x10AF0, 0x10AF5, LBP_BA },
{ 0x10AF6, 0x10AF6, LBP_IN },
{ 0x10B00, 0x10B35, LBP_AL },
{ 0x10B39, 0x10B3F, LBP_BA },
{ 0x10B40, 0x10E7E, LBP_AL },
{ 0x11000, 0x11002, LBP_CM },
@ -1877,7 +1905,7 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x11047, 0x11048, LBP_BA },
{ 0x11049, 0x11065, LBP_AL },
{ 0x11066, 0x1106F, LBP_NU },
{ 0x11080, 0x11082, LBP_CM },
{ 0x1107F, 0x11082, LBP_CM },
{ 0x11083, 0x110AF, LBP_AL },
{ 0x110B0, 0x110BA, LBP_CM },
{ 0x110BB, 0x110BD, LBP_AL },
@ -1889,6 +1917,11 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x11127, 0x11134, LBP_CM },
{ 0x11136, 0x1113F, LBP_NU },
{ 0x11140, 0x11143, LBP_BA },
{ 0x11150, 0x11172, LBP_AL },
{ 0x11173, 0x11173, LBP_CM },
{ 0x11174, 0x11174, LBP_AL },
{ 0x11175, 0x11175, LBP_BB },
{ 0x11176, 0x11176, LBP_AL },
{ 0x11180, 0x11182, LBP_CM },
{ 0x11183, 0x111B2, LBP_AL },
{ 0x111B3, 0x111C0, LBP_CM },
@ -1896,12 +1929,46 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x111C5, 0x111C6, LBP_BA },
{ 0x111C7, 0x111C7, LBP_AL },
{ 0x111C8, 0x111C8, LBP_BA },
{ 0x111CD, 0x111CD, LBP_AL },
{ 0x111D0, 0x111D9, LBP_NU },
{ 0x111DA, 0x1122B, LBP_AL },
{ 0x1122C, 0x11237, LBP_CM },
{ 0x11238, 0x11239, LBP_BA },
{ 0x1123A, 0x1123A, LBP_AL },
{ 0x1123B, 0x1123C, LBP_BA },
{ 0x1123D, 0x112DE, LBP_AL },
{ 0x112DF, 0x112EA, LBP_CM },
{ 0x112F0, 0x112F9, LBP_NU },
{ 0x11301, 0x11303, LBP_CM },
{ 0x11305, 0x11339, LBP_AL },
{ 0x1133C, 0x1133C, LBP_CM },
{ 0x1133D, 0x1133D, LBP_AL },
{ 0x1133E, 0x11357, LBP_CM },
{ 0x1135D, 0x11361, LBP_AL },
{ 0x11362, 0x11374, LBP_CM },
{ 0x11480, 0x114AF, LBP_AL },
{ 0x114B0, 0x114C3, LBP_CM },
{ 0x114C4, 0x114C7, LBP_AL },
{ 0x114D0, 0x114D9, LBP_NU },
{ 0x11580, 0x115AE, LBP_AL },
{ 0x115AF, 0x115C0, LBP_CM },
{ 0x115C1, 0x115C1, LBP_BB },
{ 0x115C2, 0x115C3, LBP_BA },
{ 0x115C4, 0x115C5, LBP_EX },
{ 0x115C6, 0x115C8, LBP_AL },
{ 0x115C9, 0x115C9, LBP_BA },
{ 0x11600, 0x1162F, LBP_AL },
{ 0x11630, 0x11640, LBP_CM },
{ 0x11641, 0x11642, LBP_BA },
{ 0x11643, 0x11644, LBP_AL },
{ 0x11650, 0x11659, LBP_NU },
{ 0x11680, 0x116AA, LBP_AL },
{ 0x116AB, 0x116B7, LBP_CM },
{ 0x116C0, 0x116C9, LBP_NU },
{ 0x12000, 0x12462, LBP_AL },
{ 0x12470, 0x12473, LBP_BA },
{ 0x118A0, 0x118DF, LBP_AL },
{ 0x118E0, 0x118E9, LBP_NU },
{ 0x118EA, 0x1246E, LBP_AL },
{ 0x12470, 0x12474, LBP_BA },
{ 0x13000, 0x13257, LBP_AL },
{ 0x13258, 0x1325A, LBP_OP },
{ 0x1325B, 0x1325D, LBP_CL },
@ -1915,10 +1982,27 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x1328A, 0x13378, LBP_AL },
{ 0x13379, 0x13379, LBP_OP },
{ 0x1337A, 0x1337B, LBP_CL },
{ 0x1337C, 0x16F50, LBP_AL },
{ 0x1337C, 0x16A5E, LBP_AL },
{ 0x16A60, 0x16A69, LBP_NU },
{ 0x16A6E, 0x16A6F, LBP_BA },
{ 0x16AD0, 0x16AED, LBP_AL },
{ 0x16AF0, 0x16AF4, LBP_CM },
{ 0x16AF5, 0x16AF5, LBP_BA },
{ 0x16B00, 0x16B2F, LBP_AL },
{ 0x16B30, 0x16B36, LBP_CM },
{ 0x16B37, 0x16B39, LBP_BA },
{ 0x16B3A, 0x16B43, LBP_AL },
{ 0x16B44, 0x16B44, LBP_BA },
{ 0x16B45, 0x16B45, LBP_AL },
{ 0x16B50, 0x16B59, LBP_NU },
{ 0x16B5B, 0x16F50, LBP_AL },
{ 0x16F51, 0x16F92, LBP_CM },
{ 0x16F93, 0x16F9F, LBP_AL },
{ 0x1B000, 0x1B001, LBP_ID },
{ 0x1BC00, 0x1BC9C, LBP_AL },
{ 0x1BC9D, 0x1BC9E, LBP_CM },
{ 0x1BC9F, 0x1BC9F, LBP_BA },
{ 0x1BCA0, 0x1BCA3, LBP_CM },
{ 0x1D000, 0x1D164, LBP_AL },
{ 0x1D165, 0x1D169, LBP_CM },
{ 0x1D16A, 0x1D16C, LBP_AL },
@ -1931,15 +2015,19 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x1D242, 0x1D244, LBP_CM },
{ 0x1D245, 0x1D7CB, LBP_AL },
{ 0x1D7CE, 0x1D7FF, LBP_NU },
{ 0x1E800, 0x1E8CF, LBP_AL },
{ 0x1E8D0, 0x1E8D6, LBP_CM },
{ 0x1EE00, 0x1EEF1, LBP_AL },
{ 0x1F000, 0x1F0DF, LBP_ID },
{ 0x1F000, 0x1F0F5, LBP_ID },
{ 0x1F100, 0x1F12D, LBP_AI },
{ 0x1F12E, 0x1F12E, LBP_AL },
{ 0x1F130, 0x1F169, LBP_AI },
{ 0x1F16A, 0x1F16B, LBP_AL },
{ 0x1F170, 0x1F19A, LBP_AI },
{ 0x1F1E6, 0x1F1FF, LBP_RI },
{ 0x1F200, 0x1F3B4, LBP_ID },
{ 0x1F200, 0x1F39B, LBP_ID },
{ 0x1F39C, 0x1F39D, LBP_AL },
{ 0x1F39E, 0x1F3B4, LBP_ID },
{ 0x1F3B5, 0x1F3B6, LBP_AL },
{ 0x1F3B7, 0x1F3BB, LBP_ID },
{ 0x1F3BC, 0x1F3BC, LBP_AL },
@ -1953,14 +2041,23 @@ struct LineBreakProperties lb_prop_default[] = {
{ 0x1F4AF, 0x1F4AF, LBP_AL },
{ 0x1F4B0, 0x1F4B0, LBP_ID },
{ 0x1F4B1, 0x1F4B2, LBP_AL },
{ 0x1F4B3, 0x1F4FC, LBP_ID },
{ 0x1F4B3, 0x1F4FE, LBP_ID },
{ 0x1F500, 0x1F506, LBP_AL },
{ 0x1F507, 0x1F516, LBP_ID },
{ 0x1F517, 0x1F524, LBP_AL },
{ 0x1F525, 0x1F531, LBP_ID },
{ 0x1F532, 0x1F543, LBP_AL },
{ 0x1F550, 0x1F6C5, LBP_ID },
{ 0x1F700, 0x1F773, LBP_AL },
{ 0x1F532, 0x1F549, LBP_AL },
{ 0x1F54A, 0x1F5D3, LBP_ID },
{ 0x1F5D4, 0x1F5DB, LBP_AL },
{ 0x1F5DC, 0x1F5F3, LBP_ID },
{ 0x1F5F4, 0x1F5F9, LBP_AL },
{ 0x1F5FA, 0x1F64F, LBP_ID },
{ 0x1F650, 0x1F675, LBP_AL },
{ 0x1F676, 0x1F678, LBP_QU },
{ 0x1F679, 0x1F67B, LBP_NS },
{ 0x1F67C, 0x1F67F, LBP_AL },
{ 0x1F680, 0x1F6F3, LBP_ID },
{ 0x1F700, 0x1F8AD, LBP_AL },
{ 0x20000, 0x3FFFD, LBP_ID },
{ 0xE0001, 0xE01EF, LBP_CM },
{ 0xF0000, 0x10FFFD, LBP_XX },

View File

@ -4,7 +4,7 @@
* Line breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2008-2012 Wu Yongwei <wuyongwei at gmail dot com>
* Copyright (C) 2008-2015 Wu Yongwei <wuyongwei at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
@ -30,9 +30,9 @@
* Unicode 5.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-19.html>
*
* This library has been updated according to Revision 30, for
* Unicode 6.2.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-30.html>
* This library has been updated according to Revision 33, for
* Unicode 7.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-33.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>

View File

@ -4,7 +4,7 @@
* Line breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2008-2013 Wu Yongwei <wuyongwei at gmail dot com>
* Copyright (C) 2008-2015 Wu Yongwei <wuyongwei at gmail dot com>
* Copyright (C) 2013 Petr Filipsky <philodej at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
@ -31,9 +31,9 @@
* Unicode 5.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-19.html>
*
* This library has been updated according to Revision 30, for
* Unicode 6.2.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-30.html>
* This library has been updated according to Revision 33, for
* Unicode 7.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-33.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
@ -45,16 +45,12 @@
* Definitions of internal data structures, declarations of global
* variables, and function prototypes for the line breaking algorithm.
*
* @version 2.4, 2013/11/10
* @version 2.6, 2015/04/18
* @author Wu Yongwei
* @author Petr Filipsky
*/
/**
* Constant value to mark the end of string. It is not a valid Unicode
* character.
*/
#define EOS 0xFFFFFFFF
#include "unibreakdef.h"
/**
* Line break classes. This is a direct mapping of Table 1 of Unicode
@ -143,28 +139,20 @@ struct LineBreakContext
enum LineBreakClass lbcCur; /**< Breaking class of current codepoint */
enum LineBreakClass lbcNew; /**< Breaking class of next codepoint */
enum LineBreakClass lbcLast; /**< Breaking class of last codepoint */
int fLb21aHebrew; /**< Flag for Hebrew letters (LB21a) */
};
/**
* Abstract function interface for #lb_get_next_char_utf8,
* #lb_get_next_char_utf16, and #lb_get_next_char_utf32.
*/
typedef utf32_t (*get_next_char_t)(const void *, size_t, size_t *);
/* Declarations */
extern struct LineBreakProperties lb_prop_default[];
extern struct LineBreakPropertiesLang lb_prop_lang_map[];
/* Function Prototype */
utf32_t lb_get_next_char_utf8(const utf8_t *s, size_t len, size_t *ip);
utf32_t lb_get_next_char_utf16(const utf16_t *s, size_t len, size_t *ip);
utf32_t lb_get_next_char_utf32(const utf32_t *s, size_t len, size_t *ip);
void lb_init_break_context(
struct LineBreakContext* lbpCtx,
struct LineBreakContext *lbpCtx,
utf32_t ch,
const char* lang);
const char *lang);
int lb_process_next_char(
struct LineBreakContext* lbpCtx,
struct LineBreakContext *lbpCtx,
utf32_t ch);
void set_linebreaks(
const void *s,

View File

@ -0,0 +1,41 @@
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
/*
* Break processing in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2015 Wu Yongwei <wuyongwei at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute
* it freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must
* not claim that you wrote the original software. If you use this
* software in a product, an acknowledgement in the product
* documentation would be appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must
* not be misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source
* distribution.
*/
/**
* @file unibreakbase.c
*
* Definition of basic libunibreak information.
*
* @version 1.0, 2015/04/18
* @author Wu Yongwei
*/
#include "unibreakbase.h"
/**
* Version number of the library.
*/
const int unibreak_version = UNIBREAK_VERSION;

View File

@ -0,0 +1,73 @@
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
/*
* Break processing in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2015 Wu Yongwei <wuyongwei at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute
* it freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must
* not claim that you wrote the original software. If you use this
* software in a product, an acknowledgement in the product
* documentation would be appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must
* not be misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source
* distribution.
*
* The main reference is Unicode Standard Annex 14 (UAX #14):
* <URL:http://www.unicode.org/reports/tr14/>
*
* When this library was designed, this annex was at Revision 19, for
* Unicode 5.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-19.html>
*
* This library has been updated according to Revision 33, for
* Unicode 7.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-33.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
/**
* @file unibreakbase.h
*
* Header file for common definitions in the libunibreak library.
*
* @version 1.0, 2015/04/18
* @author Wu Yongwei
*/
#ifndef UNIBREAKBASE_H
#define UNIBREAKBASE_H
#include <stddef.h>
#ifdef __cplusplus
extern "C" {
#endif
#define UNIBREAK_VERSION 0x0300 /**< Version of the library linebreak */
extern const int unibreak_version;
#ifndef UNIBREAK_UTF_TYPES_DEFINED
#define UNIBREAK_UTF_TYPES_DEFINED
typedef unsigned char utf8_t; /**< Type for UTF-8 data points */
typedef unsigned short utf16_t; /**< Type for UTF-16 data points */
typedef unsigned int utf32_t; /**< Type for UTF-32 data points */
#endif
#ifdef __cplusplus
}
#endif
#endif /* UNIBREAKBASE_H */

View File

@ -0,0 +1,159 @@
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
/*
* Break processing in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2015 Wu Yongwei <wuyongwei at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute
* it freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must
* not claim that you wrote the original software. If you use this
* software in a product, an acknowledgement in the product
* documentation would be appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must
* not be misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source
* distribution.
*/
/**
* @file unibreakdef.c
*
* Definition of utility functions used by the libunibreak library.
*
* @version 1.0, 2015/04/18
* @author Wu Yongwei
*/
#include <assert.h>
#include <stddef.h>
#include "unibreakdef.h"
/**
* Gets the next Unicode character in a UTF-8 sequence. The index will
* be advanced to the next complete character, unless the end of string
* is reached in the middle of a UTF-8 sequence.
*
* @param[in] s input UTF-8 string
* @param[in] len length of the string in bytes
* @param[in,out] ip pointer to the index
* @return the Unicode character beginning at the index; or
* #EOS if end of input is encountered
*/
utf32_t ub_get_next_char_utf8(
const utf8_t *s,
size_t len,
size_t *ip)
{
utf8_t ch;
utf32_t res;
assert(*ip <= len);
if (*ip == len)
return EOS;
ch = s[*ip];
if (ch < 0xC2 || ch > 0xF4)
{ /* One-byte sequence, tail (should not occur), or invalid */
*ip += 1;
return ch;
}
else if (ch < 0xE0)
{ /* Two-byte sequence */
if (*ip + 2 > len)
return EOS;
res = ((ch & 0x1F) << 6) + (s[*ip + 1] & 0x3F);
*ip += 2;
return res;
}
else if (ch < 0xF0)
{ /* Three-byte sequence */
if (*ip + 3 > len)
return EOS;
res = ((ch & 0x0F) << 12) +
((s[*ip + 1] & 0x3F) << 6) +
((s[*ip + 2] & 0x3F));
*ip += 3;
return res;
}
else
{ /* Four-byte sequence */
if (*ip + 4 > len)
return EOS;
res = ((ch & 0x07) << 18) +
((s[*ip + 1] & 0x3F) << 12) +
((s[*ip + 2] & 0x3F) << 6) +
((s[*ip + 3] & 0x3F));
*ip += 4;
return res;
}
}
/**
* Gets the next Unicode character in a UTF-16 sequence. The index will
* be advanced to the next complete character, unless the end of string
* is reached in the middle of a UTF-16 surrogate pair.
*
* @param[in] s input UTF-16 string
* @param[in] len length of the string in words
* @param[in,out] ip pointer to the index
* @return the Unicode character beginning at the index; or
* #EOS if end of input is encountered
*/
utf32_t ub_get_next_char_utf16(
const utf16_t *s,
size_t len,
size_t *ip)
{
utf16_t ch;
assert(*ip <= len);
if (*ip == len)
return EOS;
ch = s[(*ip)++];
if (ch < 0xD800 || ch > 0xDBFF)
{ /* If the character is not a high surrogate */
return ch;
}
if (*ip == len)
{ /* If the input ends here (an error) */
--(*ip);
return EOS;
}
if (s[*ip] < 0xDC00 || s[*ip] > 0xDFFF)
{ /* If the next character is not the low surrogate (an error) */
return ch;
}
/* Return the constructed character and advance the index again */
return (((utf32_t)ch & 0x3FF) << 10) + (s[(*ip)++] & 0x3FF) + 0x10000;
}
/**
* Gets the next Unicode character in a UTF-32 sequence. The index will
* be advanced to the next character.
*
* @param[in] s input UTF-32 string
* @param[in] len length of the string in dwords
* @param[in,out] ip pointer to the index
* @return the Unicode character beginning at the index; or
* #EOS if end of input is encountered
*/
utf32_t ub_get_next_char_utf32(
const utf32_t *s,
size_t len,
size_t *ip)
{
assert(*ip <= len);
if (*ip == len)
return EOS;
return s[(*ip)++];
}

View File

@ -0,0 +1,80 @@
/* vim: set expandtab tabstop=4 softtabstop=4 shiftwidth=4: */
/*
* Break processing in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2015 Wu Yongwei <wuyongwei at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute
* it freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must
* not claim that you wrote the original software. If you use this
* software in a product, an acknowledgement in the product
* documentation would be appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must
* not be misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source
* distribution.
*
* The main reference is Unicode Standard Annex 14 (UAX #14):
* <URL:http://www.unicode.org/reports/tr14/>
*
* When this library was designed, this annex was at Revision 19, for
* Unicode 5.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-19.html>
*
* This library has been updated according to Revision 33, for
* Unicode 7.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-33.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
/**
* @file unibreakdef.h
*
* Header file for private definitions in the libunibreak library.
*
* @version 1.1, 2015/04/19
* @author Wu Yongwei
*/
#ifndef UNIBREAKDEF_H
#define UNIBREAKDEF_H
#include "unibreakbase.h"
#ifdef __cplusplus
extern "C" {
#endif
/**
* Constant value to mark the end of string. It is not a valid Unicode
* character.
*/
#define EOS 0xFFFFFFFF
/**
* Abstract function interface for #ub_get_next_char_utf8,
* #ub_get_next_char_utf16, and #ub_get_next_char_utf32.
*/
typedef utf32_t (*get_next_char_t)(const void *, size_t, size_t *);
/* Function Prototype */
utf32_t ub_get_next_char_utf8(const utf8_t *s, size_t len, size_t *ip);
utf32_t ub_get_next_char_utf16(const utf16_t *s, size_t len, size_t *ip);
utf32_t ub_get_next_char_utf32(const utf32_t *s, size_t len, size_t *ip);
#ifdef __cplusplus
}
#endif
#endif /* UNIBREAKDEF_H */

View File

@ -4,7 +4,7 @@
* Word breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2013 Tom Hacohen <tom at stosb dot com>
* Copyright (C) 2013-2015 Tom Hacohen <tom at stosb dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
@ -30,9 +30,9 @@
* Unicode 6.0.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-17.html>
*
* This library has been updated according to Revision 21, for
* Unicode 6.2.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-21.html>
* This library has been updated according to Revision 25, for
* Unicode 7.0.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-25.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
@ -44,16 +44,14 @@
* Implementation of the word breaking algorithm as described in Unicode
* Standard Annex 29.
*
* @version 2.4, 2013/09/28
* @version 2.6, 2015/04/18
* @author Tom Hacohen
*/
#include <assert.h>
#include <stddef.h>
#include <string.h>
#include "linebreak.h"
#include "linebreakdef.h"
#include "unibreakdef.h"
#include "wordbreak.h"
#include "wordbreakdata.c"
@ -128,7 +126,6 @@ static void set_brks_to(
while (posNext < posEnd)
{
utf32_t ch;
(void)ch;
ch = get_next_char(s, len, &posNext);
assert(ch != EOS);
for (; posStart < posNext - 1; ++posStart)
@ -257,8 +254,24 @@ static void set_wordbreaks(
posLast = posCur;
break;
case WBP_Hebrew_Letter:
case WBP_ALetter:
if ((wbcSeqStart == WBP_ALetter) || /* WB5,6,7 */
if ((wbcSeqStart == WBP_Hebrew_Letter) &&
(wbcLast == WBP_Double_Quote)) /* WB7b,c */
{
if (wbcCur == WBP_Hebrew_Letter)
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_NOBREAK, get_next_char);
}
else
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
}
}
else if (((wbcSeqStart == WBP_ALetter) ||
(wbcSeqStart == WBP_Hebrew_Letter)) || /* WB5,6,7 */
(wbcLast == WBP_Numeric) || /* WB10 */
(wbcSeqStart == WBP_ExtendNumLet)) /* WB13b */
{
@ -275,8 +288,18 @@ static void set_wordbreaks(
posLast = posCur;
break;
case WBP_Single_Quote:
if (wbcLast == WBP_Hebrew_Letter) /* WB7a */
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_NOBREAK, get_next_char);
wbcSeqStart = wbcCur;
posLast = posCur;
}
/* No break on purpose */
case WBP_MidNumLet:
if ((wbcLast == WBP_ALetter) || /* WB6,7 */
if (((wbcLast == WBP_ALetter) ||
(wbcLast == WBP_Hebrew_Letter)) || /* WB6,7 */
(wbcLast == WBP_Numeric)) /* WB11,12 */
{
/* Go on */
@ -291,7 +314,8 @@ static void set_wordbreaks(
break;
case WBP_MidLetter:
if (wbcLast == WBP_ALetter) /* WB6,7 */
if ((wbcLast == WBP_ALetter) ||
(wbcLast == WBP_Hebrew_Letter)) /* WB6,7 */
{
/* Go on */
}
@ -320,7 +344,8 @@ static void set_wordbreaks(
case WBP_Numeric:
if ((wbcSeqStart == WBP_Numeric) || /* WB8,11,12 */
(wbcLast == WBP_ALetter) || /* WB9 */
((wbcLast == WBP_ALetter) ||
(wbcLast == WBP_Hebrew_Letter)) || /* WB9 */
(wbcSeqStart == WBP_ExtendNumLet)) /* WB13b */
{
set_brks_to(s, brks, posLast, posCur, len,
@ -340,6 +365,7 @@ static void set_wordbreaks(
/* WB13a,13b */
if ((wbcSeqStart == wbcLast) &&
((wbcLast == WBP_ALetter) ||
(wbcLast == WBP_Hebrew_Letter) ||
(wbcLast == WBP_Numeric) ||
(wbcLast == WBP_Katakana) ||
(wbcLast == WBP_ExtendNumLet)))
@ -357,9 +383,9 @@ static void set_wordbreaks(
posLast = posCur;
break;
case WBP_Regional:
case WBP_Regional_Indicator:
/* WB13c */
if (wbcSeqStart == WBP_Regional)
if (wbcSeqStart == WBP_Regional_Indicator)
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_NOBREAK, get_next_char);
@ -368,6 +394,20 @@ static void set_wordbreaks(
posLast = posCur;
break;
case WBP_Double_Quote:
if (wbcLast == WBP_Hebrew_Letter) /* WB7b,c */
{
/* Go on */
}
else
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
wbcSeqStart = wbcCur;
posLast = posCur;
}
break;
case WBP_Any:
/* Allow breaks and reset */
set_brks_to(s, brks, posLast, posCur, len,
@ -409,7 +449,7 @@ void set_wordbreaks_utf8(
char *brks)
{
set_wordbreaks(s, len, lang, brks,
(get_next_char_t)lb_get_next_char_utf8);
(get_next_char_t)ub_get_next_char_utf8);
}
/**
@ -429,7 +469,7 @@ void set_wordbreaks_utf16(
char *brks)
{
set_wordbreaks(s, len, lang, brks,
(get_next_char_t)lb_get_next_char_utf16);
(get_next_char_t)ub_get_next_char_utf16);
}
/**
@ -449,5 +489,5 @@ void set_wordbreaks_utf32(
char *brks)
{
set_wordbreaks(s, len, lang, brks,
(get_next_char_t)lb_get_next_char_utf32);
(get_next_char_t)ub_get_next_char_utf32);
}

View File

@ -4,7 +4,7 @@
* Word breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2013 Tom Hacohen <tom at stosb dot com>
* Copyright (C) 2013-2015 Tom Hacohen <tom at stosb dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
@ -30,9 +30,9 @@
* Unicode 6.0.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-17.html>
*
* This library has been updated according to Revision 21, for
* Unicode 6.2.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-21.html>
* This library has been updated according to Revision 25, for
* Unicode 7.0.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-25.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
@ -43,7 +43,7 @@
*
* Header file for the word breaking (segmentation) algorithm.
*
* @version 2.3, 2013/09/28
* @version 2.5, 2015/04/18
* @author Tom Hacohen
*/
@ -51,7 +51,7 @@
#define WORDBREAK_H
#include <stddef.h>
#include "linebreak.h"
#include "unibreakbase.h"
#ifdef __cplusplus
extern "C" {

View File

@ -1,16 +1,16 @@
/* The content of this file is generated from:
# WordBreakProperty-6.2.0.txt
# Date: 2012-08-13, 19:12:09 GMT [MD]
# WordBreakProperty-7.0.0.txt
# Date: 2014-02-19, 15:51:39 GMT [MD]
*/
#include "linebreak.h"
#include "wordbreakdef.h"
static struct WordBreakProperties wb_prop_default[] = {
{0x000A, 0x000A, WBP_LF},
{0x000B, 0x000C, WBP_Newline},
{0x000D, 0x000D, WBP_CR},
{0x0027, 0x0027, WBP_MidNumLet},
{0x0022, 0x0022, WBP_Double_Quote},
{0x0027, 0x0027, WBP_Single_Quote},
{0x002C, 0x002C, WBP_MidNum},
{0x002E, 0x002E, WBP_MidNumLet},
{0x0030, 0x0039, WBP_Numeric},
@ -36,6 +36,7 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x0295, 0x02AF, WBP_ALetter},
{0x02B0, 0x02C1, WBP_ALetter},
{0x02C6, 0x02D1, WBP_ALetter},
{0x02D7, 0x02D7, WBP_MidLetter},
{0x02E0, 0x02E4, WBP_ALetter},
{0x02EC, 0x02EC, WBP_ALetter},
{0x02EE, 0x02EE, WBP_ALetter},
@ -46,6 +47,7 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x037A, 0x037A, WBP_ALetter},
{0x037B, 0x037D, WBP_ALetter},
{0x037E, 0x037E, WBP_MidNum},
{0x037F, 0x037F, WBP_ALetter},
{0x0386, 0x0386, WBP_ALetter},
{0x0387, 0x0387, WBP_MidLetter},
{0x0388, 0x038A, WBP_ALetter},
@ -55,7 +57,7 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x03F7, 0x0481, WBP_ALetter},
{0x0483, 0x0487, WBP_Extend},
{0x0488, 0x0489, WBP_Extend},
{0x048A, 0x0527, WBP_ALetter},
{0x048A, 0x052F, WBP_ALetter},
{0x0531, 0x0556, WBP_ALetter},
{0x0559, 0x0559, WBP_ALetter},
{0x0561, 0x0587, WBP_ALetter},
@ -65,13 +67,14 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x05C1, 0x05C2, WBP_Extend},
{0x05C4, 0x05C5, WBP_Extend},
{0x05C7, 0x05C7, WBP_Extend},
{0x05D0, 0x05EA, WBP_ALetter},
{0x05F0, 0x05F2, WBP_ALetter},
{0x05D0, 0x05EA, WBP_Hebrew_Letter},
{0x05F0, 0x05F2, WBP_Hebrew_Letter},
{0x05F3, 0x05F3, WBP_ALetter},
{0x05F4, 0x05F4, WBP_MidLetter},
{0x0600, 0x0604, WBP_Format},
{0x0600, 0x0605, WBP_Format},
{0x060C, 0x060D, WBP_MidNum},
{0x0610, 0x061A, WBP_Extend},
{0x061C, 0x061C, WBP_Format},
{0x0620, 0x063F, WBP_ALetter},
{0x0640, 0x0640, WBP_ALetter},
{0x0641, 0x064A, WBP_ALetter},
@ -117,10 +120,8 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x0829, 0x082D, WBP_Extend},
{0x0840, 0x0858, WBP_ALetter},
{0x0859, 0x085B, WBP_Extend},
{0x08A0, 0x08A0, WBP_ALetter},
{0x08A2, 0x08AC, WBP_ALetter},
{0x08E4, 0x08FE, WBP_Extend},
{0x0900, 0x0902, WBP_Extend},
{0x08A0, 0x08B2, WBP_ALetter},
{0x08E4, 0x0902, WBP_Extend},
{0x0903, 0x0903, WBP_Extend},
{0x0904, 0x0939, WBP_ALetter},
{0x093A, 0x093A, WBP_Extend},
@ -138,8 +139,7 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x0962, 0x0963, WBP_Extend},
{0x0966, 0x096F, WBP_Numeric},
{0x0971, 0x0971, WBP_ALetter},
{0x0972, 0x0977, WBP_ALetter},
{0x0979, 0x097F, WBP_ALetter},
{0x0972, 0x0980, WBP_ALetter},
{0x0981, 0x0981, WBP_Extend},
{0x0982, 0x0983, WBP_Extend},
{0x0985, 0x098C, WBP_ALetter},
@ -247,12 +247,12 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x0BD0, 0x0BD0, WBP_ALetter},
{0x0BD7, 0x0BD7, WBP_Extend},
{0x0BE6, 0x0BEF, WBP_Numeric},
{0x0C00, 0x0C00, WBP_Extend},
{0x0C01, 0x0C03, WBP_Extend},
{0x0C05, 0x0C0C, WBP_ALetter},
{0x0C0E, 0x0C10, WBP_ALetter},
{0x0C12, 0x0C28, WBP_ALetter},
{0x0C2A, 0x0C33, WBP_ALetter},
{0x0C35, 0x0C39, WBP_ALetter},
{0x0C2A, 0x0C39, WBP_ALetter},
{0x0C3D, 0x0C3D, WBP_ALetter},
{0x0C3E, 0x0C40, WBP_Extend},
{0x0C41, 0x0C44, WBP_Extend},
@ -263,6 +263,7 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x0C60, 0x0C61, WBP_ALetter},
{0x0C62, 0x0C63, WBP_Extend},
{0x0C66, 0x0C6F, WBP_Numeric},
{0x0C81, 0x0C81, WBP_Extend},
{0x0C82, 0x0C83, WBP_Extend},
{0x0C85, 0x0C8C, WBP_ALetter},
{0x0C8E, 0x0C90, WBP_ALetter},
@ -284,6 +285,7 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x0CE2, 0x0CE3, WBP_Extend},
{0x0CE6, 0x0CEF, WBP_Numeric},
{0x0CF1, 0x0CF2, WBP_ALetter},
{0x0D01, 0x0D01, WBP_Extend},
{0x0D02, 0x0D03, WBP_Extend},
{0x0D05, 0x0D0C, WBP_ALetter},
{0x0D0E, 0x0D10, WBP_ALetter},
@ -311,6 +313,7 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x0DD2, 0x0DD4, WBP_Extend},
{0x0DD6, 0x0DD6, WBP_Extend},
{0x0DD8, 0x0DDF, WBP_Extend},
{0x0DE6, 0x0DEF, WBP_Numeric},
{0x0DF2, 0x0DF3, WBP_Extend},
{0x0E31, 0x0E31, WBP_Extend},
{0x0E34, 0x0E3A, WBP_Extend},
@ -391,6 +394,7 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x1681, 0x169A, WBP_ALetter},
{0x16A0, 0x16EA, WBP_ALetter},
{0x16EE, 0x16F0, WBP_ALetter},
{0x16F1, 0x16F8, WBP_ALetter},
{0x1700, 0x170C, WBP_ALetter},
{0x170E, 0x1711, WBP_ALetter},
{0x1712, 0x1714, WBP_Extend},
@ -411,6 +415,7 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x17DD, 0x17DD, WBP_Extend},
{0x17E0, 0x17E9, WBP_Numeric},
{0x180B, 0x180D, WBP_Extend},
{0x180E, 0x180E, WBP_Format},
{0x1810, 0x1819, WBP_Numeric},
{0x1820, 0x1842, WBP_ALetter},
{0x1843, 0x1843, WBP_ALetter},
@ -419,7 +424,7 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x18A9, 0x18A9, WBP_Extend},
{0x18AA, 0x18AA, WBP_ALetter},
{0x18B0, 0x18F5, WBP_ALetter},
{0x1900, 0x191C, WBP_ALetter},
{0x1900, 0x191E, WBP_ALetter},
{0x1920, 0x1922, WBP_Extend},
{0x1923, 0x1926, WBP_Extend},
{0x1927, 0x1928, WBP_Extend},
@ -434,7 +439,8 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x19D0, 0x19D9, WBP_Numeric},
{0x1A00, 0x1A16, WBP_ALetter},
{0x1A17, 0x1A18, WBP_Extend},
{0x1A19, 0x1A1B, WBP_Extend},
{0x1A19, 0x1A1A, WBP_Extend},
{0x1A1B, 0x1A1B, WBP_Extend},
{0x1A55, 0x1A55, WBP_Extend},
{0x1A56, 0x1A56, WBP_Extend},
{0x1A57, 0x1A57, WBP_Extend},
@ -449,6 +455,8 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x1A7F, 0x1A7F, WBP_Extend},
{0x1A80, 0x1A89, WBP_Numeric},
{0x1A90, 0x1A99, WBP_Numeric},
{0x1AB0, 0x1ABD, WBP_Extend},
{0x1ABE, 0x1ABE, WBP_Extend},
{0x1B00, 0x1B03, WBP_Extend},
{0x1B04, 0x1B04, WBP_Extend},
{0x1B05, 0x1B33, WBP_ALetter},
@ -471,8 +479,7 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x1BA6, 0x1BA7, WBP_Extend},
{0x1BA8, 0x1BA9, WBP_Extend},
{0x1BAA, 0x1BAA, WBP_Extend},
{0x1BAB, 0x1BAB, WBP_Extend},
{0x1BAC, 0x1BAD, WBP_Extend},
{0x1BAB, 0x1BAD, WBP_Extend},
{0x1BAE, 0x1BAF, WBP_ALetter},
{0x1BB0, 0x1BB9, WBP_Numeric},
{0x1BBA, 0x1BE5, WBP_ALetter},
@ -504,13 +511,14 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x1CF2, 0x1CF3, WBP_Extend},
{0x1CF4, 0x1CF4, WBP_Extend},
{0x1CF5, 0x1CF6, WBP_ALetter},
{0x1CF8, 0x1CF9, WBP_Extend},
{0x1D00, 0x1D2B, WBP_ALetter},
{0x1D2C, 0x1D6A, WBP_ALetter},
{0x1D6B, 0x1D77, WBP_ALetter},
{0x1D78, 0x1D78, WBP_ALetter},
{0x1D79, 0x1D9A, WBP_ALetter},
{0x1D9B, 0x1DBF, WBP_ALetter},
{0x1DC0, 0x1DE6, WBP_Extend},
{0x1DC0, 0x1DF5, WBP_Extend},
{0x1DFC, 0x1DFF, WBP_Extend},
{0x1E00, 0x1F15, WBP_ALetter},
{0x1F18, 0x1F1D, WBP_ALetter},
@ -544,7 +552,7 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x2044, 0x2044, WBP_MidNum},
{0x2054, 0x2054, WBP_ExtendNumLet},
{0x2060, 0x2064, WBP_Format},
{0x206A, 0x206F, WBP_Format},
{0x2066, 0x206F, WBP_Format},
{0x2071, 0x2071, WBP_ALetter},
{0x207F, 0x207F, WBP_ALetter},
{0x2090, 0x209C, WBP_ALetter},
@ -631,7 +639,8 @@ static struct WordBreakProperties wb_prop_default[] = {
{0xA670, 0xA672, WBP_Extend},
{0xA674, 0xA67D, WBP_Extend},
{0xA67F, 0xA67F, WBP_ALetter},
{0xA680, 0xA697, WBP_ALetter},
{0xA680, 0xA69B, WBP_ALetter},
{0xA69C, 0xA69D, WBP_ALetter},
{0xA69F, 0xA69F, WBP_Extend},
{0xA6A0, 0xA6E5, WBP_ALetter},
{0xA6E6, 0xA6EF, WBP_ALetter},
@ -642,8 +651,9 @@ static struct WordBreakProperties wb_prop_default[] = {
{0xA771, 0xA787, WBP_ALetter},
{0xA788, 0xA788, WBP_ALetter},
{0xA78B, 0xA78E, WBP_ALetter},
{0xA790, 0xA793, WBP_ALetter},
{0xA7A0, 0xA7AA, WBP_ALetter},
{0xA790, 0xA7AD, WBP_ALetter},
{0xA7B0, 0xA7B1, WBP_ALetter},
{0xA7F7, 0xA7F7, WBP_ALetter},
{0xA7F8, 0xA7F9, WBP_ALetter},
{0xA7FA, 0xA7FA, WBP_ALetter},
{0xA7FB, 0xA801, WBP_ALetter},
@ -683,6 +693,8 @@ static struct WordBreakProperties wb_prop_default[] = {
{0xA9BD, 0xA9C0, WBP_Extend},
{0xA9CF, 0xA9CF, WBP_ALetter},
{0xA9D0, 0xA9D9, WBP_Numeric},
{0xA9E5, 0xA9E5, WBP_Extend},
{0xA9F0, 0xA9F9, WBP_Numeric},
{0xAA00, 0xAA28, WBP_ALetter},
{0xAA29, 0xAA2E, WBP_Extend},
{0xAA2F, 0xAA30, WBP_Extend},
@ -696,6 +708,8 @@ static struct WordBreakProperties wb_prop_default[] = {
{0xAA4D, 0xAA4D, WBP_Extend},
{0xAA50, 0xAA59, WBP_Numeric},
{0xAA7B, 0xAA7B, WBP_Extend},
{0xAA7C, 0xAA7C, WBP_Extend},
{0xAA7D, 0xAA7D, WBP_Extend},
{0xAAB0, 0xAAB0, WBP_Extend},
{0xAAB2, 0xAAB4, WBP_Extend},
{0xAAB7, 0xAAB8, WBP_Extend},
@ -714,6 +728,9 @@ static struct WordBreakProperties wb_prop_default[] = {
{0xAB11, 0xAB16, WBP_ALetter},
{0xAB20, 0xAB26, WBP_ALetter},
{0xAB28, 0xAB2E, WBP_ALetter},
{0xAB30, 0xAB5A, WBP_ALetter},
{0xAB5C, 0xAB5F, WBP_ALetter},
{0xAB64, 0xAB65, WBP_ALetter},
{0xABC0, 0xABE2, WBP_ALetter},
{0xABE3, 0xABE4, WBP_Extend},
{0xABE5, 0xABE5, WBP_Extend},
@ -728,15 +745,16 @@ static struct WordBreakProperties wb_prop_default[] = {
{0xD7CB, 0xD7FB, WBP_ALetter},
{0xFB00, 0xFB06, WBP_ALetter},
{0xFB13, 0xFB17, WBP_ALetter},
{0xFB1D, 0xFB1D, WBP_ALetter},
{0xFB1D, 0xFB1D, WBP_Hebrew_Letter},
{0xFB1E, 0xFB1E, WBP_Extend},
{0xFB1F, 0xFB28, WBP_ALetter},
{0xFB2A, 0xFB36, WBP_ALetter},
{0xFB38, 0xFB3C, WBP_ALetter},
{0xFB3E, 0xFB3E, WBP_ALetter},
{0xFB40, 0xFB41, WBP_ALetter},
{0xFB43, 0xFB44, WBP_ALetter},
{0xFB46, 0xFBB1, WBP_ALetter},
{0xFB1F, 0xFB28, WBP_Hebrew_Letter},
{0xFB2A, 0xFB36, WBP_Hebrew_Letter},
{0xFB38, 0xFB3C, WBP_Hebrew_Letter},
{0xFB3E, 0xFB3E, WBP_Hebrew_Letter},
{0xFB40, 0xFB41, WBP_Hebrew_Letter},
{0xFB43, 0xFB44, WBP_Hebrew_Letter},
{0xFB46, 0xFB4F, WBP_Hebrew_Letter},
{0xFB50, 0xFBB1, WBP_ALetter},
{0xFBD3, 0xFD3D, WBP_ALetter},
{0xFD50, 0xFD8F, WBP_ALetter},
{0xFD92, 0xFDC7, WBP_ALetter},
@ -745,7 +763,7 @@ static struct WordBreakProperties wb_prop_default[] = {
{0xFE10, 0xFE10, WBP_MidNum},
{0xFE13, 0xFE13, WBP_MidLetter},
{0xFE14, 0xFE14, WBP_MidNum},
{0xFE20, 0xFE26, WBP_Extend},
{0xFE20, 0xFE2D, WBP_Extend},
{0xFE33, 0xFE34, WBP_ExtendNumLet},
{0xFE4D, 0xFE4F, WBP_ExtendNumLet},
{0xFE50, 0xFE50, WBP_MidNum},
@ -784,11 +802,14 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x101FD, 0x101FD, WBP_Extend},
{0x10280, 0x1029C, WBP_ALetter},
{0x102A0, 0x102D0, WBP_ALetter},
{0x10300, 0x1031E, WBP_ALetter},
{0x102E0, 0x102E0, WBP_Extend},
{0x10300, 0x1031F, WBP_ALetter},
{0x10330, 0x10340, WBP_ALetter},
{0x10341, 0x10341, WBP_ALetter},
{0x10342, 0x10349, WBP_ALetter},
{0x1034A, 0x1034A, WBP_ALetter},
{0x10350, 0x10375, WBP_ALetter},
{0x10376, 0x1037A, WBP_Extend},
{0x10380, 0x1039D, WBP_ALetter},
{0x103A0, 0x103C3, WBP_ALetter},
{0x103C8, 0x103CF, WBP_ALetter},
@ -796,12 +817,19 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x10400, 0x1044F, WBP_ALetter},
{0x10450, 0x1049D, WBP_ALetter},
{0x104A0, 0x104A9, WBP_Numeric},
{0x10500, 0x10527, WBP_ALetter},
{0x10530, 0x10563, WBP_ALetter},
{0x10600, 0x10736, WBP_ALetter},
{0x10740, 0x10755, WBP_ALetter},
{0x10760, 0x10767, WBP_ALetter},
{0x10800, 0x10805, WBP_ALetter},
{0x10808, 0x10808, WBP_ALetter},
{0x1080A, 0x10835, WBP_ALetter},
{0x10837, 0x10838, WBP_ALetter},
{0x1083C, 0x1083C, WBP_ALetter},
{0x1083F, 0x10855, WBP_ALetter},
{0x10860, 0x10876, WBP_ALetter},
{0x10880, 0x1089E, WBP_ALetter},
{0x10900, 0x10915, WBP_ALetter},
{0x10920, 0x10939, WBP_ALetter},
{0x10980, 0x109B7, WBP_ALetter},
@ -816,9 +844,14 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x10A38, 0x10A3A, WBP_Extend},
{0x10A3F, 0x10A3F, WBP_Extend},
{0x10A60, 0x10A7C, WBP_ALetter},
{0x10A80, 0x10A9C, WBP_ALetter},
{0x10AC0, 0x10AC7, WBP_ALetter},
{0x10AC9, 0x10AE4, WBP_ALetter},
{0x10AE5, 0x10AE6, WBP_Extend},
{0x10B00, 0x10B35, WBP_ALetter},
{0x10B40, 0x10B55, WBP_ALetter},
{0x10B60, 0x10B72, WBP_ALetter},
{0x10B80, 0x10B91, WBP_ALetter},
{0x10C00, 0x10C48, WBP_ALetter},
{0x11000, 0x11000, WBP_Extend},
{0x11001, 0x11001, WBP_Extend},
@ -826,7 +859,7 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x11003, 0x11037, WBP_ALetter},
{0x11038, 0x11046, WBP_Extend},
{0x11066, 0x1106F, WBP_Numeric},
{0x11080, 0x11081, WBP_Extend},
{0x1107F, 0x11081, WBP_Extend},
{0x11082, 0x11082, WBP_Extend},
{0x11083, 0x110AF, WBP_ALetter},
{0x110B0, 0x110B2, WBP_Extend},
@ -842,6 +875,9 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x1112C, 0x1112C, WBP_Extend},
{0x1112D, 0x11134, WBP_Extend},
{0x11136, 0x1113F, WBP_Numeric},
{0x11150, 0x11172, WBP_ALetter},
{0x11173, 0x11173, WBP_Extend},
{0x11176, 0x11176, WBP_ALetter},
{0x11180, 0x11181, WBP_Extend},
{0x11182, 0x11182, WBP_Extend},
{0x11183, 0x111B2, WBP_ALetter},
@ -850,6 +886,68 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x111BF, 0x111C0, WBP_Extend},
{0x111C1, 0x111C4, WBP_ALetter},
{0x111D0, 0x111D9, WBP_Numeric},
{0x111DA, 0x111DA, WBP_ALetter},
{0x11200, 0x11211, WBP_ALetter},
{0x11213, 0x1122B, WBP_ALetter},
{0x1122C, 0x1122E, WBP_Extend},
{0x1122F, 0x11231, WBP_Extend},
{0x11232, 0x11233, WBP_Extend},
{0x11234, 0x11234, WBP_Extend},
{0x11235, 0x11235, WBP_Extend},
{0x11236, 0x11237, WBP_Extend},
{0x112B0, 0x112DE, WBP_ALetter},
{0x112DF, 0x112DF, WBP_Extend},
{0x112E0, 0x112E2, WBP_Extend},
{0x112E3, 0x112EA, WBP_Extend},
{0x112F0, 0x112F9, WBP_Numeric},
{0x11301, 0x11301, WBP_Extend},
{0x11302, 0x11303, WBP_Extend},
{0x11305, 0x1130C, WBP_ALetter},
{0x1130F, 0x11310, WBP_ALetter},
{0x11313, 0x11328, WBP_ALetter},
{0x1132A, 0x11330, WBP_ALetter},
{0x11332, 0x11333, WBP_ALetter},
{0x11335, 0x11339, WBP_ALetter},
{0x1133C, 0x1133C, WBP_Extend},
{0x1133D, 0x1133D, WBP_ALetter},
{0x1133E, 0x1133F, WBP_Extend},
{0x11340, 0x11340, WBP_Extend},
{0x11341, 0x11344, WBP_Extend},
{0x11347, 0x11348, WBP_Extend},
{0x1134B, 0x1134D, WBP_Extend},
{0x11357, 0x11357, WBP_Extend},
{0x1135D, 0x11361, WBP_ALetter},
{0x11362, 0x11363, WBP_Extend},
{0x11366, 0x1136C, WBP_Extend},
{0x11370, 0x11374, WBP_Extend},
{0x11480, 0x114AF, WBP_ALetter},
{0x114B0, 0x114B2, WBP_Extend},
{0x114B3, 0x114B8, WBP_Extend},
{0x114B9, 0x114B9, WBP_Extend},
{0x114BA, 0x114BA, WBP_Extend},
{0x114BB, 0x114BE, WBP_Extend},
{0x114BF, 0x114C0, WBP_Extend},
{0x114C1, 0x114C1, WBP_Extend},
{0x114C2, 0x114C3, WBP_Extend},
{0x114C4, 0x114C5, WBP_ALetter},
{0x114C7, 0x114C7, WBP_ALetter},
{0x114D0, 0x114D9, WBP_Numeric},
{0x11580, 0x115AE, WBP_ALetter},
{0x115AF, 0x115B1, WBP_Extend},
{0x115B2, 0x115B5, WBP_Extend},
{0x115B8, 0x115BB, WBP_Extend},
{0x115BC, 0x115BD, WBP_Extend},
{0x115BE, 0x115BE, WBP_Extend},
{0x115BF, 0x115C0, WBP_Extend},
{0x11600, 0x1162F, WBP_ALetter},
{0x11630, 0x11632, WBP_Extend},
{0x11633, 0x1163A, WBP_Extend},
{0x1163B, 0x1163C, WBP_Extend},
{0x1163D, 0x1163D, WBP_Extend},
{0x1163E, 0x1163E, WBP_Extend},
{0x1163F, 0x11640, WBP_Extend},
{0x11644, 0x11644, WBP_ALetter},
{0x11650, 0x11659, WBP_Numeric},
{0x11680, 0x116AA, WBP_ALetter},
{0x116AB, 0x116AB, WBP_Extend},
{0x116AC, 0x116AC, WBP_Extend},
@ -859,16 +957,36 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x116B6, 0x116B6, WBP_Extend},
{0x116B7, 0x116B7, WBP_Extend},
{0x116C0, 0x116C9, WBP_Numeric},
{0x12000, 0x1236E, WBP_ALetter},
{0x12400, 0x12462, WBP_ALetter},
{0x118A0, 0x118DF, WBP_ALetter},
{0x118E0, 0x118E9, WBP_Numeric},
{0x118FF, 0x118FF, WBP_ALetter},
{0x11AC0, 0x11AF8, WBP_ALetter},
{0x12000, 0x12398, WBP_ALetter},
{0x12400, 0x1246E, WBP_ALetter},
{0x13000, 0x1342E, WBP_ALetter},
{0x16800, 0x16A38, WBP_ALetter},
{0x16A40, 0x16A5E, WBP_ALetter},
{0x16A60, 0x16A69, WBP_Numeric},
{0x16AD0, 0x16AED, WBP_ALetter},
{0x16AF0, 0x16AF4, WBP_Extend},
{0x16B00, 0x16B2F, WBP_ALetter},
{0x16B30, 0x16B36, WBP_Extend},
{0x16B40, 0x16B43, WBP_ALetter},
{0x16B50, 0x16B59, WBP_Numeric},
{0x16B63, 0x16B77, WBP_ALetter},
{0x16B7D, 0x16B8F, WBP_ALetter},
{0x16F00, 0x16F44, WBP_ALetter},
{0x16F50, 0x16F50, WBP_ALetter},
{0x16F51, 0x16F7E, WBP_Extend},
{0x16F8F, 0x16F92, WBP_Extend},
{0x16F93, 0x16F9F, WBP_ALetter},
{0x1B000, 0x1B000, WBP_Katakana},
{0x1BC00, 0x1BC6A, WBP_ALetter},
{0x1BC70, 0x1BC7C, WBP_ALetter},
{0x1BC80, 0x1BC88, WBP_ALetter},
{0x1BC90, 0x1BC99, WBP_ALetter},
{0x1BC9D, 0x1BC9E, WBP_Extend},
{0x1BCA0, 0x1BCA3, WBP_Format},
{0x1D165, 0x1D166, WBP_Extend},
{0x1D167, 0x1D169, WBP_Extend},
{0x1D16D, 0x1D172, WBP_Extend},
@ -908,6 +1026,8 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x1D7AA, 0x1D7C2, WBP_ALetter},
{0x1D7C4, 0x1D7CB, WBP_ALetter},
{0x1D7CE, 0x1D7FF, WBP_Numeric},
{0x1E800, 0x1E8C4, WBP_ALetter},
{0x1E8D0, 0x1E8D6, WBP_Extend},
{0x1EE00, 0x1EE03, WBP_ALetter},
{0x1EE05, 0x1EE1F, WBP_ALetter},
{0x1EE21, 0x1EE22, WBP_ALetter},
@ -941,7 +1061,10 @@ static struct WordBreakProperties wb_prop_default[] = {
{0x1EEA1, 0x1EEA3, WBP_ALetter},
{0x1EEA5, 0x1EEA9, WBP_ALetter},
{0x1EEAB, 0x1EEBB, WBP_ALetter},
{0x1F1E6, 0x1F1FF, WBP_Regional},
{0x1F130, 0x1F149, WBP_ALetter},
{0x1F150, 0x1F169, WBP_ALetter},
{0x1F170, 0x1F189, WBP_ALetter},
{0x1F1E6, 0x1F1FF, WBP_Regional_Indicator},
{0xE0001, 0xE0001, WBP_Format},
{0xE0020, 0xE007F, WBP_Format},
{0xE0100, 0xE01EF, WBP_Extend},

View File

@ -4,8 +4,7 @@
* Word breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2013 Tom Hacohen <tom at stosb dot com>
* Copyright (C) 2013 Petr Filipsky <philodej at gmail dot com>
* Copyright (C) 2013-15 Tom Hacohen <tom at stosb dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
@ -31,9 +30,8 @@
* Unicode 6.0.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-17.html>
*
* This library has been updated according to Revision 21, for
* Unicode 6.2.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-21.html>
* This library has been updated according to Revision 25, for
* Unicode 7.0.0:
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
@ -45,11 +43,12 @@
* Definitions of internal data structures, declarations of global
* variables, and function prototypes for the word breaking algorithm.
*
* @version 2.4, 2013/11/10
* @version 2.6, 2015/04/19
* @author Tom Hacohen
* @author Petr Filipsky
*/
#include "unibreakdef.h"
/**
* Word break classes. This is a direct mapping of Table 3 of Unicode
* Standard Annex 29, Revision 23.
@ -61,18 +60,18 @@ enum WordBreakClass
WBP_LF,
WBP_Newline,
WBP_Extend,
WBP_Regional_Indicator,
WBP_Format,
WBP_Katakana,
WBP_Hebrew_Letter,
WBP_ALetter,
WBP_Single_Quote,
WBP_Double_Quote,
WBP_MidNumLet,
WBP_MidLetter,
WBP_MidNum,
WBP_Numeric,
WBP_ExtendNumLet,
WBP_Regional,
WBP_Hebrew,
WBP_Single,
WBP_Double,
WBP_Any
};