% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/search.R
\name{stringi-search-charclass}
\alias{stringi-search-charclass}
\title{Character Classes in \pkg{stringi}}
\description{
In this man page we describe how character classes are
declared in the \pkg{stringi} package
so that you may e.g. find their occurrences in your search activities
or generate random code points with \code{\link{stri_rand_strings}}.
Moreover, the \pkg{ICU} regex engine uses the same
scheme for denoting character classes.
}
\details{
All \code{stri_*_charclass} functions in \pkg{stringi} perform
a single character (i.e. Unicode code point) search-based operations.
Since stringi_0.2-1 you may obtain
roughly the same results using \link{stringi-search-regex}.
However, these very functions aim to be faster.

Character classes are defined using \pkg{ICU}'s \code{UnicodeSet}
patterns. Below we briefly summarize their syntax.
For more details refer to the bibliographic References below.
}
\section{\code{UnicodeSet} patterns}{


A \code{UnicodeSet} represents a subset of Unicode code points
(recall that \pkg{stringi} converts strings in your native encoding
to Unicode automatically). Legal code points are U+0000 to U+10FFFF,
inclusive.

Patterns either consist of series of characters either bounded by square brackets
(such patterns follow a syntax similar to that employed
by version 8 regular expression character classes)
or of Perl-like Unicode property set specifiers.

\code{[]} denotes an empty set, \code{[a]} --
a set consisting of character ``a'',
\code{[\\u0105]} -- a set with character U+0105,
and \code{[abc]} -- a set with ``a'', ``b'', and ``c''.

\code{[a-z]} denotes a set consisting of characters
``a'' through ``z'' inclusively, in Unicode code point order.

Some set-theoretic operations are available.
\code{^} denotes the complement, e.g. \code{[^a-z]} contains
all characters but ``a'' through ``z''.
On the other hand, \code{[[pat1][pat2]]},
\code{[[pat1]\&[pat2]]}, and \code{[[pat1]-[pat2]]}
denote union, intersection, and asymmetric difference of sets
specified by \code{pat1} and \code{pat2}, respectively.

Note that all white spaces are ignored unless they are quoted or backslashed
(white spaces can be freely used for clarity, as \code{[a c d-f m]}
means the same as \code{[acd-fm]}).
\pkg{stringi} does not allow for including so-called multicharacter strings
(see \code{UnicodeSet} API documentation).
Also, empty string patterns are disallowed.

Any character may be preceded by
a backslash in order to remove any special meaning.

A malformed pattern always results in an error.

Set expressions at a glance
(according to \url{http://userguide.icu-project.org/strings/regexp}):


Some examples:

\describe{
\item{\code{[abc]}}{Match any of the characters a, b or c.}
\item{\code{[^abc]}}{Negation -- match any character except a, b or c.}
\item{\code{[A-M]}}{Range -- match any character from A to M. The characters
   to include are determined by Unicode code point ordering.}
\item{\code{[\\u0000-\\U0010ffff]}}{Range -- match all characters.}
\item{\code{[\\p{Letter}]} or \code{[\\p{General_Category=Letter}]} or \code{[\\p{L}]}}{
   Characters with Unicode Category = Letter. All forms shown are equivalent.}
\item{\code{[\\P{Letter}]}}{Negated property.
   (Upper case \code{\\P}) Match everything except Letters.}
\item{\code{[\\p{numeric_value=9}]}}{Match all numbers with a numeric value of 9.
   Any Unicode Property may be used in set expressions.}
\item{\code{[\\p{Letter}&&\\p{script=cyrillic}]}}{Logical AND
   or intersection -- match the set of all Cyrillic letters.}
\item{\code{[\\p{Letter}--\\p{script=latin}]}}{Subtraction --
   match all non-Latin letters.}
\item{\code{[[a-z][A-Z][0-9]]} or \code{[a-zA-Z0-9]}}{Implicit Logical
   OR or Union of Sets -- the examples match ASCII letters and digits.
   The two forms are equivalent.}
\item{\code{[:script=Greek:]}}{Alternate POSIX-like syntax for properties --
   equivalent to \code{\\p{script=Greek}}.}
}
}

\section{Unicode properties}{


Unicode property sets are specified with a POSIX-like syntax,
e.g. \code{[:Letter:]},
or with a (extended) Perl-style syntax, e.g. \code{\\p{L}}.
The complements of the above sets are
\code{[:^Letter:]} and \code{\\P{L}}, respectively.

The properties' names are normalized before matching
(for example, the match is case-insensitive).
Moreover, many names have short aliases.

Among predefined Unicode properties we find e.g.
\itemize{
\item Unicode General Categories, e.g. \code{Lu} for uppercase letters,
\item Unicode Binary Properties, e.g. \code{WHITE_SPACE},
}
and many more (including Unicode scripts).

Each property provides access to the large and comprehensive
Unicode Character Database.
Generally, the list of properties available in \pkg{ICU}
is not perfectly documented. Please refer to the References section
for some links.

Please note that some classes may seem to overlap.
However, e.g. General Category \code{Z} (some space) and Binary Property
\code{WHITE_SPACE} matches different character sets.
}

\section{Unicode General Categories}{


The Unicode General Category property of a code point provides the most
general classification of that code point.
Each code point falls into one and only one Category.

\describe{
 \item{\code{Cc}}{a C0 or C1 control code.}
 \item{\code{Cf}}{a format control character.}
 \item{\code{Cn}}{a reserved unassigned code point or a non-character.}
 \item{\code{Co}}{a private-use character.}
 \item{\code{Cs}}{a surrogate code point.}
 \item{\code{Lc}}{the union of Lu, Ll, Lt.}
 \item{\code{Ll}}{a lowercase letter.}
 \item{\code{Lm}}{a modifier letter.}
 \item{\code{Lo}}{other letters, including syllables and ideographs.}
 \item{\code{Lt}}{a digraphic character, with first part uppercase.}
 \item{\code{Lu}}{an uppercase letter.}
 \item{\code{Mc}}{a spacing combining mark (positive advance width).}
 \item{\code{Me}}{an enclosing combining mark.}
 \item{\code{Mn}}{a non-spacing combining mark (zero advance width).}
 \item{\code{Nd}}{a decimal digit.}
 \item{\code{Nl}}{a letter-like numeric character.}
 \item{\code{No}}{a numeric character of other type.}
 \item{\code{Pd}}{a dash or hyphen punctuation mark.}
 \item{\code{Ps}}{an opening punctuation mark (of a pair).}
 \item{\code{Pe}}{a closing punctuation mark (of a pair).}
 \item{\code{Pc}}{a connecting punctuation mark, like a tie.}
 \item{\code{Po}}{a punctuation mark of other type.}
 \item{\code{Pi}}{an initial quotation mark.}
 \item{\code{Pf}}{a final quotation mark.}
 \item{\code{Sm}}{a symbol of mathematical use.}
 \item{\code{Sc}}{a currency sign.}
 \item{\code{Sk}}{a non-letter-like modifier symbol.}
 \item{\code{So}}{a symbol of other type.}
 \item{\code{Zs}}{a space character (of non-zero width).}
 \item{\code{Zl}}{U+2028 LINE SEPARATOR only.}
 \item{\code{Zp}}{U+2029 PARAGRAPH SEPARATOR only.}
 \item{\code{C} }{the union of Cc, Cf, Cs, Co, Cn.}
 \item{\code{L} }{the union of Lu, Ll, Lt, Lm, Lo.}
 \item{\code{M} }{the union of Mn, Mc, Me.}
 \item{\code{N} }{the union of Nd, Nl, No.}
 \item{\code{P} }{the union of Pc, Pd, Ps, Pe, Pi, Pf, Po.}
 \item{\code{S} }{the union of Sm, Sc, Sk, So.}
 \item{\code{Z} }{the union of Zs, Zl, Zp }
}
}

\section{Unicode Binary Properties}{


Each character may follow many Binary Properties at a time.

Here is a comprehensive list of supported Binary Properties:

\describe{
  \item{\code{ALPHABETIC}     }{alphabetic character.}
  \item{\code{ASCII_HEX_DIGIT}}{a character matching the \code{[0-9A-Fa-f]} charclass.}
  \item{\code{BIDI_CONTROL}   }{a format control which have specific functions
                             in the Bidi (bidirectional text) Algorithm.}
  \item{\code{BIDI_MIRRORED}  }{a character that may change display in right-to-left text.}
  \item{\code{DASH}           }{a kind of a dash character.}
  \item{\code{DEFAULT_IGNORABLE_CODE_POINT}}{characters that are ignorable in most
                               text processing activities,
                               e.g. <2060..206F, FFF0..FFFB, E0000..E0FFF>.}
  \item{\code{DEPRECATED}     }{a deprecated character according
          to the current Unicode standard (the usage of deprecated characters
          is strongly discouraged).}
  \item{\code{DIACRITIC}      }{a character that linguistically modifies
             the meaning of another character to which it applies.}
  \item{\code{EXTENDER}       }{a character that extends the value
                             or shape of a preceding alphabetic character,
                             e.g. a length and iteration mark.}
  \item{\code{HEX_DIGIT}      }{a character commonly
                            used for hexadecimal numbers,
                            cf. also \code{ASCII_HEX_DIGIT}.}
  \item{\code{HYPHEN}}{a dash used to mark connections between
              pieces of words, plus the Katakana middle dot.}
  \item{\code{ID_CONTINUE}}{a character that can continue an identifier,
                     \code{ID_START}+\code{Mn}+\code{Mc}+\code{Nd}+\code{Pc}.}
  \item{\code{ID_START}}{a character that can start an identifier,
                 \code{Lu}+\code{Ll}+\code{Lt}+\code{Lm}+\code{Lo}+\code{Nl}.}
  \item{\code{IDEOGRAPHIC}}{a CJKV (Chinese-Japanese-Korean-Vietnamese)
               ideograph.}
  \item{\code{LOWERCASE}}{}
  \item{\code{MATH}}{}
  \item{\code{NONCHARACTER_CODE_POINT}}{}
  \item{\code{QUOTATION_MARK}}{}
  \item{\code{SOFT_DOTTED}}{a character with a ``soft dot'', like i or j,
such that an accent placed on this character causes the dot to disappear.}
  \item{\code{TERMINAL_PUNCTUATION}}{a punctuation character that generally
marks the end of textual units.}
  \item{\code{UPPERCASE}}{}
  \item{\code{WHITE_SPACE}}{a space character or TAB or CR or LF or ZWSP or ZWNBSP.}
  \item{\code{CASE_SENSITIVE}}{}
  \item{\code{POSIX_ALNUM}}{}
  \item{\code{POSIX_BLANK}}{}
  \item{\code{POSIX_GRAPH}}{}
  \item{\code{POSIX_PRINT}}{}
  \item{\code{POSIX_XDIGIT}}{}
  \item{\code{CASED}}{}
  \item{\code{CASE_IGNORABLE}}{}
  \item{\code{CHANGES_WHEN_LOWERCASED}}{}
  \item{\code{CHANGES_WHEN_UPPERCASED}}{}
  \item{\code{CHANGES_WHEN_TITLECASED}}{}
  \item{\code{CHANGES_WHEN_CASEFOLDED}}{}
  \item{\code{CHANGES_WHEN_CASEMAPPED}}{}
  \item{\code{CHANGES_WHEN_NFKC_CASEFOLDED}}{}
}
}

\section{POSIX Character Classes}{


Beware of using POSIX character classes,
e.g. \code{[:punct:]}. ICU User Guide (see below)
states that in general they are not well-defined, so may end up
with something different than you expect.

In particular, in POSIX-like regex engines, \code{[:punct:]} stands for
the character class corresponding to the \code{ispunct()} classification
function (check out \code{man 3 ispunct} on UNIX-like systems).
According to ISO/IEC 9899:1990 (ISO C90), the \code{ispunct()} function
tests for any printing character except for space or a character
for which \code{isalnum()} is true. However, in a POSIX setting,
the details of what characters belong into which class depend
on the current locale. So the \code{[:punct:]} class does not lead
to portable code (again, in POSIX-like regex engines).

So a POSIX flavor of \code{[:punct:]} is more like
\code{[\\p{P}\\p{S}]} in \pkg{ICU}. You have been warned.
}
\references{
\emph{The Unicode Character Database} -- Unicode Standard Annex #44,
\url{http://www.unicode.org/reports/tr44/}

\emph{UnicodeSet} -- ICU User Guide,
\url{http://userguide.icu-project.org/strings/unicodeset}

\emph{Properties} -- ICU User Guide,
\url{http://userguide.icu-project.org/strings/properties}

\emph{C/POSIX Migration} -- ICU User Guide,
\url{http://userguide.icu-project.org/posix}

\emph{Unicode Script Data}, \url{http://www.unicode.org/Public/UNIDATA/Scripts.txt}

\emph{icu::Unicodeset Class Reference} -- ICU4C API Documentation,
\url{http://www.icu-project.org/apiref/icu4c/classicu_1_1UnicodeSet.html}
}
\seealso{
Other search_charclass: \code{\link{stri_trim}},
  \code{\link{stri_trim}}, \code{\link{stri_trim_both}},
  \code{\link{stri_trim_left}},
  \code{\link{stri_trim_right}};
  \code{\link{stringi-search}}

Other stringi_general_topics: \code{\link{stringi-arguments}};
  \code{\link{stringi-encoding}};
  \code{\link{stringi-locale}};
  \code{\link{stringi-search-boundaries}};
  \code{\link{stringi-search-coll}};
  \code{\link{stringi-search-fixed}};
  \code{\link{stringi-search-regex}};
  \code{\link{stringi-search}}; \code{\link{stringi}},
  \code{\link{stringi-package}}
}

