.. default-domain:: chpl .. module:: String :synopsis: The following documentation shows functions and methods used to Strings ======= The following documentation shows functions and methods used to manipulate and process Chapel strings. Methods Available in Other Modules ---------------------------------- Besides the functions below, some other modules provide routines that are useful for working with strings. The :mod:`IO` module provides `IO.string.format` which creates a string that is the result of formatting. It also includes functions for reading and writing strings. The :mod:`Regexp` module also provides some routines for searching within strings. Casts from String to a Numeric Type ----------------------------------- The :mod:`string ` type supports casting to numeric types. Such casts will convert the string to the numeric type and throw an error if the string is invalid. For example: .. code-block:: chapel var number = "a":int; throws an error when it is executed, but .. code-block:: chapel var number = "1":int; stores the value ``1`` in ``number``. To learn more about handling these errors, see the :ref:`Error Handling technical note `. Unicode Support --------------- Chapel strings use the UTF-8 encoding. Note that ASCII strings are a simple subset of UTF-8 strings, because every ASCII character is a UTF-8 character with the same meaning. UTF-8 strings might not work properly if a UTF-8 environment is not used. See :ref:`character set environment ` for more information. .. _string.nonunicode: Non-Unicode Data and Chapel Strings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For doing string operations on non-Unicode or arbitrary data, consider using :mod:`bytes ` instead of string. However, there may be cases where :mod:`string ` must be used with non-Unicode data. Examples of this are file system and path operations on systems where UTF-8 file names are not enforced. In such scenarios, non-UTF-8 data can be escaped and stored in a string in a way that it can be restored when needed. For example: .. code-block:: chapel var myBytes = b"Illegal \xff sequence"; // \xff is non UTF-8 var myEscapedString = myBytes.decode(policy=decodePolicy.escape); will escape the illegal `0xFF` byte and store it in the string. The escaping strategy is similar to Python's "surrogate escapes" and is as follows. - Each individual byte in an illegal sequence is bitwise-or'ed with `0xDC00` to create a 2-byte codepoint. - Then, this codepoint is encoded in UTF-8 and stored in the string buffer. This strategy typically results in storing 3 bytes for each byte in the illegal sequence. Similarly escaped strings can also be created with :proc:`createStringWithNewBuffer` using a C buffer. An escaped data sequence can be reconstructed with :proc:`~string.encode`: .. code-block:: chapel var reconstructedBytes = myEscapedString.encode(policy=encodePolicy.unescape); writeln(myBytes == reconstructedBytes); // prints true Alternatively, escaped sequence can be used as-is without reconstructing the bytes: .. code-block:: chapel var escapedBytes = myEscapedString.encode(policy=encodePolicy.pass); writeln(myBytes == escapedBytes); // prints false .. note:: Strings that contain escaped sequences cannot be directly used with unformatted I/O functions such as ``writeln``. :ref:`Formatted I/O ` can be used to print such strings with binary formatters such as ``%|s``. .. note:: The standard :mod:`FileSystem`, :mod:`Path` and :mod:`IO` modules can use escaped strings as described above for paths and file names. Lengths and Offsets in Unicode Strings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For Unicode strings, and in particular UTF-8 strings, there are several possible units for offsets or lengths: * bytes * codepoints * graphemes Most methods on the Chapel string type currently work with codepoint units by default. For example, :proc:`~string.size` returns the length in codepoints and `int` values passed into :proc:`~string.this` are offsets in codepoint units. It is possible to indicate byte or codepoint units for indexing in the string methods by using arguments of type ``byteIndex`` or ``codepointIndex`` respectively. For speed of indexing with their result values, :proc:`~string.find()` and :proc:`~string.rfind()` return a ``byteIndex``. .. note:: Support for grapheme units is not implemented at this time. Using the ``byteIndex`` and ``codepointIndex`` types ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A value of type ``byteIndex`` or ``codepointIndex`` can be passed to certain ``string`` functions to indicate that the function should operate with units of bytes or codepoints. Passing a ``codepointIndex`` has the same behavior as passing an integral type. See :proc:`~string.this` for an example. Both of these types can be created from an ``int`` via assignment or cast. They also support addition and subtraction with ``int``. Finally, values of same types can be compared. For example, the following function returns a string containing only the second byte of the argument: .. code-block:: chapel proc getSecondByte(arg:string) { var offsetInBytes = 1:byteIndex; return arg[offsetInBytes]; } Whereas the following function returns a string containing only the second codepoint of the argument: .. code-block:: chapel proc getSecondCodepoint(arg:string) { var offsetInCodepoints = 1:codepointIndex; return arg[offsetInCodepoints]; } .. function:: proc createStringWithBorrowedBuffer(x: string) Creates a new string which borrows the internal buffer of another string. If the buffer is freed before the string returned from this function, accessing it is undefined behavior. :arg x: Object to borrow the buffer from :type x: `string` :returns: A new `string` .. function:: proc createStringWithBorrowedBuffer(x: c_string, length = x.size) throws Creates a new string which borrows the internal buffer of a `c_string`. If the buffer is freed before the string returned from this function, accessing it is undefined behavior. :arg x: Object to borrow the buffer from :type x: `c_string` :arg length: Length of the string stored in `x` in bytes, excluding the terminating null byte. :type length: `int` :throws: `DecodeError` if `x` contains non-UTF-8 characters. :returns: A new `string` .. function:: proc createStringWithBorrowedBuffer(x: c_ptr(?t), length: int, size: int) throws Creates a new string which borrows the memory allocated for a `c_ptr`. If the buffer is freed before the string returned from this function, accessing it is undefined behavior. :arg x: Object to borrow the buffer from :type x: `c_ptr(uint(8))` or `c_ptr(c_char)` :arg length: Length of the string stored in `x` in bytes, excluding the terminating null byte. :type length: `int` :arg size: Size of memory allocated for `x` in bytes :type length: `int` :throws: `DecodeError` if `x` contains non-UTF-8 characters. :returns: A new `string` .. function:: proc createStringWithOwnedBuffer(x: c_string, length = x.size) throws Creates a new string which takes ownership of the internal buffer of a `c_string`. The buffer will be freed when the string is deinitialized. :arg x: Object to take ownership of the buffer from :type x: `c_string` :arg length: Length of the string stored in `x` in bytes, excluding the terminating null byte. :type length: `int` :throws: `DecodeError` if `x` contains non-UTF-8 characters. :returns: A new `string` .. function:: proc createStringWithOwnedBuffer(x: c_ptr(?t), length: int, size: int) throws Creates a new string which takes ownership of the memory allocated for a `c_ptr`. The buffer will be freed when the string is deinitialized. :arg x: Object to take ownership of the buffer from :type x: `c_ptr(uint(8))` or `c_ptr(c_char)` :arg length: Length of the string stored in `x` in bytes, excluding the terminating null byte. :type length: `int` :arg size: Size of memory allocated for `x` in bytes :type length: `int` :throws: `DecodeError` if `x` contains non-UTF-8 characters. :returns: A new `string` .. function:: proc createStringWithNewBuffer(x: string) Creates a new string by creating a copy of the buffer of another string. :arg x: Object to copy the buffer from :type x: `string` :returns: A new `string` .. function:: proc createStringWithNewBuffer(x: c_string, length = x.size, policy = decodePolicy.strict) throws Creates a new string by creating a copy of the buffer of a `c_string`. :arg x: Object to copy the buffer from :type x: `c_string` :arg length: Length of the string stored in `x` in bytes, excluding the terminating null byte. :type length: `int` :arg policy: - `decodePolicy.strict` raises an error - `decodePolicy.replace` replaces the malformed character with UTF-8 replacement character - `decodePolicy.drop` drops the data silently - `decodePolicy.escape` escapes each illegal byte with private use codepoints :throws: `DecodeError` if `decodePolicy.strict` is passed to the `policy` argument and `x` contains non-UTF-8 characters. :returns: A new `string` .. function:: proc createStringWithNewBuffer(x: c_ptr(?t), length: int, size = length+1, policy = decodePolicy.strict) throws Creates a new string by creating a copy of a buffer. :arg x: The buffer to copy :type x: `c_ptr(uint(8))` or `c_ptr(c_char)` :arg length: Length of the string stored in `x` in bytes, excluding the terminating null byte. :type length: `int` :arg size: Size of memory allocated for `x` in bytes. This argument is ignored by this function. :type size: `int` :arg policy: `decodePolicy.strict` raises an error, `decodePolicy.replace` replaces the malformed character with UTF-8 replacement character, `decodePolicy.drop` drops the data silently, `decodePolicy.escape` escapes each illegal byte with private use codepoints :throws: `DecodeError` if `x` contains non-UTF-8 characters. :returns: A new `string` .. function:: proc cStrAssignmentDeprWarn() .. method:: proc string.size :returns: The number of codepoints in the string. .. method:: proc string.indices :returns: The indices that can be used to index into the string (i.e., the range ``0..` that is on the currently executing locale. :returns: A shallow copy if the :mod:`string ` is already on the current locale, otherwise a deep copy is performed. .. method:: proc string.c_str(): c_string Get a `c_string` from a :mod:`string `. .. warning:: This can only be called safely on a :mod:`string ` whose home is the current locale. This property can be enforced by calling :proc:`string.localize()` before :proc:`~string.c_str()`. If the string is remote, the program will halt. For example: .. code-block:: chapel var my_string = "Hello!"; on different_locale { printf("%s", my_string.localize().c_str()); } :returns: A `c_string` that points to the underlying buffer used by this :mod:`string `. The returned `c_string` is only valid when used on the same locale as the string. .. method:: proc string.encode(policy = encodePolicy.pass): bytes Returns a :mod:`bytes ` from the given :mod:`string `. If the string contains some escaped non-UTF8 bytes, `policy` argument determines the action. :arg policy: `encodePolicy.pass` directly copies the (potentially escaped) data, `encodePolicy.unescape` recovers the escaped bytes back. :returns: :mod:`bytes ` .. itermethod:: iter string.items(): string Iterates over the string character by character. For example: .. code-block:: chapel var str = "abcd"; for c in str { writeln(c); } Output:: a b c d .. itermethod:: iter string.these(): string Iterates over the string character by character, yielding 1-codepoint strings. (A synonym for :iter:`string.items`) For example: .. code-block:: chapel var str = "abcd"; for c in str { writeln(c); } Output:: a b c d .. itermethod:: iter string.bytes(): byteType Iterates over the string byte by byte. .. itermethod:: iter string.codepoints(): int(32) Iterates over the string Unicode character by Unicode character. .. method:: proc string.toByte(): uint(8) :returns: The value of a single-byte string as an integer. .. method:: proc string.byte(i: int): uint(8) :returns: The value of the `i` th byte as an integer. .. method:: proc string.toCodepoint(): int(32) :returns: The value of a single-codepoint string as an integer. .. method:: proc string.codepoint(i: int): int(32) :returns: The value of the `i` th multibyte character as an integer. .. method:: proc string.this(i: byteIndex): string Return the codepoint starting at the `i` th byte in the string :returns: A string with the complete multibyte character starting at the specified byte index from ``0..#string.numBytes`` .. method:: proc string.this(i: codepointIndex): string Return the `i` th codepoint in the string. (A synonym for :proc:`string.item`) :returns: A string with the complete multibyte character starting at the specified codepoint index from ``0..#string.numCodepoints`` .. method:: proc string.this(i: int): string Return the `i` th codepoint in the string. (A synonym for :proc:`string.item`) :returns: A string with the complete multibyte character starting at the specified codepoint index from ``1..string.numCodepoints`` .. method:: proc string.item(i: codepointIndex): string Return the `i` th codepoint in the string :returns: A string with the complete multibyte character starting at the specified codepoint index from ``1..string.numCodepoints`` .. method:: proc string.item(i: int): string Return the `i` th codepoint in the string :returns: A string with the complete multibyte character starting at the specified codepoint index from ``0..#string.numCodepoints`` .. method:: proc string.this(r: range(?)): string Slice a string. Halts if r is non-empty and not completely inside the range ``0.. 0``, remove ``columns`` leading whitespace characters from each line. Tabs are not considered whitespace when ``columns > 0``, so only leading spaces are removed. :arg columns: The number of columns of indentation to remove. Infer common leading whitespace if ``columns == 0``. :arg ignoreFirst: When ``true``, ignore first line when determining the common leading whitespace, and make no changes to the first line. :returns: A new `string` with indentation removed. .. warning:: ``string.dedent`` is not considered stable and is subject to change in future Chapel releases. .. method:: proc string.isUpper(): bool Checks if all the characters in the string are either uppercase (A-Z) or uncased (not a letter). :returns: * `true` -- if the string contains at least one uppercase character and no lowercase characters, ignoring uncased characters. * `false` -- otherwise .. method:: proc string.isLower(): bool Checks if all the characters in the string are either lowercase (a-z) or uncased (not a letter). :returns: * `true` -- when there are no uppercase characters in the string. * `false` -- otherwise .. method:: proc string.isSpace(): bool Checks if all the characters in the string are whitespace (' ', '\t', '\n', '\v', '\f', '\r'). :returns: * `true` -- when all the characters are whitespace. * `false` -- otherwise .. method:: proc string.isAlpha(): bool Checks if all the characters in the string are alphabetic (a-zA-Z). :returns: * `true` -- when the characters are alphabetic. * `false` -- otherwise .. method:: proc string.isDigit(): bool Checks if all the characters in the string are digits (0-9). :returns: * `true` -- when the characters are digits. * `false` -- otherwise .. method:: proc string.isAlnum(): bool Checks if all the characters in the string are alphanumeric (a-zA-Z0-9). :returns: * `true` -- when the characters are alphanumeric. * `false` -- otherwise .. method:: proc string.isPrintable(): bool Checks if all the characters in the string are printable. :returns: * `true` -- when the characters are printable. * `false` -- otherwise .. method:: proc string.isTitle(): bool Checks if all uppercase characters are preceded by uncased characters, and if all lowercase characters are preceded by cased characters. :returns: * `true` -- when the condition described above is met. * `false` -- otherwise .. method:: proc string.toLower(): string :returns: A new string with all uppercase characters replaced with their lowercase counterpart. .. note:: The case change operation is not currently performed on characters whose cases take different number of bytes to represent in Unicode mapping. .. method:: proc string.toUpper(): string :returns: A new string with all lowercase characters replaced with their uppercase counterpart. .. note:: The case change operation is not currently performed on characters whose cases take different number of bytes to represent in Unicode mapping. .. method:: proc string.toTitle(): string :returns: A new string with all cased characters following an uncased character converted to uppercase, and all cased characters following another cased character converted to lowercase. .. note:: The case change operation is not currently performed on characters whose cases take different number of bytes to represent in Unicode mapping. .. function:: proc =(ref lhs: string, rhs: string) Copies the string `rhs` into the string `lhs`. .. function:: proc =(ref lhs: string, rhs_c: c_string) Copies the c_string `rhs_c` into the string `lhs`. Halts if `lhs` is a remote string. .. function:: proc +(s0: string, s1: string) :returns: A new string which is the result of concatenating `s0` and `s1` .. function:: proc *(s: string, n: integral) :returns: A new string which is the result of repeating `s` `n` times. If `n` is less than or equal to 0, an empty string is returned. For example: .. code-block:: chapel writeln("Hello! " * 3); Results in:: Hello! Hello! Hello! .. function:: proc +=(ref lhs: string, const ref rhs: string): void Appends the string `rhs` to the string `lhs`. .. function:: proc codepointToString(i: int(32)) :returns: A string storing the complete multibyte character sequence that corresponds to the codepoint value `i`.