Package com.google.common.base
Class Utf8
- java.lang.Object
-
- com.google.common.base.Utf8
-
@Beta @GwtCompatible(emulated=true) public final class Utf8 extends Object
Low-level, high-performance utility methods related to the UTF-8 character encoding. UTF-8 is defined in section D92 of The Unicode Standard Core Specification, Chapter 3.The variant of UTF-8 implemented by this class is the restricted definition of UTF-8 introduced in Unicode 3.1. One implication of this is that it rejects "non-shortest form" byte sequences, even though the JDK decoder may accept them.
- Since:
- 16.0
- Author:
- Martin Buchholz, Clément Roux
-
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static int
encodedLength(CharSequence sequence)
Returns the number of bytes in the UTF-8-encoded form ofsequence
.static boolean
isWellFormed(byte[] bytes)
Returnstrue
ifbytes
is a well-formed UTF-8 byte sequence according to Unicode 6.0.static boolean
isWellFormed(byte[] bytes, int off, int len)
Returns whether the given byte array slice is a well-formed UTF-8 byte sequence, as defined byisWellFormed(byte[])
.
-
-
-
Method Detail
-
encodedLength
public static int encodedLength(CharSequence sequence)
Returns the number of bytes in the UTF-8-encoded form ofsequence
. For a string, this method is equivalent tostring.getBytes(UTF_8).length
, but is more efficient in both time and space.- Throws:
IllegalArgumentException
- ifsequence
contains ill-formed UTF-16 (unpaired surrogates)
-
isWellFormed
public static boolean isWellFormed(byte[] bytes)
Returnstrue
ifbytes
is a well-formed UTF-8 byte sequence according to Unicode 6.0. Note that this is a stronger criterion than simply whether the bytes can be decoded. For example, some versions of the JDK decoder will accept "non-shortest form" byte sequences, but encoding never reproduces these. Such byte sequences are not considered well-formed.This method returns
true
if and only ifArrays.equals(bytes, new String(bytes, UTF_8).getBytes(UTF_8))
does, but is more efficient in both time and space.
-
isWellFormed
public static boolean isWellFormed(byte[] bytes, int off, int len)
Returns whether the given byte array slice is a well-formed UTF-8 byte sequence, as defined byisWellFormed(byte[])
. Note that this can be false even whenisWellFormed(bytes)
is true.- Parameters:
bytes
- the input bufferoff
- the offset in the buffer of the first byte to readlen
- the number of bytes to read from the buffer
-
-