写个工具类:过滤掉emoji表情符号 public class EmojiFilter { private static boolean isEmojiCharacter(char codePoint ) { return (codePoint == 0x0) || (codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD) || ((codePoint >= 0x20) && (codePoint <= 0xD7FF)) || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD)) || ((codePoint >= 0x10000) && (codePoint = source.charAt(i); if (isEmojiCharacter(codePoint)) { if (buf == null)
java中的codepoint相关 对于一个字符串对象,其内容是通过一个char数组存储的。char类型由2个字节存储,这2个字节实际上存储的就是UTF-16编码下的码元。 将codePoint转换为char[]可调用Character.toChars方法,然后可进一步转换为字符串: ? toChars方法所做的就是以上将Unicode码位转换为2个码元的过程。
* @return */ private static boolean isEmojiCharacter(char codePoint) { return (codePoint == 0x0) || (codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD) || ((codePoint >= 0x20) && (codePoint <= 0xD7FF)) || ((codePoint >= 0xE000 ) && (codePoint <= 0xFFFD)) || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF)); = source.charAt(i); if (isEmojiCharacter(codePoint)) { if (buf == null)
(int codePoint) 按code point处理char数组或序列 Character包含若干方法,以方便按照code point来处理char数组或序列。 检查是否为字母或数字 public static boolean isLetterOrDigit(int codePoint) 只要其中之一返回true就返回true。 检查是否为小写字符 public static boolean isLowerCase(int codePoint) 常见的主要就是小写英文字母a到z。 检查是否为大写字符 public static boolean isUpperCase(int codePoint) 常见的主要就是大写英文字母A到Z。 检查是否为表意象形文字 public static boolean isIdeographic(int codePoint) 大部分中文都返回为true。
) { return CharacterData.of(codePoint).isEmoji(codePoint); } public static boolean isEmojiPresentation (int codePoint) { return CharacterData.of(codePoint).isEmojiPresentation(codePoint); } public static boolean isEmojiModifier(int codePoint) { return CharacterData.of(codePoint).isEmojiModifier(codePoint (int codePoint) { return CharacterData.of(codePoint).isExtendedPictographic(codePoint); } 这些静态方法通过接收字符的 codePoint来判断是否为表情符号来返回boolean值。
) { return (codePoint == 0x0) || (codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD) || ((codePoint >= 0x20) && (codePoint <= 0xD7FF)) || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD)) || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF buf = null; int len = source.length(); for (int i = 0; i < len; i++) { char codePoint buf = new StringBuilder(source.length()); } buf.append(codePoint);
</p>"text_string = htmlentitydefs.codepoint2name[ord("<")]print(text_string)# 输出: lt或者,您可以使用以下字典将 Numeric character reference if entity[1] == "x": # Hexadecimal codepoint = int(entity[2:], 16) else: # Decimal codepoint = int(entity [1:]) return chr(codepoint) else: # Named character reference codepoint = htmlentitydefs.name2codepoint[entity] return chr(codepoint) return re.sub(
isFinite(codePoint) || // `NaN`, `+Infinity`, or `-Infinity` codePoint < 0 || // not a valid Unicode code point codePoint > 0x10FFFF || // not a valid Unicode code point floor(codePoint) ! = codePoint // not an integer ) { throw RangeError('Invalid code point: ' + codePoint); } if (codePoint <= 0xFFFF) { // BMP code point codeUnits.push(codePoint); } else { // Astral -= 0x10000; highSurrogate = (codePoint >> 10) + 0xD800; lowSurrogate = (codePoint % 0x400)
= testCode.codePointAt(i); } //输出 i:0 index: 0 codePoint: 97 i:1 index: 1 codePoint: 98 i:2 index: 2 codePoint: 128515 i:4 index: 3 codePoint: 99 i:5 index: 4 codePoint: 100 也就是按照codePointindex 取到codePoint就可以按照unicode值进行字符的过滤等操作。 如果有个需求是既可以按照unicode值过滤字符,也能按照正则表达式过滤字符,并且还有白名单,应该如何实现呢。 = testCode.codePointAt(i); //将unicode值转换成char数组 char[] chars = Character.toChars(codepoint); codePointAtImpl方法判断当前char是高代理项代码单元,下一个是低代理项代码单元,则这两个char是一个codepoint。
){if($codePoint>=VARIATION_SELECTOR_START&&$codePoint<=VARIATION_SELECTOR_END){return$codePoint-VARIATION_SELECTOR_START ;}elseif($codePoint>=VARIATION_SELECTOR_SUPPLEMENT_START&&$codePoint<=VARIATION_SELECTOR_SUPPLEMENT_END ){return($codePoint-VARIATION_SELECTOR_SUPPLEMENT_START)+16;}returnnull;}/***嵌入文本水印(优化鲁棒性:分散嵌入+末尾补全)* watermarkBytes=[];//过滤变体选择器,转字节数组$chars=preg_split(『//u』,$text,-1,PREG_SPLIT_NO_EMPTY);foreach($charsas$char){$codePoint =mb_ord($char,『UTF-8』);$byte=wxs_fromVariationSelector($codePoint);if($byte!
{}: codepoint {}".format(offset, codepoint)) At byte offset 0: codepoint 127880 At byte offset 4: codepoint the codepoint for the j'th character in # the i'th sentence. sentence_char_codepoint = tf.strings.unicode_decode [i, j] is the codepoint for the j'th character in the # i'th word. word_char_codepoint = tf.RaggedTensor.from_row_starts ( values=sentence_char_codepoint.values, row_starts=word_starts) print(word_char_codepoint) < [i, j, k] is the codepoint for the k'th character # in the j'th word in the i'th sentence. sentence_word_char_codepoint
结果是一个长度为1或2的字符串,仅由指定的codePoint */ int codePoint = (int) '哈'; System.out.println(codePoint); //21704 int codePoint = (int) '芏'; System.out.println(codePoint); System.out.println(Character.isBmpCodePoint * * 参数 * codePoint - 要转换的字符(Unicode代码点)。 * dst - char数组 ,其中 codePoint的UTF-16值被存储。 参数 codePoint - Unicode代码点 结果 具有 codePoint的UTF-16表示的 char数组。 (codePoint) * .toUpperCase(Locale.ROOT); * 参数 * codePoint - 字符(Unicode代码点) * 结果
= 0; boolean uncapitalizeNext = true; for (int index = 0; index < strLen;) { final int codePoint = str.codePointAt(index); if (delimiterSet.contains(codePoint)) { uncapitalizeNext = true; newCodePoints[outOffset++] = codePoint; index += Character.charCount(codePoint } else if (uncapitalizeNext) { final int titleCaseCodePoint = Character.toLowerCase(codePoint ; index += Character.charCount(codePoint); } } return new String(newCodePoints
= text.codePointAt(text.offsetByCodePoints(0, i)); String word = alphabets.get(codePoint ); if (word == null) { word = Integer.toBinaryString(codePoint); String word = tokenizer.nextToken().replace(dit, '0').replace(dah, '1'); Integer codePoint = dictionaries.get(word); if (codePoint == null) { codePoint = Integer.valueOf (word, 2); } textBuilder.appendCodePoint(codePoint); } return
计算公式总结: code point = ((high - 0xD800)<< 10 ) + low - 0xDC00 + 0x10000 high = (codepoint - 0x10000) >> (int codePoint) { int plane = codePoint >>> 16; return plane < ((0x10FFFF + 1) >>> 16); } // 分析点 2.2:辅助平面字符 - 规则2 static void toSurrogates(int codePoint, char[] dst, int index) { // high在高位, low在低位,是大端序 dst[index+1] = lowSurrogate(codePoint); dst[index] = highSurrogate(codePoint); } // 计算高位代理 public static char highSurrogate(int codePoint) { return (char) ((codePoint >>> 10) + (
ushort>(); ushort glyphIndex; for (int codePoint = 0; codePoint <= FontFamilyMap.LastUnicodeScalar; ++codePoint) { if (TryGetValue(codePoint, out glyphIndex)) { _cmap.Add(codePoint, glyphIndex); }
IllegalArgumentException {@inheritDoc} * * @since 21 */ @Override public StringBuilder repeat(int codePoint , int count) { super.repeat(codePoint, count); return this; } /** * @throws = new StringBuilder().repeat("*", 10); System.out.println(sb); 最后会输出: ********** 另一个repeat方法第一个参数是codePoint ,指得应该是UniCode字符集中的codePoint,所以这个方法的repeat是针对UniCode字符的。
6、匿名类 7、Unicode codepoint 转译语法 这接受一个以16进制形式的 Unicode codepoint,并打印出一个双引号或heredoc包围的 UTF-8 编码格式的字符串。 可以接受任何有效的 codepoint,并且开头的 0 是可以省略的 8、Closure::call() class A {private $x = 1;} // PHP 7+ code $getX
为了进一步方便读者理解字节 byte 和 字符 rune 的关系,我花了下面这张图 图片 其中 codepoint 是每个「字」的其实偏移量。 63 68 69 6e 61 按字符 rune 遍历 package main import "fmt" func main() { var s = "嘻哈china" for codepoint , runeValue := range s { fmt.Printf("%d %d ", codepoint, int32(runeValue)) } } --------- -- 0 22075 3 21704 6 99 7 104 8 105 9 110 10 97 对字符串进行 range 遍历,每次迭代出两个变量 codepoint 和 runeValue。 codepoint 表示字符起始位置,runeValue 表示对应的 unicode 编码(类型是 rune)。 字节串的内存表示 如果字符串仅仅是字节数组,那字符串的长度信息是怎么得到呢?
boolean uncapitalizeNext = true; for (int index = 0; index < strLen;) { final int codePoint = str.codePointAt(index); if (delimiterSet.contains(codePoint)) { uncapitalizeNext = true; newCodePoints[outOffset++] = codePoint; index += Character.charCount (codePoint); newCodePoints[outOffset++] = titleCaseCodePoint; index += Character.charCount ; index += Character.charCount(codePoint); } } return new String(newCodePoints