首页
学习
活动
专区
圈层
工具
发布
    • 综合排序
    • 最热优先
    • 最新优先
    时间不限
  • 来自专栏小小码农一个。

    Java 解决Emoji表情过滤问题

    写个工具类:过滤掉emoji表情符号 public class EmojiFilter { private static boolean isEmojiCharacter(char codePoint ) { return (codePoint == 0x0) || (codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD) || ((codePoint >= 0x20) && (codePoint <= 0xD7FF)) || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD)) || ((codePoint >= 0x10000) && (codePoint = source.charAt(i); if (isEmojiCharacter(codePoint)) { if (buf == null)

    6.8K10发布于 2020-06-08
  • 来自专栏java一日一条

    聊聊Java中codepoint和UTF-16相关的一些事

    java中的codepoint相关 对于一个字符串对象,其内容是通过一个char数组存储的。char类型由2个字节存储,这2个字节实际上存储的就是UTF-16编码下的码元。 将codePoint转换为char[]可调用Character.toChars方法,然后可进一步转换为字符串: ? toChars方法所做的就是以上将Unicode码位转换为2个码元的过程。

    1.4K20发布于 2018-09-14
  • 来自专栏郭家一诺千金

    Java 存储mysql数据库时如何进行Emoji表情转换和处理

    * @return */ private static boolean isEmojiCharacter(char codePoint) { return (codePoint == 0x0) || (codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD) || ((codePoint >= 0x20) && (codePoint <= 0xD7FF)) || ((codePoint >= 0xE000 ) && (codePoint <= 0xFFFD)) || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF)); = source.charAt(i); if (isEmojiCharacter(codePoint)) { if (buf == null)

    2.2K10发布于 2020-04-30
  • 来自专栏老马说编程

    (28) 剖析包装类 (下) / 计算机程序的思维逻辑

    (int codePoint) 按code point处理char数组或序列 Character包含若干方法,以方便按照code point来处理char数组或序列。 检查是否为字母或数字 public static boolean isLetterOrDigit(int codePoint) 只要其中之一返回true就返回true。 检查是否为小写字符 public static boolean isLowerCase(int codePoint) 常见的主要就是小写英文字母a到z。 检查是否为大写字符 public static boolean isUpperCase(int codePoint) 常见的主要就是大写英文字母A到Z。 检查是否为表意象形文字 public static boolean isIdeographic(int codePoint) 大部分中文都返回为true。

    85170发布于 2018-01-31
  • 来自专栏程序猿DD

    Java 21 增强对 Emoji 表情符号的处理了

    ) { return CharacterData.of(codePoint).isEmoji(codePoint); } public static boolean isEmojiPresentation (int codePoint) { return CharacterData.of(codePoint).isEmojiPresentation(codePoint); } public static boolean isEmojiModifier(int codePoint) { return CharacterData.of(codePoint).isEmojiModifier(codePoint (int codePoint) { return CharacterData.of(codePoint).isExtendedPictographic(codePoint); } 这些静态方法通过接收字符的 codePoint来判断是否为表情符号来返回boolean值。

    78010编辑于 2023-11-24
  • 来自专栏小小码农一个。

    Java解决Emoji表情过滤问题 - 崔笑颜的博客

    ) { return (codePoint == 0x0) || (codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD) || ((codePoint >= 0x20) && (codePoint <= 0xD7FF)) || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD)) || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF buf = null; int len = source.length(); for (int i = 0; i < len; i++) { char codePoint buf = new StringBuilder(source.length()); } buf.append(codePoint);

    1.5K10发布于 2021-03-12
  • 来自专栏小徐学爬虫

    如何在Python中将HTML实体代码转换为文本

    </p>"text_string = htmlentitydefs.codepoint2name[ord("<")]​print(text_string)# 输出: lt或者,您可以使用以下字典将 Numeric character reference if entity[1] == "x": # Hexadecimal codepoint = int(entity[2:], 16) else: # Decimal codepoint = int(entity [1:]) return chr(codepoint) else: # Named character reference codepoint = htmlentitydefs.name2codepoint[entity] return chr(codepoint)​ return re.sub(

    2.7K10编辑于 2024-04-07
  • 来自专栏离别歌 - 信息安全与代码审计

    Fuzz中的javascript大小写特性

    isFinite(codePoint) || // `NaN`, `+Infinity`, or `-Infinity` codePoint < 0 || // not a valid Unicode code point codePoint > 0x10FFFF || // not a valid Unicode code point floor(codePoint) ! = codePoint // not an integer ) { throw RangeError('Invalid code point: ' + codePoint); } if (codePoint <= 0xFFFF) { // BMP code point codeUnits.push(codePoint); } else { // Astral -= 0x10000; highSurrogate = (codePoint >> 10) + 0xD800; lowSurrogate = (codePoint % 0x400)

    94541发布于 2020-10-15
  • 来自专栏JavaEdge

    Java的String类中提到的代码点,代码单元到底是什么?

    = testCode.codePointAt(i); } //输出 i:0 index: 0 codePoint: 97 i:1 index: 1 codePoint: 98 i:2 index: 2 codePoint: 128515 i:4 index: 3 codePoint: 99 i:5 index: 4 codePoint: 100 也就是按照codePointindex 取到codePoint就可以按照unicode值进行字符的过滤等操作。 如果有个需求是既可以按照unicode值过滤字符,也能按照正则表达式过滤字符,并且还有白名单,应该如何实现呢。 = testCode.codePointAt(i); //将unicode值转换成char数组 char[] chars = Character.toChars(codepoint); codePointAtImpl方法判断当前char是高代理项代码单元,下一个是低代理项代码单元,则这两个char是一个codepoint

    72620发布于 2020-05-26
  • 如何用文本盲水印保护原创文章免受抄袭

    ){if($codePoint>=VARIATION_SELECTOR_START&&$codePoint<=VARIATION_SELECTOR_END){return$codePoint-VARIATION_SELECTOR_START ;}elseif($codePoint>=VARIATION_SELECTOR_SUPPLEMENT_START&&$codePoint<=VARIATION_SELECTOR_SUPPLEMENT_END ){return($codePoint-VARIATION_SELECTOR_SUPPLEMENT_START)+16;}returnnull;}/***嵌入文本水印(优化鲁棒性:分散嵌入+末尾补全)* watermarkBytes=[];//过滤变体选择器,转字节数组$chars=preg_split(『//u』,$text,-1,PREG_SPLIT_NO_EMPTY);foreach($charsas$char){$codePoint =mb_ord($char,『UTF-8』);$byte=wxs_fromVariationSelector($codePoint);if($byte!

    39210编辑于 2025-12-01
  • 来自专栏计算机视觉理论及其实现

    Unicode strings

    {}: codepoint {}".format(offset, codepoint)) At byte offset 0: codepoint 127880 At byte offset 4: codepoint the codepoint for the j'th character in # the i'th sentence. sentence_char_codepoint = tf.strings.unicode_decode [i, j] is the codepoint for the j'th character in the # i'th word. word_char_codepoint = tf.RaggedTensor.from_row_starts (     values=sentence_char_codepoint.values,     row_starts=word_starts) print(word_char_codepoint) < [i, j, k] is the codepoint for the k'th character # in the j'th word in the i'th sentence. sentence_word_char_codepoint

    2.9K20编辑于 2022-09-30
  • 来自专栏Java技术进阶

    【读码JDK】- java.lang.Character类Api介绍及测试

    结果是一个长度为1或2的字符串,仅由指定的codePoint */ int codePoint = (int) '哈'; System.out.println(codePoint); //21704 int codePoint = (int) '芏'; System.out.println(codePoint); System.out.println(Character.isBmpCodePoint * * 参数 * codePoint - 要转换的字符(Unicode代码点)。 * dst - char数组 ,其中 codePoint的UTF-16值被存储。 参数 codePoint - Unicode代码点 结果 具有 codePoint的UTF-16表示的 char数组。 (codePoint) * .toUpperCase(Locale.ROOT); * 参数 * codePoint - 字符(Unicode代码点) * 结果

    1.4K20编辑于 2022-12-02
  • 来自专栏开发运维工程师

    经验分享|字符串首字母由大写改小写简单方法以及一些思考归纳

    = 0; boolean uncapitalizeNext = true; for (int index = 0; index < strLen;) { final int codePoint = str.codePointAt(index); if (delimiterSet.contains(codePoint)) { uncapitalizeNext = true; newCodePoints[outOffset++] = codePoint; index += Character.charCount(codePoint } else if (uncapitalizeNext) { final int titleCaseCodePoint = Character.toLowerCase(codePoint ; index += Character.charCount(codePoint); } } return new String(newCodePoints

    57100编辑于 2023-11-20
  • 来自专栏渔夫

    Java MorseCoder - Java 语言实现的摩尔斯电码编码解码器

    = text.codePointAt(text.offsetByCodePoints(0, i)); String word = alphabets.get(codePoint ); if (word == null) { word = Integer.toBinaryString(codePoint); String word = tokenizer.nextToken().replace(dit, '0').replace(dah, '1'); Integer codePoint = dictionaries.get(word); if (codePoint == null) { codePoint = Integer.valueOf (word, 2); } textBuilder.appendCodePoint(codePoint); } return

    1.1K30发布于 2020-02-19
  • 来自专栏彭旭锐

    今天一次把 Unicode 和 UTF-8 说清楚

    计算公式总结: code point = ((high - 0xD800)<< 10 ) + low - 0xDC00 + 0x10000 high = (codepoint - 0x10000) >> (int codePoint) { int plane = codePoint >>> 16; return plane < ((0x10FFFF + 1) >>> 16); } // 分析点 2.2:辅助平面字符 - 规则2 static void toSurrogates(int codePoint, char[] dst, int index) { // high在高位, low在低位,是大端序 dst[index+1] = lowSurrogate(codePoint); dst[index] = highSurrogate(codePoint); } // 计算高位代理 public static char highSurrogate(int codePoint) { return (char) ((codePoint >>> 10) + (

    2K20编辑于 2022-09-26
  • 来自专栏林德熙的博客

    读 WPF 源代码 了解获取 GlyphTypeface 的 CharacterToGlyphMap 的数量耗时原因

    ushort>(); ushort glyphIndex; for (int codePoint = 0; codePoint <= FontFamilyMap.LastUnicodeScalar; ++codePoint) { if (TryGetValue(codePoint, out glyphIndex)) { _cmap.Add(codePoint, glyphIndex); }

    16310编辑于 2025-09-27
  • 来自专栏程序猿DD

    Java 21的StringBuilder和StringBuffer新增了一个repeat方法

    IllegalArgumentException {@inheritDoc} * * @since 21 */ @Override public StringBuilder repeat(int codePoint , int count) { super.repeat(codePoint, count); return this; } /** * @throws = new StringBuilder().repeat("*", 10); System.out.println(sb); 最后会输出: ********** 另一个repeat方法第一个参数是codePoint ,指得应该是UniCode字符集中的codePoint,所以这个方法的repeat是针对UniCode字符的。

    37620编辑于 2023-09-26
  • 来自专栏我的博客

    PHP7特性

    6、匿名类 7、Unicode codepoint 转译语法 这接受一个以16进制形式的 Unicode codepoint,并打印出一个双引号或heredoc包围的 UTF-8 编码格式的字符串。 可以接受任何有效的 codepoint,并且开头的 0 是可以省略的 8、Closure::call() class A {private $x = 1;} // PHP 7+ code $getX

    1.5K50发布于 2018-04-28
  • 来自专栏码洞

    《快学 Go 语言》第 7 课 —— 冰糖葫芦串

    为了进一步方便读者理解字节 byte 和 字符 rune 的关系,我花了下面这张图 图片 其中 codepoint 是每个「字」的其实偏移量。 63 68 69 6e 61 按字符 rune 遍历 package main import "fmt" func main() { var s = "嘻哈china" for codepoint , runeValue := range s { fmt.Printf("%d %d ", codepoint, int32(runeValue)) } } --------- -- 0 22075 3 21704 6 99 7 104 8 105 9 110 10 97 对字符串进行 range 遍历,每次迭代出两个变量 codepoint 和 runeValue。 codepoint 表示字符起始位置,runeValue 表示对应的 unicode 编码(类型是 rune)。 字节串的内存表示 如果字符串仅仅是字节数组,那字符串的长度信息是怎么得到呢?

    58550发布于 2018-12-17
  • >> 技术应用:字符串首字母由大写改小写简单方法以及一些思考归纳

    boolean uncapitalizeNext = true; for (int index = 0; index < strLen;) { final int codePoint = str.codePointAt(index); if (delimiterSet.contains(codePoint)) { uncapitalizeNext = true; newCodePoints[outOffset++] = codePoint; index += Character.charCount (codePoint); newCodePoints[outOffset++] = titleCaseCodePoint; index += Character.charCount ; index += Character.charCount(codePoint); } } return new String(newCodePoints

    35920编辑于 2023-10-10
领券