Convert unicode symbols to their codes
I have the following XML file containing emoji : http://pastebin.com/8f0GeE96
Now, what I want is to convert each unicode character to its code (as a string). For this reason, I wrote the following code. The problem is that I get a lot of dup (ie d83d
) which makes me think there is a problem with parsing. What's the explanation for this?
public static void main(String[] args) {
File file = new File("c:\\EmojisList.plist.txt");
try {
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(file), "UTF8"));
String str;
while ((str = in.readLine()) != null) {
if(str.trim().startsWith("<string>"))
{
int emoji_pos = str.indexOf('>') + 1;
char emoji_char = str.charAt(emoji_pos);
String emoji_code_str = Integer.toHexString(emoji_char);
System.out.println(emoji_code_str);
}
}
in.close();
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
The Unicode standard starts with a series of characters that are sufficient to meet the requirements of 16 bits (two bytes).
However, more and more scripts and symbols are added to it, and nowadays you can't represent all characters in 16 bits. The legal range of code points is from U+0 to U+10FFFF.
Unfortunately this doesn't work with Java char
, which is only 16 bits and capable of representing values from 0 to FFFF.
The most common western languages have no problem with this - Latin ranges (including accents, Russian, Arabic, Hebrew, etc.) are all within the 16-bit range. Even common Chinese and Japanese characters are in this range.
However, most emojis are actually in the "extended" range, which ranges from U+1F300 to U+1F5FF and U+1F600 to U+ in the Unicode "Other Symbols and Pictographs" and "Emoticon" blocks, respectively 1F67F.
Characters in this range are represented as strings using the UTF-16 encoding, which basically uses two values char
for each such character . So if a character's code point (its official unicode value) is in the range U+10000 to U+10FFFF , then two values, one in the range from U+D800 to U+DB7F ("high surrogate") and one from The range of values U+DC00 to U+DFFF ("low surrogate") is used to represent it.char
So when you read the value charAt(emoji_pos)
in your program , you're actually only reading the first half of the actual character. In fact, all emojis in the "Emoticon" range have a high substitution characteristic of U+D83D.
So, to get the actual Unicode code point of the emoji, you need to convert the UTF-16 representation to the actual int
value. A char
cannot meet the requirements. You can then use the methods available in the String
and Character
class to do this .
charAt
In this case, you can simply use the method without using .codePointAt
So, instead of
char emoji_char = str.charAt(emoji_pos);
use:
int emojiCodePoint = str.codePoint(emojiPos);
For more information, read the UTF FAQ on the Unicode Consortium website .
Note: The Java coding convention is that variable, field names and method names should be in lower camel case: the first word starts with a lower case letter, other words start with an upper case letter, and no underscores . Therefore, the variable name should emojiCodePoint
not be emoji_code_point
. Underscores are only acceptable in constant names (all uppercase, for example CASE_INSENSITIVE_ORDER
) .