Convert unicode symbols to their codes


Hello and goodbye

I have the following XML file containing emoji : http://pastebin.com/8f0GeE96

Now, what I want is to convert each unicode character to its code (as a string). For this, I wrote the following code. The problem is that I get a lot of dup (ie d83d) which makes me think there is a problem with parsing. What's the explanation for this?

public static void main(String[] args) {

        File file = new File("c:\\EmojisList.plist.txt");

        try {
            BufferedReader in = new BufferedReader(
                       new InputStreamReader(new FileInputStream(file), "UTF8"));

            String str;
            while ((str = in.readLine()) != null) { 
                if(str.trim().startsWith("<string>"))
                {
                    int emoji_pos = str.indexOf('>') + 1;
                    char emoji_char = str.charAt(emoji_pos);
                    String emoji_code_str = Integer.toHexString(emoji_char);

                    System.out.println(emoji_code_str);
                }

            }

            in.close();


        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
skeptic

The Unicode standard starts with a sequence of characters that are sufficient for 16 bits (two bytes).

However, more and more scripts and symbols are added to it, and nowadays you can't represent all characters in 16 bits. The legal range of code points is from U+0 to U+10FFFF.

Unfortunately this doesn't work with Java char, which is only 16 bits and capable of representing values ​​from 0 to FFFF.

The most common western languages ​​have no problem with this - Latin ranges (including accents, Russian, Arabic, Hebrew, etc.) are all in the 16-bit range. Even common Chinese and Japanese characters are in this range.

However, most emoji are actually in the "extended" range, in the Unicode "Other Symbols and Pictographs" and "Emoticon" modules, from U+1F300 to U+1F5FF and U+1F600 to U+1F67F, respectively.

Characters in this range are represented as strings using the UTF-16 encoding, which basically uses two values char​​for each such character . So if a character's code point (its official unicode value) is in the range U+10000 to U+10FFFF , then two values, one in the range from U+D800 to U+DB7F ("high surrogate"), and one from The range of values ​​U+DC00 to U+DFFF ("low surrogate") is used to represent it.char

So when you read the value charAt(emoji_pos)in your program , you're actually only reading the first half of the actual character. In fact, all emojis in the "emoji" range have a high substitution characteristic of U+D83D.

So, to get the actual Unicode code point of the emoji, you need to convert the UTF-16 representation to the actual intvalue. A is charnot enough. You can then use the methods available in the Stringand Characterclass to do this .

charAtIn this case, you can use methods instead .codePointAt

So, instead of

char emoji_char = str.charAt(emoji_pos);

use:

int emojiCodePoint = str.codePoint(emojiPos);

For more information, read the UTF FAQ on the Unicode Consortium website .


Note: The Java coding convention is that variable, field names and method names should be in lower camel case: the first word starts with a lower case letter, the other words start with an upper case letter, and there are no underscores . So the variable name should emojiCodePointnot be emoji_code_point. Underscores are only acceptable in constant names (all uppercase, for example CASE_INSENSITIVE_ORDER) .

Related


Convert unicode symbols to their codes

Hello and goodbye I have the following XML file containing emoji : http://pastebin.com/8f0GeE96 Now, what I want is to convert each unicode character to its code (as a string). For this reason, I wrote the following code. The problem is that I get a lot of dup

Convert unicode symbols to their codes

Hello and goodbye I have the following XML file containing emoji : http://pastebin.com/8f0GeE96 Now, what I want is to convert each unicode character to its code (as a string). For this reason, I wrote the following code. The problem is that I get a lot of dup

Convert Unicode symbols to Unicode entities

Jagas I've been looking for a proper solution to how to convert a Unicode symbol (ἔ) to its corresponding Unicode entity (ἔ). I have a text file with many symbols like ῶἤÜὰὔ. I'm looking for a python or even Perl script that can take a file as an argument and

Convert Unicode symbols to Unicode entities

Jagas I've been looking for a proper solution to how to convert a Unicode symbol (ἔ) to its corresponding Unicode entity (ἔ). I have a text file with many symbols like ῶἤÜὰὔ. I'm looking for a python or even Perl script that can take a file as an argument and

Convert currency symbols to currency codes

Grace I wish to redirect the url to the new domain and manually edit any currency symbols to the currency code I set in htaccess. so https://example.com/ticket12/$4.44 should redirect to https://mynewexample.com/ticket12/USD4.44 Then, when other options appe

Convert currency symbols to currency codes

Grace I wish to redirect the url to the new domain and manually edit any currency symbols to the currency code I set in htaccess. so https://example.com/ticket12/$4.44 should redirect to https://mynewexample.com/ticket12/USD4.44 Then, when other options appe

Japanese Unicode: Convert radicals to regular character codes

small road How to convert Japanese radical characters to corresponding "regular" Kanji characters? For example, the character of the radical fire is fire (Unicode value 12117) and the regular character is fire (Unicode value 28779) edit: To clarify, I think th

Convert Unicode character codes to characters on Python

Arnold I have a list of Unicode character codes that I need to convert to characters on python 2.7. U+0021 U+0022 U+0023 ....... U+0024 How to do it? Mark Tolonen This regular expression will replace all sequences with U+nnnnthe corresponding Unicode characte

Convert Unicode string to string in Python (with extra symbols)

williamtroup: How to convert Unicode string (containing extra characters like £$ etc) to Python string? Sorantis; Seeunicodedata.normalize title = u"Klüft skräms inför på fédéral électoral große" import unicodedata unicodedata.normalize('NFKD', title).encode('

Convert Unicode string to string in Python (with extra symbols)

williamtroup: How to convert Unicode string (containing extra characters like £$ etc) to Python string? Sorantis; Seeunicodedata.normalize title = u"Klüft skräms inför på fédéral électoral große" import unicodedata unicodedata.normalize('NFKD', title).encode('

Quickly convert country codes to emoji flags via Unicode

Edgar Aroutiounian I'm looking for a quick way to make something like: let germany = "DE" Enter let flag = "\u{1f1e9}\u{1f1ea}" That is, what is mapped Dto 1f1e9and Eto the string 1f1eaI was looking .utf8for, but this returns an integer. FWIW my overall goa

Quickly convert country codes to emoji flags via Unicode

Edgar Aroutiounian : I'm looking for a quick way to make something like: let germany = "DE" Enter let flag = "\u{1f1e9}\u{1f1ea}" That is, what is mapped Dto 1f1e9and Eto the string 1f1eaI was looking .utf8for, but this returns an integer. FWIW my overall g

Quickly convert country codes to emoji flags via Unicode

Edgar Aroutiounian I'm looking for a quick way to make something like: let germany = "DE" Enter let flag = "\u{1f1e9}\u{1f1ea}" That is, what is mapped Dto 1f1e9and Eto the string 1f1eaI was looking .utf8for, but this returns an integer. FWIW my overall goa

Quickly convert country codes to emoji flags via Unicode

Edgar Aroutiounian : I'm looking for a quick way to make something like: let germany = "DE" Enter let flag = "\u{1f1e9}\u{1f1ea}" That is, what is mapped Dto 1f1e9and Eto the string 1f1eaI was looking .utf8for, but this returns an integer. FWIW my overall g

Unicode symbols (arrows) in Java

cupakob : I want to use the following symbols for buttons in my app: Arrow http://img402.imageshack.us/img402/3176/arrowso.jpg Here is my code: Button goToFirstButton = new Button("\uE318"); Button prevPageButton = new Button("\uE312"); Button nextPageButton =

Printing Unicode symbols in C

Luke Collins I'm trying to print unicode star characters ( 0x2605 ) in linux terminal using C. I followed the syntax suggested by other answers on the site, but got no output: #include <stdio.h> #include <wchar.h> int main(){ wchar_t star = 0x2605; w

Display unicode symbols with pygame

Fred B: I checked the other answers but can't see why my code is showing ♔ incorrectly. This is what I see so far Here is the relevant code for text rendering. font = pygame.font.SysFont('Tahoma', 80, False, False) queenblack = "♔" queenblacktext = font.render

Unicode symbols (arrows) in Java

cupakob : I want to use the following symbols for buttons in my app: Arrow http://img402.imageshack.us/img402/3176/arrowso.jpg Here is my code: Button goToFirstButton = new Button("\uE318"); Button prevPageButton = new Button("\uE312"); Button nextPageButton =

Unicode symbols in batch files

Marpent If I type the following command in cmd: " echo█ " then it will show "█" symbol. However, if I type the command " echo█ " in the batch (.bat) file , I get what is shown in the image below. how can i fix it? I only need the "█" symbol, but optionally, I

Printing Unicode symbols in C

Luke Collins I'm trying to print unicode star characters ( 0x2605 ) in linux terminal using C. I followed the syntax suggested by other answers on the site, but got no output: #include <stdio.h> #include <wchar.h> int main(){ wchar_t star = 0x2605; w

Printing Unicode symbols in C

Luke Collins I'm trying to print unicode star characters ( 0x2605 ) in linux terminal using C. I followed the syntax suggested by other answers on the site, but got no output: #include <stdio.h> #include <wchar.h> int main(){ wchar_t star = 0x2605; w

Display unicode symbols with pygame

Fred B: I checked the other answers but can't see why my code is showing ♔ incorrectly. This is what I see so far Here is the relevant code for text rendering. font = pygame.font.SysFont('Tahoma', 80, False, False) queenblack = "♔" queenblacktext = font.render

Unicode symbols (arrows) in Java

cupakob : I want to use the following symbols for buttons in my app: Arrow http://img402.imageshack.us/img402/3176/arrowso.jpg Here is my code: Button goToFirstButton = new Button("\uE318"); Button prevPageButton = new Button("\uE312"); Button nextPageButton =