Convert unicode symbols to their codes

Hello and goodbye

I have the following XML file containing emoji : http://pastebin.com/8f0GeE96

Now, what I want is to convert each unicode character to its code (as a string). For this, I wrote the following code. The problem is that I get a lot of dup (ie d83d) which makes me think there is a problem with parsing. What's the explanation for this?

public static void main(String[] args) {

        File file = new File("c:\\EmojisList.plist.txt");

        try {
            BufferedReader in = new BufferedReader(
                       new InputStreamReader(new FileInputStream(file), "UTF8"));

            String str;
            while ((str = in.readLine()) != null) { 
                if(str.trim().startsWith("<string>"))
                {
                    int emoji_pos = str.indexOf('>') + 1;
                    char emoji_char = str.charAt(emoji_pos);
                    String emoji_code_str = Integer.toHexString(emoji_char);

                    System.out.println(emoji_code_str);
                }

            }

            in.close();


        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

skeptic

The Unicode standard starts with a sequence of characters that are sufficient for 16 bits (two bytes).

However, more and more scripts and symbols are added to it, and nowadays you can't represent all characters in 16 bits. The legal range of code points is from U+0 to U+10FFFF.

Unfortunately this doesn't work with Java char, which is only 16 bits and capable of representing values from 0 to FFFF.

The most common western languages have no problem with this - Latin ranges (including accents, Russian, Arabic, Hebrew, etc.) are all in the 16-bit range. Even common Chinese and Japanese characters are in this range.

However, most emoji are actually in the "extended" range, in the Unicode "Other Symbols and Pictographs" and "Emoticon" modules, from U+1F300 to U+1F5FF and U+1F600 to U+1F67F, respectively.

Characters in this range are represented as strings using the UTF-16 encoding, which basically uses two values charfor each such character . So if a character's code point (its official unicode value) is in the range U+10000 to U+10FFFF , then two values, one in the range from U+D800 to U+DB7F ("high surrogate"), and one from The range of values U+DC00 to U+DFFF ("low surrogate") is used to represent it.char

So when you read the value charAt(emoji_pos)in your program , you're actually only reading the first half of the actual character. In fact, all emojis in the "emoji" range have a high substitution characteristic of U+D83D.

So, to get the actual Unicode code point of the emoji, you need to convert the UTF-16 representation to the actual intvalue. A is charnot enough. You can then use the methods available in the Stringand Characterclass to do this .

charAtIn this case, you can use methods instead .codePointAt

So, instead of

char emoji_char = str.charAt(emoji_pos);

use:

int emojiCodePoint = str.codePoint(emojiPos);

For more information, read the UTF FAQ on the Unicode Consortium website .

Note: The Java coding convention is that variable, field names and method names should be in lower camel case: the first word starts with a lower case letter, the other words start with an upper case letter, and there are no underscores . So the variable name should emojiCodePointnot be emoji_code_point. Underscores are only acceptable in constant names (all uppercase, for example CASE_INSENSITIVE_ORDER) .

Convert unicode symbols to their codes

Hello and goodbye I have the following XML file containing emoji : http://pastebin.com/8f0GeE96 Now, what I want is to convert each unicode character to its code (as a string). For this reason, I wrote the following code. The problem is that I get a lot of dup

Convert unicode symbols to their codes

Convert Unicode symbols to Unicode entities

Jagas I've been looking for a proper solution to how to convert a Unicode symbol (ἔ) to its corresponding Unicode entity (ἔ). I have a text file with many symbols like ῶἤÜὰὔ. I'm looking for a python or even Perl script that can take a file as an argument and

Convert Unicode symbols to Unicode entities

Convert currency symbols to currency codes

Grace I wish to redirect the url to the new domain and manually edit any currency symbols to the currency code I set in htaccess. so https://example.com/ticket12/$4.44 should redirect to https://mynewexample.com/ticket12/USD4.44 Then, when other options appe

Convert currency symbols to currency codes

Japanese Unicode: Convert radicals to regular character codes

small road How to convert Japanese radical characters to corresponding "regular" Kanji characters? For example, the character of the radical fire is fire (Unicode value 12117) and the regular character is fire (Unicode value 28779) edit: To clarify, I think th

Convert Unicode character codes to characters on Python

Arnold I have a list of Unicode character codes that I need to convert to characters on python 2.7. U+0021 U+0022 U+0023 ....... U+0024 How to do it? Mark Tolonen This regular expression will replace all sequences with U+nnnnthe corresponding Unicode characte

Convert Unicode string to string in Python (with extra symbols)

williamtroup： How to convert Unicode string (containing extra characters like £$ etc) to Python string? Sorantis; Seeunicodedata.normalize title = u"Klüft skräms inför på fédéral électoral große" import unicodedata unicodedata.normalize('NFKD', title).encode('

Convert Unicode string to string in Python (with extra symbols)

Convert Unicode symbols or their XML/HTML entities to their Unicode numbers in Swift

Jordan H Given a Unicode symbol Stringor its XML/HTML entity, how does one generate its Unicode number? For example, if you are given a string "෴", and you can generate its HTML code ( ෴), how do you generate its Unicode number ( U+0DF4)? I'm currently g

Convert Unicode symbols or their XML/HTML entities to their Unicode numbers in Swift

Domsanitizer does not convert all special character codes to corresponding symbols

Sivakumar Tadisetti String Ex: "Test 'Name" has Apostrophe(') code in it, I sanitized it with Domsanitizer to convert ' to Apostrophe symbol. However, if I have a double-quote code ("), it is not converted to the corresponding symbol. Not only that, but (&) is

Domsanitizer does not convert all special character codes to corresponding symbols

Quickly convert country codes to emoji flags via Unicode

Edgar Aroutiounian I'm looking for a quick way to make something like: let germany = "DE" Enter let flag = "\u{1f1e9}\u{1f1ea}" That is, what is mapped Dto 1f1e9and Eto the string 1f1eaI was looking .utf8for, but this returns an integer. FWIW my overall goa

Quickly convert country codes to emoji flags via Unicode

Edgar Aroutiounian ： I'm looking for a quick way to make something like: let germany = "DE" Enter let flag = "\u{1f1e9}\u{1f1ea}" That is, what is mapped Dto 1f1e9and Eto the string 1f1eaI was looking .utf8for, but this returns an integer. FWIW my overall g

Quickly convert country codes to emoji flags via Unicode

How to convert symbols to their respective unicode representation using python3?

deep I want to convert devanagri script characters (eg 'अ') to their unicode representation \u0905. Earlier in python2.7 I used each_character.encode("unicode_escape")where each_characterto refer to the devanagri script character. But recently I started workin

Unicode symbols (arrows) in Java

cupakob ： I want to use the following symbols for buttons in my app: Arrow http://img402.imageshack.us/img402/3176/arrowso.jpg Here is my code: Button goToFirstButton = new Button("\uE318"); Button prevPageButton = new Button("\uE312"); Button nextPageButton =

Printing Unicode symbols in C

Luke Collins I'm trying to print unicode star characters ( 0x2605 ) in linux terminal using C. I followed the syntax suggested by other answers on the site, but got no output: #include <stdio.h> #include <wchar.h> int main(){ wchar_t star = 0x2605; w

Display unicode symbols with pygame

Fred B: I checked the other answers but can't see why my code is showing ♔ incorrectly. This is what I see so far Here is the relevant code for text rendering. font = pygame.font.SysFont('Tahoma', 80, False, False) queenblack = "♔" queenblacktext = font.render

Unicode symbols (arrows) in Java

Unicode symbols in batch files

Marpent If I type the following command in cmd: " echo█ " then it will show "█" symbol. However, if I type the command " echo█ " in the batch (.bat) file , I get what is shown in the image below. how can i fix it? I only need the "█" symbol, but optionally, I

Convert unicode symbols to their codes

Related

Convert unicode symbols to their codes

Convert unicode symbols to their codes

Convert Unicode symbols to Unicode entities

Convert Unicode symbols to Unicode entities

Convert currency symbols to currency codes

Convert currency symbols to currency codes

Japanese Unicode: Convert radicals to regular character codes

Convert Unicode character codes to characters on Python

Convert Unicode string to string in Python (with extra symbols)

Convert Unicode string to string in Python (with extra symbols)

Convert Unicode symbols or their XML/HTML entities to their Unicode numbers in Swift

Convert Unicode symbols or their XML/HTML entities to their Unicode numbers in Swift

Convert Unicode symbols or their XML/HTML entities to their Unicode numbers in Swift

Domsanitizer does not convert all special character codes to corresponding symbols

Domsanitizer does not convert all special character codes to corresponding symbols

Quickly convert country codes to emoji flags via Unicode

Quickly convert country codes to emoji flags via Unicode

Quickly convert country codes to emoji flags via Unicode

Quickly convert country codes to emoji flags via Unicode

How to convert symbols to their respective unicode representation using python3?

Unicode symbols (arrows) in Java

Printing Unicode symbols in C

Display unicode symbols with pygame

Unicode symbols (arrows) in Java

Unicode symbols in batch files

Printing Unicode symbols in C

Printing Unicode symbols in C

Display unicode symbols with pygame

Unicode symbols (arrows) in Java

Ranking