What Unicode Code Points Are
Every character in every script — Latin, Cyrillic, CJK, emoji, mathematical symbols, ancient hieroglyphs — has a unique number assigned by the Unicode Consortium. A is U+0041 (decimal 65). 世 is U+4E16. 👋 is U+1F44B. The number is called a code point; the encoding (UTF-8, UTF-16, UTF-32) determines how the code point is stored as bytes. Code points themselves are encoding-agnostic — they're the abstract identity of each character.
The Text to Unicode Converter goes both directions: encode any text into its code points (in five different formats), and decode a list of code points back into text. The conversion handles emoji and rare CJK correctly, which sounds trivial but isn't — naive JavaScript breaks on anything outside the Basic Multilingual Plane.
How the Microapp Text to Unicode Converter Works
Pick mode (encode or decode). Encode mode: type or paste text, pick the output format (U+XXXX, \\uXXXX, &#x; HTML hex, &#; HTML decimal, or plain decimal numbers), pick a separator (space, comma, none, or newline). The output appears below as you type.
Decode mode: paste any list of code points in any of the five formats — they can even be mixed in the same input. The decoder finds them with a regex and reconstructs the original text. Both modes use String.prototype.codePointAt and String.fromCodePoint, the modern JavaScript APIs that handle the entire Unicode space correctly.
Hi 👋:U+0048 U+0069 U+0020 U+1F44BNotice the wave emoji becomes a single code point (U+1F44B), not two surrogate halves (U+D83D U+DC4B). Decode the same string back and you get
Hi 👋. Round-trip safe.
Five Output Formats — Why So Many?
Different ecosystems use different conventions for representing the same code point. Pick the one that matches the place you're pasting:
| Format | Example | Where you'll see it |
|---|---|---|
| U+XXXX | U+0048 | Unicode Consortium docs, character pickers, OS info dialogs |
| \\uXXXX | \\u0048 | JavaScript and JSON string literals |
| &#x; HTML hex | H | HTML entities (modern style) |
| &#; HTML decimal | H | HTML entities (legacy/decimal style) |
| Plain decimal | 72 | Database fields, CSVs, raw integer arrays |
The Surrogate-Pair Trap
Code points up to U+FFFF fit in a single 16-bit JavaScript "code unit." Code points above (emoji, ancient scripts, supplementary CJK) require surrogate pairs — two 16-bit values that together encode one code point. Naive code that uses str.charCodeAt(0) only sees the first half of a surrogate pair, which is meaningless on its own.
Example of the trap: "👋".charCodeAt(0) returns 55357 (the high-surrogate half), not the actual code point 128075. Old code that splits text into "characters" using array indexing breaks on emoji for the same reason. Modern str.codePointAt(0) looks at both halves and returns the real code point. Same fix on output: String.fromCharCode(0x1F44B) produces garbage; String.fromCodePoint(0x1F44B) produces 👋.
This tool uses the modern APIs throughout, which is why emoji round-trip cleanly.
What's the Difference Between Unicode and UTF-8?
Unicode is the character set — it answers "which characters exist and what's each one's code point?" UTF-8, UTF-16, and UTF-32 are encodings — they answer "how do I store these code points as actual bytes?"
UTF-8 uses 1-4 bytes per code point depending on the value (ASCII gets 1 byte; emoji get 4). UTF-16 uses 2 or 4 bytes. UTF-32 always uses 4. UTF-8 won the web because it's backwards-compatible with ASCII (an ASCII file is also a valid UTF-8 file) and it doesn't waste bytes on Latin text. This tool works at the code point level — encoding-agnostic — so the output is the same regardless of how your text is stored.
Common Pitfalls
Mixing escape syntaxes. JavaScript's \\uXXXX only handles up to U+FFFF; for larger code points you need \\u{XXXXX} with curly braces (introduced in ES2015). The encoder uses the right form automatically based on the code point's value.
Combining characters. Some characters render as one visual glyph but are encoded as multiple code points — accented letters can be a base letter plus a combining accent (é = U+0065 + U+0301) or a single precomposed character (é = U+00E9). The encoder shows you the actual code points, which can be more than the visible character count. Use Unicode normalization (str.normalize("NFC")) if you need a canonical form.
Skin-tone and ZWJ-joined emoji. Compound emoji like 👨👩👧 are sequences of multiple emoji joined by zero-width joiners (U+200D). The encoder shows every code point in the sequence — usually 5+ for family emoji. That's correct, just longer than expected.
Related Tools
For browsing emoji visually, use the Emoji Picker or search by name with the Emoji Search. To encode characters as HTML entities for embedding in pages, the HTML Encoder/Decoder is the right tool. For URL-encoding (different from Unicode encoding — covers reserved URL characters), see the URL Encoder/Decoder. To encode arbitrary binary as ASCII-safe text, use the Base64 Encoder/Decoder.