What are HTML entities and why do they matter?
An HTML entity is a special sequence of characters that represents a reserved or non-printable character in HTML markup. They start with an ampersand (&) and end with a semicolon (;). The entity < renders as <, & renders as &, © renders as ©. Defined originally in HTML 4.01 and extended in HTML5 to over 2,000 named entities, they exist for two reasons: escaping reserved characters (so the browser doesn't think your text is markup) and typing characters that aren't on your keyboard (em dashes, math symbols, currency).
The five mandatory escapes — the ones that XSS-prevention guides drill into every web developer:
| Char | Named entity | Numeric (decimal) | Numeric (hex) | Why escape? |
|---|---|---|---|---|
< | < | < | < | Starts a tag — the #1 XSS vector |
> | > | > | > | Ends a tag |
& | & | & | & | Starts an entity — escape to literal |
" | " | " | " | Closes attribute values |
' (apostrophe) | ' (HTML5) | ' | ' | Closes single-quoted attributes |
If your application takes user input and renders it inside HTML without escaping these five characters, you have an XSS vulnerability. Every modern web framework escapes by default; raw string concatenation or "trust-us" templating bypasses that protection. This is why innerHTML with user data is dangerous and textContent is safe.
Named vs numeric vs hex — three ways to encode the same character
| Form | Example | Pros | Cons |
|---|---|---|---|
| Named entity | © | Readable, memorable | ~2,000 names; not all parsers support all of them |
| Decimal numeric | © | Universal — every char has a decimal code point | Less readable than named |
| Hex numeric | © | Matches Unicode references (U+00A9) | Slightly less common; same support as decimal |
For the five mandatory escapes, named entities are universally supported and most readable. For uncommon symbols (em dash, em space, math operators, arrows), numeric entities are safer because every parser recognizes them. Always include the trailing semicolon — most browsers tolerate missing semicolons in HTML5, but XML and strict parsers reject the entity entirely.
HTML5's named-entity gotchas
'only became official in HTML5. It works in modern browsers but breaks in XHTML 1.0 — use'if XHTML compatibility matters.- Some "obvious" names don't exist. There's no
&asterisk;or$— those are plain ASCII. Don't over-escape. - Case matters.
Ä= Ä,ä= ä — different characters.
Useful named entities by category
Currency & math
| Symbol | Named | Numeric |
|---|---|---|
| © | © | © |
| ® | ® | ® |
| ™ | ™ | ™ |
| € | € | € |
| £ | £ | £ |
| ¥ | ¥ | ¥ |
| × | × | × |
| ÷ | ÷ | ÷ |
| ± | ± | ± |
| ° | ° | ° |
Punctuation & whitespace
| Symbol | Named | Numeric |
|---|---|---|
| (non-breaking space) | |   |
| — | — | — |
| – | – | – |
| … | … | … |
| " | “ | “ |
| " | ” | ” |
| ' | ‘ | ‘ |
| ' | ’ | ’ |
| « | « | « |
| » | » | » |
| · | · | · |
| • | • | • |
Arrows
| Symbol | Named | Numeric |
|---|---|---|
| ← | ← | ← |
| → | → | → |
| ↑ | ↑ | ↑ |
| ↓ | ↓ | ↓ |
| ⇐ | ⇐ | ⇐ |
| ⇒ | ⇒ | ⇒ |
HTML entity encoding for XSS prevention — the critical rules
XSS (Cross-Site Scripting) happens when user-supplied content is rendered as HTML/JavaScript instead of as text. The fix: encode user input before placing it in an HTML context. But "an HTML context" is plural — different contexts need different encoding.
| Context | Example | Required encoding |
|---|---|---|
| HTML body / text node | <p>USER</p> | Escape < > & |
| HTML attribute (quoted) | <a title="USER"> | Escape < > & " |
| HTML attribute (unquoted) | <a title=USER> | Encode every non-alphanumeric (or quote the attribute) |
| JavaScript context | <script>var x = "USER";</script> | JavaScript escape (\x3C), NOT HTML escape |
| CSS context | <style>.a { content: "USER" }</style> | CSS escape (\3C), NOT HTML escape |
| URL parameter | <a href="?q=USER"> | URL encode (%3C), NOT HTML escape |
The classic mistake: HTML-escaping content that ends up inside JavaScript. < becomes a literal in JS — the < is never recovered. Use the right escape for the destination context, not the source.
script-src 'self'). Even if escaping fails, CSP blocks the attack.
HTML escaping in 8 programming languages
JavaScript
// Modern: use textContent (no encoding bugs possible)
el.textContent = userInput; // Safe ✓
// If you MUST build HTML strings, escape manually
function escapeHtml(s) {
return s.replace(/[&<>"']/g, c => ({
'&': '&', '<': '<', '>': '>',
'"': '"', "'": '''
}[c]));
}
// To DECODE entities (use a textarea — the browser does it)
function decodeHtml(s) {
const ta = document.createElement('textarea');
ta.innerHTML = s;
return ta.value;
}
Python
import html
html.escape("<script>alert(1)</script>")
# '<script>alert(1)</script>'
html.escape("It's & \"quoted\"", quote=True)
# 'It's & "quoted"'
# Decode
html.unescape("<p>Hello</p>") # '<p>Hello</p>'
PHP
// Mandatory: htmlspecialchars (escapes 5 reserved chars)
echo htmlspecialchars($input, ENT_QUOTES | ENT_HTML5, 'UTF-8');
// htmlentities — encodes ALL applicable entities (rarely what you want)
echo htmlentities($input, ENT_QUOTES | ENT_HTML5, 'UTF-8');
// Decode
echo html_entity_decode($input, ENT_QUOTES | ENT_HTML5, 'UTF-8');
Java
// Apache Commons Text
import org.apache.commons.text.StringEscapeUtils;
String safe = StringEscapeUtils.escapeHtml4(userInput);
String back = StringEscapeUtils.unescapeHtml4(safe);
// OWASP Java Encoder (recommended for XSS prevention)
import org.owasp.encoder.Encode;
String safe = Encode.forHtml(userInput);
String safeAttr = Encode.forHtmlAttribute(userInput);
String safeJs = Encode.forJavaScript(userInput); // different escape!
Go
import "html"
escaped := html.EscapeString("<script>")
// "<script>"
unescaped := html.UnescapeString("<p>hi</p>")
// "<p>hi</p>"
// In Go templates, html/template auto-escapes by default — use it:
import "html/template"
t := template.Must(template.New("x").Parse(`<p>{{.}}</p>`))
t.Execute(os.Stdout, "<script>") // automatically escaped
Ruby
require 'cgi'
CGI.escapeHTML("<script>alert(1)</script>")
# "<script>alert(1)</script>"
CGI.unescapeHTML("<p>hi</p>")
# "<p>hi</p>"
# In Rails ERB, the <%= %> syntax auto-escapes
<%= user.name %> # auto-escaped, safe
<%= raw user.name %> # NOT escaped — only when you've already trusted the input
Rust
use html_escape::{encode_text, decode_html_entities};
let safe = encode_text("<script>");
// "<script>"
let back = decode_html_entities("<p>");
// "<p>"
// Or use Askama / Tera templates — they auto-escape
Bash (recode / sed / xmlstarlet)
# GNU recode
echo '<p>Hello & world</p>' | recode html..ascii
# "<p>Hello & world</p>"
# Plain sed (5 mandatory chars)
sed -e 's/&/\&/g' -e 's/</\</g' -e 's/>/\>/g' \
-e 's/"/\"/g' -e "s/'/\'/g"
# Decode with xmlstarlet
echo '<p>hi</p>' | xmlstarlet unesc
HTML entity best practices
- Always use auto-escaping templating engines. Jinja2, ERB, Thymeleaf, html/template (Go), React's JSX — all escape by default. Manual escaping is bug-prone.
- Use
textContentin JavaScript, notinnerHTML, when inserting user data. Eliminates the encoding question entirely. - Never trust input from anywhere. Database content, third-party APIs, your own admin panel — all can be sources of XSS payloads.
- Encode at the boundary, not the storage. Store raw user input; encode only when rendering. Lets you change templates and re-render correctly.
- Pick the right encoder for the context. HTML escape, JS escape, URL encode, CSS escape — they're all different. OWASP's library handles them all.
- Layer defenses with CSP. Even if XSS slips through, a tight Content Security Policy blocks inline-script execution.
- Sanitize, don't escape, when allowing some HTML. If users paste rich content, use DOMPurify (JS) or Bleach (Python) to strip dangerous tags. Don't try to write a regex sanitizer.
- Don't HTML-escape data going into JSON. JSON has different rules — use
JSON.stringifyor your language's JSON library.