XSS defenses go wrong when teams treat “sanitize user input” like a single magic step. It isn’t. Different kinds of input need different handling, and some of the most common advice online is flat-out incomplete.
My opinion: if you only remember one thing, remember this — validate for business rules, encode for output, and sanitize only when you intentionally allow HTML.
That distinction matters because “sanitizing input” can mean wildly different things:
- stripping dangerous characters
- validating format
- cleaning user-supplied HTML
- escaping data before rendering
- filtering URLs or CSS values
Those are not interchangeable.
The short version
Here’s the practical comparison:
| Approach | Best for | Pros | Cons |
|---|---|---|---|
| Input validation | Emails, usernames, IDs, dates | Simple, predictable, blocks bad data early | Does not stop XSS by itself |
| Output encoding | Any untrusted data rendered into HTML/JS/CSS/URLs | Most reliable general defense | Must match the exact output context |
| HTML sanitization | Rich text editors, comments with formatting | Lets users keep safe HTML | Easy to misconfigure, library-dependent |
| Character stripping / regex cleaning | Very limited controlled formats | Fast for narrow cases | Dangerous as a general XSS defense |
| CSP | Defense in depth | Reduces impact of some XSS bugs | Not a replacement for proper handling |
If your app does not need user HTML, do not sanitize HTML. Store the text and output-encode it.
1. Input validation
Input validation is about enforcing what the data should be, not trying to detect every possible attack payload.
For example:
- usernames: letters, numbers, underscores, 3–30 chars
- age: integer in a sensible range
- country code: two uppercase letters
- product ID: UUID or numeric ID
Pros
- Easy to reason about
- Improves data quality
- Shrinks attack surface
- Usually fast and cheap to implement
Cons
- Doesn’t solve XSS when you later render the data unsafely
- Breaks down for free-form text fields
- Too many teams overestimate what regex can do here
A good validation rule:
function validateUsername(input) {
return /^[a-zA-Z0-9_]{3,30}$/.test(input);
}
A bad validation rule pretending to stop XSS:
function antiXssFilter(input) {
return !/<script|javascript:|onerror=|onload=/i.test(input);
}
That second one is brittle and bypassable. Attackers do not politely use <script> every time.
Use validation to enforce expected structure. Don’t use it as your primary XSS control.
2. Output encoding
This is the workhorse defense. If you render untrusted data as text in the browser, output encoding is usually what you want.
The catch: encoding depends on where the data lands.
HTML text context
Safe pattern:
<div>{{ user.bio }}</div>
In modern templating systems, this is often auto-escaped by default.
Rendered safely, <img src=x onerror=alert(1)> becomes text, not executable HTML.
Pros
- Reliable when used correctly
- Built into many frameworks
- Preserves original data
- Works well for plain text user content
Cons
- Context-sensitive
- Easy to break when developers bypass framework protections
- Not enough when you intentionally allow HTML
Dangerous mistake: wrong sink
This is where teams get burned:
element.innerHTML = userInput;
If userInput is untrusted, you probably just created an XSS sink.
Safer:
element.textContent = userInput;
Or in the DOM:
const div = document.createElement('div');
div.textContent = userInput;
container.appendChild(div);
If you’re using React, Vue, Angular, Razor, Django templates, or similar, the default escaped rendering is usually the safe path. Problems start when someone reaches for raw HTML rendering like dangerouslySetInnerHTML, v-html, or direct DOM injection.
Official docs worth reviewing for your stack:
3. HTML sanitization
This is the right tool when users are allowed to submit rich text: comments with formatting, CMS content, support tickets with markup, WYSIWYG editor output.
You are no longer treating input as plain text. You are allowing some HTML, so you need a sanitizer that removes dangerous elements and attributes while preserving approved markup.
Pros
- Supports rich content
- Better user experience than stripping all formatting
- Can enforce allowlists for tags and attributes
Cons
- Harder than it looks
- Misconfiguration creates holes
- Sanitizer bypasses happen, so patching matters
- HTML is not the only problem — URLs, SVG, MathML, and CSS can be tricky
Here’s a typical server-side example in Node.js using a sanitizer library:
import sanitizeHtml from 'sanitize-html';
const clean = sanitizeHtml(userHtml, {
allowedTags: ['p', 'b', 'i', 'em', 'strong', 'a', 'ul', 'ol', 'li', 'code', 'pre'],
allowedAttributes: {
a: ['href', 'title']
},
allowedSchemes: ['http', 'https', 'mailto']
});
That’s a sane starting point. Still, I would review every allowed tag and attribute with suspicion.
A few rules I follow:
- Prefer a small allowlist
- Be careful with
style - Be very careful with SVG
- Restrict URL schemes
- Patch sanitizer dependencies promptly
- Test with real payloads, not just happy-path formatting
If your product does not truly need HTML, don’t sanitize HTML “just in case.” That adds complexity you don’t need.
4. Character stripping and regex-based cleaning
This is the old-school move:
input = input.replace(/<script.*?>.*?<\/script>/gi, '');
input = input.replace(/[<>]/g, '');
I don’t recommend this as a general XSS strategy.
Pros
- Simple for highly constrained input
- Can be okay as a normalization step in narrow cases
- Sometimes useful for cosmetic cleanup
Cons
- Incomplete by design
- Easy to bypass with encoding tricks or alternate payload forms
- Often destroys legitimate user content
- Gives teams false confidence
If the field is “first name,” then yes, you can aggressively constrain it. If the field is “message,” “profile bio,” or “article body,” regex stripping is the wrong abstraction.
The browser parses HTML, not your intentions. That parser is much more flexible than a few regular expressions.
5. URL sanitization
URLs deserve their own section because developers often allow them into href, src, or redirect parameters without enough checks.
Bad:
link.href = userInput;
If userInput is javascript:alert(1), you have a problem.
Safer:
function isSafeUrl(value) {
try {
const url = new URL(value, 'https://example.com');
return ['http:', 'https:'].includes(url.protocol);
} catch {
return false;
}
}
Then:
if (isSafeUrl(userInput)) {
link.href = userInput;
}
Pros
- Effective for link and media handling
- Easy to build around protocol allowlists
Cons
- Developers forget relative URLs, base resolution, and odd schemes
- Different sinks may have different parsing behavior
Treat URLs as structured data, not random strings.
6. CSP as backup, not cleanup
Content Security Policy won’t sanitize input. It won’t fix unsafe innerHTML. What it does is reduce blast radius when something slips through.
A decent CSP can block inline script execution, restrict script sources, and make some classes of XSS much harder to exploit.
Pros
- Strong defense in depth
- Helps contain mistakes
- Useful visibility with reporting
Cons
- Doesn’t replace output encoding or sanitization
- Can be painful to retrofit
- Weak CSPs are common and often overestimated
For implementation patterns, nonce usage, and rollout strategy, see CSP Guide.
What I recommend in real projects
Here’s the practical decision tree I use:
If the field should be plain text
- Validate length and business rules
- Store raw text
- Output-encode on render
- Use safe DOM APIs like
textContent
Example:
app.post('/comment', (req, res) => {
const comment = String(req.body.comment || '').slice(0, 2000);
saveComment(comment);
res.sendStatus(204);
});
Then during rendering, escape by default in your template engine.
If the field should contain limited HTML
- Sanitize with a maintained HTML sanitizer
- Use a strict allowlist
- Validate URLs inside attributes
- Re-sanitize if content is transformed later
- Render only into trusted HTML sinks after sanitization
If the field has a strict format
- Validate hard
- Reject anything outside expected shape
- Still encode on output
That last part matters. Even validated data can become dangerous in the wrong output context.
Common mistakes
These show up constantly in code reviews:
- Sanitizing once at input time and assuming the data is forever safe
- Using the same escaping for HTML, JavaScript, CSS, and URLs
- Trusting client-side sanitization alone
- Allowing raw HTML because “the admin panel is internal”
- Forgetting that stored XSS is usually worse than reflected XSS
- Using
innerHTMLfor convenience - Building custom sanitizers when a maintained library exists
My strongest opinion here: don’t invent your own XSS filter. I’ve never seen a homegrown one age well.
Pros and cons recap
If you want the blunt answer:
- Best default: output encoding
- Best for strict fields: input validation
- Best for rich text: HTML sanitization
- Worst general advice: strip “bad characters” and hope
- Best backup layer: CSP
XSS prevention works best when you stop asking “How do I sanitize all user input?” and start asking “What kind of data is this, and where will it be rendered?”
That shift is where most security programs start getting this right.