URL Extraction
Extracting and validating URLs is essential for web scraping, content processing, and security applications. Let's explore patterns from simple to comprehensive.
Basic URL Matching
Simple HTTP/HTTPS URLs
https?://[\w.-]+(?:/[\w./-]*)?Matches:
https://example.comhttp://sub.domain.com/path/to/pagehttps://api.example.com/v1/users
With Query Parameters
https?://[\w.-]+(?:/[\w./-]*)?(?:\?[\w=&%-]*)?Matches:
https://example.com/search?q=regexhttps://api.example.com/users?page=1&limit=10
With Hash/Anchor
https?://[\w.-]+(?:/[\w./-]*)?(?:\?[\w=&%-]*)?(?:#[\w-]*)?Matches:
https://docs.example.com/guide#installationhttps://example.com/page?ref=home#section
Comprehensive URL Pattern
For more complete URL matching including all components:
(?<protocol>https?):\/\/(?<host>[\w.-]+)(?::(?<port>\d+))?(?<path>\/[\w.\/-]*)?(?:\?(?<query>[\w=&%-]*))?(?:#(?<hash>[\w-]*))?Captured groups:
protocol— http or httpshost— Domain nameport— Port number (optional)path— URL path (optional)query— Query string (optional)hash— Fragment/anchor (optional)
Specific URL Types
GitHub Repository URLs
https?://github\.com/(?<owner>[\w.-]+)/(?<repo>[\w.-]+)YouTube Video URLs
(?:https?://)?(?:www\.)?(?:youtube\.com/watch\?v=|youtu\.be/)(?<id>[\w-]{11})Amazon Product URLs
https?://(?:www\.)?amazon\.com/(?:dp|gp/product)/(?<asin>[A-Z0-9]{10})Twitter/X Profile URLs
https?://(?:www\.)?(?:twitter|x)\.com/(?<username>\w+)Extracting Links from HTML
All href Attributes
href=["'](?<url>[^"']+)["']Image Sources
src=["'](?<url>[^"']+\.(?:jpg|jpeg|png|gif|webp))["']URL Validation vs Extraction
Important distinction: Extraction patterns find URLs in text. Validation patterns check if an entire string is a valid URL. Use anchors (^ and $) for validation.
Validation (entire string must be URL)
^https?://[\w.-]+(?:/[\w./-]*)?$Extraction (find URLs in text)
https?://[\w.-]+(?:/[\w./-]*)?Edge Cases
URLs with Special Characters
URLs can contain encoded characters:
https?://[\w.-]+(?:/[\w./%+-]*)?(?:\?[\w=&%+-]*)?Internationalized Domain Names
For IDN support, you'd need Unicode patterns or the u flag:
https?://[\p{L}\p{N}.-]+(Requires u flag)
Localhost and IP Addresses
https?://(?:localhost|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?::\d+)?(?:/[\w./-]*)?Practical Implementation
// Extract all URLs from text
function extractUrls(text) {
const pattern = /https?:\/\/[\w.-]+(?:\/[\w.\/-]*)?(?:\?[\w=&%-]*)?/gi;
return text.match(pattern) || [];
}
// Validate a single URL
function isValidUrl(url) {
const pattern = /^https?:\/\/[\w.-]+(?:\/[\w.\/-]*)?(?:\?[\w=&%-]*)?$/i;
return pattern.test(url);
}
// Parse URL components
function parseUrl(url) {
const pattern = /^(?<protocol>https?):\/\/(?<host>[\w.-]+)(?::(?<port>\d+))?(?<path>\/[\w.\/-]*)?/;
const match = url.match(pattern);
return match?.groups || null;
}Common Patterns Cheatsheet
| Type | Pattern |
|---|---|
| Basic HTTP(S) | https?://[\w.-]+ |
| With path | https?://[\w.-]+(?:/[\w./-]*)? |
| With query | https?://[\w.-]+(?:/[\w./-]*)?(?:\?[\w=&%-]*)? |
| Full URL | https?://[\w.-]+(?::\d+)?(?:/[\w./-]*)?(?:\?[\w=&%-]*)?(?:#[\w-]*)? |
| Domain only | (?:[\w-]+\.)+[a-z]{2,} |