URL Extraction

Extracting and validating URLs is essential for web scraping, content processing, and security applications. Let's explore patterns from simple to comprehensive.

Basic URL Matching

Simple HTTP/HTTPS URLs

https?://[\w.-]+(?:/[\w./-]*)?

Matches:

https://example.com
http://sub.domain.com/path/to/page
https://api.example.com/v1/users

Try it → (opens in a new tab)

With Query Parameters

https?://[\w.-]+(?:/[\w./-]*)?(?:\?[\w=&%-]*)?

Matches:

https://example.com/search?q=regex
https://api.example.com/users?page=1&limit=10

Try it → (opens in a new tab)

With Hash/Anchor

https?://[\w.-]+(?:/[\w./-]*)?(?:\?[\w=&%-]*)?(?:#[\w-]*)?

Matches:

https://docs.example.com/guide#installation
https://example.com/page?ref=home#section

Comprehensive URL Pattern

For more complete URL matching including all components:

(?<protocol>https?):\/\/(?<host>[\w.-]+)(?::(?<port>\d+))?(?<path>\/[\w.\/-]*)?(?:\?(?<query>[\w=&%-]*))?(?:#(?<hash>[\w-]*))?

Try it → (opens in a new tab)

Captured groups:

protocol — http or https
host — Domain name
port — Port number (optional)
path — URL path (optional)
query — Query string (optional)
hash — Fragment/anchor (optional)

Specific URL Types

GitHub Repository URLs

https?://github\.com/(?<owner>[\w.-]+)/(?<repo>[\w.-]+)

Try it → (opens in a new tab)

YouTube Video URLs

(?:https?://)?(?:www\.)?(?:youtube\.com/watch\?v=|youtu\.be/)(?<id>[\w-]{11})

Try it → (opens in a new tab)

Amazon Product URLs

https?://(?:www\.)?amazon\.com/(?:dp|gp/product)/(?<asin>[A-Z0-9]{10})

Twitter/X Profile URLs

https?://(?:www\.)?(?:twitter|x)\.com/(?<username>\w+)

Extracting Links from HTML

All href Attributes

href=["'](?<url>[^"']+)["']

Try it → (opens in a new tab)

Image Sources

src=["'](?<url>[^"']+\.(?:jpg|jpeg|png|gif|webp))["']

Try it → (opens in a new tab)

URL Validation vs Extraction

Important distinction: Extraction patterns find URLs in text. Validation patterns check if an entire string is a valid URL. Use anchors (^ and $) for validation.

Validation (entire string must be URL)

^https?://[\w.-]+(?:/[\w./-]*)?$

Extraction (find URLs in text)

https?://[\w.-]+(?:/[\w./-]*)?

Edge Cases

URLs with Special Characters

URLs can contain encoded characters:

https?://[\w.-]+(?:/[\w./%+-]*)?(?:\?[\w=&%+-]*)?

Internationalized Domain Names

For IDN support, you'd need Unicode patterns or the u flag:

https?://[\p{L}\p{N}.-]+

(Requires u flag)

Localhost and IP Addresses

https?://(?:localhost|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?::\d+)?(?:/[\w./-]*)?

Try it → (opens in a new tab)

Practical Implementation

// Extract all URLs from text
function extractUrls(text) {
  const pattern = /https?:\/\/[\w.-]+(?:\/[\w.\/-]*)?(?:\?[\w=&%-]*)?/gi;
  return text.match(pattern) || [];
}
 
// Validate a single URL
function isValidUrl(url) {
  const pattern = /^https?:\/\/[\w.-]+(?:\/[\w.\/-]*)?(?:\?[\w=&%-]*)?$/i;
  return pattern.test(url);
}
 
// Parse URL components
function parseUrl(url) {
  const pattern = /^(?<protocol>https?):\/\/(?<host>[\w.-]+)(?::(?<port>\d+))?(?<path>\/[\w.\/-]*)?/;
  const match = url.match(pattern);
  return match?.groups || null;
}

Common Patterns Cheatsheet

Type	Pattern
Basic HTTP(S)	`https?://[\w.-]+`
With path	`https?://[\w.-]+(?:/[\w./-]*)?`
With query	`https?://[\w.-]+(?:/[\w./-])?(?:\?[\w=&%-])?`
Full URL	`https?://[\w.-]+(?::\d+)?(?:/[\w./-])?(?:\?[\w=&%-])?(?:#[\w-]*)?`
Domain only	`(?:[\w-]+\.)+[a-z]{2,}`

Log Parsing Code Refactoring