Real-World Examples
URL Extraction

URL Extraction

Extracting and validating URLs is essential for web scraping, content processing, and security applications. Let's explore patterns from simple to comprehensive.

Basic URL Matching

Simple HTTP/HTTPS URLs

https?://[\w.-]+(?:/[\w./-]*)?

Matches:

  • https://example.com
  • http://sub.domain.com/path/to/page
  • https://api.example.com/v1/users

Try it → (opens in a new tab)

With Query Parameters

https?://[\w.-]+(?:/[\w./-]*)?(?:\?[\w=&%-]*)?

Matches:

  • https://example.com/search?q=regex
  • https://api.example.com/users?page=1&limit=10

Try it → (opens in a new tab)

With Hash/Anchor

https?://[\w.-]+(?:/[\w./-]*)?(?:\?[\w=&%-]*)?(?:#[\w-]*)?

Matches:

  • https://docs.example.com/guide#installation
  • https://example.com/page?ref=home#section

Comprehensive URL Pattern

For more complete URL matching including all components:

(?<protocol>https?):\/\/(?<host>[\w.-]+)(?::(?<port>\d+))?(?<path>\/[\w.\/-]*)?(?:\?(?<query>[\w=&%-]*))?(?:#(?<hash>[\w-]*))?

Try it → (opens in a new tab)

Captured groups:

  • protocol — http or https
  • host — Domain name
  • port — Port number (optional)
  • path — URL path (optional)
  • query — Query string (optional)
  • hash — Fragment/anchor (optional)

Specific URL Types

GitHub Repository URLs

https?://github\.com/(?<owner>[\w.-]+)/(?<repo>[\w.-]+)

Try it → (opens in a new tab)

YouTube Video URLs

(?:https?://)?(?:www\.)?(?:youtube\.com/watch\?v=|youtu\.be/)(?<id>[\w-]{11})

Try it → (opens in a new tab)

Amazon Product URLs

https?://(?:www\.)?amazon\.com/(?:dp|gp/product)/(?<asin>[A-Z0-9]{10})

Twitter/X Profile URLs

https?://(?:www\.)?(?:twitter|x)\.com/(?<username>\w+)

Extracting Links from HTML

All href Attributes

href=["'](?<url>[^"']+)["']

Try it → (opens in a new tab)

Image Sources

src=["'](?<url>[^"']+\.(?:jpg|jpeg|png|gif|webp))["']

Try it → (opens in a new tab)

URL Validation vs Extraction

Important distinction: Extraction patterns find URLs in text. Validation patterns check if an entire string is a valid URL. Use anchors (^ and $) for validation.

Validation (entire string must be URL)

^https?://[\w.-]+(?:/[\w./-]*)?$

Extraction (find URLs in text)

https?://[\w.-]+(?:/[\w./-]*)?

Edge Cases

URLs with Special Characters

URLs can contain encoded characters:

https?://[\w.-]+(?:/[\w./%+-]*)?(?:\?[\w=&%+-]*)?

Internationalized Domain Names

For IDN support, you'd need Unicode patterns or the u flag:

https?://[\p{L}\p{N}.-]+

(Requires u flag)

Localhost and IP Addresses

https?://(?:localhost|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?::\d+)?(?:/[\w./-]*)?

Try it → (opens in a new tab)

Practical Implementation

// Extract all URLs from text
function extractUrls(text) {
  const pattern = /https?:\/\/[\w.-]+(?:\/[\w.\/-]*)?(?:\?[\w=&%-]*)?/gi;
  return text.match(pattern) || [];
}
 
// Validate a single URL
function isValidUrl(url) {
  const pattern = /^https?:\/\/[\w.-]+(?:\/[\w.\/-]*)?(?:\?[\w=&%-]*)?$/i;
  return pattern.test(url);
}
 
// Parse URL components
function parseUrl(url) {
  const pattern = /^(?<protocol>https?):\/\/(?<host>[\w.-]+)(?::(?<port>\d+))?(?<path>\/[\w.\/-]*)?/;
  const match = url.match(pattern);
  return match?.groups || null;
}

Common Patterns Cheatsheet

TypePattern
Basic HTTP(S)https?://[\w.-]+
With pathhttps?://[\w.-]+(?:/[\w./-]*)?
With queryhttps?://[\w.-]+(?:/[\w./-]*)?(?:\?[\w=&%-]*)?
Full URLhttps?://[\w.-]+(?::\d+)?(?:/[\w./-]*)?(?:\?[\w=&%-]*)?(?:#[\w-]*)?
Domain only(?:[\w-]+\.)+[a-z]{2,}