Normalize and compare URLs with C#

I'm constantly tweaking the uniqueness detection on elmah.io. It's the feature that will identify if an error has been logged previously which will trigger a range of other features like notifications. A while back, I wanted to add support for identifying GUIDs as part of the URL and got a really simple solution suggested by a friend. Hopefully, this will help anyone needing to do something similar.

Normalize and compare URLs with C#

Before I start digging down into the code, let's set the stage. We all know URLs that are used to look up resources on the web. There are many ways of putting together URLs but common use cases look like this:

https://example.com/posts/42
https://example.com/product/e40b1d0d-dc6b-4d78-9172-d183b3ff3769/view
https://example.com/article?id=42

For uniqueness detection, I'm simplifying URLs to better identify when errors from two different URLs are actually the same error. Consider the following URLs:

https://example.com/product/e40b1d0d-dc6b-4d78-9172-d183b3ff3769/view
https://example.com/product/5bfce00d-9ddd-439a-99b0-918ba7fe55c1/view

While being two unique URLs, when looking at this as a human, we quickly see that this is actually the same page just with various product IDs. The same goes for:

https://example.com/article?id=42
https://example.com/article?id=43

Let's create a method that will normalize/simplify an URL for better comparison:

public static class StringExtensions
{
    public static string NormalizeUrl(this string url)
    {
        // If empty return empty string
        if (string.IsNullOrWhiteSpace(url)) return string.Empty;
    
        // If url not a valid Uri return empty string
        if (!Uri.TryCreate(url, UriKind.RelativeOrAbsolute, out Uri uri)) return string.Empty;
    }
}

The first couple of lines verify that the inputted URL does contain a value and is, in fact, a URL.

For the next part, we want to strip away any query parameters since using these typically references the same destination anyway:

// Remove any trailing slash and remove content after any ? or #
string result = url.Split('?', '#')[0].TrimEnd('/');

The line split the URL by the characters ? or # and grabs the first part. In case there's no query or fragment on the URL, we get the original URL back. Finally, I trim any / character from the end since variants with and without a trailing / would still be the same URL.

To identify integers, GUIDs, and dates inside the path of the URL, we could use regular expressions or write an ugly foreach loop. But luckily, being C# developers we have LINQ to easily write code like this:

// Now replace any parts of the URL which is a number or guid with 0
return string
    .Join("/", result
        .Split('/')
        .Select(part => 
            int.TryParse(part, out _)
            || Guid.TryParse(part, out _)
            || DateTime.TryParse(part, out _) ? "0" : part));

The code splits the remaining part of the URL by / and tries to parse each part as an int, Guid, or DateTime. In case any of the parts successfully parse as one of those types, we replace the part with a 0. The important part here is to be able to compare URLs, why the character used here could be anything, including an empty string (that would produce // as part of the path, though). Once parsed, we join each part back together using the Join method.

And that's the end of the method. The code posted here is a simplified version of the one currently running on elmah.io, but I wanted to keep it simple for this post. The entire method looks like this:

public static class StringExtensions
{
    public static string NormalizeUrl(this string url)
    {
        // If empty return empty string
        if (string.IsNullOrWhiteSpace(url)) return string.Empty;
    
        // If url not a valid Uri return empty string
        if (!Uri.TryCreate(url, UriKind.RelativeOrAbsolute, out Uri uri)) return string.Empty;
        
        // Remove any trailing slash and remove content after any ?
        string result = url.Split('?', '#')[0].TrimEnd('/');
    
        // Now replace any parts of the URL which is a number or guid with 0
        return string
            .Join("/", result
                .Split('/')
                .Select(part => 
                    int.TryParse(part, out _)
                    || Guid.TryParse(part, out _)
                    || DateTime.TryParse(part, out _) ? "0" : part));
    }
}

Here's a quick NUnit test that validates all three scenarios:

[TestCase("https://example.com/posts/42", "https://example.com/posts/43")]
[TestCase(
    "https://example.com/product/e40b1d0d-dc6b-4d78-9172-d183b3ff3769/view",
    "https://example.com/product/5bfce00d-9ddd-439a-99b0-918ba7fe55c1/view")]
[TestCase("https://example.com/article?id=42", "https://example.com/article?id=43")]
public void CanNormalize(string url1, string url2)
{
    StringAssert.AreEqualIgnoringCase(url1.NormalizeUrl(), url2.NormalizeUrl());
}

To finish up, I know that this implementation doesn't work on all URLs. A pattern that I see somewhere is this:

https://example.com/product-42

The current implementation won't successfully parse product-42 as either integer, GUID, or date. If you have any suggestions for improving the implementation, feel free to reach out.

elmah.io: Error logging and Uptime Monitoring for your web apps

This blog post is brought to you by elmah.io. elmah.io is error logging, uptime monitoring, deployment tracking, and service heartbeats for your .NET and JavaScript applications. Stop relying on your users to notify you when something is wrong or dig through hundreds of megabytes of log files spread across servers. With elmah.io, we store all of your log messages, notify you through popular channels like email, Slack, and Microsoft Teams, and help you fix errors fast.

See how we can help you monitor your website for crashes Monitor your website