How to Apply Url Normalization Rules in PHP

How do I apply URL normalization rules in PHP?

The Pear Net_URL2 library looks like it'll do at least part of what you want. It'll remove dot segments, fix capitalization and get rid of the default port:

include("Net/URL2.php");
$url = new Net_URL2('HTTP://example.com:80/a/../b/c');
print $url->getNormalizedURL();

emits:

http://example.com/b/c

I doubt there's a general purpose mechanism for adding trailing slashes to directories because you need a way to map urls to directories which is challenging to do in a generic way. But it's close.

References:

http://pear.php.net/package/Net_URL2
http://pear.php.net/package/Net_URL2/docs/latest/Net_URL2/Net_URL2.html

PHP normalize remote url's

Since you asked for "quick," here's a one-liner that does the job:

$url = 'HtTp://User:Pass@www.ExAmPle.com:80/Blah';

echo preg_replace_callback(
  '#(^[a-z]+://)(.+@)?([^/]+)(.*)$#i',
  create_function('$m',
                  'return strtolower($m[1]).$m[2].strtolower($m[3]).$m[4];'),
  $url);

Outputs:

http://User:Pass@www.example.com:80/Blah

EDIT/ADD:

I've tested, and this version is about 55% faster than using preg_replace_callback with an anonymous function:

echo preg_replace(
  '#(^[a-z]+://)(.+@)?([^/]+)(.*)$#ei',
  "strtolower('\\1').'\\2'.strtolower('\\3').'\\4'",
  $url);

PHP normalization of URL to identical form

You can use glenscott/url-normalizer package for URL normalization in compliance of the specification RFC 3986. You can see the result of normalization with help the following simple example:

$urls = [
    'http://example.com/~smith/home.html',
    'http://example.com:80/~smith/home.html',
    'http://EXAMPLE.com/%7Esmith/home.html',
    'http://EXAMPLE.COM/%7esmith/home.html',
    'https://example.com:443/~smith/home.html'
];

foreach ($urls as $url) {
    $normalizer = new URL\Normalizer($url);
    echo $normalizer->normalize(), "</br>"; 
}

The result:

http://example.com/~smith/home.html
http://example.com/~smith/home.html
http://example.com/~smith/home.html
http://example.com/~smith/home.html
https://example.com/~smith/home.html

URL availability from MySQL Database using Php

I think before you implement a solution you should abstractly flesh out your policy more thoroughly. There are many parts of a URL which may or may not be equivalent. Do you want to treat protocols as equivalent? https://foo.com vs http://foo.com. Some subdomains might be aliases, some might not. http://www.foo.com vs http://foo.com, or http://site1.foo.com vs http://foo.com. What about the path of the the URL? http://foo.com vs http://foo.com/index.php. I wouldn't waste your time writing a comparison function until you've completely thought through your policy. Good luck!

UPDATE:

Something like this perhaps:

$ignore_subdomains = array('www','web','site');
$domain_parts = explode('.',$siteurl); 
$subdomain = strtolower(array_shift($domain_parts));
$siteurl = (in_array($subdomain,$ignore_subdomains)) ? implode('.',$domain_parts) : $siteurl;
//now run your DB comparison query

Filtering links based on whether already seen

Short answer is no, there's no straight way to do that. Have a read at this article about URL normalization to find out some reasons why that is hard to accomplish.

How to apply normalization on mysql using php

Ok, couple of things:

php has got nothing to do with this. normalization is about modelling data
normalization is not about saving disk space. It is about organizing data so that it is easily maintainable, which in turn is a way to maintain data-integrity.
normalization is typically described in a few stages or 'normal forms'. In practice, people that design relational databases often intuitively 'get it right' most of the time. But it is still good to be aware of the normal forms and what their characteristics are. There is a lot of documentation on that on the internet (fe http://en.wikipedia.org/wiki/Database_normalization), and you should certainly do you own research, but the most important stages are:

unormalized data: in this stage, data is not truly tabular ('relational'). There is a lot of discussion of what tabular really means, and experts disagree with one another. but most people agree that data is unnormalized in case there are multi-valued attributes (=columns that can for one row contain lists as value), or in case there are repeating groups (=multiple columns or multiple groups of columns for storing the same type of data)

Example of multi-valued column: person (first_name, last_name, phonenumbers)
Here, phonenumbers implies there could be more phonenumbers, stored in one column

Example of repeating group: person(first_name, last_name, child1_first_name, child1_birth_date, child2_first_name, child2_birth_date..., childN_first_name, childN_birth_date)
Here, the person table has a number of column pairs (child_first_name, child_birth_date) to store the person's children.

Note that something like order (shipping_address, billing_address) is not a repeating group: the addresses for billing and shipping may be similar pieces of data, but each has its own distinct role for an order, both just represent a different aspect of an order. child1 thru child10 do not - children do not have specific roles, and the list of children is variable (you never know how many groups you should reserve in advance)

In both cases, multi-valued columns and repeating groups, you basically have "nested table" structure - a table within a table. Data is said to be in 1NF (first normal form) if neither of these occur.

The 1NF is about structural characeristics: the tabular form of the data. All subsequenct normal forms have to do with eliminating redundancy. Redundancy occurs when the same information is independently stored multiple times. Redundancy is bad: if you want to change some fact, you have to change it in multiple places. If you forget to chance one of them, you have inconsistent data - the data is contradicting itself.

There are a lot of processes that can eliminate redundancy, each leading to a higher normal form, all the way from 1nf up to 6nf. However, typically most databases are adequately normalized at 3nf (or a lsight variation of that called boyce-codd normal form, BCNF) You should study 2nf and 3nf, but the principle is very simple: a table is adequately normalized, if:

the table is in 1nf
the table has a key (a column or column combination whose values are required, and which uniquely identifies a row - ie. there can be only one row having that combination of values in the key columns)
there are no functional dependencies between the non-key columns
non-key columns are not functionally dependent upon part of the key (but are completely functionally dependent upon the entire key).

functional dependency means that a column's value can be derived from another column. simple example:

order_item (order_id, item_number, customer_id, product_code, product_description, amount)

let's assume (order_id, item_number) is key. product_code and product description are functionally dependent upon each other: for one particular product_code, you will always find the same product description (as if product description is a function of product_code). The problem is now: suppose a product description changes for a particualr product code, you have to change all orders that us that product_code. forget only one and you have an inconsistent database.

The way to solve it is to create a new product table with (product_code, product_description), having (product_code) as key, and then instead of storing all product fields in order, only store a reference to a row in the product table in the order_item records (in this case, order_item should only keep product_code, which is sufficient to look up a row in the product table and find the product_description)

So as you u can see, with this solution you do actually save space (by not storing all these product descriptions in each order_item that happens to order the product) and you do get more tables (split off product from order_item) But just remember that it is not because of saving diskspace: it is because you eliminate redundancy, thus making it easier to maintain the data. because now you only have to change one row in the product table to change the description

Normalizing Unicode according to the W3C in PHP

It is up to you to decide, on the basis of the purpose and nature of your application, whether you apply normalization upon reading user input, or storing it to a database, or when writing it, or at all. To summarize the long thread mentioned in the comments to the question, also available in the official list archive at http://validator.w3.org/feedback.html

The warning message comes from the experimental “HTML5 validation” (which is really a linter, applying subjective rules in addition to some formal tests).
The message is not based on any requirement in HTML5 drafts but on opinions on what might cause problems in some software.
The opinions originally made “HTML5 validation” issue an error message, now a warning.

It is certainly possible, though uncommon, to get unnormalized data as user input. This does not depend on normalization carried out by browsers (they don’t do such things, though they conceivably might in the future) but on input methods and habits. For example, methods of typing the letter ü (u umlaut, or u with diaeresis) tend to produce the character in precomposed form, as normalized. People can produce it as unnormalized, in decomposed form, as letter u followed by combining diaeresis, but they usually have no reason to do so, and most people wouldn’t even know how to do that.

If you do string comparisons in your software, they may or may not (depending on comparison routines used) treat e.g. a precomposed ü as equal to the decomposed presentation. Simple implementations treat them as different, as they are definitely distinct at the simple character level (Unicode code points).

One reason to normalize at some point, in the writing phase at the latest, is that precomposed characters generally get displayed more reliably. To present a normalized ü, a program just has to pick up a glyph from a font. To present a decomposed ü, a program must either recognize it as canonically equivalent to the normalized ü or write the letter u with a diaeresis symbol properly placed above it, with due attention to the graphic properties of the glyph for u, and many programs fail in this.

On the other hand, in the rare cases where unnormalized data is received as user input, the user may well have a reason to have produced it. He may have the idea that normalized ü and unnormalized ü are distinct and need to be treated as such.

Saving GET parameters from an HTACCESS rule

Change the condition to

RewriteCond %{THE_REQUEST} ^GET[^?]+index\.php [NC]

The above would work for all index.php files anywhere on your site. If you want to redirect for a particular file only, like /index.php, you can use:

RewriteCond %{THE_REQUEST} ^GET\ /index\.php [NC]