The Inside Scoop On URL Canonicalization And Duplicate Content

David Montalvo, Active Web Group, March 2008
About The Author:
David Montalvo is the Web Marketing Strategist at Active Web Group (www.activewebgroup.com). He has achieved over 15,000 top 10 positions for Fortune 500 companies since 1997

Canonicalization is the process used by search engines to determine the best URL or website address when several different choices exist. This is a common issue that is predominantly related to home page files at the root level. For example, to a web user, the following URLs all appear to be the same:

 

www.yourdomain.com

yourdomain.com

www.yourdomain.com/index.shtml

yourdomain.com/default.asp

 

However, search engines view the above URLs as different pages. Web servers see each URL differently and treat each independently; therefore each of the above URLs could display different content if desired. When a search engine attempts to “canonicalize” a URL, it seeks to choose the best page to represent the website.

The simplest way to discover if you are experiencing this issue would be to perform a search using part of the content found on your homepage. If more than one result appears on the search engine result pages, you may want to correct the issue using a 301 redirect on your web server. The 301 redirect will instruct the search engine as to which URL you desire to be “canonical.” In addition, this 301 redirect will permanently readdress to the specified URL, as seen in the sample below.

 

When a user types this address into their browser:

www.yourdomain.com

 

The 301 redirect will redirect any request to the specified address:

www.yourdomain.com

 

I’ve seen countless websites displaying the same content on two or more different URLs. Search engines penalize websites that feature duplicate content, causing problems and poor performance for the site owners in obtaining organic rankings. Ironically, most website owners aren’t even aware of this problem and if they are, they don’t realize that it pertains to their sites.

Duplicate Content Issues

Search engines dislike duplicate content for a few reasons. One is that major search engines such as Google, Yahoo, MSN, and Ask aim to provide searchers with a diverse cross-section of unique content, and duplicate content often results in duplicate listings that impair the searcher’s experience. Another reason is that search engines don’t want to spend the resources (bandwidth) on indexing pages that are very similar.

In some instances, pages containing duplicate content are filtered at the time search engine results are sorted, so there is no guarantee as to which version of a page will appear in results and which won’t. Duplicate content may even hinder some sites and web pages from getting indexed by search engines, and there are some cases in which a search engine crawler will stop indexing all of the pages of a site because it finds too many copies of the same pages under different URLs.

While content duplication is sometimes used in an attempt to manipulate search engine rankings to garner more website traffic, in most cases it occurs without ill intent on behalf of the site owner or webmaster. The following is a list of duplicate content scenarios that could be burdening your site.

Scenario #1: Ecommerce sites that include product descriptions from manufacturers, producers, and publishers

Product distribution websites often use text from the manufacturer or producer of the product as a description for the item on their own pages. With the addition of the product name, creator, manufacturer, writer, or recording artist appearing on the page, there is a considerable amount of duplicate content on pages that don’t originate from the same website. Here are some examples:

http://www.amazon.com/Sony-VGN-TXN15P-B-Notebook-Processor/dp/B000J43MR0

http://www.crowdstorm.com/Sony_VAIO_11_1_Widescreen_Notebook_PC_VGN_TXN15P_B+2973.html

http://www.clearanceclub.com/products/6495-VAIO-VGN-TXN15P-B

http://www.provantage.com/sony-vgntxn15p-b~7SONN0UX.htm

Scenario #2: Printer-friendly pages

Many sites offer “printer friendly” versions of their content on different pages. Without the application of robots.txt disallow statements or meta “noindex” tags on these pages to keep search engines from indexing them, they may be indexed as duplicate content. See these samples:

http://www.constructionbook.com/xq/ASP/productid.5395/qx/printable_view_product.htm

http://www.tigerdirect.com/applications/searchtools/item-details-print.asp?EdpNo=1556143&Sku=H24-PX849%20SB

Scenario #3: Websites that create session IDs

A session ID lets you create customized applications for a more personalized user experience, thus increasing the appeal of your website. A visitor to your site would be assigned a unique session ID which is either stored in a cookie on the user side or is propagated in the URL.

Websites with session IDs serve information in their URLs to track visitors as they go through the pages of that site. When search engine crawlers detect this tracking information they may index the same page several times under different URLs. A good example of this is http://www.staples.com.

Search engine guidelines advise you to allow bots or spiders to crawl your sites without session IDs that track their path through the site. While this technique is great for tracking individual user behavior, the access pattern of bots is entirely different. Since bots cannot always decipher URLs that look different but point to the same page, the use of session IDs may result in incomplete indexing of your site.

Scenario #4: URLs that include multiple data variables

When multiple data variables exist within a URL, this causes bots to crawl and index the same page under different URLs. Here are some examples of sites that show different data variables in their URLs.

http://www.homedepot.com/webapp/wcs/stores/servlet/ProductDisplay?storeId=10051&langId=-1&catalogId=10053&productId=100022126&categoryID=502813

storeId=10051

catalogId=10053

productId=100022126

categoryID=5028

http://www1.macys.com/catalog/index.ognc?CategoryID=30977&PageID=30977*1*24*-1*-1&kw=Hugo%20Boss&LinkType=EverGreen

CategoryID=30977

PageID=30977

LinkType=EverGreen

It is difficult for a search engine bot or spider to crawl the URLs listed above. If this scenario applies to your website, you may want to implement the mod-re-write server settings.

Scenario #5: Pages sharing similar elements

Some websites have elements that are very common from one page to another, such as title, meta descriptions, headings, navigation, and text that is shared sitewide. This can be a problem since bots might consider it to be duplicate content. Beware of this scenario if you own an ecommerce site that includes your brand name and information about that brand in every title on every page of your site. In addition, the use of content management systems that do not allow for distinct meta description tags to be placed on each page of a website can cause a similar dilemma.

Here are two well-known websites that use their brand names on every page:

http://www.barnesandnoble.com

http://www.officemax.com

These five scenarios represent situations in which search engine crawlers may perceive your website to have duplicate content. Although it is probably inadvertent on your part, you should take steps to resolve these issues to ensure that all of your web pages are properly indexed on the search engines.



subscribeToMag
Fill in the following details to subscribe to “Visibility”.
Name:
Company:
Email:
Phone:
Street Address1:
Street Address2:
City:
State:
Zip:
Country: