This session was moderated by Chris Sherman and included the following speakers: Mikkel deMib Svendsen, Shari Thurow, Adam Lasnik (Google), Tim Converse (Yahoo) and Jon Glick (Become.com) who is filling in for Anne Kennedy whose husband unfortunatley passed away over the weekend.
Duplicate content has resurfaced as an issue due to the increased use of syndication.
First up is Jon Glick.
What is duplicate content? Multiple domains with same home page. Duplicate content is a problem because while search engines want your content, they want unique content. Search engine webmaster guidelines offer specifics.
Example problems:
– Dynamic urls where different urls can access the same content.
– Multiple domains and one web site
Search engines will pick the “canonical” domain and ignore any others. That is the domain name that will be indexed and included in search results. Typically search engines will pick the url that has the most links to it.
What to do: Choose your cannonical domain and make sure your links point to that domain. Use 301 redirects and point any other versions of urls to the canonical domain. Exclude landing pages that are duplicates from getting indexed.
Often times people will use a 302 redirect. If the change is permanent, tell the search engine by using a 301 redirect. 302 redirects are only good for content that will change, such as events.
If you get delisted by a search engine for duplicate content issues, then fix the problem and fill out a reinclusion request. Yahoo will actually tell you whether you’re banned or not. You can then fix them and ask them to re-review your site.,
Next up is Shari Thurow who reminds us she is a grad student in information architecture and makes this funny comment: “Excuse me if I have de-factualized my content to the point of not being factually accurate so that it is easier to understand, the search engines can take care of that.”
Search engines don’t want duplicate content in the search results because it causes excessive load on search engine resources. It’s also a usability issue. Users do not respond well to duplicate content in the search results.
Ways that search engines filter out redundant content:
- Content properties – removes boilerplate elements and reviews what’s left
- Linkage properties – Is a press release on the wire service and the press release on a web site duplicate content? No, because the linkage properties are distinctly different, even though the content is the same
- Content evolution – most content does not change. High average page mutation – news site. Low would be a manufacturing site
- Host name resolution – If the hostname resolves to the same domain name, one version of the redundant content will be selected as the canonical url
- Shingle comparison – References Andrei Broder. Shingles are word sets. The more shingles that a document has, the more likely it is duplicate content.
With printer versions of your pages, use robots.txt to exclude that duplicate content.
Some duplicate content is as a result of copyright infringement. “Hire an attorney” and get them to remove. Use copyscape.com and archive.org as tools to detect duplicate content.
Use you web analytics software to find the versions of your pages with the best conversions and use those as your canonical urls. Be proactive and don’t exploit the search engines.
Next up is Mikkel deMib Svendsen.
With and without www. There may not be big indexing issues, but there are linking issues of you do not redirect one to the other. Session ids are a problem because they create large numbers of urls for the same content. Dump all session information in a cookie.
WordPress: Customize permalink structure. The problem is that WordPress does not block the original version of the url. Solution would be to redirect via 301. WordPress Canonical URL plugin does this for you.
Another problem is using parameters to change the display of a page, This creates the same content under different urls.
Breadcrumb navigation can be problematic if there are multiple ways to get to the same product. A solution would be to have products in only one place.
Do not leave it to the search engines to find your cannonical content. Chances are, they will pick the wrong content.
Adam Lasnik and Tim Converse from Yahoo are on the panel for the Q and A>
Q/A
How can you get the wayback machine to index sites more frequently.
Shari: It’s not the best search engine, direct contact with Archive.org has not had good results.
If you have different versions of your site for different languages a top level domain is a significant indicator as well as the presence of a local address to help the site get indexed by the apporiate language version search engine. Other tips include local links and local hosting.
If you get to chose whether you or the search engine picks the content that gets indexed make sure it’s you telling the search engine what to do.