Some common SEO and Web development practices can be used quite legitimately, yet still could look like spam to search engines. In today's Web Analytics and ROI column, "Coding Problems: How to Avoid Getting Flagged by Search Engines," Eric Enge explains that the key is to use these techniques for legitimate reasons, use them in moderation, and in ways commonly found on the Web.
Posted by Kevin Newcomb at 12:00 AM | Permalink | Comments (0)
I reported at the Search Engine Roundtable on thread in our forums that shows how some hosts are inserting links on sites they host, without notifying the web site owner, and doing it via cloaking. Matt Cutts from Google looked deeper into the reported issue in the thread and said that "it looks like this webhost is cloaking." The web hosting company is placing paid links within the content using cloaking techniques.
If you are worried about this for your site, then check the Google index for you site. You can use a Google site command "with a porn phrase such as [site:www.mydomainc.com porn] or [site:www.mydomainc.com sex] and see what comes up."
Posted by Barry Schwartz at 12:06 PM | Permalink
I covered a DigitalPoint thread which uncovered several domains that was able to rank billions of pages at the top of the Google results within a couple of weeks. The methods deployed to rank the pages seemed to include excessive use of subdomains, cloaking, content theft scraping, alexa traffic boosting and blog comment spam. I listed the documented steps here. Some suspect that Google's new URL handling with the big daddy update allowed "old school" cloaking to begin working again.
A Threadwatch post shows screen captures of the spam and also has a comment from Google representative, Adam Lasnik. Adam directly responds to over 5 billion pages of this domain being indexed, saying:
We have noticed that some site: queries are showing bizarre results and it's turned out to be tied to a bad data push. We're fixing it now.
Yes, we are aware of the site command issues (Google's mentioned them itself). That may mean it is far less than 5 billion pages indexed in this case -- but still, plenty of pages got through.
If the site command is the issue or even if it is not, this is still indicative of other substantial problems plaguing Google that are making the rounds on discussion board and blogs lately.
Posted by Barry Schwartz at 9:09 AM | Permalink
The New York Times has one of the most popular news web sites, but until this year that was largely because of the strength of its brand. After its acquisition of About.com, the Times embarked on an aggressive campaign to make its web site more search friendly, a complex process that's paid off with notable traffic gains for the company. Today's SearchDay article, Getting The New York Times More Search Engine Friendly, takes a look behind the scenes at how the Times and its vice president of enterprise search, Marshall Simmonds, pulled it off.
Posted by Chris Sherman at 5:48 AM | Permalink
Dave Naylor's been doing a tour of European automotive sites and finding others that are doing the doorway page dance that got BMW banned from Google. Meanwhile, there's some concern in the blogosphere about whether people should be worried about Google's spam rules in general. A look at both issues, below.
Dave's found this page over at Porsche Denmark that redirects to the Porsche Denmark home page. Disable JavaScript (use this handy tool for Firefox), and you can see the underlying textual content that's being cloaked.
It's hard to know what exactly is going on, as I don't read Danish. Since you can't get to this page from the Porsche Denmark home page -- and since it redirects to that home page -- it seems designed mainly to capture searchers looking for a particular topic and route them into Porsche. In other words, a classic doorway page operation.
Here's a better example. Look for klassiske porscher on Google, then you get this page, which redirects to the home page. Disable JavaScript, and the redirection stops, showing you the hidden content. A user never sees that. Porsche has no intention for them to see it. They only want Google to see it, to rank the page well and deliver them a user to a completely different page on the site.
In the comments on Dave's post, David Thulin points to this page at Chevrolet Sweden. Use that tool I mentioned above and disable styles. Now the pretty picture of a Chevy goes away, replaced by hidden text. My Swedish is as good as my Danish -- ie, I can't read this. But it doesn't seem spammy in terms of repetition. Still, scroll to the bottom, and you'll see links to additional doorway pages. Someone clearly realizes search engines don't like the graphical pages they are feeding out, so they've created a series of doorway pages. That degree of savvyness also means they should be aware that search engines generally don't like doorways.
Of course, the entire BMW situation has sparked some interesting pushback in new quarters, people who feel like Google in particular shouldn't be pushing "orthodoxy" or their own results on site designers. Google Orwellian at Publishing 2.0 is one example (I left some comments there), Death Penalty, Investigations? Sounds like the FBI... is another and Google Delists BMW-Germany at Slashdot has some similar comments. Jeremy Zawodny has some pushback of his own on the pushback over here: Google vs. BMW, a sanity check.
I think some of the outcry is mistaken. Google is simply doing what all search engines do, enforcing its own rules on what spam is. That's not anything new or Google specific. Sure, it does warrant examination. Then again, it has also been heavily debated in the past. Not everyone agrees with spam rules, but even those who don't understand that if they do something against the rules, they risk getting tossed out. But perhaps the times are a changing...
For those looking to educate themselves on spam issues, here's a reading list:
Need yet more? The SEO: Cloaking and SEO: Spamming categories of the Search Topics area available to Search Engine Watch members takes you back for years with articles on these topics. Plus, becoming a member helps support the site and the creation of content like you're reading right now.
Want to comment or discuss? Please visit our Search Engine Watch Forums thread, Google Removes BMW Germany For Spamming.
Posted by Danny Sullivan at 9:32 AM | Permalink
WebmasterWorld, which banned Google, Yahoo, MSN, Ask Jeeves and other search spiders last month, is now allowing them back in and thus returned to the land of the living, in terms of being listed with search engines.
WebmasterWorld chief Brett Tabke gives his rundown on the situation more in the site's robots.txt file, which he's now using as a blog. C'mon Brett -- you're posting good stuff in there beyond the whole robots things. Put the material into proper web pages, if not an actual blog, so we can link to individual items.
Look close at that file, and you'll see that it seems to still ban all the robots. Now look here at what the robots.txt file tells you is the "real" robots.txt file. That's made real to the major search spiders through this code, which checks to see if a spider is reporting a useragent from any major search engines. If so, then a cloaked robots.txt file is sent to them.
Cloaked! Cloaked! You mean Google and gang are all anti-cloaking but they don't mind this cloaking? Apparently so, and not that surprising. The robots.txt file really isn't designed to be read by humans, though they can. So while technically this is another example of search engines allowing cloaking, it's more a footnote than a big exception as with things like Google Scholar.
Ah, but what about people who might visit WebmasterWorld while pretending to be one of the major spiders? How could you do that? Here, Greg Boser points you at one of many tools that let you do this.
Greg's pointing at that because last week, he found himself blocked from WebmasterWorld after surfing in there as if he was from Google. He wasn't alone in being caught by some detection stuff Brett's setup, and now he and others are back with access, as Greg explains. Found yourself in the same situation? Brett explains here to send a sticky mail to an admin to have access restored. I'm told from a good source that a number of Google folks found themselves locked out as well, because many of them use browsers that report the Google useragent.
What about the entire rogue spider thing? They were ignoring robots.txt in the first place. That's why, as I covered earlier, WebmasterWorld also set up required logins to block the spiders. My understanding is that the major search spiders are being excluded from this requirement, plus referring data is also being used to help prevent some people clicking from the search engines from getting a login request for the first two or three clicks.
WebmasterWorld Back In Google Index? has discussion at WebmasterWorld, WMW - the bots are back has discussion over at Threadwatch and WebmasterWorld Off Of Google & Others Due To Banning Spiders our Search Engine Watch Forums has older discussion and is a place also you can comment or discuss the latest developments.
Posted by Danny Sullivan at 11:34 AM | Permalink
Rogue host changing customers' websites over at SEO Forum is an interesting read and warning to those to watch their hosting service. What's PhilC describes there is a hosting company that was unbeknownst to clients was apparently inserting links at the bottom of client pages to benefit the host. The screenshots here tell the tale much better. Apparently, the tactic was supposed to be stopped but started again.
Moral for anyone? Look at the cached pages you have in the major search engines. They'll show you what the search engine spider saw -- and any links that you might not realized were cloaked without permission to feed to the spiders.
Want to discuss? Visit our forum thread, Obnoxious cloaking scam.
Posted by Danny Sullivan at 8:43 AM | Permalink
On Google Scholar noted that some going to Google from within university campuses were seeing a new Google Scholar link on the Google home page. Google confirms this is the case.
We have been offering Google Scholar as a tab [link] for the .edu domain for a few weeks now. We have expanded this to a larger set of universities. This includes a large number of universities around the world, not just .edu.
In other words, if Google can tell you are coming from within an institution using IP addresses that resolve to an .edu domain, or from a list of universities it chooses to target, then you'll see a new "Scholar" link on the Google home page, as the screenshot shows below:
Thanks to CKP for the screenshot!We asked Google if there was a way for those who wanted to add the Scholar link to the home page to do so if it doesn't show up automatically, but the company didn't respond. We think it would be a good idea.
For that matter, if would be nice if people could pick and choose exactly what links they want on the home page, given that Google offers a variety of search services that aren't normally shown. Perhaps that's something the Google personalized home page launched last month will allow, as it matures.
FYI, Yahoo's pure search page has an edit option just above the search box that lets you add and remove links to many of the company's vertical search services. A9 also allows you to pick-and-choose from hundreds of sources.
Postscript: Gary points out that you can also do something similar on the main Yahoo home page, if you are logged in as a registered user. Look for the very small edit link in the upper right hand corner. That will let you change three of the home page "buttons" to the left of the Yahoo logo to whatever you'd like.
Posted by Danny Sullivan at 11:21 AM | Permalink
Search Marketing Techniques, Deceptive Advertising Laws & Other Laws from Alan Perkins at Search Engine Guide looks at how laws about deceptive advertising might be applied to search marketing. Alan's long argued that cloaking could be considered deceptive advertising, and he tries to build that case here -- the deception being that the search engine itself was being deceived about the real relevancy of a page.
He cites the FTC action over a pagejacking scam in 1999 as one extreme example of deception being found in a legal instance. I agree with that (and my own write-up of that case is here, FTC Steps In To Stop Spamming). Alan does make clear that search spam itself is not necessarily the same as deception from a legal perspective. But he does conclude specifically that cloaking content with the intent of getting a better ranking is deceptive advertising:
So, those search engine spamming techniques that involve delivering the same content to searchers and search engines, such as hidden text or single pixel transparent links, do not constitute deceptive advertising. However, those techniques that involve delivering different content to searchers and search engines constitute deceptive advertising if the intent and result of the technique is a preferable placement.
I completely disagree. First, I don't know that getting organic listings in a search engine would be considered "advertising" under US laws, much less those of other countries. In addition, if what was promised in the search listing is generally the same as what someone gets when they arrive at the page, it's hard to argue consumer deception.
But the search engine itself was deceived! Maybe, but that doesn't mean laws about deceptive advertising were violated. And search engines get deceived about things all the time, including when they naturally fail to index pages properly or assign them a better ranking because the page themselves are not necessarily search engine friendly.
In fact, that's one reason that Google itself allows approved cloaking, as I've written before. Without allowing this, it can't properly index some content.
It's also why I find the entire argument over cloaking to be so tiresome to the point I may no longer even comment on articles about it in the future. Cloaking is not necessarily spam or misleading, as I wrote to great depth in my Ending The Debate Over Cloaking article of Feb. 2003.
If cloaking alone (independent of WHAT is being cloaked) were spam and misleading, then Google wouldn't allow it all all, in any circumstances, nor would Yahoo and others that accept XML feeds allow that form of cloaking. Cloaking is simply a method of feeding content to a search engine. How that content is described to a consumer and what ultimately is delivered when they arrive at a page after reading a listing is where you determine deception.
Did you promise "kids internet games" as with the 1999 pagejacking case and instead deliver up porn? That's deceptive, regardless of whether you cloaked, meta refreshed or whatever. Did you promise games and actually deliver them? Then how you gained the listing isn't likely deceptive from a legal point of view. Deception in getting the ranking will remain the sole jurisdiction of the search engine itself (and more about that in my past Spam Rules Require Effective Spam Police article)
Later, I'll be writing about new page-specific markup that Yahoo is proposing that were raised at the Indexing Summit we held at SES New York (for some fast details, see our Indexing Summit - SES NYC 05 forum thread with live coverage of that). This markup would allow portions of a page seen by humans to be ignored by spiders -- effectively, a form a cloaking.
There are good reasons for doing it, but if the change comes, it's going to once again move forward the definition of cloaking. More important, it's going to further move forward the fact that search engines are no longer (and haven't for some time) only comparing pages to each other that have been spidered exactly as seen by humans. They aren't, nor should they, and nor would doing so somehow restore some type of "level playing field" that never existed in the first place.
Want to discuss? Please join our forum thread, Deceptive Advertising in Search Results.Posted by Danny Sullivan at 7:20 AM | Permalink | TrackBack
No, it's not April Fool's Day. Google has indeed cloaked pages on its own search engine and now banned those pages from its index.
Earlier I posted about Google cloaking pages as spotted on Threadwatch (and updated here). Turns out, Google says it's an accident that happened due to it trying to optimize its internal search engine used by AdWords support people. Nevertheless, the company's now banned its own pages from its own search engine for cloaking.
The move is sort of odd given that Google does allow other people to cloak on its search engine, as my Google & Approved Cloaking and Cloaking By NPR OK At Google stories explain more. Nevertheless, it's a PR move the company probably felt it had to make, let it be accused of not following guidelines it tells others to follow.
Google's GoogleGuy forum rep provided the explanation early today in this WebmasterWorld thread: Cloaked Pages Targeted at Search Box To Be Removed. Specifically, he said:
Those pages were primarily intended for the Google Search Appliances that do site search on individual help center pages. For example, http://adwords.google.com/support has a search box, and that search is powered by a Google Search Appliance. In order to help the Google Search Appliance find answers to questions, the user support system checked for the user agent of "Googlebot" (the Google Search Appliance uses "Googlebot" as a user agent), and if it found it, it added additional information from the user support database into the title.
The issue is that in addition to being accessed via the internal site-search at each help center, these pages can be accessed by static links via the web. When the web-crawl Googlebot visits, the user support system thinks that it's the Google Search Appliance (the code only checks for "Googlebot") and adds these additional keywords.
That's the background, so let me talk about what we're doing. To be consistent with our guidelines, we're removing these pages from our index. I think the pages are already gone from most of our data centers--a search like [site:google.com/support] didn't return any of these pages when I checked. Once the pages are fully changed, people will have to follow the same procedure that anyone else would (email webmaster at google.com with the subject "Reinclusion request" to explain the situation).
I did follow up with Google on Monday, immediately after I posted my original story on the cloaking. I got a preliminary "we're checking and we'll get back you" message. I'm still waiting on that official response. If it finally comes, I'll let you know.
That preliminary message I received, however, conflicts with what was later posted. On Monday, I was told directly by Google that a quick check of the page in question from a Google IP address and with either a Google user agent or a Googlebot user agent didn't show any cloaking.
In other words, the title of the page displayed to the person at Google, pretending to be Google's web indexing agent on Monday was:
Google AdWords Support: Why do traffic estimates for my Ad Group differ from those given by the standalone tool?
Nevertheless, the title actually recorded by Google in its index was:
traffic estimator, traffic estimates, traffic tool, estimate traffic Google AdWords Support: Why do traffic estimates for my Ad Group differ from those given by the standalone tool?
If it was actually the case that Google's web indexer, Googlebot, accidentally got served these pages, then that preliminary check should have revealed it. (UPDATE: Why it didn't is uncertain, Google says. One likely culprit seems to be that the page content itself had been changed by another Google department when the check was done, as I speculated in the paragraph below).
It could be that the cloaking had stopped by the time the check was done. I do know that the last time I looked at that page as recorded in Google's cache, Google had recorded the cloaked content as of March 7 at 4:54am GMT. That's the time stamp for when Googlebot last indexed the page. The fast check by a Google employee was done at 5:15pm later that day. During the 12 hours from when the spider last visited the page and when the checking was done, someone at Google may have shut off the cloaking.
By the way, GoogleGuy is indeed a real Google employee that you can trust as speaking for Google, even though as I've also written before, comments he makes have been sometimes said to be unofficial in nature.
Confusing? Yep, it is. I've also written before that it's time for the lid to come off GoogleGuy's identity. That's especially so if Google's going to continue releasing official information about controversial topics such as cloaking or nofollow via forums, blog entries and so on in this way. The company needs to finally identify the person behind the nickname, so that the general public doesn't have to wonder if it's really Google talking. I've had reporters ask me in the past how they can know the person is real; John Battelle on his blog wondered the same earlier this year after getting a GoogleGuy comment:
As I understand it from the Google Guy post (and I am not sure this really is a "Google Guy" - when will Google just stop being coy and let actual real people make comments?)
Hopefully, we'll see Google finally identify GoogleGuy so there's no confusion that he does speak for the company. If not, and if we have to keep getting "official" information in this "non-official" way, I'll simply out him myself.
Want to comment or discuss? Please visit our forum thread, Google Caught Cloaking and Keyword Stuffing.
Posted by Danny Sullivan at 9:27 AM | Permalink | TrackBack
Threadwatch has a nice catch that this page from Google on AdWords traffic estimates looks different from the cached version recorded by Google's spider. In particular, the HTML title tag of the page humans see says:
Google AdWords Support: Why do traffic estimates for my Ad Group differ from those given by the standalone tool?
Whereas the title tag of the cached version says:
traffic estimator, traffic estimates, traffic tool, estimate traffic Google AdWords Support: Why do traffic estimates for my Ad Group differ from those given by the standalone tool?
What's going on? The first thought is that Google wants this page to rank well for terms like "traffic estimator" or "traffic estimates" and so has put them in the title tag -- but doesn't want that bad looking title to show up to those reading the page, so it's cloaking it.
To see the ranking impact in real-life, try a search for traffic estimator on Google, and you'll see the US version page in the top results (it's first for me).
It could also be that the title previously said this and has since changed. The cached version of the page is dated March 6 as of 5AM GMT. The difference between the cached copy and the current one was spotted on March 7. It is possible, unlikely but possible, that the page was changed within a day and that the Google spider hasn't yet caught up with it.
I'm checking with Google to find out what they have to say and will update as I hear. For more, see these Threadwatch posts: Google Caught Cloaking - Keyword Stuffing Titles and the follow-up Are Google Cloaking and Keyword Stuffing?
Posted by Danny Sullivan at 11:21 AM | Permalink | TrackBack
Leigh Dodds provides a great rundown in his Google Scholar piece about one aggregator's experience in getting content prepared for entry into the new Google Scholar service.
My favorite part was this:
The second issue was to ensure that the crawler got the full text so they could work their on the full content rather than just the titles and abstracts. A bit of sleight-of-hand at our end ensured that the crawler got what it needed but with the URLs in the Google index being a suitable entry point for an end user.
That sleight-of-hand is almost certainly cloaking, showing an end user something different than what the crawler saw. Cloaking, of course, is against Google's published policies for webmasters.
As I've covered before in the situations of cloaking allowed at Google for NPR and Google Scholar, this type of cloaking is helpful to searchers. It's good cloaking. I have to stress, Dodds and the other Google Scholar participants are doing nothing wrong. They are working directly with Google, with Google's full approval, in a way that Google rightly feels will help searchers.
Nevertheless, Google's failure to update its policy continues to make it sound hypocritical. Telling general web publishers not to cloak, then having your Google Scholar participants talk about "sleight-of-hand" is a mixed message.
As I blogged earlier, it's long overdue for Google's policy on cloaking to be updated, to eliminate this mixed message. Simple changes like shown in bold below would be enough:
The term "cloaking" is used to describe a website that returns altered webpages to search engines crawling the site without permission. In other words, the webserver is programmed to return different content to Google than it returns to regular users, usually in an attempt to distort search engine rankings. This can mislead users about what they'll find when they click on a search result. To preserve the accuracy and quality of our search results, Google may permanently ban from our index any sites or site authors that engage in cloaking without our permission, if we feel it is harmful to our search rankings.
If you're a Search Engine Watch member, Google & The Approved Cloaking Problem takes an even longer look at the issue, not just about the needed definition change, but also the fact that general web publishers are long-overdue for some of the special assistances being given to merchants, book and scholarly publishers by Google.
FYI, I came across the Dodds article via the new On Google Scholar blog, a nice resource for those wanting to track things about the new service.
Posted by Danny Sullivan at 6:38 AM | Permalink | TrackBack
Last May, I wrote about how Google effectively approves of cloaking in the case of content from NPR. The new Google Scholar launch, while good for searchers, leaves the company open to even more hypocrisy over its published policy on cloaking.
My article on Google Scholar touches on this to a limited degree. I've also posted a new article for our Search Engine Watch members that takes a longer look at the issues involved: Google & The Approved Cloaking Problem.
In summary, Google needs to change its cloaking definition to acknowledge that approved cloaking is allowed -- and it definitely needs to move forward with providing better support to ALL web site owners, rather than just some of them.
Posted by Danny Sullivan at 7:53 AM | Permalink | TrackBack
Last week, one of our most energetic forum moderators Nacho Hernandez started a thread called Search Engine Marketing 101. In it, he leads off with a variety of resources useful for those getting started with search engine marketing. Comments and further contributions follow.
Nacho also kicked off a theme. Orion, one of our newest moderators, followed up with Block Analysis 101. That looks at the concept of search engines breaking up a page into "blocks," to better understand which particular content or links within that content should be given greater or less weight.
Member Nick W's now dived in to look at the often controversial issue of cloaking: Cloaking 101 - Questions and Answers. Some previous good threads and debate on this topic include The Great Doorway Debate, How Do I Spot Cloaked Sites?. You might also look over an article I did last year, Ending The Debate Over Cloaking.
Returning back to Nacho, he's compiled a great list of Google Sandbox 101-style resources in Sandbox - IN or OUT? The sandbox concept relates to the idea that new pages, new links or new sites might not be allowed to do well in Google until a certain period of time has passed. The Filthy Linking Rich thread touches on this, as well.
Posted by Danny Sullivan at 11:24 AM | Permalink | Comments (0) | TrackBack