Google Sitemaps have been discussed heavily since Google first announced the new system over the summer. Matt Cutts discussed it but did not demo it at the New Orleans Webmaster World Conference in June. Just before the Webmaster World Conference in Vegas earlier this month, there were some new features added and Matt Cutts actually went into the interface and showed the new features. It got a few ‘oohs’ and ‘aahs’. If you haven’t seen it, you need to check it out. Very cool stuff. Which makes me ask, where are all the cool toys from Yahoo! and MSN?
The problem with Google Sitemaps is the format they request to be submitted. It isn’t a text file, or an HTML file, but an XML feed. I would estimate that 90% of the webmasters out there don’t know the proper way to compile an XML feed in the format that Google requests. Sad, but true. So, I have been playing around with Google Sitemaps now for about 4-5 months and I have run into quite a few problems, tested all sorts of software packages and scripts offering to do a Sitemap for Google in the simplest way possible.
Believe me, many were far from simple.
This edition will take you from start to finish and show you the program that tested out the best over the last four months that also carries the best price: free. Now, what I consider to be the best, may not be what you do. I always advise you to use the product that best suite your needs and your business.
First things first: You need to go to the Google Sitemap page and login to your Google Account. If you don’t have one, you will need to create one.
Next, remember that very important word that is listed: Beta. This system is still in Beta Testing and there will be problems, inconsistencies, and other issues. Be patient. If you run into trouble, there is a Google Group dedicated to Google Sitemaps that you can post your issue in.
To create your sitemap, the program that worked the best (note: not perfect) is SOFTplus GSiteCrawler. After the installation, launch the program. In the lower left corner, click on the ‘Add’ button and choose ‘Yes’ to run the New Site Wizard. Next, add the domain you would like the sitemap generated for in the Main Address section. Then name your project. As a default, it will use the domain for the name. I highly recommend that you NOT skip the server check. This process helps in compiling the sitemap correctly, as other programs that do not have this feature often produced incorrect feeds, and Google never verified them.
The next screen is based on the server check. If it detected that you are on a Linux/Unix server, it will check the ‘URLs are case sensitive’ option. It also has a filter to detect session IDs which really comes in handy for dynamically generated sites. It also has an area to include the file extensions that you use with your web development.
Next, it has an option to upload the sitemap files after they are generated to your server via FTP. I highly recommend this option as it takes a manual step out of the process by automating it. Automation is always good!
The last screen before it starts to create the files will check various issues:
Robots.txt. Believe it or not, there are some domains that exclude Googlebot from accessing the server. This ensures that there isn’t anything in your robots.txt file that will cause a problem.
Check for Custom ‘File Not Found’ Error Pages. If you have your domain setup to go to a custom 404 Page, Google will not like that – thus will not verify the sitemap (i.e. not good). You will need to disable this function until you get the sitemap file verified.
If your site is older, Google probably already has the majority of your site indexed. This helps the process get moving faster and Google responds very well to this.
Scan Your website now. Make sure this is checked. This is why you came here in the first place, right?
Clicking ‘Finish’ gets the process rolling. The program has six crawlers, and depending upon the speed of your connection and number of pages on your site is the amount of time it will take to crawl your domain and create the sitemap files.
Now, when the program gets done, it will post a group of files in a project folder. The Aborted file is a list of all the URLs that the crawlers attempted to crawl but couldn’t find. These could be bad links, old pages, or just some general housekeeping that is in order. The Robots.txt file is a copy of your robots.txt file – nothing more. The other three files have to do with the Sitemap. You want to upload all three files to the root of your server. The Sitemap.xml is the file that you want to tell Google about. However, if your site is HUGE, then you want to give Google the compressed file (Sitemap.xml.gz). I would suggest that you use the compressed file if the uncompressed version is over 500kb.
Once the files are uploaded, go into Google Sitemaps and click the ADD tab. Instruct Google where to find the sitemap file and it will ping it. Now, Google will then come back and tell you that you have to post an empty HTML file with a long file name so it can verify that you are the manager of the domain. Now, here is where it can get tricky. Here are some common issues:
What if I host with a provider that won’t allow me to post a blank file?
Not an issue. Google just says for it to be blank because it is not going to read it, Google just wants to make sure the file is there. So, enter ‘test’ or ‘abc’ and then post the file. Don’t make this harder than it has to be. :-)
I get the error ‘We’ve detected that your 404 (file not found) error page returns a status of 200 (OK) in the header.’ The sitemap never verifies.
The problem is that you have a custom 404 page stated in your HTAccess file. Remove this line, allow Google to verify the sitemap file and put the 404 error line back in. The reason that this happens is since any page returns a real page, Google cannot verify if that page is actually there.
I get a ‘Our system is currently busy. Please try again in a few minutes.’
If the above error occurs and lasts more than one day, delete the sitemap you have posted, and direct Google to the other version. For example, if I had pointed Google to sitemap.xml, I would point Google to sitemap.xml.gz. This is often all that is needed to overcome this error.
Once the file has been verified, Google will grab the file and grabs it about every 4-6 hours thereafter. You will start to see Googlebot on your site more frequently and spidering pages that it hasn’t spidered in some time. If you make constant changes to your site, I advise you to update your sitemap file every week. The file tells Google when the last time the file was modified and Googlebot will hit the files that were recently changed. This is a better and more accurate way to get your changed pages reindexed rather than to hope your server does that for you.
Answers to Common Questions
Question: If my site has been banned from Google, will using Google Sitemaps lift the ban?
Answer: No. You must clean up the site and then do a reinclusion request. We do offer guaranteed reinclusion within 90 days.
Question:Will using Google Sitemaps increase my ranking in the SERPs (Search Engine Results Pages)?
Answer: According to testing, no, it will not.
Question:If I clean up my bad links and errors, will Googlebot index my site better?
Answer: Absolutely. It pays to run a clean ship.
Accuracy of Reporting
I noticed in testing that the accuracy is, well, not that accurate. However, the information that it provides really makes housekeeping a lot easier. The ‘top searches’ is not close to being accurate, but it will be great to see this section evolve.
The ‘Crawl Stats’ are super cool. Here Google will tell you some vital facts about your domain. First, the pages that were successfully crawled (bar graph). You want this bar full. It will also give you the URLs that are restricted by your robots.txt file, unreachable URLs, timed out URLs, and not followed URLs. All great information. It will also give you PageRank information. It will tell you the distribution of PageRank across your site (High, Medium, Low and Not Assigned). This is very helpful to gauge your internal linking structure and your overall linking campaign. How effective is it? This will tell you. If you have a majority of pages that are low (4 or lower) you need work done. Mediums are 5-6 and Highs are 7+. The Page Analysis covers Content and determines the type of pages you have and the encodings used. The Index Stats section gives some basic queries that most advanced SEOs are already familiar with, such as: site:, allinurl:, link:, cache:, info:, and related:.
The Errors tab will show you the URLs that errored out when Googlebot attempted to access. The important thing is, these URLs are referenced somewhere on your site. I see that Google Sitemaps does not obey 301 redirects in the HTAccess file currently as files that were discovered as errors were placed as 301 redirects to a new page, but on the next crawl, Google still has this page listed in the error section. I’m sure this will be corrected soon.
In the next three days, block out an hour of time and go through this process and get your Google Sitemap rolling. This is, hands down, the best way to get a site crawled and deeply indexed. I have seen this be effective with new sites as well. This does NOT avoid the Sandbox, but it does allow for more pages to be found on a new site than by not using it.
Leave a Reply
You must be logged in to post a comment.