Web Page Extraction Tool Explanations

Do you have 'dead' links (links that are unreachable on the Web)?

Unreachable links are annoying to visitors. Links are often unreachable because the file/page isn't on the Web server. Perhaps the file was never uploaded to the server, the file is in the wrong place or the file name is misspelled. Some Web servers, for example Linux, are case sensitive (img.jpg is different from img.JPG!). I test on a Windows OS, but I usually work with Linux servers and get bitten by the case-sensitive nature of Linux. Image files are the main culprits; most cameras use upper case in the file names, but the image manipulation software may use lower case for the image extension (.jpg, .png).

Some links that are unreachable are not visible to the user, namely script files and stylesheets. A missing script file could result in some page functionality not working. For instance, a slideshow might not work because the script file can't be found. A missing stylesheet can cause a Web page to display incorrectly or links will appear in the wrong color.

One way to find these problems is to open the page in the browser and click every link, or look for missing images or problems in the way the page displays. This process can be a very slow and painstaking. The easy way to find missing links is to use our Web page extraction tool.

Does the page have a title and meta description tag?

The title and meta description tags are in the <head> part of the HTML file. This part of the HTML file is usually invisible to the viewer, however browsers use the title as the tab title.

Although invisible to the visitor, they are not invisible to search engines. Search engines use the title and meta description to rank the Web page and in the search results display. If the page is missing either of these, the search engines will select some text on the page to display in the results.

Google search engine results page (SERP) uses the following limits:

  • Title: 69 characters
  • Meta description: 156 characters

Do you have Google Analytics installed on the page?

Google Analytics (GA) is an advanced 'hit' counter. GA goes beyond an ordinary hit counter. I highly recommend installing GA code on all Web pages.

Does the page have inline scripts and css?

Inline scripts use the <script> tag without linking to a file. The scripts are embedded in the html for the Web page. When using a content-management system, this is the easiest way to go. Most pages will need at least one inline script simply to call into the linked scripts. This inline script is very useful for website maintenance. By looking at the source for the Web page, the maintainer can quickly trace down into the linked script files.

The Web page may also have gadgets on the page or the Google Analytics code. Small pieces of code have very little overhead when downloading the page and usually don't need to be cached by the browser. Gadgets are independent pieces of code that can be placed on a Web page, some may link to a script file on a different server but have a small inline script to initialize the gadget.

When there are large blocks of inline javascript or css, consider linking to them instead. Browsers can cache Javascript and css files. The browser can retrieve the cached files instead of downloading them each time.

Do your images have the 'alt' attribute set?

The <img> tag has src and alt attributes. The src attribute points to the image file and should be named something other than DSC2542.JPG (straight from a camera). Take a little time and name the image file with a description of the image. The alt attribute is displayed when the link to the image file is broken and can be used by search engines to 'understand' the content of the image. When sending an email using HTML, most email clients block the display of remote images and display only the alt attribute.

Are you using descriptive anchor text?

The <a> tag (anchor tag) is the way visitors navigate around a site. The anchor tag has an href attribute and anchor text. For example:

<a href='mypage.html'>click here</a>
links to the Web page mypage.html. When a visitor clicks the 'click here' text, the browser will load mypage.html. The phrase 'click here' in the example is called the anchor text. 'Click here' is adequate for a visitor who can use the surrounding text to determine the content on mypage.html, but we also want to make it easier for search engines to determine the purpose of mypage.html. Instead of using the phrase 'click here,' use more descriptive anchor text, like 'Web page extraction tool.'

Are you using compression (gzip) to serve the page and all the links?

gzip is the standard compression used by Web servers to serve compressed files. Most browsers can accept gzip encoding. gzip can compress text files up to 80%. A Web page is made up of HTML, javascript files, stylesheet files, images, videos, etc. HTML, javascript and stylesheet files are all text files and compress really well. A Web page that is 16000 bytes may compress to 5000 bytes.

Use our Web page extraction tool to determine whether a Web page is using compression and how much smaller the Web page, script and stylesheet files would be if they were compressed. The tool will also provide compressed versions of the script and stylesheet files.

Sonata, Inc.  |  Bend, OR 97702  |  541-318-0567  |  tgrimm@sonatainc.com