Google indexing and SEO

It is crucial that both the Open Access full text research content of the repository and metadata records of citation material are fully indexed by Google (and other search engines); in the future it is also likely to be required for other Open Educational Resources (learning objects). However, site:http://repository-intralibrary.leedsmet.ac.uk/ currently returns just 4 results (in addition to the Login page itself) and it is a bit of a mystery how these 4 are actually being picked up when the majority of records are not.

In intraLibrary, for a given collection, the administrator may choose to:

• Allow published content in this collection to be searched by external systems

This effectively means SRU (Search and Retrieve by URL) a standard search protocol utilizing CQL (Common Query Language).

• Allow published records in this collection to be harvested by external systems

This effectively means harvest by OAI-PMH

XML Sitemaps

Intrallect have suggested that it is necessary to implement an XML sitemap to ensure that content is properly crawled by Google. Until 2008, Google did support sitemaps using OAI-PMH but have since withdrawn this and now support only the standard XML format. Intrallect have therefore developed a software tool that converts OAI-PMH output to an appropriate XML format. A sitemap has been generated and registered using Google’s webmaster tools but currently is registering a series of errors that indicate “This URL is not allowed for a Sitemap at this location”; 9 errors are listed from the very first URL and which are sequential; it seems that the crawl does not go any further and none of the 100+ URLs in the sitemap have been successfully recognised. Two possible reasons have been suggested for this:

• All of the URLs in the sitemap are external; it may be that Google does not permit URLs outside the mapped domain.
• There is a problem with the XML itself

Sitemap here: http://repository-intralibrary.leedsmet.ac.uk/sitemap/Sitemap.xml

Sitemaps using RSS

It is also possible to submit a sitemap based on RSS, however, this approach has not been any more successful as the Open URL/virtual file paths generated by intraLibrary are inaccessible to Google resulting in the following warning:

URLs not followed
When we tested a sample of URLs from your Sitemap, we found that some URLs redirect to other locations. We recommend that your Sitemap contain URLs that point to the final destination (the redirect target) instead of redirecting to another URL.

Google and SRU

Though SRU does not facilitate indexing by Google per se, the integration of the SRU Open Search interface may provide a potential solution. site:http://repository.leedsmet.ac.uk/ currently returns 247 records; largely these appear to represent Googlebot following the various browse links (many of which themselves return no results where there is no content to find!) In addition, Googlebot appears to be following hyperlinked author names, publisher and subject(s) in the individual metadata records:

google

The third of these “The Repository search for Morton, Veronica” links to the two metadata records associated with that name as though it had simply been entered into http://repository.leedsmet.ac.uk/ as a search term:

http://repository.leedsmet.ac.uk/main/search.php?q=Morton%2C+Veronica+

Presumably these records were initially indexed via the appropriate links on the browse interface – http://repository.leedsmet.ac.uk/main/browse.phpFaculty of Health and R – Medicine and then re-indexed via the hyperlinks embedded in the metadata records. It is interesting to note that, though Morton, Veronica only has two records associated with her name, this record appears relatively high – at the top of the second page – and this is probably because there so many other authors also associated with these papers; all of these names are hyperlinked giving over 21 separate indexable links.

It seems that we might need to formalise the structure of the SRU to ensure it is optimised for Google; possibly with some sort of SRU sitemap. For example, if we could generate a page that linked to all the individual metadata records in the repository and optimise this page to be crawled by search engine spiders (doesn’t need to be human readable; could be XML) which could then follow the links to the associated metadata.

It also seems to me that Search Engine Optimisation will need to comprise appropriate customisation of the SRU interface; for example, we want to facilitate browse by author which, in turn, will provide indexable links for Googlebot.

Full text indexing

There is also the issue of indexing full text. As already mentioned, Google does not follow the Open URL/virtual file paths generated by intraLibrary and all the results from site:http://repository.leedsmet.ac.uk/ are search results. Potentially this is a benefit in as much as people are less likely to bypass the metadata record and go directly to the PDF but we do also want to facilitate full text indexing. We may have to wait for Intrallect on this who have assured us they are looking into facilitating full text indexing – probably via intraLibrary itself rather than the SRU.

9 Responses to Google indexing and SEO

  1. Pingback: SEO for Open Access Repositories « Open Education News

  2. Pingback: Development of Research Repository Aspect of IntraLibrary « Repository News

  3. Pingback: Open Educational Resources Programme start-up meeting: What I learned « Repository News

  4. Pingback: JorumOpen will use DSpace « Repository News

  5. Pingback: Separate HTML pages for individual records « Repository News

  6. if you look at the picture in this link; http://www.lab860.net/cnn.jpg
    You’ll see 3 urls, the top and bottom have an update date, the center url doesn’t, but it is updated daily.
    How can you get the update down to minutes on search results?

    Thomas,
    Miami Web Designer

  7. On most of the websites I design, I use an xml sitemap generator, and then immediately submit the sitemap to Google using Google Webmaster Tools. I know they will probably find it anyway but this has always worked fine for me!

  8. Tom Desai says:

    Miami, best way is to create sitemap using xml sitemap and set the change frequency to daily. Then direct your robots.txt file to point to this sitemap. It will not change in minutes though, googlebot isnt that fast.

    SEO Agency Manchester

  9. Pingback: Infobib » Kein IRrweg, aber dennoch Handlungsbedarf

Leave a comment