Extract Google Search Terms from Web Logs

In the preceding example I showed you how to extract information and statistics from web logs. I will now build upon that example to accomplish a specific task: get a list of search terms that people used to find your web site in Google.

The regular expression for matching web log entries needs three adaptations. The first one is optional. I like to restrict the search to hits to web pages, so I’ve changed the part of the regular expression that matches the file in the HTTP request to /([-_a-z0-9]++\.html)?+

The second change is what makes this example work. Instead of using "[^"]*+" to match any referring URL, we’ll use (?:http://www\.google\.(?:com?\.)?[a-z]{2,3}/search\?.*?\bq=\+*+([^&"\r\n]++)[^"\r\n]*+) to match only Google search pages, and extract the search terms. The http://www\.google\.(?:com?\.)?[a-z]{2,3}/search part matches URLs such as http://www.google.com/search on any country-specific top-level domain. The other part \bq=\+*+([^&"\r\n]++)[^"\r\n]*+ matches the q parameter in the search page URL. This parameter lists the URL-encoded search terms. The regex captures these into a backreference.

The third change speeds up the action. Since we only care about the HTTP request and the referrer, we can remove the parts of the regex before the HTTP request and after the referrer. The part of the regex matching the HTTP request cannot match anywhere else in the log entries, so our regular expression is still properly anchored.

Since the search terms are part of the referring URL, they are URL-encoded. Spaces have been substituted with pluses, and various other special characters are substituted with hexadecimal values. E.g. the plus itself was substituted with %2B, and the quote character with %22. When PowerGREP’s “extra processing” feature, the search terms can easily be made readable again.

Select the log files you want to search through in the File Selector.
Open the PowerGREP5.pgl library file included with PowerGREP. You can find it in the folder where PowerGREP is installed, c:\Program Files\Just Great Software\PowerGREP 5 by default.
Select “Logs: Inspect Apache web logs - Google search terms” in the library, and click the Use Action button. This sets up the regular expression and extra processing as explained above.
Click the Preview button to run the action.

When the action finishes running, the Results panel will show a list of search terms, sorted from most to least occurrences.

If you select the action “Logs: Inspect Apache web logs - Google search terms with landing pages” in the library, you will get a list of search terms paired with the page the visitor clicked on in Google’s search results. Search terms without a page brought the visitor to the home page. The only difference in the action that shows landing pages is the text to be collected, which uses two backreferences instead of one.

The regular expression has two capturing groups. The first one matches the file name in the HTTP request, which is the landing page. The second group matches the Google search terms. The text to be collected uses \l2 (backslash ell two) to collect the search terms converted to lowercase, and \1 to collect the landing page.