The wwwscan command allows developers to view log files created by the HTTP daemons. These logs cannot be directly viewed because they are on hosts on which developers do not have accounts.
Note: In most cases where Google Analytics will give you better and more detailed information, as well as not requiring manual processing of log files.
Usage for the wwwscan command is:
wwwscan -[a|e] -[p|t|d|v|u] [-s] [-h virtual_host] regexp|-nlinesThe option flags are:
-a | Get data from the access log (default) |
-e | Get data from the error log |
-p | Get data for the production (www) environment (default) |
-t | Get data for the test (wwwtest) environment |
-d | Get data for the development (wwwdev) environment |
-u | Get data for the u-development (wwwudev) environment |
-s | Get data for SSL connections |
-h virtual_host | Get data for virtual host www.HOST.org or www.HOST.net |
-c | Return access logs in common log format |
-C | Return access logs in combined log format |
-D | Return access logs sorted by date (may take a long time) |
regexp | Regular expression for which to search |
-nlines | Show last nlines lines for each file |
Warning: Be sure that you do not save any wwwscan output to the web directories. If you do put them in to the web directories, they become accessible via the web. Due to the content of the log files, there may be some privacy concerns if the raw data were viewed by people not involved with maintaining content for www.washington.edu. For this reason, please be sure all output goes into your home directory or into a group directory.
If you have problems saving the output of the wwwscan command because you get the error message "write stdout: Permission denied", it could be because you are in a directory for which you do not have write permission. If that's not the case, then you could be running across a bug in some filesystem implementations which doesn't let you directly write. In these cases, you need to put another command into the command line before saving the output. In the examples below, the grep command was used. If you aren't using the grep command, you can substitute cat, such as:
% wwwscan ' /webinfo/' | cat >webinfo.scan
% wwwscan -1 world www1:www/world/access: green.alexa.com - - [27/Jan/2000:12:13:14 -0800] "GET /students/timeschd/sln.cgi?QTRYR=AUT+1999&SLN=6327 HTTP/1.0 User-Agent='ia_archiver' Referer='-'" 11918 200 2203 0 www2:www/world/access: 128.220.12.65 - - [27/Jan/2000:12:13:21 -0800] "GET /admin/eoo/ads/index.html HTTP/1.0 User-Agent='Mozilla/4.61 [en] (Win95; I)' Referer='http://www.washington.edu/admin/eoo/ads/'" 17549 200 8214 0 www3:www/world/access: house11.studaff.calpoly.edu - - [27/Jan/2000:12:13:23 -0800] "GET /students/uga/css/images/bg_navDesc_yTitle.jpg HTTP/1.0 User-Agent='Mozilla/4.61 [en] (Win98; I)' Referer='http://www.washington.edu/students/uga/tr/'" 11740 200 1702 0 www4:www/world/access: host-216-79-211-24.shv.bellsouth.net - - [27/Jan/2000:12:13:23 -0800] "GET /home/graphics/mo/arrow.gif HTTP/1.1 User-Agent='Mozilla/4.0 (compatible; MSIE 4.01; MSN 2.5; Windows 98)' Referer='http://www.washington.edu/'" 21403 200 62 0Note that the access logs give extra fields which are passed from the browser. The field User-Agent is the same string passed by the client, if any, as is the Referer field. The root field shows the root used to find a document if different than the default document root.
The last four numbers are the process ID of the server handling the request, the return code (200 in this case means a successful transfer), the number of data bytes transfered, and how many seconds the transfer took.
If you are searching on a virtual host, note that you still must specify a search string or a number of lines. However, you can search on a string which will appear in all requests:
% wwwscan -h kcmu / www3t:kcmu/prod/access: shiva1.cac.washington.edu - - [07/Jan/2000:13:02:40 -0800] "GET / HTTP/1.0 User-Agent='-' Referer='-'" 21921 200 3577 4 www3t:kcmu/prod/access: shiva1.cac.washington.edu - - [07/Jan/2000:13:02:46 -0800] "GET / HTTP/1.0 User-Agent='-' Referer='-'" 21922 200 3577 1 www3t:kcmu/prod/access: banshee.kerbango.com - - [07/Jan/2000:13:32:41 -0800] "GET /ra/wp991019.ram HTTP/1.1 User-Agent='CheckURL/1.0 libwww/5.2.6' Referer='-'" 21923 200 42 1 www3t:kcmu/prod/access: banshee.kerbango.com - - [07/Jan/2000:13:32:44 -0800] "GET /ra/swdr991014.ram HTTP/1.1 User-Agent='CheckURL/1.0 libwww/5.2.6' Referer='-'" 21926 200 44 1 www3t:kcmu/prod/access: banshee.kerbango.com - - [07/Jan/2000:13:32:45 -0800] "GET /ra/sts991022.ram HTTP/1.1 User-Agent='CheckURL/1.0 libwww/5.2.6' Referer='-'" 21921 200 43 0
If you wish to search for all hits on a certain set of pages, the best way is with a string that's as specific as possible, and begins with a space (so it won't match the Referer information). For example:
% wwwscan ' /cambots/archive' www1:www/world/access: 205.68.79.66 - - [01/Jan/2000:02:26:15 -0800] "GET /cambots/archive.html HTTP/1.0 User-Agent='Mozilla/4.6 [en] (WinNT; U)' Referer='http://www.washington.edu/cambots/'" 21844 200 8296 0 www1:www/world/access: 205.68.79.66 - - [01/Jan/2000:02:26:21 -0800] "GET /cambots/archive/june.mpg HTTP/1.0 User-Agent='Mozilla/4.6 [en] (WinNT; U)' Referer='http://www.washington.edu/cambots/archive.html'" 21844 200 57344 3 www1:www/world/access: 1cust175.tnt5.bos2.da.uu.net - - [01/Jan/2000:15:45:59 -0800] "GET /cambots/archive.html HTTP/1.1 User-Agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows 98)' Referer='http://www.washington.edu/cambots/'" 30085 200 8296 0 www1:www/world/access: 1cust175.tnt5.bos2.da.uu.net - - [01/Jan/2000:15:46:42 -0800] "GET /cambots/archive/april97/0545.gif HTTP/1.1 User-Agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows 98)' Referer='http://www.washington.edu/cambots/archive.html'" 2516 200 130013 17 www1:www/world/access: 1cust175.tnt5.bos2.da.uu.net - - [01/Jan/2000:15:47:08 -0800] "GET /cambots/archive/april96/0550.gif HTTP/1.1 User-Agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows 98)' Referer='http://www.washington.edu/cambots/archive.html'" 24409 200 124040 16 etc.
If you know you will want to do searches on many different pages in the same directory, it's best to do a general search in the top-level directory and save results in a file. The Advanced Use section shows how this is done.
If you have access to tools to do web log analysis (such as webalizer or analog) you will probably need to use the -C
and -D
flags to convert the logs to the Common Log Format, and to sort the output by date.
A project is currently under way to provide a web interface to viewing wwwscan logs, as well as summaries of usage. Until that time, however, there are several things you can do at the unix prompt to provide basic count information.
Note that the documentation below assumes that the sections are used in order, and that some of the commands rely on the results of previous ones. For example, Computing Total Hits relies on the file generated by Saving wwwscan output for later use.
Warning: As noted above, do not save any wwwscan output into the web directories. All the examples below assume the log files are written to one's home directory.
% wwwscan ' /webinfo/' | grep Dec/1999 >webinfo.scanNote that the grep is run separately because it makes for a much more efficient pattern that's being searched in wwwscan, so it will have less of an effect on the production web servers.
% wc -l webinfo.scan 3214 webinfo.scanThis tells us there are 3,214 entries. however, it doesn't tell us what those files are. Also, it also counts errors as well as successful requests.
% grep -c ' /webinfo/wwwscan.html' webinfo.scan 49 % grep -c ' /webinfo/wwwinst.html' webinfo.scan 36
If we just did a grep for the string wwwscan.html, not only would we get all hits, but we'd also get results for entries where wwwscan.html was the referring page. Specifying the grep string as shown above prevents this from happening.
The -c
flag for the grep command is what generates a count. If you wish to generate the actual list of files, do not use the -c
.
% awk '$(NF-2) < 400 {print}' webinfo.scan | sed 's,/index.html,/,g' >webinfo.scan.valid % awk '{++u[$8]} END {for (x in u) print u[x], x}' webinfo.scan.valid | sort -rn 228 /webinfo/graphics/1pix.gif 221 /webinfo/webinfo.css 153 /webinfo/ 62 /webinfo/tidy.html 49 /webinfo/wwwscan.html 45 /webinfo/ssl.html 41 /webinfo/weblint.cgi 36 /webinfo/wwwinst.html 35 /webinfo/mailto/ 33 /webinfo/chtml/ 33 /webinfo/announcetech.html 32 /webinfo/env.html 31 /webinfo/htaccess.html etc.The first line creates a file which eliminates requests which resulted in errors (the 3rd from the last field of every log entry is the HTTP return code, and codes greater than or equal to 400 are considered errors). This intermediate file will be used for all other computations, since we will not want to count error requests.
The first line also makes sure that references to index.html turn into references into just the directory. For example, any accesses to "/webinfo/index.html" would be converted into "/webinfo/" since they are both for the webinfo main page. This way, the accesses will be grouped together. Note that if you have another name for your index file (such as index.cgi) you'll want to use that either instead of index.html in the command, or add another sed command to filter for both.
% cut -f2 -d' ' webinfo.scan.valid >webinfo.scan.hosts % grep -c -v '[a-z]' webinfo.scan.hosts 340Next, to compute hits on a per-domain basis:
% grep '[a-z]' webinfo.scan.hosts | sed 's/^.*\.\([^.]*\.[^.]*\)$/\1/' | sort | uniq -c | sort -rn 1352 washington.edu 415 alltheweb.com 208 inktomi.com 132 edu.tw 62 home.com 50 ziplink.net 35 sanpaolo.net 32 earthlink.net 26 oz.net 24 idt.net 22 uswest.net 22 stsi.net 22 aol.com etc.
The sed
command converts the hostname to just the domain. The first sort
groups hostnames together, and the uniq -c
counts how many of each host there is.
The referer field allows you to see how people got to your pages. For example, to compute what links were followed to get to the main page for the page wwwscan.html:
% sed -n "/ \/webinfo\/wwwscan.html/s/^.*Referer='\([^']*\).*/\1/p" webinfo.scan.valid | sort | uniq -c | sort -rn 25 - 21 http://www.washington.edu/webinfo/ 1 http://www.webtop.com/ 1 http://www.washington.edu/cgi-bin/search/webinfo/?Kind=Results&Key=wwwscan&Phrase=&CaseSensitive=&PartialWords= 1 http://huskysearch.cs.washington.edu/results/99120917-0/30007-zhadum-0/rmain.html
This shows that 25 of the accesses did not have a referer field (either because the URL was typed by the user or because the browser did not forward the information), and 21 of the accesses came from the webinfo main page. Of interest is the access from http://www.washington.edu/cgi-bin/search/webinfo/, which is the webinfo search function, and the access from http://huskysearch.cs.washington.edu/results/, which is the huskysearch search engine.
In the command shown, the arguments to sed are very complex. The sed documentation can be used to explain their use for those who are interested.
Another use of the reference field is to do the opposite of checking references. In other words, in order to find out where in a directory a particular page references. Note that this information is much less complete, because someone may follow a link to another system, and you won't be able to detect that. However, suppose we wish to know how many people followed a link on the top-level page to another page in the same directory:
% grep "Referer='[^']*/webinfo/'" webinfo.scan.valid | awk '{++u[$8]} END {for (x in u) print u[x], x}' | sort -rn 103 /webinfo/graphics/1pix.gif 60 /webinfo/webinfo.css 21 /webinfo/wwwscan.html 18 /webinfo/wwwinst.html 12 /webinfo/wwwauth.html 12 /webinfo/env.html
Note there are some files which are usually never directly accessed by users, such as the first two files in the list (the first is a placeholder image used by some of the tables, and the second is the style sheet). However, we see that wwwscan.html is the page most often referenced by the webinfo main page.
The log files also have information about the type of browser the client used to access your files. Computing usage based on browser type is just like computing references, but the User-Agent field is used instead of the Referer field.
To see what browser types have accessed the main webinfo page:
% sed -n "/ \/webinfo\/ /s/^.*User-Agent='\([^']*\).*/\1/p" webinfo.scan.valid | sort | uniq -c | sort -rn 13 Mozilla/4.61 [en] (X11; U; HP-UX B.10.20 9000/715) 10 Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt) 9 Mozilla/4.0 (compatible; MSIE 4.01; Windows NT) 8 Mozilla/4.0 (compatible; MSIE 4.5; Mac_PowerPC) 8 Mozilla/4.0 (compatible; MSIE 4.01; Windows 98) 6 Mozilla/4.7 [en] (Win98; U) 6 Mozilla/4.7 [en] (Win98; I) 6 Mozilla/4.5 [en] (WinNT; I) 6 Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt) 5 Mozilla/4.0 (compatible; MSIE 4.01; Windows 95) (etc.)
We see that Netscape (which has a User-Agent field of "Mozilla") version 4.61 for X accessed the webinfo pages 13 times, but the browser which accessed the page 10 times was really Microsoft Internet Explorer 5.0. However, this probably gives too much information, because of all the version numbers and platforms.
To further break down the browser types by which browser:
% sed -n "/ \/webinfo\/ /s/^.*User-Agent='\([^']*\).*/\1/p" webinfo.scan.valid >webinfo.scan.browsers % sed -e 's,^.*MSIE ,MSIE/,' -e 's,/.*,,' webinfo.scan.browsers | sort | uniq -c | sort -rn 70 Mozilla 61 MSIE 4 Slurp 3 libwww-perl 3 Teleport Pro 2 Slurp.so 2 FAST-WebCrawler 2 CCU_GAIS Robot 2 ArchitextSpider 1 www.WebWombat.com.au 1 WebCopier Session 6 1 Spider 1 EliteSys SuperBot
To include version numbers with which browser is being used:
% sed -e 's,^.*MSIE ,MSIE/,' -e 's,\(/[0-9a-z.]*\).*,\1,' webinfo.scan.browsers | sort | uniq -c | sort -rn 28 MSIE/4.01 24 Mozilla/4.7 18 MSIE/5.0 15 Mozilla/4.61 8 Mozilla/4.5 8 MSIE/4.5 7 Mozilla/4.04 6 MSIE/5.01 etc.
Note there is extra code which maniuplates text with MSIE in it. This is to properly identify Internet Explorer browser strings, which at first glance look like Netscape strings.
If you wish to see what operating system your users are using, that information is more difficult because the number of variations in the User-Agent strings is even greater than for the browser vendors. However, we can get close:
% grep -c Mac webinfo.scan.browsers 19 % grep -c Win webinfo.scan.browsers 80 % grep -c X11 webinfo.scan.browsers 28 % wc -l webinfo.scan.browsers 153 webinfo.scan.browsers
By computing the total number of lines, we find that there are 26 entries which didn't fall into one of our categories. These can be either search engine robots (programs which crawl the web for content to put into search engines such as AltaVista or Google) or programs that people wrote which use a generic User-Agent field.