UW ODIN Pilot Project Report -- Technical Architecture

Pilot Project Report -- Technical Architecture

Technical Architecture

A nightly batch job extracts the student index from the legacy DMS II database on a Unisys A19 mainframe. The extract is merged with supplemental index records for non-student borrowers and loaded into the ODIN index database, a Sybase System 11 database on an IBM RS6000 R50 running AIX 4.2.

ODIN Architecture Schematic

The scanning module was built using PixTools, an imaging toolkit by Pixel Translations, with minimal customization. It currently runs under Microsoft Windows 95 on a 150MHz Pentium PC. Documents are scanned on a Fujitsu M3096GX high speed monochrome document scanner with an automatic document feeder. They are converted to GIF files and saved directly to a temporary location on the network file server (DEC Alpha 200 4/166 running Digital Unix 4.0b) so that no data is stored on the PC. The GIF format was chosen for its ability to compress document images without loss and because it can be rendered within the browser window without plug-ins or helper applications.

The rest of the ODIN application runs on the UW web servers, a cluster of three DEC Alpha 200 4/166 systems running Digital Unix 4.0b and the Apache 1.1.3 (soon to be 1.2.4) web server software. The help screens, online manual, and a few other pages are written in static HTML, but most of the forms and pages are dynamically generated and processed by Perl scripts using the Common Gateway Interface (CGI).

PC browsers connect directly to the web servers. The PCs range from 486/66 to Pentium 166 CPUs, all running Windows 95 and Netscape. X terminals run the browser on a host, and the host connects to the web servers. The X terminals boot from a cluster of nine DECstation 5000/240 systems running Ultrix 4.2a and they run Netscape on a cluster of two HP 9000 715/80 systems running HP-UX 9.05.

When the indexing module is invoked on the web server, ODIN moves the images to a pair of DEC Alpha 400 4/233 systems running Digital Unix 4.0b and the Apache 1.1.3 web server software. Each image is assigned a pseudo-random URL based on a hash code of the image itself and a timestamp. The images are stored in Unix directories so the web server software can access them directly. The pilot application would fit on one image server, but splitting it between two demonstrates the scaleable nature of the architecture.

Each of the major components (web servers, image servers, database server, scanning station, mainframe database, network file servers, etc.) is independent. This modularity makes it possible to modify one part of the architecture while retaining the others. As new browsers become available, they can be used without affecting the storage, indexing, or searching modules. If the indexing database is changed, none of the software used for scanning and displaying images will be affected.

The response time for a typical ODIN exchange ranges from 6 to 10 seconds, which is slower than desired. Part of this is overhead from CGI, part is from accessing the data server using remote shell commands, and the biggest part is from Netscape rendering the image. There are tools available today which can boost performance by eliminating some of the overhead of CGI and database connections.

A flexible data model makes it possible to use common tables to index different record series. For example, a single employee table storing name, employee number, and social security number could be used to validate entries into the benefits, personnel, and payroll files, even if those files were on their own image servers.

ODIN is secured in several ways. Access to the web application is granted on an individual basis using http basic authentication. There are no user accounts on the web servers or image servers. Access to the data server is limited, and it requires entry of a pseudo-random number from a SecureID card. The database itself is restricted still further, and requires another login and password for databse administration. Scanning into the system requires a network login with appropriate authorization. Once stored, the images cannot be edited. Each image is assigned an obscure URL based on the image itself, and the URL is stored in the database. All scanning, indexing, and deletion of images is logged by the application, and all web access is logged by the web servers. The index data and the images are backed up daily.