UCT CS Research Document Archive

A Web Browsing Workload for Simulation

Walters, Lourens 0. (2004) A Web Browsing Workload for Simulation. MSc, Department of Computer Science, University of Cape Town.

Full text available as:
PDF - Requires Adobe Acrobat Reader or other PDF viewer.

Abstract

The simulation of packet switched networks depends on accurate web workload models as input for network models. We derived a workload model for traffic generated by an individual browsing the web. We derived the workload model by studying packet traces of web traffic generated by individuals browsing the web on a campus network.

We attempted to model aggregate traffic generated by many web users browsing the web on a campus network, by decomposing the traffic into its constituent elements i.e. traffic generated by individual users. We furthermore identified elements within the traffic generated by individual users which contributed to the characteristics of the aggregate traffic stream. We identified parameters which were not directly inuenced by network specific characteristics such as latency and throughput, in order to ensure that our model was as general as possible.

We found that web traffic was extremely complex. The dynamic behaviour of client and server side scripts introduced dependencies in the data. Our model was more detailed than any existing web workload model at the time the study was conducted, but did not take into account the behaviour of web client and server scripts. There is room for improvement in our model here.

We tried to break down aggregate web traffic into parts which contain observations which were independent from each other. An analysis of autocorrelation between observations within parameter datasets showed that dependencies exist between observations in most of the parameter datasets. The dynamic behaviour of scripts might explain some of the dependencies in the parameter datasets.

We implemented a measurement system which measured data on a campus network by extracting selected information from IP, TCP and HTTP message headers. The system extracted parameter datasets for our workload model from the captured data. The approach of capturing selected information from TCP/IP packets transmitted between web clients and servers as opposed to capturing all the data transmitted, avoided the problem of extremely large amounts of data accumulating over small periods of time. Capturing all the data transmitted between web clients and servers required large amounts of storage space and processing power which were not available to us. By using our measurement system, we were able to record data for a 30 day period, capturing web traffic generated by 6 692 hosts on a campus network.

The measurement system could extract parameter datasets in real-time, or write selected data to secondary storage in order to extract parameter datasets off-line. The real-time version of the measurement system could not extract parameter datasets during peak traffic hours. We used the off-line version of the measurement system to obtain parameter datasets for the study. We believe that with certain optimisations the real-time system would be able to extract parameter datasets in real-time.

We extracted parameter datasets from data recorded to secondary storage by the measurement system. We used the packet trace method to record data to secondary storage. Because of the nature of the packet trace method of measurement we did not have sufficient information in the recorded data to extract parameter datasets.

We used a heuristic algorithm to extract parameter datasets from the incomplete data. The heuristic algorithm was novel as it used information from TCP, IP and HTTP package headers to recontruct a user's browsing behaviour. This had not been done before. The algorithm used a list of characteristics of web client requests which we compiled by studying packet traces of traffic generated by web users. The algorithm inferred user behaviour from the list of web client request characteristics. By using the algorithm we were able to extract parameter datasets from the incomplete measured data.

We analysed the extracted parameter datasets by using visual techniques and goodness-of-fit measures. We tested several families of mathematical functions in order to find a function which fits the model parameter data well. The parameter datasets were very large. They typically contained millions of entries. The commercial statistical analysis packages we had at our disposal could not analyse datasets with millions of entries. We overcame the problem posed by the size of the datasets by implementing our analysis routines in the R statistical analysis environment. The R statistical analysis environment is a freely available open source software package. We implemented the Anderson Darling and (lambda)^2 goodness-of-fit statistics in the R statistical analysis environment for eight mathematical families of functions. We also implemented the Q-Q and P-P plots in order to visually analyse the data.

We found that the Anderson Darling statistic could not be used to analyse large datasets. The Anderson Darling statistic did not have p-value tables for datasets with more than 200 observations. We used the (lambda)^2 statistic as an indicator of goodness-of-fit. The (lambda)^2 statistic indicated that there was evidence against a perfect fit between parameter datasets and any of the eight mathematical function families we tested for. At least one mathematical function did fit each dataset very well, albeit not perfectly well. The visual evidence provided by Q-Q and P-P plots corroborated this nding. We tabulated the data for each of the parameter datasets. Random values could be generated from the tabulated values.

EPrint Type:Electronic Thesis or Dissertation
Subjects:UNSPECIFIED
ID Code:137
Deposited By:Arnab, A
Deposited On:14 June 2004