The web data pipelines

I wanted to address another observation given in the article Things That Throw Your Stats. The author makes the statement:

Web analysis is statistics, not accounting.

While I think his overall message is a disservice to the people trying hard to increase accuracy and accountability on the web, I won’t go on about that here. Instead, I want to point out that his view of web analysis is too narrow.

Actually there are three different components to web analysis. At Yahoo! we have many sources of data, but fundamentally three data pipelines:

  • Operational
  • Financial
  • Analytical

Each may start from a central place, such as the web server log files, but they move through the infrastructure at different speeds, and in different ways, because they are used for different things.

The operational data pipeline is largely concerned with availability, quality of service, consistency, correctness, etc. Some of the analysis needs to be available in real-time, and some of it much less so. A lot of the analysis is accounting, but there’s statistics involved for things like failure prediction.

The financial data pipeline is all about the money. If you can’t account for it, you can’t charge for it. Since Y! is largely ad-driven, it’s important to get this aspect right. A 10% “fudge” won’t sit right with advertisers, nor with shareholders, nor with the fine folks who brought you Sarbanes-Oxley. Not everything needs to be collected (e.g. click paths aren’t very interesting), just metrics like ad views and clickthroughs. It’s not real-time, but needs to be available relatively soon after a campaign ends, or at the end of an accounting quarter. This is largely straight accounting, yet there are statistics involved, for things like detecting click fraud.

The analytics data pipeline largely parallels the financial pipeline, but doesn’t have to be SOX-compliant. Also much more data is collected (e.g. browser string), and even more data is algorithmically computed (e.g. visit duration). The intention, of course, is to use analytics to impact the other two systems. The tricky part is that the way to positively impact the operational and financial systems is by improving the user experience (better response times, more engaging content, etc.) which largely must be inferred through observed behavior. There’s some accounting here, but largely statistics, advanced metrics, and data research/mining, with a heavy dose of human-based synthesis. Some of the results of the analytics systems feed the operational pipeline, for things like providing targeted advertising based on observed interest.

While the group I’m in largely focuses on strategic uses of the web data – the analytics pipeline – it’s never done in a vacuum; we’re always cognizant of the other two pipelines. All three groups – operational, financial, and analytical – are all doing analysis, all with the same source data, all towards the same overall goals. The data we keep, the tools we use, and the methods we employ can be very different, but it’s always a combination of accounting and statistics – never just one or the other.

The web data pipelines