andygates

So it comes to pass (thick with irony) that I'm involved in the organisation's web logs and all that jazz. These logs are currently dumping out as text files from the proxy servers, three in all, so each day I get about 1.5Gb (no, really) of logs.

Currently I'm manually importing them into a MSSQL database and, for reasons of management's own, each day's logfile ends up in a separate table, ideal for difficult and tedious analysis. Clearly, I'll be automating that in just a few days' time.

But I mean, a gig and a half daily? That's half a terabyte in a year! I know we're enterprise-class, but that's a stuposterously humungous wodge of data. It's particularly unwieldy when (as has happened) I'm asked to mine it for, say, J Random User's access to see if he's been doing "anything naughty".

It strikes me that there's a trick we're missing. We need historical logs, because evidence of naughtiness is a long-term thing. But a terabyte database in a thousand tables is daft. How do real organisations handle this?

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Most Popular Tags

arctic - 14 uses
badger - 6 uses
cats - 9 uses
climate - 15 uses
climate change - 11 uses
comics - 8 uses
cthulhu - 8 uses
culture - 8 uses
cycling - 47 uses
diy - 9 uses
dunwich dynamo - 5 uses
ebooks - 6 uses
enviroment - 15 uses
environment - 29 uses
fitness - 53 uses
games - 12 uses
garmin - 6 uses
gear - 7 uses
geek - 59 uses
gps - 6 uses
green - 23 uses
ice - 6 uses
lejog - 6 uses
mars - 9 uses
media - 13 uses
meme - 10 uses
movies - 10 uses
music - 7 uses
openstreetmap - 17 uses
pennyfarthing - 5 uses
phoenix - 5 uses
politics - 11 uses
rant - 11 uses
running - 14 uses
science - 15 uses
second life - 6 uses
space - 17 uses
star wars - 6 uses
steam - 7 uses
steampunk - 8 uses
surf - 8 uses
swimming - 23 uses
tech - 8 uses
training - 8 uses
triathlon - 61 uses
tv - 7 uses
twitter - 7 uses
weather - 7 uses
wii - 5 uses
work - 11 uses

Threaded | Flat

From:

thudthwacker.livejournal.com

How do real organisations handle this?

I'm going to go out on a limb and propose "not using MSSQL." Now, I don't do this sort of thing (and hope I never have to), but I would think that this kind of data would be more suitable to a database that can put all the log data in a single database table (or smallish set of tables -- not one table per day, which is clearly insane), which was designed such that pulling a single user's traffic for a week some time last year out of a couple of terabytes of table data wouldn't be no thang. "Old" data (defined as "data which we don't need to be readily available but which can't be discarded yet") could be moved out of the "current" tables and into some data warehouse arrangement (which I say in such a vague way because I, personally, know nothing about this "data warehousing" thing).

gedhrel.livejournal.com

Our AD cluster generates 50GB a day. Generally we don't give a shit because it's all tedious and boring. When it gets interesting the numbers go up to about 10 times that. The firewall generates about 500GB a day, raw. These are kept for under a week normally because they're only really useful in tracking down immediate problems. Processed logs to get login events and their ilk we keep for longer.

Our syslogs are tiny but across the machines we look after that's a few tens of GB a week. Web logs and app server logs are huge, but typically only need to hang around for a short while for troubleshooting. We hang onto stuff for 30 days, or at a maximum (unless otherwise asked nicely by the police) for 90 days.

It all winds up on a big box with lots of cheap disk for processing; historical logs (in the 90-day category) are extracted via post-processing.

This is separate from the IDS stuff or other log keeping that people set up and don't tell us about.

There's lots of nice tools to help with this. I'm rather taken with splunk (www. .com), although it does a lot more than just web logs, you might want to get hold of the trial version. For your data rates you might find it affordable (and the NHS has tons of money to throw at IT, obviously).

If you need lots of cross-searching then unless your logs are very structured a stock relational DB might not be what you're after; but you can certainly make mssql scale up to that sort of thing if you want. Not convinced at all by the table a day thing - sounds like a misinterpretation of forensic evidence requirements.

Geek: Web logs

Geek: Web logs

no subject

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

Mine's bigger than yours

Re: Mine's bigger than yours

Re: Mine's bigger than yours

Profile

April 2017

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags