It’s been a few days and I’ve made a few choices regarding NRT … mainly the database I’m going to use and its structure (sounds exciting, doesn’t it?).
The Database
After lots of thinking (imagine steam and turning gears) I’ve decided to use sqlite3 as my database (mine … all mine!). Some of the other options I considered were MySQL and PostgreSQL (what about MSSQL, you say? … didn’t even cross my mind)
I really didn’t want to force users to install an entire DB, what if you’re on a machine that you don’t have admin rights to (like an ethical hacker on a compromised box)? What then? You’d be up the creek without a canoe. Besides I think it’d be really cool to have this run on my phone and there’s no way I’m going to get MySQL installed on my Android (hmm … a possible future project, maybe?)
Anyway, sqlite3 supports databases up to 281 terabytes (point 12 in the docs) … if you’re scanning enough networks to fill up a database with that much info, I applaud you. Even though the DB can support that much data it might become burdensome, so we’ll be creating new sqlite3 databases based on a combination of size/date/number of space x-rays in the last fortnight/etc. (This will probably be user-defined).
The analytics portion of NRT will be able to ingest multiple databases at once so that data mining can happen across vast numbers of saved scans.
Data Structure
This is still a bit of a work-in-progress, but I have the basic structure nailed down. There will be two main tables; Hosts and Ports, which can be queried individually or together.
Hosts will have the following columns:
hosts_id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
user_scan_id TEXT,
dns TEXT,
mac_address TEXT,
ip_full TEXT,
ip_split1 INTEGER,
ip_split2 INTEGER,
ip_split3 INTEGER,
ip_split4 INTEGER,
create_date DATE,
create_time TIME,
create_datetime DATETIME,
UNIQUE(dns, ip_full)
A few notes about these columns:
- user_scan_id: A user-defined string to identify the scan. Could be a location, a business name, etc. (it’s a TEXT field, so have fun!).
- dns: This is the DNS name resolved by Nmap … so it’s not always accurate, especially if the DNS servers don’t have stale record scavenging enabled (but we can’t really do anything about that).
- mac_address: This is an interesting field (even though it might change often … thanks to Apple/Andriod spoofing their MAC’s), we can get it, so we should store it.
- ip_full: The full IP (ie. 127.0.0.1)
- ip_split1 – 4: Here is where things get interesting … you might be asking, why split the IP? That’s a silly thing to do. It’s all for sorting purposes and for ease of analysis. This way you don’t need to keep parsing ip_full if you’re interested in a specific octet of the IP. (Also, I realize this won’t work with IPv6 … we’re not there yet, once we do that work, I’ll add more columns to this table)
- create_date/time/datetime: Same thing here, instead of needing to parse these out (or combine them together) at analysis time, we’ll just store a few more bits. These fields are handy because we can tell when a unique combination of DNS and IP_full was first scanned.
Ports will have the following columns:
ports_id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
dns TEXT,
ip TEXT,
mac_address TEXT,
protocol TEXT,
port INTEGER,
state TEXT,
reason TEXT,
name TEXT,
product TEXT,
version TEXT,
extra_info TEXT,
confidence TEXT,
common_platform_enumeration TEXT,
start_scan_datetime DATETIME,
end_scan_datetime DATETIME
A few notes about these columns:
- dns/ip/mac_address: These are the same values as in the hosts table (I know I know … it’s not good to store the same info in two places, but it will make these tables more flexible)
- protocol – common_platform_enumeration: All standard info from Nmap
- start_scan/end_scan: Over time this will be used as a predictive tool when scanning same or similar networks. Or if a scan takes significantly longer/shorter than the same scan that was previously done, then there might be something wrong (like hackers taking down machines or a network failure cutting off access). This would be an alert condition. They are full datetime fields because scans might run over midnight.
That’s it for now … if you have any suggestions (or see glaring mistakes) let me know!