This is going to be an informal post.  For those looking for an Open Data Model (ODM), check out Apache Spot.  The general idea of the whole project can be seen by going to:

https://youtu.be/8c_Z1G-MyZM

My interest is in the ODM.  A direct Link:

http://spot.incubator.apache.org/project-components/open-data-models/

It is a cyber focused data model taking user, endpoint, and network data into account.  A good data model can really help direct your development.  Determining common fields is essential for later correlation of data sources.

If any of you are wondering “Why not use Splunk’s CIM?”  That is an option, though I found it to be overly complicated.  What Splunk attempted to do is to take every data type and tell vendors what they should name these fields for integration purposes.  A better approach is to focus on the date fields that might get shared.  The rest can be named whatever.   For example, Mail.  Apache Spot defines:

 

 

SMTP

trans_depth

int

Depth of email into SMTP exchange

 

headers_helo

string

Helo header

 

headers_mailfrom

string

Mailfrom header

 

headers_rcptto

string

Rcptto header

 

headers_date

string

Header date

 

headers_from

string

From header

 

headers_to

string

To header

 

headers_reply_to

string

Reply to header

 

headers_msg_id

string

Message ID

 

headers_in_reply_to

string

In reply to header

 

headers_subject

string

Subject

 

headers_x_originating_ip4

bigint

Originating IP address

 

headers_first_received

string

First to receive message

 

headers_second_received

string

Second to receive message

 

last_reply

string

Last reply in message chain

 

path

string

Path of message

 

user_agent

string

User agent

 

tls

boolean

Indication of TLS use

 

is_webmail

boolean

Indication of webmail

 

Now compare that to Splunk CIM model:

Dataset name

Field name

Data type

Description

Possible values

Email

action

string

Action taken by the reporting device.

delivered, blocked, quarantined, deleted

Email

delay

number

Total sending delay in milliseconds.

 

Email

dest

string

The endpoint system to which the message was delivered. You can alias this from more specific fields, such as dest_host, dest_ip, or dest_name.

 

Email

dest_bunit

string

The business unit of the endpoint system to which the message was delivered.

 

Email

dest_category

string

The category of the endpoint system to which the message was delivered.

 

Email

dest_priority

string

The priority of the endpoint system to which the message was delivered.

 

Email

duration

number

The amount of time for the completion of the messaging event, in seconds.

 

Email

file_hash

string

The hashes for the files attached to the message, if any exist.

 

Email

file_name

string

The names of the files attached to the message, if any exist.

 

Email

file_size

number

The size of the files attached the message, in bytes.

 

Email

internal_message_id

string

Host-specific unique message identifier (such as aid in sendmail, IMI in Domino, Internal-Message-ID in Exchange, and MID in Ironport).

 

Email

message_id

string

The globally-unique message identifier.

 

Email

message_info

string

Additional information about the message.

 

Email

orig_dest

string

The original destination host of the message. The message destination host can change when a message is relayed or bounced.

 

Email

orig_recipient

string

The original recipient of the message. The message recipient can change when the original email address is an alias and has to be resolved to the actual recipient.

 

Email

orig_src

string

The original source of the message.

 

Email

process

string

The name of the email executable that carries out the message transaction, such as sendmail, postfix, or the name of an email client.

 

Email

process_id

number

The numeric identifier of the process invoked to send the message.

 

Email

protocol

string

The email protocol involved, such as SMTP or RPC.

smtp, imap, pop3, mapi

Email

recipient

string

A field listing individual recipient email addresses, such as recipient="foo@splunk.com", recipient="bar@splunk.com".

 

Email

recipient_count

number

The total number of intended message recipients.

 

Email

recipient_status

string

The recipient delivery status, if available.

 

Email

response_time

number

The amount of time it took to receive a response in the messaging event, in seconds.

 

Email

retries

number

The number of times that the message was automatically resent because it was bounced back, or a similar transmission error condition.

 

Email

return_addr

string

The return address for the message.

 

Email

size

number

The size of the message, in bytes.

 

Email

src

string

The system that sent the message. You can alias this from more specific fields, such as src_host, src_ip, or src_name.

 

Email

src_bunit

string

The business unit of the system that sent the message.

 

Email

src_category

string

The category of the system that sent the message.

 

Email

src_priority

string

The priority of the system that sent the message.

 

Email

src_user

string

The email address of the message sender.

 

Email

src_user_bunit

string

The business unit of the message sender.

 

Email

src_user_category

string

The category of the message sender.

 

Email

src_user_priority

string

The priority of the message sender.

 

Email

status_code

string

The status code associated with the message.

 

Email

subject

string

The subject of the message.

 

Email

tag

string

This automatically generated field is used to access tags from within data models. Add-on builders do not need to populate it.

 

Email

url

string

The URL associated with the message, if any.

 

Email

user

string

The user context for the process. This is not the email address for the sender. For that, look at the src_user field.

 

Email

user_bunit

string

The business unit of the user context for the process.

 

Email

user_category

string

The category of the user context for the process.

 

Email

user_priority

string

The priority of the user context for the process.

 

Email

vendor_product

string

The vendor and product of the email server used for the email transaction. This field can be automatically populated by vendor and product fields in your data.

 

Email

xdelay

string

Extended delay information for the message transaction. May contain details of all the delays from all the servers in the message transmission chain.

 

Email

xref

string

An external reference. Can contain message IDs or recipient addresses from related messages.

 

Filtering

filter_action

string

The status produced by the filter, such as "accepted", "rejected", or "dropped".

 

Filtering

filter_score

number

Numeric indicator assigned to specific emails by an email filter.

 

Filtering

signature

string

The name of the filter applied.

 

Filtering

signature_extra

string

Any additional information about the filter.

 

Filtering

signature_id

string

The id associated with the filter name.

 

You think, “more is better.”   If you feel that way, good for you and go forth and use Splunk's CIM.  My experience tends to make me think less is better.  Life is complicated enough without trying to control everything.  I worked through the CIM model for the Nessus vulnerability information and that was an experience I never want to repeat.  I simply want a model that will align fields that which will be used for correlation of information.  What which will likely not be correlated, leave it.  As you are beginning to build your data sourcs, you can use Apache Spot to help identify fields that will be significant and may need to be indexed. 

O'Reilly had a security conference where Rocky DeStefano gave a presentation "Moving cybersecurity forward: Introducing Apache Spot":

https://conferences.oreilly.com/security/sec-ny-2016/public/schedule/detail/56305

Rocky DeStefano was Director the Security Operations practice at ArcSight.  I do realize he is part of Cloudera now and that is one of the major companies supporting Spot. Clodera is all about big data and ringing it together.  While all this talk  is machine learning focused, that is a good application that requires information integration.  My point is that when you bring together people with experience with data models that were developed for products and then they work towards making a community based ODM, the resulting Spot ODM is something worth checking out.