Thursday, July 24, 2014

URL Analysis - a [semi] data driven approach

--------------------------------------
Been a while since I sat down and posted something here... Still slaving away at the grindstone, but having fun and doing something I enjoy (shhh... don't let my bosses know though).

I don't think I've finished any of the projects I've been working on either, basically have the same list with more added... though the rate they've been adding has slowed down a little, or increased. Who knows.

--------------------------------------

URL analysis, something I've been working on since I first started trying to build pattern matches for malicious websites back in 2010-2011, and ever since then. This can be both easy and tricky.

Easy: Well, easier if you know Regular Expressions (RegEx/RE/PCRE) and know how to finesse the pattern for the tool you are using. There can be a bit of variance between products, Splunk understands RegEx a little differently than Suricata or Snort, and so on. So, figure out what you have to write to. Additionally, take advantage of other people's RegEx, what they are writing for and to. I highly recommend looking at the rules the awesome people of the Emerging Threats community have put out, that's where I've learned more about pattern matching and refinement.

I recommend pulling down their latest rule set here: rules.emergingthreats.net. I like pulling down the allrules file and then using grep to find the rules I am looking for.

An example of extracting the PCREs for (my favorite exploit kit) Fiesta kit:

 grep -i 'fiesta' emerging-all.rules | egrep -o 'pcre:".*"'

This will spit out just the PCRE bits from the file after finding all the matches for Fiesta. It can probably be improved, but it works.

Tricky: This is where approach and having a good sized data set to play with is extremely important. Building RE matches against malicious URLs is tricky, depending on the deviousness of the miscreants, because they like to blend in or make their deviance hard to match on, meaning it either is terribly dynamic and constantly changes lengths and various things so a solid match, or even anchor, can be hard to find. Or they like to blend in with common web traffic, so figuring out how to extract the badness from the good is time consuming.

As I've written about my (and friends) work on tracking and writing signatures to match neosploit aka Fiesta Kit there is hopefully a marked growth or maturation of sample data and matching.

A quick aside on tools, there are certain things that must be taken into account when building Snort/Suricata signatures, that are able to slide in something like Splunk. Anchoring on a string of text, one of the frustrations with Fiesta Kit was not really having anything to anchor off of, was a pain and made for rough and intense rules for the IDS engine. Splunk, however, will generally accept poorly written searches and hopefully it will kill a poorly written search and RE combination, if it can't muscle through it. Ugh, I've written a few doozies hunting Fiesta in my time. And, though I thought I was following the 'start small, get bigger with testing', I was definitely not doing as well as I thought. Yay for not melting servers.

Yes, I am rambling...

Recently, I've had the pleasure of seeing some ninjas in action when it comes to hunting miscreants and doing good things on the Internet. One of the challenges I have enjoyed is trying to attack parts of the chain of events when it comes to exploit kits, always good to have most of the kits stopped before the user is exposed, but there is a good amount of data to be gathered even though the incident has been thwarted.

(Warning, still very much a python noob... so the following idea isn't magic of any sort)

Something that I have been working on is a python script that I can feed parts of or entire URLs to find the lengths of. Why? When sifting through dynamically generated crap that is attempting to hide amongst annoying but also dynamically generated (hopefully) innocent traffic I end up dealing with wide ranges in traffic types. Being able to narrow down my match to be specific enough to have at least low false positive rates gives me something to go on, and not be too embarrassed to share. (Who am I kidding, I'm still a noob, and if I can get something that matches and I am hitting a brick wall on refining I will share it to let the Pros and ninjas go to work.)

I need to work on the script some more to be able to tailor it to the data I have, and then, be able to spit out both usable data to my own purposes and sharing. Right now I can take one field, calculate the length, and then using the tools the Unix Gods bestowed on us (sort/uniq/cat/sed/awk/less) I can find the ranges based on the malicious traffic. Simple, but important. Being able to determine what is a EK gate versus an analytics site traffic is the difference of driving your SOC mad with false positives, and solid intel on what legit sites have been popped, or ad rotators with bad content, and are now sending users to bad places.

Learning how to first build solid network connections with good people who share good info, pivoting off of it in your own environment, gathering as much as you can, seeing if you can improve (expand and tighten) the patterns, and most importantly share back.

It is one of the best feelings and experiences to be able to share back to the community to help others protect others from the evils of malicious miscreants.

Find things, build matches, share and be awesome!

Happy hunting!
-Demon117

@demon117

No comments:

Post a Comment