Illustrating KDE speed and depth with SQL queries
 
Jet engineIn part 1 of this series we introduced Kentik Data Engine™, the backend to Kentik Detect™, which is a large-scale distributed datastore that is optimized for querying IP flow records (NetFlow v5/9, sFlow, IPFIX) and related network data (GeoIP, BGP, SNMP). We started our tour of KDE with a word about our database schema, and then used SQL queries to quantify how design features such as time-slice subqueries, results caching, and parallel dataseries help us achieve extraordinarily fast query performance even over huge volumes of data. In this post we’ll continue our tour of KDE, starting with filtering and aggregation.
 
SQL gives us easy methods to do both filtering and aggregation, either by adding additional terms to the WHERE clause or by adding a GROUP BY clause. We can filter and aggregate by any combination of columns using those options. Let’s look at traffic volume for the top source dest country pairs where both the source and dest are outside the US:

SELECT src_geo,
   dst_geo,
   Sum(both_bytes) AS f_sum_both_bytes
FROM big_backbone_router
WHERE src_geo <> ‘US’
  AND dst_geo <> ‘US’
  AND i_start_time > now() – interval ‘1 week’
GROUP BY src_geo,
   dst_geo
ORDER BY f_sum_both_bytes DESC
LIMIT 10
| src_geo   | dst_geo   | f_sum_both_bytes   |
| HK        | BR        | 27568963549063     |
| GB        | BR        | 8594666838327      |
| NL        | DE        | 6044367035356      |
| HK        | GB        | 6004897386415      |
| HK        | SG        | 5305439621766      |
| HK        | CO        | 4893091337832      |
| NL        | BR        | 4330923877223      |
| HK        | JP        | 4086102823771      |
| HK        | PL        | 3833512917644      |
| HK        | TR        | 3501243783418      |
SELECT 10
Time: 0.675s

The first row, HK (Hong Kong) BR (Brazil) seems like an interesting, unlikely pair. Let’s filter on that and see who was talking to whom from a network/ASN perspective:

SELECT src_as,
   i_src_as_name,
   dst_as,
   i_dst_as_name,
   Sum(both_bytes) AS f_sum_both_bytes
FROM big_backbone_router
WHERE src_geo = ‘HK’
  AND dst_geo = ‘BR’
  AND i_start_time > Now() – interval ‘1 week’
GROUP BY src_as,
   i_src_as_name,
   dst_as,
   i_dst_as_name
ORDER BY f_sum_both_bytes DESC
LIMIT 10
| src_as | i_src_as_name      |   dst_as | i_dst_as_name      | f_sum_both_bytes   |
|  65001 | Global Transit Net |    65101 | Eyeball Net A      | 11686340849432     |
|  65001 | Global Transit Net |    65201 | Eyeball Net B ASN1 | 5076468451751      |
|  65001 | Global Transit Net |    65301 | CDN Net A          | 3337948976347      |
|  65001 | Global Transit Net |    65102 | Eyeball Net C      | 1261908657743      |
|  65001 | Global Transit Net |    65103 | Eyeball Net D      | 1234101190857      |
|  65001 | Global Transit Net |    65104 | Eyeball Net E      | 1211922009485      |
|  65001 | Global Transit Net |    65105 | Eyeball Net F      | 334959552542       |
|  65001 | Global Transit Net |    65202 | Eyeball Net B ASN2 | 247936925394       |
|  65001 | Global Transit Net |    65106 | Eyeball Net G      | 229671528291       |
|  65001 | Global Transit Net |    65401 | Enterprise Net A   | 209961848484       |
SELECT 10
Time: 0.458s

Substring and regex matching
 
If we wanted to drill down on the first row, we could additionally filter on the specific source/dest ASNs. But let’s filter on the ASN names instead, so we can see how KDE supports SQL substring and regular expression matching on text columns. Substring/regex matching also works on other strings such as interface names and descriptions, AS_PATHs, and user-defined flow tags.

SELECT ipv4_src_addr AS f_cidr24_ipv4_src_addr,
   ipv4_dst_addr AS f_cidr24_ipv4_dst_addr,
   Sum(both_bytes) AS f_sum_both_bytes
FROM big_backbone_router
WHERE i_src_as_name ~ ‘Peer|Transit’
  AND i_dst_as_name LIKE ‘%Eyeball Net A%’
  AND i_start_time > Now() – interval ‘1 week’
GROUP BY f_cidr24_ipv4_src_addr,
   f_cidr24_ipv4_dst_addr
ORDER BY f_sum_both_bytes DESC
LIMIT 10
| f_cidr24_ipv4_src_addr  | f_cidr24_ipv4_dst_addr   | f_sum_both_bytes   |
| 10.129.204.0            | 10.156.25.0              | 115419904954801    |
| 10.221.155.0            | 10.141.201.0             | 78651524382556     |
| 10.156.25.0             | 10.129.204.0             | 62329500664567     |
| 10.254.246.0            | 10.31.38.0               | 39162753399340     |
| 10.117.35.0             | 10.31.38.0               | 39073550458830     |
| 10.144.99.0             | 10.254.210.0             | 28582936121869     |
| 10.31.73.0              | 10.254.244.0             | 27632400104976     |
| 10.31.75.0              | 10.17.153.0              | 26265050083173     |
| 10.144.99.0             | 10.254.244.0             | 25763076333705     |
| 10.17.100.0             | 10.93.63.0               | 23713868194889     |
SELECT 10
Time: 0.980s

As we see above, KDE can aggregate IP address columns by arbitrary subnet masks. In this query we’ve grouped the data by source dest /24 subnet pairs. KDE also natively understands IP addresses and CIDR notation in filters, so we can drill down on the top subnet pair and look at pairs of individual IPs:

SELECT ipv4_src_addr,
   ipv4_dst_addr,
   Sum(both_bytes) AS f_sum_both_bytes
FROM big_backbone_router
WHERE ipv4_src_addr LIKE ‘10.129.204.0/24’
  AND ipv4_dst_addr LIKE ‘10.156.25.0/24’
  AND i_start_time > Now() – interval ‘1 week’
GROUP BY ipv4_src_addr,
   ipv4_dst_addr
ORDER BY f_sum_both_bytes DESC
LIMIT 10
| ipv4_src_addr  | ipv4_dst_addr  | f_sum_both_bytes   |
| 10.129.204.41  | 10.156.25.5    | 101922511168965    |
| 10.129.204.34  | 10.156.25.4    | 16534277019052     |
| 10.129.204.69  | 10.156.25.79   | 12821801454        |
| 10.129.204.85  | 10.156.25.79   | 12408606234        |
| 10.129.204.116 | 10.156.25.79   | 11170668135        |
| 10.129.204.110 | 10.156.25.79   | 11078339112        |
| 10.129.204.76  | 10.156.25.79   | 10895308401        |
| 10.129.204.84  | 10.156.25.79   | 10497115055        |
| 10.129.204.115 | 10.156.25.79   | 10361345421        |
| 10.129.204.75  | 10.156.25.79   | 9923494659         |
SELECT 10
Time: 0.660s

Time-series data
 
Summary tables are great, but often we want time-series data to build visualizations. In KDE, this is as simple as adding a time column to the SELECT and GROUP BY clauses. Let’s take the top IP pair from the results above and get time-series data (bytes and packets) over the last week:

SELECT i_start_time,
   Max(i_duration),
   Sum(both_bytes) AS f_sum_both_bytes,
   Sum(both_pkts) AS f_sum_both_pkts
FROM big_backbone_router
WHERE ipv4_src_addr LIKE ‘10.129.204.41’
  AND ipv4_dst_addr LIKE ‘10.156.25.5’
  AND i_start_time > Now() – interval ‘1 week’
GROUP BY i_start_time
ORDER BY i_start_time ASC
| i_start_time              |   max | f_sum_both_bytes   | f_sum_both_pkts   |
| 2016-03-10 23:00:00+00:00 |  3600 | 475231866603       | 558646000         |
| 2016-03-11 00:00:00+00:00 |  3600 | 141987820665       | 180911990         |
| 2016-03-11 01:00:00+00:00 |  3600 | 85119841990        | 130098569         |
| 2016-03-11 02:00:00+00:00 |  3600 | 102749092833       | 124245217         |
| 2016-03-11 03:00:00+00:00 |  3600 | 40349266424        | 74404852          |
| 2016-03-11 04:00:00+00:00 |  3600 | 47615668871        | 80084659          |
| 2016-03-11 05:00:00+00:00 |  3600 | 39601556357        | 71966274          |
| 2016-03-11 06:00:00+00:00 |  3600 | 44595721100        | 55644084          |
| 2016-03-11 07:00:00+00:00 |  3600 | 36984645947        | 73379683          |
| 2016-03-11 08:00:00+00:00 |  3600 | 57309415120        | 86561840          |
| 2016-03-11 09:00:00+00:00 |  3600 | 221576669330       | 219835996         |
SELECT 168
Time: 0.430s

It doesn’t make sense to return 10,800 1-minute intervals for a week-long time series query; you can’t display that many data points in a visualization that fits in your average web browser. So KDE auto-selects the interval width to return an appropriate number of points / rows, based on the overall time range covered by the query. In this case, for a query covering a week, we get 168 one-hour intervals. For more information, see Time Rounding. KDE can also return a column showing the width (in seconds) of each interval so we (or the Kentik Detect frontend) can easily calculate rates like bits/sec or packets/sec.

Maintaining full granularity
 
None of the per-interval data above was pre-calculated; it was generated on the fly, at query time, from the individual records that we saw in the response to the first query in part 1 of this blog series. That means that when we see an area of interest it’s easy to narrow the time range or apply filters to drill down. We can get full-resolution, 1-minute granularity for any historical period within the data that KDE stores, such as this one-hour time range from 90 days ago:

SELECT i_start_time,
   Max(i_duration) AS i_duration,
   Sum(both_bytes) AS f_sum_both_bytes,
   Sum(both_pkts) AS f_sum_both_pkts
FROM big_backbone_router
WHERE i_start_time > Now() – interval ‘2159 hours’
  AND i_start_time < Now() - interval ‘2158 hours’
GROUP BY i_start_time
ORDER BY i_start_time ASC
| i_start_time              |   i_duration | f_sum_both_bytes   | f_sum_both_pkts   |
| 2015-12-15 22:29:00+00:00 |           60 | 179245157376       | 189853696         |
| 2015-12-15 22:30:00+00:00 |           60 | 181873404928       | 192246784         |
| 2015-12-15 22:31:00+00:00 |           60 | 183132584960       | 193918976         |
| 2015-12-15 22:32:00+00:00 |           60 | 180520254464       | 191270912         |
| 2015-12-15 22:33:00+00:00 |           60 | 179917988864       | 190438400         |
| 2015-12-15 22:34:00+00:00 |           60 | 175917901824       | 185893888         |
| 2015-12-15 22:35:00+00:00 |           60 | 174799783936       | 184879104         |
| 2015-12-15 22:36:00+00:00 |           60 | 175613580288       | 185396224         |
| 2015-12-15 22:37:00+00:00 |           60 | 173256493056       | 182279168         |
| 2015-12-15 22:38:00+00:00 |           60 | 170268498944       | 179223552         |
| 2015-12-15 22:39:00+00:00 |           60 | 169344593920       | 178819072         |
| 2015-12-15 22:40:00+00:00 |           60 | 169141132288       | 178192384         |
| 2015-12-15 22:41:00+00:00 |           60 | 169238467584       | 178177024         |
SELECT 60
Time: 1.293s

In summary, the Kentik Data Engine provides a very performant, scalable, and flexible platform that enables Kentik Detect to analyze network traffic data without making any compromises on the granularity or specificity of the results. But that’s actually just the tip of the iceberg. We’re excited about the other types of datasets that KDE will consume, and the additional applications we’ll be able to build on top of it in the areas of network performance, security, and business intelligence.
 
Want to try KDE with your own network data? Start a free trial of Kentik Detect and experience KDE’s performance first hand. Or maybe you’d like to help us take KDE to the next level? We’re hiring.