Suggestions and comments regarding the NetFlow weekly reports are
solicited. Questions are invited as well.
Suggestions and various ideas are documented. They fall into the
following categories:
not yet evaluated,
rejected (these include a reason in brackets),
not yet implemented (these migrate to the next category),
and done.
Some of the items below are actually questions, so this document
serves as a sort of an FAQ as well as a TODO list.
- I assume that the Iperf traffic in that table is traffic with a
destination port of 5001, only, correct?
[Iperf is (both TCP and UDP) ports 5000-5009. This captures Iperf
traffic generated by the "bandwidth to the world" project at SLAC.]
- I notice bbftp is listed, but with 0% traffic... is that a
rounding artifact? (e.g., was traffic seen, but not enough to make the
chart once it was rounded to the number of decimals shown?)
[It was a bug (bbftp was included with ftp). It is now separate
and the numbers aren't zero.]
- In the full data set section, are UDP applications included along
with TCP applications? (e.g., I think you are doing so, but if so, I'd
expect more DNS-related traffic flows if that's the case...)
[In the full data set section, UDP applications are included as
well as TCP applications, as well as mixed. DNS is defined as port 53
TCP or UDP.]
- Provide data files, not only plots.
[The time-series data is already
provided. The (much larger) bulk TCP raw datasets aren't (and contain
privacy-sensitive information).]
- Provide gnuplot scripts used to produce the plots.
[They're in the source.]
- Oh yes: in your analysis, you put IRC and AOL IM into a different
category than Napster/Gnutella/Kazaa... given the existence of DCC and
Aimster, I'm not sure that's correct.
[With IRC, DCC uses random port numbers, so DCC won't show up as
part of IRC; with AIM, we don't know what fraction of AIM is Aimster.]
- Can reports show the start and end time of the data set being
used? (e.g., rather than doing "week of" can we show explicitly a
start and stop time/date?)
[Globally meaningful timestamps are not readily available to us.
Mark Fullmer supplies us with daily files. We rely on correctness of
generation of these files.]
- Is there any way of showing which (if any) core nodes were not
contributing NetFlow records for periods during the analysis window?
[Node IDs are not present in the data format that we receive. We
cannot reliably tell if data from a given node is missing. We rely
on completeness of data supplied to us.]
- Is there confidence that the traffic type shown as FTP (in
"Popular Applications (Bulk TCP)) is indeed FTP, rather than a file
sharing application masquerading as FTP? E.G., can sources/sinks be
examined in the non-anonymized data by an I2 staff person to see if
they correspond with known FTP archives, say?
[We do not distinguish between p2p file-sharing applications that
use FTP ports and FTP. Doing so would have only limited usefulness.
In fact, at least one of these applications actually uses the FTP
protocol. Hand-examination of 1,700,000,000 or so flows per week is
hardly an option.]
- It appears that the current graphs simply show the weekly
aggregate values, it would be super if the actual daily values could
also be shown for the time series graphs.
[We chose to produce weekly (rather than daily) reports because we
felt that daily averaging isn't sufficient. It appears that even
weekly averaging displays signs of statistical volatility. We do not
feel that doing any finer-grain analysis than weekly is actually
helpful. Note that it would be helpful for network operations
applications; our reports aim to capture long-term trends rather than
provide operational support to any NOC. Even if we wanted to support
NOCs, we wouldn't be able to conveniently provide timely data as
network view datasets are generated nightly.]
- It would help users to interpret the bulk TCP throughput
distribution graph if the empirically observed graph could be overlaid
with a theoretically expected graph (e.g., would one expect the 1-CDF
distribution to be declining exponential, or something else).
[We do not know what the theoretically expected distribution would
be. Our empirical evidence isn't at this point sufficient to make
observations, either.]
- Is there any plan to try to aggregated concurrent parallel flows
associated with applications such as parallelized FTP tools?
[Aggregation of parallel flows isn't planned. Isn't
computationally hard and brings in little value, since what we
want to observe is single-stream TCP throughput.]
- Analyses by type of participant (e.g., primary participant,
network peer, sponsored participant, SEGP, corporate participant,
etc.) would also be nice to see.
[The ongoing overhead of maintaining the tables of prefixes would
be unbearable. Maybe this is something that the NOC could do in
co-operation with Mark Fullmer of Ohio ITEC.]
- There are some tables that other weekly reports include that
aren't yet part of the Abilene report; any plans to add additional
stats over time? For example, the CANet ARDNOC weekly report (see,
for example http://www.canet3.net/stats/reports/tr_020210.html)
include a variety of per peer/gigapop reports (e.g., a 30 minute
average transfer rate on a peer/gigapop basis; an AS matrix, showing
top src-dst pairs by src Peer/GigaPOP; a top applications by
peer/gigapop, etc.)
[See previous item.]
- Any plans to begin breaking out IPv6 traffic by application?
[Because of our network design, the IPv4 core currently sees IPv6
traffic as GRE tunnels. NetFlow doesn't look inside. The amount of
it is tiny, anyway.]
- Any plans to begin looking at IP multicast traffic in more depth?
[Not unless the amount of multicast traffic increases and we have
ideas of how to classify it.]
- Would it make sense to move the (more general) table of IP
Protocol Distribution (Full Data Set) above the (more specific) table
of applications? (there's also a typo in that table legend)
[We present tables in essentially the order we want the reader
(who might stop reading at any point) to see them, not in any logical
order.]
- Add iMesh file-sharing application.
[Port hopping, Abilene core doesn't see control traffic.]
- Provide all-time Top10 table.
[Complicates week removal. A gimmick.]
- Replace 1-CDF with CDF, because it's The Standard.
[1-CDF and log scale allows us to zoom in on the high end (which
is what is interesting). CDF plots in log scale would be totally
useless; without log scale both CDF and 1-CDF are useless for
throughput and transfer sizes.]
- Provide links forward and backwards a week for each table.
[Complicates week removal, makes the first week special, is a
hassle with little benefit added over time-series plots of every
parameter (which we already have).]
- Convert all numbers to HP engineering notation.
[What about people who bought calculators made by a different
firm? My MK-52 didn't have this.]
- Can a text-only version of the reports be sent to a mailing list
to which interested parties might subscribe, with a link to the full
blown report? (sort of like Tony Bate's CIDR reports?)
[We decided against it. We expect very few people to be
sufficiently interested in this stuff to subscribe to a weekly
message. And these are probably the ones who could look at the
reports on the web...]
- If there are natural knees or inflection points on distribution
plots that should be highlighted, it would be great if vertical
reference lines could be added (e.g., if you believe that physical
choke points/natural break points play a role, could you add vertical
rules at 56K, 1.5Mbps, 10Mbps, 45Mbps, 100Mbps, say)?
[Hard to do right in gnuplot (in a way that doesn't break
autoscaling and autoticking). In addition, the most interesting
points (10Mb/s and 100Mb/s) are already present.]
- Table 3 caption: "Table 3. Popular Applications (Bulk TCPs only)"
or "Table 3. Popular Applications (Bulk TCP Subset)". Similarly,
the "total" label on the last line might be changed to "Bulk TCP
Total" or "Total (Bulk TCPs only)".
[We think the table is now less confusing. Added a note before
it, typeset in bold.]
- Any plans to include security related data in the reports, such as
anomalous flows identified as part of DDoS attacks, etc.?
[DDoS identification is not part of this report. We do include a
"Port 0" category that captures some fraction of the invalid traffic.
Bogus source address identification isn't easy because of routing
asymmetries.]
- 2234=DirectPlay (?)
[Not a major source of traffic; identification uncertain.]
- Produce some statistics (such as throughputs distribution) for
bulk TCP except for file sharing.
[The lasting value of this seems dubious. I did it as a one-off
exercise, and didn't find meaningful differences.]
- For Table 9, Packet Sizes (Full Data Set), can you provide AS's
for endpoints where frames >1500 bytes were seen?
[Not sure how to format it.]
- Port 7668 might be Aimster.
[Not a major source of traffic.]
- port 10700 could be
KDX (an encrypted Hotline-like file sharing product)
[Not a major source of traffic.]
- 4444
may be Kerberos 5->4/heimdal, however it may also be a DOS or CrackDown.
I also find it intriguing that CSUPomona
claims that it is "Napster" (Google cached copy of that page)
[Too uncertain.]
- port 1336
may be ischat
[Not a major source of traffic.]
- 1026/udp
may be statd
[I doubt it.]
- 1026/tcp
may be Microsoft LSA or nterm remote
login
[Or it could just be a low-numbered non-privileged port diluted by
noise.]
- 6968
might be use of netcat to as part of an exploit against Microsoft
SQL server
[Not a lasting application, one would hope.]
- Produce per-AS median bulk TCP throughputs and report top 10 ASs.
- Add new application: ntp (123)
[Was in there from the beginning, see full traffic types table.
Defined as UDP port 123.]
- Can popular applications in the chart be sorted in descending
order by octets? (The "other" category can stay at the bottom, even if
it is unavoidably large)
- Could you add a table of contents/set of jump links at the top of
the document so that one can simply click on section of interest to
immediately jump down to a particular chunk of the report?
- Can all figures and tables be numbered for easy reference?
- For the drill down graphs (e.g., clicking on 20.85% for NNTP
Octets in 20020225 report), can consecutive values be connected by
linear segments or splines or something else (currently the
unconnected dots which are being used are sort of hard to pick
out).
[This is easy to do, and right now (while we only have small
number of points) would look better. Later, we will revert to dot
plots. They look better when there are many points, and they show
missing data nicely.]
- Time series box-plot-style plot of bulk TCP throughput; would show
5%, median, and 95% on the same graph.
- Include a piece of useful information (a "teaser") on the root
weekly page.
[The previous item will work nicely.]
- Can the fraction of total traffic associated with bulk flows vs
the fraction of total traffic associated with everything else? That
is, do bulk flows represent 10% of the total traffic, 38%, 73% or
something else (this should probably be at the top of the bulk flows
section)
- Convert all numbers to SI-style notation with k, M, G, T, etc.,
suffixes. Keep 4 significant digits in all numbers.
[We give in. God intended to have scientific notation used here,
but we ourselves end up doing the conversion for every number. Maybe
it just makes sense to have the computer do it...]
- Add new application: Shoutcast (8000-8005, ugh)
- Add new application: hotline (5500-5503)
- Add new application: snmp (161)
- Add new application: identd (113)
- Add new application: web proxy traffic (e.g., 3128)
- Add new application: socks proxy traffic (e.g., 1080)
- Add new application: battlenet (4000, 6112-6119)
- Add new application: quake (26000, 27910-27961/udp)
- Add new application: starseige tribes (28000-28008)
- Add new application: Carracho (6700-6702)
- Add Blubster file-sharing application.
- Add WinMX file-sharing application.
- Add "total" line wherever it is applicable.
- Can the semi-log or log-log graph backgrounds be drawn with faint
hair-lines rather than just ticked on the axes?
- Add new application: DirectX game traffic (see:
http://support.microsoft.com/default.aspx?scid=kb;EN-US;q240429;
TCP 47624, UDP 6073, TCP/UDP 2300-2400)
- For the "fastest bulk TCP flows with unique AS source and
destination" could you actually generate two tables, one for Iperf
flows (which dominate the current display), and another for everything
else, each with ten or so values?
- Can traffic types be enumerated from the same set of classifications
for both the bulk and the full data set sections, so that comparability is
maintained? For example, SSH is the category show in the bulk section, while
"crypto" is used as a type in the full data set section.
[Not sure what to do with application identification in bulk TCP
section. We try to make things consistent. If they get out of sync,
it's a bug that we want to fix.]
- Define all application types in the aggregated traffic type table.
- Application identification in Top10 table.
[Remove port numbers if application is identified. Keep
otherwise.]
- Current images are 640x480 in size; can they either be made larger
(e.g., say 800x600, more appropriate for folks running large tubes in
high resolution), or smaller, so that the semi-log and the log-log
versions could be seen side by side for comparison purposes?
[Make two small graphs side-by-side. "Click to enlarge."]
- In "Average Packet Sizes" table, change label "Packet size" to
"Average packet size".
[Will come up with a better explanatory text in front of the table.]
- If some days are missing, multiply all actual numbers and
concurrency by something like 7/number_of_days_present.
- Close all TD tags (e.g., in the Top10 table, where some newlines
wouldn't hurt either) and remove redundant ALIGN attributes.
- Insert new table after "Application Types (Full Data Set)" that
would summarize
ports[number] data for guessing
emerging applications. Only output 5 most frequent (by octets)
unidentified ports.
- Remove any mention of flows from full data set.
- In packet size distribution tables, only leave "packets" column.
- Add ECN characterization.
- New application: 6881-9/TCP==BitTorrent
- New application: Single-Source Multicast (SSM): destination in 232/8.
- New application: tcp/5031-3 is bbcp (SLAC's file transfer tool)
If your suggestion, request, comment, or question is not addressed
above to your satisfaction please contact
Stanislav Shalunov.
(An edited extract of your communication will probably appear here.)
$Date: 2004/05/20 23:40:40 $