Further research led us to Alexander Marquardt ’s blog, Deduplicating documents in Elasticsearch, which in turn led us to Elastic’s fingerprint filter plugin - a much simpler and equally effective solution. We initially wrote a Python script for data deduplication, which worked but seemed excessive. We would be like, “Cool! A document referencing a Zero Day!” and then an hour later we were like, “Sweet! This is the same article we saw an hour ago!” It was ugly, and we knew it would negatively impact our analysis and research as we moved forward. We would sporadically get one new document and several duplicates. Logstash would pull from the RSS feeds every hour, as per our configuration. We quickly determined that Logstash filtering would be necessary in order to avoid ingesting duplicate documents. It can be installed by running bin/logstash-plugin install logstash-input-rss. Note: The RSS plugin needs to be installed as it does not come as part of the logstash package. Again, we wanted our tags to be succinct so that we could clearly differentiate the news feeds we chose. We are using tags for Kibana filtering purposes.
0 Comments
Leave a Reply. |