350 rub
Journal Highly available systems №4 for 2015 г.
Article in number:
Processing unstructured textual data for support of search and rescue operations
Authors:
D.A. Devyatkin - Junior Research Scientist, Institute for Systems Analysis of FRC CSC RAS (Moscow). E-mail: devyatkin@isa.ru A.O. Shelmanov - Junior Research Scientist, Institute for Systems Analysis of FRC CSC RAS (Moscow). E-mail: shelmanov@isa.ru
Abstract:
The paper discusses the main scientific and technical problems of creating methods and software tools for processing unstructured textual data and providing support for search and rescue operations. We review systems that leverage data and messages from social media for information and analytical support of response and recovery operations in emergency situations. The paper considers methods for focused crawling, for preliminary parsing of crawled data, as well as methods for information extraction from natural language texts. With the review as the background, we propose approaches for solving tasks arising during development of methods and software tools for processing unstructured data and providing search and analytical support of search and rescue operations. We propose an intelligent (ontology-based) crawl strategies for focused crawling and machine learning techniques for classifying indi-vidual pages of target resources. Natural language processing and indexing of texts will be performed by means of the Exactus platform. Rule-based approaches as well as machine learning techniques will be adapted to solve the problem of extracting information from natural language texts related to emergency situations. Ontological resources and lexicons will be created to extract from texts geographical objects, names of ships and aircrafts. The problem of storing structured data will be solved by means of a distributed scalable NO-SQL databases that provide the ability to load and process huge amounts of data on the sufficient hardware. The requirements for the software tools for support of search and rescue operations are suggested. To satisfy these requirements, we propose a distributed service-based architecture. It provides the ability to process big streams of information gathered online, scalability, information security, and low cost of implementation and maintenance of intelligent data processing systems. We are planning to perform experimental evaluation of the considered methods and software tools on the free-access retrospective data about emergences occurred in the Arctic zone.
Pages: 45-60
References

 

  1. Tunkelang D. Faceted search (synthesis lectures on information concepts, retrieval, and services). Claypool & Morgan. 2009.
  2. Sudha Verma, Sarah Vieweg, William J Corvey et al. Naturallanguageprocessingtotherescue? Extracting «situational awareness» tweets during mass emergency // Proceedings of ICWSM. 2011. P. 385−392.
  3. Theresa Wilson, Paul Hoffmann, Swapna Somasundaran et al. OpinionFinder: Asystemforsubjectivityanalysis// ProceedingsofHTL/EMNLPoninteractivedemonstrations. AssociationforComputationalLinguistics. 2005. P. 34−35.
  4. Amit P. Sheth, Hemant Purohit, Ashutosh Sopan Jadhav et al. Understandingeventsthroughanalysisofsocialmedia// [EHlektronnyjj resurs] URL: http://corescholar.libraries.wright.edu/knoesis/788/ (data obrashhenija 09.11.2015). 2010.
  5. Purohit H., Sheth A.P. Twitris v3: From citizen sensing to analysis, coordination and action // Proceedings of ICWSM. 2013. P. 746−747.
  6. Alan M. MacEachren, Anuj Jaiswal, Anthony C. Robinson et al. Senseplace2: Geotwitteranalyticssupportforsituationalawareness// ProceedingsofVisualAnalyticsScienceandTechnology (VAST) onIEEEConference. 2011. P. 181−190.
  7. Cornelia Caragea, Nathan McNeese, Anuj Jaiswal et al. ClassifyingtextmessagesfortheHaitiearthquake// ProceedingsofISCRAM. 2011.
  8. Silvescu A., Caragea C., Honavar V. Combining super-structuring and abstraction on sequence classification // Proceedings of ICDM. IEEE. 2009. P. 986−991.
  9. Jie Yin, Sarvnaz Karimi, Bella Robinson, Mark Cameron. ESA: emergency situation awareness via microbloggers // Proceedings of the 21st ACM international conference on Information and knowledge management. ACM. 2012. P. 2701−2703.
  10. Osinski S., Stefanowski J., Weiss D. Lingo: Search results clustering algorithm based on singular value decomposition // Intelligent information processing and web mining. Springer. 2004. P. 359−368.
  11. Jie Yin, Andrew Lampert, Mark Cameron et al. Usingsocialmediatoenhanceemergencysituationawareness // IEEEIntelligentSystems. 2012. № 6. P. 52−59.
  12. Muhammad Imran, Carlos Castillo, Ji Lucas et al. AIDR: Artificialintelligencefordisasterresponse// Proceedingsofthecompanionpublicationofthe 23rdinternationalconferenceonWorldWideWebcompanion. 2014. P. 159−162.
  13. Muhammad Imran, Carlos Castillo, Ji Lucas et al. Coordinatinghumanandmachineintelligencetoclassifymicroblogcommunicationsincrises// ProceedingsofISCRAM. 2014. P. 159−162.
  14. Zahra Ashktorab, Christopher Brown, Manojit Nandi, Aron Culotta. Tweedr: Mining Twitter to inform disaster response // Proceedings of ISCRAM. 2014. P. 354−358.
  15. Charikar M.S. Similarity estimation techniques from rounding algorithms // Proceedings of the thirty-fourth annual ACM symposium on Theory of computing / ACM. 2002. P. 380−388.
  16. Xiaohua Liu, Shaodian Zhang, Furu Wei, Ming Zhou. Recognizing named entities in tweets // Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. AssociationforComputationalLinguistics. 2011. P. 359−367.
  17. Sutton C., McCallum A. An introduction to conditional random fields for relational learning // Introduction to statistical relational learning. 2006. P. 93−128.
  18. Miller G.A. WordNet: a lexical database for English // Communications of the ACM. 1995. V. 38. № 11. P. 39−41.
  19. Paul De Bra, Geert-Jan Houben, Yoram Kornatzky, Renier Post. Information retrieval in distributed hypertexts // RIAO. 1994. P. 481−493.
  20. Michael Hersovici, Michal Jacovi, Yoelle S Maarek et al. The shark-search algorithm. An application: tailored web site mapping // Computer Networks and ISDN Systems. 1998. V. 30. № 1. P. 317−326.
  21. Youwei Yuan, Dou Chen, Yong Li et al. The improved shark search approach for crawling large-scale web data // International Journal of Multimedia & Ubiquitous Engineering. 2014. V. 9. № 8. P. 251−260.
  22. Liu H., Milios E., Janssen J. Probabilistic models for focused web crawling // Proceedings of the 6th annual ACM international workshop on Web information and data management. ACM. 2004. P. 16−22.
  23. Van de Maele F., Spyns P., Meersman R. An ontology-based crawler for the semantic web // On the Move to Meaningful Internet Systems: OTM 2008 Workshops. Springer. 2008. P. 1056−1065.
  24. Bedi P., Thukral A., Banati H. Focused crawling of tagged web resources using ontology // Computers & Electrical Engineering. 2013. V. 39. № 2. P. 613−628.
  25. Dong H., Hussain F.K., Chang E. A survey in semantic web technologies-inspired focused crawlers // Proceedings of Third International Conference on Digital Information Management (ICDIM 2008). IEEE. 2008. P. 934−936.
  26. Liu H., Milios E., Janssen J. Focused crawling by learning HMM from user-s topic-specific browsing // Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence. IEEEComputerSociety. 2004. P. 732−732.
  27. Liu H., Janssen J., Milios E. Using HMM to learn user browsing patterns for focused web crawling // Data & Knowledge Engineering. 2006. V. 59. № 2. P. 270−291.
  28. Vydiswaran V.V., Sarawagi S. Learning to extract information from large websites using sequential models // Proceedings of International Conference on Management of Data. 2005. P. 3−14.
  29. Mihalcea R., Corley C., Strapparava C. Corpus-based and knowledge-based measures of text semantic similarity // Proceedings of AAAI. V. 6. 2006. P. 775−780.
  30. Sun A., Lim E.-P., Ng W.-K. Web classification using support vector machine // Proceedings of the 4th international workshop on Web information and data management. ACM. 2002. P. 96−99.
  31. McCallum A., Nigam K. et al. A comparison of event models for naive Bayes text classification // AAAI-98 workshop on learning for text categorization. V. 752. 1998. P. 41−48.
  32. Goyal D., Kalra M. A novel prediction method of relevancy for focused crawling in topic specific search // International Conference on Signal Propagation and Computer Technology (ICSPCT). 2014. IEEE. 2014. P. 257−262.
  33. Sochenkov I.V. Metod sravnenija tekstov dlja reshenija poiskovo-analiticheskikh zadach // Iskusstvennyjj intellekt i prinjatie reshenijj. 2013. № 2. S. 32−43.
  34. Vlasova S.A., Smirnov A.A., Golubinskijj E.JU.i dr. Podkhod k avtomaticheskojj klassifikacii korotkikh tekstovykh soobshhenijj na osnove modificirovannogo metoda Bajjesa // Nauchnye vedomosti Belgorodskogo gosudarstvennogo universiteta. Serija: Istorija. Politologija. EHkonomika. Informatika. 2014. T. 30. № 8-1 (179).
  35. Shaohua Wang, Ying Zou, Bipin Upadhyaya, Jason Ng. An intelligent framework for auto-filling web forms from different web applications // Proceedings of the 2013 IEEE Ninth World Congress on Services. IEEEComputerSociety. 2013. P. 175−179.
  36. Stephen W Liddle, David W Embley, Del T Scott, Sai Ho Yau. Extracting data behind web forms // Advanced Conceptual Modeling Techniques. Springer. 2003. P. 402−413.
  37. Gossen G., Demidova E., Risse T. The iCrawl Wizard-supporting interactive focused crawl specification // Advances in Information Retrieval. Springer. 2015. P. 797−800.
  38. Rui Li, Kin Hou Lei, Ravi Khadiwala, Kevin Chen-Chuan Chang. Tedas: A twitter-based event detection and analysis system // Proceedings of the 2012 IEEE 28th International Conference on Data Engineering (ICDE). 2012. P. 1273−1276.
  39. Li Zheng, Chao Shen, Liang Tang et al. Disaster SitRep - A vertical search engine and information analysis tool in disaster management domain // Proceedings of 2012 IEEE 13th International Conference on Information Reuse and Integration (IRI). 2012. P. 457−465.
  40. Hui Han, C Lee Giles, Eren Manavoglu et al. Automaticdocumentmetadataextractionusingsupportvectormachines// ProceedingsJCDL 2003. Third Joint ACM/IEEE-CS Joint Conference on Digital Libraries. 2003. P. 37−48.
  41. Hui Han, Eren Manavoglu, Hongyuan Zha et al. Rule-basedwordclusteringfordocumentmetadataextraction// Proceedingsofthe 2005 ACMsymposiumonAppliedcomputing. 2005. P. 1049−1053.
  42. Hetzner E. A simple method for citation metadata extraction using hidden Markov models // Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries. ACM. 2008. P. 280−284.
  43. Peng F., McCallum A. Information extraction from research papers using conditional random fields // Information processing & management. 2006. V. 42. № 4. P. 963−979.
  44. Poritz A.B. Hidden Markov models: A guided tour // Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP-88). IEEE. 1988. P. 7−13.
  45. Kaling Bontcheva, Hamish Cunningham, Diana Maynard et al. DevelopingreusableandrobustlanguageprocessingcomponentsforinformationsystemsusingGATE// DatabaseandExpertSystemsApplications. IEEE. 2002. P. 223−227.
  46. Appelt D.E., Onyshkevych B. The common pattern specification language // Proceedings of TIPSTER. AssociationforComputationalLinguistics. 1998. P. 23−30.
  47. Peter Kluegl, Martin Toepfer, Philip-Daniel Beck et al. UIMARuta: Rapiddevelopmentofrule-basedinformationextractionapplications// NaturalLanguageEngineering. 2014. P. 1−40.
  48. Ferrucci D., Lally A. UIMA: an architectural approach to unstructured information processing in the corporate research environment // Natural Language Engineering. 2004. V. 10. № 3−4. P. 327−348.
  49. Tomita M. LR parsers for natural languages // Proceedings of the 10th International Conference on Computational Linguistics and 22nd annual meeting on Association for Computational Linguistics. AssociationforComputationalLinguistics. 1984. P. 354−357.
  50. Bolshakova E. JAzyk leksiko-sintaksicheskikh shablonov LSPL: opyt ispolzovanija i puti razvitija programmnye sistemy i instrumenty // Programmnye sistemy i instrumenty. T. 15 iz Tematicheskijj sb. Izd-vo fakulteta VMiK MGU Moskva. 2014. S. 15−26.
  51. Cheng Z., Caverlee J., Lee K. You are where you tweet: a content-based approach to geo-locating twitter users // Proceedings of the 19th ACM international conference on Information and knowledge management. 2010. P. 759−768.
  52. Gelernter J., Mushegian N. Geo-parsing messages from microtext // Transactions in GIS. 2011. V. 15. № 6. P. 753−773.
  53. Wen Li, Pavel Serdyukov, Arjen P de Vries et al. The where in the tweet // Proceedings of the 20th ACM international conference on Information and knowledge management. 2011. P. 2473−2476.
  54. Eugenio Cesario, Antonio Congiusta, Domenico Talia, Paolo Trunfio. Data analysis services in the knowledge grid // Data Mining Techniques in Grid Computing Environments. 2008.
  55. Group J.-R.W. et al. JSON-RPC 2.0 specification. 2012. URL: http://www.jsonrpc.org/specification.
  56. Rawat S., Patil D. Efficient focused crawling based on best first search // Proceedings of Advance Computing Conference (IACC), 2013 IEEE 3rd International. 2013. P. 908−911.
  57. Gennady Osipov, Ivan Smirnov, Ilya Tikhomirov, Artem Shelmanov. Relational-situational method for intelligent search and analysis of scientific publications // Proceedings of the Integrating IR Technologies for Professional Search Workshop. 2013. P. 57−64.
  58. Shelmanov A.O., Smirnov I.V. Methods for semantic role labeling of Russian texts // Computational Linguistics and Intellectual Technologies. PapersfromtheAnnualInternationalConference«Dialogue» (2014). 2014. № 13. P. 607−620.
  59. Sochenkov I., Suvorov R. Servisy polnotekstovogo poiska v informacionno-analiticheskojj sisteme. CH. 1 // Informacionnye tekhnologii i vychislitelnye sistemy. 2013. № 2. S. 69−78.
  60. Tikhomirov I.A., Smirnov I.V., Sochenkov I.V. i dr. ExactusExpert: Poiskovo-analiticheskaja sistema podderzhki nauchno-tekhnich. dejatelnosti // Trudy trinadcatojj nacionalnojj konf. po iskusstvennomu intellektu s Mezhdunar. uchastiem KII-2012. B.: BGTU. T. 4. 2012. S. 100−108.
  61. Lakshman A., Malik P. Cassandra: a decentralized structured storage system // ACM SIGOPS Operating Systems Review. 2010. V. 44. № 2. P. 35−40.
  62. Boicea A., Radulescu F., Agapin L.I. Mongodb vs oracle-database comparison // 2012 Third International Conference on Emerging Intelligent Data and Web Technologies. IEEE. 2012. P. 330−335.