350 руб
Журнал «Системы высокой доступности» №4 за 2015 г.
Статья в номере:
Анализ неструктурированных текстовых данных для поддержки поисково-спасательных работ
Ключевые слова:
извлечение информации
машинное обучение
неструктурированные данные
поисково-спасательные работы
Авторы:
Д.А. Девяткин - мл. науч. сотрудник, Институт системного анализа ФИЦ ИУ РАН (Москва). E-mail: devyatkin@isa.ru
А.О. Шелманов - мл. науч. сотрудник, Институт системного анализа ФИЦ ИУ РАН (Москва). E-mail: shelmanov@isa.ru
Аннотация:
Рассмотрены основные научные и технические проблемы создания методов и программных средств анализа неструктурированных текстовых данных для поддержки поисково-спасательных работ. Проведен обзор систем, предназначенных для анализа сообщений из социальных сетей и осуществления на этой основе информационно-аналитической поддержки мероприятий, связанных с ликвидацией последствий различных чрезвычайных происшествий. Рассмотрены методы сфокусированного сбора неструктурированных и полуструктурированных данных и их предварительного анализа, а также методы извлечения информации из текстов на естественном языке. Предложена распределенная сервис-ориентированная архитектура программных средств анализа неструктурированных текстовых данных для поисково-аналитической поддержки ПСР.
Страницы: 45-60
Список источников
- Tunkelang D. Faceted search (synthesis lectures on information concepts, retrieval, and services). Claypool & Morgan. 2009.
- Sudha Verma, Sarah Vieweg, William J Corvey et al. Naturallanguageprocessingtotherescue? Extracting «situational awareness» tweets during mass emergency // Proceedings of ICWSM. 2011. P. 385−392.
- Theresa Wilson, Paul Hoffmann, Swapna Somasundaran et al. OpinionFinder: Asystemforsubjectivityanalysis// ProceedingsofHTL/EMNLPoninteractivedemonstrations. AssociationforComputationalLinguistics. 2005. P. 34−35.
- Amit P. Sheth, Hemant Purohit, Ashutosh Sopan Jadhav et al. Understandingeventsthroughanalysisofsocialmedia// [Электронный ресурс] URL: http://corescholar.libraries.wright.edu/knoesis/788/ (дата обращения 09.11.2015). 2010.
- Purohit H., Sheth A.P. Twitris v3: From citizen sensing to analysis, coordination and action // Proceedings of ICWSM. 2013. P. 746−747.
- Alan M. MacEachren, Anuj Jaiswal, Anthony C. Robinson et al. Senseplace2: Geotwitteranalyticssupportforsituationalawareness// ProceedingsofVisualAnalyticsScienceandTechnology (VAST) onIEEEConference. 2011. P. 181−190.
- Cornelia Caragea, Nathan McNeese, Anuj Jaiswal et al. ClassifyingtextmessagesfortheHaitiearthquake// ProceedingsofISCRAM. 2011.
- Silvescu A., Caragea C., Honavar V. Combining super-structuring and abstraction on sequence classification // Proceedings of ICDM. IEEE. 2009. P. 986−991.
- Jie Yin, Sarvnaz Karimi, Bella Robinson, Mark Cameron. ESA: emergency situation awareness via microbloggers // Proceedings of the 21st ACM international conference on Information and knowledge management. ACM. 2012. P. 2701−2703.
- Osinski S., Stefanowski J., Weiss D. Lingo: Search results clustering algorithm based on singular value decomposition // Intelligent information processing and web mining. Springer. 2004. P. 359−368.
- Jie Yin, Andrew Lampert, Mark Cameron et al. Usingsocialmediatoenhanceemergencysituationawareness // IEEEIntelligentSystems. 2012. № 6. P. 52−59.
- Muhammad Imran, Carlos Castillo, Ji Lucas et al. AIDR: Artificialintelligencefordisasterresponse// Proceedingsofthecompanionpublicationofthe 23rdinternationalconferenceonWorldWideWebcompanion. 2014. P. 159−162.
- Muhammad Imran, Carlos Castillo, Ji Lucas et al. Coordinatinghumanandmachineintelligencetoclassifymicroblogcommunicationsincrises// ProceedingsofISCRAM. 2014. P. 159−162.
- Zahra Ashktorab, Christopher Brown, Manojit Nandi, Aron Culotta. Tweedr: Mining Twitter to inform disaster response // Proceedings of ISCRAM. 2014. P. 354−358.
- Charikar M.S. Similarity estimation techniques from rounding algorithms // Proceedings of the thirty-fourth annual ACM symposium on Theory of computing / ACM. 2002. P. 380−388.
- Xiaohua Liu, Shaodian Zhang, Furu Wei, Ming Zhou. Recognizing named entities in tweets // Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. AssociationforComputationalLinguistics. 2011. P. 359−367.
- Sutton C., McCallum A. An introduction to conditional random fields for relational learning // Introduction to statistical relational learning. 2006. P. 93−128.
- Miller G.A. WordNet: a lexical database for English // Communications of the ACM. 1995. V. 38. № 11. P. 39−41.
- Paul De Bra, Geert-Jan Houben, Yoram Kornatzky, Renier Post. Information retrieval in distributed hypertexts // RIAO. 1994. P. 481−493.
- Michael Hersovici, Michal Jacovi, Yoelle S Maarek et al. The shark-search algorithm. An application: tailored web site mapping // Computer Networks and ISDN Systems. 1998. V. 30. № 1. P. 317−326.
- Youwei Yuan, Dou Chen, Yong Li et al. The improved shark search approach for crawling large-scale web data // International Journal of Multimedia & Ubiquitous Engineering. 2014. V. 9. № 8. P. 251−260.
- Liu H., Milios E., Janssen J. Probabilistic models for focused web crawling // Proceedings of the 6th annual ACM international workshop on Web information and data management. ACM. 2004. P. 16−22.
- Van de Maele F., Spyns P., Meersman R. An ontology-based crawler for the semantic web // On the Move to Meaningful Internet Systems: OTM 2008 Workshops. Springer. 2008. P. 1056−1065.
- Bedi P., Thukral A., Banati H. Focused crawling of tagged web resources using ontology // Computers & Electrical Engineering. 2013. V. 39. № 2. P. 613−628.
- Dong H., Hussain F.K., Chang E. A survey in semantic web technologies-inspired focused crawlers // Proceedings of Third International Conference on Digital Information Management (ICDIM 2008). IEEE. 2008. P. 934−936.
- Liu H., Milios E., Janssen J. Focused crawling by learning HMM from user-s topic-specific browsing // Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence. IEEEComputerSociety. 2004. P. 732−732.
- Liu H., Janssen J., Milios E. Using HMM to learn user browsing patterns for focused web crawling // Data & Knowledge Engineering. 2006. V. 59. № 2. P. 270−291.
- Vydiswaran V.V., Sarawagi S. Learning to extract information from large websites using sequential models // Proceedings of International Conference on Management of Data. 2005. P. 3−14.
- Mihalcea R., Corley C., Strapparava C. Corpus-based and knowledge-based measures of text semantic similarity // Proceedings of AAAI. V. 6. 2006. P. 775−780.
- Sun A., Lim E.-P., Ng W.-K. Web classification using support vector machine // Proceedings of the 4th international workshop on Web information and data management. ACM. 2002. P. 96−99.
- McCallum A., Nigam K. et al. A comparison of event models for naive Bayes text classification // AAAI-98 workshop on learning for text categorization. V. 752. 1998. P. 41−48.
- Goyal D., Kalra M. A novel prediction method of relevancy for focused crawling in topic specific search // International Conference on Signal Propagation and Computer Technology (ICSPCT). 2014. IEEE. 2014. P. 257−262.
- Соченков И.В. Метод сравнения текстов для решения поисково-аналитических задач // Искусственный интеллект и принятие решений. 2013. № 2. С. 32−43.
- Власова С.А., Смирнов А.А., Голубинский Е.Ю.и др. Подход к автоматической классификации коротких текстовых сообщений на основе модифицированного метода Байеса // Научные ведомости Белгородского государственного университета. Серия: История. Политология. Экономика. Информатика. 2014. Т. 30. № 8-1 (179).
- Shaohua Wang, Ying Zou, Bipin Upadhyaya, Jason Ng. An intelligent framework for auto-filling web forms from different web applications // Proceedings of the 2013 IEEE Ninth World Congress on Services. IEEEComputerSociety. 2013. P. 175−179.
- Stephen W Liddle, David W Embley, Del T Scott, Sai Ho Yau. Extracting data behind web forms // Advanced Conceptual Modeling Techniques. Springer. 2003. P. 402−413.
- Gossen G., Demidova E., Risse T. The iCrawl Wizard-supporting interactive focused crawl specification // Advances in Information Retrieval. Springer. 2015. P. 797−800.
- Rui Li, Kin Hou Lei, Ravi Khadiwala, Kevin Chen-Chuan Chang. Tedas: A twitter-based event detection and analysis system // Proceedings of the 2012 IEEE 28th International Conference on Data Engineering (ICDE). 2012. P. 1273−1276.
- Li Zheng, Chao Shen, Liang Tang et al. Disaster SitRep - A vertical search engine and information analysis tool in disaster management domain // Proceedings of 2012 IEEE 13th International Conference on Information Reuse and Integration (IRI). 2012. P. 457−465.
- Hui Han, C Lee Giles, Eren Manavoglu et al. Automaticdocumentmetadataextractionusingsupportvectormachines// ProceedingsJCDL 2003. Third Joint ACM/IEEE-CS Joint Conference on Digital Libraries. 2003. P. 37−48.
- Hui Han, Eren Manavoglu, Hongyuan Zha et al. Rule-basedwordclusteringfordocumentmetadataextraction// Proceedingsofthe 2005 ACMsymposiumonAppliedcomputing. 2005. P. 1049−1053.
- Hetzner E. A simple method for citation metadata extraction using hidden Markov models // Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries. ACM. 2008. P. 280−284.
- Peng F., McCallum A. Information extraction from research papers using conditional random fields // Information processing & management. 2006. V. 42. № 4. P. 963−979.
- Poritz A.B. Hidden Markov models: A guided tour // Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP-88). IEEE. 1988. P. 7−13.
- Kaling Bontcheva, Hamish Cunningham, Diana Maynard et al. DevelopingreusableandrobustlanguageprocessingcomponentsforinformationsystemsusingGATE// DatabaseandExpertSystemsApplications. IEEE. 2002. P. 223−227.
- Appelt D.E., Onyshkevych B. The common pattern specification language // Proceedings of TIPSTER. AssociationforComputationalLinguistics. 1998. P. 23−30.
- Peter Kluegl, Martin Toepfer, Philip-Daniel Beck et al. UIMARuta: Rapiddevelopmentofrule-basedinformationextractionapplications// NaturalLanguageEngineering. 2014. P. 1−40.
- Ferrucci D., Lally A. UIMA: an architectural approach to unstructured information processing in the corporate research environment // Natural Language Engineering. 2004. V. 10. № 3−4. P. 327−348.
- Tomita M. LR parsers for natural languages // Proceedings of the 10th International Conference on Computational Linguistics and 22nd annual meeting on Association for Computational Linguistics. AssociationforComputationalLinguistics. 1984. P. 354−357.
- Большакова Е. Язык лексико-синтаксических шаблонов LSPL: опыт использования и пути развития программные системы и инструменты // Программные системы и инструменты. Т. 15 из Тематический сб. Изд-во факультета ВМиК МГУ Москва. 2014. С. 15−26.
- Cheng Z., Caverlee J., Lee K. You are where you tweet: a content-based approach to geo-locating twitter users // Proceedings of the 19th ACM international conference on Information and knowledge management. 2010. P. 759−768.
- Gelernter J., Mushegian N. Geo-parsing messages from microtext // Transactions in GIS. 2011. V. 15. № 6. P. 753−773.
- Wen Li, Pavel Serdyukov, Arjen P de Vries et al. The where in the tweet // Proceedings of the 20th ACM international conference on Information and knowledge management. 2011. P. 2473−2476.
- Eugenio Cesario, Antonio Congiusta, Domenico Talia, Paolo Trunfio. Data analysis services in the knowledge grid // Data Mining Techniques in Grid Computing Environments. 2008.
- Group J.-R.W. et al. JSON-RPC 2.0 specification. 2012. URL: http://www.jsonrpc.org/specification.
- Rawat S., Patil D. Efficient focused crawling based on best first search // Proceedings of Advance Computing Conference (IACC), 2013 IEEE 3rd International. 2013. P. 908−911.
- Gennady Osipov, Ivan Smirnov, Ilya Tikhomirov, Artem Shelmanov. Relational-situational method for intelligent search and analysis of scientific publications // Proceedings of the Integrating IR Technologies for Professional Search Workshop. 2013. P. 57−64.
- Shelmanov A.O., Smirnov I.V. Methods for semantic role labeling of Russian texts // Computational Linguistics and Intellectual Technologies. PapersfromtheAnnualInternationalConference«Dialogue» (2014). 2014. № 13. P. 607−620.
- Соченков И., Суворов Р. Сервисы полнотекстового поиска в информационно-аналитической системе. Ч. 1 // Информационные технологии и вычислительные системы. 2013. № 2. С. 69−78.
- Тихомиров И.А., Смирнов И.В., Соченков И.В. и др. ExactusExpert: Поисково-аналитическая система поддержки научно-технич. деятельности // Труды тринадцатой национальной конф. по искусственному интеллекту с Междунар. участием КИИ-2012. Б.: БГТУ. Т. 4. 2012. С. 100−108.
- Lakshman A., Malik P. Cassandra: a decentralized structured storage system // ACM SIGOPS Operating Systems Review. 2010. V. 44. № 2. P. 35−40.
- Boicea A., Radulescu F., Agapin L.I. Mongodb vs oracle-database comparison // 2012 Third International Conference on Emerging Intelligent Data and Web Technologies. IEEE. 2012. P. 330−335.