ОБРАБОТКА ИНФОРМАЦИИ И АНАЛИЗ ДАННЫХ
О.Н. Тушканова "Экспериментальное исследование численных мер оценки ассоциативных и причинных связей в больших данных"
ПРОГРАММНАЯ ИНЖЕНЕРИЯ
МАТЕМАТИЧЕСКОЕ МОДЕЛИРОВАНИЕ
ПРИКЛАДНЫЕ АСПЕКТЫ ИНФОРМАТИКИ
О.Н. Тушканова "Экспериментальное исследование численных мер оценки ассоциативных и причинных связей в больших данных"

Аннотация.

В работе приводится краткое описание сущности методов ассоциативного и причинного анализа данных и проблем, затрудняющие его применение в больших данных. Описывается схема ускоренного поиска множества причинных связей. Приводится список численных мер, предложенных к настоящему времени для оценки “силы” ассоциативной связи пары атрибутов в статистике, социологии, машинном обучении и интеллектуальном анализе данных. Приводятся результаты анализа их формальных свойств, в терминах которых формулируются необходимые условия, которым должны удовлетворять меры связи причинного характера. Описываются результаты экспериментального исследования выделенных численных мер, которые позволяют сформировать упорядоченный список наиболее перспективных мер, пригодных для оценки силы причинной связи.

Ключевые слова:

ассоциативная мера, причинная мера, причинный анализ, большие данные.

Стр. 23-32.

O.N. Tushkanova

"Experimental Study of the Numerical Measures for Mining Associative and Causal Relationship in Big Data"

Big data analysis is one of the topmost problems of information technologies. In this context, associative and causal analyses are considered as perspective approaches to efficient discovering of the relationships between big data attributes. However, traditionally used causal structure discovery models are of exponential complexity. Current trend in big data causal analysis is using various measures indicating the “strength” of associations between pairs of attributes. However, data scientists have no guidance, which of them are preferable in various applications. The paper surveys the numerical measures proposed to date and conducts theoretical and experimental comparative analyses of them in order to detect those of them that best fit the basic requirements to the big data processing. The conclusions regarding the most promising measures recommended to researchers and practitioners in big data causal analysis are drawn.

Keywords:

association measure, causal measure, causal analysis, big data.

REFERENCES

1. Fan J., Han F., Liu H. Challenges of Big Data Analysis //National Science Review. 2014. No. 1. spp. 293-314.
2. Bickel P. Discussion on the paper “Sure independence screening for ultrahigh dimensional feature space” by Fan and Lv // Journal of the Royal Statistical Society: Series B(Statistical Methodology). 2008. No. 70(5). pp. 883–884.
3. Aliferis S.F., Statnikov A., et al. Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation // Journal of Machine Learning Research. 2010. No. 11. pp. 171-234
4. Aliferis S.F., Statnikov A., et al. Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part II: Analysis and Extensions // Journal of Machine Learning Research. 2010. No. 11. pp. 235 – 299
5. Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Representation and Reasoning Series (2nd printing ed.). San Francisco. California: Morgan Kaufmann. 1988
6. Witten I.H., Frank E., Hall M.A. Data Mining: Practical machine learning tools and techniques (3rd Edition). San Francisco. California: Morgan Kaufmann. 2011
7. Clark P., Boswell R. Rule induction with cn2: some recent improvements // Proceedings of the European Working Session on Learning EWSL-91, Porto, Portugal. 1991. pp.151-163.
8. Clark P, Brin S., Motwani R., Ullman J., Tsur S. Dynamic itemset counting and implication rules for market basket data // Proceedings of ACM-SIGMOD International Conference on Management of Data, Montreal, Canada. 1997. pp. 255-264
9. Silverstein C., Brin S., Motwani R., Ullman J., Scalable techniques for mining causal structures // Journal of Data Mining and Knowledge Discovery.2000. No. 4. pp. 163–192
10. Adamo J.-M. Data Mining for Association Rules and Sequential Patterns // Berlin: Springer. 2000
11. Han J., Kamber M. Data Mining: Concepts and Techniques, 2nd ed. (J. Gray, Ed.). San Francisco, California: Morgan Kaufmann. 2006
12. Agrawal R., Sricant R. Fast Algorithm for Mining Association rules // Proc. of the 20th Intern. Conference on Very Large Databases. Santiago, Chile. 1994. pp. 487-499
13. Agrawal R., Imielinski T., Swami A. Mining association rules between sets of items in large databases //Proceedings of ACM SIGMOD International Conf. on Management of Data. In P. Buneman, & S. Jajodia (eds.). 1993. pp. 207-216
14. Yafi E., Alam M.A., Biswas R. Development of subjective measures of interestingness: From unexpectedness to shocking // Proceedings of World Academy of Science, Engineering and Tech. No. 26. 2007. pp. 368-370
15. Ferster E., Rents B. Metody korrelyatsionnogo i regressionnogo analiza. M.: Finansy i statistika, 1981. 302 s.
16. Mosteller F. Association and estimation in contingency tables // Journal of American Statistical Association. 1968. No. 63 (321). pp. 1-26
17. Lenca P., Vaillant B., Meyer P., Lallich S. Association Rule Interestingness Measures // Experimental and Theoretical Studies. Quality Measures in Data Mining.  2007. Vol.  43.  pp.  51-76
18. Yule G.U. On the methods of measuring association between two attributes // J. R. Stat. Soc. 75. 1912. pp. 579-642
19. Tan P.N., Kumar V., Srivastava J. Selecting the right objective measure for association analysis // Joirnal of Information Systems - KDD. 2004. No. 4. pp. 293-313
20. Wikipedia.org: the free encyclopedia. Gini coefficient. URL: http://en.wikipedia.org/:
http://en.wikipedia.org/wiki/Gini_coefficient (data obrashcheniya 01.06.2015 g.)
21. Piatetsky-Shapiro G. Discovery, analysis and presentation of strong rules // G. Piatetsky-Shapiro, & W. Frawley(Eds.), Knowledge Discovery in Databases. Cambridge, MA: MIT Press, 1991. pp. 229-248
22. Tan P., Kumar V. Interestingness measures for association patterns: A perspective. Technical Report TR00-036 //Proceeedings of Workshop on Postprocessing in Machine Learning and Data, Mining University of Minnesota, Department of Computer Science. 2000
23. Sahar S., Mansour Y. An empirical evaluation of objective interestingness criteria // Proceeedings of SPIE Conference on Data Mining and Knowledge Discovery, Orlando, FL. 1999. pp. 63-74
24. Wikipedia.org: the free encyclopedia. Jaccard index. URL: http://en.wikipedia.org/wiki/:
http://en.wikipedia.org/wiki/Jaccard_index (data obrashcheniya 01.06.2015 g.)
25. Sebag M., Schoenauer M. Generation of rules with certainty and confidence factors from incomplete and incoherent learning bases // Proc. of the European Knowledge Acquisition Workshop EKAW'88. 1988. rp 28.1-28.20
26. Ioffe A.Ya., Markov V.I., Petukhov G.B. i dr. Veroyatnostnye metody v prikladnoy kibernetike: Uchebnoe posobie. Pod red. P.M. Yusupova. L. 1976. 424 s.
27. Gorodetskiy V.I., Samoylov V.V. Assotsiativnyy i prichinnyy analiz i assotsiativnye bayesovskie seti // Trudy SPIIRAN. 2009. № 9. S. 13-65.
28. Li J., Le T.D, Liu L., Liu J., Jin Z., Sun B. Mining causal association rules // Proc. of The First IEEE ICDM Workshop on Causal Discovery (CD 2013). 2013. pp. 114-123.
29. UCI Machine Learning Repository. URL: http://archive.ics.uci.edu/ml/ (data obrashcheniya 01.06.2015 g.)
30. Cooper G.F. A simple constraint-based algorithm for efficiently mining observational databases for causal relationships // Journal of Data Mining and Knowledge Discovery. 1997. No. 1. pp. 203–224
31. Rezultaty eksperimentov. URL: https://drive.google.com/folderview?id=0ByiklOTai_zZU UlIRjVJWVdwZTg&usp=sharing (data obrashcheniya 01.06.2015 g.)
32. Wikipedia.org: the free encyclopedia. Confusion matrix. URL: http://en.wikipedia.org/wiki/Confusion_matrix (data obrashcheniya 01.06.2015 g.)
33. Wikipedia.org: svobodnaya entsiklopediya. Srednekvadratichsekoe otklonenie. URL:
https://ru.wikipedia.org/wiki/Srednekvadratichsekoe_otklonenie (data obrashcheniya  01.06.2015 g.)
34. Wikipedia.org: svobodnaya entsiklopediya. ROC-krivaya.URL: https://ru.wikipedia.org/wiki/ROC-krivaya (data obrashcheniya 01.06.2015 g.)
35. Wikipedia.org: the free encyclopedia. Precision and recall.
URL: http://en.wikipedia.org/wiki/Precision_and_recall (data obrashcheniya 01.06.2015 g.)
36. Wikipedia.org: svobodnaya entsiklopediya. Skolzyashchiy kontrol. URL:
http://machinelearning.ru/wiki/index.php?title=Skolzyashchiy_kontrol (data obrashcheniya 01.06.2015 g.)
37. Weka 3: Data Mining Software. URL: http://www.cs.waikato.ac.nz/ml/weka (data obrashcheniya 01.06.2015 g.)
38. Causality Challenge №1: Causation and Prediction. URL:
http://www.causality.inf.ethz.ch/ challenge.php (data obrashcheniya 01.06.2015 g.)
39. Causality Challenge №1: Causation and Prediction. Rezultaty sorevnovaniya. URL:
http://www.causality.inf.ethz.ch/challenge.php?page=resul ts&ds=cina0 (data obrashcheniya 01.06.2015 g.)

 

2017 / 02
2017 / 01
2016 / 04
2016 / 03

© ФИЦ ИУ РАН 2008-2016. Создание сайта "РосИнтернет технологии".