Аннотация. Машинное обучение является активно развивающейся областью исследований. Во многих задачах машинного обучения и интеллектуального анализа данных возникает необходимость работать с большими массивами данных. Эти массивы зачастую не могут быть обработаны на одном компьютере, или обработка занимает слишком много времени. Если в этих задачах использовать для обучения только часть имеющихся данных, то точность модели, как правило, падает. Для решения этой проблемы используются распределенные вычислительные системы. Наиболее популярные подходы к разработке программного обеспечения таких систем: модели вычислений Map/Reduce, Spark, модели вычислений на графах, использование передачи сообщений по стандарту MPI, архитектура сервера параметров. В данной статье дан обзор таких систем, проведен анализ их достоинств и недостатков применительно к задачам машинного обучения. В отдельном разделе проведен анализ распределенных систем для обучения искусственных нейронных сетей. Ключевые слова: машинное обучение, анализ данных, большие данные, распределенные системы. Стр. 56-69. Полная версия статьи в формате pdf. Литература1. Agarwal, A., Chapelle, O., Dudik, M., and Langford, J. (2011). A reliable effective terascale linear learning system. Technical report. http://arxiv.org/abs/1110.4198. 2. Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., and Smola, A. (2012). Scalable inference in latent variable models. In International conference on Web search and data mining (WSDM), pages 123-132. https://github.com/shravanmn/Yahoo_LDA. 3. Amazon (2016). Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine. https://github.com/amznlabs/amazon-dsstne. 4. Apache Foundation (2016). Spark MLLib. http://spark.apache.org/mllib/. 5. Babenko, M. (2013). YT - a new framework for distributed computations. Technical report. https://events.yandex.ru/lib/talks/1091/. 6. Bekkerman, R., Bilenko, M., and Langford, J. (2011). Scaling up machine learning: Parallel and distributed approaches. Cambridge University Press. 7. Bilenko, M. et al. (2011). Scaling up decision tree ensembles. Technical report. http://hunch.net/~large_scale_survey/TreeEnsembles.pdf. 8. Bu, Y., Howe, B., and Ernst, M. D. (2010). HaLoop: Efficient Iterative Data Processing on Large Clusters. Proceedings of the VLDB Endowment, 3(1-2):285-296. 9. Chandy, K. M. and Lamport, L. (1985). Distributed snapshots: determining global states of distributed systems. ACM Transactions on Computer Systems (TOCS), 3(1):63-75. 10. Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. arXiv preprint arXiv:1603.02754. 11. Chen, W., Wang, Z., and Zhou, J. (2014). Large-scale LBFGS using MapReduce. In Advances in Neural Information Processing Systems, pages 1332-1340. 12. Chilimbi, T., Suzue, Y., Apacible, J., and Kalyanaraman, K. (2014). Project Adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 571-582. 13. Ching, A. and Kunz, C. (2011). Giraph: Large-scale graph processing infrastructure on Hadoop. Hadoop Summit, 6(29):2011. http://giraph.apache.org. 14. Chu, C., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y., and Olukotun, K. (2007). Map-reduce for machine learning on multicore. Advances in neural information processing systems 15. Ciresan, D. C., Meier, U., Masci, J., Maria Gambardella, L., and Schmidhuber, J. (2011). Flexible, high performance convolutional neural networks for image classifcation. In IJCAI Proceedings-International Joint Conference on Artificial Intelligence, volume 22, page 1237. 16. Click, C., Lanford, J., Malohlava, M., Parmar, V., and Roark, H. (2015). Gradient boosted machines with H2O. Technical report. http://h2o.gitbooks.io/gbm-with-h2o/. 17. Coates, A., Huval, B., Wang, T., Wu, D. J., Ng, A. Y., and Catanzaro, B. (2013). Deep learning with COTS HPC systems. 18. Dai, W., Wei, J., Kim, J. K., Lee, S., Yin, J., Ho, Q., and Xing, E. P. (2013). Petuum: A framework for iterativeconvergent distributed ML. http://www.petuum.com. 19. Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., et al. (2012). Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223-1231. 20. Dean, J. and Ghemawat, S. (2004). MapReduce : Simplifed Data Processing on Large Clusters. In OSDI' 04, San Francisco. 21. Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121-2159. 22. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.- h., Qiu, J., and Fox, G. (2010). Twister: a runtime for iterative MapReduce. Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing - HPDC '10 , pages 810-818. 23. Gabriel, E., Fagg, G. E., Bosilca, G., Angskun, T., Dongarra, J. J., Squyres, J. M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R. H., Daniel, D. J., Graham, R. L., and Woodall, T. S. (2004). Open MPI: Goals, concept, and design of a next generation 38 implementation. In Proceedings, 11th European PVM/38 Users' Group Meeting, pages 97-104. 24. Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J., and Stoica, I. (2014). Graphx: Graph processing in a distributed dataflow framework. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 599-613. 25. Han, M., Daudjee, K., Ammar, K., Ozsu, M. T., Wang, X., and Jin, T. (2014). An experimental comparison of pregellike graph processing systems. Proceedings of the VLDB Endowment, 7(12):1047-1058. 26. Ho, Q., Cipar, J., Cui, H., Kim, J. K., Lee, S., Gibbons, P. B., Gibson, G. A., Ganger, G. R., and Xing, E. P. (2013). More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In NIPS' 13. 27. Iandola, F. N., Ashraf, K., Moskewicz, M. W., and Keutzer, K. (2015). Firecaffe: nearlinear acceleration of deep neural network training on compute clusters. arXiv preprint arXiv:1511.00175. 28. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia , pages 675-678. ACM. 29. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classiffcation with deep convolutional neural networks. In NIPS, pages 1097-1105. 30. Le, Q. V., Ranzato, M. A., Devin, M., Corrado, G. S., and Ng, A. Y. (2012). Building High-level Features Using Large Scale Unsupervised Learning. 31. Leskovec, J., Rajaraman, A., and Ullman, J. D. (2014). Mining of Massive Datasets. Cambridge University Press. 32. Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y. (2014). Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 583-598. https://github.com/dmlc/ps-lite. 33. Li, M., Zhou, L., Yang, Z., Li, A., and Xia, F. (2013). Parameter Server for Distributed Machine Learning. Big Learning Workshop, pages 1-10. 34. Lin, C.-Y., Tsai, C.-H., Lee, C.-P., and Lin, C.-J. (2014). Large-scale logistic regression and linear support vector machines using spark. In Big Data (Big Data), 2014 IEEE International Conference on, pages 519-528. IEEE. 35. Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., and Hellerstein, J. M. (2012). Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 5(8):716- 727. http://graphlab.org. 36. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., and Hellerstein, J. M. (2010). Graphlab: A new framework for parallel machine learning. In UAI' 10, Cataline Island, California. 37. Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. (2010). Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data , pages 135-146. ACM. 38. Message Passing Interface Forum (2012). 38: A messagepassing interface standard, version 3.0. http://www.mpiforum. 39. Microsoft (2016). Distributed machine learning toolkit. http://www.dmtlk.io. 40. Moritz, P., Nishihara, R., Stoica, I., and Jordan, M. I. (2015). SparkNet: Training Deep Networks in Spark. arXiv preprint arXiv:1511.06051. https://github.com/amplab/SparkNet. 41. Niu, F., Recht, B., Re, C., and Wright, S. J. (2011). HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. page 21. 42. Nokia Research Center (2008). Disco project. http://discoproject.org/. 43. Pacheco, P. (2011). An introduction to parallel programming. Elsevier. 44. Pafka, S. (2016). Simple/limited/incomplete benchmark for scalability, speed and accuracy of machine learning libraries for classiffcation. https://github.com/szilard/benchm-ml. 45. Panda, B., Herbach, J. S., Basu, S., and Bayardo, R. J. (2009). Planet: massively parallel learning of tree ensembles with mapreduce. Proceedings of the VLDB Endowment, 2(2):1426-1437. 46. Rabit developers (2016). Rabit: Reliable allreduce and broadcast interface. https://github.com/dmlc/rabit. 47. Smola, A. and Narayanamurthy, S. (2010). An architecture for parallel topic models. Proceedings of the VLDB Endowment, 3(1-2): pp. 703-710. 48. Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., and Balakrishnan, H. (2001). Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Computer Communication Review, 31(4):149-160. 49. Stylianou, M. (2014). Apache Giraph for applications in Machine Learning & Recommendation Systems. http://people.inf.ethz.ch/jaggim/meetup/5/slides/MLmeetup. 50. Svore, K. M. and Burges, C. (2011). Large-scale learning to rank using boosted decision trees. Scaling Up Machine Learning: Parallel and Distributed Approaches. 51. Trofimov, I. and Genkin, A. (2014). Distributed coordinate descent for l1-regularized logistic regression. arXiv preprint:1411.6520 52. Tyree, S., Weinberger, K. Q., Agrawal, K., and Paykin, J. (2011). Parallel boosted regression trees for web search ranking. In Proceedings of the 20th international conference on World wide web, pages 387-396. ACM. 53. Van Renesse, R. and Schneider, F. B. (2004). Chain replication for supporting high throughput and availability. In OSDI, volume 4, pages 91-104. 54. White, T. (2012). Hadoop: The definitive guide. "O'Reilly Media, Inc.". 55. Xin, R. S., Gonzalez, J. E., Franklin, M. J., and Stoica, I. (2013). GraphX: A resilient distributed graph system on Spark. First Intern. Workshop on Graph Data Management Experiences and Systems. 56. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. (2010). Spark : Cluster Computing with Working Sets. HotCloud'10 Proceedings of the 2nd USENIXconference on Hot topics in cloud computing, page 10. 57. Zhang, W., Gupta, S., Lian, X., and Liu, J. (2015). Staleness-aware async-sgd for distributed deep learning. arXiv preprint arXiv:1511.05950. 58. Zhuang, Y., Chin, W.-S., Juan, Y.-C., and Lin, C.-J. (2015). Distributed newton methods for regularized logistic regression. In Advances in KDD.
|