Advanced Analytics with Spark
해외주문/바로드림/제휴사주문/업체배송건의 경우 1+1 증정상품이 발송되지 않습니다.
패키지
북카드
책 소개
이 책이 속한 분야
You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques-including classification, clustering, collaborative filtering, and anomaly detection-to fields such as genomics, security, and finance.
If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find the book’s patterns useful for working on your own data applications.
With this book, you will:
Familiarize yourself with the Spark programming model
Become comfortable within the Spark ecosystem
Learn general approaches in data science
Examine complete implementations that analyze large public data sets
Discover which machine learning tools make sense for particular problems
Acquire code that can be adapted to many uses
What’s in This Book
The first chapter will place Spark within the wider context of data science and big data analytics. After that, each chapter will comprise a self-contained analysis using Spark. The second chapter will introduce the basics of data processing in Spark and Scala through a use case in data cleansing. The next few chapters will delve into the meat and potatoes of machine learning with Spark, applying some of the most common algorithms in canonical applications. The remaining chapters are a bit more of a grab bag and apply Spark in slightly more exotic applications?for example, querying Wikipedia through latent semantic relationships in the text or analyzing genomics data.
The Second Edition
Since the first edition, Spark has experienced a major version upgrade that instated an entirely new core API and sweeping changes in subcomponents like MLlib and Spark SQL. In the second edition, we’ve made major renovations to the example code and brought the materials up to date with Spark’s new best practices.
작가정보
저자(글) Hougland, Juliet
Sandy Ryza develops algorithms for public transit at Remix. Prior, he was a senior data scientist at Cloudera and Clover Health. He is an Apache Spark committer, Apache Hadoop PMC member, and founder of the Time Series for Spark project. He holds the Brown University computer science department's 2012 Twining award for "Most Chill."
Uri Laserson is an Assistant Professor of Genetics at the Icahn School of Medicine at Mount Sinai, where he develops scalable technology for genomics and immunology using the Hadoop ecosystem.
Sean Owen is Director of Data Science at Cloudera. He is an ApacheSpark committer and PMC member, and was an Apache Mahout committer.
Josh Wills is the Head of Data Engineering at Slack, the founder of the Apache Crunch project, and wrote a tweet about data scientists once.
목차
- Chapter Page
Foreword vii
Preface ix
1. Analyzing Big Data 1
The Challenges of Data Science 3
Introducing Apache Spark 4
About This Book 6
The Second Edition 7
2. Introduction to Data Analysis with Scala and Spark 9
Scala for Data Scientists 10
The Spark Programming Model 11
Record Linkage 12
Getting Started: The Spark Shell and SparkContext 13
Bringing Data from the Cluster to the Client 19
Shipping Code from the Client to the Cluster 22
From RDDs to Data Frames 23
Analyzing Data with the DataFrame API 26
Fast Summary Statistics for DataFrames 32
Pivoting and Reshaping DataFrames 33
Joining DataFrames and Selecting Features 37
Preparing Models for Production Environments 38
Model Evaluation 40
Where to Go from Here 41
3. Recommending Music and the Audioscrobbler Data Set 43
Data Set 44
The Alternating Least Squares Recommender Algorithm 45
Preparing the Data 48
Building a First Model 51
Spot Checking Recommendations 54
Evaluating Recommendation Quality 57
Computing AUC 58
Hyperparameter Selection 60
Making Recommendations 62
Where to Go from Here 64
4. Predicting Forest Cover with Decision Trees 67
Fast Forward to Regression 67
Vectors and Features 68
Training Examples 69
Decision Trees and Forests 70
Covtype Data Set 73
Preparing the Data 73
A First Decision Tree 76
Decision Tree Hyperparameters 82
Tuning Decision Trees 84
Categorical Features Revisited 88
Random Decision Forests 91
Making Predictions 93
Where to Go from Here 94
5. Anomaly Detection in Network Traffic with K-means Clustering 97
Anomaly Detection 98
K-means Clustering 98
Network Intrusion 99
KDD Cup 1999 Data Set 100
A First Take on Clustering 101
Choosing k 103
Visualization with SparkR 106
Feature Normalization 110
Categorical Variables 112
Using Labels with Entropy 114
Clustering in Action 115
Where to Go from Here 117
6. Understanding Wikipedia with Latent Semantic Analysis 119
The Document-Term Matrix 120
Getting the Data 122
Parsing and Preparing the Data 122
Lemmatization 124
Computing the TF-IDFs 125
Singular Value Decomposition 127
Finding Important Concepts 129
Querying and Scoring with a Low-Dimensional Representation 133
Term-Term Relevance 134
Document-Document Relevance 136
Document-Term Relevance 137
Multiple-Term Queries 138
Where to Go from Here 140
7. Analyzing Co-Occurrence Networks with GraphX 141
The MEDLINE Citation Index: A Network Analysis 143
Getting the Data 144
Parsing XML Documents with Scala's XML Library 146
Analyzing the MeSH Major Topics and Their Co-Occurrences 147
Constructing a Co-Occurrence Network with GrapbX 150
Understanding the Structure of Networks 154
Connected Components 154
Degree Distribution 157
Filtering Out Noisy Edges 159
Processing EdgeTriplets 160
Analyzing the Filtered Graph 162
Small-World Networks 163
Cliques and Clustering Coefficients 164
Computing Average Path Length with Pregel 165
Where to Go from Here 170
8. Geospatial and Temporal Data Analysis on New York City Taxi Trip Data 173
Getting the Data 174
Working with Third-Party Libraries in Spark 175
Geospatial Data with the Esri Geometry API and Spray 176
Exploring the Esri Geometry API 176
Intro to GeoJSON 178
Preparing the New York City Taxi Trip Data 180
Handling Invalid Records at Scale 182
Geospatial Analysis 186
Sessionization in Spark 189
Building Sessions: Secondary Sorts in Spark 190
Where to Go from Here 193
9. Estimating Financial Risk Through Monte Carlo Simulation 195
Terminology 196
Methods for Calculating VaR 197
Variance-Covariance 197
Historical Simulation 197
Monte Carlo Simulation 197
Our Model 198
Getting the Data 199
Preprocessing 199
Determining the Factor Weights 202
Sampling 205
The Multivariate Normal Distribution 208
Running the Trials 209
Visualizing the Distribution of Returns 212
Evaluating Our Results 213
Where to Go from Here 215
10. Analyzing Genomics Data and the BDG Project 217
Decoupling Storage from Modeling 218
Ingesting Genomics Data with the ADAM CLI 221
Parquet Format and Columnar Storage 227
Predicting Transcription Factor Binding Sites from ENCODE Data 229
Querying Genotypes from the 1000 Genomes Project 236
Where to Go from Here 239
11. Analyzing Neuroimaging Data with PySpark and Thunder 241
Overview of PySpark 242
PySpark Internals 243
Overview and Installation of the Thunder Library 245
Loading Data with Thunder 245
Thunder Core Data Types 252
Categorizing Neuron Types with Thunder 253
Where to Go from Here 258
Index 259
기본정보
ISBN | 9781491972953 ( 1491972955 ) | ||
---|---|---|---|
발행(출시)일자 | 2017년 07월 06일 | ||
쪽수 | 280쪽 | ||
크기 |
178 * 231
* 13
mm
/ 408 g
|
||
총권수 | 1권 | ||
언어 | 영어 | ||
이 책의 개정정보 |
가장 최근에 출시된 개정판입니다.
구판보기
|
Klover
e교환권은 적립 일로부터 180일 동안 사용 가능합니다.
리워드는 작성 후 다음 날 제공되며, 발송 전 작성 시 발송 완료 후 익일 제공됩니다.
리워드는 리뷰 종류별로 구매한 아이디당 한 상품에 최초 1회 작성 건들에 대해서만 제공됩니다.
판매가 1,000원 미만 도서의 경우 리워드 지급 대상에서 제외됩니다.
일부 타인의 권리를 침해하거나 불편을 끼치는 것을 방지하기 위해 아래에 해당하는 Klover 리뷰는 별도의 통보 없이 삭제될 수 있습니다.
- 도서나 타인에 대해 근거 없이 비방을 하거나 타인의 명예를 훼손할 수 있는 리뷰
- 도서와 무관한 내용의 리뷰
- 인신공격이나 욕설, 비속어, 혐오발언이 개재된 리뷰
- 의성어나 의태어 등 내용의 의미가 없는 리뷰
리뷰는 1인이 중복으로 작성하실 수는 있지만, 평점계산은 가장 최근에 남긴 1건의 리뷰만 반영됩니다.
구매 후 리뷰 작성 시, e교환권 200원 적립
문장수집
e교환권은 적립 일로부터 180일 동안 사용 가능합니다. 리워드는 작성 후 다음 날 제공되며, 발송 전 작성 시 발송 완료 후 익일 제공됩니다.
리워드는 한 상품에 최초 1회만 제공됩니다.
주문취소/반품/절판/품절 시 리워드 대상에서 제외됩니다.
구매 후 리뷰 작성 시, e교환권 100원 적립