Advanced Analytics with Spark

Hougland, Juliet 저자(글)

O'Reilly Media · 2017년 07월 06일

가장 최근에 출시된 개정판입니다. 구판보기

0.0 (0개의 리뷰)

평가된 감성태그가
없습니다

A4

사이즈 비교

210x297

Advanced Analytics with Spark 사이즈 비교 178x231

단위 : mm

MD의 선택 무료배송 이벤트 소득공제 정가제Free

3% 57,230원 ~~59,000원~~

적립/혜택

1,720P

이벤트
상품정보
리뷰 (0)
교환/반품/품절

해외주문/바로드림/제휴사주문/업체배송건의 경우 1+1 증정상품이 발송되지 않습니다.

패키지

북카드

책 소개

이 책이 속한 분야

In the second edition of this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. Updated for Spark 2.1, this edition acts as an introduction to these techniques and other best practices in Spark programming.

You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques-including classification, clustering, collaborative filtering, and anomaly detection-to fields such as genomics, security, and finance.

If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find the book’s patterns useful for working on your own data applications.

With this book, you will:

Familiarize yourself with the Spark programming model
Become comfortable within the Spark ecosystem
Learn general approaches in data science
Examine complete implementations that analyze large public data sets
Discover which machine learning tools make sense for particular problems
Acquire code that can be adapted to many uses

From the Preface
What’s in This Book
The first chapter will place Spark within the wider context of data science and big data analytics. After that, each chapter will comprise a self-contained analysis using Spark. The second chapter will introduce the basics of data processing in Spark and Scala through a use case in data cleansing. The next few chapters will delve into the meat and potatoes of machine learning with Spark, applying some of the most common algorithms in canonical applications. The remaining chapters are a bit more of a grab bag and apply Spark in slightly more exotic applications?for example, querying Wikipedia through latent semantic relationships in the text or analyzing genomics data.

The Second Edition
Since the first edition, Spark has experienced a major version upgrade that instated an entirely new core API and sweeping changes in subcomponents like MLlib and Spark SQL. In the second edition, we’ve made major renovations to the example code and brought the materials up to date with Spark’s new best practices.

원서번역서 내용 엿보기

원서번역서

원

다른 언어 책

전체선택

작가정보

저자(글) Hougland, Juliet

Sandy Ryza develops algorithms for public transit at Remix. Prior, he was a senior data scientist at Cloudera and Clover Health. He is an Apache Spark committer, Apache Hadoop PMC member, and founder of the Time Series for Spark project. He holds the Brown University computer science department's 2012 Twining award for "Most Chill."

Uri Laserson is an Assistant Professor of Genetics at the Icahn School of Medicine at Mount Sinai, where he develops scalable technology for genomics and immunology using the Hadoop ecosystem.

Sean Owen is Director of Data Science at Cloudera. He is an ApacheSpark committer and PMC member, and was an Apache Mahout committer.

Josh Wills is the Head of Data Engineering at Slack, the founder of the Apache Crunch project, and wrote a tweet about data scientists once.

Chapter Page
Foreword vii
Preface ix
1. Analyzing Big Data 1
The Challenges of Data Science 3
Introducing Apache Spark 4
About This Book 6
The Second Edition 7
2. Introduction to Data Analysis with Scala and Spark 9
Scala for Data Scientists 10
The Spark Programming Model 11
Record Linkage 12
Getting Started: The Spark Shell and SparkContext 13
Bringing Data from the Cluster to the Client 19
Shipping Code from the Client to the Cluster 22
From RDDs to Data Frames 23
Analyzing Data with the DataFrame API 26
Fast Summary Statistics for DataFrames 32
Pivoting and Reshaping DataFrames 33
Joining DataFrames and Selecting Features 37
Preparing Models for Production Environments 38
Model Evaluation 40
Where to Go from Here 41
3. Recommending Music and the Audioscrobbler Data Set 43
Data Set 44
The Alternating Least Squares Recommender Algorithm 45
Preparing the Data 48
Building a First Model 51
Spot Checking Recommendations 54
Evaluating Recommendation Quality 57
Computing AUC 58
Hyperparameter Selection 60
Making Recommendations 62
Where to Go from Here 64
4. Predicting Forest Cover with Decision Trees 67
Fast Forward to Regression 67
Vectors and Features 68
Training Examples 69
Decision Trees and Forests 70
Covtype Data Set 73
Preparing the Data 73
A First Decision Tree 76
Decision Tree Hyperparameters 82
Tuning Decision Trees 84
Categorical Features Revisited 88
Random Decision Forests 91
Making Predictions 93
Where to Go from Here 94
5. Anomaly Detection in Network Traffic with K-means Clustering 97
Anomaly Detection 98
K-means Clustering 98
Network Intrusion 99
KDD Cup 1999 Data Set 100
A First Take on Clustering 101
Choosing k 103
Visualization with SparkR 106
Feature Normalization 110
Categorical Variables 112
Using Labels with Entropy 114
Clustering in Action 115
Where to Go from Here 117
6. Understanding Wikipedia with Latent Semantic Analysis 119
The Document-Term Matrix 120
Getting the Data 122
Parsing and Preparing the Data 122
Lemmatization 124
Computing the TF-IDFs 125
Singular Value Decomposition 127
Finding Important Concepts 129
Querying and Scoring with a Low-Dimensional Representation 133
Term-Term Relevance 134
Document-Document Relevance 136
Document-Term Relevance 137
Multiple-Term Queries 138
Where to Go from Here 140
7. Analyzing Co-Occurrence Networks with GraphX 141
The MEDLINE Citation Index: A Network Analysis 143
Getting the Data 144
Parsing XML Documents with Scala's XML Library 146
Analyzing the MeSH Major Topics and Their Co-Occurrences 147
Constructing a Co-Occurrence Network with GrapbX 150
Understanding the Structure of Networks 154
Connected Components 154
Degree Distribution 157
Filtering Out Noisy Edges 159
Processing EdgeTriplets 160
Analyzing the Filtered Graph 162
Small-World Networks 163
Cliques and Clustering Coefficients 164
Computing Average Path Length with Pregel 165
Where to Go from Here 170
8. Geospatial and Temporal Data Analysis on New York City Taxi Trip Data 173
Getting the Data 174
Working with Third-Party Libraries in Spark 175
Geospatial Data with the Esri Geometry API and Spray 176
Exploring the Esri Geometry API 176
Intro to GeoJSON 178
Preparing the New York City Taxi Trip Data 180
Handling Invalid Records at Scale 182
Geospatial Analysis 186
Sessionization in Spark 189
Building Sessions: Secondary Sorts in Spark 190
Where to Go from Here 193
9. Estimating Financial Risk Through Monte Carlo Simulation 195
Terminology 196
Methods for Calculating VaR 197
Variance-Covariance 197
Historical Simulation 197
Monte Carlo Simulation 197
Our Model 198
Getting the Data 199
Preprocessing 199
Determining the Factor Weights 202
Sampling 205
The Multivariate Normal Distribution 208
Running the Trials 209
Visualizing the Distribution of Returns 212
Evaluating Our Results 213
Where to Go from Here 215
10. Analyzing Genomics Data and the BDG Project 217
Decoupling Storage from Modeling 218
Ingesting Genomics Data with the ADAM CLI 221
Parquet Format and Columnar Storage 227
Predicting Transcription Factor Binding Sites from ENCODE Data 229
Querying Genotypes from the 1000 Genomes Project 236
Where to Go from Here 239
11. Analyzing Neuroimaging Data with PySpark and Thunder 241
Overview of PySpark 242
PySpark Internals 243
Overview and Installation of the Thunder Library 245
Loading Data with Thunder 245
Thunder Core Data Types 252
Categorizing Neuron Types with Thunder 253
Where to Go from Here 258
Index 259

기본정보

상품정보
ISBN	9781491972953 ( 1491972955 )
발행(출시)일자	2017년 07월 06일
쪽수	280쪽
크기	178 * 231 * 13 mm / 408 g
총권수	1권
언어	영어
이 책의 개정정보	가장 최근에 출시된 개정판입니다. 구판보기

Klover

구매 후 리뷰 작성 시, e교환권 200원 적립

문장수집

구매 후 리뷰 작성 시, e교환권 100원 적립

이 책의 첫 기록을 남겨주세요

교환/반품/품절 안내

반품/교환 신청 1:1 문의

상품 설명에 반품/교환 관련한 안내가 있는 경우 그 내용을 우선으로 합니다. (업체 사정에 따라 달라질 수 있습니다.)

총 상품 금액 57,230 원

배송 일정 안내 테이블로 결제 완료 시간, 도착예정일 결제 완료 시간 컬럼의 하위로 평일 0시 ~ 12시 토요일 0시 ~ 11시 평일 12시 ~ 22시 평일 12시 ~ 24시 토요일 11시 ~ 21시 을(를) 나타낸 표입니다.

결제 완료 시간

도착예정일

평일 0시 ~ 12시

토요일 0시 ~ 11시

당일배송 오늘

당일배송 오늘

평일 12시 ~ 22시

평일 12시 ~ 24시

토요일 11시 ~ 21시

새벽배송 내일 07시 이전

내일

일요배송 일요일

배송 일정 안내 테이블로 결제 완료 시간, 도착예정일 결제 완료 시간 컬럼의 하위로 월~토 0시 ~ 11시 30분 을(를) 나타낸 표입니다.

결제 완료 시간

도착예정일

월~토 0시 ~ 11시 30분

당일배송 오늘