Becoming a Data Scientist at Houzz: A Checklist
In this blog post, Peizhou Liao (pictured above) takes us through his first few months as a Houzzer and shares a checklist about what it takes to be a great data scientist at Houzz.
On my first day at Houzz, I sat in on a product review meeting and as I listened to the team debate the merits of various new features, I was struck by the sincere passion for products that solve real problems, and of a data-driven culture that drives new product development. Houzz was just the place I had been looking for, and I couldn’t wait to plunge into new projects.
As my first quarter comes to an end, I’ve prepared a checklist to help other qualified data scientists get ready for their role at Houzz. These tips fall into four essential categories: data intuition, model development, programming skills and product sense.
Data intuition is partially innate but mostly learned by playing with numerous data sets. Houzz logs the activities of over 40 million monthly unique users and over two million home renovation and design professionals. That’s a massive amount of data, which enables data scientists to develop their data intuition and grow technical skills and expertise, all while contributing to Houzz’s business.
To fully utilize the data, an intuitive understanding of basic concepts and interesting relationships is highly desired. For instance, we predict inventory availability for advertising packages that can be offered to home professionals in a future period. Such inventory forecasting is based on historical time series data. To obtain an accurate forecast, it is crucial to understand the fundamentals of time series including trend, seasonality, and noise, asking questions like:
– Is the increasing trend linear or exponential?
– Can the seasonality be explained by the nature of the industry?
– Is it safe to assume normally distributed noise?
Another important element is to identify related external variables other than information from past observations of the series. At Houzz, the effects of seasonal increases in website traffic, product updates, and renovation trends in the wider housing market, are often of particular interest. These effects differ considerably as seasonal increases tend to be very narrow timeframes while new product launches and renovation trends have a longer-term impact. By carefully examining these relationships, we can decide whether or not to include an external variable in the downstream model development, i.e., employing the dynamic regression models or regular ARIMA (AutoRegressive Integrated Moving Average).
To get the most information out of the data, we commonly aggregate or segment and graph the data in different ways. Data segmentation helps to avoid Simpson’s paradox, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined, and is therefore important for correctly interpreting the information. Data aggregation effectively alleviates sparsity problems in calculating summary statistics. Additionally, visualizing data from different angles gives us various perspectives on our business.
A data scientist cannot tell a good story with data in the absence of appropriate modeling, which is indispensable to turn raw data into meaningful business insights. All Houzz data scientists are experts in the multifaceted process of model development, which typically includes four major components: hypothesis generation, feature engineering, model building, and performance evaluation.
Models are always developed for practical use cases and business operations improvement. Before diving into the data, it’s critical for us to talk with stakeholders to define the use cases, identify associations of particular interest, and establish what the model should predict and how. Such communication for creating a hypothesis ensures that the model focuses on the right problems, and that the results positively influence operations for our business. Our data scientists are always encouraged to take a hypothesis-centred approach in practice.
Data has to be refined into relevant information in order to train a model. Feature engineering allows us to craft data features optimized for accurately representing key patterns, which leads to a higher prediction power for the model. In order to power feature engineering, our data scientists must gain industry-specific domain knowledge and develop a range of techniques including, transforming individual predictors into more contextually meaningful information and grouping data into reasonable bins. For example, field experience is indispensable to detect and remove a “smoking gun” feature, and log transformation is often necessary to reduce data variability.
Models are often imperfect but some may still be useful. At Houzz, we strive to maximize model elegance and prediction power. Given our fast-paced environment however, we often seek useful models rather than perfect ones. If the total development costs of a new model exceeds the value it can add, it is preferable to use the existing one but revisit and improve the existing model later. It is highly desirable for a data scientist to have extensive expertise in at least one machine learning model, because most of the time, they are able to obtain satisfactory results by fine-tuning that model, and successfully deploying it.
Good models are typically those that have very high prediction accuracy for new data and are easy to interpret. To assess generalizability, we perform cross validation and receiver operating characteristic curve analysis. In the case of imbalanced classification problems, we use a precision-recall curve as well as an F1 score. For interpretability, we often choose a model that is interpretable rather than a black box, such that the stakeholders can comprehend why certain decisions or predictions have been made. Sometimes only a complex model can be used. In those cases, we employ explanation techniques that describe the predictions of the model (e.g., LIME, or Local Interpretable Model-Agnostic Explanations) to understand the cause or reason for a decision.
Good programming skills are important for data scientists at Houzz. When writing code to create data pipelines or advanced analytics on “big data”, the most important consideration is to make the code as efficient as possible (i.e., optimization of memory and runtime). Our data scientists frequently work with gigantic data sets that require huge clusters to manipulate. Sloppy code could clog the entire cluster’s resources, blocking production jobs as well as other Houzzers’ projects. Such a situation can be avoided by carefully designing the code and explicitly requesting the right amount of resources. Inefficient code also tends to slow down model development. Take hive query in feature engineering as an example. Join operations occurring before or after “where clauses” can make a big difference in runtime. If a large number of unnecessary rows are post-filtered rather than pre-filtered when outer joining, you are likely to wait a long time for extracting features, preventing you from training different models. At Houzz, my team has been very helpful in suggesting ways to make my code more efficient and recommending various computational tricks (NumPy Broadcasting, SciPy sparse matrix, etc.) to speed up my programs.
Another consideration is to organize the code well so that it is easy to understand. There are two primary reasons for improving the readability of code. First, it helps to decipher the code after time has passed. Often, we have to revisit old code either for debugging or for reusing legacy code. In this situation, clearer code makes changes easier and less risky. Second, it helps our team to better synchronize with the code. As data scientists, we always work in a team and collaborate with software engineers. Clear standards and proper naming reduces the effort we spend understanding the code, and facilitates code reviews. Additionally, to put data insight into production, our code is used as the reference point and North Star code for engineers to develop high-performing solutions and test against.
The tricky thing about programming is the balance between efficiency and clarity. There are no strict rules for efficiency versus clarity. What Houzz data scientists do is to follow our existing best practices and learn from new lessons.
Good data scientists at Houzz are technical product managers who have empathy for our customers and understand how their work influences product decisions. For example, data scientists on our Industry Solutions team, which is focused on helping home professionals connect with homeowners, directly communicate with sales teams to explain actionable insights at the appropriate technical level as well as collect feedback on products. While meeting with our sales team, I joined calls with our customers to gather a holistic understanding of which features our customers appreciate the most and which can be improved. As a result, I know exactly what real problems to solve so as to improve our products in tangible ways, and when to stop in model development rather than seek marginal returns with additional efforts. Such exposure to our customers also motivates me to always deliver analysis and insights that my cross functional partners can understand and are able to act on. Whenever measuring the impact of product changes, I carefully choose proper evaluation metrics. What’s more, I keep explainability in mind when creating models to predict the potential impact of a particular change.
Like the data science field, the essential traits and skills of data scientists are evolving very fast. The above checklist summarizes the qualities that Houzz values in a data scientist based on my initial experiences. In one sentence, I’ve found that a data scientist should have professional engineering skills and a strong product sense, in addition to good data intuition and a remarkable ability for model development. I believe that the role of the Houzz data scientist is a fun, challenging, and rewarding career.