Crawling the internet: data science within a large engineering system
by BILL RICHOUX
Data scientists promote principled decision-making following several different arrangements. In some cases, data scientists provide executive level guidance, reporting insights and trends. Alternatively, guidance and insight may be delivered below the executive level to product managers and engineering leads, directing product feature development via metrics and A/B experiments.
This post focuses on an even lower-level pattern, when data scientists are themselves implementing solutions to analytical problems within the software system codebase. Instead of closing the loop (and impacting the product) through strategic decision making, in this scenario the data scientists are directly rewiring the software stack. This kind of action usually comes with a prelude, working though higher-level scenarios — first metrics are developed, next metrics are universally embraced, and lastly software changes are driven by the desire to improve those metrics.
This post focuses on an even lower-level pattern, when data scientists are themselves implementing solutions to analytical problems within the software system codebase. Instead of closing the loop (and impacting the product) through strategic decision making, in this scenario the data scientists are directly rewiring the software stack. This kind of action usually comes with a prelude, working though higher-level scenarios — first metrics are developed, next metrics are universally embraced, and lastly software changes are driven by the desire to improve those metrics.
Every large-scale system has its own constraints and pressures that the data scientist will need to accommodate. They are typically built as a software suite that has been abstracted into several interacting components, each owned by a distinct subteam of infrastructure engineers. Most of these subteams interact with only a small subset of subteams upstream or downstream of their subsystem. Although the software components themselves may have limited cross-dependencies in the software layer, the subteams remain heavily dependent on one another to ensure the product as a whole is successful.
Good data scientists typically seek to solve problems not just once, but to solve problems continuously. Even in instances when data scientists supply metrics, insights, and trends to inform strategic decision-making, they still care to minimize future work needed to continue solving such problems. But when a data scientist is implementing solutions within software systems, it is arguably a higher form of solving problems continuously. Such solutions are no longer necessarily just strategic and may have to address tactical issues. For instance, tactical responsiveness is critical whenever the software system interacts with a continuously changing external environment (e.g., user behaviors/interests, the internet, etc.). Moreover, there is a much higher bar on reliability, as well as a higher bar on immediately recognizing when solutions break, as your eyes no longer will be sanity-checking each solution that your algorithm emits.
Example: Recrawl Logic within Google search
Google search works because our software has previously crawled many billions of web pages, that is, scraped and snapshotted each one. These snapshots comprise what we refer to as our search index. When queries arrive, the search system matches the inferred meaning of the query to web pages on the basis of these snapshots. The best-matching snapshots then tell the search system which web pages should appear on the search results page as an answer to the user's query. In other words, Google search heavily relies on its own dated snapshot of the internet — dated to the extent that it is based on the contents that appeared on web pages when we last crawled them.Freshness
Many web pages have their contents updated repeatedly, and it is imperative that Google search keep the snapshots as up-to-date as possible. Whenever a snapshot’s contents match its real-world counterpart, we call that snapshot ‘fresh.’ Even if it has been a long while since that snapshot was taken, as long as the contents continue to match, the snapshot is fresh. On the other hand, if a web page’s contents have meaningfully changed since Google search last crawled it — regardless of how little time has passed since this last crawl — Google search’s snapshot of the web page is not acceptable, or ‘stale.’ When a user clicks through to a stale web page from a Google search results page, his or her browser loads a web page whose contents may no longer be relevant to the user’s query. This results in a poor user experience.A team of data scientists defined a measure of freshness for each snapshot and built a system to estimate continuously the weighted average of the freshness metric for each snapshot in our web search index. The weights were provided by a separate system that assigns a value to each web page, akin to PageRank. These web pages' values are used ubiquitously across Google search for prioritization. This measure of web page value is on a meaningful linear scale, such that our freshness metric (a weighted average) has an intuitive interpretation. It was widely agreed that we should undertake effort to improve this freshness metric.
Crawl policy
Improving this freshness metric would require changes to the logic that decides when Google search recrawls web pages. We call this logic the ‘crawl policy’, and the implementation of that policy is owned by the crawl team of infrastructure engineers. There are two sets of constraints that make crawl an interesting problem:
Putting these parameters together, we pose an optimization problem:
\arg \max_{\Delta_{11}, \Delta_{12}, \ldots } \sum_{ij} \frac{w_{ij}}{\delta_{ij} \Delta_{ij}} \left( 1 - e^{-\delta_{ij} \Delta_{ij}} \right) \\
\textrm{s.t.} \sum_j \frac{1}{\Delta_{ij}} \leq k_i \textrm{ for the }i \textrm{th host} \\
\;\;\; \sum_{ ij } \frac{1}{\Delta_{ij}} \leq k_0 \\
\;\;\; \Delta_{ij} > 0 \;\; \forall i,j
$$
- Each host (a collection of web pages sharing a common URL prefix) imposes an implicit or explicit limit on the rate of crawls Google’s web crawler can request. Google continuously estimates this maximum crawl rate for each host (with such estimates being made available to any logic implementing how web pages should be recrawled).
- A global constraint of how much compute and network resources Google itself is willing to dedicate to crawling web pages. Effectively, this imposes an overall limit on the rate of crawls across all web pages.
Mathematical abstraction
An optimal crawl policy could be derived from this objective of maximizing average freshness, if we could abstract this problem into an optimization framework. We first made some simplifying assumptions:- We could assume the problem was static, solve for an optimal crawl policy under such static assumptions, and then apply this policy for a brief period of time — however long it would take to solve (again) and update the policy.
- Each web page will experience, in its future, meaningful changes. We model the arrival of these changes as a Poisson process. The Poisson rate of such meaningful changes for Page $j$ on Host $i$ is denoted as $\delta_{ij}$, and assumed constant over the life of a solution. An estimate of this change rate for each web page would be available to the recrawl logic.
Putting these parameters together, we pose an optimization problem:
- At a period of time $\tau$ since the last time Page $j$ on Host $i$ is crawled, the probability that it will be fresh (not have meaningfully changed) will be $ e^{-\delta_{ij} \tau} $
- If Page $j$ is recrawled every $\Delta_{ij}$ time units, its probability of being fresh at a time chosen uniformly over this period will be $ \frac{1}{\Delta_{ij}} \int_0^{\Delta_{ij}} e^{-\delta_{ij} \tau} d \tau $ or equivalently, $ \frac{1}{\delta_{ij} \Delta_{ij}} \left( 1 - e^{-\delta_{ij} \Delta_{ij}} \right) $.
\arg \max_{\Delta_{11}, \Delta_{12}, \ldots } \sum_{ij} \frac{w_{ij}}{\delta_{ij} \Delta_{ij}} \left( 1 - e^{-\delta_{ij} \Delta_{ij}} \right) \\
\textrm{s.t.} \sum_j \frac{1}{\Delta_{ij}} \leq k_i \textrm{ for the }i \textrm{th host} \\
\;\;\; \sum_{ ij } \frac{1}{\Delta_{ij}} \leq k_0 \\
\;\;\; \Delta_{ij} > 0 \;\; \forall i,j
$$
We can express this optimization problem in a more compact abstract form, clarifying its special structure, by
- denoting $C_{ij}(\Delta_{ij})$ as the term in the objective function relating to Page $j$ on Host $i$
- denoting $K_{ij}(\Delta_{ij})$ as the term in the constraints relating to Page $j$ on Host $i$
Then the optimization problem can be expressed as $$
\arg \max_{\Delta_{11}, \Delta_{12}, \ldots } \sum_{ij} C_{ij}(\Delta_{ij}) \\
\textrm{s.t.} \sum_{ j } K_{ij}(\Delta_{ij}) \leq k_i \textrm{ for the }i \textrm{th host} \\
\;\;\; \sum_{ ij } K_{ij}(\Delta_{ij}) \leq k_0
$$Functions $C_{ij}$ and $K_{ij}$ are convex. Consequently, our optimization problem is convex. Determining optimal recrawl periods for each web page, optimal in the sense of maximizing our freshness metric, has been mapped to solving a convex optimization problem. That makes it feasible to perform the optimization repeatedly and frequently, regularly updating each web page’s recrawl period as our estimates associated with each web page, host, and the web itself, all evolve over time.
\arg \max_{\Delta_{11}, \Delta_{12}, \ldots } \sum_{ij} C_{ij}(\Delta_{ij}) \\
\textrm{s.t.} \sum_{ j } K_{ij}(\Delta_{ij}) \leq k_i \textrm{ for the }i \textrm{th host} \\
\;\;\; \sum_{ ij } K_{ij}(\Delta_{ij}) \leq k_0
$$Functions $C_{ij}$ and $K_{ij}$ are convex. Consequently, our optimization problem is convex. Determining optimal recrawl periods for each web page, optimal in the sense of maximizing our freshness metric, has been mapped to solving a convex optimization problem. That makes it feasible to perform the optimization repeatedly and frequently, regularly updating each web page’s recrawl period as our estimates associated with each web page, host, and the web itself, all evolve over time.
The Missing Link!
This optimization problem was the missing mathematical link between our singular intention (improve our freshness metric) and the multitude of decisions to be made in the software layer (how often to crawl each web page). We had the mathematical formulation, and we had the technical expertise to solve it. As our optimization problem could not reasonably reside in memory on a normal computer — it has several free parameters for every page in Google’s large index of web pages — the challenges that remained would be exciting for both data scientists and infrastructure engineers.We envisioned two potential approaches to solving this optimization problem — employ a solver on a single machine; or employ a solver distributed across multiple machines, partitioned to guarantee that no host would be split across multiple machines. Our preference was for the latter because 1) better scaling properties over time, and 2) more in line with Google’s usual approach to solving large computational problems — parallelize computation across machines. Moreover, intuitively, the structure of our optimization problem (diagonal Hessian matrices associated with the objective function and all constraints) lent itself to being well-conditioned under a distributed-across-machines approach. To us, this appeared to be a near-ideal project for collaboration between data scientists and infrastructure engineers.
Or was it? Hint: there is arguably a much simpler way to understand when web pages should be recrawled. Do you see it? Initially, all of us were so fixated with designing a solver for our optimization problem, especially those of us with backgrounds in optimization, that none of us saw it.
Pushback
When we discussed this proposal with the crawl team comprised of infrastructure engineers, none of its members were enthusiastic about the proposal. Their concerns fell into four categories:- black box solution: Although we data scientists felt that we could clearly explain the optimization problem and also explain the rough structure of the infrastructure to solve it, infrastructure engineers saw this proposed solution as introducing a new black box component to their crawl system. None of the infrastructure engineers on the crawl team were convinced that they could easily understand and diagnose when or why such new optimization infrastructure would fail. All of this led to significant concerns with being able to quickly debug problems — they promised us that problems would invariably occur, as was always the case when working at the scale of Google search with a continuously evolving external environment (the web), no matter how well designed the system.
- solving the wrong problem formulation: Our solution approach was to freeze time, solve a static optimization problem, and then to solve again repeatedly. However, in the software layer, the decision being made repeatedly was to determine the subset of web pages that should be crawled next, whenever some crawl bandwidth becomes available. This meant that we would need to take the solution from our solver, and merge it in with information detailing when each URL was last crawled, and only then could we determine what URLs should be crawled next. It’s arguably a minor extra step, but nevertheless requires a bridge between our problem solution and the software implementation. By not matching the problem formulation in the infrastructure, we imposed an extra layer of complexity to interpret our solution.
- substantial expansion in complexity: Although we were excited when we had a concrete proposal to solve for many billions of optimal crawl periods, our proposal was effectively to replace an existing heuristic implemented in a few lines of code, with some new software infrastructure, a distributed solver for a very large convex optimization problem — likely tens of thousands of lines of new code. This would sound fun to many developers thinking only about the short-term. But for anyone with a long-term perspective, there were obvious concerns with the burden of maintaining such a large specialized system that exceeds the technical complexity that either a data scientist or infrastructure engineer could individually master. Moreover, the infrastructure engineers insisted that any new, not-battle-tested system of moderate complexity would likely fail in unintended ways. Without any indication that such proposed infrastructure would become a standard tool to solve additional problems, the crawl team was hesitant to commit to it.
- limitations on responsiveness: The structure of the proposed solution implicitly imposed a batch approach to solving the optimization problem in practice. That is, the approach was to freeze time and solve a static problem, and then repeat ad infinitum. Although we referred to the solution as optimal, in reality it would never be optimal considering that the parameters feeding into such recrawl logic (the maximum crawl rates, the value of a page, the estimated change rate of a page) were all being updated asynchronously. No matter how responsive one wanted our recrawl logic to be, the runtime of each iteration of our solver would impose a non-negotiable lower bound on such responsiveness, eroding any claim of optimality of our solution. Moreover, as our index would invariably grow over time, such runtimes (and responsiveness) potentially could worsen.
Deconstructing our mistakes
At a high level, our mistakes boiled down to either 1) not knowing and appreciating our context — both the people, the infrastructure engineers on the crawl team, as well as the existing software implementation, or 2) an excessive eagerness to build a new solution that aligned with our technical interests (building a large-scale distributed convex optimization solver). In both instances, these mistakes led us to willingly embrace an expansion of complexity, rather than simpler abstractions.Failing to appreciate the infrastructure engineer’s pressures and responsibilities
Simpler software systems make software team responsibilities easier, whether it be onboarding new engineers, maintenance, or debugging. Especially in the case of being far upstream in a complex software system (much of Search is dependent on the crawl working intelligently), debuggability becomes critical. When you have a large software system interacting with an external environment, it is accepted that the software team will continually encounter cases where the existing code behaves in undesirable ways. Because the continuously running system does not just take a break when this occurs, responsiveness to fixing such issues affects the long-term performance of such a software system. Data scientists must appreciate this, and particularly appreciate the value of simpler solutions when integrating solutions into the software layer.Knowing too little of the actual, existing software implementation
The actual problem within the software was to determine the next few web pages to crawl whenever capacity becomes available. An extra step was required to map the problem that our optimization solved (determining web page recrawl rates) to this actual problem. If we were better aware of the actual problem initially, we would have spent more time thinking about a solution that could rank/sort web pages, as such a formulation would align with the actual decision-making within the software layer.Furthermore, it would have benefited us to appreciate the importance of a solution being responsive to dynamic conditions. Our batch solution assumes that we can treat short durations of time as relatively uniform in a steady-state. Obviously, this assumption is inappropriate if, over the lifetime of a solution, there are significant changes to the parameters assumed frozen in our optimization formulation (e.g., the crawl rate capacity of a host).
Less obvious, this static approach also assumes sufficient ‘mixing’ to ensure that the resulting crawl rate on a host would effectively be uniform over time when an optimal solution is followed. Because of the continuous evolution of the internet, the sudden emergence of new blocks of web pages, and the thick tail of the web consisting of a large number of moderately-sized hosts without large numbers to ensure sufficient ‘mixing’, this second assumption is especially inappropriate (this will be discussed further in the next section). Lastly, we also had failed to be aware of an ongoing transition of the existing batch-based crawl software infrastructure effectively to a streaming system — hence adding a new batch-based component would be unwelcome. Knowing all of this would have compelled us to avoid proposing our batch solution based on static assumptions.
Seduction of familiarity
When you’re holding a hammer, every problem looks like a nail. It is easy to get drawn into a complex solution when you’ve spent years training yourself to design complex solutions. When trying to solve problems in software systems, however, arguably more time ought to be spent thinking of new ways to abstract the problem. It is best to hold off on building a solution, if only a small amount of time was spent to understand the problem.In summary, we solved the wrong problem — wrong both at the human client level and at the level of mathematical abstraction. We proposed a solution to a stand-alone static problem, whereas what was needed was a method to make good use of all available capacity of the crawl system. In other words, what the crawl system needed was a forward-looking means of deciding which next few web pages to crawl.
A revised proposal
Learning specifics about the underlying software implementation left us wishing that we could define, for each web page, a function that would tell us the value of recrawling the page at any given time since it was last crawled. If we could sort our queues by such a function, it would be relatively easy to determine, at any time, the web pages to crawl next whenever capacity was available. But then suddenly we realized that we had precisely that, made possible by special mathematical structure in the optimization problem itself.
As before, let $i$ index hosts and $j$ index pages on each host. Thus each page is uniquely indexed by $(i,j)$. Furthermore, define crawl rate for Page $i,j$ as the reciprocal of the duration $$
\rho_{ij}=\frac{1}{\Delta_{ij}}
$$This is the same as $K_{ij}(\Delta_{ij})$ in the original framing but we want to think of it as a variable over which we are optimizing rather than as a function. We may now write down the contribution of the Page $i,j$ to the overall objective as $$
C_{ij}(\rho_{ij})=\frac{w_{ij}}{\delta_{ij}}\rho_{ij}\left(1-\exp(-\frac{\delta_{ij}}{\rho_{ij}})\right)
$$If we define $\underline{\rho}$ as the vector of all crawl rates to be chosen, the overall freshness objective to be maximized is $$
\arg \max_{\underline{\rho}}\sum_{ij}C_{ij}(\rho_{ij})
$$under the constraint for each host indexed by $i$ as $$
\sum_{j}\rho_{ij}\leq k_{i}
$$and the global crawl constraint $$
\sum_{ij}\rho_{ij}\leq k_{0}
$$We use Lagrange multipliers to optimize the unconstrained objective
V(\Delta) &= C'(1 / \Delta) \\
&= \frac{w}{\delta} \left(1-\mathrm{e}^{-\delta \Delta}\right)
- w\Delta \mathrm{e}^{-\delta\Delta}
\end{align}The function $V(\Delta)$ is monotonically increasing, starting with $V(0) = 0$. At the optimal crawl period $\Delta_{ij}^*=1/\rho^*_{ij}$, it follows that for each Page $j$ on Host $i$ $$
V_{ij}(\Delta_{ij}^*) =\lambda_{i}+\lambda_{0}
$$This perspective motivates explicitly expressing this function temporally, that is, envision $\tau$ as the time that has elapsed since Page $j$ on Host $i$ was last crawled; we can then consider the function$$
V_{ij}(\tau) = \frac{w_{ij}}{\delta_{ij}} \left( 1 - e^{-\delta_{ij} \tau} \right ) - w_{ij} \tau e^{-\delta_{ij} \tau}
$$which is a function of time since the web page was last crawled. We refer to $V_{ij}(\tau)$ as Page $i,j$’s crawl value function. When we temporally observe a solution to the static optimization problem being executed, we can track each Page $i,j$’s crawl value function and observe it climbing until it reaches a threshold of $\lambda_i + \lambda_0$, whereupon Page $i,j$ will be recrawled — as the crawl value function equals $\lambda_i + \lambda_0$ at the web page’s optimal crawl period $\Delta^*_{ij}$. Upon recrawl, its crawl value function will reset to 0.
\rho_{ij}=\frac{1}{\Delta_{ij}}
$$This is the same as $K_{ij}(\Delta_{ij})$ in the original framing but we want to think of it as a variable over which we are optimizing rather than as a function. We may now write down the contribution of the Page $i,j$ to the overall objective as $$
C_{ij}(\rho_{ij})=\frac{w_{ij}}{\delta_{ij}}\rho_{ij}\left(1-\exp(-\frac{\delta_{ij}}{\rho_{ij}})\right)
$$If we define $\underline{\rho}$ as the vector of all crawl rates to be chosen, the overall freshness objective to be maximized is $$
\arg \max_{\underline{\rho}}\sum_{ij}C_{ij}(\rho_{ij})
$$under the constraint for each host indexed by $i$ as $$
\sum_{j}\rho_{ij}\leq k_{i}
$$and the global crawl constraint $$
\sum_{ij}\rho_{ij}\leq k_{0}
$$We use Lagrange multipliers to optimize the unconstrained objective
\begin{align}
J(\underline{\rho}) & =\left(\sum_{ij}C_{ij}(\rho_{ij})\right)+\left(\sum_{i}\lambda_{i}\sum_{j}\rho_{ij}\right)+\left(\lambda_{0}\sum_{ij}\rho_{ij}\right)\\
& =\sum_{ij}C_{ij}(\rho_{ij})+(\lambda_{i}+\lambda_{0})\rho_{ij}
\end{align}From the KKT conditions, the Lagrange multipliers $\lambda_{i}$ must all be non-negative. Setting $\partial J/\partial\rho_{ij}=0$ we get $$
C'_{ij}(\rho_{ij}^*)=\lambda_{i}+\lambda_{0}
$$at the optimal crawl rates $\rho_{ij}^*$ where
\begin{align}
C'(\rho) & =\frac{w}{\delta} \left(1-\exp(-\frac{\delta}{\rho})\right)-\frac{w}{\rho}\exp(-\frac{\delta}{\rho})
\end{align}all variables indexed by $i,j$. Furthermore, $C'(\rho)$ is monotone decreasing in $\rho$ because $C''(\rho)$ is always negative$$
C''(\rho)=-\frac{w\delta\exp(-\frac{\delta}{\rho})}{\rho^{3}}
$$For hosts whose crawl rate does not constrain the solution, $\lambda_{i}=0$.
Up to this point, we have considered how to solve the static optimization problem. But the form of the solution offers us a new perspective on how to understand and implement an optimal solution, primarily in terms of functions that tell us the value of crawling any web page at any given time. To see this, first consider the function
\begin{align}J(\underline{\rho}) & =\left(\sum_{ij}C_{ij}(\rho_{ij})\right)+\left(\sum_{i}\lambda_{i}\sum_{j}\rho_{ij}\right)+\left(\lambda_{0}\sum_{ij}\rho_{ij}\right)\\
& =\sum_{ij}C_{ij}(\rho_{ij})+(\lambda_{i}+\lambda_{0})\rho_{ij}
\end{align}From the KKT conditions, the Lagrange multipliers $\lambda_{i}$ must all be non-negative. Setting $\partial J/\partial\rho_{ij}=0$ we get $$
C'_{ij}(\rho_{ij}^*)=\lambda_{i}+\lambda_{0}
$$at the optimal crawl rates $\rho_{ij}^*$ where
\begin{align}
C'(\rho) & =\frac{w}{\delta} \left(1-\exp(-\frac{\delta}{\rho})\right)-\frac{w}{\rho}\exp(-\frac{\delta}{\rho})
\end{align}all variables indexed by $i,j$. Furthermore, $C'(\rho)$ is monotone decreasing in $\rho$ because $C''(\rho)$ is always negative$$
C''(\rho)=-\frac{w\delta\exp(-\frac{\delta}{\rho})}{\rho^{3}}
$$For hosts whose crawl rate does not constrain the solution, $\lambda_{i}=0$.
Up to this point, we have considered how to solve the static optimization problem. But the form of the solution offers us a new perspective on how to understand and implement an optimal solution, primarily in terms of functions that tell us the value of crawling any web page at any given time. To see this, first consider the function
V(\Delta) &= C'(1 / \Delta) \\
&= \frac{w}{\delta} \left(1-\mathrm{e}^{-\delta \Delta}\right)
- w\Delta \mathrm{e}^{-\delta\Delta}
\end{align}The function $V(\Delta)$ is monotonically increasing, starting with $V(0) = 0$. At the optimal crawl period $\Delta_{ij}^*=1/\rho^*_{ij}$, it follows that for each Page $j$ on Host $i$ $$
V_{ij}(\Delta_{ij}^*) =\lambda_{i}+\lambda_{0}
$$This perspective motivates explicitly expressing this function temporally, that is, envision $\tau$ as the time that has elapsed since Page $j$ on Host $i$ was last crawled; we can then consider the function$$
V_{ij}(\tau) = \frac{w_{ij}}{\delta_{ij}} \left( 1 - e^{-\delta_{ij} \tau} \right ) - w_{ij} \tau e^{-\delta_{ij} \tau}
$$which is a function of time since the web page was last crawled. We refer to $V_{ij}(\tau)$ as Page $i,j$’s crawl value function. When we temporally observe a solution to the static optimization problem being executed, we can track each Page $i,j$’s crawl value function and observe it climbing until it reaches a threshold of $\lambda_i + \lambda_0$, whereupon Page $i,j$ will be recrawled — as the crawl value function equals $\lambda_i + \lambda_0$ at the web page’s optimal crawl period $\Delta^*_{ij}$. Upon recrawl, its crawl value function will reset to 0.
As for the parameters, it turns out that it is not necessary to estimate $\lambda_i$ in order to enforce the crawl rate limits on individual hosts — each host itself limits how fast it allows itself to be crawled. The implementation assigns a number of parallel workers, each with its own range of pages to crawl. Thus, $\lambda_i$ are maintained by the worker assigned to each host, estimated based on the web pages on that host crawled recently. In contrast, $\lambda_0$ is estimated centrally and updated periodically. Its value is then propagated to all the parallel workers. While in theory we need $\lambda_i$ in order to estimate $\lambda_0$, in practice $\lambda_0$ is relatively constant over time. Thus, its value can be updated based only on hosts which are not constrained.
Note what this crawl value function is not — it is not greedy. The crawl value function we derived for the static optimization problem has some properties that were seen as favorable — at least to some people — when compared to what results from a greedy algorithm. In some regimes (and in practice for google search), a greedy algorithm would devote more recrawl resources towards high value pages, as lower value pages would commonly starve. Many in Google search had experienced past situations where greedy algorithms accumulated serious issues over time, and hence there was a lot of institutional intuition to avoid them.
As for the use of the crawl value function — at any given time, we can evaluate the crawl value function for each web page on a host. We can use this function to sort the web pages, and then determine which web pages should be scheduled for immediate crawl. This is the substantial simplification alluded to earlier — that for each web page, there is a function (this crawl value function) telling us the value of recrawling that page, only in terms of parameters for the given web page, at any given time since it was last crawled. It’s precisely what one wants when one has a queue, as in the software implementation in Google search, and it dramatically simplifies how a solution to the static optimization problem can be implemented on a per-host basis.
As this reasoning is derived directly from the KKT conditions of a solution to the static optimization problem, the reader should be convinced that a crawl policy executed by sorting web pages by their crawl value functions and recrawling them when they reach thresholds $\lambda_i+\lambda_0$ will be equivalent to a crawl policy implemented by solving the optimization problem for the optimal recrawl periods and then scheduling URLs to be recrawled at intervals according to their solved-for optimal recrawl periods. What should be obvious is that one can apply the policy of sorting by crawl value to prioritize what to recrawl, even in situations where the available crawl rate on a host varies over time (recall that it was assumed to be fixed in the static optimization problem’s formulation). What may be less obvious, however, is that this recrawl policy can be an optimal crawl policy (optimal in the sense that it will maximize weighted freshness over time) even under some more general conditions.
There are some very desirable properties of such a solution, properties which we hadn’t really focused on until learning more about the existing software implementation and the longer-term goals of the crawl team (which led us to revisit our original proposal):
- familiar: the optimization problem is mapped to a solution that matches a common routine in software engineering: scan a list and find the top $N$. Minimal effort was required to convince infrastructure engineers to be open to the proposal.
- adaptive to heterogeneity: this alternative solution is not burdened by the easily-overlooked assumption in the static optimization problem that there is sufficient ‘mixing’ to ensure that the resulting crawl rate on a host will be effectively uniform over time. A consequence of such an assumption is that there’s real potential to thrash a host, i.e., exhibit a lot of volatility in the crawl demand imposed on the host. On the other hand, our alternative approach is more adaptive, and from the perspective of the host, less volatile. This difference in behavior between the two approaches is evident under a common scenario — a new host comes online, and Google search quickly crawls all the new web pages (over a day or two). As these new web pages initially are not well-differentiated, a solution emitted by a solver to the optimization problem would suggest that all web pages be recrawled roughly $x$ days later — meaning no recrawls are initially requested for some time, and then all the web pages on the host would be scheduled to be recrawled over a short duration. If the available crawl rate on the host is restricted when it is time to recrawl, such behavior will be suboptimal. With our new solution, if recrawls would otherwise be bunched, the threshold at which web pages will be recrawled will fall downwards until reaching $\lambda_0$. Consequently, some pages will be crawled sooner than suggested by the solution to the static optimization problem, with an active spreading of the recrawls as the crawl capacity is fully utilized much earlier.
- debuggable: one may need to debug why a web page was recrawled (or hasn’t been recrawled). Debugging of an individual web page is decoupled into two parts: 1) how the web page itself influences when it will be recrawled (its crawl value function is a function of only its own parameters), and 2) how the other web pages on a host will influence when it will be recrawled, as they collectively determine $\lambda_i$. This provides more intuition and clarity when compared to an optimization framework that merely emits optimal recrawl periods — one can understand how changes in a web page’s underlying parameters, its value (weight) and estimated change rate will tend to drive changes in its recrawl rate.
- continuous: the solution in practice much more closely mirrors a streaming solution. The only meaningful batch component is repeated scanning of lists to re-sort. This can be controlled by splitting data and parallelizing such scans, or by choosing how deep to scan — familiar workarounds to Google infrastructure engineers.
- responsiveness to external changes: the solution can be responsive to changes on a host. As mentioned earlier, we have found that in practice the global crawl KKT multiplier $\lambda_0$ is rather stable (conceptually, it is determined by the aggregated behavior of many independent hosts). This is less the case for many host-specific KKT multipliers (in some cases this is due to recrawl demand being non-uniform, the concern highlighted above).
- easily generalized: generalizing the approach to a probabilistic setting (assuming the estimated parameters are distributions rather than point estimates) isn’t a significant increase in complexity.
We’re going to sweep under the rug some additional issues that needed to be resolved, e.g., the crawl value functions are bounded, meaning the optimal solution would elect never to recrawl some pages. However, we hopefully made our point that often much simpler solutions exist, which can also much better align with existing software system design.
In projects where decision-making is to be deployed to control a complex software system, we run a greater risk of incurring a Type III error — giving the right answer to the wrong problem. But with additional diligence and humility, it is possible to avoid such a scenario.
Conclusions
In this post we described our experience of a very specific problem, but our message to data scientists is broader. Finding the right abstraction is always important in data science, but it is absolutely crucial when designing algorithms to control the behavior of complex software systems. Black-box solutions to well-understood mathematical problems often don't work well in this capacity because of the assumptions they make. This mismatch in assumptions can lead to additional system complexity around the black box, especially in addressing issues of robustness and responsiveness to other parts of the system. Worst of all, the solution may expose a poor conceptual model to the infrastructure team whose job it is to keep the system running. The people on pager duty might not appreciate the elegance of your solution when they have no easy means of debugging when things go wrong.In projects where decision-making is to be deployed to control a complex software system, we run a greater risk of incurring a Type III error — giving the right answer to the wrong problem. But with additional diligence and humility, it is possible to avoid such a scenario.
Fascinating post!
ReplyDeleteI'd draw an analogy to a Chicago Economics world view: If each individual crawler just acts out of self interest to crawl the highest expected value page available at the time, the system as whole is optimal. At the risk of showing my lack of subject matter knowledge on the web crawler, I'd extend the analogy further to consider the cost of crawling a particular page. To the extent that compute cost is variable across pages, it might make sense to consider expected value per compute cost. I couldn't see "costs" in your formulation.