mudrod · Oct 3, 2017
diff --git a/‎README.md
+2-1 b/‎README.md
+2-1
diff --git a/‎_posts/2017-06-02-ranking-algorithms.markdown
+45 b/‎_posts/2017-06-02-ranking-algorithms.markdown
+45
diff --git a/‎_posts/2017-06-02-recommendation-algorithms.markdown
+17 b/‎_posts/2017-06-02-recommendation-algorithms.markdown
+17
diff --git a/‎_posts/2017-06-02-vocabulary-similarity-algorithm.markdown
+17 b/‎_posts/2017-06-02-vocabulary-similarity-algorithm.markdown
+17
diff --git a/‎images/architecture.jpg
22.9 KB b/‎images/architecture.jpg
22.9 KB
diff --git a/‎images/ranking.png
143 KB b/‎images/ranking.png
143 KB
diff --git a/‎images/recommendation.png
21.4 KB b/‎images/recommendation.png
21.4 KB
diff --git a/‎images/vocabulary.png
134 KB b/‎images/vocabulary.png
134 KB
diff --git a/‎introduction.md
+33 b/‎introduction.md
+33
diff --git a/‎team-info.md
+1-1 b/‎team-info.md
+1-1
@@ -1,2 +1,3 @@
 # mudrod.github.io
-A website for the MUDROD semantic discovery and search project funded by NASA AIST (NNX15AM85G), part of OceanWorks
+
+A website for the MUDROD semantic discovery and search project funded by NASA AIST (NNX15AM85G), part of OceanWorks now hosted at https://github.com/aist-oceanworks/mudrod
@@ -0,0 +1,45 @@
+---
+layout: post
+title:  "An introduction to MUDROD ranking algorithm"
+categories: weekly update
+---
+
+When a user types some keywords into a search engine, there are typically hundreds, or even thousands of datasets related to the given query. Although high level of recall can be useful in some cases, the user is only interested in a much smaller subset. Current search engines in most geospatial data portals tend to induce end users to focus on one single data characteristic/feature dimension (e.g., spatial resolution), which often results in less than optimal user experience (Ghose, Ipeirotis, and Li 2012). 
+
+To overcome this fundamental ranking problem, we therefore 1) identify a number of ranking features of geospatial data to represent users’ multidimensional preferences by considering semantics, user behaviour, spatial similarity, and static dataset metadata attributes; 2) apply machine learning method to automatically learn a function from a training set capable of ranking geospatial data according to the ranking features.
+
+Within the ranking process, each query will be associated with a set of data, and each data can be represented as a feature vector. Eleven features listed below are identified by considering user behaviour, query-text match and  examining common geospatial metadata attributes.
+
+  | Query-dependent features        | 
+    | --------   | 
+    | Lucene relevance score        | 
+    | Semantic popularity        |
+    | Spatial Similarity        | 
+  |         |
+  
+  | Query-dependent features        | 
+	| --------   | 
+	| Release date        | 
+    | Processing level        | 
+    | Version number        | 
+    | Spatial resolution        | 
+    | Temporal resolution        |
+    | All-time popularity        | 
+    | Monthly popularity        | 
+    | User popularity        | 
+  |         |
+	
+	
+RankSVM, one of the well-recognized learning approach is selected to learn feature weights to rank search results. In RankSVM (Joachims 2002), ranking is transformed into a pairwise classification task in which a classifier is trained to predict the ranking order of data pairs.
+
+<center>
+	<img src="/images/ranking.png">
+	Figure 1. System workflow and architecture
+</center>
+
+The proposed architecture primarily consists of six components comprising semantic knowledge base, geocoding service, search index, feature extractor, learning algorithm, and ranking model respectively (Figure 1). When a user submits a query, it is then converted into a semantic query and a geographical bounding box by the semantic knowledge base and geocoding service. The search index would then return the top K results for the semantic query combined with the bounding box. After that, feature extractor would extract the ranking features for each of the search results, including the semantic click score. Once all the features are prepared, the top K results would then be put into a pre-trained ranking model, which would finally re-rank the top K retrieval. As the index in this architecture can be any Lucene-based software, it enables the loosely coupled software structure of a data portal and avoids the cost of replacing the existing system.
+
+Reference:
+* Ghose, Anindya, Panagiotis G Ipeirotis, and Beibei Li. 2012. "Designing ranking systems for hotels on travel search engines by mining user-generated and crowdsourced content."  Marketing Science 31 (3):493-520.
+
+* Joachims, Thorsten. 2002. Optimizing search engines using clickthrough data. Paper presented at the Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 
@@ -0,0 +1,17 @@
+---
+layout: post
+title:  "An introduction to MUDROD recommendation algorithm"
+categories: weekly update
+---
+
+With the recent advances in remote sensing satellites and other sensors, geographic datasets have been growing faster than ever. In response, a number of Spatial Data Infrastructure (SDI) components (e.g. catalogues and portals) have been developed to archive and made those datasets available online. However, finding the right data for scientific research and application development is still a challenge due to the lack of data relevancy information. 
+
+Recommendation has become extremely common in recent years and are utilized in a variety of areas to help users quickly find useful information. Wee propose a recommendation system to improve geographic data discovery by mining and utilizing metadata and usage logs. Metadata abstracts are processed with natural language processing methods to find semantic relationship between metadata. Metadata variables are used to calculate spatial and temporal similarity between metadata. In addition, portal logs are analysed to introduce user preference. 
+
+<center>
+	<img src="/images/recommendation.png">
+	Figure 1. Recommendation workflow
+</center>
+
+
+The system starts by pre-processing raw web logs and metadata (Figure 1). After pre-processing step, sessions are reconstructed from raw web logs and then used to calculate session-based metadata similarity. Metadata are harvested from PO. DAAC web service APIs. Metadata variable values are then converted to value using the united unit to calculate metadata content similarity. All these above similarities are calculated offline and stored in Elasticsearch. Once user views a metadata, the system finds the top-k related metadata with hybrid recommendation. The hybrid recommendation module integrates results from content-based recommendation and session-based recommendation methods and ranks the final recommendation list in a descending order of similarity.
@@ -0,0 +1,17 @@
+---
+layout: post
+title:  "An introduction to MUDROD vocabulary similarity calculation algorithm"
+categories: weekly update
+---
+
+Big geospatial data have been produced, archived and made available online, but finding the right data for scientific research and decision-support applications remains a significant challenge. A long-standing problem in data discovery is how to locate, assimilate and utilize the semantic context for a given query. Most of past research in geospatial domain attempts to solve this problem through two approaches: 1) building a domain-specific ontology  manually; 2) discovering semantic relationship through dataset metadata automatically using machine learning techniques. The former contains rich expert knowledge, but it is static, costly, and labour intensive, while the latter is automatic, it is prone to noise. 
+
+An emerging trend in information science is to take advantage of large-scale user search history, which is dynamic but contains user and crawler generated noise. Leveraging the benefits of all of these three approaches and avoiding their weaknesses, a novel  approach is proposed in this article to 1) discover vocabulary semantic relationship from user clickstream; 2) refine the similarity calculation methods from existing ontology; 3) integrate the results of ontology, metadata, user search history and clickstream analysis to better determine the semantic relationship. 
+
+<center>
+	<img src="/images/vocabulary.png">
+	Figure 1. System workflow and architecture
+</center>
+
+
+The system starts by pre-processing raw web logs, metadata, and ontology (Figure 1 ). After pre-processing step, search history and clickstream data are extracted from raw logs, selected properties are extracted from metadata, and ocean-related triples are extracted from the SWEET ontology. These four types of processed data are then put into their corresponding processer as discussed in the last section. Once all the processers finish their jobs, the results of different methods are integrated to produce a final most related terms list.
@@ -0,0 +1,33 @@
+---
+layout: page
+title: Introduction
+permalink: /introduction/
+---
+
+## Mining and Utilizing Dataset Relevancy from Oceanographic Dataset (MUDROD) Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access
+
+*Bookmark this page!*
+
+**Background and goal**: Massive amounts of geospatial datasets are archived and made available through online web discovery and access methods. However, finding the right data for scientific research and application development is still a challenge. We propose to mine and utilize the combination of Earth Science dataset metadata, usage metrics, and user feedback to objectively extract relevance for improved data discovery and access across a NASA Distributed Active Archive Center (DAAC) and other data centers. As a point of reference, the Physical Oceanographic Distributed Active Archive Center (PO.DAAC) aims to provide datasets to facilitate scientists in selecting Earth observation data that fit better their needs in various aspects of Physical Oceanography.
+
+**What are the problems in Earth data discovery?**
+
+* **Only keyword-matching, no semantic context**： If a user searches for “sea surface temperature,” it is understood by traditional search engine as “sea AND surface AND temperature,” but the real intent of this user might be “sea surface temperature” OR “sst” OR “ghrsst” OR …
+
+* **Single-dimension based ranking**： There are typically hundreds or even thousands of search hits when you search by a keyword. Many be websites provides some sorts of metrics (e.g., spatial resolution, processing level) to help sort the search hits. It can be helpful sometimes, but it also induces users to just focus on one single data characteristic. What if a user is looking for a data that has both high spatial resolution and processing level?
+
+* **Lack of data-to-data relevancy**： In theory, there exist hidden relationships among the data hosted within a data center or across different data centers. In reality, we can only view a particular data after clicking on it without knowing its related data.
+
+
+**How does MUDROD address them?**
+
+<center>
+	<img src="/images/architecture.jpg">
+	Figure 1. System architecture
+</center>
+
+* **Query expansion and suggestion**： Rather than manually create a domain ontology, MUDROD discovers the latent semantic relationship through analyzing the web logs and metadata documents. The process could be thought of as a type statistical analysis. The hypothesis for user behavior analysis is that similar queries result in similar clicking behaviors. The intuition is that if two queries are similar, the clicked data will be similar in the context of large-scale user behaviors. The analysis results would be the similarities between any pair of data. For example, we found that the similarity between “ocean temperature” and “sea surface temperature” is nearly one, meaning exactly the same. These similarities would be very helpful in terms of better understanding users’ search intent. For example, an original query “ocean temperature” can be expanded to “ocean temperature (1.0) OR sea surface temperature (1.0)”. This converted query has been proven to be able to improve both the search recall and precision
+
+* **Learning to ranking**： After discussing with domain experts, we identified about eleven features that can reflect users’ search interests. These features primarily come from three aspects, including user behavior, query-metadata overlap, metadata attributes. From there, we trained a machine learning ranking algorithm with human-labelled query results. The reason we use machine learning here is that it is difficult to weight these features, especially the number of them can be larger down the road. After that, we use the machine learned model to predict the results of other unseen queries.
+
+* **Recommendation**： To find the relatedness of different data, two types of information are considered, user behavior and metadata attributes. In terms of the metadata attributes, it is straightforward that we just need to compare the text of the metadata documents. For the user behavior based recommendation, we just need to find the most similar user(s) to you, and then find the data that have been clicked by your most similar user(s) but you haven’t. We can get the best recommendation for us by merging the results from these two methods.
@@ -16,4 +16,4 @@ permalink: /team-info/
  * David Moroni - [Jet Propulsion Laboratory](http://www.jpl.nasa.gov/), [NASA](http://www.nasa.gov)
  * Chris Finch - [Jet Propulsion Laboratory](http://www.jpl.nasa.gov/), [NASA](http://www.nasa.gov)
  * [Lewis John Mcgibbney](https://www.linkedin.com/in/lmcgibbney) - [Jet Propulsion Laboratory](http://www.jpl.nasa.gov/), [NASA](http://www.nasa.gov)
-
+ * [Frank Greguska](https://www.linkedin.com/in/frankgreguska/) - [Jet Propulsion Laboratory](http://www.jpl.nasa.gov/), [NASA](http://www.nasa.gov)