Skip to content

Conversation

@anshuman9468
Copy link

@anshuman9468 anshuman9468 commented Jan 17, 2026

The goal of this project is to implement a dataset recommendation system for the DBpedia Databus that assists users in selecting appropriate datasets for various types of analysis. By analyzing dataset metadata such as domain, structure, update frequency, and provenance, the system will recommend relevant datasets, improving discoverability, usability, and efficiency in data analysis workflows.

Issue Link-: #39

Summary by CodeRabbit

  • New Features
    • Added dataset recommendation feature enabling users to search and discover relevant datasets by keyword
    • Enhanced chatbot to recognize multiple natural language patterns for dataset queries (e.g., "recommend datasets for X", "search databus for X")
    • Recommended datasets display with labels, descriptions (truncated for readability), and direct links to dataset resources

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 17, 2026

📝 Walkthrough

Walkthrough

A new dataset recommendation feature is introduced that queries the Databus SPARQL endpoint for datasets matching a given keyword. The functionality is integrated through template routing, request type configuration, and RiveScript patterns supporting various user query variations.

Changes

Cohort / File(s) Summary
Service Layer
src/main/java/chatbot/lib/api/DatabusService.java
New service class with buildQuery() for query prefix injection and getRecommendedDatasets() executing SPARQL queries against Databus endpoint with configurable timeout, result truncation to 200 chars, and ResponseData object mapping.
Request Configuration
src/main/java/chatbot/lib/request/TemplateType.java
Added constant DBPEDIA_DATABUS_RECOMMENDATION with value "dbpedia-databus-recommendation" for template type routing.
Handler Integration
src/main/java/chatbot/lib/handlers/TemplateHandler.java, src/main/java/chatbot/lib/handlers/templates/dbpedia/DatasetTemplateHandler.java
TemplateHandler adds routing case for DBPEDIA_DATABUS_RECOMMENDATION. DatasetTemplateHandler introduces handling logic invoking DatabusService to fetch and format recommended datasets as carousel or text response; minor formatting adjustments applied to existing template construction.
RiveScript Patterns
src/main/resources/rivescript/dbpedia-databus.rive
New template patterns for dataset recommendations covering variations: "recommend datasets for *", "find * dataset", "search databus for ", "recommend * data", "dataset recommendation system []", and primary catchall pattern.

Sequence Diagram

sequenceDiagram
    actor User
    participant RiveScript
    participant TemplateHandler
    participant DatasetTemplateHandler
    participant DatabusService
    participant DatabusEndpoint as Databus SPARQL<br/>Endpoint

    User->>RiveScript: User query (e.g., "recommend datasets for health")
    RiveScript->>RiveScript: Pattern match to dbpedia-databus-recommendation
    RiveScript->>TemplateHandler: Request with TemplateType.DBPEDIA_DATABUS_RECOMMENDATION
    TemplateHandler->>DatasetTemplateHandler: Route to handler
    DatasetTemplateHandler->>DatabusService: Call getRecommendedDatasets(keyword)
    DatabusService->>DatabusService: buildQuery() with SPARQL prefixes
    DatabusService->>DatabusEndpoint: Execute SPARQL SELECT query with<br/>keyword filter and limit
    DatabusEndpoint->>DatabusService: Return ResultSet
    DatabusService->>DatabusService: Parse results, truncate descriptions,<br/>map to ResponseData objects
    DatabusService->>DatasetTemplateHandler: Return List<ResponseData>
    DatasetTemplateHandler->>DatasetTemplateHandler: Format as carousel or text response
    DatasetTemplateHandler->>RiveScript: Return ResponseData
    RiveScript->>User: Display recommendations
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Implement Dataset recommendation system in databus for several analysis...' is partially related to the changeset but incomplete and truncated. Complete the title without truncation and ensure it clearly specifies what type of analysis or which specific aspect is being addressed.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/main/java/chatbot/lib/api/DatabusService.java`:
- Line 42: The line logger.info("Querying Databus: " + sparqlQuery) in
DatabusService logs raw user input; change it to avoid exposing user-provided
keywords by either logging at DEBUG level (logger.debug(...)) or logging a
sanitized/redacted version of the query (e.g., build a sanitizedQuery that masks
the user keyword) and use that in the log; update the call site where
sparqlQuery is constructed (in the same DatabusService method) to produce
sanitizedQuery or maskedKeyword and log that instead of the full sparqlQuery.
- Around line 38-40: The SPARQL query concatenates the user-provided variable
`keyword` directly, causing a SPARQL injection risk; update the code in
DatabusService where the query string is built (the fragment that uses
`keyword`) to avoid direct concatenation by using Jena's
ParameterizedSparqlString (or an equivalent parameterized API) and bind the
keyword as a literal (or escape/normalize it) before inserting into the FILTER
regex, or alternatively properly escape quotes/backslashes and regex
metacharacters in `keyword` before concatenation; ensure the code uses
ParameterizedSparqlString.setLiteral/setEscapedString (or a safe escaping
helper) instead of string concatenation so malicious input cannot alter the
query structure.

In `@src/main/java/chatbot/lib/handlers/TemplateHandler.java`:
- Around line 68-73: The switch in TemplateHandler that routes template types to
DatasetTemplateHandler is missing the TemplateType.DBPEDIA_DATASET_ONTOLOGY
case, causing ontology requests to fall back; update the switch in
TemplateHandler.java to include TemplateType.DBPEDIA_DATASET_ONTOLOGY alongside
DBPEDIA_DATASET, DBPEDIA_DATASET_NLP, and DBPEDIA_DATABUS_RECOMMENDATION so that
responseGenerator is set via new DatasetTemplateHandler(request, payload,
helper).handleTemplateMessage() for ontology requests.
🧹 Nitpick comments (2)
src/main/java/chatbot/lib/api/DatabusService.java (2)

34-37: Consider making description optional to avoid missing datasets.

The query requires both ?label and ?description to be present. Datasets without descriptions will be excluded from results. Consider using OPTIONAL for the description field.

♻️ Proposed fix
     String sparqlQuery = buildQuery(
             "SELECT DISTINCT ?dataset ?label ?description WHERE {\n" +
                     "  ?dataset a dataid:Group .\n" +
                     "  ?dataset rdfs:label ?label .\n" +
-                    "  ?dataset dct:description ?description .\n" +
+                    "  OPTIONAL { ?dataset dct:description ?description . }\n" +
                     "  FILTER (regex(?label, \"" + keyword + "\", \"i\") || regex(?description, \"" + keyword
                     + "\", \"i\"))\n" +
                     "} LIMIT 5");

21-25: Remove redundant new String() wrapper.

The new String(...) constructor is unnecessary when the argument is already a string literal. This adds no value and creates an extra object.

♻️ Proposed fix
-    private static final String PREFIXES = new String(
-            "PREFIX dataid: <http://dataid.dbpedia.org/ns/core#>\n" +
-                    "PREFIX dct: <http://purl.org/dc/terms/>\n" +
-                    "PREFIX dcat: <http://www.w3.org/ns/dcat#>\n" +
-                    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n");
+    private static final String PREFIXES =
+            "PREFIX dataid: <http://dataid.dbpedia.org/ns/core#>\n" +
+            "PREFIX dct: <http://purl.org/dc/terms/>\n" +
+            "PREFIX dcat: <http://www.w3.org/ns/dcat#>\n" +
+            "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n";

Comment on lines +38 to +40
" FILTER (regex(?label, \"" + keyword + "\", \"i\") || regex(?description, \"" + keyword
+ "\", \"i\"))\n" +
"} LIMIT 5");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: SPARQL injection vulnerability.

The keyword parameter is directly concatenated into the SPARQL query without sanitization. A malicious user could inject SPARQL syntax to manipulate the query behavior.

For example, input like test", "i")) . ?x <http://evil> ?y . FILTER(regex(?label, " could alter the query structure.

🔒 Proposed fix to sanitize input
 public List<ResponseData> getRecommendedDatasets(String keyword) {
     List<ResponseData> responseDataList = new ArrayList<>();
+    // Sanitize keyword: escape backslashes and quotes for SPARQL string literal
+    String sanitizedKeyword = keyword
+        .replace("\\", "\\\\")
+        .replace("\"", "\\\"");
     String sparqlQuery = buildQuery(
             "SELECT DISTINCT ?dataset ?label ?description WHERE {\n" +
                     "  ?dataset a dataid:Group .\n" +
                     "  ?dataset rdfs:label ?label .\n" +
                     "  ?dataset dct:description ?description .\n" +
-                    "  FILTER (regex(?label, \"" + keyword + "\", \"i\") || regex(?description, \"" + keyword
+                    "  FILTER (regex(?label, \"" + sanitizedKeyword + "\", \"i\") || regex(?description, \"" + sanitizedKeyword
                     + "\", \"i\"))\n" +
                     "} LIMIT 5");
🤖 Prompt for AI Agents
In `@src/main/java/chatbot/lib/api/DatabusService.java` around lines 38 - 40, The
SPARQL query concatenates the user-provided variable `keyword` directly, causing
a SPARQL injection risk; update the code in DatabusService where the query
string is built (the fragment that uses `keyword`) to avoid direct concatenation
by using Jena's ParameterizedSparqlString (or an equivalent parameterized API)
and bind the keyword as a literal (or escape/normalize it) before inserting into
the FILTER regex, or alternatively properly escape quotes/backslashes and regex
metacharacters in `keyword` before concatenation; ensure the code uses
ParameterizedSparqlString.setLiteral/setEscapedString (or a safe escaping
helper) instead of string concatenation so malicious input cannot alter the
query structure.

+ "\", \"i\"))\n" +
"} LIMIT 5");

logger.info("Querying Databus: " + sparqlQuery);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid logging user input directly.

The SPARQL query containing user input is logged at INFO level. If the keyword contains sensitive data or the injection issue is exploited, this could expose sensitive information in logs.

Consider logging at DEBUG level or redacting the user-provided keyword:

-        logger.info("Querying Databus: " + sparqlQuery);
+        logger.debug("Querying Databus with keyword: {}", keyword);
🤖 Prompt for AI Agents
In `@src/main/java/chatbot/lib/api/DatabusService.java` at line 42, The line
logger.info("Querying Databus: " + sparqlQuery) in DatabusService logs raw user
input; change it to avoid exposing user-provided keywords by either logging at
DEBUG level (logger.debug(...)) or logging a sanitized/redacted version of the
query (e.g., build a sanitizedQuery that masks the user keyword) and use that in
the log; update the call site where sparqlQuery is constructed (in the same
DatabusService method) to produce sanitizedQuery or maskedKeyword and log that
instead of the full sparqlQuery.

Comment on lines 68 to 73
// DBpedia Datset Scenarios
case TemplateType.DBPEDIA_DATASET:
case TemplateType.DBPEDIA_DATASET_NLP:
case TemplateType.DBPEDIA_DATABUS_RECOMMENDATION:
responseGenerator = new DatasetTemplateHandler(request, payload, helper).handleTemplateMessage();
break;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -n "DBPEDIA_DATASET_ONTOLOGY" --type java

Repository: dbpedia/chatbot

Length of output: 319


🏁 Script executed:

# Get line count and view the relevant sections of TemplateHandler.java
wc -l src/main/java/chatbot/lib/handlers/TemplateHandler.java

Repository: dbpedia/chatbot

Length of output: 118


🏁 Script executed:

# Search for all case statements in TemplateHandler.java to see complete routing
rg -n "case TemplateType\." src/main/java/chatbot/lib/handlers/TemplateHandler.java

Repository: dbpedia/chatbot

Length of output: 1901


🏁 Script executed:

# Check DatasetTemplateHandler.java around line 51 to see DBPEDIA_DATASET_ONTOLOGY handling
sed -n '45,65p' src/main/java/chatbot/lib/handlers/templates/dbpedia/DatasetTemplateHandler.java

Repository: dbpedia/chatbot

Length of output: 1513


🏁 Script executed:

# View the full TemplateHandler.java to see the default case
sed -n '1,150p' src/main/java/chatbot/lib/handlers/TemplateHandler.java | tail -60

Repository: dbpedia/chatbot

Length of output: 2640


🏁 Script executed:

# Also check if there's a default case after all the cases
rg -n "default:" src/main/java/chatbot/lib/handlers/TemplateHandler.java -A 5

Repository: dbpedia/chatbot

Length of output: 245


Add DBPEDIA_DATASET_ONTOLOGY to the routing switch statement.

The DatasetTemplateHandler contains a case for DBPEDIA_DATASET_ONTOLOGY (lines 51-60), but this template type is not included in the routing switch in TemplateHandler.java. Currently only DBPEDIA_DATASET, DBPEDIA_DATASET_NLP, and DBPEDIA_DATABUS_RECOMMENDATION are routed there. Without this case, ontology-related requests will fall through to the default handler and receive a generic fallback response instead of the proper ontology information.

Fix: Add the missing case
            case TemplateType.DBPEDIA_DATASET:
            case TemplateType.DBPEDIA_DATASET_NLP:
            case TemplateType.DBPEDIA_DATASET_ONTOLOGY:
            case TemplateType.DBPEDIA_DATABUS_RECOMMENDATION:
                responseGenerator = new DatasetTemplateHandler(request, payload, helper).handleTemplateMessage();
                break;
🤖 Prompt for AI Agents
In `@src/main/java/chatbot/lib/handlers/TemplateHandler.java` around lines 68 -
73, The switch in TemplateHandler that routes template types to
DatasetTemplateHandler is missing the TemplateType.DBPEDIA_DATASET_ONTOLOGY
case, causing ontology requests to fall back; update the switch in
TemplateHandler.java to include TemplateType.DBPEDIA_DATASET_ONTOLOGY alongside
DBPEDIA_DATASET, DBPEDIA_DATASET_NLP, and DBPEDIA_DATABUS_RECOMMENDATION so that
responseGenerator is set via new DatasetTemplateHandler(request, payload,
helper).handleTemplateMessage() for ontology requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants