-
Notifications
You must be signed in to change notification settings - Fork 53
Implement Dataset recommendation system in databus for several analysis... #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Implement Dataset recommendation system in databus for several analysis... #64
Conversation
📝 WalkthroughWalkthroughA new dataset recommendation feature is introduced that queries the Databus SPARQL endpoint for datasets matching a given keyword. The functionality is integrated through template routing, request type configuration, and RiveScript patterns supporting various user query variations. Changes
Sequence DiagramsequenceDiagram
actor User
participant RiveScript
participant TemplateHandler
participant DatasetTemplateHandler
participant DatabusService
participant DatabusEndpoint as Databus SPARQL<br/>Endpoint
User->>RiveScript: User query (e.g., "recommend datasets for health")
RiveScript->>RiveScript: Pattern match to dbpedia-databus-recommendation
RiveScript->>TemplateHandler: Request with TemplateType.DBPEDIA_DATABUS_RECOMMENDATION
TemplateHandler->>DatasetTemplateHandler: Route to handler
DatasetTemplateHandler->>DatabusService: Call getRecommendedDatasets(keyword)
DatabusService->>DatabusService: buildQuery() with SPARQL prefixes
DatabusService->>DatabusEndpoint: Execute SPARQL SELECT query with<br/>keyword filter and limit
DatabusEndpoint->>DatabusService: Return ResultSet
DatabusService->>DatabusService: Parse results, truncate descriptions,<br/>map to ResponseData objects
DatabusService->>DatasetTemplateHandler: Return List<ResponseData>
DatasetTemplateHandler->>DatasetTemplateHandler: Format as carousel or text response
DatasetTemplateHandler->>RiveScript: Return ResponseData
RiveScript->>User: Display recommendations
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/main/java/chatbot/lib/api/DatabusService.java`:
- Line 42: The line logger.info("Querying Databus: " + sparqlQuery) in
DatabusService logs raw user input; change it to avoid exposing user-provided
keywords by either logging at DEBUG level (logger.debug(...)) or logging a
sanitized/redacted version of the query (e.g., build a sanitizedQuery that masks
the user keyword) and use that in the log; update the call site where
sparqlQuery is constructed (in the same DatabusService method) to produce
sanitizedQuery or maskedKeyword and log that instead of the full sparqlQuery.
- Around line 38-40: The SPARQL query concatenates the user-provided variable
`keyword` directly, causing a SPARQL injection risk; update the code in
DatabusService where the query string is built (the fragment that uses
`keyword`) to avoid direct concatenation by using Jena's
ParameterizedSparqlString (or an equivalent parameterized API) and bind the
keyword as a literal (or escape/normalize it) before inserting into the FILTER
regex, or alternatively properly escape quotes/backslashes and regex
metacharacters in `keyword` before concatenation; ensure the code uses
ParameterizedSparqlString.setLiteral/setEscapedString (or a safe escaping
helper) instead of string concatenation so malicious input cannot alter the
query structure.
In `@src/main/java/chatbot/lib/handlers/TemplateHandler.java`:
- Around line 68-73: The switch in TemplateHandler that routes template types to
DatasetTemplateHandler is missing the TemplateType.DBPEDIA_DATASET_ONTOLOGY
case, causing ontology requests to fall back; update the switch in
TemplateHandler.java to include TemplateType.DBPEDIA_DATASET_ONTOLOGY alongside
DBPEDIA_DATASET, DBPEDIA_DATASET_NLP, and DBPEDIA_DATABUS_RECOMMENDATION so that
responseGenerator is set via new DatasetTemplateHandler(request, payload,
helper).handleTemplateMessage() for ontology requests.
🧹 Nitpick comments (2)
src/main/java/chatbot/lib/api/DatabusService.java (2)
34-37: Consider makingdescriptionoptional to avoid missing datasets.The query requires both
?labeland?descriptionto be present. Datasets without descriptions will be excluded from results. Consider usingOPTIONALfor the description field.♻️ Proposed fix
String sparqlQuery = buildQuery( "SELECT DISTINCT ?dataset ?label ?description WHERE {\n" + " ?dataset a dataid:Group .\n" + " ?dataset rdfs:label ?label .\n" + - " ?dataset dct:description ?description .\n" + + " OPTIONAL { ?dataset dct:description ?description . }\n" + " FILTER (regex(?label, \"" + keyword + "\", \"i\") || regex(?description, \"" + keyword + "\", \"i\"))\n" + "} LIMIT 5");
21-25: Remove redundantnew String()wrapper.The
new String(...)constructor is unnecessary when the argument is already a string literal. This adds no value and creates an extra object.♻️ Proposed fix
- private static final String PREFIXES = new String( - "PREFIX dataid: <http://dataid.dbpedia.org/ns/core#>\n" + - "PREFIX dct: <http://purl.org/dc/terms/>\n" + - "PREFIX dcat: <http://www.w3.org/ns/dcat#>\n" + - "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n"); + private static final String PREFIXES = + "PREFIX dataid: <http://dataid.dbpedia.org/ns/core#>\n" + + "PREFIX dct: <http://purl.org/dc/terms/>\n" + + "PREFIX dcat: <http://www.w3.org/ns/dcat#>\n" + + "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n";
| " FILTER (regex(?label, \"" + keyword + "\", \"i\") || regex(?description, \"" + keyword | ||
| + "\", \"i\"))\n" + | ||
| "} LIMIT 5"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: SPARQL injection vulnerability.
The keyword parameter is directly concatenated into the SPARQL query without sanitization. A malicious user could inject SPARQL syntax to manipulate the query behavior.
For example, input like test", "i")) . ?x <http://evil> ?y . FILTER(regex(?label, " could alter the query structure.
🔒 Proposed fix to sanitize input
public List<ResponseData> getRecommendedDatasets(String keyword) {
List<ResponseData> responseDataList = new ArrayList<>();
+ // Sanitize keyword: escape backslashes and quotes for SPARQL string literal
+ String sanitizedKeyword = keyword
+ .replace("\\", "\\\\")
+ .replace("\"", "\\\"");
String sparqlQuery = buildQuery(
"SELECT DISTINCT ?dataset ?label ?description WHERE {\n" +
" ?dataset a dataid:Group .\n" +
" ?dataset rdfs:label ?label .\n" +
" ?dataset dct:description ?description .\n" +
- " FILTER (regex(?label, \"" + keyword + "\", \"i\") || regex(?description, \"" + keyword
+ " FILTER (regex(?label, \"" + sanitizedKeyword + "\", \"i\") || regex(?description, \"" + sanitizedKeyword
+ "\", \"i\"))\n" +
"} LIMIT 5");🤖 Prompt for AI Agents
In `@src/main/java/chatbot/lib/api/DatabusService.java` around lines 38 - 40, The
SPARQL query concatenates the user-provided variable `keyword` directly, causing
a SPARQL injection risk; update the code in DatabusService where the query
string is built (the fragment that uses `keyword`) to avoid direct concatenation
by using Jena's ParameterizedSparqlString (or an equivalent parameterized API)
and bind the keyword as a literal (or escape/normalize it) before inserting into
the FILTER regex, or alternatively properly escape quotes/backslashes and regex
metacharacters in `keyword` before concatenation; ensure the code uses
ParameterizedSparqlString.setLiteral/setEscapedString (or a safe escaping
helper) instead of string concatenation so malicious input cannot alter the
query structure.
| + "\", \"i\"))\n" + | ||
| "} LIMIT 5"); | ||
|
|
||
| logger.info("Querying Databus: " + sparqlQuery); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid logging user input directly.
The SPARQL query containing user input is logged at INFO level. If the keyword contains sensitive data or the injection issue is exploited, this could expose sensitive information in logs.
Consider logging at DEBUG level or redacting the user-provided keyword:
- logger.info("Querying Databus: " + sparqlQuery);
+ logger.debug("Querying Databus with keyword: {}", keyword);🤖 Prompt for AI Agents
In `@src/main/java/chatbot/lib/api/DatabusService.java` at line 42, The line
logger.info("Querying Databus: " + sparqlQuery) in DatabusService logs raw user
input; change it to avoid exposing user-provided keywords by either logging at
DEBUG level (logger.debug(...)) or logging a sanitized/redacted version of the
query (e.g., build a sanitizedQuery that masks the user keyword) and use that in
the log; update the call site where sparqlQuery is constructed (in the same
DatabusService method) to produce sanitizedQuery or maskedKeyword and log that
instead of the full sparqlQuery.
| // DBpedia Datset Scenarios | ||
| case TemplateType.DBPEDIA_DATASET: | ||
| case TemplateType.DBPEDIA_DATASET_NLP: | ||
| case TemplateType.DBPEDIA_DATABUS_RECOMMENDATION: | ||
| responseGenerator = new DatasetTemplateHandler(request, payload, helper).handleTemplateMessage(); | ||
| break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
rg -n "DBPEDIA_DATASET_ONTOLOGY" --type javaRepository: dbpedia/chatbot
Length of output: 319
🏁 Script executed:
# Get line count and view the relevant sections of TemplateHandler.java
wc -l src/main/java/chatbot/lib/handlers/TemplateHandler.javaRepository: dbpedia/chatbot
Length of output: 118
🏁 Script executed:
# Search for all case statements in TemplateHandler.java to see complete routing
rg -n "case TemplateType\." src/main/java/chatbot/lib/handlers/TemplateHandler.javaRepository: dbpedia/chatbot
Length of output: 1901
🏁 Script executed:
# Check DatasetTemplateHandler.java around line 51 to see DBPEDIA_DATASET_ONTOLOGY handling
sed -n '45,65p' src/main/java/chatbot/lib/handlers/templates/dbpedia/DatasetTemplateHandler.javaRepository: dbpedia/chatbot
Length of output: 1513
🏁 Script executed:
# View the full TemplateHandler.java to see the default case
sed -n '1,150p' src/main/java/chatbot/lib/handlers/TemplateHandler.java | tail -60Repository: dbpedia/chatbot
Length of output: 2640
🏁 Script executed:
# Also check if there's a default case after all the cases
rg -n "default:" src/main/java/chatbot/lib/handlers/TemplateHandler.java -A 5Repository: dbpedia/chatbot
Length of output: 245
Add DBPEDIA_DATASET_ONTOLOGY to the routing switch statement.
The DatasetTemplateHandler contains a case for DBPEDIA_DATASET_ONTOLOGY (lines 51-60), but this template type is not included in the routing switch in TemplateHandler.java. Currently only DBPEDIA_DATASET, DBPEDIA_DATASET_NLP, and DBPEDIA_DATABUS_RECOMMENDATION are routed there. Without this case, ontology-related requests will fall through to the default handler and receive a generic fallback response instead of the proper ontology information.
Fix: Add the missing case
case TemplateType.DBPEDIA_DATASET:
case TemplateType.DBPEDIA_DATASET_NLP:
case TemplateType.DBPEDIA_DATASET_ONTOLOGY:
case TemplateType.DBPEDIA_DATABUS_RECOMMENDATION:
responseGenerator = new DatasetTemplateHandler(request, payload, helper).handleTemplateMessage();
break;
🤖 Prompt for AI Agents
In `@src/main/java/chatbot/lib/handlers/TemplateHandler.java` around lines 68 -
73, The switch in TemplateHandler that routes template types to
DatasetTemplateHandler is missing the TemplateType.DBPEDIA_DATASET_ONTOLOGY
case, causing ontology requests to fall back; update the switch in
TemplateHandler.java to include TemplateType.DBPEDIA_DATASET_ONTOLOGY alongside
DBPEDIA_DATASET, DBPEDIA_DATASET_NLP, and DBPEDIA_DATABUS_RECOMMENDATION so that
responseGenerator is set via new DatasetTemplateHandler(request, payload,
helper).handleTemplateMessage() for ontology requests.
The goal of this project is to implement a dataset recommendation system for the DBpedia Databus that assists users in selecting appropriate datasets for various types of analysis. By analyzing dataset metadata such as domain, structure, update frequency, and provenance, the system will recommend relevant datasets, improving discoverability, usability, and efficiency in data analysis workflows.
Issue Link-: #39
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.