<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Innogence Blog</title>
	<atom:link href="http://www.innogence.com/blog/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://www.innogence.com/blog</link>
	<description></description>
	<lastBuildDate>Wed, 11 Apr 2012 05:36:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=</generator>
		<item>
		<title>Text Data Processing in Data Services blog #3</title>
		<link>http://www.innogence.com/blog/?p=479</link>
		<comments>http://www.innogence.com/blog/?p=479#comments</comments>
		<pubDate>Wed, 11 Apr 2012 05:28:40 +0000</pubDate>
		<dc:creator>Roman Bukarev</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.innogence.com/blog/?p=479</guid>
		<description><![CDATA[This is the third blog entry in what becomes the unstructured text data processing series. The first two entries discussed data acquisition into SAP BusinessObjects Data Services from various sources using JSONAdapter, and in this one I will discuss using &#8230; <a href="http://www.innogence.com/blog/?p=479">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>This is the third blog entry in what becomes the unstructured text data processing series. The first two entries discussed data acquisition into SAP BusinessObjects Data Services from various sources using JSONAdapter, and in this one I will discuss using Data Services for consumer sentiment analysis of the data collected from Twitter.</p>
<p>As discussed in the <a href="http://www.innogence.com/blog/?p=441">first blog entry</a>, Twitter Search API has been accessed, with the word &#8216;cityrail&#8217; as the search term. To those not in the know, Cityrail is the train network of Sydney (Australia) metropolitan area. It was a very obvious target: with relatively big customer base it was guaranteed to get enough unstructured data. Over time (2-3 months) Data Services collected a few thousands tweets supposedly related to Cityrail.<br />
<span id="more-479"></span></p>
<p style="margin-left: 36pt;">It is worth to elaborate on the data collection process. In one request-response session, Twitter Search API returns up to 100 most recent tweets. Provided that within every 15 minutes the number of tweets about Cityrail does not exceed 100 (that assumption has been confirmed), Data Services Batch Job running every 15 minutes can collect all such tweets almost in real-time.</p>
<p style="margin-left: 36pt;">I would also like to mention a conversation I had with a colleague recently. He wondered if JSONAdapter may help to obtain a large amount of tweets for analysis instantly, and if Twitter Streaming API might help with that. The answer to the last question seems to be negative: once you open a Stream, Twitter feeds entries there in real time, but those are <em>current</em> entries; they can be grouped into time slots or used in another ETL process by Data Services. Streaming API should be used, for example, when for some topic the number of tweets exceeds, say, 100 per minute, hence Batch Job described above may not be able to cope, even if executed every minute.</p>
<p style="margin-left: 36pt;">Otherwise, the only difference between Stream and Search APIs becomes that the Stream API would provide raw data, while Search would apply some extra filtering/ranking by relevance to the search term. In fact, it is possible to build a Data Services job to get historical results, executing consecutive search requests to Search API deeper and deeper into the past (by restricting the TweetID field in a request) &#8212; the process would not be <em>instant</em>, though, but probably running for 1-2 hours (consider it an Initial Load), and it is hard to tell how far into the past it can go.</p>
<p style="margin-left: 36pt;">The bottom line is: if there is an immediate need to analyze the historical data, you may have to contact the <a href="https://dev.twitter.com/docs/twitter-data-providers">Twitter&#8217;s partner data providers</a>. Otherwise, JSONAdapter may help to start collecting the data and implement (near) real-time analysis.</p>
<p>The further discussion will be around the following points:</p>
<ul>
<li>text parsing using using Data Services,</li>
<li>&#8216;noise&#8217; reduction,</li>
<li>Topic-Sentiment links rebuilding,</li>
<li>the sarcasm problem (no pun intended!).</li>
</ul>
<p><strong>Text parsing </strong>itself is simpler than one might expect. A special Transform in Data Services v.4.0, called Entity Extraction, parses the input unstructured text and extracts entities and facts about them. Its output is a number or ID&#8217;ed records containing one entity/fact each, accompanied with location attributes (paragraph number, sentence number and offset) and categorized accordingly to the rules specified in the Transform options.</p>
<p>Provided <span style="text-decoration: underline;">out of the box</span> are dictionary of categorized entities and a few rulesets for facts extraction – they are located in the folder <em>TextAnalysis</em> of a standard Data Services installation (availability for use is subject to the license: either full DI+DQ or DI Premium). One of those rulesets, Voice Of Customer (VOC), is used for this work. SAP allows <a href="http://wiki.sdn.sap.com/wiki/display/EIM/How+to+Customize+Rules">customization of rulesets</a> (at your own risk, of course) and implementation of user-defined dictionaries and rules. SAP has also published <a href="http://scn.sap.com/docs/DOC-8820">several blueprints</a>, which could be used to start new text analyses developments. For this blog, a <a href="http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/406fca3b-2f10-2e10-d6a0-d22296e24786">blueprint for sentiment analysis</a> has been used, it does the following:</p>
<ol style="margin-left: 54pt;">
<li>parse the incoming unstructured text into Topic and Sentiment entities using Voice Of Customer ruleset: for example, a phrase &#8220;<em>I like apples</em>&#8221; would be parsed into Topic=apples and Sentiment=like (accompanied with Sentiment Type &#8216;StrongPositive&#8217;),</li>
<li>process Topics data and put Topics into groups, to enable some measures, like number of Sentiments per group, and ensure topics like &#8216;apple&#8217; and &#8216;apples&#8217; would fall into the same group,</li>
<li>build a SAP BusinessObjects Universe on top of resulting tables, to enable WebI reporting with slice and dice capabilities.</li>
</ol>
<p>A few important changes have been made to the blueprint design to deal with Twitter data, the first one covering the issue of <strong>noise elimination</strong>. For starters, the blueprint assumed the original text to be in plain English; in reality, tweets constitute quite a lingo, full of abbreviations, expletives of all kinds, and with incomplete grammar. In the upcoming SP release for Data Services, SAP makes an attempt to keep in touch with social media and adds new entities for trend/hashtag and for handle/mention. That did not seem enough, and a custom ruleset has been implemented and added to the Entity Extraction transform to detect and mark words that should be excluded from further processing. The picture below demonstrates options of the transform, including two out of the box rulesets followed by the custom one:</p>
<p><img src="http://www.innogence.com/blog/wp-content/uploads/2012/04/041112_0526_TextDataPro1.png" alt="" /></p>
<p>This way, if an entity is extracted as both e.g. Topic and a (custom-defined) Blather type of entity, it will be detected in a simple join:</p>
<p><img src="http://www.innogence.com/blog/wp-content/uploads/2012/04/041112_0526_TextDataPro2.png" alt="" /></p>
<p>The following screenshot displays a sample output from the transform &#8216;Blather&#8217;, the second column contains the entity extracted from the original tweet and categorized as an expletive:</p>
<p><img src="http://www.innogence.com/blog/wp-content/uploads/2012/04/041112_0526_TextDataPro3.png" alt="" /></p>
<p>All such entities would be filtered, thus clearing the output from most of the &#8216;yucks&#8217;, &#8216;hells&#8217;, and &#8216;lols&#8217;. There is one more use of those, which will be discussed a couple paragraphs below.</p>
<p>Noise may also occur on macro-level. Tweets analysis is different from the blueprint in one more way: while the blueprint assumed the source text is completely relevant (for example, customer feedbacks on the imaginary Swiftmobile, collected in a separate folder), tweets don&#8217;t have to be. Filtering tweets by a word X returns not just a customer&#8217;s view on X, but all aspects of people lives that somehow involve X and that people care to write about. The amount of such macro-noise in &#8216;cityrail&#8217; selection is, actually, small, but in a selection for, say, &#8216;westfield&#8217; (a major chain of shopping malls in Australia) it becomes much bigger, for obvious reasons. A possible way to further filter the results would be by having a predefined list of topics specific to the bigger theme.</p>
<p>By default, the output of Entity Extraction transform looks like what might be called a &#8216;spaghetti&#8217; type of data, i.e. it doesn&#8217;t care about relationships between Topics and Sentiments. While it may be considered sufficient, a need to relate Topics and Sentiments may be considered. Assuming that in a sentence related topic and sentiment should be closely located, it&#8217;s possible to derive Topic-Sentiment pairs from ParentID and Offset fields of Entity Extraction transform output:</p>
<p><img src="http://www.innogence.com/blog/wp-content/uploads/2012/04/041112_0526_TextDataPro4.png" alt="" /></p>
<p>This design obviously ignores topics not accompanied by sentiments and vice versa, and those could be added to the reporting data model.</p>
<p>&#8216;Raw&#8217; tweets preview in the database revealed that tweets mostly expressed negative feedback on Cityrail: people tend to complain more often than praise, and – by the way – I wrote the first draft of this blog on the day and hour when some Cityrail&#8217;s <a href="http://www.smh.com.au/nsw/train-chaos-hits-sydney-20120315-1v7cj.html">equipment failure</a> caused major suspension of service and delays. Therefore, it was surprising to see significant &#8216;StrongPositiveSentiment&#8217;-related numbers in the reporting. The reason was that many tweets were <a href="http://en.wikipedia.org/wiki/Sarcasm">sarcastic</a> and should not have been taken literally, but, rather, their sentiment would be opposite to their literal meaning. So, if a tweet is deemed sarcastic, its positive Sentiment should be reverted; while negative Sentiment still counts.</p>
<p>Apparently, <strong>sarcasm detection</strong> in user feedbacks is a much bigger problem without a general solution. Even a human cannot detect sarcasm perfectly (73% accuracy has been reported from one research), as familiarity with the context is often required. Given the Data Services&#8217; ability to process Python scripts in User Defined Transforms, one might attempt to build a sarcasm detection functionality in Data Services based on Bayesian classification and using not only words, but markers of emotions: emoticons, &#8216;blather&#8217;-words discussed above, words highlighting using &#8216;*&#8217; (like in &#8220;I *love* when trains go slowly in rainy weather&#8221;) or enclosing into quotation marks, and, of course, the hashtag <em>#sarcasm</em>. Coincidence of negative and positive (rather, strong positive, in terms of VOC&#8217;s ruleset) sentiments or, rather, emotions in one tweet is also a potential sarcasm marker. The last one, actually, can be implemented with regular Data Services ETL:</p>
<p><img src="http://www.innogence.com/blog/wp-content/uploads/2012/04/041112_0526_TextDataPro5.png" alt="" /></p>
<p>The results below could have been slightly better if VOC knew that &#8216;con&#8217; is a short form for &#8216;conditioner&#8217;, not a negative sentiment expression. Some extra customization of the dictionary may be required.</p>
<p><img src="http://www.innogence.com/blog/wp-content/uploads/2012/04/041112_0526_TextDataPro6.png" alt="" /></p>
<p>Implementation of the full outlined above sarcasm detection functionality scope, however, seems to be a project by its own and beyond this blog.</p>
<p><strong>Setting up reports</strong> on the analysed data was not a primary goal of this work, as the SAP blueprint&#8217;s approach of BusinessObjects Universe was adopted. The original plan was to use SAP BW BEx reporting, but as storage of texts longer than 60 characters in BW InfoProviders is not trivial, the idea had been discarded.</p>
<p>Consumer sentiment is quantified here by counting the number of feedbacks, restricted measures have been created for each feedback type. The screenshot below demonstrates how Data Quality grouped topics into larger groups using fuzzy matching logic:</p>
<p><img src="http://www.innogence.com/blog/wp-content/uploads/2012/04/041112_0526_TextDataPro7.png" alt="" /></p>
<p>A drilldown into a group is then possible, like below:</p>
<p><img src="http://www.innogence.com/blog/wp-content/uploads/2012/04/041112_0526_TextDataPro8.png" alt="" /></p>
<p>An extra characteristic, time, has been added to reporting as an obvious choice: there is clear correlation between number of &#8216;cityrail&#8217; tweets and morning/afternoon transport peak hours. One might think of implementing a &#8220;rolling total negative sentiment&#8221; of 30 minute scope and raise an alert if that value exceeds some threshold.</p>
<p style="text-align: center;"><img src="http://www.innogence.com/blog/wp-content/uploads/2012/04/041112_0526_TextDataPro9.png" alt="" /></p>
<p>Lastly, <strong>beyond consumer sentiment analysis</strong>, another obvious idea would be to geocode Tweets using either geolocation information (GPS coordinates) from tweets metadata, or geographical names from tweets themselves (post-processing is required for the latter, of course, to eliminate noise). The geocoded data could be made available for visualization in Business Objects or provided to a GIS product like ArcGIS for use in spatial analyses.</p>
<p>&nbsp;</p>
<p>- Roman Bukarev</p>
]]></content:encoded>
			<wfw:commentRss>http://www.innogence.com/blog/?feed=rss2&#038;p=479</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SAP Data Services JSON Adapter, part 2: load of Facebook and RSS feeds unstructured text data</title>
		<link>http://www.innogence.com/blog/?p=463</link>
		<comments>http://www.innogence.com/blog/?p=463#comments</comments>
		<pubDate>Fri, 02 Mar 2012 04:01:10 +0000</pubDate>
		<dc:creator>Roman Bukarev</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.innogence.com/blog/?p=463</guid>
		<description><![CDATA[In the previous blog entry I have introduced an idea of a Data Services custom adapter accessing unstructured data from sources in the Web, using JSON as the medium, and demonstrated how that JSONAdapter can obtain data from Twitter. In &#8230; <a href="http://www.innogence.com/blog/?p=463">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>In the <a href="http://www.innogence.com/blog/?p=441">previous blog entry</a> I have introduced an idea of a Data Services custom adapter accessing unstructured data from sources in the Web, using JSON as the medium, and demonstrated how that JSONAdapter can obtain data from Twitter. In this one I am going to extend the case to Facebook and RSS feeds.</p>
<p><strong>Facebook data</strong> is based on so called social Graph using <a href="http://ogp.me/">Open Graph Protocol</a>. Every object in that Graph &#8212; a person, a feed, a post etc &#8212; is a node in that Graph, having an ID and a few properties. These properties can be accessed using <a href="http://developers.facebook.com/docs/reference/api/">Graph API</a>, which, by no surprise, uses JSON: an inquisitive developer can play with <a href="https://developers.facebook.com/tools/explorer">Graph Explorer</a> to get the idea. There are a couple of implications following from that observation: a) every type of Graph node has its own &#8216;schema&#8217; when described in JSON, b) nodes represent either objects (say, a person) or feeds (wall posts, events etc).</p>
<p><span id="more-463"></span></p>
<p>From Data Services extraction point, constantly updated feeds is a better candidate for extraction, as they may contain unstructured data worth analysing. I targeted a feed with wall posts of Commonwealth Bank of Australia (CBA), expecting to see various feedbacks left by CBA&#8217;s consumers, which then might be interesting to analyse for consumer sentiment.</p>
<p>Here&#8217;s what I would see in Graph Explorer for a typical contents of the Feed node (I edited the image to change names and mask IDs) :</p>
<p><img src="http://www.innogence.com/blog/wp-content/uploads/2012/03/030212_0358_SAPDataServ1.png" alt="" /></p>
<p>So my first step in Data Services was to import that metadata into JSONAdapter using the URL from Explorer:</p>
<p><img src="http://www.innogence.com/blog/wp-content/uploads/2012/03/030212_0358_SAPDataServ2.png" alt="" /></p>
<p>I used the option to generate a sample DTD from the JSON data. As I mentioned in the previous blog, it may not be perfect, but is a good start.</p>
<p>A problem that I found with the Facebook Feed object is related to the liquid nature of JSON, maybe somehow utilised by Graph: some nodes of the JSON data have changing names, as below:</p>
<p><img src="http://www.innogence.com/blog/wp-content/uploads/2012/03/030212_0358_SAPDataServ3.png" alt="" /></p>
<p>Imagine what mess that would make in the data structure definition. These changing numbers, however, do not seem to have much sense, as tag objects here have their IDs anyway. In order to skip them in data definition I had to introduce the setting &#8220;Skip level under Elements&#8221; and re-import the metadata from the generated DTD file with instruction to JSONAdapter to skip one level under &#8220;message_tags&#8221; element. Ah, I had to manually adjust the DTD to make it less relaxed, based on my analysis of the data, basically replacing <span style="font-family: Courier New;"><em>&#8220;(element1 | element2 )*&#8221;</em></span>-like DTD constructions with <span style="font-family: Courier New;"><em>&#8221; (element1?, element2? )&#8221;</em></span>.</p>
<p>As the result of that import, this is the partial screenshot of NRDM schema corresponding to Facebook Feed:</p>
<p><img src="http://www.innogence.com/blog/wp-content/uploads/2012/03/030212_0358_SAPDataServ4.png" alt="" /></p>
<p>This is enough to build a Data Services dataflow, similar to the one from the previous blog:</p>
<p><img src="http://www.innogence.com/blog/wp-content/uploads/2012/03/030212_0358_SAPDataServ5.png" alt="" />In this dataflow, the table T_FB_FEEDS collects messages on the CBA&#8217;s wall, while another table, T_FB_TIMER, keeps the change pointer (latest feed entry&#8217;s &#8220;created_time&#8221; attribute) and thus allows to load changed data only.</p>
<p>So, this is how Facebook Feed is finally stored in a database, ready for further analysis:</p>
<p><img src="http://www.innogence.com/blog/wp-content/uploads/2012/03/030212_0358_SAPDataServ6.png" alt="" /></p>
<p>Now let us discuss RSS (RDF Site Summary) documents in the context of data acquisition via JSON format. RSS is a list of some documents, where each entry in the list contains some summary of the original document (or just data snippet) and some metadata, like posting date or the author. RSS is published as an XML document and is meant to be updated as new documents appear in the original collection, frequency of updates varying from days to minutes. An RSS document may be then accessed by various reader applications or by some aggregator services like Google Reader &#8212; or by Data Services.</p>
<p>Keep in mind that RSS is an XML document, and it needs to be converted to JSON, so it could be accessed by the JSONAdapter.</p>
<p>Such conversion looks redundant as JSONAdapter internally converts JSON data to XML again, to pass it to Data Services. However, there is no RSS-reading Adapter for Data Services that I would know of (and I wouldn&#8217;t care writing an extra Adapter just for that), so the exercise is legit.</p>
<p>As for what tool may be used to convert RSS to JSON, preferably on the fly, there are a few online services around, easily googlable. My first candidate was Google (as I&#8217;m long time user of their Reader application, even that it degraded recently), which worked just fine – however, a few points from their Terms of Service made me not so willing to try and use the Google Feed API . So I picked a free and not so encumbered with usage terms <a href="http://www.blastcasta.com/convert-feed-to-json.aspx">RSS to JSON service</a> on blastcasta.com. Mind you, different RSS-JSON converters deliver different JSON schemas.</p>
<p>After all this research, the actual ETL setup less than an hour or so, here&#8217;s the partial screenshot of the RSS schema I had got:</p>
<p><img src="http://www.innogence.com/blog/wp-content/uploads/2012/03/030212_0358_SAPDataServ7.png" alt="" /></p>
<p>As I had mentioned in the first blog entry, element names are built from JSON elements using a &#8216;snowball&#8217;-approach, and some names may become too long to serve as column headers, after the NRDM-data gets unnested and routed into a database table. A Query transform should be used to a) pick only the required columns, b) map the element names to column names satisfying the staging database&#8217;s requirements.</p>
<p>That said, my resulting staging table looks like this:</p>
<p><img src="http://www.innogence.com/blog/wp-content/uploads/2012/03/030212_0358_SAPDataServ8.png" alt="" /></p>
<p>The ETL process likely needs some delta-capability, e.g. based on some hashcode generated from the RSS-item&#8217;s URL or the title; that part being trivial, though.</p>
<p>Technically, RSS format allows storing the original document&#8217;s whole contents, and I personally would prefer to see it that way, as it saves the reader&#8217;s time and clicks – a few RSS publishers do so, and I love them for that. However, many content owners prefer to release only a small text snippet into RSS, as (I guess) they still want to get their revenue from advertisement etc. That leaves two options: a) only analyse those ~50-words snippets, b) get Data Services to crawl to the obtained links and extract the text from there, ignoring all the markup/navigational/advertisement additions. The second option is obviously more attractive; however, that functionality is missing in the standard Data Services build (as of the time of writing this) and may be a good candidate for a nice little Adapter.</p>
<p>In the next blog entry I will try to cover data analysis opportunities for the data acquired by Data Services.</p>
<p>&nbsp;</p>
<p>-Roman Bukarev</p>
]]></content:encoded>
			<wfw:commentRss>http://www.innogence.com/blog/?feed=rss2&#038;p=463</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data Services JSON Adapter helps to load unstructured text data into SAP ecosystem</title>
		<link>http://www.innogence.com/blog/?p=441</link>
		<comments>http://www.innogence.com/blog/?p=441#comments</comments>
		<pubDate>Thu, 15 Dec 2011 23:34:49 +0000</pubDate>
		<dc:creator>Roman Bukarev</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://www.innogence.com/blog/?p=441</guid>
		<description><![CDATA[In the recent couple of years I became convinced in what researchers had been telling for quite a long time – in the rise of unstructured data, or, simply put, plain text in natural language. Some analysts go as far &#8230; <a href="http://www.innogence.com/blog/?p=441">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>In the recent couple of years I became convinced in what researchers had been telling for quite a long time – in the rise of unstructured data, or, simply put, plain text in natural language. Some analysts go as far as predicting that volume of unstructured data available to companies will exceed one of the ‘traditional’ structured data, which we read from database tables and are accustomed to build intelligence on. Analytical potential of unstructured data is well discussed (to mention a couple of uses, that would be consumer sentiment analysis and entities finding), so why don’t we talk about some practicalities of text data processing, and how SAP products could be of use.</p>
<p>I would like to start this series of blogs with discussion on how unstructured data from social media can be loaded to a data warehouse.</p>
<p>In this context appears the name JSON &#8212; an open source text-based data interchange format. The acronym stands for JavaScript Object Notation – the name points to the roots of the format, but, actually, the standard is used outside of Javascript, with implementations of JSON for various platforms referenced at its homepage, <a href="http://json.org/">http://json.org</a>.</p>
<p><span id="more-441"></span></p>
<p>Have a glance at JSON code:</p>
<code>{</p>
<p>“Plant”:{</p>
<p>“Colour”:”green”,</p>
<p>“Height”:{</p>
<p>“Measure”:”50”,</p>
<p>“Units”:”cm”</p>
<p>}</p>
<p>}</p>
<p>}</code>
<p>As you might guess, it describes a green plant of 50 cm height.</p>
<p>Just so you might compare – in XML, a well-known standard for data interchange, one possible way to express the same would be:</p>
<code>&lt;Plant&gt;</p>
<p>&lt;Colour&gt;green&lt;/Colour&gt;</p>
<p>&lt;Height&gt;</p>
<p>&lt;Measure&gt;50&lt;/Measure&gt;</p>
<p>&lt;Units&gt;cm&lt;/Units&gt;</p>
<p>&lt;/Height&gt;</p>
<p>&lt;/Plant&gt;</code>
<p>&nbsp;</p>
<p>On large data volumes the JSON representation is more lightweight than XML, mainly because of missing closing tags. JSON became widely used in web development (“Ajax programming” is your keywords for further reading), which brings us to the topic of those famous social web-based applications, including Twitter, Facebook, Friendfeed, or RSS feeds collected by Google Reader.</p>
<p>So this is why JSON looks like a good candidate to be the medium for unstructured data extraction into a data warehouse: it is supported by source web-applications, and it can be converted relatively easy into XML, understood by major ETL tools – in our case, SAP BusinessObjects Data Services.</p>
<p>Even out of the box, Data Services can integrate with wide range of databases and data formats from various vendors, and that range can be extended by using plugins called Adapters. SAP provides Java Software Development Kit to create such Adapters. A very helpful introduction into Data Services Adapter SDK can be found at SDN website; sample code is also provided as a part of the product installation.</p>
<p>With that SDK I have developed JSONAdapter that obtains data via HTTP in JSON-format, converts it to XML and passes to Data Services – so I will describe the steps to configure such interface with Twitter, assuming the JSONAdapter has been installed, and its Datastore in the Local Repository has been created.</p>
<p>Twitter provides a search interface, where search parameters are included into a URL.  For example, a URL “http://search.twitter.com/search.json?q=intelligence&amp;rpp=50” would render Twitter to return a JSON-formatted result with 50 most recent tweets containing the word “intelligence”.</p>
<p>The first step in ETL setup is to describe the data structure (metadata) in XML DTD form, one that Data Services understands. Unfortunately, it is not always possible to obtain JSON Schema, used to describe the structure of a JSON document – nor it is actually standardized yet. That is why the Adapter makes an attempt to derive metadata from a sample data. While the sample may be incomprehensive, and there may be several ways to describe the same XML data by DTD, it is still a good start.</p>
<p>So, let’s use “Import By Name” JSONAdapter’s functionality for metadata import.</p>
<p><a href="http://www.innogence.com/blog/wp-content/uploads/2011/12/7.png"><img class="alignnone size-full wp-image-450" title="7" src="http://www.innogence.com/blog/wp-content/uploads/2011/12/7.png" alt="" width="364" height="410" /></a>, which leads to..</p>
<p><a href="http://www.innogence.com/blog/wp-content/uploads/2011/12/6.png"><img class="alignnone size-full wp-image-449" title="6" src="http://www.innogence.com/blog/wp-content/uploads/2011/12/6.png" alt="" width="584" height="327" /></a></p>
<p>While actually creating a Function Call that may be used straight away, the import process also (optionally) generates a DTD file.  That file may be reviewed and adjusted, for example, to denote some XML elements as optional rather than required. “Import By Name” functionality should be then used again, but this time to import metadata from the adjusted DTD file, not a sample URL.</p>
<p>With that second import or without, the final result would be a Function returning data in Nested Relational Data Model (NRDM), the Data Services’ internal representation of XML data (screenshot below is partial):</p>
<p><a href="http://www.innogence.com/blog/wp-content/uploads/2011/12/5.png"><img class="alignnone size-full wp-image-448" title="5" src="http://www.innogence.com/blog/wp-content/uploads/2011/12/5.png" alt="" width="571" height="464" /></a></p>
<p>In the new Function, the Input parameters are always URL and EXTRA_PARAM, which both form the Url submitted to the web-application, but at development time such split provides more clarity regarding constant and variable parts of the request.</p>
<p>The Output is a nested schema that may be processed using standard Data Services tools. Let’s have a closer look at Data Flow design for that.</p>
<p><a href="http://www.innogence.com/blog/wp-content/uploads/2011/12/4.png"><img class="alignnone size-full wp-image-447" title="4" src="http://www.innogence.com/blog/wp-content/uploads/2011/12/4.png" alt="" width="340" height="189" /></a></p>
<p>Function cannot be placed into a Data Flow by itself; it rather should be inserted into a Query transform, which, in turn, should be preceded by a Row_Generation transform, to provide an input. Hence, the Row_Generation transform generates exactly 1 row, and Query transform Twitter_Search calls the JSONAdapter-based Function:</p>
<p><a href="http://www.innogence.com/blog/wp-content/uploads/2011/12/3.png"><img class="alignnone size-full wp-image-446" title="3" src="http://www.innogence.com/blog/wp-content/uploads/2011/12/3.png" alt="" width="461" height="351" /></a></p>
<p>Greater flexibility may be added to the Function call’s input parameters using variables:</p>
<p><a href="http://www.innogence.com/blog/wp-content/uploads/2011/12/2.png"><img class="alignnone size-full wp-image-445" title="2" src="http://www.innogence.com/blog/wp-content/uploads/2011/12/2.png" alt="" width="583" height="231" /></a></p>
<p>These variables may be globals or the data flow’s input parameters, so that is the way to parameterise Function calls from outside of Data Services.</p>
<p>The NRDM data received by Data Services may then be unnested and stored in table format, <em>voi la!</em></p>
<p><a href="http://www.innogence.com/blog/wp-content/uploads/2011/12/1.png"><img class="alignnone size-full wp-image-444" title="1" src="http://www.innogence.com/blog/wp-content/uploads/2011/12/1.png" alt="" width="606" height="315" /></a></p>
<p>From here, the social media data can be used and reused for analytical processing as part of the data warehouse.</p>
<p>In the next blogs I will outline how to configure JSONAdapter to load data from Facebook and RSS Feeds, and, using collected tweets as a model, discuss text data processing options in Data Services.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.innogence.com/blog/?feed=rss2&#038;p=441</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>BW 7.3 upgrade went successful with one of our clients</title>
		<link>http://www.innogence.com/blog/?p=421</link>
		<comments>http://www.innogence.com/blog/?p=421#comments</comments>
		<pubDate>Thu, 11 Aug 2011 01:54:31 +0000</pubDate>
		<dc:creator>Rachel.McCusker</dc:creator>
				<category><![CDATA[News]]></category>

		<guid isPermaLink="false">http://www.innogence.com/blog/?p=421</guid>
		<description><![CDATA[Innogence has just finished a BW 7.3 upgrade for one of our clients including both ABAP/JAVA stacks last weekend. BW 7.3 is a major SAP BW release since Netweaver 7 including the new functionalities such as: Enhanced scalability &#38; performance Flexible &#8230; <a href="http://www.innogence.com/blog/?p=421">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Innogence has just finished a BW 7.3 upgrade for one of our clients including both ABAP/JAVA stacks last weekend. BW 7.3 is a major SAP BW release since Netweaver 7 including the new functionalities such as:</p>
<ul>
<li>Enhanced scalability &amp; performance</li>
</ul>
<ul>
<li>Flexible and tighter integration with SAP BusinesssObjects products</li>
</ul>
<ul>
<li>Reduced TCO and higher development efficiency</li>
</ul>
<ul>
<li>Simplified configuration and operations</li>
</ul>
<p>Innogence was one of the first SAP partners embarking on BW 7.3 ramp-up at the end of last year in ANZ region. With the experiences gathered through ramp-up phase, Innogence helped the client upgrade to BW 7.3 in a speedy manner before a major user community go live. The upgrade went very smooth and finished within the planned down-time window. Congratulations to the team!</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.innogence.com/blog/?feed=rss2&#038;p=421</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SAUG Summit 2011</title>
		<link>http://www.innogence.com/blog/?p=418</link>
		<comments>http://www.innogence.com/blog/?p=418#comments</comments>
		<pubDate>Wed, 27 Jul 2011 05:41:47 +0000</pubDate>
		<dc:creator>Rachel.McCusker</dc:creator>
				<category><![CDATA[News]]></category>

		<guid isPermaLink="false">http://www.innogence.com/blog/?p=418</guid>
		<description><![CDATA[Innogence will be attending SAP Australia User Group from 2nd &#8211; 4th August 2011.  At 1pm on Tuesday 2nd August, Dimitri Zarganakis will be doing a live demo of HANA -SAP In-Memory computing.  Innogence staff will be available at the &#8230; <a href="http://www.innogence.com/blog/?p=418">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Innogence will be attending SAP Australia User Group from 2<sup>nd</sup> &#8211; 4<sup>th</sup> August 2011.  At 1pm on Tuesday 2<sup>nd</sup> August, Dimitri Zarganakis will be doing a live demo of HANA -SAP In-Memory computing<strong>.</strong>  Innogence staff will be available at the event to discuss and answer any queries you may have relating to HANA and SAP Business Intelligence.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.innogence.com/blog/?feed=rss2&#038;p=418</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Innogence has a Launch at its Innovation Centre, the first HANA environment in ANZ region!</title>
		<link>http://www.innogence.com/blog/?p=414</link>
		<comments>http://www.innogence.com/blog/?p=414#comments</comments>
		<pubDate>Tue, 26 Jul 2011 06:40:06 +0000</pubDate>
		<dc:creator>Rachel.McCusker</dc:creator>
				<category><![CDATA[News]]></category>

		<guid isPermaLink="false">http://www.innogence.com/blog/?p=414</guid>
		<description><![CDATA[ Another 1st for Innogence as it receives the 1st SAP HANA platform environment in the ANZ region. It is currently housed at the Innogence Innovation centre in Sydney in partnership with IBM Australia. &#8220;Innogence has also reinforced its commitment to &#8230; <a href="http://www.innogence.com/blog/?p=414">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p> Another 1st for Innogence as it receives the 1st SAP HANA platform environment in the ANZ region. It is currently housed at the Innogence Innovation centre in Sydney in partnership with IBM Australia.</p>
<p>&#8220;Innogence has also reinforced its commitment to SAP to deliver next-generation solutions on the SAP HANA platform based on SAP in-memory computing technology.  This carries on from the success Innogence delivered to the APJ market around BIA/BWA (Business Warehouse Accelerator &#8211; in-memory computing technology) over the last 3 years. Innogence deployed around 17 BIA/BWA environments, so it is a natural progression for the Innogence in-memory team&#8221; said Ian Markram Director of Innogence. On the initial findings and global ramp up success Ian believes HANA will be another extremely successful SAP product that is a must have for clients.</p>
<p>At the Innogence Innovation centre, clients in the ANZ region will be able to experience how HANA can help them run business and simplify existing IT landscapes. This will help clients recognise the benefits of in-memory computing for their organizations.</p>
<p>If you would like to enquire about the HANA Test Drive offered by Innogence, please E-Mail: <a href="mailto:enquiries@innogence.com">enquiries@innogence.com</a> or speak to your account manager.</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.innogence.com/blog/?feed=rss2&#038;p=414</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Innogence appoints new CEO</title>
		<link>http://www.innogence.com/blog/?p=406</link>
		<comments>http://www.innogence.com/blog/?p=406#comments</comments>
		<pubDate>Tue, 26 Jul 2011 06:22:43 +0000</pubDate>
		<dc:creator>Rachel.McCusker</dc:creator>
				<category><![CDATA[News]]></category>

		<guid isPermaLink="false">http://www.innogence.com/blog/?p=406</guid>
		<description><![CDATA[Innogence has appointed Phil Cameron as the new CEO whilst the previous CEO, Hernus Carelsen, has moved into the role of Chairman. “The appointment of Phil in this role is arguably the best strategic move we’ve ever made” says Hernus Carelsen. Phil &#8230; <a href="http://www.innogence.com/blog/?p=406">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Innogence has appointed Phil Cameron as the new CEO whilst the previous CEO, Hernus Carelsen, has moved into the role of Chairman.</p>
<p>“The appointment of Phil in this role is arguably the best strategic move we’ve ever made” says Hernus Carelsen. Phil Cameron is qualified as a Chartered Accountant and holds a Masters in Project Management and Human Resource Management from UNSW. He has sold and delivered over $120 million revenue over the last 12 years. During this time he has performed the roles of Sales and Marketing Manager for Deloitte Outsourcing in NZ, SAP Client Partner responsible for Consulting in SAP Queensland, Managing Director and Founder of Zer01 (a SAP GRC Consultancy in Australia) and Global Sales Director for Spendvision (a Global SaaS Expense Management Product).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.innogence.com/blog/?feed=rss2&#038;p=406</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Innogence Appoints Innovation Director</title>
		<link>http://www.innogence.com/blog/?p=385</link>
		<comments>http://www.innogence.com/blog/?p=385#comments</comments>
		<pubDate>Tue, 26 Apr 2011 23:15:35 +0000</pubDate>
		<dc:creator>Liam Nesteroff</dc:creator>
				<category><![CDATA[News]]></category>

		<guid isPermaLink="false">http://www.innogence.com/blog/?p=385</guid>
		<description><![CDATA[Innogence is investing further in Innovation as a means of enhancing their value proposition to their clients.  A new role has been created, Innovation Director, that will be responsible for the acceleration and stimulation of innovation within Innogence.  Andrew Small has &#8230; <a href="http://www.innogence.com/blog/?p=385">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Innogence is investing further in Innovation as a means of enhancing their value proposition to their clients.  A new role has been created, Innovation Director, that will be responsible for the acceleration and stimulation of innovation within Innogence.  Andrew Small has been appointed as the Innovation Director.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.innogence.com/blog/?feed=rss2&#038;p=385</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Innogence cements offshore partnership in India with Innokey</title>
		<link>http://www.innogence.com/blog/?p=394</link>
		<comments>http://www.innogence.com/blog/?p=394#comments</comments>
		<pubDate>Sun, 17 Apr 2011 23:38:42 +0000</pubDate>
		<dc:creator>Liam Nesteroff</dc:creator>
				<category><![CDATA[News]]></category>

		<guid isPermaLink="false">http://www.innogence.com/blog/?p=394</guid>
		<description><![CDATA[Innogence&#8217;s NSW State Director, Ian Markram, visited Hyderabad this week;  to further establish the offshore relationship and work on value added initiatives.  With the average SAP BI experience greater than 5 years; and all consultants certified, this relationship is emerging &#8230; <a href="http://www.innogence.com/blog/?p=394">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Innogence&#8217;s NSW State Director, Ian Markram, visited Hyderabad this week;  to further establish the offshore relationship and work on value added initiatives.  With the average SAP BI experience greater than 5 years; and all consultants certified, this relationship is emerging as a real point of difference in the Australian marketplace.</p>
<p>A NSW based client also attended Innokey&#8217;s offices in Hyderabad to work with the existing offshore team; and to get ready for the support phase for their project.</p>
<p>A link to Innogence&#8217;s offshore partner can be found at <a href="http://innokeysoft.com/">http://innokeysoft.com/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.innogence.com/blog/?feed=rss2&#038;p=394</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Innogence ranked 3rd highest no of certified consultants by SAP</title>
		<link>http://www.innogence.com/blog/?p=390</link>
		<comments>http://www.innogence.com/blog/?p=390#comments</comments>
		<pubDate>Tue, 29 Mar 2011 23:23:19 +0000</pubDate>
		<dc:creator>Liam Nesteroff</dc:creator>
				<category><![CDATA[News]]></category>

		<guid isPermaLink="false">http://www.innogence.com/blog/?p=390</guid>
		<description><![CDATA[Innogence are ranked 3rd highest in the number of certified consultants by SAP. http://www.sap.com/australia/partners/partner_table.epx For a company that specialises just in Business Intelligence, this cements Innogence&#8217;s position as the undisputed BI leader in Australia.]]></description>
			<content:encoded><![CDATA[<p>Innogence are ranked 3rd highest in the number of certified consultants by SAP.</p>
<p><a href="http://www.sap.com/australia/partners/partner_table.epx">http://www.sap.com/australia/partners/partner_table.epx</a></p>
<p>For a company that specialises just in Business Intelligence, this cements Innogence&#8217;s position as the undisputed BI leader in Australia.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.innogence.com/blog/?feed=rss2&#038;p=390</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

