We're a UK Collective, doing our bit to help increase that distribution, by bringing the future that already exists to a wider audience.

Crunch, save time, save tax, save money
London Tech Jobs at Career 2.0

Apache makes Pig out of Big Data .. and it's Good

Apache Pig is a high-level procedural language for querying large semi-structured data sets using Hadoop and the MapReduce Platform. Pig simplifies the use of Hadoop by allowing SQL-like queries to a distributed dataset.

Hopefully this is the beginning of a standardized query format across various BigData, MapReduce vendors. For now Apache pig takes some of the grunt out of adhoc queries. For example parsing a large data set for new error messages:

messages = LOAD 'messages';
errors = FILTER messages BY $0 MATCHES '.*ERROR+.*';
STORE errors INTO 'error_report';

Apache Pig provides a variety of operations

Operator Description
FILTER Select a set of tuples from a relation based on a condition.
FOREACH Iterate the tuples of a relation, generating a data transformation.
GROUP Group the data in one or more relations.
JOIN Join two or more relations (inner or outer join).
LOAD Load data from the file system.
ORDER Sort a relation based on one or more fields.
SPLIT Partition a relation into two or more relations.
STORE Store data in the file system.

More details on Apache pig will come as we experiment with it and from Apache's own page